All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC v11][PATCH 00/13] Kernel based checkpoint/restart
@ 2008-12-05 17:31 ` Oren Laadan
  0 siblings, 0 replies; 133+ messages in thread
From: Oren Laadan @ 2008-12-05 17:31 UTC (permalink / raw)
  To: Oren Laadan
  Cc: containers, linux-kernel, linux-mm, linux-api, Linux Torvalds,
	Thomas Gleixner, Serge Hallyn, Dave Hansen, Ingo Molnar,
	H. Peter Anvin, Alexander Viro, MinChan Kim, arnd, jeremy,
	Oren Laadan

Checkpoint-restart (c/r): fixed races in file handling (comments from
from Al Viro). Updated and tested against v2.6.28-rc7 (feaf384...)

We'd like these to make it into -mm. This version addresses the
last of the known bugs. Please pull at least the first 11 patches,
as they are similar to before.

Patches 1-11 are stable, providing self- and external- c/r of a
single process.
Patches 12 and 13 are newer, adding support for c/r of multiple
processes.

The git tree tracking v11, branch 'ckpt-v11' (and older versions):
	git://git.ncl.cs.columbia.edu/pub/git/linux-cr.git

Restarting multiple processes requires 'mktree' userspace tool:
	git://git.ncl.cs.columbia.edu/pub/git/user-cr.git

Oren.


--
Why do we want it?  It allows containers to be moved between physical
machines' kernels in the same way that VMWare can move VMs between
physical machines' hypervisors.  There are currently at least two
out-of-tree implementations of this in the commercial world (IBM's
Metacluster and Parallels' OpenVZ/Virtuozzo) and several in the academic
world like Zap.

Why do we need it in mainline now?  Because we already have plenty of
out-of-tree ones, and  want to know what an in-tree one will be like.   :)  
What *I* want right now is the extra review and scrutiny that comes with
a mainline submission to make sure we're not going in a direction
contrary to the community.

This only supports pretty simple apps.  But, I trust Ingo when he says:
>> > > Generally, if something works for simple apps already (in a robust, 
>> > > compatible and supportable way) and users find it "very cool", then 
>> > > support for more complex apps is not far in the future.  but if you
>> > > want to support more complex apps straight away, it takes forever and
>> > > gets ugly.

We're *certainly* going to be changing the ABI (which is the format of
the checkpoint).  I'd like to follow the model that we used for
ext4-dev, which is to make it very clear that this is a development-only
feature for now.  Perhaps we do that by making the interface only
available through debugfs or something similar for now.  Or, reserving
the syscall numbers but require some runtime switch to be thrown before
they can be used.  I'm open to suggestions here.
--

--
Todo:
- Add support for x86-64 and improve ABI
- Refine or change syscall interface
- Handle multiple namespaces in a container (e.g. save the filesystem
  namespaces state with the file descriptors)
- Security (without CAPS_SYS_ADMIN files restore may fail)

Changelog:

[2008-Dec-05] v11:
  - Use contents of 'init->fs->root' instead of pointing to it
  - Ignore symlinks (there is no such thing as an open symlink)
  - cr_scan_fds() retries from scratch if it hits size limits
  - Add missing test for VM_MAYSHARE when dumping memory
  - Improve documentation about: behavior when tasks aren't fronen,
    life span of the object hash, references to objects in the hash
 
[2008-Nov-26] v10:
  - Grab vfs root of container init, rather than current process
  - Acquire dcache_lock around call to __d_path() in cr_fill_name()
  - Force end-of-string in cr_read_string() (fix possible DoS)
  - Introduce cr_write_buffer(), cr_read_buffer() and cr_read_buf_type()

[2008-Nov-10] v9:
  - Support multiple processes c/r
  - Extend checkpoint header with archtiecture dependent header 
  - Misc bug fixes (see individual changelogs)
  - Rebase to v2.6.28-rc3.

[2008-Oct-29] v8:
  - Support "external" checkpoint
  - Include Dave Hansen's 'deny-checkpoint' patch
  - Split docs in Documentation/checkpoint/..., and improve contents

[2008-Oct-17] v7:
  - Fix save/restore state of FPU
  - Fix argument given to kunmap_atomic() in memory dump/restore

[2008-Oct-07] v6:
  - Balance all calls to cr_hbuf_get() with matching cr_hbuf_put()
    (even though it's not really needed)
  - Add assumptions and what's-missing to documentation
  - Misc fixes and cleanups

[2008-Sep-11] v5:
  - Config is now 'def_bool n' by default
  - Improve memory dump/restore code (following Dave Hansen's comments)
  - Change dump format (and code) to allow chunks of <vaddrs, pages>
    instead of one long list of each
  - Fix use of follow_page() to avoid faulting in non-present pages
  - Memory restore now maps user pages explicitly to copy data into them,
    instead of reading directly to user space; got rid of mprotect_fixup()
  - Remove preempt_disable() when restoring debug registers
  - Rename headers files s/ckpt/checkpoint/
  - Fix misc bugs in files dump/restore
  - Fixes and cleanups on some error paths
  - Fix misc coding style

[2008-Sep-09] v4:
  - Various fixes and clean-ups
  - Fix calculation of hash table size
  - Fix header structure alignment
  - Use stand list_... for cr_pgarr

[2008-Aug-29] v3:
  - Various fixes and clean-ups
  - Use standard hlist_... for hash table
  - Better use of standard kmalloc/kfree

[2008-Aug-20] v2:
  - Added Dump and restore of open files (regular and directories)
  - Added basic handling of shared objects, and improve handling of
    'parent tag' concept
  - Added documentation
  - Improved ABI, 64bit padding for image data
  - Improved locking when saving/restoring memory
  - Added UTS information to header (release, version, machine)
  - Cleanup extraction of filename from a file pointer
  - Refactor to allow easier reviewing
  - Remove requirement for CAPS_SYS_ADMIN until we come up with a
    security policy (this means that file restore may fail)
  - Other cleanup and response to comments for v1

[2008-Jul-29] v1:
  - Initial version: support a single task with address space of only
    private anonymous or file-mapped VMAs; syscalls ignore pid/crid
    argument and act on current process.

--
At the containers mini-conference before OLS, the consensus among
all the stakeholders was that doing checkpoint/restart in the kernel
as much as possible was the best approach.  With this approach, the
kernel will export a relatively opaque 'blob' of data to userspace
which can then be handed to the new kernel at restore time.

This is different than what had been proposed before, which was
that a userspace application would be responsible for collecting
all of this data.  We were also planning on adding lots of new,
little kernel interfaces for all of the things that needed
checkpointing.  This unites those into a single, grand interface.

The 'blob' will contain copies of select portions of kernel
structures such as vmas and mm_structs.  It will also contain
copies of the actual memory that the process uses.  Any changes
in this blob's format between kernel revisions can be handled by
an in-userspace conversion program.

This is a similar approach to virtually all of the commercial
checkpoint/restart products out there, as well as the research
project Zap.

These patches basically serialize internel kernel state and write
it out to a file descriptor.  The checkpoint and restore are done
with two new system calls: sys_checkpoint and sys_restart.

In this incarnation, they can only work checkpoint and restore a
single task. The task's address space may consist of only private,
simple vma's - anonymous or file-mapped. The open files may consist
of only simple files and directories.
--


^ permalink raw reply	[flat|nested] 133+ messages in thread

* [RFC v11][PATCH 00/13] Kernel based checkpoint/restart
@ 2008-12-05 17:31 ` Oren Laadan
  0 siblings, 0 replies; 133+ messages in thread
From: Oren Laadan @ 2008-12-05 17:31 UTC (permalink / raw)
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Linux Torvalds,
	Thomas Gleixner, Serge Hallyn, Dave Hansen, Ingo Molnar,
	H. Peter Anvin, Alexander Viro, MinChan Kim, arnd-r2nGTMty4D4,
	jeremy-TSDbQ3PG+2Y, Oren Laadan

Checkpoint-restart (c/r): fixed races in file handling (comments from
from Al Viro). Updated and tested against v2.6.28-rc7 (feaf384...)

We'd like these to make it into -mm. This version addresses the
last of the known bugs. Please pull at least the first 11 patches,
as they are similar to before.

Patches 1-11 are stable, providing self- and external- c/r of a
single process.
Patches 12 and 13 are newer, adding support for c/r of multiple
processes.

The git tree tracking v11, branch 'ckpt-v11' (and older versions):
	git://git.ncl.cs.columbia.edu/pub/git/linux-cr.git

Restarting multiple processes requires 'mktree' userspace tool:
	git://git.ncl.cs.columbia.edu/pub/git/user-cr.git

Oren.


--
Why do we want it?  It allows containers to be moved between physical
machines' kernels in the same way that VMWare can move VMs between
physical machines' hypervisors.  There are currently at least two
out-of-tree implementations of this in the commercial world (IBM's
Metacluster and Parallels' OpenVZ/Virtuozzo) and several in the academic
world like Zap.

Why do we need it in mainline now?  Because we already have plenty of
out-of-tree ones, and  want to know what an in-tree one will be like.   :)  
What *I* want right now is the extra review and scrutiny that comes with
a mainline submission to make sure we're not going in a direction
contrary to the community.

This only supports pretty simple apps.  But, I trust Ingo when he says:
>> > > Generally, if something works for simple apps already (in a robust, 
>> > > compatible and supportable way) and users find it "very cool", then 
>> > > support for more complex apps is not far in the future.  but if you
>> > > want to support more complex apps straight away, it takes forever and
>> > > gets ugly.

We're *certainly* going to be changing the ABI (which is the format of
the checkpoint).  I'd like to follow the model that we used for
ext4-dev, which is to make it very clear that this is a development-only
feature for now.  Perhaps we do that by making the interface only
available through debugfs or something similar for now.  Or, reserving
the syscall numbers but require some runtime switch to be thrown before
they can be used.  I'm open to suggestions here.
--

--
Todo:
- Add support for x86-64 and improve ABI
- Refine or change syscall interface
- Handle multiple namespaces in a container (e.g. save the filesystem
  namespaces state with the file descriptors)
- Security (without CAPS_SYS_ADMIN files restore may fail)

Changelog:

[2008-Dec-05] v11:
  - Use contents of 'init->fs->root' instead of pointing to it
  - Ignore symlinks (there is no such thing as an open symlink)
  - cr_scan_fds() retries from scratch if it hits size limits
  - Add missing test for VM_MAYSHARE when dumping memory
  - Improve documentation about: behavior when tasks aren't fronen,
    life span of the object hash, references to objects in the hash
 
[2008-Nov-26] v10:
  - Grab vfs root of container init, rather than current process
  - Acquire dcache_lock around call to __d_path() in cr_fill_name()
  - Force end-of-string in cr_read_string() (fix possible DoS)
  - Introduce cr_write_buffer(), cr_read_buffer() and cr_read_buf_type()

[2008-Nov-10] v9:
  - Support multiple processes c/r
  - Extend checkpoint header with archtiecture dependent header 
  - Misc bug fixes (see individual changelogs)
  - Rebase to v2.6.28-rc3.

[2008-Oct-29] v8:
  - Support "external" checkpoint
  - Include Dave Hansen's 'deny-checkpoint' patch
  - Split docs in Documentation/checkpoint/..., and improve contents

[2008-Oct-17] v7:
  - Fix save/restore state of FPU
  - Fix argument given to kunmap_atomic() in memory dump/restore

[2008-Oct-07] v6:
  - Balance all calls to cr_hbuf_get() with matching cr_hbuf_put()
    (even though it's not really needed)
  - Add assumptions and what's-missing to documentation
  - Misc fixes and cleanups

[2008-Sep-11] v5:
  - Config is now 'def_bool n' by default
  - Improve memory dump/restore code (following Dave Hansen's comments)
  - Change dump format (and code) to allow chunks of <vaddrs, pages>
    instead of one long list of each
  - Fix use of follow_page() to avoid faulting in non-present pages
  - Memory restore now maps user pages explicitly to copy data into them,
    instead of reading directly to user space; got rid of mprotect_fixup()
  - Remove preempt_disable() when restoring debug registers
  - Rename headers files s/ckpt/checkpoint/
  - Fix misc bugs in files dump/restore
  - Fixes and cleanups on some error paths
  - Fix misc coding style

[2008-Sep-09] v4:
  - Various fixes and clean-ups
  - Fix calculation of hash table size
  - Fix header structure alignment
  - Use stand list_... for cr_pgarr

[2008-Aug-29] v3:
  - Various fixes and clean-ups
  - Use standard hlist_... for hash table
  - Better use of standard kmalloc/kfree

[2008-Aug-20] v2:
  - Added Dump and restore of open files (regular and directories)
  - Added basic handling of shared objects, and improve handling of
    'parent tag' concept
  - Added documentation
  - Improved ABI, 64bit padding for image data
  - Improved locking when saving/restoring memory
  - Added UTS information to header (release, version, machine)
  - Cleanup extraction of filename from a file pointer
  - Refactor to allow easier reviewing
  - Remove requirement for CAPS_SYS_ADMIN until we come up with a
    security policy (this means that file restore may fail)
  - Other cleanup and response to comments for v1

[2008-Jul-29] v1:
  - Initial version: support a single task with address space of only
    private anonymous or file-mapped VMAs; syscalls ignore pid/crid
    argument and act on current process.

--
At the containers mini-conference before OLS, the consensus among
all the stakeholders was that doing checkpoint/restart in the kernel
as much as possible was the best approach.  With this approach, the
kernel will export a relatively opaque 'blob' of data to userspace
which can then be handed to the new kernel at restore time.

This is different than what had been proposed before, which was
that a userspace application would be responsible for collecting
all of this data.  We were also planning on adding lots of new,
little kernel interfaces for all of the things that needed
checkpointing.  This unites those into a single, grand interface.

The 'blob' will contain copies of select portions of kernel
structures such as vmas and mm_structs.  It will also contain
copies of the actual memory that the process uses.  Any changes
in this blob's format between kernel revisions can be handled by
an in-userspace conversion program.

This is a similar approach to virtually all of the commercial
checkpoint/restart products out there, as well as the research
project Zap.

These patches basically serialize internel kernel state and write
it out to a file descriptor.  The checkpoint and restore are done
with two new system calls: sys_checkpoint and sys_restart.

In this incarnation, they can only work checkpoint and restore a
single task. The task's address space may consist of only private,
simple vma's - anonymous or file-mapped. The open files may consist
of only simple files and directories.
--

--
To unsubscribe from this list: send the line "unsubscribe linux-api" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 133+ messages in thread

* [RFC v11][PATCH 00/13] Kernel based checkpoint/restart
@ 2008-12-05 17:31 ` Oren Laadan
  0 siblings, 0 replies; 133+ messages in thread
From: Oren Laadan @ 2008-12-05 17:31 UTC (permalink / raw)
  To: Oren Laadan
  Cc: containers, linux-kernel, linux-mm, linux-api, Linux Torvalds,
	Thomas Gleixner, Serge Hallyn, Dave Hansen, Ingo Molnar,
	H. Peter Anvin, Alexander Viro, MinChan Kim, arnd, jeremy

Checkpoint-restart (c/r): fixed races in file handling (comments from
from Al Viro). Updated and tested against v2.6.28-rc7 (feaf384...)

We'd like these to make it into -mm. This version addresses the
last of the known bugs. Please pull at least the first 11 patches,
as they are similar to before.

Patches 1-11 are stable, providing self- and external- c/r of a
single process.
Patches 12 and 13 are newer, adding support for c/r of multiple
processes.

The git tree tracking v11, branch 'ckpt-v11' (and older versions):
	git://git.ncl.cs.columbia.edu/pub/git/linux-cr.git

Restarting multiple processes requires 'mktree' userspace tool:
	git://git.ncl.cs.columbia.edu/pub/git/user-cr.git

Oren.


--
Why do we want it?  It allows containers to be moved between physical
machines' kernels in the same way that VMWare can move VMs between
physical machines' hypervisors.  There are currently at least two
out-of-tree implementations of this in the commercial world (IBM's
Metacluster and Parallels' OpenVZ/Virtuozzo) and several in the academic
world like Zap.

Why do we need it in mainline now?  Because we already have plenty of
out-of-tree ones, and  want to know what an in-tree one will be like.   :)  
What *I* want right now is the extra review and scrutiny that comes with
a mainline submission to make sure we're not going in a direction
contrary to the community.

This only supports pretty simple apps.  But, I trust Ingo when he says:
>> > > Generally, if something works for simple apps already (in a robust, 
>> > > compatible and supportable way) and users find it "very cool", then 
>> > > support for more complex apps is not far in the future.  but if you
>> > > want to support more complex apps straight away, it takes forever and
>> > > gets ugly.

We're *certainly* going to be changing the ABI (which is the format of
the checkpoint).  I'd like to follow the model that we used for
ext4-dev, which is to make it very clear that this is a development-only
feature for now.  Perhaps we do that by making the interface only
available through debugfs or something similar for now.  Or, reserving
the syscall numbers but require some runtime switch to be thrown before
they can be used.  I'm open to suggestions here.
--

--
Todo:
- Add support for x86-64 and improve ABI
- Refine or change syscall interface
- Handle multiple namespaces in a container (e.g. save the filesystem
  namespaces state with the file descriptors)
- Security (without CAPS_SYS_ADMIN files restore may fail)

Changelog:

[2008-Dec-05] v11:
  - Use contents of 'init->fs->root' instead of pointing to it
  - Ignore symlinks (there is no such thing as an open symlink)
  - cr_scan_fds() retries from scratch if it hits size limits
  - Add missing test for VM_MAYSHARE when dumping memory
  - Improve documentation about: behavior when tasks aren't fronen,
    life span of the object hash, references to objects in the hash
 
[2008-Nov-26] v10:
  - Grab vfs root of container init, rather than current process
  - Acquire dcache_lock around call to __d_path() in cr_fill_name()
  - Force end-of-string in cr_read_string() (fix possible DoS)
  - Introduce cr_write_buffer(), cr_read_buffer() and cr_read_buf_type()

[2008-Nov-10] v9:
  - Support multiple processes c/r
  - Extend checkpoint header with archtiecture dependent header 
  - Misc bug fixes (see individual changelogs)
  - Rebase to v2.6.28-rc3.

[2008-Oct-29] v8:
  - Support "external" checkpoint
  - Include Dave Hansen's 'deny-checkpoint' patch
  - Split docs in Documentation/checkpoint/..., and improve contents

[2008-Oct-17] v7:
  - Fix save/restore state of FPU
  - Fix argument given to kunmap_atomic() in memory dump/restore

[2008-Oct-07] v6:
  - Balance all calls to cr_hbuf_get() with matching cr_hbuf_put()
    (even though it's not really needed)
  - Add assumptions and what's-missing to documentation
  - Misc fixes and cleanups

[2008-Sep-11] v5:
  - Config is now 'def_bool n' by default
  - Improve memory dump/restore code (following Dave Hansen's comments)
  - Change dump format (and code) to allow chunks of <vaddrs, pages>
    instead of one long list of each
  - Fix use of follow_page() to avoid faulting in non-present pages
  - Memory restore now maps user pages explicitly to copy data into them,
    instead of reading directly to user space; got rid of mprotect_fixup()
  - Remove preempt_disable() when restoring debug registers
  - Rename headers files s/ckpt/checkpoint/
  - Fix misc bugs in files dump/restore
  - Fixes and cleanups on some error paths
  - Fix misc coding style

[2008-Sep-09] v4:
  - Various fixes and clean-ups
  - Fix calculation of hash table size
  - Fix header structure alignment
  - Use stand list_... for cr_pgarr

[2008-Aug-29] v3:
  - Various fixes and clean-ups
  - Use standard hlist_... for hash table
  - Better use of standard kmalloc/kfree

[2008-Aug-20] v2:
  - Added Dump and restore of open files (regular and directories)
  - Added basic handling of shared objects, and improve handling of
    'parent tag' concept
  - Added documentation
  - Improved ABI, 64bit padding for image data
  - Improved locking when saving/restoring memory
  - Added UTS information to header (release, version, machine)
  - Cleanup extraction of filename from a file pointer
  - Refactor to allow easier reviewing
  - Remove requirement for CAPS_SYS_ADMIN until we come up with a
    security policy (this means that file restore may fail)
  - Other cleanup and response to comments for v1

[2008-Jul-29] v1:
  - Initial version: support a single task with address space of only
    private anonymous or file-mapped VMAs; syscalls ignore pid/crid
    argument and act on current process.

--
At the containers mini-conference before OLS, the consensus among
all the stakeholders was that doing checkpoint/restart in the kernel
as much as possible was the best approach.  With this approach, the
kernel will export a relatively opaque 'blob' of data to userspace
which can then be handed to the new kernel at restore time.

This is different than what had been proposed before, which was
that a userspace application would be responsible for collecting
all of this data.  We were also planning on adding lots of new,
little kernel interfaces for all of the things that needed
checkpointing.  This unites those into a single, grand interface.

The 'blob' will contain copies of select portions of kernel
structures such as vmas and mm_structs.  It will also contain
copies of the actual memory that the process uses.  Any changes
in this blob's format between kernel revisions can be handled by
an in-userspace conversion program.

This is a similar approach to virtually all of the commercial
checkpoint/restart products out there, as well as the research
project Zap.

These patches basically serialize internel kernel state and write
it out to a file descriptor.  The checkpoint and restore are done
with two new system calls: sys_checkpoint and sys_restart.

In this incarnation, they can only work checkpoint and restore a
single task. The task's address space may consist of only private,
simple vma's - anonymous or file-mapped. The open files may consist
of only simple files and directories.
--

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 133+ messages in thread

* [RFC v11][PATCH 01/13] Create syscalls: sys_checkpoint, sys_restart
       [not found] ` <1228498282-11804-1-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
@ 2008-12-05 17:31   ` Oren Laadan
  2008-12-05 17:31   ` [RFC v11][PATCH 02/13] Checkpoint/restart: initial documentation Oren Laadan
                     ` (16 subsequent siblings)
  17 siblings, 0 replies; 133+ messages in thread
From: Oren Laadan @ 2008-12-05 17:31 UTC (permalink / raw)
  To: Oren Laadan
  Cc: jeremy-TSDbQ3PG+2Y, arnd-r2nGTMty4D4,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Dave Hansen,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Linux Torvalds, Alexander Viro,
	H. Peter Anvin, Thomas Gleixner, Ingo Molnar

Create trivial sys_checkpoint and sys_restore system calls. They will
enable to checkpoint and restart an entire container, to and from a
checkpoint image file descriptor.

The syscalls take a file descriptor (for the image file) and flags as
arguments. For sys_checkpoint the first argument identifies the target
container; for sys_restart it will identify the checkpoint image.

A checkpoint, much like a process coredump, dumps the state of multiple
processes at once, including the state of the container. The checkpoint
image is written to (and read from) the file descriptor directly from
the kernel. This way the data is generated and then pushed out naturally
as resources and tasks are scanned to save their state. This is the
approach taken by, e.g., Zap and OpenVZ.

By using a return value and not a file descriptor, we can distinguish
between a return from checkpoint, a return from restart (in case of a
checkpoint that includes self, i.e. a task checkpointing its own
container, or itself), and an error condition, in a manner analogous
to a fork() call.

We don't use copyin()/copyout() because it requires holding the entire
image in user space, and does not make sense for restart.  Also, we
don't use a pipe, pseudo-fs file and the like, because they work by
generating data on demand as the user pulls it (unless the entire
image is buffered in the kernel) and would require more complex logic.
They also would significantly complicate checkpoint that includes self.

Changelog[v5]:
  - Config is 'def_bool n' by default

Signed-off-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
Acked-by: Serge Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
Signed-off-by: Dave Hansen <dave-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
---
 arch/x86/include/asm/unistd_32.h   |    2 +
 arch/x86/kernel/syscall_table_32.S |    2 +
 checkpoint/Kconfig                 |   11 +++++++++
 checkpoint/Makefile                |    5 ++++
 checkpoint/sys.c                   |   41 ++++++++++++++++++++++++++++++++++++
 include/linux/syscalls.h           |    2 +
 init/Kconfig                       |    2 +
 kernel/sys_ni.c                    |    4 +++
 8 files changed, 69 insertions(+), 0 deletions(-)
 create mode 100644 checkpoint/Kconfig
 create mode 100644 checkpoint/Makefile
 create mode 100644 checkpoint/sys.c

diff --git a/arch/x86/include/asm/unistd_32.h b/arch/x86/include/asm/unistd_32.h
index f2bba78..a5f9e09 100644
--- a/arch/x86/include/asm/unistd_32.h
+++ b/arch/x86/include/asm/unistd_32.h
@@ -338,6 +338,8 @@
 #define __NR_dup3		330
 #define __NR_pipe2		331
 #define __NR_inotify_init1	332
+#define __NR_checkpoint		333
+#define __NR_restart		334
 
 #ifdef __KERNEL__
 
diff --git a/arch/x86/kernel/syscall_table_32.S b/arch/x86/kernel/syscall_table_32.S
index d44395f..5543136 100644
--- a/arch/x86/kernel/syscall_table_32.S
+++ b/arch/x86/kernel/syscall_table_32.S
@@ -332,3 +332,5 @@ ENTRY(sys_call_table)
 	.long sys_dup3			/* 330 */
 	.long sys_pipe2
 	.long sys_inotify_init1
+	.long sys_checkpoint
+	.long sys_restart
diff --git a/checkpoint/Kconfig b/checkpoint/Kconfig
new file mode 100644
index 0000000..ffaa635
--- /dev/null
+++ b/checkpoint/Kconfig
@@ -0,0 +1,11 @@
+config CHECKPOINT_RESTART
+	prompt "Enable checkpoint/restart (EXPERIMENTAL)"
+	def_bool n
+	depends on X86_32 && EXPERIMENTAL
+	help
+	  Application checkpoint/restart is the ability to save the
+	  state of a running application so that it can later resume
+	  its execution from the time at which it was checkpointed.
+
+	  Turning this option on will enable checkpoint and restart
+	  functionality in the kernel.
diff --git a/checkpoint/Makefile b/checkpoint/Makefile
new file mode 100644
index 0000000..07d018b
--- /dev/null
+++ b/checkpoint/Makefile
@@ -0,0 +1,5 @@
+#
+# Makefile for linux checkpoint/restart.
+#
+
+obj-$(CONFIG_CHECKPOINT_RESTART) += sys.o
diff --git a/checkpoint/sys.c b/checkpoint/sys.c
new file mode 100644
index 0000000..375129c
--- /dev/null
+++ b/checkpoint/sys.c
@@ -0,0 +1,41 @@
+/*
+ *  Generic container checkpoint-restart
+ *
+ *  Copyright (C) 2008 Oren Laadan
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+#include <linux/sched.h>
+#include <linux/kernel.h>
+
+/**
+ * sys_checkpoint - checkpoint a container
+ * @pid: pid of the container init(1) process
+ * @fd: file to which dump the checkpoint image
+ * @flags: checkpoint operation flags
+ *
+ * Returns positive identifier on success, 0 when returning from restart
+ * or negative value on error
+ */
+asmlinkage long sys_checkpoint(pid_t pid, int fd, unsigned long flags)
+{
+	pr_debug("sys_checkpoint not implemented yet\n");
+	return -ENOSYS;
+}
+/**
+ * sys_restart - restart a container
+ * @crid: checkpoint image identifier
+ * @fd: file from which read the checkpoint image
+ * @flags: restart operation flags
+ *
+ * Returns negative value on error, or otherwise returns in the realm
+ * of the original checkpoint
+ */
+asmlinkage long sys_restart(int crid, int fd, unsigned long flags)
+{
+	pr_debug("sys_restart not implemented yet\n");
+	return -ENOSYS;
+}
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 04fb47b..9750393 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -621,6 +621,8 @@ asmlinkage long sys_timerfd_gettime(int ufd, struct itimerspec __user *otmr);
 asmlinkage long sys_eventfd(unsigned int count);
 asmlinkage long sys_eventfd2(unsigned int count, int flags);
 asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t len);
+asmlinkage long sys_checkpoint(pid_t pid, int fd, unsigned long flags);
+asmlinkage long sys_restart(int crid, int fd, unsigned long flags);
 
 int kernel_execve(const char *filename, char *const argv[], char *const envp[]);
 
diff --git a/init/Kconfig b/init/Kconfig
index f763762..57364fe 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -814,6 +814,8 @@ config MARKERS
 
 source "arch/Kconfig"
 
+source "checkpoint/Kconfig"
+
 endmenu		# General setup
 
 config HAVE_GENERIC_DMA_COHERENT
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index e14a232..fcd65cc 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -174,3 +174,7 @@ cond_syscall(compat_sys_timerfd_settime);
 cond_syscall(compat_sys_timerfd_gettime);
 cond_syscall(sys_eventfd);
 cond_syscall(sys_eventfd2);
+
+/* checkpoint/restart */
+cond_syscall(sys_checkpoint);
+cond_syscall(sys_restart);
-- 
1.5.4.3

^ permalink raw reply related	[flat|nested] 133+ messages in thread

* [RFC v11][PATCH 01/13] Create syscalls: sys_checkpoint, sys_restart
  2008-12-05 17:31 ` Oren Laadan
  (?)
@ 2008-12-05 17:31   ` Oren Laadan
  -1 siblings, 0 replies; 133+ messages in thread
From: Oren Laadan @ 2008-12-05 17:31 UTC (permalink / raw)
  To: Oren Laadan
  Cc: containers, linux-kernel, linux-mm, linux-api, Linux Torvalds,
	Thomas Gleixner, Serge Hallyn, Dave Hansen, Ingo Molnar,
	H. Peter Anvin, Alexander Viro, MinChan Kim, arnd, jeremy,
	Oren Laadan

Create trivial sys_checkpoint and sys_restore system calls. They will
enable to checkpoint and restart an entire container, to and from a
checkpoint image file descriptor.

The syscalls take a file descriptor (for the image file) and flags as
arguments. For sys_checkpoint the first argument identifies the target
container; for sys_restart it will identify the checkpoint image.

A checkpoint, much like a process coredump, dumps the state of multiple
processes at once, including the state of the container. The checkpoint
image is written to (and read from) the file descriptor directly from
the kernel. This way the data is generated and then pushed out naturally
as resources and tasks are scanned to save their state. This is the
approach taken by, e.g., Zap and OpenVZ.

By using a return value and not a file descriptor, we can distinguish
between a return from checkpoint, a return from restart (in case of a
checkpoint that includes self, i.e. a task checkpointing its own
container, or itself), and an error condition, in a manner analogous
to a fork() call.

We don't use copyin()/copyout() because it requires holding the entire
image in user space, and does not make sense for restart.  Also, we
don't use a pipe, pseudo-fs file and the like, because they work by
generating data on demand as the user pulls it (unless the entire
image is buffered in the kernel) and would require more complex logic.
They also would significantly complicate checkpoint that includes self.

Changelog[v5]:
  - Config is 'def_bool n' by default

Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com>
---
 arch/x86/include/asm/unistd_32.h   |    2 +
 arch/x86/kernel/syscall_table_32.S |    2 +
 checkpoint/Kconfig                 |   11 +++++++++
 checkpoint/Makefile                |    5 ++++
 checkpoint/sys.c                   |   41 ++++++++++++++++++++++++++++++++++++
 include/linux/syscalls.h           |    2 +
 init/Kconfig                       |    2 +
 kernel/sys_ni.c                    |    4 +++
 8 files changed, 69 insertions(+), 0 deletions(-)
 create mode 100644 checkpoint/Kconfig
 create mode 100644 checkpoint/Makefile
 create mode 100644 checkpoint/sys.c

diff --git a/arch/x86/include/asm/unistd_32.h b/arch/x86/include/asm/unistd_32.h
index f2bba78..a5f9e09 100644
--- a/arch/x86/include/asm/unistd_32.h
+++ b/arch/x86/include/asm/unistd_32.h
@@ -338,6 +338,8 @@
 #define __NR_dup3		330
 #define __NR_pipe2		331
 #define __NR_inotify_init1	332
+#define __NR_checkpoint		333
+#define __NR_restart		334
 
 #ifdef __KERNEL__
 
diff --git a/arch/x86/kernel/syscall_table_32.S b/arch/x86/kernel/syscall_table_32.S
index d44395f..5543136 100644
--- a/arch/x86/kernel/syscall_table_32.S
+++ b/arch/x86/kernel/syscall_table_32.S
@@ -332,3 +332,5 @@ ENTRY(sys_call_table)
 	.long sys_dup3			/* 330 */
 	.long sys_pipe2
 	.long sys_inotify_init1
+	.long sys_checkpoint
+	.long sys_restart
diff --git a/checkpoint/Kconfig b/checkpoint/Kconfig
new file mode 100644
index 0000000..ffaa635
--- /dev/null
+++ b/checkpoint/Kconfig
@@ -0,0 +1,11 @@
+config CHECKPOINT_RESTART
+	prompt "Enable checkpoint/restart (EXPERIMENTAL)"
+	def_bool n
+	depends on X86_32 && EXPERIMENTAL
+	help
+	  Application checkpoint/restart is the ability to save the
+	  state of a running application so that it can later resume
+	  its execution from the time at which it was checkpointed.
+
+	  Turning this option on will enable checkpoint and restart
+	  functionality in the kernel.
diff --git a/checkpoint/Makefile b/checkpoint/Makefile
new file mode 100644
index 0000000..07d018b
--- /dev/null
+++ b/checkpoint/Makefile
@@ -0,0 +1,5 @@
+#
+# Makefile for linux checkpoint/restart.
+#
+
+obj-$(CONFIG_CHECKPOINT_RESTART) += sys.o
diff --git a/checkpoint/sys.c b/checkpoint/sys.c
new file mode 100644
index 0000000..375129c
--- /dev/null
+++ b/checkpoint/sys.c
@@ -0,0 +1,41 @@
+/*
+ *  Generic container checkpoint-restart
+ *
+ *  Copyright (C) 2008 Oren Laadan
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+#include <linux/sched.h>
+#include <linux/kernel.h>
+
+/**
+ * sys_checkpoint - checkpoint a container
+ * @pid: pid of the container init(1) process
+ * @fd: file to which dump the checkpoint image
+ * @flags: checkpoint operation flags
+ *
+ * Returns positive identifier on success, 0 when returning from restart
+ * or negative value on error
+ */
+asmlinkage long sys_checkpoint(pid_t pid, int fd, unsigned long flags)
+{
+	pr_debug("sys_checkpoint not implemented yet\n");
+	return -ENOSYS;
+}
+/**
+ * sys_restart - restart a container
+ * @crid: checkpoint image identifier
+ * @fd: file from which read the checkpoint image
+ * @flags: restart operation flags
+ *
+ * Returns negative value on error, or otherwise returns in the realm
+ * of the original checkpoint
+ */
+asmlinkage long sys_restart(int crid, int fd, unsigned long flags)
+{
+	pr_debug("sys_restart not implemented yet\n");
+	return -ENOSYS;
+}
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 04fb47b..9750393 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -621,6 +621,8 @@ asmlinkage long sys_timerfd_gettime(int ufd, struct itimerspec __user *otmr);
 asmlinkage long sys_eventfd(unsigned int count);
 asmlinkage long sys_eventfd2(unsigned int count, int flags);
 asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t len);
+asmlinkage long sys_checkpoint(pid_t pid, int fd, unsigned long flags);
+asmlinkage long sys_restart(int crid, int fd, unsigned long flags);
 
 int kernel_execve(const char *filename, char *const argv[], char *const envp[]);
 
diff --git a/init/Kconfig b/init/Kconfig
index f763762..57364fe 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -814,6 +814,8 @@ config MARKERS
 
 source "arch/Kconfig"
 
+source "checkpoint/Kconfig"
+
 endmenu		# General setup
 
 config HAVE_GENERIC_DMA_COHERENT
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index e14a232..fcd65cc 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -174,3 +174,7 @@ cond_syscall(compat_sys_timerfd_settime);
 cond_syscall(compat_sys_timerfd_gettime);
 cond_syscall(sys_eventfd);
 cond_syscall(sys_eventfd2);
+
+/* checkpoint/restart */
+cond_syscall(sys_checkpoint);
+cond_syscall(sys_restart);
-- 
1.5.4.3


^ permalink raw reply related	[flat|nested] 133+ messages in thread

* [RFC v11][PATCH 01/13] Create syscalls: sys_checkpoint, sys_restart
@ 2008-12-05 17:31   ` Oren Laadan
  0 siblings, 0 replies; 133+ messages in thread
From: Oren Laadan @ 2008-12-05 17:31 UTC (permalink / raw)
  Cc: containers, linux-kernel, linux-mm, linux-api, Linux Torvalds,
	Thomas Gleixner, Serge Hallyn, Dave Hansen, Ingo Molnar,
	H. Peter Anvin, Alexander Viro, MinChan Kim, arnd, jeremy,
	Oren Laadan

Create trivial sys_checkpoint and sys_restore system calls. They will
enable to checkpoint and restart an entire container, to and from a
checkpoint image file descriptor.

The syscalls take a file descriptor (for the image file) and flags as
arguments. For sys_checkpoint the first argument identifies the target
container; for sys_restart it will identify the checkpoint image.

A checkpoint, much like a process coredump, dumps the state of multiple
processes at once, including the state of the container. The checkpoint
image is written to (and read from) the file descriptor directly from
the kernel. This way the data is generated and then pushed out naturally
as resources and tasks are scanned to save their state. This is the
approach taken by, e.g., Zap and OpenVZ.

By using a return value and not a file descriptor, we can distinguish
between a return from checkpoint, a return from restart (in case of a
checkpoint that includes self, i.e. a task checkpointing its own
container, or itself), and an error condition, in a manner analogous
to a fork() call.

We don't use copyin()/copyout() because it requires holding the entire
image in user space, and does not make sense for restart.  Also, we
don't use a pipe, pseudo-fs file and the like, because they work by
generating data on demand as the user pulls it (unless the entire
image is buffered in the kernel) and would require more complex logic.
They also would significantly complicate checkpoint that includes self.

Changelog[v5]:
  - Config is 'def_bool n' by default

Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com>
---
 arch/x86/include/asm/unistd_32.h   |    2 +
 arch/x86/kernel/syscall_table_32.S |    2 +
 checkpoint/Kconfig                 |   11 +++++++++
 checkpoint/Makefile                |    5 ++++
 checkpoint/sys.c                   |   41 ++++++++++++++++++++++++++++++++++++
 include/linux/syscalls.h           |    2 +
 init/Kconfig                       |    2 +
 kernel/sys_ni.c                    |    4 +++
 8 files changed, 69 insertions(+), 0 deletions(-)
 create mode 100644 checkpoint/Kconfig
 create mode 100644 checkpoint/Makefile
 create mode 100644 checkpoint/sys.c

diff --git a/arch/x86/include/asm/unistd_32.h b/arch/x86/include/asm/unistd_32.h
index f2bba78..a5f9e09 100644
--- a/arch/x86/include/asm/unistd_32.h
+++ b/arch/x86/include/asm/unistd_32.h
@@ -338,6 +338,8 @@
 #define __NR_dup3		330
 #define __NR_pipe2		331
 #define __NR_inotify_init1	332
+#define __NR_checkpoint		333
+#define __NR_restart		334
 
 #ifdef __KERNEL__
 
diff --git a/arch/x86/kernel/syscall_table_32.S b/arch/x86/kernel/syscall_table_32.S
index d44395f..5543136 100644
--- a/arch/x86/kernel/syscall_table_32.S
+++ b/arch/x86/kernel/syscall_table_32.S
@@ -332,3 +332,5 @@ ENTRY(sys_call_table)
 	.long sys_dup3			/* 330 */
 	.long sys_pipe2
 	.long sys_inotify_init1
+	.long sys_checkpoint
+	.long sys_restart
diff --git a/checkpoint/Kconfig b/checkpoint/Kconfig
new file mode 100644
index 0000000..ffaa635
--- /dev/null
+++ b/checkpoint/Kconfig
@@ -0,0 +1,11 @@
+config CHECKPOINT_RESTART
+	prompt "Enable checkpoint/restart (EXPERIMENTAL)"
+	def_bool n
+	depends on X86_32 && EXPERIMENTAL
+	help
+	  Application checkpoint/restart is the ability to save the
+	  state of a running application so that it can later resume
+	  its execution from the time at which it was checkpointed.
+
+	  Turning this option on will enable checkpoint and restart
+	  functionality in the kernel.
diff --git a/checkpoint/Makefile b/checkpoint/Makefile
new file mode 100644
index 0000000..07d018b
--- /dev/null
+++ b/checkpoint/Makefile
@@ -0,0 +1,5 @@
+#
+# Makefile for linux checkpoint/restart.
+#
+
+obj-$(CONFIG_CHECKPOINT_RESTART) += sys.o
diff --git a/checkpoint/sys.c b/checkpoint/sys.c
new file mode 100644
index 0000000..375129c
--- /dev/null
+++ b/checkpoint/sys.c
@@ -0,0 +1,41 @@
+/*
+ *  Generic container checkpoint-restart
+ *
+ *  Copyright (C) 2008 Oren Laadan
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+#include <linux/sched.h>
+#include <linux/kernel.h>
+
+/**
+ * sys_checkpoint - checkpoint a container
+ * @pid: pid of the container init(1) process
+ * @fd: file to which dump the checkpoint image
+ * @flags: checkpoint operation flags
+ *
+ * Returns positive identifier on success, 0 when returning from restart
+ * or negative value on error
+ */
+asmlinkage long sys_checkpoint(pid_t pid, int fd, unsigned long flags)
+{
+	pr_debug("sys_checkpoint not implemented yet\n");
+	return -ENOSYS;
+}
+/**
+ * sys_restart - restart a container
+ * @crid: checkpoint image identifier
+ * @fd: file from which read the checkpoint image
+ * @flags: restart operation flags
+ *
+ * Returns negative value on error, or otherwise returns in the realm
+ * of the original checkpoint
+ */
+asmlinkage long sys_restart(int crid, int fd, unsigned long flags)
+{
+	pr_debug("sys_restart not implemented yet\n");
+	return -ENOSYS;
+}
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 04fb47b..9750393 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -621,6 +621,8 @@ asmlinkage long sys_timerfd_gettime(int ufd, struct itimerspec __user *otmr);
 asmlinkage long sys_eventfd(unsigned int count);
 asmlinkage long sys_eventfd2(unsigned int count, int flags);
 asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t len);
+asmlinkage long sys_checkpoint(pid_t pid, int fd, unsigned long flags);
+asmlinkage long sys_restart(int crid, int fd, unsigned long flags);
 
 int kernel_execve(const char *filename, char *const argv[], char *const envp[]);
 
diff --git a/init/Kconfig b/init/Kconfig
index f763762..57364fe 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -814,6 +814,8 @@ config MARKERS
 
 source "arch/Kconfig"
 
+source "checkpoint/Kconfig"
+
 endmenu		# General setup
 
 config HAVE_GENERIC_DMA_COHERENT
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index e14a232..fcd65cc 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -174,3 +174,7 @@ cond_syscall(compat_sys_timerfd_settime);
 cond_syscall(compat_sys_timerfd_gettime);
 cond_syscall(sys_eventfd);
 cond_syscall(sys_eventfd2);
+
+/* checkpoint/restart */
+cond_syscall(sys_checkpoint);
+cond_syscall(sys_restart);
-- 
1.5.4.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 133+ messages in thread

* [RFC v11][PATCH 01/13] Create syscalls: sys_checkpoint, sys_restart
@ 2008-12-05 17:31   ` Oren Laadan
  0 siblings, 0 replies; 133+ messages in thread
From: Oren Laadan @ 2008-12-05 17:31 UTC (permalink / raw)
  To: Oren Laadan
  Cc: containers, linux-kernel, linux-mm, linux-api, Linux Torvalds,
	Thomas Gleixner, Serge Hallyn, Dave Hansen, Ingo Molnar,
	H. Peter Anvin, Alexander Viro, MinChan Kim, arnd, jeremy

Create trivial sys_checkpoint and sys_restore system calls. They will
enable to checkpoint and restart an entire container, to and from a
checkpoint image file descriptor.

The syscalls take a file descriptor (for the image file) and flags as
arguments. For sys_checkpoint the first argument identifies the target
container; for sys_restart it will identify the checkpoint image.

A checkpoint, much like a process coredump, dumps the state of multiple
processes at once, including the state of the container. The checkpoint
image is written to (and read from) the file descriptor directly from
the kernel. This way the data is generated and then pushed out naturally
as resources and tasks are scanned to save their state. This is the
approach taken by, e.g., Zap and OpenVZ.

By using a return value and not a file descriptor, we can distinguish
between a return from checkpoint, a return from restart (in case of a
checkpoint that includes self, i.e. a task checkpointing its own
container, or itself), and an error condition, in a manner analogous
to a fork() call.

We don't use copyin()/copyout() because it requires holding the entire
image in user space, and does not make sense for restart.  Also, we
don't use a pipe, pseudo-fs file and the like, because they work by
generating data on demand as the user pulls it (unless the entire
image is buffered in the kernel) and would require more complex logic.
They also would significantly complicate checkpoint that includes self.

Changelog[v5]:
  - Config is 'def_bool n' by default

Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com>
---
 arch/x86/include/asm/unistd_32.h   |    2 +
 arch/x86/kernel/syscall_table_32.S |    2 +
 checkpoint/Kconfig                 |   11 +++++++++
 checkpoint/Makefile                |    5 ++++
 checkpoint/sys.c                   |   41 ++++++++++++++++++++++++++++++++++++
 include/linux/syscalls.h           |    2 +
 init/Kconfig                       |    2 +
 kernel/sys_ni.c                    |    4 +++
 8 files changed, 69 insertions(+), 0 deletions(-)
 create mode 100644 checkpoint/Kconfig
 create mode 100644 checkpoint/Makefile
 create mode 100644 checkpoint/sys.c

diff --git a/arch/x86/include/asm/unistd_32.h b/arch/x86/include/asm/unistd_32.h
index f2bba78..a5f9e09 100644
--- a/arch/x86/include/asm/unistd_32.h
+++ b/arch/x86/include/asm/unistd_32.h
@@ -338,6 +338,8 @@
 #define __NR_dup3		330
 #define __NR_pipe2		331
 #define __NR_inotify_init1	332
+#define __NR_checkpoint		333
+#define __NR_restart		334
 
 #ifdef __KERNEL__
 
diff --git a/arch/x86/kernel/syscall_table_32.S b/arch/x86/kernel/syscall_table_32.S
index d44395f..5543136 100644
--- a/arch/x86/kernel/syscall_table_32.S
+++ b/arch/x86/kernel/syscall_table_32.S
@@ -332,3 +332,5 @@ ENTRY(sys_call_table)
 	.long sys_dup3			/* 330 */
 	.long sys_pipe2
 	.long sys_inotify_init1
+	.long sys_checkpoint
+	.long sys_restart
diff --git a/checkpoint/Kconfig b/checkpoint/Kconfig
new file mode 100644
index 0000000..ffaa635
--- /dev/null
+++ b/checkpoint/Kconfig
@@ -0,0 +1,11 @@
+config CHECKPOINT_RESTART
+	prompt "Enable checkpoint/restart (EXPERIMENTAL)"
+	def_bool n
+	depends on X86_32 && EXPERIMENTAL
+	help
+	  Application checkpoint/restart is the ability to save the
+	  state of a running application so that it can later resume
+	  its execution from the time at which it was checkpointed.
+
+	  Turning this option on will enable checkpoint and restart
+	  functionality in the kernel.
diff --git a/checkpoint/Makefile b/checkpoint/Makefile
new file mode 100644
index 0000000..07d018b
--- /dev/null
+++ b/checkpoint/Makefile
@@ -0,0 +1,5 @@
+#
+# Makefile for linux checkpoint/restart.
+#
+
+obj-$(CONFIG_CHECKPOINT_RESTART) += sys.o
diff --git a/checkpoint/sys.c b/checkpoint/sys.c
new file mode 100644
index 0000000..375129c
--- /dev/null
+++ b/checkpoint/sys.c
@@ -0,0 +1,41 @@
+/*
+ *  Generic container checkpoint-restart
+ *
+ *  Copyright (C) 2008 Oren Laadan
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+#include <linux/sched.h>
+#include <linux/kernel.h>
+
+/**
+ * sys_checkpoint - checkpoint a container
+ * @pid: pid of the container init(1) process
+ * @fd: file to which dump the checkpoint image
+ * @flags: checkpoint operation flags
+ *
+ * Returns positive identifier on success, 0 when returning from restart
+ * or negative value on error
+ */
+asmlinkage long sys_checkpoint(pid_t pid, int fd, unsigned long flags)
+{
+	pr_debug("sys_checkpoint not implemented yet\n");
+	return -ENOSYS;
+}
+/**
+ * sys_restart - restart a container
+ * @crid: checkpoint image identifier
+ * @fd: file from which read the checkpoint image
+ * @flags: restart operation flags
+ *
+ * Returns negative value on error, or otherwise returns in the realm
+ * of the original checkpoint
+ */
+asmlinkage long sys_restart(int crid, int fd, unsigned long flags)
+{
+	pr_debug("sys_restart not implemented yet\n");
+	return -ENOSYS;
+}
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 04fb47b..9750393 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -621,6 +621,8 @@ asmlinkage long sys_timerfd_gettime(int ufd, struct itimerspec __user *otmr);
 asmlinkage long sys_eventfd(unsigned int count);
 asmlinkage long sys_eventfd2(unsigned int count, int flags);
 asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t len);
+asmlinkage long sys_checkpoint(pid_t pid, int fd, unsigned long flags);
+asmlinkage long sys_restart(int crid, int fd, unsigned long flags);
 
 int kernel_execve(const char *filename, char *const argv[], char *const envp[]);
 
diff --git a/init/Kconfig b/init/Kconfig
index f763762..57364fe 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -814,6 +814,8 @@ config MARKERS
 
 source "arch/Kconfig"
 
+source "checkpoint/Kconfig"
+
 endmenu		# General setup
 
 config HAVE_GENERIC_DMA_COHERENT
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index e14a232..fcd65cc 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -174,3 +174,7 @@ cond_syscall(compat_sys_timerfd_settime);
 cond_syscall(compat_sys_timerfd_gettime);
 cond_syscall(sys_eventfd);
 cond_syscall(sys_eventfd2);
+
+/* checkpoint/restart */
+cond_syscall(sys_checkpoint);
+cond_syscall(sys_restart);
-- 
1.5.4.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 133+ messages in thread

* [RFC v11][PATCH 02/13] Checkpoint/restart: initial documentation
       [not found] ` <1228498282-11804-1-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
  2008-12-05 17:31   ` [RFC v11][PATCH 01/13] Create syscalls: sys_checkpoint, sys_restart Oren Laadan
@ 2008-12-05 17:31   ` Oren Laadan
  2008-12-05 17:31   ` [RFC v11][PATCH 03/13] General infrastructure for checkpoint restart Oren Laadan
                     ` (15 subsequent siblings)
  17 siblings, 0 replies; 133+ messages in thread
From: Oren Laadan @ 2008-12-05 17:31 UTC (permalink / raw)
  To: Oren Laadan
  Cc: jeremy-TSDbQ3PG+2Y, arnd-r2nGTMty4D4,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Dave Hansen,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Linux Torvalds, Alexander Viro,
	H. Peter Anvin, Thomas Gleixner, Ingo Molnar

Covers application checkpoint/restart, overall design, interfaces,
usage, shared objects, and and checkpoint image format.

Changelog[v8]:
  - Split into multiple files in Documentation/checkpoint/...
  - Extend documentation, fix typos and comments from feedback

Signed-off-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
Acked-by: Serge Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
Signed-off-by: Dave Hansen <dave-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
---
 Documentation/checkpoint/ckpt.c        |   32 ++++++
 Documentation/checkpoint/internals.txt |  133 +++++++++++++++++++++++++
 Documentation/checkpoint/readme.txt    |  105 +++++++++++++++++++
 Documentation/checkpoint/rstr.c        |   20 ++++
 Documentation/checkpoint/security.txt  |   38 +++++++
 Documentation/checkpoint/self.c        |   57 +++++++++++
 Documentation/checkpoint/test.c        |   48 +++++++++
 Documentation/checkpoint/usage.txt     |  171 ++++++++++++++++++++++++++++++++
 8 files changed, 604 insertions(+), 0 deletions(-)
 create mode 100644 Documentation/checkpoint/ckpt.c
 create mode 100644 Documentation/checkpoint/internals.txt
 create mode 100644 Documentation/checkpoint/readme.txt
 create mode 100644 Documentation/checkpoint/rstr.c
 create mode 100644 Documentation/checkpoint/security.txt
 create mode 100644 Documentation/checkpoint/self.c
 create mode 100644 Documentation/checkpoint/test.c
 create mode 100644 Documentation/checkpoint/usage.txt

diff --git a/Documentation/checkpoint/ckpt.c b/Documentation/checkpoint/ckpt.c
new file mode 100644
index 0000000..094408c
--- /dev/null
+++ b/Documentation/checkpoint/ckpt.c
@@ -0,0 +1,32 @@
+#include <stdio.h>
+#include <stdlib.h>
+#include <errno.h>
+#include <unistd.h>
+#include <sys/syscall.h>
+
+int main(int argc, char *argv[])
+{
+	pid_t pid;
+	int ret;
+
+	if (argc != 2) {
+		printf("usage: ckpt PID\n");
+		exit(1);
+	}
+
+	pid = atoi(argv[1]);
+	if (pid <= 0) {
+		printf("invalid pid\n");
+		exit(1);
+	}
+
+	ret = syscall(__NR_checkpoint, pid, STDOUT_FILENO, 0);
+
+	if (ret < 0)
+		perror("checkpoint");
+	else
+		printf("checkpoint id %d\n", ret);
+
+	return (ret > 0 ? 0 : 1);
+}
+
diff --git a/Documentation/checkpoint/internals.txt b/Documentation/checkpoint/internals.txt
new file mode 100644
index 0000000..b363e83
--- /dev/null
+++ b/Documentation/checkpoint/internals.txt
@@ -0,0 +1,133 @@
+
+	===== Internals of Checkpoint-Restart =====
+
+
+(1) Order of state dump
+
+The order of operations, both save and restore, is as follows:
+
+* Header section: header, container information, etc.
+
+* Global section: [TBD] global resources such as IPC, UTS, etc.
+
+* Process forest: [TBD] tasks and their relationships
+
+* Per task data (for each task):
+  -> task state: elements of task_struct
+  -> thread state: elements of thread_struct and thread_info
+  -> CPU state: registers etc, including FPU
+  -> memory state: memory address space layout and contents
+  -> filesystem state: [TBD] filesystem namespace state, chroot, cwd, etc
+  -> files state: open file descriptors and their state
+  -> signals state: [TBD] pending signals and signal handling state
+  -> credentials state: [TBD] user and group state, statistics
+
+
+(2) Checkpoint image format
+
+The checkpoint image format is composed of records consisting of a
+pre-header that identifies its contents, followed by a payload. (The
+idea here is to enable parallel checkpointing in the future in which
+multiple threads interleave data from multiple processes into a single
+stream).
+
+The pre-header is defined by "struct cr_hdr" as follows:
+
+struct cr_hdr {
+	__s16 type;
+	__s16 len;
+	__u32 parent;
+};
+
+'type' identifies the type of the payload, 'len' tells its length in
+bytes, and 'parent' identifies the owner object instance. The meaning
+of 'parent' varies depending on the type. For example, for CR_HDR_MM,
+'parent' identifies the task to which this MM belongs. The payload
+also varies depending on the type, for instance, the data describing a
+task_struct is given by a 'struct cr_hdr_task' (type CR_HDR_TASK) and
+so on.
+
+The format of the memory dump is as follows: for each VMA, there is a
+'struct cr_vma'; if the VMA is file-mapped, it is followed by the file
+name. Following comes the actual contents, in one or more chunks: each
+chunk begins with a header that specifies how many pages it holds,
+then the virtual addresses of all the dumped pages in that chunk,
+followed by the actual contents of all the dumped pages. A header with
+zero number of pages marks the end of the contents for a particular
+VMA. Then comes the next VMA and so on.
+
+To illustrate this, consider a single simple task with two VMAs: one
+is file mapped with two dumped pages, and the other is anonymous with
+three dumped pages. The checkpoint image will look like this:
+
+cr_hdr + cr_hdr_head
+cr_hdr + cr_hdr_task
+	cr_hdr + cr_hdr_mm
+		cr_hdr + cr_hdr_vma + cr_hdr + string
+			cr_hdr_pgarr (nr_pages = 2)
+			addr1, addr2
+			page1, page2
+			cr_hdr_pgarr (nr_pages = 0)
+		cr_hdr + cr_hdr_vma
+			cr_hdr_pgarr (nr_pages = 3)
+			addr3, addr4, addr5
+			page3, page4, page5
+			cr_hdr_pgarr (nr_pages = 0)
+		cr_hdr + cr_mm_context
+	cr_hdr + cr_hdr_thread
+	cr_hdr + cr_hdr_cpu
+cr_hdr + cr_hdr_tail
+
+
+(3) Shared resources (objects)
+
+Many resources used by tasks may be shared by more than one task (e.g.
+file descriptors, memory address space, etc), or even have multiple
+references from other resources (e.g. a single inode that represents
+two ends of a pipe).
+
+Clearly, the state of shared objects need only be saved once, even if
+they occur multiple times. We use a hash table (ctx->objhash) to keep
+track of shared objects and whether they were already saved.  Shared
+objects are stored in a hash table as they appear, indexed by their
+kernel address. (The hash table itself is not saved as part of the
+checkpoint image: it is constructed dynamically during both checkpoint
+and restart, and discarded at the end of the operation).
+
+Each shared object that is found is first looked up in the hash table.
+On the first encounter, the object will not be found, so its state is
+dumped, and the object is assigned a unique identifier and also stored
+in the hash table. Subsequent lookups of that object in the hash table
+will yield that entry, and then only the unique identifier is saved,
+as opposed the entire state of the object.
+
+During restart, shared objects are seen by their unique identifiers as
+assigned during the checkpoint. Each shared object that it read in is
+first looked up in the hash table. On the first encounter it will not
+be found, meaning that the object needs to be created and its state
+read in and restored. Then the object is added to the hash table, this
+time indexed by its unique identifier. Subsequent lookups of the same
+unique identifier in the hash table will yield that entry, and then
+the existing object instance is reused instead of creating another one.
+
+The hash grabs a reference to each object that is inserted, and
+maintains this reference for the entire lifetime of the hash. Thus,
+it is always safe to reference an object that is stored in the hash.
+The hash is "one-way" in the sense that objects that are added are
+never deleted from the hash until the hash is discarded. This, in
+turn, happens only when the checkpoint (or restart) terminates.
+
+The interface for the hash table is the following:
+
+cr_obj_get_by_ptr() - find the unique object reference (objref)
+  of the object that is pointer to by ptr [checkpoint]
+
+cr_obj_add_ptr() - add the object pointed to by ptr to the hash table
+  if not already there, and fill its unique object reference (objref)
+
+cr_obj_get_by_ref() - return the pointer to the object whose unique
+  object reference is equal to objref [restart]
+
+cr_obj_add_ref() - add the object with given unique object reference
+  (objref), pointed to by ptr to the hash table. [restart]
+
diff --git a/Documentation/checkpoint/readme.txt b/Documentation/checkpoint/readme.txt
new file mode 100644
index 0000000..344a551
--- /dev/null
+++ b/Documentation/checkpoint/readme.txt
@@ -0,0 +1,105 @@
+
+	===== Checkpoint-Restart support in the Linux kernel =====
+
+Copyright (C) 2008 Oren Laadan
+
+Author:		Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
+
+License:	The GNU Free Documentation License, Version 1.2
+		(dual licensed under the GPL v2)
+
+Reviewers:	Serge Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
+		Dave Hansen <dave-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
+
+Application checkpoint/restart [C/R] is the ability to save the state
+of a running application so that it can later resume its execution
+from the time at which it was checkpointed. An application can be
+migrated by checkpointing it on one machine and restarting it on
+another. C/R can provide many potential benefits:
+
+* Failure recovery: by rolling back to a previous checkpoint
+
+* Improved response time: by restarting applications from checkpoints
+  instead of from scratch.
+
+* Improved system utilization: by suspending long running CPU
+  intensive jobs and resuming them when load decreases.
+
+* Fault resilience: by migrating applications off faulty hosts.
+
+* Dynamic load balancing: by migrating applications to less loaded
+  hosts.
+
+* Improved service availability and administration: by migrating
+  applications before host maintenance so that they continue to run
+  with minimal downtime
+
+* Time-travel: by taking periodic checkpoints and restarting from
+  any previous checkpoint.
+
+
+=== Overall design
+
+Checkpoint and restart is done in the kernel as much as possible. The
+kernel exports a relatively opaque 'blob' of data to userspace which can
+then be handed to the new kernel at restore time.  The 'blob' contains
+data and state of select portions of kernel structures such as VMAs
+and mm_structs, as well as copies of the actual memory that the tasks
+use. Any changes in this blob's format between kernel revisions can be
+handled by an in-userspace conversion program. The approach is similar
+to virtually all of the commercial C/R products out there, as well as
+the research project Zap.
+
+Two new system calls are introduced to provide C/R: sys_checkpoint()
+and sys_restart(). The checkpoint code basically serializes internal
+kernel state and writes it out to a file descriptor, and the resulting
+image is stream-able. More specifically, it consists of 5 steps:
+
+1. Pre-dump
+2. Freeze the container
+3. Dump
+4. Thaw (or kill) the container
+5. Post-dump
+
+Steps 1 and 5 are an optimization to reduce application downtime. In
+particular, "pre-dump" works before freezing the container, e.g. the
+pre-copy for live migration, and "post-dump" works after the container
+resumes execution, e.g. write-back the data to secondary storage.
+
+The restart code basically reads the saved kernel state from a file
+descriptor, and re-creates the tasks and the resources they need to
+resume execution. The restart code is executed by each task that is
+restored in a new container to reconstruct its own state.
+
+
+=== Current Implementation
+
+* How useful is this code as it stands in real-world usage?
+
+Right now, the application must be a single process that does not
+share any resources with other processes. The only file descriptors
+that may be open are simple files and directories, they may not
+include devices, sockets or pipes.
+
+For an "external" checkpoint, the caller must first freeze (or stop)
+the target process. For "self" checkpoint, the application must be
+specifically written to use the new system calls. The restart does not
+yet preserve the pid of the original process, but will use whatever
+pid it was given by the kernel.
+
+What this means in practice is that it is useful for a simple
+application doing computational work and input/output from/to files.
+
+Currently, namespaces are not saved or restored. They will be treated
+as a class of a shared object. In particular, it is assumed that the
+task's file system namespace is the "root" for the entire container.
+It is also assumed that the same file system view is available for the
+restart task(s). Otherwise, a file system snapshot is required.
+
+* What additional work needs to be done to it?
+
+We know this design can work.  We have two commercial products and a
+horde of academic projects doing it today using this basic design.
+We're early in this particular implementation because we're trying to
+release early and often.
+
diff --git a/Documentation/checkpoint/rstr.c b/Documentation/checkpoint/rstr.c
new file mode 100644
index 0000000..288209d
--- /dev/null
+++ b/Documentation/checkpoint/rstr.c
@@ -0,0 +1,20 @@
+#include <stdio.h>
+#include <stdlib.h>
+#include <unistd.h>
+#include <errno.h>
+#include <sys/syscall.h>
+
+int main(int argc, char *argv[])
+{
+	pid_t pid = getpid();
+	int ret;
+
+	ret = syscall(__NR_restart, pid, STDIN_FILENO, 0);
+	if (ret < 0)
+		perror("restart");
+
+	printf("should not reach here !\n");
+
+	return 0;
+}
+
diff --git a/Documentation/checkpoint/security.txt b/Documentation/checkpoint/security.txt
new file mode 100644
index 0000000..e5b4107
--- /dev/null
+++ b/Documentation/checkpoint/security.txt
@@ -0,0 +1,38 @@
+
+	===== Security consideration for Checkpoint-Restart =====
+
+The main question is whether sys_checkpoint() and sys_restart()
+require privileged or unprivileged operation.
+
+Early versions checked capable(CAP_SYS_ADMIN) assuming that we would
+attempt to remove the need for privilege, so that all users could
+safely use it. Arnd Bergmann pointed out that it'd make more sense to
+let unprivileged users use them now, so that we'll be more careful
+about the security as patches roll in.
+
+Checkpoint: the main concern is whether a task that performs the
+checkpoint of another task has sufficient privileges to access its
+state. We address this by requiring that the checkpointer task will be
+able to ptrace the target task, by means of ptrace_may_access() with
+read mode.
+
+Restart: the main concern is that we may allow an unprivileged user to
+feed the kernel with random data. To this end, the restart works in a
+way that does not skip the usual security checks. Task credentials,
+i.e. euid, reuid, and LSM security contexts currently come from the
+caller, not the checkpoint image.  When restoration of credentials
+becomes supported, then definitely the ability of the task that calls
+sys_restore() to setresuid/setresgid to those values must be checked.
+
+Keeping the restart procedure to operate within the limits of the
+caller's credentials means that there various scenarios that cannot
+be supported. For instance, a setuid program that opened a protected
+log file and then dropped privileges will fail the restart, because
+the user won't have enough credentials to reopen the file. In these
+cases, we should probably treat restarting like inserting a kernel
+module: surely the user can cause havoc by providing incorrect data,
+but then again we must trust the root account.
+
+So that's why we don't want CAP_SYS_ADMIN required up-front. That way
+we will be forced to more carefully review each of those features.
+
diff --git a/Documentation/checkpoint/self.c b/Documentation/checkpoint/self.c
new file mode 100644
index 0000000..febb888
--- /dev/null
+++ b/Documentation/checkpoint/self.c
@@ -0,0 +1,57 @@
+#include <stdio.h>
+#include <stdlib.h>
+#include <unistd.h>
+#include <string.h>
+#include <errno.h>
+#include <math.h>
+#include <sys/syscall.h>
+
+#define OUTFILE  "/tmp/cr-test.out"
+
+int main(int argc, char *argv[])
+{
+	pid_t pid = getpid();
+	FILE *file;
+	int i, ret;
+	float a;
+
+	close(0);
+	close(2);
+
+	unlink(OUTFILE);
+	file = fopen(OUTFILE, "w+");
+	if (!file) {
+		perror("open");
+		exit(1);
+	}
+	if (dup2(0, 2) < 0) {
+		perror("dup2");
+		exit(1);
+	}
+
+	a = sqrt(2.53 * (getpid() / 1.21));
+
+	fprintf(file, "hello, world (%.2f)!\n", a);
+	fflush(file);
+
+	for (i = 0; i < 1000; i++) {
+		sleep(1);
+		/* make the fpu work ->  a = a + i/10  */
+		a = sqrt(a*a + 2*a*(i/10.0) + i*i/100.0);
+		fprintf(file, "count %d (%.2f)!\n", i, a);
+		fflush(file);
+
+		if (i == 2) {
+			ret = syscall(__NR_checkpoint, pid, STDOUT_FILENO, 0);
+			if (ret < 0) {
+				fprintf(file, "ckpt: %s\n", strerror(errno));
+				exit(2);
+			}
+			fprintf(file, "checkpoint ret: %d\n", ret);
+			fflush(file);
+		}
+	}
+
+	return 0;
+}
+
diff --git a/Documentation/checkpoint/test.c b/Documentation/checkpoint/test.c
new file mode 100644
index 0000000..1183655
--- /dev/null
+++ b/Documentation/checkpoint/test.c
@@ -0,0 +1,48 @@
+#include <stdio.h>
+#include <stdlib.h>
+#include <unistd.h>
+#include <errno.h>
+#include <math.h>
+
+#define OUTFILE  "/tmp/cr-test.out"
+
+int main(int argc, char *argv[])
+{
+	FILE *file;
+	float a;
+	int i;
+
+	close(0);
+	close(1);
+	close(2);
+
+	unlink(OUTFILE);
+	file = fopen(OUTFILE, "w+");
+	if (!file) {
+		perror("open");
+		exit(1);
+	}
+	if (dup2(0, 2) < 0) {
+		perror("dup2");
+		exit(1);
+	}
+
+	a = sqrt(2.53 * (getpid() / 1.21));
+
+	fprintf(file, "hello, world (%.2f)!\n", a);
+	fflush(file);
+
+	for (i = 0; i < 1000; i++) {
+		sleep(1);
+		/* make the fpu work ->  a = a + i/10  */
+		a = sqrt(a*a + 2*a*(i/10.0) + i*i/100.0);
+		fprintf(file, "count %d (%.2f)!\n", i, a);
+		fflush(file);
+	}
+
+	fprintf(file, "world, hello (%.2f) !\n", a);
+	fflush(file);
+
+	return 0;
+}
+
diff --git a/Documentation/checkpoint/usage.txt b/Documentation/checkpoint/usage.txt
new file mode 100644
index 0000000..1b42d6b
--- /dev/null
+++ b/Documentation/checkpoint/usage.txt
@@ -0,0 +1,171 @@
+
+	===== How to use Checkpoint-Restart =====
+
+The API consists of two new system calls:
+
+* int sys_checkpoint(pid_t pid, int fd, unsigned long flag);
+
+    Checkpoint a container whose init task is identified by pid, to
+    the file designated by fd. 'flags' will have future meaning (must
+    be 0 for now).
+
+    Returns: a positive checkpoint identifier (crid) upon success, 0
+    if it returns from a restart, and -1 if an error occurs.
+
+    'crid' uniquely identifies a checkpoint image. For each checkpoint
+    the kernel allocates a unique 'crid', that remains valid for as
+    long as the checkpoint is kept in the kernel (for instance, when a
+    checkpoint, or a partial checkpoint, may reside in kernel memory).
+
+* int sys_restart(int crid, int fd, unsigned long flags);
+
+    Restart a container from a checkpoint image that is read from the
+    blob stored in the file designated by fd. 'crid' will have future
+    meaning (must be 0 for now). 'flags' will have future meaning
+    (must be 0 for now).
+
+    The role of 'crid' is to identify the checkpoint image in the case
+    that it remains in kernel memory. This will be useful to restart
+    from a checkpoint image that remains in kernel memory.
+
+    Returns: -1 if an error occurs, 0 on success when restarting from
+    a "self" checkpoint, and return value of system call at the time
+    of the checkpoint when restarting from an "external" checkpoint.
+
+    If restarting from an "external" checkpoint, tasks that were
+    executing a system call will observe the return value of that
+    system call (as it was when interrupted for the act of taking the
+    checkpoint), and tasks that were executing in user space will be
+    ready to return there.
+
+    Upon successful "external" restart, the container will end up in a
+    frozen state.
+
+The granularity of a checkpoint usually is a whole container. The
+'pid' argument is interpreted in the caller's pid namespace. So to
+checkpoint a container whose init task (pid 1 in that pidns) appears
+as pid 3497 the caller's pidns, the caller must use pid 3497. Passing
+pid 1 will attempt to checkpoint the caller's container, and if the
+caller isn't privileged and init is owned by root, it will fail.
+
+If the caller passes a pid which does not refer to a container's init
+task, then sys_checkpoint() would return -EINVAL. (This is because
+with nested containers a task may belong to more than one container).
+
+We assume that during checkpoint and restart the container state is
+quiescent. During checkpoint, this means that all affected tasks are
+frozen (or otherwise stopped). During restart, this means that all
+affected tasks are executing the sys_restart() call. In both cases,
+if there are other tasks possible sharing state with the container,
+they must not modify it during the operation. It is the reponsibility
+of the caller to follow this requirement.
+
+If the assumption that all tasks are frozen and that there is no other
+sharing doesn't hold - then the results of the operation are undefined
+(just as, e.g. not calling execve() immediately after vfork() produces
+undefined results). In particular, either checkpoint will fail, or it
+may produce a checkpoint image that can't be restarted, or (unlikely)
+the restart may produce a container whose state does not match that of
+the original container.
+
+
+Here is a code snippet that illustrates how a checkpoint is initiated
+by a process in a container - the logic is similar to fork():
+	...
+	crid = checkpoint(1, ...);
+	switch (crid) {
+	case -1:
+		perror("checkpoint failed");
+		break;
+	default:
+		fprintf(stderr, "checkpoint succeeded, CRID=%d\n", ret);
+		/* proceed with execution after checkpoint */
+		...
+		break;
+	case 0:
+		fprintf(stderr, "returned after restart\n");
+		/* proceed with action required following a restart */
+		...
+		break;
+	}
+	...
+
+And to initiate a restart, the process in an empty container can use
+logic similar to execve():
+	...
+	if (restart(crid, ...) < 0)
+		perror("restart failed");
+	/* only get here if restart failed */
+	...
+
+Note, that the code also supports "self" checkpoint, where a process
+can checkpoint itself. This mode does not capture the relationships
+of the task with other tasks, or any shared resources. It is useful
+for application that wish to be able to save and restore their state.
+They will either not use (or care about) shared resources, or they
+will be aware of the operations and adapt suitably after a restart.
+The code above can also be used for "self" checkpoint.
+
+To illustrate how the API works, refer to these sample programs:
+
+* ckpt.c: accepts a 'pid' argument and checkpoint that task to stdout
+* rstr.c: restarts a checkpoint image from stdin
+* self.c: a simple test program doing self-checkpoint
+* test.c: a simple test program to checkpoint
+
+"External" checkpoint:
+---------------------
+To do "external" checkpoint, you need to first freeze that other task
+either using the freezer cgroup, or by sending SIGSTOP.
+
+Restart does not preserve the original PID yet, (because we haven't
+solved yet the fork-with-specific-pid issue). In a real scenario, you
+probably want to first create a new names space, and have the init
+task there call 'sys_restart()'.
+
+I tested it this way:
+	$ ./test &
+	[1] 3493
+
+	$ kill -STOP 3493
+	$ ./ckpt 3493 > ckpt.image
+
+	$ mv /tmp/cr-test.out /tmp/cr-test.out.orig
+	$ cp /tmp/cr-test.out.orig /tmp/cr-test.out
+
+	$ kill -CONT 3493
+
+	$ ./rstr < ckpt.image
+Now compare the output of the two output files.
+
+"Self checkpoint:
+----------------
+To do "self" checkpoint, you can incorporate the code from ckpt.c into
+your application.
+
+Here is how to test the "self" checkpoint:
+	$ ./self > self.image &
+	[1] 3512
+
+	$ sleep 3
+	$ mv /tmp/cr-test.out /tmp/cr-test.out.orig
+	$ cp /tmp/cr-test.out.orig /tmp/cr-test.out
+
+	$ cat /tmp/cr-rest.out
+	hello, world (85.46)!
+	count 0 (85.46)!
+	count 1 (85.56)!
+	count 2 (85.76)!
+	count 3 (86.46)!
+
+	$ sed -i 's/count/xxxx/g' /tmp/cr-rest.out
+
+	$ ./rstr < self.image &
+Now compare the output of the two output files.
+
+Note how in test.c we close stdin, stdout, stderr - that's because
+currently we only support regular files (not ttys/ptys).
+
+If you check the output of ps, you'll see that "rstr" changed its name
+to "test" or "self", as expected.
+
-- 
1.5.4.3

^ permalink raw reply related	[flat|nested] 133+ messages in thread

* [RFC v11][PATCH 02/13] Checkpoint/restart: initial documentation
       [not found] ` <1228498282-11804-1-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
  2008-12-05 17:31   ` [RFC v11][PATCH 01/13] Create syscalls: sys_checkpoint, sys_restart Oren Laadan
@ 2008-12-05 17:31   ` Oren Laadan
  2008-12-05 17:31   ` [RFC v11][PATCH 03/13] General infrastructure for checkpoint restart Oren Laadan
                     ` (15 subsequent siblings)
  17 siblings, 0 replies; 133+ messages in thread
From: Oren Laadan @ 2008-12-05 17:31 UTC (permalink / raw)
  To: Oren Laadan
  Cc: containers, linux-kernel, linux-mm, linux-api, Linux Torvalds,
	Thomas Gleixner, Serge Hallyn, Dave Hansen, Ingo Molnar,
	H. Peter Anvin, Alexander Viro, MinChan Kim, arnd, jeremy,
	Oren Laadan

Covers application checkpoint/restart, overall design, interfaces,
usage, shared objects, and and checkpoint image format.

Changelog[v8]:
  - Split into multiple files in Documentation/checkpoint/...
  - Extend documentation, fix typos and comments from feedback

Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com>
---
 Documentation/checkpoint/ckpt.c        |   32 ++++++
 Documentation/checkpoint/internals.txt |  133 +++++++++++++++++++++++++
 Documentation/checkpoint/readme.txt    |  105 +++++++++++++++++++
 Documentation/checkpoint/rstr.c        |   20 ++++
 Documentation/checkpoint/security.txt  |   38 +++++++
 Documentation/checkpoint/self.c        |   57 +++++++++++
 Documentation/checkpoint/test.c        |   48 +++++++++
 Documentation/checkpoint/usage.txt     |  171 ++++++++++++++++++++++++++++++++
 8 files changed, 604 insertions(+), 0 deletions(-)
 create mode 100644 Documentation/checkpoint/ckpt.c
 create mode 100644 Documentation/checkpoint/internals.txt
 create mode 100644 Documentation/checkpoint/readme.txt
 create mode 100644 Documentation/checkpoint/rstr.c
 create mode 100644 Documentation/checkpoint/security.txt
 create mode 100644 Documentation/checkpoint/self.c
 create mode 100644 Documentation/checkpoint/test.c
 create mode 100644 Documentation/checkpoint/usage.txt

diff --git a/Documentation/checkpoint/ckpt.c b/Documentation/checkpoint/ckpt.c
new file mode 100644
index 0000000..094408c
--- /dev/null
+++ b/Documentation/checkpoint/ckpt.c
@@ -0,0 +1,32 @@
+#include <stdio.h>
+#include <stdlib.h>
+#include <errno.h>
+#include <unistd.h>
+#include <sys/syscall.h>
+
+int main(int argc, char *argv[])
+{
+	pid_t pid;
+	int ret;
+
+	if (argc != 2) {
+		printf("usage: ckpt PID\n");
+		exit(1);
+	}
+
+	pid = atoi(argv[1]);
+	if (pid <= 0) {
+		printf("invalid pid\n");
+		exit(1);
+	}
+
+	ret = syscall(__NR_checkpoint, pid, STDOUT_FILENO, 0);
+
+	if (ret < 0)
+		perror("checkpoint");
+	else
+		printf("checkpoint id %d\n", ret);
+
+	return (ret > 0 ? 0 : 1);
+}
+
diff --git a/Documentation/checkpoint/internals.txt b/Documentation/checkpoint/internals.txt
new file mode 100644
index 0000000..b363e83
--- /dev/null
+++ b/Documentation/checkpoint/internals.txt
@@ -0,0 +1,133 @@
+
+	===== Internals of Checkpoint-Restart =====
+
+
+(1) Order of state dump
+
+The order of operations, both save and restore, is as follows:
+
+* Header section: header, container information, etc.
+
+* Global section: [TBD] global resources such as IPC, UTS, etc.
+
+* Process forest: [TBD] tasks and their relationships
+
+* Per task data (for each task):
+  -> task state: elements of task_struct
+  -> thread state: elements of thread_struct and thread_info
+  -> CPU state: registers etc, including FPU
+  -> memory state: memory address space layout and contents
+  -> filesystem state: [TBD] filesystem namespace state, chroot, cwd, etc
+  -> files state: open file descriptors and their state
+  -> signals state: [TBD] pending signals and signal handling state
+  -> credentials state: [TBD] user and group state, statistics
+
+
+(2) Checkpoint image format
+
+The checkpoint image format is composed of records consisting of a
+pre-header that identifies its contents, followed by a payload. (The
+idea here is to enable parallel checkpointing in the future in which
+multiple threads interleave data from multiple processes into a single
+stream).
+
+The pre-header is defined by "struct cr_hdr" as follows:
+
+struct cr_hdr {
+	__s16 type;
+	__s16 len;
+	__u32 parent;
+};
+
+'type' identifies the type of the payload, 'len' tells its length in
+bytes, and 'parent' identifies the owner object instance. The meaning
+of 'parent' varies depending on the type. For example, for CR_HDR_MM,
+'parent' identifies the task to which this MM belongs. The payload
+also varies depending on the type, for instance, the data describing a
+task_struct is given by a 'struct cr_hdr_task' (type CR_HDR_TASK) and
+so on.
+
+The format of the memory dump is as follows: for each VMA, there is a
+'struct cr_vma'; if the VMA is file-mapped, it is followed by the file
+name. Following comes the actual contents, in one or more chunks: each
+chunk begins with a header that specifies how many pages it holds,
+then the virtual addresses of all the dumped pages in that chunk,
+followed by the actual contents of all the dumped pages. A header with
+zero number of pages marks the end of the contents for a particular
+VMA. Then comes the next VMA and so on.
+
+To illustrate this, consider a single simple task with two VMAs: one
+is file mapped with two dumped pages, and the other is anonymous with
+three dumped pages. The checkpoint image will look like this:
+
+cr_hdr + cr_hdr_head
+cr_hdr + cr_hdr_task
+	cr_hdr + cr_hdr_mm
+		cr_hdr + cr_hdr_vma + cr_hdr + string
+			cr_hdr_pgarr (nr_pages = 2)
+			addr1, addr2
+			page1, page2
+			cr_hdr_pgarr (nr_pages = 0)
+		cr_hdr + cr_hdr_vma
+			cr_hdr_pgarr (nr_pages = 3)
+			addr3, addr4, addr5
+			page3, page4, page5
+			cr_hdr_pgarr (nr_pages = 0)
+		cr_hdr + cr_mm_context
+	cr_hdr + cr_hdr_thread
+	cr_hdr + cr_hdr_cpu
+cr_hdr + cr_hdr_tail
+
+
+(3) Shared resources (objects)
+
+Many resources used by tasks may be shared by more than one task (e.g.
+file descriptors, memory address space, etc), or even have multiple
+references from other resources (e.g. a single inode that represents
+two ends of a pipe).
+
+Clearly, the state of shared objects need only be saved once, even if
+they occur multiple times. We use a hash table (ctx->objhash) to keep
+track of shared objects and whether they were already saved.  Shared
+objects are stored in a hash table as they appear, indexed by their
+kernel address. (The hash table itself is not saved as part of the
+checkpoint image: it is constructed dynamically during both checkpoint
+and restart, and discarded at the end of the operation).
+
+Each shared object that is found is first looked up in the hash table.
+On the first encounter, the object will not be found, so its state is
+dumped, and the object is assigned a unique identifier and also stored
+in the hash table. Subsequent lookups of that object in the hash table
+will yield that entry, and then only the unique identifier is saved,
+as opposed the entire state of the object.
+
+During restart, shared objects are seen by their unique identifiers as
+assigned during the checkpoint. Each shared object that it read in is
+first looked up in the hash table. On the first encounter it will not
+be found, meaning that the object needs to be created and its state
+read in and restored. Then the object is added to the hash table, this
+time indexed by its unique identifier. Subsequent lookups of the same
+unique identifier in the hash table will yield that entry, and then
+the existing object instance is reused instead of creating another one.
+
+The hash grabs a reference to each object that is inserted, and
+maintains this reference for the entire lifetime of the hash. Thus,
+it is always safe to reference an object that is stored in the hash.
+The hash is "one-way" in the sense that objects that are added are
+never deleted from the hash until the hash is discarded. This, in
+turn, happens only when the checkpoint (or restart) terminates.
+
+The interface for the hash table is the following:
+
+cr_obj_get_by_ptr() - find the unique object reference (objref)
+  of the object that is pointer to by ptr [checkpoint]
+
+cr_obj_add_ptr() - add the object pointed to by ptr to the hash table
+  if not already there, and fill its unique object reference (objref)
+
+cr_obj_get_by_ref() - return the pointer to the object whose unique
+  object reference is equal to objref [restart]
+
+cr_obj_add_ref() - add the object with given unique object reference
+  (objref), pointed to by ptr to the hash table. [restart]
+
diff --git a/Documentation/checkpoint/readme.txt b/Documentation/checkpoint/readme.txt
new file mode 100644
index 0000000..344a551
--- /dev/null
+++ b/Documentation/checkpoint/readme.txt
@@ -0,0 +1,105 @@
+
+	===== Checkpoint-Restart support in the Linux kernel =====
+
+Copyright (C) 2008 Oren Laadan
+
+Author:		Oren Laadan <orenl@cs.columbia.edu>
+
+License:	The GNU Free Documentation License, Version 1.2
+		(dual licensed under the GPL v2)
+
+Reviewers:	Serge Hallyn <serue@us.ibm.com>
+		Dave Hansen <dave@linux.vnet.ibm.com>
+
+Application checkpoint/restart [C/R] is the ability to save the state
+of a running application so that it can later resume its execution
+from the time at which it was checkpointed. An application can be
+migrated by checkpointing it on one machine and restarting it on
+another. C/R can provide many potential benefits:
+
+* Failure recovery: by rolling back to a previous checkpoint
+
+* Improved response time: by restarting applications from checkpoints
+  instead of from scratch.
+
+* Improved system utilization: by suspending long running CPU
+  intensive jobs and resuming them when load decreases.
+
+* Fault resilience: by migrating applications off faulty hosts.
+
+* Dynamic load balancing: by migrating applications to less loaded
+  hosts.
+
+* Improved service availability and administration: by migrating
+  applications before host maintenance so that they continue to run
+  with minimal downtime
+
+* Time-travel: by taking periodic checkpoints and restarting from
+  any previous checkpoint.
+
+
+=== Overall design
+
+Checkpoint and restart is done in the kernel as much as possible. The
+kernel exports a relatively opaque 'blob' of data to userspace which can
+then be handed to the new kernel at restore time.  The 'blob' contains
+data and state of select portions of kernel structures such as VMAs
+and mm_structs, as well as copies of the actual memory that the tasks
+use. Any changes in this blob's format between kernel revisions can be
+handled by an in-userspace conversion program. The approach is similar
+to virtually all of the commercial C/R products out there, as well as
+the research project Zap.
+
+Two new system calls are introduced to provide C/R: sys_checkpoint()
+and sys_restart(). The checkpoint code basically serializes internal
+kernel state and writes it out to a file descriptor, and the resulting
+image is stream-able. More specifically, it consists of 5 steps:
+
+1. Pre-dump
+2. Freeze the container
+3. Dump
+4. Thaw (or kill) the container
+5. Post-dump
+
+Steps 1 and 5 are an optimization to reduce application downtime. In
+particular, "pre-dump" works before freezing the container, e.g. the
+pre-copy for live migration, and "post-dump" works after the container
+resumes execution, e.g. write-back the data to secondary storage.
+
+The restart code basically reads the saved kernel state from a file
+descriptor, and re-creates the tasks and the resources they need to
+resume execution. The restart code is executed by each task that is
+restored in a new container to reconstruct its own state.
+
+
+=== Current Implementation
+
+* How useful is this code as it stands in real-world usage?
+
+Right now, the application must be a single process that does not
+share any resources with other processes. The only file descriptors
+that may be open are simple files and directories, they may not
+include devices, sockets or pipes.
+
+For an "external" checkpoint, the caller must first freeze (or stop)
+the target process. For "self" checkpoint, the application must be
+specifically written to use the new system calls. The restart does not
+yet preserve the pid of the original process, but will use whatever
+pid it was given by the kernel.
+
+What this means in practice is that it is useful for a simple
+application doing computational work and input/output from/to files.
+
+Currently, namespaces are not saved or restored. They will be treated
+as a class of a shared object. In particular, it is assumed that the
+task's file system namespace is the "root" for the entire container.
+It is also assumed that the same file system view is available for the
+restart task(s). Otherwise, a file system snapshot is required.
+
+* What additional work needs to be done to it?
+
+We know this design can work.  We have two commercial products and a
+horde of academic projects doing it today using this basic design.
+We're early in this particular implementation because we're trying to
+release early and often.
+
diff --git a/Documentation/checkpoint/rstr.c b/Documentation/checkpoint/rstr.c
new file mode 100644
index 0000000..288209d
--- /dev/null
+++ b/Documentation/checkpoint/rstr.c
@@ -0,0 +1,20 @@
+#include <stdio.h>
+#include <stdlib.h>
+#include <unistd.h>
+#include <errno.h>
+#include <sys/syscall.h>
+
+int main(int argc, char *argv[])
+{
+	pid_t pid = getpid();
+	int ret;
+
+	ret = syscall(__NR_restart, pid, STDIN_FILENO, 0);
+	if (ret < 0)
+		perror("restart");
+
+	printf("should not reach here !\n");
+
+	return 0;
+}
+
diff --git a/Documentation/checkpoint/security.txt b/Documentation/checkpoint/security.txt
new file mode 100644
index 0000000..e5b4107
--- /dev/null
+++ b/Documentation/checkpoint/security.txt
@@ -0,0 +1,38 @@
+
+	===== Security consideration for Checkpoint-Restart =====
+
+The main question is whether sys_checkpoint() and sys_restart()
+require privileged or unprivileged operation.
+
+Early versions checked capable(CAP_SYS_ADMIN) assuming that we would
+attempt to remove the need for privilege, so that all users could
+safely use it. Arnd Bergmann pointed out that it'd make more sense to
+let unprivileged users use them now, so that we'll be more careful
+about the security as patches roll in.
+
+Checkpoint: the main concern is whether a task that performs the
+checkpoint of another task has sufficient privileges to access its
+state. We address this by requiring that the checkpointer task will be
+able to ptrace the target task, by means of ptrace_may_access() with
+read mode.
+
+Restart: the main concern is that we may allow an unprivileged user to
+feed the kernel with random data. To this end, the restart works in a
+way that does not skip the usual security checks. Task credentials,
+i.e. euid, reuid, and LSM security contexts currently come from the
+caller, not the checkpoint image.  When restoration of credentials
+becomes supported, then definitely the ability of the task that calls
+sys_restore() to setresuid/setresgid to those values must be checked.
+
+Keeping the restart procedure to operate within the limits of the
+caller's credentials means that there various scenarios that cannot
+be supported. For instance, a setuid program that opened a protected
+log file and then dropped privileges will fail the restart, because
+the user won't have enough credentials to reopen the file. In these
+cases, we should probably treat restarting like inserting a kernel
+module: surely the user can cause havoc by providing incorrect data,
+but then again we must trust the root account.
+
+So that's why we don't want CAP_SYS_ADMIN required up-front. That way
+we will be forced to more carefully review each of those features.
+
diff --git a/Documentation/checkpoint/self.c b/Documentation/checkpoint/self.c
new file mode 100644
index 0000000..febb888
--- /dev/null
+++ b/Documentation/checkpoint/self.c
@@ -0,0 +1,57 @@
+#include <stdio.h>
+#include <stdlib.h>
+#include <unistd.h>
+#include <string.h>
+#include <errno.h>
+#include <math.h>
+#include <sys/syscall.h>
+
+#define OUTFILE  "/tmp/cr-test.out"
+
+int main(int argc, char *argv[])
+{
+	pid_t pid = getpid();
+	FILE *file;
+	int i, ret;
+	float a;
+
+	close(0);
+	close(2);
+
+	unlink(OUTFILE);
+	file = fopen(OUTFILE, "w+");
+	if (!file) {
+		perror("open");
+		exit(1);
+	}
+	if (dup2(0, 2) < 0) {
+		perror("dup2");
+		exit(1);
+	}
+
+	a = sqrt(2.53 * (getpid() / 1.21));
+
+	fprintf(file, "hello, world (%.2f)!\n", a);
+	fflush(file);
+
+	for (i = 0; i < 1000; i++) {
+		sleep(1);
+		/* make the fpu work ->  a = a + i/10  */
+		a = sqrt(a*a + 2*a*(i/10.0) + i*i/100.0);
+		fprintf(file, "count %d (%.2f)!\n", i, a);
+		fflush(file);
+
+		if (i == 2) {
+			ret = syscall(__NR_checkpoint, pid, STDOUT_FILENO, 0);
+			if (ret < 0) {
+				fprintf(file, "ckpt: %s\n", strerror(errno));
+				exit(2);
+			}
+			fprintf(file, "checkpoint ret: %d\n", ret);
+			fflush(file);
+		}
+	}
+
+	return 0;
+}
+
diff --git a/Documentation/checkpoint/test.c b/Documentation/checkpoint/test.c
new file mode 100644
index 0000000..1183655
--- /dev/null
+++ b/Documentation/checkpoint/test.c
@@ -0,0 +1,48 @@
+#include <stdio.h>
+#include <stdlib.h>
+#include <unistd.h>
+#include <errno.h>
+#include <math.h>
+
+#define OUTFILE  "/tmp/cr-test.out"
+
+int main(int argc, char *argv[])
+{
+	FILE *file;
+	float a;
+	int i;
+
+	close(0);
+	close(1);
+	close(2);
+
+	unlink(OUTFILE);
+	file = fopen(OUTFILE, "w+");
+	if (!file) {
+		perror("open");
+		exit(1);
+	}
+	if (dup2(0, 2) < 0) {
+		perror("dup2");
+		exit(1);
+	}
+
+	a = sqrt(2.53 * (getpid() / 1.21));
+
+	fprintf(file, "hello, world (%.2f)!\n", a);
+	fflush(file);
+
+	for (i = 0; i < 1000; i++) {
+		sleep(1);
+		/* make the fpu work ->  a = a + i/10  */
+		a = sqrt(a*a + 2*a*(i/10.0) + i*i/100.0);
+		fprintf(file, "count %d (%.2f)!\n", i, a);
+		fflush(file);
+	}
+
+	fprintf(file, "world, hello (%.2f) !\n", a);
+	fflush(file);
+
+	return 0;
+}
+
diff --git a/Documentation/checkpoint/usage.txt b/Documentation/checkpoint/usage.txt
new file mode 100644
index 0000000..1b42d6b
--- /dev/null
+++ b/Documentation/checkpoint/usage.txt
@@ -0,0 +1,171 @@
+
+	===== How to use Checkpoint-Restart =====
+
+The API consists of two new system calls:
+
+* int sys_checkpoint(pid_t pid, int fd, unsigned long flag);
+
+    Checkpoint a container whose init task is identified by pid, to
+    the file designated by fd. 'flags' will have future meaning (must
+    be 0 for now).
+
+    Returns: a positive checkpoint identifier (crid) upon success, 0
+    if it returns from a restart, and -1 if an error occurs.
+
+    'crid' uniquely identifies a checkpoint image. For each checkpoint
+    the kernel allocates a unique 'crid', that remains valid for as
+    long as the checkpoint is kept in the kernel (for instance, when a
+    checkpoint, or a partial checkpoint, may reside in kernel memory).
+
+* int sys_restart(int crid, int fd, unsigned long flags);
+
+    Restart a container from a checkpoint image that is read from the
+    blob stored in the file designated by fd. 'crid' will have future
+    meaning (must be 0 for now). 'flags' will have future meaning
+    (must be 0 for now).
+
+    The role of 'crid' is to identify the checkpoint image in the case
+    that it remains in kernel memory. This will be useful to restart
+    from a checkpoint image that remains in kernel memory.
+
+    Returns: -1 if an error occurs, 0 on success when restarting from
+    a "self" checkpoint, and return value of system call at the time
+    of the checkpoint when restarting from an "external" checkpoint.
+
+    If restarting from an "external" checkpoint, tasks that were
+    executing a system call will observe the return value of that
+    system call (as it was when interrupted for the act of taking the
+    checkpoint), and tasks that were executing in user space will be
+    ready to return there.
+
+    Upon successful "external" restart, the container will end up in a
+    frozen state.
+
+The granularity of a checkpoint usually is a whole container. The
+'pid' argument is interpreted in the caller's pid namespace. So to
+checkpoint a container whose init task (pid 1 in that pidns) appears
+as pid 3497 the caller's pidns, the caller must use pid 3497. Passing
+pid 1 will attempt to checkpoint the caller's container, and if the
+caller isn't privileged and init is owned by root, it will fail.
+
+If the caller passes a pid which does not refer to a container's init
+task, then sys_checkpoint() would return -EINVAL. (This is because
+with nested containers a task may belong to more than one container).
+
+We assume that during checkpoint and restart the container state is
+quiescent. During checkpoint, this means that all affected tasks are
+frozen (or otherwise stopped). During restart, this means that all
+affected tasks are executing the sys_restart() call. In both cases,
+if there are other tasks possible sharing state with the container,
+they must not modify it during the operation. It is the reponsibility
+of the caller to follow this requirement.
+
+If the assumption that all tasks are frozen and that there is no other
+sharing doesn't hold - then the results of the operation are undefined
+(just as, e.g. not calling execve() immediately after vfork() produces
+undefined results). In particular, either checkpoint will fail, or it
+may produce a checkpoint image that can't be restarted, or (unlikely)
+the restart may produce a container whose state does not match that of
+the original container.
+
+
+Here is a code snippet that illustrates how a checkpoint is initiated
+by a process in a container - the logic is similar to fork():
+	...
+	crid = checkpoint(1, ...);
+	switch (crid) {
+	case -1:
+		perror("checkpoint failed");
+		break;
+	default:
+		fprintf(stderr, "checkpoint succeeded, CRID=%d\n", ret);
+		/* proceed with execution after checkpoint */
+		...
+		break;
+	case 0:
+		fprintf(stderr, "returned after restart\n");
+		/* proceed with action required following a restart */
+		...
+		break;
+	}
+	...
+
+And to initiate a restart, the process in an empty container can use
+logic similar to execve():
+	...
+	if (restart(crid, ...) < 0)
+		perror("restart failed");
+	/* only get here if restart failed */
+	...
+
+Note, that the code also supports "self" checkpoint, where a process
+can checkpoint itself. This mode does not capture the relationships
+of the task with other tasks, or any shared resources. It is useful
+for application that wish to be able to save and restore their state.
+They will either not use (or care about) shared resources, or they
+will be aware of the operations and adapt suitably after a restart.
+The code above can also be used for "self" checkpoint.
+
+To illustrate how the API works, refer to these sample programs:
+
+* ckpt.c: accepts a 'pid' argument and checkpoint that task to stdout
+* rstr.c: restarts a checkpoint image from stdin
+* self.c: a simple test program doing self-checkpoint
+* test.c: a simple test program to checkpoint
+
+"External" checkpoint:
+---------------------
+To do "external" checkpoint, you need to first freeze that other task
+either using the freezer cgroup, or by sending SIGSTOP.
+
+Restart does not preserve the original PID yet, (because we haven't
+solved yet the fork-with-specific-pid issue). In a real scenario, you
+probably want to first create a new names space, and have the init
+task there call 'sys_restart()'.
+
+I tested it this way:
+	$ ./test &
+	[1] 3493
+
+	$ kill -STOP 3493
+	$ ./ckpt 3493 > ckpt.image
+
+	$ mv /tmp/cr-test.out /tmp/cr-test.out.orig
+	$ cp /tmp/cr-test.out.orig /tmp/cr-test.out
+
+	$ kill -CONT 3493
+
+	$ ./rstr < ckpt.image
+Now compare the output of the two output files.
+
+"Self checkpoint:
+----------------
+To do "self" checkpoint, you can incorporate the code from ckpt.c into
+your application.
+
+Here is how to test the "self" checkpoint:
+	$ ./self > self.image &
+	[1] 3512
+
+	$ sleep 3
+	$ mv /tmp/cr-test.out /tmp/cr-test.out.orig
+	$ cp /tmp/cr-test.out.orig /tmp/cr-test.out
+
+	$ cat /tmp/cr-rest.out
+	hello, world (85.46)!
+	count 0 (85.46)!
+	count 1 (85.56)!
+	count 2 (85.76)!
+	count 3 (86.46)!
+
+	$ sed -i 's/count/xxxx/g' /tmp/cr-rest.out
+
+	$ ./rstr < self.image &
+Now compare the output of the two output files.
+
+Note how in test.c we close stdin, stdout, stderr - that's because
+currently we only support regular files (not ttys/ptys).
+
+If you check the output of ps, you'll see that "rstr" changed its name
+to "test" or "self", as expected.
+
-- 
1.5.4.3


^ permalink raw reply related	[flat|nested] 133+ messages in thread

* [RFC v11][PATCH 02/13] Checkpoint/restart: initial documentation
@ 2008-12-05 17:31   ` Oren Laadan
  0 siblings, 0 replies; 133+ messages in thread
From: Oren Laadan @ 2008-12-05 17:31 UTC (permalink / raw)
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Linux Torvalds,
	Thomas Gleixner, Serge Hallyn, Dave Hansen, Ingo Molnar,
	H. Peter Anvin, Alexander Viro, MinChan Kim, arnd-r2nGTMty4D4,
	jeremy-TSDbQ3PG+2Y, Oren Laadan

Covers application checkpoint/restart, overall design, interfaces,
usage, shared objects, and and checkpoint image format.

Changelog[v8]:
  - Split into multiple files in Documentation/checkpoint/...
  - Extend documentation, fix typos and comments from feedback

Signed-off-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
Acked-by: Serge Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
Signed-off-by: Dave Hansen <dave-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
---
 Documentation/checkpoint/ckpt.c        |   32 ++++++
 Documentation/checkpoint/internals.txt |  133 +++++++++++++++++++++++++
 Documentation/checkpoint/readme.txt    |  105 +++++++++++++++++++
 Documentation/checkpoint/rstr.c        |   20 ++++
 Documentation/checkpoint/security.txt  |   38 +++++++
 Documentation/checkpoint/self.c        |   57 +++++++++++
 Documentation/checkpoint/test.c        |   48 +++++++++
 Documentation/checkpoint/usage.txt     |  171 ++++++++++++++++++++++++++++++++
 8 files changed, 604 insertions(+), 0 deletions(-)
 create mode 100644 Documentation/checkpoint/ckpt.c
 create mode 100644 Documentation/checkpoint/internals.txt
 create mode 100644 Documentation/checkpoint/readme.txt
 create mode 100644 Documentation/checkpoint/rstr.c
 create mode 100644 Documentation/checkpoint/security.txt
 create mode 100644 Documentation/checkpoint/self.c
 create mode 100644 Documentation/checkpoint/test.c
 create mode 100644 Documentation/checkpoint/usage.txt

diff --git a/Documentation/checkpoint/ckpt.c b/Documentation/checkpoint/ckpt.c
new file mode 100644
index 0000000..094408c
--- /dev/null
+++ b/Documentation/checkpoint/ckpt.c
@@ -0,0 +1,32 @@
+#include <stdio.h>
+#include <stdlib.h>
+#include <errno.h>
+#include <unistd.h>
+#include <sys/syscall.h>
+
+int main(int argc, char *argv[])
+{
+	pid_t pid;
+	int ret;
+
+	if (argc != 2) {
+		printf("usage: ckpt PID\n");
+		exit(1);
+	}
+
+	pid = atoi(argv[1]);
+	if (pid <= 0) {
+		printf("invalid pid\n");
+		exit(1);
+	}
+
+	ret = syscall(__NR_checkpoint, pid, STDOUT_FILENO, 0);
+
+	if (ret < 0)
+		perror("checkpoint");
+	else
+		printf("checkpoint id %d\n", ret);
+
+	return (ret > 0 ? 0 : 1);
+}
+
diff --git a/Documentation/checkpoint/internals.txt b/Documentation/checkpoint/internals.txt
new file mode 100644
index 0000000..b363e83
--- /dev/null
+++ b/Documentation/checkpoint/internals.txt
@@ -0,0 +1,133 @@
+
+	===== Internals of Checkpoint-Restart =====
+
+
+(1) Order of state dump
+
+The order of operations, both save and restore, is as follows:
+
+* Header section: header, container information, etc.
+
+* Global section: [TBD] global resources such as IPC, UTS, etc.
+
+* Process forest: [TBD] tasks and their relationships
+
+* Per task data (for each task):
+  -> task state: elements of task_struct
+  -> thread state: elements of thread_struct and thread_info
+  -> CPU state: registers etc, including FPU
+  -> memory state: memory address space layout and contents
+  -> filesystem state: [TBD] filesystem namespace state, chroot, cwd, etc
+  -> files state: open file descriptors and their state
+  -> signals state: [TBD] pending signals and signal handling state
+  -> credentials state: [TBD] user and group state, statistics
+
+
+(2) Checkpoint image format
+
+The checkpoint image format is composed of records consisting of a
+pre-header that identifies its contents, followed by a payload. (The
+idea here is to enable parallel checkpointing in the future in which
+multiple threads interleave data from multiple processes into a single
+stream).
+
+The pre-header is defined by "struct cr_hdr" as follows:
+
+struct cr_hdr {
+	__s16 type;
+	__s16 len;
+	__u32 parent;
+};
+
+'type' identifies the type of the payload, 'len' tells its length in
+bytes, and 'parent' identifies the owner object instance. The meaning
+of 'parent' varies depending on the type. For example, for CR_HDR_MM,
+'parent' identifies the task to which this MM belongs. The payload
+also varies depending on the type, for instance, the data describing a
+task_struct is given by a 'struct cr_hdr_task' (type CR_HDR_TASK) and
+so on.
+
+The format of the memory dump is as follows: for each VMA, there is a
+'struct cr_vma'; if the VMA is file-mapped, it is followed by the file
+name. Following comes the actual contents, in one or more chunks: each
+chunk begins with a header that specifies how many pages it holds,
+then the virtual addresses of all the dumped pages in that chunk,
+followed by the actual contents of all the dumped pages. A header with
+zero number of pages marks the end of the contents for a particular
+VMA. Then comes the next VMA and so on.
+
+To illustrate this, consider a single simple task with two VMAs: one
+is file mapped with two dumped pages, and the other is anonymous with
+three dumped pages. The checkpoint image will look like this:
+
+cr_hdr + cr_hdr_head
+cr_hdr + cr_hdr_task
+	cr_hdr + cr_hdr_mm
+		cr_hdr + cr_hdr_vma + cr_hdr + string
+			cr_hdr_pgarr (nr_pages = 2)
+			addr1, addr2
+			page1, page2
+			cr_hdr_pgarr (nr_pages = 0)
+		cr_hdr + cr_hdr_vma
+			cr_hdr_pgarr (nr_pages = 3)
+			addr3, addr4, addr5
+			page3, page4, page5
+			cr_hdr_pgarr (nr_pages = 0)
+		cr_hdr + cr_mm_context
+	cr_hdr + cr_hdr_thread
+	cr_hdr + cr_hdr_cpu
+cr_hdr + cr_hdr_tail
+
+
+(3) Shared resources (objects)
+
+Many resources used by tasks may be shared by more than one task (e.g.
+file descriptors, memory address space, etc), or even have multiple
+references from other resources (e.g. a single inode that represents
+two ends of a pipe).
+
+Clearly, the state of shared objects need only be saved once, even if
+they occur multiple times. We use a hash table (ctx->objhash) to keep
+track of shared objects and whether they were already saved.  Shared
+objects are stored in a hash table as they appear, indexed by their
+kernel address. (The hash table itself is not saved as part of the
+checkpoint image: it is constructed dynamically during both checkpoint
+and restart, and discarded at the end of the operation).
+
+Each shared object that is found is first looked up in the hash table.
+On the first encounter, the object will not be found, so its state is
+dumped, and the object is assigned a unique identifier and also stored
+in the hash table. Subsequent lookups of that object in the hash table
+will yield that entry, and then only the unique identifier is saved,
+as opposed the entire state of the object.
+
+During restart, shared objects are seen by their unique identifiers as
+assigned during the checkpoint. Each shared object that it read in is
+first looked up in the hash table. On the first encounter it will not
+be found, meaning that the object needs to be created and its state
+read in and restored. Then the object is added to the hash table, this
+time indexed by its unique identifier. Subsequent lookups of the same
+unique identifier in the hash table will yield that entry, and then
+the existing object instance is reused instead of creating another one.
+
+The hash grabs a reference to each object that is inserted, and
+maintains this reference for the entire lifetime of the hash. Thus,
+it is always safe to reference an object that is stored in the hash.
+The hash is "one-way" in the sense that objects that are added are
+never deleted from the hash until the hash is discarded. This, in
+turn, happens only when the checkpoint (or restart) terminates.
+
+The interface for the hash table is the following:
+
+cr_obj_get_by_ptr() - find the unique object reference (objref)
+  of the object that is pointer to by ptr [checkpoint]
+
+cr_obj_add_ptr() - add the object pointed to by ptr to the hash table
+  if not already there, and fill its unique object reference (objref)
+
+cr_obj_get_by_ref() - return the pointer to the object whose unique
+  object reference is equal to objref [restart]
+
+cr_obj_add_ref() - add the object with given unique object reference
+  (objref), pointed to by ptr to the hash table. [restart]
+
diff --git a/Documentation/checkpoint/readme.txt b/Documentation/checkpoint/readme.txt
new file mode 100644
index 0000000..344a551
--- /dev/null
+++ b/Documentation/checkpoint/readme.txt
@@ -0,0 +1,105 @@
+
+	===== Checkpoint-Restart support in the Linux kernel =====
+
+Copyright (C) 2008 Oren Laadan
+
+Author:		Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
+
+License:	The GNU Free Documentation License, Version 1.2
+		(dual licensed under the GPL v2)
+
+Reviewers:	Serge Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
+		Dave Hansen <dave-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
+
+Application checkpoint/restart [C/R] is the ability to save the state
+of a running application so that it can later resume its execution
+from the time at which it was checkpointed. An application can be
+migrated by checkpointing it on one machine and restarting it on
+another. C/R can provide many potential benefits:
+
+* Failure recovery: by rolling back to a previous checkpoint
+
+* Improved response time: by restarting applications from checkpoints
+  instead of from scratch.
+
+* Improved system utilization: by suspending long running CPU
+  intensive jobs and resuming them when load decreases.
+
+* Fault resilience: by migrating applications off faulty hosts.
+
+* Dynamic load balancing: by migrating applications to less loaded
+  hosts.
+
+* Improved service availability and administration: by migrating
+  applications before host maintenance so that they continue to run
+  with minimal downtime
+
+* Time-travel: by taking periodic checkpoints and restarting from
+  any previous checkpoint.
+
+
+=== Overall design
+
+Checkpoint and restart is done in the kernel as much as possible. The
+kernel exports a relatively opaque 'blob' of data to userspace which can
+then be handed to the new kernel at restore time.  The 'blob' contains
+data and state of select portions of kernel structures such as VMAs
+and mm_structs, as well as copies of the actual memory that the tasks
+use. Any changes in this blob's format between kernel revisions can be
+handled by an in-userspace conversion program. The approach is similar
+to virtually all of the commercial C/R products out there, as well as
+the research project Zap.
+
+Two new system calls are introduced to provide C/R: sys_checkpoint()
+and sys_restart(). The checkpoint code basically serializes internal
+kernel state and writes it out to a file descriptor, and the resulting
+image is stream-able. More specifically, it consists of 5 steps:
+
+1. Pre-dump
+2. Freeze the container
+3. Dump
+4. Thaw (or kill) the container
+5. Post-dump
+
+Steps 1 and 5 are an optimization to reduce application downtime. In
+particular, "pre-dump" works before freezing the container, e.g. the
+pre-copy for live migration, and "post-dump" works after the container
+resumes execution, e.g. write-back the data to secondary storage.
+
+The restart code basically reads the saved kernel state from a file
+descriptor, and re-creates the tasks and the resources they need to
+resume execution. The restart code is executed by each task that is
+restored in a new container to reconstruct its own state.
+
+
+=== Current Implementation
+
+* How useful is this code as it stands in real-world usage?
+
+Right now, the application must be a single process that does not
+share any resources with other processes. The only file descriptors
+that may be open are simple files and directories, they may not
+include devices, sockets or pipes.
+
+For an "external" checkpoint, the caller must first freeze (or stop)
+the target process. For "self" checkpoint, the application must be
+specifically written to use the new system calls. The restart does not
+yet preserve the pid of the original process, but will use whatever
+pid it was given by the kernel.
+
+What this means in practice is that it is useful for a simple
+application doing computational work and input/output from/to files.
+
+Currently, namespaces are not saved or restored. They will be treated
+as a class of a shared object. In particular, it is assumed that the
+task's file system namespace is the "root" for the entire container.
+It is also assumed that the same file system view is available for the
+restart task(s). Otherwise, a file system snapshot is required.
+
+* What additional work needs to be done to it?
+
+We know this design can work.  We have two commercial products and a
+horde of academic projects doing it today using this basic design.
+We're early in this particular implementation because we're trying to
+release early and often.
+
diff --git a/Documentation/checkpoint/rstr.c b/Documentation/checkpoint/rstr.c
new file mode 100644
index 0000000..288209d
--- /dev/null
+++ b/Documentation/checkpoint/rstr.c
@@ -0,0 +1,20 @@
+#include <stdio.h>
+#include <stdlib.h>
+#include <unistd.h>
+#include <errno.h>
+#include <sys/syscall.h>
+
+int main(int argc, char *argv[])
+{
+	pid_t pid = getpid();
+	int ret;
+
+	ret = syscall(__NR_restart, pid, STDIN_FILENO, 0);
+	if (ret < 0)
+		perror("restart");
+
+	printf("should not reach here !\n");
+
+	return 0;
+}
+
diff --git a/Documentation/checkpoint/security.txt b/Documentation/checkpoint/security.txt
new file mode 100644
index 0000000..e5b4107
--- /dev/null
+++ b/Documentation/checkpoint/security.txt
@@ -0,0 +1,38 @@
+
+	===== Security consideration for Checkpoint-Restart =====
+
+The main question is whether sys_checkpoint() and sys_restart()
+require privileged or unprivileged operation.
+
+Early versions checked capable(CAP_SYS_ADMIN) assuming that we would
+attempt to remove the need for privilege, so that all users could
+safely use it. Arnd Bergmann pointed out that it'd make more sense to
+let unprivileged users use them now, so that we'll be more careful
+about the security as patches roll in.
+
+Checkpoint: the main concern is whether a task that performs the
+checkpoint of another task has sufficient privileges to access its
+state. We address this by requiring that the checkpointer task will be
+able to ptrace the target task, by means of ptrace_may_access() with
+read mode.
+
+Restart: the main concern is that we may allow an unprivileged user to
+feed the kernel with random data. To this end, the restart works in a
+way that does not skip the usual security checks. Task credentials,
+i.e. euid, reuid, and LSM security contexts currently come from the
+caller, not the checkpoint image.  When restoration of credentials
+becomes supported, then definitely the ability of the task that calls
+sys_restore() to setresuid/setresgid to those values must be checked.
+
+Keeping the restart procedure to operate within the limits of the
+caller's credentials means that there various scenarios that cannot
+be supported. For instance, a setuid program that opened a protected
+log file and then dropped privileges will fail the restart, because
+the user won't have enough credentials to reopen the file. In these
+cases, we should probably treat restarting like inserting a kernel
+module: surely the user can cause havoc by providing incorrect data,
+but then again we must trust the root account.
+
+So that's why we don't want CAP_SYS_ADMIN required up-front. That way
+we will be forced to more carefully review each of those features.
+
diff --git a/Documentation/checkpoint/self.c b/Documentation/checkpoint/self.c
new file mode 100644
index 0000000..febb888
--- /dev/null
+++ b/Documentation/checkpoint/self.c
@@ -0,0 +1,57 @@
+#include <stdio.h>
+#include <stdlib.h>
+#include <unistd.h>
+#include <string.h>
+#include <errno.h>
+#include <math.h>
+#include <sys/syscall.h>
+
+#define OUTFILE  "/tmp/cr-test.out"
+
+int main(int argc, char *argv[])
+{
+	pid_t pid = getpid();
+	FILE *file;
+	int i, ret;
+	float a;
+
+	close(0);
+	close(2);
+
+	unlink(OUTFILE);
+	file = fopen(OUTFILE, "w+");
+	if (!file) {
+		perror("open");
+		exit(1);
+	}
+	if (dup2(0, 2) < 0) {
+		perror("dup2");
+		exit(1);
+	}
+
+	a = sqrt(2.53 * (getpid() / 1.21));
+
+	fprintf(file, "hello, world (%.2f)!\n", a);
+	fflush(file);
+
+	for (i = 0; i < 1000; i++) {
+		sleep(1);
+		/* make the fpu work ->  a = a + i/10  */
+		a = sqrt(a*a + 2*a*(i/10.0) + i*i/100.0);
+		fprintf(file, "count %d (%.2f)!\n", i, a);
+		fflush(file);
+
+		if (i == 2) {
+			ret = syscall(__NR_checkpoint, pid, STDOUT_FILENO, 0);
+			if (ret < 0) {
+				fprintf(file, "ckpt: %s\n", strerror(errno));
+				exit(2);
+			}
+			fprintf(file, "checkpoint ret: %d\n", ret);
+			fflush(file);
+		}
+	}
+
+	return 0;
+}
+
diff --git a/Documentation/checkpoint/test.c b/Documentation/checkpoint/test.c
new file mode 100644
index 0000000..1183655
--- /dev/null
+++ b/Documentation/checkpoint/test.c
@@ -0,0 +1,48 @@
+#include <stdio.h>
+#include <stdlib.h>
+#include <unistd.h>
+#include <errno.h>
+#include <math.h>
+
+#define OUTFILE  "/tmp/cr-test.out"
+
+int main(int argc, char *argv[])
+{
+	FILE *file;
+	float a;
+	int i;
+
+	close(0);
+	close(1);
+	close(2);
+
+	unlink(OUTFILE);
+	file = fopen(OUTFILE, "w+");
+	if (!file) {
+		perror("open");
+		exit(1);
+	}
+	if (dup2(0, 2) < 0) {
+		perror("dup2");
+		exit(1);
+	}
+
+	a = sqrt(2.53 * (getpid() / 1.21));
+
+	fprintf(file, "hello, world (%.2f)!\n", a);
+	fflush(file);
+
+	for (i = 0; i < 1000; i++) {
+		sleep(1);
+		/* make the fpu work ->  a = a + i/10  */
+		a = sqrt(a*a + 2*a*(i/10.0) + i*i/100.0);
+		fprintf(file, "count %d (%.2f)!\n", i, a);
+		fflush(file);
+	}
+
+	fprintf(file, "world, hello (%.2f) !\n", a);
+	fflush(file);
+
+	return 0;
+}
+
diff --git a/Documentation/checkpoint/usage.txt b/Documentation/checkpoint/usage.txt
new file mode 100644
index 0000000..1b42d6b
--- /dev/null
+++ b/Documentation/checkpoint/usage.txt
@@ -0,0 +1,171 @@
+
+	===== How to use Checkpoint-Restart =====
+
+The API consists of two new system calls:
+
+* int sys_checkpoint(pid_t pid, int fd, unsigned long flag);
+
+    Checkpoint a container whose init task is identified by pid, to
+    the file designated by fd. 'flags' will have future meaning (must
+    be 0 for now).
+
+    Returns: a positive checkpoint identifier (crid) upon success, 0
+    if it returns from a restart, and -1 if an error occurs.
+
+    'crid' uniquely identifies a checkpoint image. For each checkpoint
+    the kernel allocates a unique 'crid', that remains valid for as
+    long as the checkpoint is kept in the kernel (for instance, when a
+    checkpoint, or a partial checkpoint, may reside in kernel memory).
+
+* int sys_restart(int crid, int fd, unsigned long flags);
+
+    Restart a container from a checkpoint image that is read from the
+    blob stored in the file designated by fd. 'crid' will have future
+    meaning (must be 0 for now). 'flags' will have future meaning
+    (must be 0 for now).
+
+    The role of 'crid' is to identify the checkpoint image in the case
+    that it remains in kernel memory. This will be useful to restart
+    from a checkpoint image that remains in kernel memory.
+
+    Returns: -1 if an error occurs, 0 on success when restarting from
+    a "self" checkpoint, and return value of system call at the time
+    of the checkpoint when restarting from an "external" checkpoint.
+
+    If restarting from an "external" checkpoint, tasks that were
+    executing a system call will observe the return value of that
+    system call (as it was when interrupted for the act of taking the
+    checkpoint), and tasks that were executing in user space will be
+    ready to return there.
+
+    Upon successful "external" restart, the container will end up in a
+    frozen state.
+
+The granularity of a checkpoint usually is a whole container. The
+'pid' argument is interpreted in the caller's pid namespace. So to
+checkpoint a container whose init task (pid 1 in that pidns) appears
+as pid 3497 the caller's pidns, the caller must use pid 3497. Passing
+pid 1 will attempt to checkpoint the caller's container, and if the
+caller isn't privileged and init is owned by root, it will fail.
+
+If the caller passes a pid which does not refer to a container's init
+task, then sys_checkpoint() would return -EINVAL. (This is because
+with nested containers a task may belong to more than one container).
+
+We assume that during checkpoint and restart the container state is
+quiescent. During checkpoint, this means that all affected tasks are
+frozen (or otherwise stopped). During restart, this means that all
+affected tasks are executing the sys_restart() call. In both cases,
+if there are other tasks possible sharing state with the container,
+they must not modify it during the operation. It is the reponsibility
+of the caller to follow this requirement.
+
+If the assumption that all tasks are frozen and that there is no other
+sharing doesn't hold - then the results of the operation are undefined
+(just as, e.g. not calling execve() immediately after vfork() produces
+undefined results). In particular, either checkpoint will fail, or it
+may produce a checkpoint image that can't be restarted, or (unlikely)
+the restart may produce a container whose state does not match that of
+the original container.
+
+
+Here is a code snippet that illustrates how a checkpoint is initiated
+by a process in a container - the logic is similar to fork():
+	...
+	crid = checkpoint(1, ...);
+	switch (crid) {
+	case -1:
+		perror("checkpoint failed");
+		break;
+	default:
+		fprintf(stderr, "checkpoint succeeded, CRID=%d\n", ret);
+		/* proceed with execution after checkpoint */
+		...
+		break;
+	case 0:
+		fprintf(stderr, "returned after restart\n");
+		/* proceed with action required following a restart */
+		...
+		break;
+	}
+	...
+
+And to initiate a restart, the process in an empty container can use
+logic similar to execve():
+	...
+	if (restart(crid, ...) < 0)
+		perror("restart failed");
+	/* only get here if restart failed */
+	...
+
+Note, that the code also supports "self" checkpoint, where a process
+can checkpoint itself. This mode does not capture the relationships
+of the task with other tasks, or any shared resources. It is useful
+for application that wish to be able to save and restore their state.
+They will either not use (or care about) shared resources, or they
+will be aware of the operations and adapt suitably after a restart.
+The code above can also be used for "self" checkpoint.
+
+To illustrate how the API works, refer to these sample programs:
+
+* ckpt.c: accepts a 'pid' argument and checkpoint that task to stdout
+* rstr.c: restarts a checkpoint image from stdin
+* self.c: a simple test program doing self-checkpoint
+* test.c: a simple test program to checkpoint
+
+"External" checkpoint:
+---------------------
+To do "external" checkpoint, you need to first freeze that other task
+either using the freezer cgroup, or by sending SIGSTOP.
+
+Restart does not preserve the original PID yet, (because we haven't
+solved yet the fork-with-specific-pid issue). In a real scenario, you
+probably want to first create a new names space, and have the init
+task there call 'sys_restart()'.
+
+I tested it this way:
+	$ ./test &
+	[1] 3493
+
+	$ kill -STOP 3493
+	$ ./ckpt 3493 > ckpt.image
+
+	$ mv /tmp/cr-test.out /tmp/cr-test.out.orig
+	$ cp /tmp/cr-test.out.orig /tmp/cr-test.out
+
+	$ kill -CONT 3493
+
+	$ ./rstr < ckpt.image
+Now compare the output of the two output files.
+
+"Self checkpoint:
+----------------
+To do "self" checkpoint, you can incorporate the code from ckpt.c into
+your application.
+
+Here is how to test the "self" checkpoint:
+	$ ./self > self.image &
+	[1] 3512
+
+	$ sleep 3
+	$ mv /tmp/cr-test.out /tmp/cr-test.out.orig
+	$ cp /tmp/cr-test.out.orig /tmp/cr-test.out
+
+	$ cat /tmp/cr-rest.out
+	hello, world (85.46)!
+	count 0 (85.46)!
+	count 1 (85.56)!
+	count 2 (85.76)!
+	count 3 (86.46)!
+
+	$ sed -i 's/count/xxxx/g' /tmp/cr-rest.out
+
+	$ ./rstr < self.image &
+Now compare the output of the two output files.
+
+Note how in test.c we close stdin, stdout, stderr - that's because
+currently we only support regular files (not ttys/ptys).
+
+If you check the output of ps, you'll see that "rstr" changed its name
+to "test" or "self", as expected.
+
-- 
1.5.4.3

--
To unsubscribe from this list: send the line "unsubscribe linux-api" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 133+ messages in thread

* [RFC v11][PATCH 02/13] Checkpoint/restart: initial documentation
@ 2008-12-05 17:31   ` Oren Laadan
  0 siblings, 0 replies; 133+ messages in thread
From: Oren Laadan @ 2008-12-05 17:31 UTC (permalink / raw)
  To: Oren Laadan
  Cc: containers, linux-kernel, linux-mm, linux-api, Linux Torvalds,
	Thomas Gleixner, Serge Hallyn, Dave Hansen, Ingo Molnar,
	H. Peter Anvin, Alexander Viro, MinChan Kim, arnd, jeremy

Covers application checkpoint/restart, overall design, interfaces,
usage, shared objects, and and checkpoint image format.

Changelog[v8]:
  - Split into multiple files in Documentation/checkpoint/...
  - Extend documentation, fix typos and comments from feedback

Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com>
---
 Documentation/checkpoint/ckpt.c        |   32 ++++++
 Documentation/checkpoint/internals.txt |  133 +++++++++++++++++++++++++
 Documentation/checkpoint/readme.txt    |  105 +++++++++++++++++++
 Documentation/checkpoint/rstr.c        |   20 ++++
 Documentation/checkpoint/security.txt  |   38 +++++++
 Documentation/checkpoint/self.c        |   57 +++++++++++
 Documentation/checkpoint/test.c        |   48 +++++++++
 Documentation/checkpoint/usage.txt     |  171 ++++++++++++++++++++++++++++++++
 8 files changed, 604 insertions(+), 0 deletions(-)
 create mode 100644 Documentation/checkpoint/ckpt.c
 create mode 100644 Documentation/checkpoint/internals.txt
 create mode 100644 Documentation/checkpoint/readme.txt
 create mode 100644 Documentation/checkpoint/rstr.c
 create mode 100644 Documentation/checkpoint/security.txt
 create mode 100644 Documentation/checkpoint/self.c
 create mode 100644 Documentation/checkpoint/test.c
 create mode 100644 Documentation/checkpoint/usage.txt

diff --git a/Documentation/checkpoint/ckpt.c b/Documentation/checkpoint/ckpt.c
new file mode 100644
index 0000000..094408c
--- /dev/null
+++ b/Documentation/checkpoint/ckpt.c
@@ -0,0 +1,32 @@
+#include <stdio.h>
+#include <stdlib.h>
+#include <errno.h>
+#include <unistd.h>
+#include <sys/syscall.h>
+
+int main(int argc, char *argv[])
+{
+	pid_t pid;
+	int ret;
+
+	if (argc != 2) {
+		printf("usage: ckpt PID\n");
+		exit(1);
+	}
+
+	pid = atoi(argv[1]);
+	if (pid <= 0) {
+		printf("invalid pid\n");
+		exit(1);
+	}
+
+	ret = syscall(__NR_checkpoint, pid, STDOUT_FILENO, 0);
+
+	if (ret < 0)
+		perror("checkpoint");
+	else
+		printf("checkpoint id %d\n", ret);
+
+	return (ret > 0 ? 0 : 1);
+}
+
diff --git a/Documentation/checkpoint/internals.txt b/Documentation/checkpoint/internals.txt
new file mode 100644
index 0000000..b363e83
--- /dev/null
+++ b/Documentation/checkpoint/internals.txt
@@ -0,0 +1,133 @@
+
+	===== Internals of Checkpoint-Restart =====
+
+
+(1) Order of state dump
+
+The order of operations, both save and restore, is as follows:
+
+* Header section: header, container information, etc.
+
+* Global section: [TBD] global resources such as IPC, UTS, etc.
+
+* Process forest: [TBD] tasks and their relationships
+
+* Per task data (for each task):
+  -> task state: elements of task_struct
+  -> thread state: elements of thread_struct and thread_info
+  -> CPU state: registers etc, including FPU
+  -> memory state: memory address space layout and contents
+  -> filesystem state: [TBD] filesystem namespace state, chroot, cwd, etc
+  -> files state: open file descriptors and their state
+  -> signals state: [TBD] pending signals and signal handling state
+  -> credentials state: [TBD] user and group state, statistics
+
+
+(2) Checkpoint image format
+
+The checkpoint image format is composed of records consisting of a
+pre-header that identifies its contents, followed by a payload. (The
+idea here is to enable parallel checkpointing in the future in which
+multiple threads interleave data from multiple processes into a single
+stream).
+
+The pre-header is defined by "struct cr_hdr" as follows:
+
+struct cr_hdr {
+	__s16 type;
+	__s16 len;
+	__u32 parent;
+};
+
+'type' identifies the type of the payload, 'len' tells its length in
+bytes, and 'parent' identifies the owner object instance. The meaning
+of 'parent' varies depending on the type. For example, for CR_HDR_MM,
+'parent' identifies the task to which this MM belongs. The payload
+also varies depending on the type, for instance, the data describing a
+task_struct is given by a 'struct cr_hdr_task' (type CR_HDR_TASK) and
+so on.
+
+The format of the memory dump is as follows: for each VMA, there is a
+'struct cr_vma'; if the VMA is file-mapped, it is followed by the file
+name. Following comes the actual contents, in one or more chunks: each
+chunk begins with a header that specifies how many pages it holds,
+then the virtual addresses of all the dumped pages in that chunk,
+followed by the actual contents of all the dumped pages. A header with
+zero number of pages marks the end of the contents for a particular
+VMA. Then comes the next VMA and so on.
+
+To illustrate this, consider a single simple task with two VMAs: one
+is file mapped with two dumped pages, and the other is anonymous with
+three dumped pages. The checkpoint image will look like this:
+
+cr_hdr + cr_hdr_head
+cr_hdr + cr_hdr_task
+	cr_hdr + cr_hdr_mm
+		cr_hdr + cr_hdr_vma + cr_hdr + string
+			cr_hdr_pgarr (nr_pages = 2)
+			addr1, addr2
+			page1, page2
+			cr_hdr_pgarr (nr_pages = 0)
+		cr_hdr + cr_hdr_vma
+			cr_hdr_pgarr (nr_pages = 3)
+			addr3, addr4, addr5
+			page3, page4, page5
+			cr_hdr_pgarr (nr_pages = 0)
+		cr_hdr + cr_mm_context
+	cr_hdr + cr_hdr_thread
+	cr_hdr + cr_hdr_cpu
+cr_hdr + cr_hdr_tail
+
+
+(3) Shared resources (objects)
+
+Many resources used by tasks may be shared by more than one task (e.g.
+file descriptors, memory address space, etc), or even have multiple
+references from other resources (e.g. a single inode that represents
+two ends of a pipe).
+
+Clearly, the state of shared objects need only be saved once, even if
+they occur multiple times. We use a hash table (ctx->objhash) to keep
+track of shared objects and whether they were already saved.  Shared
+objects are stored in a hash table as they appear, indexed by their
+kernel address. (The hash table itself is not saved as part of the
+checkpoint image: it is constructed dynamically during both checkpoint
+and restart, and discarded at the end of the operation).
+
+Each shared object that is found is first looked up in the hash table.
+On the first encounter, the object will not be found, so its state is
+dumped, and the object is assigned a unique identifier and also stored
+in the hash table. Subsequent lookups of that object in the hash table
+will yield that entry, and then only the unique identifier is saved,
+as opposed the entire state of the object.
+
+During restart, shared objects are seen by their unique identifiers as
+assigned during the checkpoint. Each shared object that it read in is
+first looked up in the hash table. On the first encounter it will not
+be found, meaning that the object needs to be created and its state
+read in and restored. Then the object is added to the hash table, this
+time indexed by its unique identifier. Subsequent lookups of the same
+unique identifier in the hash table will yield that entry, and then
+the existing object instance is reused instead of creating another one.
+
+The hash grabs a reference to each object that is inserted, and
+maintains this reference for the entire lifetime of the hash. Thus,
+it is always safe to reference an object that is stored in the hash.
+The hash is "one-way" in the sense that objects that are added are
+never deleted from the hash until the hash is discarded. This, in
+turn, happens only when the checkpoint (or restart) terminates.
+
+The interface for the hash table is the following:
+
+cr_obj_get_by_ptr() - find the unique object reference (objref)
+  of the object that is pointer to by ptr [checkpoint]
+
+cr_obj_add_ptr() - add the object pointed to by ptr to the hash table
+  if not already there, and fill its unique object reference (objref)
+
+cr_obj_get_by_ref() - return the pointer to the object whose unique
+  object reference is equal to objref [restart]
+
+cr_obj_add_ref() - add the object with given unique object reference
+  (objref), pointed to by ptr to the hash table. [restart]
+
diff --git a/Documentation/checkpoint/readme.txt b/Documentation/checkpoint/readme.txt
new file mode 100644
index 0000000..344a551
--- /dev/null
+++ b/Documentation/checkpoint/readme.txt
@@ -0,0 +1,105 @@
+
+	===== Checkpoint-Restart support in the Linux kernel =====
+
+Copyright (C) 2008 Oren Laadan
+
+Author:		Oren Laadan <orenl@cs.columbia.edu>
+
+License:	The GNU Free Documentation License, Version 1.2
+		(dual licensed under the GPL v2)
+
+Reviewers:	Serge Hallyn <serue@us.ibm.com>
+		Dave Hansen <dave@linux.vnet.ibm.com>
+
+Application checkpoint/restart [C/R] is the ability to save the state
+of a running application so that it can later resume its execution
+from the time at which it was checkpointed. An application can be
+migrated by checkpointing it on one machine and restarting it on
+another. C/R can provide many potential benefits:
+
+* Failure recovery: by rolling back to a previous checkpoint
+
+* Improved response time: by restarting applications from checkpoints
+  instead of from scratch.
+
+* Improved system utilization: by suspending long running CPU
+  intensive jobs and resuming them when load decreases.
+
+* Fault resilience: by migrating applications off faulty hosts.
+
+* Dynamic load balancing: by migrating applications to less loaded
+  hosts.
+
+* Improved service availability and administration: by migrating
+  applications before host maintenance so that they continue to run
+  with minimal downtime
+
+* Time-travel: by taking periodic checkpoints and restarting from
+  any previous checkpoint.
+
+
+=== Overall design
+
+Checkpoint and restart is done in the kernel as much as possible. The
+kernel exports a relatively opaque 'blob' of data to userspace which can
+then be handed to the new kernel at restore time.  The 'blob' contains
+data and state of select portions of kernel structures such as VMAs
+and mm_structs, as well as copies of the actual memory that the tasks
+use. Any changes in this blob's format between kernel revisions can be
+handled by an in-userspace conversion program. The approach is similar
+to virtually all of the commercial C/R products out there, as well as
+the research project Zap.
+
+Two new system calls are introduced to provide C/R: sys_checkpoint()
+and sys_restart(). The checkpoint code basically serializes internal
+kernel state and writes it out to a file descriptor, and the resulting
+image is stream-able. More specifically, it consists of 5 steps:
+
+1. Pre-dump
+2. Freeze the container
+3. Dump
+4. Thaw (or kill) the container
+5. Post-dump
+
+Steps 1 and 5 are an optimization to reduce application downtime. In
+particular, "pre-dump" works before freezing the container, e.g. the
+pre-copy for live migration, and "post-dump" works after the container
+resumes execution, e.g. write-back the data to secondary storage.
+
+The restart code basically reads the saved kernel state from a file
+descriptor, and re-creates the tasks and the resources they need to
+resume execution. The restart code is executed by each task that is
+restored in a new container to reconstruct its own state.
+
+
+=== Current Implementation
+
+* How useful is this code as it stands in real-world usage?
+
+Right now, the application must be a single process that does not
+share any resources with other processes. The only file descriptors
+that may be open are simple files and directories, they may not
+include devices, sockets or pipes.
+
+For an "external" checkpoint, the caller must first freeze (or stop)
+the target process. For "self" checkpoint, the application must be
+specifically written to use the new system calls. The restart does not
+yet preserve the pid of the original process, but will use whatever
+pid it was given by the kernel.
+
+What this means in practice is that it is useful for a simple
+application doing computational work and input/output from/to files.
+
+Currently, namespaces are not saved or restored. They will be treated
+as a class of a shared object. In particular, it is assumed that the
+task's file system namespace is the "root" for the entire container.
+It is also assumed that the same file system view is available for the
+restart task(s). Otherwise, a file system snapshot is required.
+
+* What additional work needs to be done to it?
+
+We know this design can work.  We have two commercial products and a
+horde of academic projects doing it today using this basic design.
+We're early in this particular implementation because we're trying to
+release early and often.
+
diff --git a/Documentation/checkpoint/rstr.c b/Documentation/checkpoint/rstr.c
new file mode 100644
index 0000000..288209d
--- /dev/null
+++ b/Documentation/checkpoint/rstr.c
@@ -0,0 +1,20 @@
+#include <stdio.h>
+#include <stdlib.h>
+#include <unistd.h>
+#include <errno.h>
+#include <sys/syscall.h>
+
+int main(int argc, char *argv[])
+{
+	pid_t pid = getpid();
+	int ret;
+
+	ret = syscall(__NR_restart, pid, STDIN_FILENO, 0);
+	if (ret < 0)
+		perror("restart");
+
+	printf("should not reach here !\n");
+
+	return 0;
+}
+
diff --git a/Documentation/checkpoint/security.txt b/Documentation/checkpoint/security.txt
new file mode 100644
index 0000000..e5b4107
--- /dev/null
+++ b/Documentation/checkpoint/security.txt
@@ -0,0 +1,38 @@
+
+	===== Security consideration for Checkpoint-Restart =====
+
+The main question is whether sys_checkpoint() and sys_restart()
+require privileged or unprivileged operation.
+
+Early versions checked capable(CAP_SYS_ADMIN) assuming that we would
+attempt to remove the need for privilege, so that all users could
+safely use it. Arnd Bergmann pointed out that it'd make more sense to
+let unprivileged users use them now, so that we'll be more careful
+about the security as patches roll in.
+
+Checkpoint: the main concern is whether a task that performs the
+checkpoint of another task has sufficient privileges to access its
+state. We address this by requiring that the checkpointer task will be
+able to ptrace the target task, by means of ptrace_may_access() with
+read mode.
+
+Restart: the main concern is that we may allow an unprivileged user to
+feed the kernel with random data. To this end, the restart works in a
+way that does not skip the usual security checks. Task credentials,
+i.e. euid, reuid, and LSM security contexts currently come from the
+caller, not the checkpoint image.  When restoration of credentials
+becomes supported, then definitely the ability of the task that calls
+sys_restore() to setresuid/setresgid to those values must be checked.
+
+Keeping the restart procedure to operate within the limits of the
+caller's credentials means that there various scenarios that cannot
+be supported. For instance, a setuid program that opened a protected
+log file and then dropped privileges will fail the restart, because
+the user won't have enough credentials to reopen the file. In these
+cases, we should probably treat restarting like inserting a kernel
+module: surely the user can cause havoc by providing incorrect data,
+but then again we must trust the root account.
+
+So that's why we don't want CAP_SYS_ADMIN required up-front. That way
+we will be forced to more carefully review each of those features.
+
diff --git a/Documentation/checkpoint/self.c b/Documentation/checkpoint/self.c
new file mode 100644
index 0000000..febb888
--- /dev/null
+++ b/Documentation/checkpoint/self.c
@@ -0,0 +1,57 @@
+#include <stdio.h>
+#include <stdlib.h>
+#include <unistd.h>
+#include <string.h>
+#include <errno.h>
+#include <math.h>
+#include <sys/syscall.h>
+
+#define OUTFILE  "/tmp/cr-test.out"
+
+int main(int argc, char *argv[])
+{
+	pid_t pid = getpid();
+	FILE *file;
+	int i, ret;
+	float a;
+
+	close(0);
+	close(2);
+
+	unlink(OUTFILE);
+	file = fopen(OUTFILE, "w+");
+	if (!file) {
+		perror("open");
+		exit(1);
+	}
+	if (dup2(0, 2) < 0) {
+		perror("dup2");
+		exit(1);
+	}
+
+	a = sqrt(2.53 * (getpid() / 1.21));
+
+	fprintf(file, "hello, world (%.2f)!\n", a);
+	fflush(file);
+
+	for (i = 0; i < 1000; i++) {
+		sleep(1);
+		/* make the fpu work ->  a = a + i/10  */
+		a = sqrt(a*a + 2*a*(i/10.0) + i*i/100.0);
+		fprintf(file, "count %d (%.2f)!\n", i, a);
+		fflush(file);
+
+		if (i == 2) {
+			ret = syscall(__NR_checkpoint, pid, STDOUT_FILENO, 0);
+			if (ret < 0) {
+				fprintf(file, "ckpt: %s\n", strerror(errno));
+				exit(2);
+			}
+			fprintf(file, "checkpoint ret: %d\n", ret);
+			fflush(file);
+		}
+	}
+
+	return 0;
+}
+
diff --git a/Documentation/checkpoint/test.c b/Documentation/checkpoint/test.c
new file mode 100644
index 0000000..1183655
--- /dev/null
+++ b/Documentation/checkpoint/test.c
@@ -0,0 +1,48 @@
+#include <stdio.h>
+#include <stdlib.h>
+#include <unistd.h>
+#include <errno.h>
+#include <math.h>
+
+#define OUTFILE  "/tmp/cr-test.out"
+
+int main(int argc, char *argv[])
+{
+	FILE *file;
+	float a;
+	int i;
+
+	close(0);
+	close(1);
+	close(2);
+
+	unlink(OUTFILE);
+	file = fopen(OUTFILE, "w+");
+	if (!file) {
+		perror("open");
+		exit(1);
+	}
+	if (dup2(0, 2) < 0) {
+		perror("dup2");
+		exit(1);
+	}
+
+	a = sqrt(2.53 * (getpid() / 1.21));
+
+	fprintf(file, "hello, world (%.2f)!\n", a);
+	fflush(file);
+
+	for (i = 0; i < 1000; i++) {
+		sleep(1);
+		/* make the fpu work ->  a = a + i/10  */
+		a = sqrt(a*a + 2*a*(i/10.0) + i*i/100.0);
+		fprintf(file, "count %d (%.2f)!\n", i, a);
+		fflush(file);
+	}
+
+	fprintf(file, "world, hello (%.2f) !\n", a);
+	fflush(file);
+
+	return 0;
+}
+
diff --git a/Documentation/checkpoint/usage.txt b/Documentation/checkpoint/usage.txt
new file mode 100644
index 0000000..1b42d6b
--- /dev/null
+++ b/Documentation/checkpoint/usage.txt
@@ -0,0 +1,171 @@
+
+	===== How to use Checkpoint-Restart =====
+
+The API consists of two new system calls:
+
+* int sys_checkpoint(pid_t pid, int fd, unsigned long flag);
+
+    Checkpoint a container whose init task is identified by pid, to
+    the file designated by fd. 'flags' will have future meaning (must
+    be 0 for now).
+
+    Returns: a positive checkpoint identifier (crid) upon success, 0
+    if it returns from a restart, and -1 if an error occurs.
+
+    'crid' uniquely identifies a checkpoint image. For each checkpoint
+    the kernel allocates a unique 'crid', that remains valid for as
+    long as the checkpoint is kept in the kernel (for instance, when a
+    checkpoint, or a partial checkpoint, may reside in kernel memory).
+
+* int sys_restart(int crid, int fd, unsigned long flags);
+
+    Restart a container from a checkpoint image that is read from the
+    blob stored in the file designated by fd. 'crid' will have future
+    meaning (must be 0 for now). 'flags' will have future meaning
+    (must be 0 for now).
+
+    The role of 'crid' is to identify the checkpoint image in the case
+    that it remains in kernel memory. This will be useful to restart
+    from a checkpoint image that remains in kernel memory.
+
+    Returns: -1 if an error occurs, 0 on success when restarting from
+    a "self" checkpoint, and return value of system call at the time
+    of the checkpoint when restarting from an "external" checkpoint.
+
+    If restarting from an "external" checkpoint, tasks that were
+    executing a system call will observe the return value of that
+    system call (as it was when interrupted for the act of taking the
+    checkpoint), and tasks that were executing in user space will be
+    ready to return there.
+
+    Upon successful "external" restart, the container will end up in a
+    frozen state.
+
+The granularity of a checkpoint usually is a whole container. The
+'pid' argument is interpreted in the caller's pid namespace. So to
+checkpoint a container whose init task (pid 1 in that pidns) appears
+as pid 3497 the caller's pidns, the caller must use pid 3497. Passing
+pid 1 will attempt to checkpoint the caller's container, and if the
+caller isn't privileged and init is owned by root, it will fail.
+
+If the caller passes a pid which does not refer to a container's init
+task, then sys_checkpoint() would return -EINVAL. (This is because
+with nested containers a task may belong to more than one container).
+
+We assume that during checkpoint and restart the container state is
+quiescent. During checkpoint, this means that all affected tasks are
+frozen (or otherwise stopped). During restart, this means that all
+affected tasks are executing the sys_restart() call. In both cases,
+if there are other tasks possible sharing state with the container,
+they must not modify it during the operation. It is the reponsibility
+of the caller to follow this requirement.
+
+If the assumption that all tasks are frozen and that there is no other
+sharing doesn't hold - then the results of the operation are undefined
+(just as, e.g. not calling execve() immediately after vfork() produces
+undefined results). In particular, either checkpoint will fail, or it
+may produce a checkpoint image that can't be restarted, or (unlikely)
+the restart may produce a container whose state does not match that of
+the original container.
+
+
+Here is a code snippet that illustrates how a checkpoint is initiated
+by a process in a container - the logic is similar to fork():
+	...
+	crid = checkpoint(1, ...);
+	switch (crid) {
+	case -1:
+		perror("checkpoint failed");
+		break;
+	default:
+		fprintf(stderr, "checkpoint succeeded, CRID=%d\n", ret);
+		/* proceed with execution after checkpoint */
+		...
+		break;
+	case 0:
+		fprintf(stderr, "returned after restart\n");
+		/* proceed with action required following a restart */
+		...
+		break;
+	}
+	...
+
+And to initiate a restart, the process in an empty container can use
+logic similar to execve():
+	...
+	if (restart(crid, ...) < 0)
+		perror("restart failed");
+	/* only get here if restart failed */
+	...
+
+Note, that the code also supports "self" checkpoint, where a process
+can checkpoint itself. This mode does not capture the relationships
+of the task with other tasks, or any shared resources. It is useful
+for application that wish to be able to save and restore their state.
+They will either not use (or care about) shared resources, or they
+will be aware of the operations and adapt suitably after a restart.
+The code above can also be used for "self" checkpoint.
+
+To illustrate how the API works, refer to these sample programs:
+
+* ckpt.c: accepts a 'pid' argument and checkpoint that task to stdout
+* rstr.c: restarts a checkpoint image from stdin
+* self.c: a simple test program doing self-checkpoint
+* test.c: a simple test program to checkpoint
+
+"External" checkpoint:
+---------------------
+To do "external" checkpoint, you need to first freeze that other task
+either using the freezer cgroup, or by sending SIGSTOP.
+
+Restart does not preserve the original PID yet, (because we haven't
+solved yet the fork-with-specific-pid issue). In a real scenario, you
+probably want to first create a new names space, and have the init
+task there call 'sys_restart()'.
+
+I tested it this way:
+	$ ./test &
+	[1] 3493
+
+	$ kill -STOP 3493
+	$ ./ckpt 3493 > ckpt.image
+
+	$ mv /tmp/cr-test.out /tmp/cr-test.out.orig
+	$ cp /tmp/cr-test.out.orig /tmp/cr-test.out
+
+	$ kill -CONT 3493
+
+	$ ./rstr < ckpt.image
+Now compare the output of the two output files.
+
+"Self checkpoint:
+----------------
+To do "self" checkpoint, you can incorporate the code from ckpt.c into
+your application.
+
+Here is how to test the "self" checkpoint:
+	$ ./self > self.image &
+	[1] 3512
+
+	$ sleep 3
+	$ mv /tmp/cr-test.out /tmp/cr-test.out.orig
+	$ cp /tmp/cr-test.out.orig /tmp/cr-test.out
+
+	$ cat /tmp/cr-rest.out
+	hello, world (85.46)!
+	count 0 (85.46)!
+	count 1 (85.56)!
+	count 2 (85.76)!
+	count 3 (86.46)!
+
+	$ sed -i 's/count/xxxx/g' /tmp/cr-rest.out
+
+	$ ./rstr < self.image &
+Now compare the output of the two output files.
+
+Note how in test.c we close stdin, stdout, stderr - that's because
+currently we only support regular files (not ttys/ptys).
+
+If you check the output of ps, you'll see that "rstr" changed its name
+to "test" or "self", as expected.
+
-- 
1.5.4.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 133+ messages in thread

* [RFC v11][PATCH 03/13] General infrastructure for checkpoint restart
       [not found] ` <1228498282-11804-1-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
  2008-12-05 17:31   ` [RFC v11][PATCH 01/13] Create syscalls: sys_checkpoint, sys_restart Oren Laadan
  2008-12-05 17:31   ` [RFC v11][PATCH 02/13] Checkpoint/restart: initial documentation Oren Laadan
@ 2008-12-05 17:31   ` Oren Laadan
  2008-12-05 17:31   ` [RFC v11][PATCH 04/13] x86 support for checkpoint/restart Oren Laadan
                     ` (14 subsequent siblings)
  17 siblings, 0 replies; 133+ messages in thread
From: Oren Laadan @ 2008-12-05 17:31 UTC (permalink / raw)
  To: Oren Laadan
  Cc: jeremy-TSDbQ3PG+2Y, arnd-r2nGTMty4D4,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Dave Hansen,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Linux Torvalds, Alexander Viro,
	H. Peter Anvin, Thomas Gleixner, Ingo Molnar

Add those interfaces, as well as helpers needed to easily manage the
file format. The code is roughly broken out as follows:

checkpoint/sys.c - user/kernel data transfer, as well as setup of the
  CR context (a per-checkpoint data structure for housekeeping)
checkpoint/checkpoint.c - output wrappers and basic checkpoint handling
checkpoint/restart.c - input wrappers and basic restart handling

For now, we can only checkpoint the 'current' task ("self" checkpoint),
and the 'pid' argument to to the syscall is ignored.

Patches to add the per-architecture support as well as the actual
work to do the memory checkpoint follow in subsequent patches.

Changelog[v10]:
  - add cr_write_buffer(), cr_read_buffer() and cr_read_buf_type()
  - force end-of-string in cr_read_string() (fix possible DoS)

Changelog[v9]:
  - cr_kwrite/cr_kread() use file->f_op->write() directly
  - Drop cr_uwrite/cr_uread() since they aren't used anywhere

Changelog[v6]:
  - Balance all calls to cr_hbuf_get() with matching cr_hbuf_put()
    (although it's not really needed)

Changelog[v5]:
  - Rename headers files s/ckpt/checkpoint/

Changelog[v2]:
  - Added utsname->{release,version,machine} to checkpoint header
  - Pad header structures to 64 bits to ensure compatibility

Signed-off-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
Acked-by: Serge Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
Signed-off-by: Dave Hansen <dave-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
---
 Makefile                       |    2 +-
 checkpoint/Makefile            |    2 +-
 checkpoint/checkpoint.c        |  188 +++++++++++++++++++++++++++++++
 checkpoint/restart.c           |  239 ++++++++++++++++++++++++++++++++++++++++
 checkpoint/sys.c               |  194 +++++++++++++++++++++++++++++++-
 include/linux/checkpoint.h     |   58 ++++++++++
 include/linux/checkpoint_hdr.h |   76 +++++++++++++
 include/linux/magic.h          |    3 +
 8 files changed, 756 insertions(+), 6 deletions(-)
 create mode 100644 checkpoint/checkpoint.c
 create mode 100644 checkpoint/restart.c
 create mode 100644 include/linux/checkpoint.h
 create mode 100644 include/linux/checkpoint_hdr.h

diff --git a/Makefile b/Makefile
index 9a49960..00e402a 100644
--- a/Makefile
+++ b/Makefile
@@ -619,7 +619,7 @@ export mod_strip_cmd
 
 
 ifeq ($(KBUILD_EXTMOD),)
-core-y		+= kernel/ mm/ fs/ ipc/ security/ crypto/ block/
+core-y		+= kernel/ mm/ fs/ ipc/ security/ crypto/ block/ checkpoint/
 
 vmlinux-dirs	:= $(patsubst %/,%,$(filter %/, $(init-y) $(init-m) \
 		     $(core-y) $(core-m) $(drivers-y) $(drivers-m) \
diff --git a/checkpoint/Makefile b/checkpoint/Makefile
index 07d018b..d2df68c 100644
--- a/checkpoint/Makefile
+++ b/checkpoint/Makefile
@@ -2,4 +2,4 @@
 # Makefile for linux checkpoint/restart.
 #
 
-obj-$(CONFIG_CHECKPOINT_RESTART) += sys.o
+obj-$(CONFIG_CHECKPOINT_RESTART) += sys.o checkpoint.o restart.o
diff --git a/checkpoint/checkpoint.c b/checkpoint/checkpoint.c
new file mode 100644
index 0000000..fccf723
--- /dev/null
+++ b/checkpoint/checkpoint.c
@@ -0,0 +1,188 @@
+/*
+ *  Checkpoint logic and helpers
+ *
+ *  Copyright (C) 2008 Oren Laadan
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+#include <linux/version.h>
+#include <linux/sched.h>
+#include <linux/time.h>
+#include <linux/fs.h>
+#include <linux/file.h>
+#include <linux/dcache.h>
+#include <linux/mount.h>
+#include <linux/utsname.h>
+#include <linux/magic.h>
+#include <linux/checkpoint.h>
+#include <linux/checkpoint_hdr.h>
+
+/* unique checkpoint identifier (FIXME: should be per-container ?) */
+static atomic_t cr_ctx_count = ATOMIC_INIT(0);
+
+/**
+ * cr_write_obj - write a record described by a cr_hdr
+ * @ctx: checkpoint context
+ * @h: record descriptor
+ * @buf: record buffer
+ */
+int cr_write_obj(struct cr_ctx *ctx, struct cr_hdr *h, void *buf)
+{
+	int ret;
+
+	ret = cr_kwrite(ctx, h, sizeof(*h));
+	if (ret < 0)
+		return ret;
+	return cr_kwrite(ctx, buf, h->len);
+}
+
+/**
+ * cr_write_buffer - write a buffer
+ * @ctx: checkpoint context
+ * @str: buffer pointer
+ * @len: buffer size
+ */
+int cr_write_buffer(struct cr_ctx *ctx, void *buf, int len)
+{
+	struct cr_hdr h;
+
+	h.type = CR_HDR_BUFFER;
+	h.len = len;
+	h.parent = 0;
+
+	return cr_write_obj(ctx, &h, buf);
+}
+
+/**
+ * cr_write_string - write a string
+ * @ctx: checkpoint context
+ * @str: string pointer
+ * @len: string length
+ */
+int cr_write_string(struct cr_ctx *ctx, char *str, int len)
+{
+	struct cr_hdr h;
+
+	h.type = CR_HDR_STRING;
+	h.len = len;
+	h.parent = 0;
+
+	return cr_write_obj(ctx, &h, str);
+}
+
+/* write the checkpoint header */
+static int cr_write_head(struct cr_ctx *ctx)
+{
+	struct cr_hdr h;
+	struct cr_hdr_head *hh = cr_hbuf_get(ctx, sizeof(*hh));
+	struct new_utsname *uts;
+	struct timeval ktv;
+	int ret;
+
+	h.type = CR_HDR_HEAD;
+	h.len = sizeof(*hh);
+	h.parent = 0;
+
+	do_gettimeofday(&ktv);
+
+	hh->magic = CHECKPOINT_MAGIC_HEAD;
+	hh->major = (LINUX_VERSION_CODE >> 16) & 0xff;
+	hh->minor = (LINUX_VERSION_CODE >> 8) & 0xff;
+	hh->patch = (LINUX_VERSION_CODE) & 0xff;
+
+	hh->rev = CR_VERSION;
+
+	hh->flags = ctx->flags;
+	hh->time = ktv.tv_sec;
+
+	uts = utsname();
+	memcpy(hh->release, uts->release, __NEW_UTS_LEN);
+	memcpy(hh->version, uts->version, __NEW_UTS_LEN);
+	memcpy(hh->machine, uts->machine, __NEW_UTS_LEN);
+
+	ret = cr_write_obj(ctx, &h, hh);
+	cr_hbuf_put(ctx, sizeof(*hh));
+	return ret;
+}
+
+/* write the checkpoint trailer */
+static int cr_write_tail(struct cr_ctx *ctx)
+{
+	struct cr_hdr h;
+	struct cr_hdr_tail *hh = cr_hbuf_get(ctx, sizeof(*hh));
+	int ret;
+
+	h.type = CR_HDR_TAIL;
+	h.len = sizeof(*hh);
+	h.parent = 0;
+
+	hh->magic = CHECKPOINT_MAGIC_TAIL;
+
+	ret = cr_write_obj(ctx, &h, hh);
+	cr_hbuf_put(ctx, sizeof(*hh));
+	return ret;
+}
+
+/* dump the task_struct of a given task */
+static int cr_write_task_struct(struct cr_ctx *ctx, struct task_struct *t)
+{
+	struct cr_hdr h;
+	struct cr_hdr_task *hh = cr_hbuf_get(ctx, sizeof(*hh));
+	int ret;
+
+	h.type = CR_HDR_TASK;
+	h.len = sizeof(*hh);
+	h.parent = 0;
+
+	hh->state = t->state;
+	hh->exit_state = t->exit_state;
+	hh->exit_code = t->exit_code;
+	hh->exit_signal = t->exit_signal;
+
+	hh->task_comm_len = TASK_COMM_LEN;
+
+	/* FIXME: save remaining relevant task_struct fields */
+
+	ret = cr_write_obj(ctx, &h, hh);
+	cr_hbuf_put(ctx, sizeof(*hh));
+	if (ret < 0)
+		return ret;
+
+	return cr_write_string(ctx, t->comm, TASK_COMM_LEN);
+}
+
+/* dump the entire state of a given task */
+static int cr_write_task(struct cr_ctx *ctx, struct task_struct *t)
+{
+	int ret;
+
+	ret = cr_write_task_struct(ctx, t);
+	cr_debug("ret %d\n", ret);
+
+	return ret;
+}
+
+int do_checkpoint(struct cr_ctx *ctx, pid_t pid)
+{
+	int ret;
+
+	ret = cr_write_head(ctx);
+	if (ret < 0)
+		goto out;
+	ret = cr_write_task(ctx, current);
+	if (ret < 0)
+		goto out;
+	ret = cr_write_tail(ctx);
+	if (ret < 0)
+		goto out;
+
+	ctx->crid = atomic_inc_return(&cr_ctx_count);
+
+	/* on success, return (unique) checkpoint identifier */
+	ret = ctx->crid;
+ out:
+	return ret;
+}
diff --git a/checkpoint/restart.c b/checkpoint/restart.c
new file mode 100644
index 0000000..a95d2e8
--- /dev/null
+++ b/checkpoint/restart.c
@@ -0,0 +1,239 @@
+/*
+ *  Restart logic and helpers
+ *
+ *  Copyright (C) 2008 Oren Laadan
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+#include <linux/version.h>
+#include <linux/sched.h>
+#include <linux/file.h>
+#include <linux/magic.h>
+#include <linux/checkpoint.h>
+#include <linux/checkpoint_hdr.h>
+
+/**
+ * cr_read_obj - read a whole record (cr_hdr followed by payload)
+ * @ctx: checkpoint context
+ * @h: record descriptor
+ * @buf: record buffer
+ * @len: available buffer size
+ */
+int cr_read_obj(struct cr_ctx *ctx, struct cr_hdr *h, void *buf, int len)
+{
+	int ret;
+
+	ret = cr_kread(ctx, h, sizeof(*h));
+	if (ret < 0)
+		return ret;
+
+	cr_debug("type %d len %d parent %d\n", h->type, h->len, h->parent);
+
+	if (h->len < 0 || h->len > len)
+		return -EINVAL;
+
+	return cr_kread(ctx, buf, h->len);
+}
+
+/**
+ * cr_read_obj_type - read a whole record of expected type and size
+ * @ctx: checkpoint context
+ * @buf: record buffer
+ * @n: expected record size
+ * @type: expected record type
+ *
+ * Returns objref of the parent object
+ */
+int cr_read_obj_type(struct cr_ctx *ctx, void *buf, int len, int type)
+{
+	struct cr_hdr h;
+	int ret;
+
+	ret = cr_read_obj(ctx, &h, buf, len);
+	if (ret < 0)
+		return ret;
+
+	if (h.len != len || h.type != type)
+		return -EINVAL;
+
+	return h.parent;
+}
+
+/**
+ * cr_read_buf_type - read a whole record of expected type (unknown size)
+ * @ctx: checkpoint context
+ * @buf: record buffer
+ * @n: availabe buffer size (output: actual record size)
+ * @type: expected record type
+ *
+ * Returns objref of the parent object
+ */
+int cr_read_buf_type(struct cr_ctx *ctx, void *buf, int *len, int type)
+{
+	struct cr_hdr h;
+	int ret;
+
+	ret = cr_read_obj(ctx, &h, buf, *len);
+	if (ret < 0)
+		return ret;
+
+	if (h.type != type)
+		return -EINVAL;
+
+	*len = h.len;
+	return h.parent;
+}
+
+/**
+ * cr_read_buffer - read a buffer
+ * @ctx: checkpoint context
+ * @buf: buffer
+ * @len: buffer size (output actual record size)
+ */
+int cr_read_buffer(struct cr_ctx *ctx, void *buf, int *len)
+{
+	return cr_read_buf_type(ctx, buf, len, CR_HDR_BUFFER);
+}
+
+/**
+ * cr_read_string - read a string
+ * @ctx: checkpoint context
+ * @str: string buffer
+ * @len: string length
+ */
+int cr_read_string(struct cr_ctx *ctx, char *str, int len)
+{
+	int ret;
+
+	ret = cr_read_buf_type(ctx, str, &len, CR_HDR_STRING);
+	if (ret < 0)
+		return ret;
+
+	if (len > 0)
+		str[len - 1] = '\0';	/* always play it safe */
+
+	return ret;
+}
+
+/* read the checkpoint header */
+static int cr_read_head(struct cr_ctx *ctx)
+{
+	struct cr_hdr_head *hh = cr_hbuf_get(ctx, sizeof(*hh));
+	int parent, ret = -EINVAL;
+
+	parent = cr_read_obj_type(ctx, hh, sizeof(*hh), CR_HDR_HEAD);
+	if (parent < 0) {
+		ret = parent;
+		goto out;
+	} else if (parent != 0)
+		goto out;
+
+	if (hh->magic != CHECKPOINT_MAGIC_HEAD || hh->rev != CR_VERSION ||
+	    hh->major != ((LINUX_VERSION_CODE >> 16) & 0xff) ||
+	    hh->minor != ((LINUX_VERSION_CODE >> 8) & 0xff) ||
+	    hh->patch != ((LINUX_VERSION_CODE) & 0xff))
+		goto out;
+
+	if (hh->flags & ~CR_CTX_CKPT)
+		goto out;
+
+	ctx->oflags = hh->flags;
+
+	/* FIX: verify compatibility of release, version and machine */
+
+	ret = 0;
+ out:
+	cr_hbuf_put(ctx, sizeof(*hh));
+	return ret;
+}
+
+/* read the checkpoint trailer */
+static int cr_read_tail(struct cr_ctx *ctx)
+{
+	struct cr_hdr_tail *hh = cr_hbuf_get(ctx, sizeof(*hh));
+	int parent, ret = -EINVAL;
+
+	parent = cr_read_obj_type(ctx, hh, sizeof(*hh), CR_HDR_TAIL);
+	if (parent < 0) {
+		ret = parent;
+		goto out;
+	} else if (parent != 0)
+		goto out;
+
+	if (hh->magic != CHECKPOINT_MAGIC_TAIL)
+		goto out;
+
+	ret = 0;
+ out:
+	cr_hbuf_put(ctx, sizeof(*hh));
+	return ret;
+}
+
+/* read the task_struct into the current task */
+static int cr_read_task_struct(struct cr_ctx *ctx)
+{
+	struct cr_hdr_task *hh = cr_hbuf_get(ctx, sizeof(*hh));
+	struct task_struct *t = current;
+	char *buf;
+	int parent, ret = -EINVAL;
+
+	parent = cr_read_obj_type(ctx, hh, sizeof(*hh), CR_HDR_TASK);
+	if (parent < 0) {
+		ret = parent;
+		goto out;
+	} else if (parent != 0)
+		goto out;
+
+	/* upper limit for task_comm_len to prevent DoS */
+	if (hh->task_comm_len < 0 || hh->task_comm_len > PAGE_SIZE)
+		goto out;
+
+	buf = kmalloc(hh->task_comm_len, GFP_KERNEL);
+	if (!buf)
+		goto out;
+	ret = cr_read_string(ctx, buf, hh->task_comm_len);
+	if (!ret) {
+		/* if t->comm is too long, silently truncate */
+		memset(t->comm, 0, TASK_COMM_LEN);
+		memcpy(t->comm, buf, min(hh->task_comm_len, TASK_COMM_LEN));
+	}
+	kfree(buf);
+
+	/* FIXME: restore remaining relevant task_struct fields */
+ out:
+	cr_hbuf_put(ctx, sizeof(*hh));
+	return ret;
+}
+
+/* read the entire state of the current task */
+static int cr_read_task(struct cr_ctx *ctx)
+{
+	int ret;
+
+	ret = cr_read_task_struct(ctx);
+	cr_debug("ret %d\n", ret);
+
+	return ret;
+}
+
+int do_restart(struct cr_ctx *ctx, pid_t pid)
+{
+	int ret;
+
+	ret = cr_read_head(ctx);
+	if (ret < 0)
+		goto out;
+	ret = cr_read_task(ctx);
+	if (ret < 0)
+		goto out;
+	ret = cr_read_tail(ctx);
+	if (ret < 0)
+		goto out;
+
+	/* on success, adjust the return value if needed [TODO] */
+ out:
+	return ret;
+}
diff --git a/checkpoint/sys.c b/checkpoint/sys.c
index 375129c..bd14ef9 100644
--- a/checkpoint/sys.c
+++ b/checkpoint/sys.c
@@ -10,6 +10,158 @@
 
 #include <linux/sched.h>
 #include <linux/kernel.h>
+#include <linux/fs.h>
+#include <linux/file.h>
+#include <linux/uaccess.h>
+#include <linux/capability.h>
+#include <linux/checkpoint.h>
+
+/*
+ * Helpers to write(read) from(to) kernel space to(from) the checkpoint
+ * image file descriptor (similar to how a core-dump is performed).
+ *
+ *   cr_kwrite() - write a kernel-space buffer to the checkpoint image
+ *   cr_kread() - read from the checkpoint image to a kernel-space buffer
+ */
+
+int cr_kwrite(struct cr_ctx *ctx, void *addr, int count)
+{
+	struct file *file = ctx->file;
+	mm_segment_t fs;
+	ssize_t nwrite;
+	int nleft;
+
+	fs = get_fs();
+	set_fs(KERNEL_DS);
+	for (nleft = count; nleft; nleft -= nwrite) {
+		nwrite = file->f_op->write(file, addr, nleft, &file->f_pos);
+		if (nwrite < 0) {
+			if (nwrite == -EAGAIN)
+				nwrite = 0;
+			else
+				return nwrite;
+		}
+		addr += nwrite;
+	}
+	set_fs(fs);
+	ctx->total += count;
+	return 0;
+}
+
+int cr_kread(struct cr_ctx *ctx, void *addr, int count)
+{
+	struct file *file = ctx->file;
+	mm_segment_t fs;
+	ssize_t nread;
+	int nleft;
+
+	fs = get_fs();
+	set_fs(KERNEL_DS);
+	for (nleft = count; nleft; nleft -= nread) {
+		nread = file->f_op->read(file, addr, nleft, &file->f_pos);
+		if (nread <= 0) {
+			if (nread == -EAGAIN) {
+				nread = 0;
+				continue;
+			} else if (nread == 0)
+				nread = -EPIPE;		/* unexecpted EOF */
+			return nread;
+		}
+		addr += nread;
+	}
+	set_fs(fs);
+	ctx->total += count;
+	return 0;
+}
+
+/*
+ * During checkpoint and restart the code writes outs/reads in data
+ * to/from the checkpoint image from/to a temporary buffer (ctx->hbuf).
+ * Because operations can be nested, use cr_hbuf_get() to reserve space
+ * in the buffer, then cr_hbuf_put() when you no longer need that space.
+ */
+
+/*
+ * ctx->hbuf is used to hold headers and data of known (or bound),
+ * static sizes. In some cases, multiple headers may be allocated in
+ * a nested manner. The size should accommodate all headers, nested
+ * or not, on all archs.
+ */
+#define CR_HBUF_TOTAL  (8 * 4096)
+
+/**
+ * cr_hbuf_get - reserve space on the hbuf
+ * @ctx: checkpoint context
+ * @n: number of bytes to reserve
+ *
+ * Returns pointer to reserved space
+ */
+void *cr_hbuf_get(struct cr_ctx *ctx, int n)
+{
+	void *ptr;
+
+	/*
+	 * Since requests depend on logic and static header sizes (not on
+	 * user data), space should always suffice, unless someone either
+	 * made a structure bigger or call path deeper than expected.
+	 */
+	BUG_ON(ctx->hpos + n > CR_HBUF_TOTAL);
+	ptr = ctx->hbuf + ctx->hpos;
+	ctx->hpos += n;
+	return ptr;
+}
+
+/**
+ * cr_hbuf_put - unreserve space on the hbuf
+ * @ctx: checkpoint context
+ * @n: number of bytes to reserve
+ */
+void cr_hbuf_put(struct cr_ctx *ctx, int n)
+{
+	BUG_ON(ctx->hpos < n);
+	ctx->hpos -= n;
+}
+
+/*
+ * helpers to manage C/R contexts: allocated for each checkpoint and/or
+ * restart operation, and persists until the operation is completed.
+ */
+
+static void cr_ctx_free(struct cr_ctx *ctx)
+{
+	if (ctx->file)
+		fput(ctx->file);
+	kfree(ctx->hbuf);
+	kfree(ctx);
+}
+
+static struct cr_ctx *cr_ctx_alloc(int fd, unsigned long flags)
+{
+	struct cr_ctx *ctx;
+	int err;
+
+	ctx = kzalloc(sizeof(*ctx), GFP_KERNEL);
+	if (!ctx)
+		return ERR_PTR(-ENOMEM);
+
+	ctx->flags = flags;
+
+	err = -EBADF;
+	ctx->file = fget(fd);
+	if (!ctx->file)
+		goto err;
+
+	err = -ENOMEM;
+	ctx->hbuf = kmalloc(CR_HBUF_TOTAL, GFP_KERNEL);
+	if (!ctx->hbuf)
+		goto err;
+
+	return ctx;
+
+ err:
+	cr_ctx_free(ctx);
+	return ERR_PTR(err);
+}
 
 /**
  * sys_checkpoint - checkpoint a container
@@ -22,9 +174,26 @@
  */
 asmlinkage long sys_checkpoint(pid_t pid, int fd, unsigned long flags)
 {
-	pr_debug("sys_checkpoint not implemented yet\n");
-	return -ENOSYS;
+	struct cr_ctx *ctx;
+	int ret;
+
+	/* no flags for now */
+	if (flags)
+		return -EINVAL;
+
+	ctx = cr_ctx_alloc(fd, flags | CR_CTX_CKPT);
+	if (IS_ERR(ctx))
+		return PTR_ERR(ctx);
+
+	ret = do_checkpoint(ctx, pid);
+
+	if (!ret)
+		ret = ctx->crid;
+
+	cr_ctx_free(ctx);
+	return ret;
 }
+
 /**
  * sys_restart - restart a container
  * @crid: checkpoint image identifier
@@ -36,6 +205,23 @@ asmlinkage long sys_checkpoint(pid_t pid, int fd, unsigned long flags)
  */
 asmlinkage long sys_restart(int crid, int fd, unsigned long flags)
 {
-	pr_debug("sys_restart not implemented yet\n");
-	return -ENOSYS;
+	struct cr_ctx *ctx;
+	pid_t pid;
+	int ret;
+
+	/* no flags for now */
+	if (flags)
+		return -EINVAL;
+
+	ctx = cr_ctx_alloc(fd, flags | CR_CTX_RSTR);
+	if (IS_ERR(ctx))
+		return PTR_ERR(ctx);
+
+	/* FIXME: for now, we use 'crid' as a pid */
+	pid = (pid_t) crid;
+
+	ret = do_restart(ctx, pid);
+
+	cr_ctx_free(ctx);
+	return ret;
 }
diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
new file mode 100644
index 0000000..63f298f
--- /dev/null
+++ b/include/linux/checkpoint.h
@@ -0,0 +1,58 @@
+#ifndef _CHECKPOINT_CKPT_H_
+#define _CHECKPOINT_CKPT_H_
+/*
+ *  Generic container checkpoint-restart
+ *
+ *  Copyright (C) 2008 Oren Laadan
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+#define CR_VERSION  1
+
+struct cr_ctx {
+	int crid;		/* unique checkpoint id */
+
+	pid_t root_pid;		/* container identifier */
+
+	unsigned long flags;
+	unsigned long oflags;	/* restart: old flags */
+
+	struct file *file;
+	int total;		/* total read/written */
+
+	void *hbuf;		/* temporary buffer for headers */
+	int hpos;		/* position in headers buffer */
+};
+
+/* cr_ctx: flags */
+#define CR_CTX_CKPT	0x1
+#define CR_CTX_RSTR	0x2
+
+extern int cr_kwrite(struct cr_ctx *ctx, void *buf, int count);
+extern int cr_kread(struct cr_ctx *ctx, void *buf, int count);
+
+extern void *cr_hbuf_get(struct cr_ctx *ctx, int n);
+extern void cr_hbuf_put(struct cr_ctx *ctx, int n);
+
+struct cr_hdr;
+
+extern int cr_write_obj(struct cr_ctx *ctx, struct cr_hdr *h, void *buf);
+extern int cr_write_buffer(struct cr_ctx *ctx, void *buf, int len);
+extern int cr_write_string(struct cr_ctx *ctx, char *str, int len);
+
+extern int cr_read_obj(struct cr_ctx *ctx, struct cr_hdr *h, void *buf, int n);
+extern int cr_read_obj_type(struct cr_ctx *ctx, void *buf, int len, int type);
+extern int cr_read_buf_type(struct cr_ctx *ctx, void *buf, int *len, int type);
+extern int cr_read_buffer(struct cr_ctx *ctx, void *buf, int *len);
+extern int cr_read_string(struct cr_ctx *ctx, char *str, int len);
+
+extern int do_checkpoint(struct cr_ctx *ctx, pid_t pid);
+extern int do_restart(struct cr_ctx *ctx, pid_t pid);
+
+#define cr_debug(fmt, args...)  \
+	pr_debug("[%d:c/r:%s] " fmt, task_pid_vnr(current), __func__, ## args)
+
+#endif /* _CHECKPOINT_CKPT_H_ */
diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
new file mode 100644
index 0000000..257f87f
--- /dev/null
+++ b/include/linux/checkpoint_hdr.h
@@ -0,0 +1,76 @@
+#ifndef _CHECKPOINT_CKPT_HDR_H_
+#define _CHECKPOINT_CKPT_HDR_H_
+/*
+ *  Generic container checkpoint-restart
+ *
+ *  Copyright (C) 2008 Oren Laadan
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+#include <linux/types.h>
+#include <linux/utsname.h>
+
+/*
+ * To maintain compatibility between 32-bit and 64-bit architecture flavors,
+ * keep data 64-bit aligned: use padding for structure members, and use
+ * __attribute__ ((aligned (8))) for the entire structure.
+ */
+
+/* records: generic header */
+
+struct cr_hdr {
+	__s16 type;
+	__s16 len;
+	__u32 parent;
+};
+
+/* header types */
+enum {
+	CR_HDR_HEAD = 1,
+	CR_HDR_BUFFER,
+	CR_HDR_STRING,
+
+	CR_HDR_TASK = 101,
+	CR_HDR_THREAD,
+	CR_HDR_CPU,
+
+	CR_HDR_MM = 201,
+	CR_HDR_VMA,
+	CR_HDR_MM_CONTEXT,
+
+	CR_HDR_TAIL = 5001
+};
+
+struct cr_hdr_head {
+	__u64 magic;
+
+	__u16 major;
+	__u16 minor;
+	__u16 patch;
+	__u16 rev;
+
+	__u64 time;	/* when checkpoint taken */
+	__u64 flags;	/* checkpoint options */
+
+	char release[__NEW_UTS_LEN];
+	char version[__NEW_UTS_LEN];
+	char machine[__NEW_UTS_LEN];
+} __attribute__((aligned(8)));
+
+struct cr_hdr_tail {
+	__u64 magic;
+} __attribute__((aligned(8)));
+
+struct cr_hdr_task {
+	__u32 state;
+	__u32 exit_state;
+	__u32 exit_code;
+	__u32 exit_signal;
+
+	__s32 task_comm_len;
+} __attribute__((aligned(8)));
+
+#endif /* _CHECKPOINT_CKPT_HDR_H_ */
diff --git a/include/linux/magic.h b/include/linux/magic.h
index f7f3fdd..5939bbe 100644
--- a/include/linux/magic.h
+++ b/include/linux/magic.h
@@ -46,4 +46,7 @@
 #define FUTEXFS_SUPER_MAGIC	0xBAD1DEA
 #define INOTIFYFS_SUPER_MAGIC	0x2BAD1DEA
 
+#define CHECKPOINT_MAGIC_HEAD  0x00feed0cc0a2d200LL
+#define CHECKPOINT_MAGIC_TAIL  0x002d2a0cc0deef00LL
+
 #endif /* __LINUX_MAGIC_H__ */
-- 
1.5.4.3

^ permalink raw reply related	[flat|nested] 133+ messages in thread

* [RFC v11][PATCH 03/13] General infrastructure for checkpoint restart
       [not found] ` <1228498282-11804-1-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
  2008-12-05 17:31   ` [RFC v11][PATCH 01/13] Create syscalls: sys_checkpoint, sys_restart Oren Laadan
@ 2008-12-05 17:31   ` Oren Laadan
  2008-12-05 17:31   ` [RFC v11][PATCH 03/13] General infrastructure for checkpoint restart Oren Laadan
                     ` (15 subsequent siblings)
  17 siblings, 0 replies; 133+ messages in thread
From: Oren Laadan @ 2008-12-05 17:31 UTC (permalink / raw)
  To: Oren Laadan
  Cc: containers, linux-kernel, linux-mm, linux-api, Linux Torvalds,
	Thomas Gleixner, Serge Hallyn, Dave Hansen, Ingo Molnar,
	H. Peter Anvin, Alexander Viro, MinChan Kim, arnd, jeremy,
	Oren Laadan

Add those interfaces, as well as helpers needed to easily manage the
file format. The code is roughly broken out as follows:

checkpoint/sys.c - user/kernel data transfer, as well as setup of the
  CR context (a per-checkpoint data structure for housekeeping)
checkpoint/checkpoint.c - output wrappers and basic checkpoint handling
checkpoint/restart.c - input wrappers and basic restart handling

For now, we can only checkpoint the 'current' task ("self" checkpoint),
and the 'pid' argument to to the syscall is ignored.

Patches to add the per-architecture support as well as the actual
work to do the memory checkpoint follow in subsequent patches.

Changelog[v10]:
  - add cr_write_buffer(), cr_read_buffer() and cr_read_buf_type()
  - force end-of-string in cr_read_string() (fix possible DoS)

Changelog[v9]:
  - cr_kwrite/cr_kread() use file->f_op->write() directly
  - Drop cr_uwrite/cr_uread() since they aren't used anywhere

Changelog[v6]:
  - Balance all calls to cr_hbuf_get() with matching cr_hbuf_put()
    (although it's not really needed)

Changelog[v5]:
  - Rename headers files s/ckpt/checkpoint/

Changelog[v2]:
  - Added utsname->{release,version,machine} to checkpoint header
  - Pad header structures to 64 bits to ensure compatibility

Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com>
---
 Makefile                       |    2 +-
 checkpoint/Makefile            |    2 +-
 checkpoint/checkpoint.c        |  188 +++++++++++++++++++++++++++++++
 checkpoint/restart.c           |  239 ++++++++++++++++++++++++++++++++++++++++
 checkpoint/sys.c               |  194 +++++++++++++++++++++++++++++++-
 include/linux/checkpoint.h     |   58 ++++++++++
 include/linux/checkpoint_hdr.h |   76 +++++++++++++
 include/linux/magic.h          |    3 +
 8 files changed, 756 insertions(+), 6 deletions(-)
 create mode 100644 checkpoint/checkpoint.c
 create mode 100644 checkpoint/restart.c
 create mode 100644 include/linux/checkpoint.h
 create mode 100644 include/linux/checkpoint_hdr.h

diff --git a/Makefile b/Makefile
index 9a49960..00e402a 100644
--- a/Makefile
+++ b/Makefile
@@ -619,7 +619,7 @@ export mod_strip_cmd
 
 
 ifeq ($(KBUILD_EXTMOD),)
-core-y		+= kernel/ mm/ fs/ ipc/ security/ crypto/ block/
+core-y		+= kernel/ mm/ fs/ ipc/ security/ crypto/ block/ checkpoint/
 
 vmlinux-dirs	:= $(patsubst %/,%,$(filter %/, $(init-y) $(init-m) \
 		     $(core-y) $(core-m) $(drivers-y) $(drivers-m) \
diff --git a/checkpoint/Makefile b/checkpoint/Makefile
index 07d018b..d2df68c 100644
--- a/checkpoint/Makefile
+++ b/checkpoint/Makefile
@@ -2,4 +2,4 @@
 # Makefile for linux checkpoint/restart.
 #
 
-obj-$(CONFIG_CHECKPOINT_RESTART) += sys.o
+obj-$(CONFIG_CHECKPOINT_RESTART) += sys.o checkpoint.o restart.o
diff --git a/checkpoint/checkpoint.c b/checkpoint/checkpoint.c
new file mode 100644
index 0000000..fccf723
--- /dev/null
+++ b/checkpoint/checkpoint.c
@@ -0,0 +1,188 @@
+/*
+ *  Checkpoint logic and helpers
+ *
+ *  Copyright (C) 2008 Oren Laadan
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+#include <linux/version.h>
+#include <linux/sched.h>
+#include <linux/time.h>
+#include <linux/fs.h>
+#include <linux/file.h>
+#include <linux/dcache.h>
+#include <linux/mount.h>
+#include <linux/utsname.h>
+#include <linux/magic.h>
+#include <linux/checkpoint.h>
+#include <linux/checkpoint_hdr.h>
+
+/* unique checkpoint identifier (FIXME: should be per-container ?) */
+static atomic_t cr_ctx_count = ATOMIC_INIT(0);
+
+/**
+ * cr_write_obj - write a record described by a cr_hdr
+ * @ctx: checkpoint context
+ * @h: record descriptor
+ * @buf: record buffer
+ */
+int cr_write_obj(struct cr_ctx *ctx, struct cr_hdr *h, void *buf)
+{
+	int ret;
+
+	ret = cr_kwrite(ctx, h, sizeof(*h));
+	if (ret < 0)
+		return ret;
+	return cr_kwrite(ctx, buf, h->len);
+}
+
+/**
+ * cr_write_buffer - write a buffer
+ * @ctx: checkpoint context
+ * @str: buffer pointer
+ * @len: buffer size
+ */
+int cr_write_buffer(struct cr_ctx *ctx, void *buf, int len)
+{
+	struct cr_hdr h;
+
+	h.type = CR_HDR_BUFFER;
+	h.len = len;
+	h.parent = 0;
+
+	return cr_write_obj(ctx, &h, buf);
+}
+
+/**
+ * cr_write_string - write a string
+ * @ctx: checkpoint context
+ * @str: string pointer
+ * @len: string length
+ */
+int cr_write_string(struct cr_ctx *ctx, char *str, int len)
+{
+	struct cr_hdr h;
+
+	h.type = CR_HDR_STRING;
+	h.len = len;
+	h.parent = 0;
+
+	return cr_write_obj(ctx, &h, str);
+}
+
+/* write the checkpoint header */
+static int cr_write_head(struct cr_ctx *ctx)
+{
+	struct cr_hdr h;
+	struct cr_hdr_head *hh = cr_hbuf_get(ctx, sizeof(*hh));
+	struct new_utsname *uts;
+	struct timeval ktv;
+	int ret;
+
+	h.type = CR_HDR_HEAD;
+	h.len = sizeof(*hh);
+	h.parent = 0;
+
+	do_gettimeofday(&ktv);
+
+	hh->magic = CHECKPOINT_MAGIC_HEAD;
+	hh->major = (LINUX_VERSION_CODE >> 16) & 0xff;
+	hh->minor = (LINUX_VERSION_CODE >> 8) & 0xff;
+	hh->patch = (LINUX_VERSION_CODE) & 0xff;
+
+	hh->rev = CR_VERSION;
+
+	hh->flags = ctx->flags;
+	hh->time = ktv.tv_sec;
+
+	uts = utsname();
+	memcpy(hh->release, uts->release, __NEW_UTS_LEN);
+	memcpy(hh->version, uts->version, __NEW_UTS_LEN);
+	memcpy(hh->machine, uts->machine, __NEW_UTS_LEN);
+
+	ret = cr_write_obj(ctx, &h, hh);
+	cr_hbuf_put(ctx, sizeof(*hh));
+	return ret;
+}
+
+/* write the checkpoint trailer */
+static int cr_write_tail(struct cr_ctx *ctx)
+{
+	struct cr_hdr h;
+	struct cr_hdr_tail *hh = cr_hbuf_get(ctx, sizeof(*hh));
+	int ret;
+
+	h.type = CR_HDR_TAIL;
+	h.len = sizeof(*hh);
+	h.parent = 0;
+
+	hh->magic = CHECKPOINT_MAGIC_TAIL;
+
+	ret = cr_write_obj(ctx, &h, hh);
+	cr_hbuf_put(ctx, sizeof(*hh));
+	return ret;
+}
+
+/* dump the task_struct of a given task */
+static int cr_write_task_struct(struct cr_ctx *ctx, struct task_struct *t)
+{
+	struct cr_hdr h;
+	struct cr_hdr_task *hh = cr_hbuf_get(ctx, sizeof(*hh));
+	int ret;
+
+	h.type = CR_HDR_TASK;
+	h.len = sizeof(*hh);
+	h.parent = 0;
+
+	hh->state = t->state;
+	hh->exit_state = t->exit_state;
+	hh->exit_code = t->exit_code;
+	hh->exit_signal = t->exit_signal;
+
+	hh->task_comm_len = TASK_COMM_LEN;
+
+	/* FIXME: save remaining relevant task_struct fields */
+
+	ret = cr_write_obj(ctx, &h, hh);
+	cr_hbuf_put(ctx, sizeof(*hh));
+	if (ret < 0)
+		return ret;
+
+	return cr_write_string(ctx, t->comm, TASK_COMM_LEN);
+}
+
+/* dump the entire state of a given task */
+static int cr_write_task(struct cr_ctx *ctx, struct task_struct *t)
+{
+	int ret;
+
+	ret = cr_write_task_struct(ctx, t);
+	cr_debug("ret %d\n", ret);
+
+	return ret;
+}
+
+int do_checkpoint(struct cr_ctx *ctx, pid_t pid)
+{
+	int ret;
+
+	ret = cr_write_head(ctx);
+	if (ret < 0)
+		goto out;
+	ret = cr_write_task(ctx, current);
+	if (ret < 0)
+		goto out;
+	ret = cr_write_tail(ctx);
+	if (ret < 0)
+		goto out;
+
+	ctx->crid = atomic_inc_return(&cr_ctx_count);
+
+	/* on success, return (unique) checkpoint identifier */
+	ret = ctx->crid;
+ out:
+	return ret;
+}
diff --git a/checkpoint/restart.c b/checkpoint/restart.c
new file mode 100644
index 0000000..a95d2e8
--- /dev/null
+++ b/checkpoint/restart.c
@@ -0,0 +1,239 @@
+/*
+ *  Restart logic and helpers
+ *
+ *  Copyright (C) 2008 Oren Laadan
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+#include <linux/version.h>
+#include <linux/sched.h>
+#include <linux/file.h>
+#include <linux/magic.h>
+#include <linux/checkpoint.h>
+#include <linux/checkpoint_hdr.h>
+
+/**
+ * cr_read_obj - read a whole record (cr_hdr followed by payload)
+ * @ctx: checkpoint context
+ * @h: record descriptor
+ * @buf: record buffer
+ * @len: available buffer size
+ */
+int cr_read_obj(struct cr_ctx *ctx, struct cr_hdr *h, void *buf, int len)
+{
+	int ret;
+
+	ret = cr_kread(ctx, h, sizeof(*h));
+	if (ret < 0)
+		return ret;
+
+	cr_debug("type %d len %d parent %d\n", h->type, h->len, h->parent);
+
+	if (h->len < 0 || h->len > len)
+		return -EINVAL;
+
+	return cr_kread(ctx, buf, h->len);
+}
+
+/**
+ * cr_read_obj_type - read a whole record of expected type and size
+ * @ctx: checkpoint context
+ * @buf: record buffer
+ * @n: expected record size
+ * @type: expected record type
+ *
+ * Returns objref of the parent object
+ */
+int cr_read_obj_type(struct cr_ctx *ctx, void *buf, int len, int type)
+{
+	struct cr_hdr h;
+	int ret;
+
+	ret = cr_read_obj(ctx, &h, buf, len);
+	if (ret < 0)
+		return ret;
+
+	if (h.len != len || h.type != type)
+		return -EINVAL;
+
+	return h.parent;
+}
+
+/**
+ * cr_read_buf_type - read a whole record of expected type (unknown size)
+ * @ctx: checkpoint context
+ * @buf: record buffer
+ * @n: availabe buffer size (output: actual record size)
+ * @type: expected record type
+ *
+ * Returns objref of the parent object
+ */
+int cr_read_buf_type(struct cr_ctx *ctx, void *buf, int *len, int type)
+{
+	struct cr_hdr h;
+	int ret;
+
+	ret = cr_read_obj(ctx, &h, buf, *len);
+	if (ret < 0)
+		return ret;
+
+	if (h.type != type)
+		return -EINVAL;
+
+	*len = h.len;
+	return h.parent;
+}
+
+/**
+ * cr_read_buffer - read a buffer
+ * @ctx: checkpoint context
+ * @buf: buffer
+ * @len: buffer size (output actual record size)
+ */
+int cr_read_buffer(struct cr_ctx *ctx, void *buf, int *len)
+{
+	return cr_read_buf_type(ctx, buf, len, CR_HDR_BUFFER);
+}
+
+/**
+ * cr_read_string - read a string
+ * @ctx: checkpoint context
+ * @str: string buffer
+ * @len: string length
+ */
+int cr_read_string(struct cr_ctx *ctx, char *str, int len)
+{
+	int ret;
+
+	ret = cr_read_buf_type(ctx, str, &len, CR_HDR_STRING);
+	if (ret < 0)
+		return ret;
+
+	if (len > 0)
+		str[len - 1] = '\0';	/* always play it safe */
+
+	return ret;
+}
+
+/* read the checkpoint header */
+static int cr_read_head(struct cr_ctx *ctx)
+{
+	struct cr_hdr_head *hh = cr_hbuf_get(ctx, sizeof(*hh));
+	int parent, ret = -EINVAL;
+
+	parent = cr_read_obj_type(ctx, hh, sizeof(*hh), CR_HDR_HEAD);
+	if (parent < 0) {
+		ret = parent;
+		goto out;
+	} else if (parent != 0)
+		goto out;
+
+	if (hh->magic != CHECKPOINT_MAGIC_HEAD || hh->rev != CR_VERSION ||
+	    hh->major != ((LINUX_VERSION_CODE >> 16) & 0xff) ||
+	    hh->minor != ((LINUX_VERSION_CODE >> 8) & 0xff) ||
+	    hh->patch != ((LINUX_VERSION_CODE) & 0xff))
+		goto out;
+
+	if (hh->flags & ~CR_CTX_CKPT)
+		goto out;
+
+	ctx->oflags = hh->flags;
+
+	/* FIX: verify compatibility of release, version and machine */
+
+	ret = 0;
+ out:
+	cr_hbuf_put(ctx, sizeof(*hh));
+	return ret;
+}
+
+/* read the checkpoint trailer */
+static int cr_read_tail(struct cr_ctx *ctx)
+{
+	struct cr_hdr_tail *hh = cr_hbuf_get(ctx, sizeof(*hh));
+	int parent, ret = -EINVAL;
+
+	parent = cr_read_obj_type(ctx, hh, sizeof(*hh), CR_HDR_TAIL);
+	if (parent < 0) {
+		ret = parent;
+		goto out;
+	} else if (parent != 0)
+		goto out;
+
+	if (hh->magic != CHECKPOINT_MAGIC_TAIL)
+		goto out;
+
+	ret = 0;
+ out:
+	cr_hbuf_put(ctx, sizeof(*hh));
+	return ret;
+}
+
+/* read the task_struct into the current task */
+static int cr_read_task_struct(struct cr_ctx *ctx)
+{
+	struct cr_hdr_task *hh = cr_hbuf_get(ctx, sizeof(*hh));
+	struct task_struct *t = current;
+	char *buf;
+	int parent, ret = -EINVAL;
+
+	parent = cr_read_obj_type(ctx, hh, sizeof(*hh), CR_HDR_TASK);
+	if (parent < 0) {
+		ret = parent;
+		goto out;
+	} else if (parent != 0)
+		goto out;
+
+	/* upper limit for task_comm_len to prevent DoS */
+	if (hh->task_comm_len < 0 || hh->task_comm_len > PAGE_SIZE)
+		goto out;
+
+	buf = kmalloc(hh->task_comm_len, GFP_KERNEL);
+	if (!buf)
+		goto out;
+	ret = cr_read_string(ctx, buf, hh->task_comm_len);
+	if (!ret) {
+		/* if t->comm is too long, silently truncate */
+		memset(t->comm, 0, TASK_COMM_LEN);
+		memcpy(t->comm, buf, min(hh->task_comm_len, TASK_COMM_LEN));
+	}
+	kfree(buf);
+
+	/* FIXME: restore remaining relevant task_struct fields */
+ out:
+	cr_hbuf_put(ctx, sizeof(*hh));
+	return ret;
+}
+
+/* read the entire state of the current task */
+static int cr_read_task(struct cr_ctx *ctx)
+{
+	int ret;
+
+	ret = cr_read_task_struct(ctx);
+	cr_debug("ret %d\n", ret);
+
+	return ret;
+}
+
+int do_restart(struct cr_ctx *ctx, pid_t pid)
+{
+	int ret;
+
+	ret = cr_read_head(ctx);
+	if (ret < 0)
+		goto out;
+	ret = cr_read_task(ctx);
+	if (ret < 0)
+		goto out;
+	ret = cr_read_tail(ctx);
+	if (ret < 0)
+		goto out;
+
+	/* on success, adjust the return value if needed [TODO] */
+ out:
+	return ret;
+}
diff --git a/checkpoint/sys.c b/checkpoint/sys.c
index 375129c..bd14ef9 100644
--- a/checkpoint/sys.c
+++ b/checkpoint/sys.c
@@ -10,6 +10,158 @@
 
 #include <linux/sched.h>
 #include <linux/kernel.h>
+#include <linux/fs.h>
+#include <linux/file.h>
+#include <linux/uaccess.h>
+#include <linux/capability.h>
+#include <linux/checkpoint.h>
+
+/*
+ * Helpers to write(read) from(to) kernel space to(from) the checkpoint
+ * image file descriptor (similar to how a core-dump is performed).
+ *
+ *   cr_kwrite() - write a kernel-space buffer to the checkpoint image
+ *   cr_kread() - read from the checkpoint image to a kernel-space buffer
+ */
+
+int cr_kwrite(struct cr_ctx *ctx, void *addr, int count)
+{
+	struct file *file = ctx->file;
+	mm_segment_t fs;
+	ssize_t nwrite;
+	int nleft;
+
+	fs = get_fs();
+	set_fs(KERNEL_DS);
+	for (nleft = count; nleft; nleft -= nwrite) {
+		nwrite = file->f_op->write(file, addr, nleft, &file->f_pos);
+		if (nwrite < 0) {
+			if (nwrite == -EAGAIN)
+				nwrite = 0;
+			else
+				return nwrite;
+		}
+		addr += nwrite;
+	}
+	set_fs(fs);
+	ctx->total += count;
+	return 0;
+}
+
+int cr_kread(struct cr_ctx *ctx, void *addr, int count)
+{
+	struct file *file = ctx->file;
+	mm_segment_t fs;
+	ssize_t nread;
+	int nleft;
+
+	fs = get_fs();
+	set_fs(KERNEL_DS);
+	for (nleft = count; nleft; nleft -= nread) {
+		nread = file->f_op->read(file, addr, nleft, &file->f_pos);
+		if (nread <= 0) {
+			if (nread == -EAGAIN) {
+				nread = 0;
+				continue;
+			} else if (nread == 0)
+				nread = -EPIPE;		/* unexecpted EOF */
+			return nread;
+		}
+		addr += nread;
+	}
+	set_fs(fs);
+	ctx->total += count;
+	return 0;
+}
+
+/*
+ * During checkpoint and restart the code writes outs/reads in data
+ * to/from the checkpoint image from/to a temporary buffer (ctx->hbuf).
+ * Because operations can be nested, use cr_hbuf_get() to reserve space
+ * in the buffer, then cr_hbuf_put() when you no longer need that space.
+ */
+
+/*
+ * ctx->hbuf is used to hold headers and data of known (or bound),
+ * static sizes. In some cases, multiple headers may be allocated in
+ * a nested manner. The size should accommodate all headers, nested
+ * or not, on all archs.
+ */
+#define CR_HBUF_TOTAL  (8 * 4096)
+
+/**
+ * cr_hbuf_get - reserve space on the hbuf
+ * @ctx: checkpoint context
+ * @n: number of bytes to reserve
+ *
+ * Returns pointer to reserved space
+ */
+void *cr_hbuf_get(struct cr_ctx *ctx, int n)
+{
+	void *ptr;
+
+	/*
+	 * Since requests depend on logic and static header sizes (not on
+	 * user data), space should always suffice, unless someone either
+	 * made a structure bigger or call path deeper than expected.
+	 */
+	BUG_ON(ctx->hpos + n > CR_HBUF_TOTAL);
+	ptr = ctx->hbuf + ctx->hpos;
+	ctx->hpos += n;
+	return ptr;
+}
+
+/**
+ * cr_hbuf_put - unreserve space on the hbuf
+ * @ctx: checkpoint context
+ * @n: number of bytes to reserve
+ */
+void cr_hbuf_put(struct cr_ctx *ctx, int n)
+{
+	BUG_ON(ctx->hpos < n);
+	ctx->hpos -= n;
+}
+
+/*
+ * helpers to manage C/R contexts: allocated for each checkpoint and/or
+ * restart operation, and persists until the operation is completed.
+ */
+
+static void cr_ctx_free(struct cr_ctx *ctx)
+{
+	if (ctx->file)
+		fput(ctx->file);
+	kfree(ctx->hbuf);
+	kfree(ctx);
+}
+
+static struct cr_ctx *cr_ctx_alloc(int fd, unsigned long flags)
+{
+	struct cr_ctx *ctx;
+	int err;
+
+	ctx = kzalloc(sizeof(*ctx), GFP_KERNEL);
+	if (!ctx)
+		return ERR_PTR(-ENOMEM);
+
+	ctx->flags = flags;
+
+	err = -EBADF;
+	ctx->file = fget(fd);
+	if (!ctx->file)
+		goto err;
+
+	err = -ENOMEM;
+	ctx->hbuf = kmalloc(CR_HBUF_TOTAL, GFP_KERNEL);
+	if (!ctx->hbuf)
+		goto err;
+
+	return ctx;
+
+ err:
+	cr_ctx_free(ctx);
+	return ERR_PTR(err);
+}
 
 /**
  * sys_checkpoint - checkpoint a container
@@ -22,9 +174,26 @@
  */
 asmlinkage long sys_checkpoint(pid_t pid, int fd, unsigned long flags)
 {
-	pr_debug("sys_checkpoint not implemented yet\n");
-	return -ENOSYS;
+	struct cr_ctx *ctx;
+	int ret;
+
+	/* no flags for now */
+	if (flags)
+		return -EINVAL;
+
+	ctx = cr_ctx_alloc(fd, flags | CR_CTX_CKPT);
+	if (IS_ERR(ctx))
+		return PTR_ERR(ctx);
+
+	ret = do_checkpoint(ctx, pid);
+
+	if (!ret)
+		ret = ctx->crid;
+
+	cr_ctx_free(ctx);
+	return ret;
 }
+
 /**
  * sys_restart - restart a container
  * @crid: checkpoint image identifier
@@ -36,6 +205,23 @@ asmlinkage long sys_checkpoint(pid_t pid, int fd, unsigned long flags)
  */
 asmlinkage long sys_restart(int crid, int fd, unsigned long flags)
 {
-	pr_debug("sys_restart not implemented yet\n");
-	return -ENOSYS;
+	struct cr_ctx *ctx;
+	pid_t pid;
+	int ret;
+
+	/* no flags for now */
+	if (flags)
+		return -EINVAL;
+
+	ctx = cr_ctx_alloc(fd, flags | CR_CTX_RSTR);
+	if (IS_ERR(ctx))
+		return PTR_ERR(ctx);
+
+	/* FIXME: for now, we use 'crid' as a pid */
+	pid = (pid_t) crid;
+
+	ret = do_restart(ctx, pid);
+
+	cr_ctx_free(ctx);
+	return ret;
 }
diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
new file mode 100644
index 0000000..63f298f
--- /dev/null
+++ b/include/linux/checkpoint.h
@@ -0,0 +1,58 @@
+#ifndef _CHECKPOINT_CKPT_H_
+#define _CHECKPOINT_CKPT_H_
+/*
+ *  Generic container checkpoint-restart
+ *
+ *  Copyright (C) 2008 Oren Laadan
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+#define CR_VERSION  1
+
+struct cr_ctx {
+	int crid;		/* unique checkpoint id */
+
+	pid_t root_pid;		/* container identifier */
+
+	unsigned long flags;
+	unsigned long oflags;	/* restart: old flags */
+
+	struct file *file;
+	int total;		/* total read/written */
+
+	void *hbuf;		/* temporary buffer for headers */
+	int hpos;		/* position in headers buffer */
+};
+
+/* cr_ctx: flags */
+#define CR_CTX_CKPT	0x1
+#define CR_CTX_RSTR	0x2
+
+extern int cr_kwrite(struct cr_ctx *ctx, void *buf, int count);
+extern int cr_kread(struct cr_ctx *ctx, void *buf, int count);
+
+extern void *cr_hbuf_get(struct cr_ctx *ctx, int n);
+extern void cr_hbuf_put(struct cr_ctx *ctx, int n);
+
+struct cr_hdr;
+
+extern int cr_write_obj(struct cr_ctx *ctx, struct cr_hdr *h, void *buf);
+extern int cr_write_buffer(struct cr_ctx *ctx, void *buf, int len);
+extern int cr_write_string(struct cr_ctx *ctx, char *str, int len);
+
+extern int cr_read_obj(struct cr_ctx *ctx, struct cr_hdr *h, void *buf, int n);
+extern int cr_read_obj_type(struct cr_ctx *ctx, void *buf, int len, int type);
+extern int cr_read_buf_type(struct cr_ctx *ctx, void *buf, int *len, int type);
+extern int cr_read_buffer(struct cr_ctx *ctx, void *buf, int *len);
+extern int cr_read_string(struct cr_ctx *ctx, char *str, int len);
+
+extern int do_checkpoint(struct cr_ctx *ctx, pid_t pid);
+extern int do_restart(struct cr_ctx *ctx, pid_t pid);
+
+#define cr_debug(fmt, args...)  \
+	pr_debug("[%d:c/r:%s] " fmt, task_pid_vnr(current), __func__, ## args)
+
+#endif /* _CHECKPOINT_CKPT_H_ */
diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
new file mode 100644
index 0000000..257f87f
--- /dev/null
+++ b/include/linux/checkpoint_hdr.h
@@ -0,0 +1,76 @@
+#ifndef _CHECKPOINT_CKPT_HDR_H_
+#define _CHECKPOINT_CKPT_HDR_H_
+/*
+ *  Generic container checkpoint-restart
+ *
+ *  Copyright (C) 2008 Oren Laadan
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+#include <linux/types.h>
+#include <linux/utsname.h>
+
+/*
+ * To maintain compatibility between 32-bit and 64-bit architecture flavors,
+ * keep data 64-bit aligned: use padding for structure members, and use
+ * __attribute__ ((aligned (8))) for the entire structure.
+ */
+
+/* records: generic header */
+
+struct cr_hdr {
+	__s16 type;
+	__s16 len;
+	__u32 parent;
+};
+
+/* header types */
+enum {
+	CR_HDR_HEAD = 1,
+	CR_HDR_BUFFER,
+	CR_HDR_STRING,
+
+	CR_HDR_TASK = 101,
+	CR_HDR_THREAD,
+	CR_HDR_CPU,
+
+	CR_HDR_MM = 201,
+	CR_HDR_VMA,
+	CR_HDR_MM_CONTEXT,
+
+	CR_HDR_TAIL = 5001
+};
+
+struct cr_hdr_head {
+	__u64 magic;
+
+	__u16 major;
+	__u16 minor;
+	__u16 patch;
+	__u16 rev;
+
+	__u64 time;	/* when checkpoint taken */
+	__u64 flags;	/* checkpoint options */
+
+	char release[__NEW_UTS_LEN];
+	char version[__NEW_UTS_LEN];
+	char machine[__NEW_UTS_LEN];
+} __attribute__((aligned(8)));
+
+struct cr_hdr_tail {
+	__u64 magic;
+} __attribute__((aligned(8)));
+
+struct cr_hdr_task {
+	__u32 state;
+	__u32 exit_state;
+	__u32 exit_code;
+	__u32 exit_signal;
+
+	__s32 task_comm_len;
+} __attribute__((aligned(8)));
+
+#endif /* _CHECKPOINT_CKPT_HDR_H_ */
diff --git a/include/linux/magic.h b/include/linux/magic.h
index f7f3fdd..5939bbe 100644
--- a/include/linux/magic.h
+++ b/include/linux/magic.h
@@ -46,4 +46,7 @@
 #define FUTEXFS_SUPER_MAGIC	0xBAD1DEA
 #define INOTIFYFS_SUPER_MAGIC	0x2BAD1DEA
 
+#define CHECKPOINT_MAGIC_HEAD  0x00feed0cc0a2d200LL
+#define CHECKPOINT_MAGIC_TAIL  0x002d2a0cc0deef00LL
+
 #endif /* __LINUX_MAGIC_H__ */
-- 
1.5.4.3


^ permalink raw reply related	[flat|nested] 133+ messages in thread

* [RFC v11][PATCH 03/13] General infrastructure for checkpoint restart
@ 2008-12-05 17:31   ` Oren Laadan
  0 siblings, 0 replies; 133+ messages in thread
From: Oren Laadan @ 2008-12-05 17:31 UTC (permalink / raw)
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Linux Torvalds,
	Thomas Gleixner, Serge Hallyn, Dave Hansen, Ingo Molnar,
	H. Peter Anvin, Alexander Viro, MinChan Kim, arnd-r2nGTMty4D4,
	jeremy-TSDbQ3PG+2Y, Oren Laadan

Add those interfaces, as well as helpers needed to easily manage the
file format. The code is roughly broken out as follows:

checkpoint/sys.c - user/kernel data transfer, as well as setup of the
  CR context (a per-checkpoint data structure for housekeeping)
checkpoint/checkpoint.c - output wrappers and basic checkpoint handling
checkpoint/restart.c - input wrappers and basic restart handling

For now, we can only checkpoint the 'current' task ("self" checkpoint),
and the 'pid' argument to to the syscall is ignored.

Patches to add the per-architecture support as well as the actual
work to do the memory checkpoint follow in subsequent patches.

Changelog[v10]:
  - add cr_write_buffer(), cr_read_buffer() and cr_read_buf_type()
  - force end-of-string in cr_read_string() (fix possible DoS)

Changelog[v9]:
  - cr_kwrite/cr_kread() use file->f_op->write() directly
  - Drop cr_uwrite/cr_uread() since they aren't used anywhere

Changelog[v6]:
  - Balance all calls to cr_hbuf_get() with matching cr_hbuf_put()
    (although it's not really needed)

Changelog[v5]:
  - Rename headers files s/ckpt/checkpoint/

Changelog[v2]:
  - Added utsname->{release,version,machine} to checkpoint header
  - Pad header structures to 64 bits to ensure compatibility

Signed-off-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
Acked-by: Serge Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
Signed-off-by: Dave Hansen <dave-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
---
 Makefile                       |    2 +-
 checkpoint/Makefile            |    2 +-
 checkpoint/checkpoint.c        |  188 +++++++++++++++++++++++++++++++
 checkpoint/restart.c           |  239 ++++++++++++++++++++++++++++++++++++++++
 checkpoint/sys.c               |  194 +++++++++++++++++++++++++++++++-
 include/linux/checkpoint.h     |   58 ++++++++++
 include/linux/checkpoint_hdr.h |   76 +++++++++++++
 include/linux/magic.h          |    3 +
 8 files changed, 756 insertions(+), 6 deletions(-)
 create mode 100644 checkpoint/checkpoint.c
 create mode 100644 checkpoint/restart.c
 create mode 100644 include/linux/checkpoint.h
 create mode 100644 include/linux/checkpoint_hdr.h

diff --git a/Makefile b/Makefile
index 9a49960..00e402a 100644
--- a/Makefile
+++ b/Makefile
@@ -619,7 +619,7 @@ export mod_strip_cmd
 
 
 ifeq ($(KBUILD_EXTMOD),)
-core-y		+= kernel/ mm/ fs/ ipc/ security/ crypto/ block/
+core-y		+= kernel/ mm/ fs/ ipc/ security/ crypto/ block/ checkpoint/
 
 vmlinux-dirs	:= $(patsubst %/,%,$(filter %/, $(init-y) $(init-m) \
 		     $(core-y) $(core-m) $(drivers-y) $(drivers-m) \
diff --git a/checkpoint/Makefile b/checkpoint/Makefile
index 07d018b..d2df68c 100644
--- a/checkpoint/Makefile
+++ b/checkpoint/Makefile
@@ -2,4 +2,4 @@
 # Makefile for linux checkpoint/restart.
 #
 
-obj-$(CONFIG_CHECKPOINT_RESTART) += sys.o
+obj-$(CONFIG_CHECKPOINT_RESTART) += sys.o checkpoint.o restart.o
diff --git a/checkpoint/checkpoint.c b/checkpoint/checkpoint.c
new file mode 100644
index 0000000..fccf723
--- /dev/null
+++ b/checkpoint/checkpoint.c
@@ -0,0 +1,188 @@
+/*
+ *  Checkpoint logic and helpers
+ *
+ *  Copyright (C) 2008 Oren Laadan
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+#include <linux/version.h>
+#include <linux/sched.h>
+#include <linux/time.h>
+#include <linux/fs.h>
+#include <linux/file.h>
+#include <linux/dcache.h>
+#include <linux/mount.h>
+#include <linux/utsname.h>
+#include <linux/magic.h>
+#include <linux/checkpoint.h>
+#include <linux/checkpoint_hdr.h>
+
+/* unique checkpoint identifier (FIXME: should be per-container ?) */
+static atomic_t cr_ctx_count = ATOMIC_INIT(0);
+
+/**
+ * cr_write_obj - write a record described by a cr_hdr
+ * @ctx: checkpoint context
+ * @h: record descriptor
+ * @buf: record buffer
+ */
+int cr_write_obj(struct cr_ctx *ctx, struct cr_hdr *h, void *buf)
+{
+	int ret;
+
+	ret = cr_kwrite(ctx, h, sizeof(*h));
+	if (ret < 0)
+		return ret;
+	return cr_kwrite(ctx, buf, h->len);
+}
+
+/**
+ * cr_write_buffer - write a buffer
+ * @ctx: checkpoint context
+ * @str: buffer pointer
+ * @len: buffer size
+ */
+int cr_write_buffer(struct cr_ctx *ctx, void *buf, int len)
+{
+	struct cr_hdr h;
+
+	h.type = CR_HDR_BUFFER;
+	h.len = len;
+	h.parent = 0;
+
+	return cr_write_obj(ctx, &h, buf);
+}
+
+/**
+ * cr_write_string - write a string
+ * @ctx: checkpoint context
+ * @str: string pointer
+ * @len: string length
+ */
+int cr_write_string(struct cr_ctx *ctx, char *str, int len)
+{
+	struct cr_hdr h;
+
+	h.type = CR_HDR_STRING;
+	h.len = len;
+	h.parent = 0;
+
+	return cr_write_obj(ctx, &h, str);
+}
+
+/* write the checkpoint header */
+static int cr_write_head(struct cr_ctx *ctx)
+{
+	struct cr_hdr h;
+	struct cr_hdr_head *hh = cr_hbuf_get(ctx, sizeof(*hh));
+	struct new_utsname *uts;
+	struct timeval ktv;
+	int ret;
+
+	h.type = CR_HDR_HEAD;
+	h.len = sizeof(*hh);
+	h.parent = 0;
+
+	do_gettimeofday(&ktv);
+
+	hh->magic = CHECKPOINT_MAGIC_HEAD;
+	hh->major = (LINUX_VERSION_CODE >> 16) & 0xff;
+	hh->minor = (LINUX_VERSION_CODE >> 8) & 0xff;
+	hh->patch = (LINUX_VERSION_CODE) & 0xff;
+
+	hh->rev = CR_VERSION;
+
+	hh->flags = ctx->flags;
+	hh->time = ktv.tv_sec;
+
+	uts = utsname();
+	memcpy(hh->release, uts->release, __NEW_UTS_LEN);
+	memcpy(hh->version, uts->version, __NEW_UTS_LEN);
+	memcpy(hh->machine, uts->machine, __NEW_UTS_LEN);
+
+	ret = cr_write_obj(ctx, &h, hh);
+	cr_hbuf_put(ctx, sizeof(*hh));
+	return ret;
+}
+
+/* write the checkpoint trailer */
+static int cr_write_tail(struct cr_ctx *ctx)
+{
+	struct cr_hdr h;
+	struct cr_hdr_tail *hh = cr_hbuf_get(ctx, sizeof(*hh));
+	int ret;
+
+	h.type = CR_HDR_TAIL;
+	h.len = sizeof(*hh);
+	h.parent = 0;
+
+	hh->magic = CHECKPOINT_MAGIC_TAIL;
+
+	ret = cr_write_obj(ctx, &h, hh);
+	cr_hbuf_put(ctx, sizeof(*hh));
+	return ret;
+}
+
+/* dump the task_struct of a given task */
+static int cr_write_task_struct(struct cr_ctx *ctx, struct task_struct *t)
+{
+	struct cr_hdr h;
+	struct cr_hdr_task *hh = cr_hbuf_get(ctx, sizeof(*hh));
+	int ret;
+
+	h.type = CR_HDR_TASK;
+	h.len = sizeof(*hh);
+	h.parent = 0;
+
+	hh->state = t->state;
+	hh->exit_state = t->exit_state;
+	hh->exit_code = t->exit_code;
+	hh->exit_signal = t->exit_signal;
+
+	hh->task_comm_len = TASK_COMM_LEN;
+
+	/* FIXME: save remaining relevant task_struct fields */
+
+	ret = cr_write_obj(ctx, &h, hh);
+	cr_hbuf_put(ctx, sizeof(*hh));
+	if (ret < 0)
+		return ret;
+
+	return cr_write_string(ctx, t->comm, TASK_COMM_LEN);
+}
+
+/* dump the entire state of a given task */
+static int cr_write_task(struct cr_ctx *ctx, struct task_struct *t)
+{
+	int ret;
+
+	ret = cr_write_task_struct(ctx, t);
+	cr_debug("ret %d\n", ret);
+
+	return ret;
+}
+
+int do_checkpoint(struct cr_ctx *ctx, pid_t pid)
+{
+	int ret;
+
+	ret = cr_write_head(ctx);
+	if (ret < 0)
+		goto out;
+	ret = cr_write_task(ctx, current);
+	if (ret < 0)
+		goto out;
+	ret = cr_write_tail(ctx);
+	if (ret < 0)
+		goto out;
+
+	ctx->crid = atomic_inc_return(&cr_ctx_count);
+
+	/* on success, return (unique) checkpoint identifier */
+	ret = ctx->crid;
+ out:
+	return ret;
+}
diff --git a/checkpoint/restart.c b/checkpoint/restart.c
new file mode 100644
index 0000000..a95d2e8
--- /dev/null
+++ b/checkpoint/restart.c
@@ -0,0 +1,239 @@
+/*
+ *  Restart logic and helpers
+ *
+ *  Copyright (C) 2008 Oren Laadan
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+#include <linux/version.h>
+#include <linux/sched.h>
+#include <linux/file.h>
+#include <linux/magic.h>
+#include <linux/checkpoint.h>
+#include <linux/checkpoint_hdr.h>
+
+/**
+ * cr_read_obj - read a whole record (cr_hdr followed by payload)
+ * @ctx: checkpoint context
+ * @h: record descriptor
+ * @buf: record buffer
+ * @len: available buffer size
+ */
+int cr_read_obj(struct cr_ctx *ctx, struct cr_hdr *h, void *buf, int len)
+{
+	int ret;
+
+	ret = cr_kread(ctx, h, sizeof(*h));
+	if (ret < 0)
+		return ret;
+
+	cr_debug("type %d len %d parent %d\n", h->type, h->len, h->parent);
+
+	if (h->len < 0 || h->len > len)
+		return -EINVAL;
+
+	return cr_kread(ctx, buf, h->len);
+}
+
+/**
+ * cr_read_obj_type - read a whole record of expected type and size
+ * @ctx: checkpoint context
+ * @buf: record buffer
+ * @n: expected record size
+ * @type: expected record type
+ *
+ * Returns objref of the parent object
+ */
+int cr_read_obj_type(struct cr_ctx *ctx, void *buf, int len, int type)
+{
+	struct cr_hdr h;
+	int ret;
+
+	ret = cr_read_obj(ctx, &h, buf, len);
+	if (ret < 0)
+		return ret;
+
+	if (h.len != len || h.type != type)
+		return -EINVAL;
+
+	return h.parent;
+}
+
+/**
+ * cr_read_buf_type - read a whole record of expected type (unknown size)
+ * @ctx: checkpoint context
+ * @buf: record buffer
+ * @n: availabe buffer size (output: actual record size)
+ * @type: expected record type
+ *
+ * Returns objref of the parent object
+ */
+int cr_read_buf_type(struct cr_ctx *ctx, void *buf, int *len, int type)
+{
+	struct cr_hdr h;
+	int ret;
+
+	ret = cr_read_obj(ctx, &h, buf, *len);
+	if (ret < 0)
+		return ret;
+
+	if (h.type != type)
+		return -EINVAL;
+
+	*len = h.len;
+	return h.parent;
+}
+
+/**
+ * cr_read_buffer - read a buffer
+ * @ctx: checkpoint context
+ * @buf: buffer
+ * @len: buffer size (output actual record size)
+ */
+int cr_read_buffer(struct cr_ctx *ctx, void *buf, int *len)
+{
+	return cr_read_buf_type(ctx, buf, len, CR_HDR_BUFFER);
+}
+
+/**
+ * cr_read_string - read a string
+ * @ctx: checkpoint context
+ * @str: string buffer
+ * @len: string length
+ */
+int cr_read_string(struct cr_ctx *ctx, char *str, int len)
+{
+	int ret;
+
+	ret = cr_read_buf_type(ctx, str, &len, CR_HDR_STRING);
+	if (ret < 0)
+		return ret;
+
+	if (len > 0)
+		str[len - 1] = '\0';	/* always play it safe */
+
+	return ret;
+}
+
+/* read the checkpoint header */
+static int cr_read_head(struct cr_ctx *ctx)
+{
+	struct cr_hdr_head *hh = cr_hbuf_get(ctx, sizeof(*hh));
+	int parent, ret = -EINVAL;
+
+	parent = cr_read_obj_type(ctx, hh, sizeof(*hh), CR_HDR_HEAD);
+	if (parent < 0) {
+		ret = parent;
+		goto out;
+	} else if (parent != 0)
+		goto out;
+
+	if (hh->magic != CHECKPOINT_MAGIC_HEAD || hh->rev != CR_VERSION ||
+	    hh->major != ((LINUX_VERSION_CODE >> 16) & 0xff) ||
+	    hh->minor != ((LINUX_VERSION_CODE >> 8) & 0xff) ||
+	    hh->patch != ((LINUX_VERSION_CODE) & 0xff))
+		goto out;
+
+	if (hh->flags & ~CR_CTX_CKPT)
+		goto out;
+
+	ctx->oflags = hh->flags;
+
+	/* FIX: verify compatibility of release, version and machine */
+
+	ret = 0;
+ out:
+	cr_hbuf_put(ctx, sizeof(*hh));
+	return ret;
+}
+
+/* read the checkpoint trailer */
+static int cr_read_tail(struct cr_ctx *ctx)
+{
+	struct cr_hdr_tail *hh = cr_hbuf_get(ctx, sizeof(*hh));
+	int parent, ret = -EINVAL;
+
+	parent = cr_read_obj_type(ctx, hh, sizeof(*hh), CR_HDR_TAIL);
+	if (parent < 0) {
+		ret = parent;
+		goto out;
+	} else if (parent != 0)
+		goto out;
+
+	if (hh->magic != CHECKPOINT_MAGIC_TAIL)
+		goto out;
+
+	ret = 0;
+ out:
+	cr_hbuf_put(ctx, sizeof(*hh));
+	return ret;
+}
+
+/* read the task_struct into the current task */
+static int cr_read_task_struct(struct cr_ctx *ctx)
+{
+	struct cr_hdr_task *hh = cr_hbuf_get(ctx, sizeof(*hh));
+	struct task_struct *t = current;
+	char *buf;
+	int parent, ret = -EINVAL;
+
+	parent = cr_read_obj_type(ctx, hh, sizeof(*hh), CR_HDR_TASK);
+	if (parent < 0) {
+		ret = parent;
+		goto out;
+	} else if (parent != 0)
+		goto out;
+
+	/* upper limit for task_comm_len to prevent DoS */
+	if (hh->task_comm_len < 0 || hh->task_comm_len > PAGE_SIZE)
+		goto out;
+
+	buf = kmalloc(hh->task_comm_len, GFP_KERNEL);
+	if (!buf)
+		goto out;
+	ret = cr_read_string(ctx, buf, hh->task_comm_len);
+	if (!ret) {
+		/* if t->comm is too long, silently truncate */
+		memset(t->comm, 0, TASK_COMM_LEN);
+		memcpy(t->comm, buf, min(hh->task_comm_len, TASK_COMM_LEN));
+	}
+	kfree(buf);
+
+	/* FIXME: restore remaining relevant task_struct fields */
+ out:
+	cr_hbuf_put(ctx, sizeof(*hh));
+	return ret;
+}
+
+/* read the entire state of the current task */
+static int cr_read_task(struct cr_ctx *ctx)
+{
+	int ret;
+
+	ret = cr_read_task_struct(ctx);
+	cr_debug("ret %d\n", ret);
+
+	return ret;
+}
+
+int do_restart(struct cr_ctx *ctx, pid_t pid)
+{
+	int ret;
+
+	ret = cr_read_head(ctx);
+	if (ret < 0)
+		goto out;
+	ret = cr_read_task(ctx);
+	if (ret < 0)
+		goto out;
+	ret = cr_read_tail(ctx);
+	if (ret < 0)
+		goto out;
+
+	/* on success, adjust the return value if needed [TODO] */
+ out:
+	return ret;
+}
diff --git a/checkpoint/sys.c b/checkpoint/sys.c
index 375129c..bd14ef9 100644
--- a/checkpoint/sys.c
+++ b/checkpoint/sys.c
@@ -10,6 +10,158 @@
 
 #include <linux/sched.h>
 #include <linux/kernel.h>
+#include <linux/fs.h>
+#include <linux/file.h>
+#include <linux/uaccess.h>
+#include <linux/capability.h>
+#include <linux/checkpoint.h>
+
+/*
+ * Helpers to write(read) from(to) kernel space to(from) the checkpoint
+ * image file descriptor (similar to how a core-dump is performed).
+ *
+ *   cr_kwrite() - write a kernel-space buffer to the checkpoint image
+ *   cr_kread() - read from the checkpoint image to a kernel-space buffer
+ */
+
+int cr_kwrite(struct cr_ctx *ctx, void *addr, int count)
+{
+	struct file *file = ctx->file;
+	mm_segment_t fs;
+	ssize_t nwrite;
+	int nleft;
+
+	fs = get_fs();
+	set_fs(KERNEL_DS);
+	for (nleft = count; nleft; nleft -= nwrite) {
+		nwrite = file->f_op->write(file, addr, nleft, &file->f_pos);
+		if (nwrite < 0) {
+			if (nwrite == -EAGAIN)
+				nwrite = 0;
+			else
+				return nwrite;
+		}
+		addr += nwrite;
+	}
+	set_fs(fs);
+	ctx->total += count;
+	return 0;
+}
+
+int cr_kread(struct cr_ctx *ctx, void *addr, int count)
+{
+	struct file *file = ctx->file;
+	mm_segment_t fs;
+	ssize_t nread;
+	int nleft;
+
+	fs = get_fs();
+	set_fs(KERNEL_DS);
+	for (nleft = count; nleft; nleft -= nread) {
+		nread = file->f_op->read(file, addr, nleft, &file->f_pos);
+		if (nread <= 0) {
+			if (nread == -EAGAIN) {
+				nread = 0;
+				continue;
+			} else if (nread == 0)
+				nread = -EPIPE;		/* unexecpted EOF */
+			return nread;
+		}
+		addr += nread;
+	}
+	set_fs(fs);
+	ctx->total += count;
+	return 0;
+}
+
+/*
+ * During checkpoint and restart the code writes outs/reads in data
+ * to/from the checkpoint image from/to a temporary buffer (ctx->hbuf).
+ * Because operations can be nested, use cr_hbuf_get() to reserve space
+ * in the buffer, then cr_hbuf_put() when you no longer need that space.
+ */
+
+/*
+ * ctx->hbuf is used to hold headers and data of known (or bound),
+ * static sizes. In some cases, multiple headers may be allocated in
+ * a nested manner. The size should accommodate all headers, nested
+ * or not, on all archs.
+ */
+#define CR_HBUF_TOTAL  (8 * 4096)
+
+/**
+ * cr_hbuf_get - reserve space on the hbuf
+ * @ctx: checkpoint context
+ * @n: number of bytes to reserve
+ *
+ * Returns pointer to reserved space
+ */
+void *cr_hbuf_get(struct cr_ctx *ctx, int n)
+{
+	void *ptr;
+
+	/*
+	 * Since requests depend on logic and static header sizes (not on
+	 * user data), space should always suffice, unless someone either
+	 * made a structure bigger or call path deeper than expected.
+	 */
+	BUG_ON(ctx->hpos + n > CR_HBUF_TOTAL);
+	ptr = ctx->hbuf + ctx->hpos;
+	ctx->hpos += n;
+	return ptr;
+}
+
+/**
+ * cr_hbuf_put - unreserve space on the hbuf
+ * @ctx: checkpoint context
+ * @n: number of bytes to reserve
+ */
+void cr_hbuf_put(struct cr_ctx *ctx, int n)
+{
+	BUG_ON(ctx->hpos < n);
+	ctx->hpos -= n;
+}
+
+/*
+ * helpers to manage C/R contexts: allocated for each checkpoint and/or
+ * restart operation, and persists until the operation is completed.
+ */
+
+static void cr_ctx_free(struct cr_ctx *ctx)
+{
+	if (ctx->file)
+		fput(ctx->file);
+	kfree(ctx->hbuf);
+	kfree(ctx);
+}
+
+static struct cr_ctx *cr_ctx_alloc(int fd, unsigned long flags)
+{
+	struct cr_ctx *ctx;
+	int err;
+
+	ctx = kzalloc(sizeof(*ctx), GFP_KERNEL);
+	if (!ctx)
+		return ERR_PTR(-ENOMEM);
+
+	ctx->flags = flags;
+
+	err = -EBADF;
+	ctx->file = fget(fd);
+	if (!ctx->file)
+		goto err;
+
+	err = -ENOMEM;
+	ctx->hbuf = kmalloc(CR_HBUF_TOTAL, GFP_KERNEL);
+	if (!ctx->hbuf)
+		goto err;
+
+	return ctx;
+
+ err:
+	cr_ctx_free(ctx);
+	return ERR_PTR(err);
+}
 
 /**
  * sys_checkpoint - checkpoint a container
@@ -22,9 +174,26 @@
  */
 asmlinkage long sys_checkpoint(pid_t pid, int fd, unsigned long flags)
 {
-	pr_debug("sys_checkpoint not implemented yet\n");
-	return -ENOSYS;
+	struct cr_ctx *ctx;
+	int ret;
+
+	/* no flags for now */
+	if (flags)
+		return -EINVAL;
+
+	ctx = cr_ctx_alloc(fd, flags | CR_CTX_CKPT);
+	if (IS_ERR(ctx))
+		return PTR_ERR(ctx);
+
+	ret = do_checkpoint(ctx, pid);
+
+	if (!ret)
+		ret = ctx->crid;
+
+	cr_ctx_free(ctx);
+	return ret;
 }
+
 /**
  * sys_restart - restart a container
  * @crid: checkpoint image identifier
@@ -36,6 +205,23 @@ asmlinkage long sys_checkpoint(pid_t pid, int fd, unsigned long flags)
  */
 asmlinkage long sys_restart(int crid, int fd, unsigned long flags)
 {
-	pr_debug("sys_restart not implemented yet\n");
-	return -ENOSYS;
+	struct cr_ctx *ctx;
+	pid_t pid;
+	int ret;
+
+	/* no flags for now */
+	if (flags)
+		return -EINVAL;
+
+	ctx = cr_ctx_alloc(fd, flags | CR_CTX_RSTR);
+	if (IS_ERR(ctx))
+		return PTR_ERR(ctx);
+
+	/* FIXME: for now, we use 'crid' as a pid */
+	pid = (pid_t) crid;
+
+	ret = do_restart(ctx, pid);
+
+	cr_ctx_free(ctx);
+	return ret;
 }
diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
new file mode 100644
index 0000000..63f298f
--- /dev/null
+++ b/include/linux/checkpoint.h
@@ -0,0 +1,58 @@
+#ifndef _CHECKPOINT_CKPT_H_
+#define _CHECKPOINT_CKPT_H_
+/*
+ *  Generic container checkpoint-restart
+ *
+ *  Copyright (C) 2008 Oren Laadan
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+#define CR_VERSION  1
+
+struct cr_ctx {
+	int crid;		/* unique checkpoint id */
+
+	pid_t root_pid;		/* container identifier */
+
+	unsigned long flags;
+	unsigned long oflags;	/* restart: old flags */
+
+	struct file *file;
+	int total;		/* total read/written */
+
+	void *hbuf;		/* temporary buffer for headers */
+	int hpos;		/* position in headers buffer */
+};
+
+/* cr_ctx: flags */
+#define CR_CTX_CKPT	0x1
+#define CR_CTX_RSTR	0x2
+
+extern int cr_kwrite(struct cr_ctx *ctx, void *buf, int count);
+extern int cr_kread(struct cr_ctx *ctx, void *buf, int count);
+
+extern void *cr_hbuf_get(struct cr_ctx *ctx, int n);
+extern void cr_hbuf_put(struct cr_ctx *ctx, int n);
+
+struct cr_hdr;
+
+extern int cr_write_obj(struct cr_ctx *ctx, struct cr_hdr *h, void *buf);
+extern int cr_write_buffer(struct cr_ctx *ctx, void *buf, int len);
+extern int cr_write_string(struct cr_ctx *ctx, char *str, int len);
+
+extern int cr_read_obj(struct cr_ctx *ctx, struct cr_hdr *h, void *buf, int n);
+extern int cr_read_obj_type(struct cr_ctx *ctx, void *buf, int len, int type);
+extern int cr_read_buf_type(struct cr_ctx *ctx, void *buf, int *len, int type);
+extern int cr_read_buffer(struct cr_ctx *ctx, void *buf, int *len);
+extern int cr_read_string(struct cr_ctx *ctx, char *str, int len);
+
+extern int do_checkpoint(struct cr_ctx *ctx, pid_t pid);
+extern int do_restart(struct cr_ctx *ctx, pid_t pid);
+
+#define cr_debug(fmt, args...)  \
+	pr_debug("[%d:c/r:%s] " fmt, task_pid_vnr(current), __func__, ## args)
+
+#endif /* _CHECKPOINT_CKPT_H_ */
diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
new file mode 100644
index 0000000..257f87f
--- /dev/null
+++ b/include/linux/checkpoint_hdr.h
@@ -0,0 +1,76 @@
+#ifndef _CHECKPOINT_CKPT_HDR_H_
+#define _CHECKPOINT_CKPT_HDR_H_
+/*
+ *  Generic container checkpoint-restart
+ *
+ *  Copyright (C) 2008 Oren Laadan
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+#include <linux/types.h>
+#include <linux/utsname.h>
+
+/*
+ * To maintain compatibility between 32-bit and 64-bit architecture flavors,
+ * keep data 64-bit aligned: use padding for structure members, and use
+ * __attribute__ ((aligned (8))) for the entire structure.
+ */
+
+/* records: generic header */
+
+struct cr_hdr {
+	__s16 type;
+	__s16 len;
+	__u32 parent;
+};
+
+/* header types */
+enum {
+	CR_HDR_HEAD = 1,
+	CR_HDR_BUFFER,
+	CR_HDR_STRING,
+
+	CR_HDR_TASK = 101,
+	CR_HDR_THREAD,
+	CR_HDR_CPU,
+
+	CR_HDR_MM = 201,
+	CR_HDR_VMA,
+	CR_HDR_MM_CONTEXT,
+
+	CR_HDR_TAIL = 5001
+};
+
+struct cr_hdr_head {
+	__u64 magic;
+
+	__u16 major;
+	__u16 minor;
+	__u16 patch;
+	__u16 rev;
+
+	__u64 time;	/* when checkpoint taken */
+	__u64 flags;	/* checkpoint options */
+
+	char release[__NEW_UTS_LEN];
+	char version[__NEW_UTS_LEN];
+	char machine[__NEW_UTS_LEN];
+} __attribute__((aligned(8)));
+
+struct cr_hdr_tail {
+	__u64 magic;
+} __attribute__((aligned(8)));
+
+struct cr_hdr_task {
+	__u32 state;
+	__u32 exit_state;
+	__u32 exit_code;
+	__u32 exit_signal;
+
+	__s32 task_comm_len;
+} __attribute__((aligned(8)));
+
+#endif /* _CHECKPOINT_CKPT_HDR_H_ */
diff --git a/include/linux/magic.h b/include/linux/magic.h
index f7f3fdd..5939bbe 100644
--- a/include/linux/magic.h
+++ b/include/linux/magic.h
@@ -46,4 +46,7 @@
 #define FUTEXFS_SUPER_MAGIC	0xBAD1DEA
 #define INOTIFYFS_SUPER_MAGIC	0x2BAD1DEA
 
+#define CHECKPOINT_MAGIC_HEAD  0x00feed0cc0a2d200LL
+#define CHECKPOINT_MAGIC_TAIL  0x002d2a0cc0deef00LL
+
 #endif /* __LINUX_MAGIC_H__ */
-- 
1.5.4.3

--
To unsubscribe from this list: send the line "unsubscribe linux-api" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 133+ messages in thread

* [RFC v11][PATCH 03/13] General infrastructure for checkpoint restart
@ 2008-12-05 17:31   ` Oren Laadan
  0 siblings, 0 replies; 133+ messages in thread
From: Oren Laadan @ 2008-12-05 17:31 UTC (permalink / raw)
  To: Oren Laadan
  Cc: containers, linux-kernel, linux-mm, linux-api, Linux Torvalds,
	Thomas Gleixner, Serge Hallyn, Dave Hansen, Ingo Molnar,
	H. Peter Anvin, Alexander Viro, MinChan Kim, arnd, jeremy

Add those interfaces, as well as helpers needed to easily manage the
file format. The code is roughly broken out as follows:

checkpoint/sys.c - user/kernel data transfer, as well as setup of the
  CR context (a per-checkpoint data structure for housekeeping)
checkpoint/checkpoint.c - output wrappers and basic checkpoint handling
checkpoint/restart.c - input wrappers and basic restart handling

For now, we can only checkpoint the 'current' task ("self" checkpoint),
and the 'pid' argument to to the syscall is ignored.

Patches to add the per-architecture support as well as the actual
work to do the memory checkpoint follow in subsequent patches.

Changelog[v10]:
  - add cr_write_buffer(), cr_read_buffer() and cr_read_buf_type()
  - force end-of-string in cr_read_string() (fix possible DoS)

Changelog[v9]:
  - cr_kwrite/cr_kread() use file->f_op->write() directly
  - Drop cr_uwrite/cr_uread() since they aren't used anywhere

Changelog[v6]:
  - Balance all calls to cr_hbuf_get() with matching cr_hbuf_put()
    (although it's not really needed)

Changelog[v5]:
  - Rename headers files s/ckpt/checkpoint/

Changelog[v2]:
  - Added utsname->{release,version,machine} to checkpoint header
  - Pad header structures to 64 bits to ensure compatibility

Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com>
---
 Makefile                       |    2 +-
 checkpoint/Makefile            |    2 +-
 checkpoint/checkpoint.c        |  188 +++++++++++++++++++++++++++++++
 checkpoint/restart.c           |  239 ++++++++++++++++++++++++++++++++++++++++
 checkpoint/sys.c               |  194 +++++++++++++++++++++++++++++++-
 include/linux/checkpoint.h     |   58 ++++++++++
 include/linux/checkpoint_hdr.h |   76 +++++++++++++
 include/linux/magic.h          |    3 +
 8 files changed, 756 insertions(+), 6 deletions(-)
 create mode 100644 checkpoint/checkpoint.c
 create mode 100644 checkpoint/restart.c
 create mode 100644 include/linux/checkpoint.h
 create mode 100644 include/linux/checkpoint_hdr.h

diff --git a/Makefile b/Makefile
index 9a49960..00e402a 100644
--- a/Makefile
+++ b/Makefile
@@ -619,7 +619,7 @@ export mod_strip_cmd
 
 
 ifeq ($(KBUILD_EXTMOD),)
-core-y		+= kernel/ mm/ fs/ ipc/ security/ crypto/ block/
+core-y		+= kernel/ mm/ fs/ ipc/ security/ crypto/ block/ checkpoint/
 
 vmlinux-dirs	:= $(patsubst %/,%,$(filter %/, $(init-y) $(init-m) \
 		     $(core-y) $(core-m) $(drivers-y) $(drivers-m) \
diff --git a/checkpoint/Makefile b/checkpoint/Makefile
index 07d018b..d2df68c 100644
--- a/checkpoint/Makefile
+++ b/checkpoint/Makefile
@@ -2,4 +2,4 @@
 # Makefile for linux checkpoint/restart.
 #
 
-obj-$(CONFIG_CHECKPOINT_RESTART) += sys.o
+obj-$(CONFIG_CHECKPOINT_RESTART) += sys.o checkpoint.o restart.o
diff --git a/checkpoint/checkpoint.c b/checkpoint/checkpoint.c
new file mode 100644
index 0000000..fccf723
--- /dev/null
+++ b/checkpoint/checkpoint.c
@@ -0,0 +1,188 @@
+/*
+ *  Checkpoint logic and helpers
+ *
+ *  Copyright (C) 2008 Oren Laadan
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+#include <linux/version.h>
+#include <linux/sched.h>
+#include <linux/time.h>
+#include <linux/fs.h>
+#include <linux/file.h>
+#include <linux/dcache.h>
+#include <linux/mount.h>
+#include <linux/utsname.h>
+#include <linux/magic.h>
+#include <linux/checkpoint.h>
+#include <linux/checkpoint_hdr.h>
+
+/* unique checkpoint identifier (FIXME: should be per-container ?) */
+static atomic_t cr_ctx_count = ATOMIC_INIT(0);
+
+/**
+ * cr_write_obj - write a record described by a cr_hdr
+ * @ctx: checkpoint context
+ * @h: record descriptor
+ * @buf: record buffer
+ */
+int cr_write_obj(struct cr_ctx *ctx, struct cr_hdr *h, void *buf)
+{
+	int ret;
+
+	ret = cr_kwrite(ctx, h, sizeof(*h));
+	if (ret < 0)
+		return ret;
+	return cr_kwrite(ctx, buf, h->len);
+}
+
+/**
+ * cr_write_buffer - write a buffer
+ * @ctx: checkpoint context
+ * @str: buffer pointer
+ * @len: buffer size
+ */
+int cr_write_buffer(struct cr_ctx *ctx, void *buf, int len)
+{
+	struct cr_hdr h;
+
+	h.type = CR_HDR_BUFFER;
+	h.len = len;
+	h.parent = 0;
+
+	return cr_write_obj(ctx, &h, buf);
+}
+
+/**
+ * cr_write_string - write a string
+ * @ctx: checkpoint context
+ * @str: string pointer
+ * @len: string length
+ */
+int cr_write_string(struct cr_ctx *ctx, char *str, int len)
+{
+	struct cr_hdr h;
+
+	h.type = CR_HDR_STRING;
+	h.len = len;
+	h.parent = 0;
+
+	return cr_write_obj(ctx, &h, str);
+}
+
+/* write the checkpoint header */
+static int cr_write_head(struct cr_ctx *ctx)
+{
+	struct cr_hdr h;
+	struct cr_hdr_head *hh = cr_hbuf_get(ctx, sizeof(*hh));
+	struct new_utsname *uts;
+	struct timeval ktv;
+	int ret;
+
+	h.type = CR_HDR_HEAD;
+	h.len = sizeof(*hh);
+	h.parent = 0;
+
+	do_gettimeofday(&ktv);
+
+	hh->magic = CHECKPOINT_MAGIC_HEAD;
+	hh->major = (LINUX_VERSION_CODE >> 16) & 0xff;
+	hh->minor = (LINUX_VERSION_CODE >> 8) & 0xff;
+	hh->patch = (LINUX_VERSION_CODE) & 0xff;
+
+	hh->rev = CR_VERSION;
+
+	hh->flags = ctx->flags;
+	hh->time = ktv.tv_sec;
+
+	uts = utsname();
+	memcpy(hh->release, uts->release, __NEW_UTS_LEN);
+	memcpy(hh->version, uts->version, __NEW_UTS_LEN);
+	memcpy(hh->machine, uts->machine, __NEW_UTS_LEN);
+
+	ret = cr_write_obj(ctx, &h, hh);
+	cr_hbuf_put(ctx, sizeof(*hh));
+	return ret;
+}
+
+/* write the checkpoint trailer */
+static int cr_write_tail(struct cr_ctx *ctx)
+{
+	struct cr_hdr h;
+	struct cr_hdr_tail *hh = cr_hbuf_get(ctx, sizeof(*hh));
+	int ret;
+
+	h.type = CR_HDR_TAIL;
+	h.len = sizeof(*hh);
+	h.parent = 0;
+
+	hh->magic = CHECKPOINT_MAGIC_TAIL;
+
+	ret = cr_write_obj(ctx, &h, hh);
+	cr_hbuf_put(ctx, sizeof(*hh));
+	return ret;
+}
+
+/* dump the task_struct of a given task */
+static int cr_write_task_struct(struct cr_ctx *ctx, struct task_struct *t)
+{
+	struct cr_hdr h;
+	struct cr_hdr_task *hh = cr_hbuf_get(ctx, sizeof(*hh));
+	int ret;
+
+	h.type = CR_HDR_TASK;
+	h.len = sizeof(*hh);
+	h.parent = 0;
+
+	hh->state = t->state;
+	hh->exit_state = t->exit_state;
+	hh->exit_code = t->exit_code;
+	hh->exit_signal = t->exit_signal;
+
+	hh->task_comm_len = TASK_COMM_LEN;
+
+	/* FIXME: save remaining relevant task_struct fields */
+
+	ret = cr_write_obj(ctx, &h, hh);
+	cr_hbuf_put(ctx, sizeof(*hh));
+	if (ret < 0)
+		return ret;
+
+	return cr_write_string(ctx, t->comm, TASK_COMM_LEN);
+}
+
+/* dump the entire state of a given task */
+static int cr_write_task(struct cr_ctx *ctx, struct task_struct *t)
+{
+	int ret;
+
+	ret = cr_write_task_struct(ctx, t);
+	cr_debug("ret %d\n", ret);
+
+	return ret;
+}
+
+int do_checkpoint(struct cr_ctx *ctx, pid_t pid)
+{
+	int ret;
+
+	ret = cr_write_head(ctx);
+	if (ret < 0)
+		goto out;
+	ret = cr_write_task(ctx, current);
+	if (ret < 0)
+		goto out;
+	ret = cr_write_tail(ctx);
+	if (ret < 0)
+		goto out;
+
+	ctx->crid = atomic_inc_return(&cr_ctx_count);
+
+	/* on success, return (unique) checkpoint identifier */
+	ret = ctx->crid;
+ out:
+	return ret;
+}
diff --git a/checkpoint/restart.c b/checkpoint/restart.c
new file mode 100644
index 0000000..a95d2e8
--- /dev/null
+++ b/checkpoint/restart.c
@@ -0,0 +1,239 @@
+/*
+ *  Restart logic and helpers
+ *
+ *  Copyright (C) 2008 Oren Laadan
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+#include <linux/version.h>
+#include <linux/sched.h>
+#include <linux/file.h>
+#include <linux/magic.h>
+#include <linux/checkpoint.h>
+#include <linux/checkpoint_hdr.h>
+
+/**
+ * cr_read_obj - read a whole record (cr_hdr followed by payload)
+ * @ctx: checkpoint context
+ * @h: record descriptor
+ * @buf: record buffer
+ * @len: available buffer size
+ */
+int cr_read_obj(struct cr_ctx *ctx, struct cr_hdr *h, void *buf, int len)
+{
+	int ret;
+
+	ret = cr_kread(ctx, h, sizeof(*h));
+	if (ret < 0)
+		return ret;
+
+	cr_debug("type %d len %d parent %d\n", h->type, h->len, h->parent);
+
+	if (h->len < 0 || h->len > len)
+		return -EINVAL;
+
+	return cr_kread(ctx, buf, h->len);
+}
+
+/**
+ * cr_read_obj_type - read a whole record of expected type and size
+ * @ctx: checkpoint context
+ * @buf: record buffer
+ * @n: expected record size
+ * @type: expected record type
+ *
+ * Returns objref of the parent object
+ */
+int cr_read_obj_type(struct cr_ctx *ctx, void *buf, int len, int type)
+{
+	struct cr_hdr h;
+	int ret;
+
+	ret = cr_read_obj(ctx, &h, buf, len);
+	if (ret < 0)
+		return ret;
+
+	if (h.len != len || h.type != type)
+		return -EINVAL;
+
+	return h.parent;
+}
+
+/**
+ * cr_read_buf_type - read a whole record of expected type (unknown size)
+ * @ctx: checkpoint context
+ * @buf: record buffer
+ * @n: availabe buffer size (output: actual record size)
+ * @type: expected record type
+ *
+ * Returns objref of the parent object
+ */
+int cr_read_buf_type(struct cr_ctx *ctx, void *buf, int *len, int type)
+{
+	struct cr_hdr h;
+	int ret;
+
+	ret = cr_read_obj(ctx, &h, buf, *len);
+	if (ret < 0)
+		return ret;
+
+	if (h.type != type)
+		return -EINVAL;
+
+	*len = h.len;
+	return h.parent;
+}
+
+/**
+ * cr_read_buffer - read a buffer
+ * @ctx: checkpoint context
+ * @buf: buffer
+ * @len: buffer size (output actual record size)
+ */
+int cr_read_buffer(struct cr_ctx *ctx, void *buf, int *len)
+{
+	return cr_read_buf_type(ctx, buf, len, CR_HDR_BUFFER);
+}
+
+/**
+ * cr_read_string - read a string
+ * @ctx: checkpoint context
+ * @str: string buffer
+ * @len: string length
+ */
+int cr_read_string(struct cr_ctx *ctx, char *str, int len)
+{
+	int ret;
+
+	ret = cr_read_buf_type(ctx, str, &len, CR_HDR_STRING);
+	if (ret < 0)
+		return ret;
+
+	if (len > 0)
+		str[len - 1] = '\0';	/* always play it safe */
+
+	return ret;
+}
+
+/* read the checkpoint header */
+static int cr_read_head(struct cr_ctx *ctx)
+{
+	struct cr_hdr_head *hh = cr_hbuf_get(ctx, sizeof(*hh));
+	int parent, ret = -EINVAL;
+
+	parent = cr_read_obj_type(ctx, hh, sizeof(*hh), CR_HDR_HEAD);
+	if (parent < 0) {
+		ret = parent;
+		goto out;
+	} else if (parent != 0)
+		goto out;
+
+	if (hh->magic != CHECKPOINT_MAGIC_HEAD || hh->rev != CR_VERSION ||
+	    hh->major != ((LINUX_VERSION_CODE >> 16) & 0xff) ||
+	    hh->minor != ((LINUX_VERSION_CODE >> 8) & 0xff) ||
+	    hh->patch != ((LINUX_VERSION_CODE) & 0xff))
+		goto out;
+
+	if (hh->flags & ~CR_CTX_CKPT)
+		goto out;
+
+	ctx->oflags = hh->flags;
+
+	/* FIX: verify compatibility of release, version and machine */
+
+	ret = 0;
+ out:
+	cr_hbuf_put(ctx, sizeof(*hh));
+	return ret;
+}
+
+/* read the checkpoint trailer */
+static int cr_read_tail(struct cr_ctx *ctx)
+{
+	struct cr_hdr_tail *hh = cr_hbuf_get(ctx, sizeof(*hh));
+	int parent, ret = -EINVAL;
+
+	parent = cr_read_obj_type(ctx, hh, sizeof(*hh), CR_HDR_TAIL);
+	if (parent < 0) {
+		ret = parent;
+		goto out;
+	} else if (parent != 0)
+		goto out;
+
+	if (hh->magic != CHECKPOINT_MAGIC_TAIL)
+		goto out;
+
+	ret = 0;
+ out:
+	cr_hbuf_put(ctx, sizeof(*hh));
+	return ret;
+}
+
+/* read the task_struct into the current task */
+static int cr_read_task_struct(struct cr_ctx *ctx)
+{
+	struct cr_hdr_task *hh = cr_hbuf_get(ctx, sizeof(*hh));
+	struct task_struct *t = current;
+	char *buf;
+	int parent, ret = -EINVAL;
+
+	parent = cr_read_obj_type(ctx, hh, sizeof(*hh), CR_HDR_TASK);
+	if (parent < 0) {
+		ret = parent;
+		goto out;
+	} else if (parent != 0)
+		goto out;
+
+	/* upper limit for task_comm_len to prevent DoS */
+	if (hh->task_comm_len < 0 || hh->task_comm_len > PAGE_SIZE)
+		goto out;
+
+	buf = kmalloc(hh->task_comm_len, GFP_KERNEL);
+	if (!buf)
+		goto out;
+	ret = cr_read_string(ctx, buf, hh->task_comm_len);
+	if (!ret) {
+		/* if t->comm is too long, silently truncate */
+		memset(t->comm, 0, TASK_COMM_LEN);
+		memcpy(t->comm, buf, min(hh->task_comm_len, TASK_COMM_LEN));
+	}
+	kfree(buf);
+
+	/* FIXME: restore remaining relevant task_struct fields */
+ out:
+	cr_hbuf_put(ctx, sizeof(*hh));
+	return ret;
+}
+
+/* read the entire state of the current task */
+static int cr_read_task(struct cr_ctx *ctx)
+{
+	int ret;
+
+	ret = cr_read_task_struct(ctx);
+	cr_debug("ret %d\n", ret);
+
+	return ret;
+}
+
+int do_restart(struct cr_ctx *ctx, pid_t pid)
+{
+	int ret;
+
+	ret = cr_read_head(ctx);
+	if (ret < 0)
+		goto out;
+	ret = cr_read_task(ctx);
+	if (ret < 0)
+		goto out;
+	ret = cr_read_tail(ctx);
+	if (ret < 0)
+		goto out;
+
+	/* on success, adjust the return value if needed [TODO] */
+ out:
+	return ret;
+}
diff --git a/checkpoint/sys.c b/checkpoint/sys.c
index 375129c..bd14ef9 100644
--- a/checkpoint/sys.c
+++ b/checkpoint/sys.c
@@ -10,6 +10,158 @@
 
 #include <linux/sched.h>
 #include <linux/kernel.h>
+#include <linux/fs.h>
+#include <linux/file.h>
+#include <linux/uaccess.h>
+#include <linux/capability.h>
+#include <linux/checkpoint.h>
+
+/*
+ * Helpers to write(read) from(to) kernel space to(from) the checkpoint
+ * image file descriptor (similar to how a core-dump is performed).
+ *
+ *   cr_kwrite() - write a kernel-space buffer to the checkpoint image
+ *   cr_kread() - read from the checkpoint image to a kernel-space buffer
+ */
+
+int cr_kwrite(struct cr_ctx *ctx, void *addr, int count)
+{
+	struct file *file = ctx->file;
+	mm_segment_t fs;
+	ssize_t nwrite;
+	int nleft;
+
+	fs = get_fs();
+	set_fs(KERNEL_DS);
+	for (nleft = count; nleft; nleft -= nwrite) {
+		nwrite = file->f_op->write(file, addr, nleft, &file->f_pos);
+		if (nwrite < 0) {
+			if (nwrite == -EAGAIN)
+				nwrite = 0;
+			else
+				return nwrite;
+		}
+		addr += nwrite;
+	}
+	set_fs(fs);
+	ctx->total += count;
+	return 0;
+}
+
+int cr_kread(struct cr_ctx *ctx, void *addr, int count)
+{
+	struct file *file = ctx->file;
+	mm_segment_t fs;
+	ssize_t nread;
+	int nleft;
+
+	fs = get_fs();
+	set_fs(KERNEL_DS);
+	for (nleft = count; nleft; nleft -= nread) {
+		nread = file->f_op->read(file, addr, nleft, &file->f_pos);
+		if (nread <= 0) {
+			if (nread == -EAGAIN) {
+				nread = 0;
+				continue;
+			} else if (nread == 0)
+				nread = -EPIPE;		/* unexecpted EOF */
+			return nread;
+		}
+		addr += nread;
+	}
+	set_fs(fs);
+	ctx->total += count;
+	return 0;
+}
+
+/*
+ * During checkpoint and restart the code writes outs/reads in data
+ * to/from the checkpoint image from/to a temporary buffer (ctx->hbuf).
+ * Because operations can be nested, use cr_hbuf_get() to reserve space
+ * in the buffer, then cr_hbuf_put() when you no longer need that space.
+ */
+
+/*
+ * ctx->hbuf is used to hold headers and data of known (or bound),
+ * static sizes. In some cases, multiple headers may be allocated in
+ * a nested manner. The size should accommodate all headers, nested
+ * or not, on all archs.
+ */
+#define CR_HBUF_TOTAL  (8 * 4096)
+
+/**
+ * cr_hbuf_get - reserve space on the hbuf
+ * @ctx: checkpoint context
+ * @n: number of bytes to reserve
+ *
+ * Returns pointer to reserved space
+ */
+void *cr_hbuf_get(struct cr_ctx *ctx, int n)
+{
+	void *ptr;
+
+	/*
+	 * Since requests depend on logic and static header sizes (not on
+	 * user data), space should always suffice, unless someone either
+	 * made a structure bigger or call path deeper than expected.
+	 */
+	BUG_ON(ctx->hpos + n > CR_HBUF_TOTAL);
+	ptr = ctx->hbuf + ctx->hpos;
+	ctx->hpos += n;
+	return ptr;
+}
+
+/**
+ * cr_hbuf_put - unreserve space on the hbuf
+ * @ctx: checkpoint context
+ * @n: number of bytes to reserve
+ */
+void cr_hbuf_put(struct cr_ctx *ctx, int n)
+{
+	BUG_ON(ctx->hpos < n);
+	ctx->hpos -= n;
+}
+
+/*
+ * helpers to manage C/R contexts: allocated for each checkpoint and/or
+ * restart operation, and persists until the operation is completed.
+ */
+
+static void cr_ctx_free(struct cr_ctx *ctx)
+{
+	if (ctx->file)
+		fput(ctx->file);
+	kfree(ctx->hbuf);
+	kfree(ctx);
+}
+
+static struct cr_ctx *cr_ctx_alloc(int fd, unsigned long flags)
+{
+	struct cr_ctx *ctx;
+	int err;
+
+	ctx = kzalloc(sizeof(*ctx), GFP_KERNEL);
+	if (!ctx)
+		return ERR_PTR(-ENOMEM);
+
+	ctx->flags = flags;
+
+	err = -EBADF;
+	ctx->file = fget(fd);
+	if (!ctx->file)
+		goto err;
+
+	err = -ENOMEM;
+	ctx->hbuf = kmalloc(CR_HBUF_TOTAL, GFP_KERNEL);
+	if (!ctx->hbuf)
+		goto err;
+
+	return ctx;
+
+ err:
+	cr_ctx_free(ctx);
+	return ERR_PTR(err);
+}
 
 /**
  * sys_checkpoint - checkpoint a container
@@ -22,9 +174,26 @@
  */
 asmlinkage long sys_checkpoint(pid_t pid, int fd, unsigned long flags)
 {
-	pr_debug("sys_checkpoint not implemented yet\n");
-	return -ENOSYS;
+	struct cr_ctx *ctx;
+	int ret;
+
+	/* no flags for now */
+	if (flags)
+		return -EINVAL;
+
+	ctx = cr_ctx_alloc(fd, flags | CR_CTX_CKPT);
+	if (IS_ERR(ctx))
+		return PTR_ERR(ctx);
+
+	ret = do_checkpoint(ctx, pid);
+
+	if (!ret)
+		ret = ctx->crid;
+
+	cr_ctx_free(ctx);
+	return ret;
 }
+
 /**
  * sys_restart - restart a container
  * @crid: checkpoint image identifier
@@ -36,6 +205,23 @@ asmlinkage long sys_checkpoint(pid_t pid, int fd, unsigned long flags)
  */
 asmlinkage long sys_restart(int crid, int fd, unsigned long flags)
 {
-	pr_debug("sys_restart not implemented yet\n");
-	return -ENOSYS;
+	struct cr_ctx *ctx;
+	pid_t pid;
+	int ret;
+
+	/* no flags for now */
+	if (flags)
+		return -EINVAL;
+
+	ctx = cr_ctx_alloc(fd, flags | CR_CTX_RSTR);
+	if (IS_ERR(ctx))
+		return PTR_ERR(ctx);
+
+	/* FIXME: for now, we use 'crid' as a pid */
+	pid = (pid_t) crid;
+
+	ret = do_restart(ctx, pid);
+
+	cr_ctx_free(ctx);
+	return ret;
 }
diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
new file mode 100644
index 0000000..63f298f
--- /dev/null
+++ b/include/linux/checkpoint.h
@@ -0,0 +1,58 @@
+#ifndef _CHECKPOINT_CKPT_H_
+#define _CHECKPOINT_CKPT_H_
+/*
+ *  Generic container checkpoint-restart
+ *
+ *  Copyright (C) 2008 Oren Laadan
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+#define CR_VERSION  1
+
+struct cr_ctx {
+	int crid;		/* unique checkpoint id */
+
+	pid_t root_pid;		/* container identifier */
+
+	unsigned long flags;
+	unsigned long oflags;	/* restart: old flags */
+
+	struct file *file;
+	int total;		/* total read/written */
+
+	void *hbuf;		/* temporary buffer for headers */
+	int hpos;		/* position in headers buffer */
+};
+
+/* cr_ctx: flags */
+#define CR_CTX_CKPT	0x1
+#define CR_CTX_RSTR	0x2
+
+extern int cr_kwrite(struct cr_ctx *ctx, void *buf, int count);
+extern int cr_kread(struct cr_ctx *ctx, void *buf, int count);
+
+extern void *cr_hbuf_get(struct cr_ctx *ctx, int n);
+extern void cr_hbuf_put(struct cr_ctx *ctx, int n);
+
+struct cr_hdr;
+
+extern int cr_write_obj(struct cr_ctx *ctx, struct cr_hdr *h, void *buf);
+extern int cr_write_buffer(struct cr_ctx *ctx, void *buf, int len);
+extern int cr_write_string(struct cr_ctx *ctx, char *str, int len);
+
+extern int cr_read_obj(struct cr_ctx *ctx, struct cr_hdr *h, void *buf, int n);
+extern int cr_read_obj_type(struct cr_ctx *ctx, void *buf, int len, int type);
+extern int cr_read_buf_type(struct cr_ctx *ctx, void *buf, int *len, int type);
+extern int cr_read_buffer(struct cr_ctx *ctx, void *buf, int *len);
+extern int cr_read_string(struct cr_ctx *ctx, char *str, int len);
+
+extern int do_checkpoint(struct cr_ctx *ctx, pid_t pid);
+extern int do_restart(struct cr_ctx *ctx, pid_t pid);
+
+#define cr_debug(fmt, args...)  \
+	pr_debug("[%d:c/r:%s] " fmt, task_pid_vnr(current), __func__, ## args)
+
+#endif /* _CHECKPOINT_CKPT_H_ */
diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
new file mode 100644
index 0000000..257f87f
--- /dev/null
+++ b/include/linux/checkpoint_hdr.h
@@ -0,0 +1,76 @@
+#ifndef _CHECKPOINT_CKPT_HDR_H_
+#define _CHECKPOINT_CKPT_HDR_H_
+/*
+ *  Generic container checkpoint-restart
+ *
+ *  Copyright (C) 2008 Oren Laadan
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+#include <linux/types.h>
+#include <linux/utsname.h>
+
+/*
+ * To maintain compatibility between 32-bit and 64-bit architecture flavors,
+ * keep data 64-bit aligned: use padding for structure members, and use
+ * __attribute__ ((aligned (8))) for the entire structure.
+ */
+
+/* records: generic header */
+
+struct cr_hdr {
+	__s16 type;
+	__s16 len;
+	__u32 parent;
+};
+
+/* header types */
+enum {
+	CR_HDR_HEAD = 1,
+	CR_HDR_BUFFER,
+	CR_HDR_STRING,
+
+	CR_HDR_TASK = 101,
+	CR_HDR_THREAD,
+	CR_HDR_CPU,
+
+	CR_HDR_MM = 201,
+	CR_HDR_VMA,
+	CR_HDR_MM_CONTEXT,
+
+	CR_HDR_TAIL = 5001
+};
+
+struct cr_hdr_head {
+	__u64 magic;
+
+	__u16 major;
+	__u16 minor;
+	__u16 patch;
+	__u16 rev;
+
+	__u64 time;	/* when checkpoint taken */
+	__u64 flags;	/* checkpoint options */
+
+	char release[__NEW_UTS_LEN];
+	char version[__NEW_UTS_LEN];
+	char machine[__NEW_UTS_LEN];
+} __attribute__((aligned(8)));
+
+struct cr_hdr_tail {
+	__u64 magic;
+} __attribute__((aligned(8)));
+
+struct cr_hdr_task {
+	__u32 state;
+	__u32 exit_state;
+	__u32 exit_code;
+	__u32 exit_signal;
+
+	__s32 task_comm_len;
+} __attribute__((aligned(8)));
+
+#endif /* _CHECKPOINT_CKPT_HDR_H_ */
diff --git a/include/linux/magic.h b/include/linux/magic.h
index f7f3fdd..5939bbe 100644
--- a/include/linux/magic.h
+++ b/include/linux/magic.h
@@ -46,4 +46,7 @@
 #define FUTEXFS_SUPER_MAGIC	0xBAD1DEA
 #define INOTIFYFS_SUPER_MAGIC	0x2BAD1DEA
 
+#define CHECKPOINT_MAGIC_HEAD  0x00feed0cc0a2d200LL
+#define CHECKPOINT_MAGIC_TAIL  0x002d2a0cc0deef00LL
+
 #endif /* __LINUX_MAGIC_H__ */
-- 
1.5.4.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 133+ messages in thread

* [RFC v11][PATCH 04/13] x86 support for checkpoint/restart
       [not found] ` <1228498282-11804-1-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
                     ` (2 preceding siblings ...)
  2008-12-05 17:31   ` [RFC v11][PATCH 03/13] General infrastructure for checkpoint restart Oren Laadan
@ 2008-12-05 17:31   ` Oren Laadan
  2008-12-05 17:31   ` [RFC v11][PATCH 05/13] Dump memory address space Oren Laadan
                     ` (13 subsequent siblings)
  17 siblings, 0 replies; 133+ messages in thread
From: Oren Laadan @ 2008-12-05 17:31 UTC (permalink / raw)
  To: Oren Laadan
  Cc: jeremy-TSDbQ3PG+2Y, arnd-r2nGTMty4D4,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Dave Hansen,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Linux Torvalds, Alexander Viro,
	H. Peter Anvin, Thomas Gleixner, Ingo Molnar

Add logic to save and restore architecture specific state, including
thread-specific state, CPU registers and FPU state.

In addition, architecture capabilities are saved in an architecure
specific extension of the header (cr_hdr_head_arch); Currently this
includes only FPU capabilities.

Currently only x86-32 is supported. Compiling on x86-64 will trigger
an explicit error.

Changelog[v9]:
  - Add arch-specific header that details architecture capabilities;
    split FPU restore to send capabilities only once.
  - Test for zero TLS entries in cr_write_thread()
  - Fix asm/checkpoint_hdr.h so it can be included from user-space

Changelog[v7]:
  - Fix save/restore state of FPU

Changelog[v5]:
  - Remove preempt_disable() when restoring debug registers

Changelog[v4]:
  - Fix header structure alignment

Changelog[v2]:
  - Pad header structures to 64 bits to ensure compatibility
  - Follow Dave Hansen's refactoring of the original post

Signed-off-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
Acked-by: Serge Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
Signed-off-by: Dave Hansen <dave-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
---
 arch/x86/include/asm/checkpoint_hdr.h |   85 ++++++++++++
 arch/x86/mm/Makefile                  |    2 +
 arch/x86/mm/checkpoint.c              |  223 +++++++++++++++++++++++++++++++
 arch/x86/mm/restart.c                 |  232 +++++++++++++++++++++++++++++++++
 checkpoint/checkpoint.c               |   19 +++-
 checkpoint/checkpoint_arch.h          |    9 ++
 checkpoint/restart.c                  |   17 ++-
 include/linux/checkpoint_hdr.h        |    2 +
 8 files changed, 583 insertions(+), 6 deletions(-)
 create mode 100644 arch/x86/include/asm/checkpoint_hdr.h
 create mode 100644 arch/x86/mm/checkpoint.c
 create mode 100644 arch/x86/mm/restart.c
 create mode 100644 checkpoint/checkpoint_arch.h

diff --git a/arch/x86/include/asm/checkpoint_hdr.h b/arch/x86/include/asm/checkpoint_hdr.h
new file mode 100644
index 0000000..6325062
--- /dev/null
+++ b/arch/x86/include/asm/checkpoint_hdr.h
@@ -0,0 +1,85 @@
+#ifndef __ASM_X86_CKPT_HDR_H
+#define __ASM_X86_CKPT_HDR_H
+/*
+ *  Checkpoint/restart - architecture specific headers x86
+ *
+ *  Copyright (C) 2008 Oren Laadan
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+#include <linux/types.h>
+
+/* i387 structure seen from kernel/userspace */
+#ifdef __KERNEL__
+#include <asm/processor.h>
+#else
+#include <sys/user.h>
+#endif
+
+struct cr_hdr_head_arch {
+	/* FIXME: add HAVE_HWFP */
+
+	__u16 has_fxsr;
+	__u16 has_xsave;
+	__u16 xstate_size;
+	__u16 _pading;
+} __attribute__((aligned(8)));
+
+struct cr_hdr_thread {
+	/* FIXME: restart blocks */
+
+	__s16 gdt_entry_tls_entries;
+	__s16 sizeof_tls_array;
+	__s16 ntls;	/* number of TLS entries to follow */
+} __attribute__((aligned(8)));
+
+struct cr_hdr_cpu {
+	/* see struct pt_regs (x86-64) */
+	__u64 r15;
+	__u64 r14;
+	__u64 r13;
+	__u64 r12;
+	__u64 bp;
+	__u64 bx;
+	__u64 r11;
+	__u64 r10;
+	__u64 r9;
+	__u64 r8;
+	__u64 ax;
+	__u64 cx;
+	__u64 dx;
+	__u64 si;
+	__u64 di;
+	__u64 orig_ax;
+	__u64 ip;
+	__u64 cs;
+	__u64 flags;
+	__u64 sp;
+	__u64 ss;
+
+	/* segment registers */
+	__u64 ds;
+	__u64 es;
+	__u64 fs;
+	__u64 gs;
+
+	/* debug registers */
+	__u64 debugreg0;
+	__u64 debugreg1;
+	__u64 debugreg2;
+	__u64 debugreg3;
+	__u64 debugreg4;
+	__u64 debugreg5;
+	__u64 debugreg6;
+	__u64 debugreg7;
+
+	__u32 uses_debug;
+	__u32 used_math;
+
+	/* thread_xstate contents follow (if used_math) */
+} __attribute__((aligned(8)));
+
+#endif /* __ASM_X86_CKPT_HDR__H */
diff --git a/arch/x86/mm/Makefile b/arch/x86/mm/Makefile
index fea4565..6527ea2 100644
--- a/arch/x86/mm/Makefile
+++ b/arch/x86/mm/Makefile
@@ -18,3 +18,5 @@ obj-$(CONFIG_K8_NUMA)		+= k8topology_64.o
 obj-$(CONFIG_ACPI_NUMA)		+= srat_$(BITS).o
 
 obj-$(CONFIG_MEMTEST)		+= memtest.o
+
+obj-$(CONFIG_CHECKPOINT_RESTART) += checkpoint.o restart.o
diff --git a/arch/x86/mm/checkpoint.c b/arch/x86/mm/checkpoint.c
new file mode 100644
index 0000000..8dd6d2d
--- /dev/null
+++ b/arch/x86/mm/checkpoint.c
@@ -0,0 +1,223 @@
+/*
+ *  Checkpoint/restart - architecture specific support for x86
+ *
+ *  Copyright (C) 2008 Oren Laadan
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+#include <asm/desc.h>
+#include <asm/i387.h>
+
+#include <linux/checkpoint.h>
+#include <linux/checkpoint_hdr.h>
+
+/* dump the thread_struct of a given task */
+int cr_write_thread(struct cr_ctx *ctx, struct task_struct *t)
+{
+	struct cr_hdr h;
+	struct cr_hdr_thread *hh = cr_hbuf_get(ctx, sizeof(*hh));
+	struct thread_struct *thread;
+	struct desc_struct *desc;
+	int ntls = 0;
+	int n, ret;
+
+	h.type = CR_HDR_THREAD;
+	h.len = sizeof(*hh);
+	h.parent = task_pid_vnr(t);
+
+	thread = &t->thread;
+
+	/* calculate no. of TLS entries that follow */
+	desc = thread->tls_array;
+	for (n = GDT_ENTRY_TLS_ENTRIES; n > 0; n--, desc++) {
+		if (desc->a || desc->b)
+			ntls++;
+	}
+
+	hh->gdt_entry_tls_entries = GDT_ENTRY_TLS_ENTRIES;
+	hh->sizeof_tls_array = sizeof(thread->tls_array);
+	hh->ntls = ntls;
+
+	ret = cr_write_obj(ctx, &h, hh);
+	cr_hbuf_put(ctx, sizeof(*hh));
+	if (ret < 0)
+		return ret;
+
+	cr_debug("ntls %d\n", ntls);
+	if (ntls == 0)
+		return 0;
+
+	/* for simplicity dump the entire array, cherry-pick upon restart */
+	ret = cr_kwrite(ctx, thread->tls_array, sizeof(thread->tls_array));
+
+	/* IGNORE RESTART BLOCKS FOR NOW ... */
+
+	return ret;
+}
+
+#ifdef CONFIG_X86_64
+
+#error "CONFIG_X86_64 unsupported yet."
+
+#else	/* !CONFIG_X86_64 */
+
+static void cr_save_cpu_regs(struct cr_hdr_cpu *hh, struct task_struct *t)
+{
+	struct thread_struct *thread = &t->thread;
+	struct pt_regs *regs = task_pt_regs(t);
+
+	hh->bp = regs->bp;
+	hh->bx = regs->bx;
+	hh->ax = regs->ax;
+	hh->cx = regs->cx;
+	hh->dx = regs->dx;
+	hh->si = regs->si;
+	hh->di = regs->di;
+	hh->orig_ax = regs->orig_ax;
+	hh->ip = regs->ip;
+	hh->cs = regs->cs;
+	hh->flags = regs->flags;
+	hh->sp = regs->sp;
+	hh->ss = regs->ss;
+
+	hh->ds = regs->ds;
+	hh->es = regs->es;
+
+	/*
+	 * for checkpoint in process context (from within a container)
+	 * the GS and FS registers should be saved from the hardware;
+	 * otherwise they are already sabed on the thread structure
+	 */
+	if (t == current) {
+		savesegment(gs, hh->gs);
+		savesegment(fs, hh->fs);
+	} else {
+		hh->gs = thread->gs;
+		hh->fs = thread->fs;
+	}
+
+	/*
+	 * for checkpoint in process context (from within a container),
+	 * the actual syscall is taking place at this very moment; so
+	 * we (optimistically) subtitute the future return value (0) of
+	 * this syscall into the orig_eax, so that upon restart it will
+	 * succeed (or it will endlessly retry checkpoint...)
+	 */
+	if (t == current) {
+		BUG_ON(hh->orig_ax < 0);
+		hh->ax = 0;
+	}
+}
+
+static void cr_save_cpu_debug(struct cr_hdr_cpu *hh, struct task_struct *t)
+{
+	struct thread_struct *thread = &t->thread;
+
+	/* debug regs */
+
+	/*
+	 * for checkpoint in process context (from within a container),
+	 * get the actual registers; otherwise get the saved values.
+	 */
+
+	if (t == current) {
+		get_debugreg(hh->debugreg0, 0);
+		get_debugreg(hh->debugreg1, 1);
+		get_debugreg(hh->debugreg2, 2);
+		get_debugreg(hh->debugreg3, 3);
+		get_debugreg(hh->debugreg6, 6);
+		get_debugreg(hh->debugreg7, 7);
+	} else {
+		hh->debugreg0 = thread->debugreg0;
+		hh->debugreg1 = thread->debugreg1;
+		hh->debugreg2 = thread->debugreg2;
+		hh->debugreg3 = thread->debugreg3;
+		hh->debugreg6 = thread->debugreg6;
+		hh->debugreg7 = thread->debugreg7;
+	}
+
+	hh->debugreg4 = 0;
+	hh->debugreg5 = 0;
+
+	hh->uses_debug = !!(task_thread_info(t)->flags & TIF_DEBUG);
+}
+
+static void cr_save_cpu_fpu(struct cr_hdr_cpu *hh, struct task_struct *t)
+{
+	hh->used_math = tsk_used_math(t) ? 1 : 0;
+}
+
+static int cr_write_cpu_fpu(struct cr_ctx *ctx, struct task_struct *t)
+{
+	void *xstate_buf = cr_hbuf_get(ctx, xstate_size);
+
+	/* i387 + MMU + SSE logic */
+	preempt_disable();	/* needed it (t == current) */
+
+	/*
+	 * normally, no need to unlazy_fpu(), since TS_USEDFPU flag
+	 * have been cleared when task was context-switched out...
+	 * except if we are in process context, in which case we do
+	 */
+	if (t == current && (task_thread_info(t)->status & TS_USEDFPU))
+		unlazy_fpu(current);
+
+	memcpy(xstate_buf, t->thread.xstate, xstate_size);
+	preempt_enable();	/* needed it (t == current) */
+
+	return cr_kwrite(ctx, xstate_buf, xstate_size);
+}
+
+#endif	/* CONFIG_X86_64 */
+
+/* dump the cpu state and registers of a given task */
+int cr_write_cpu(struct cr_ctx *ctx, struct task_struct *t)
+{
+	struct cr_hdr h;
+	struct cr_hdr_cpu *hh = cr_hbuf_get(ctx, sizeof(*hh));
+	int ret;
+
+	h.type = CR_HDR_CPU;
+	h.len = sizeof(*hh);
+	h.parent = task_pid_vnr(t);
+
+	cr_save_cpu_regs(hh, t);
+	cr_save_cpu_debug(hh, t);
+	cr_save_cpu_fpu(hh, t);
+
+	cr_debug("math %d debug %d\n", hh->used_math, hh->uses_debug);
+
+	ret = cr_write_obj(ctx, &h, hh);
+	if (ret < 0)
+		goto out;
+
+	if (hh->used_math)
+		ret = cr_write_cpu_fpu(ctx, t);
+ out:
+	cr_hbuf_put(ctx, sizeof(*hh));
+	return ret;
+}
+
+int cr_write_head_arch(struct cr_ctx *ctx)
+{
+	struct cr_hdr h;
+	struct cr_hdr_head_arch *hh = cr_hbuf_get(ctx, sizeof(*hh));
+	int ret;
+
+	h.type = CR_HDR_HEAD_ARCH;
+	h.len = sizeof(*hh);
+	h.parent = 0;
+
+	/* FPU capabilities */
+	hh->has_fxsr = cpu_has_fxsr;
+	hh->has_xsave = cpu_has_xsave;
+	hh->xstate_size = xstate_size;
+
+	ret = cr_write_obj(ctx, &h, hh);
+	cr_hbuf_put(ctx, sizeof(*hh));
+
+	return ret;
+}
diff --git a/arch/x86/mm/restart.c b/arch/x86/mm/restart.c
new file mode 100644
index 0000000..45ad790
--- /dev/null
+++ b/arch/x86/mm/restart.c
@@ -0,0 +1,232 @@
+/*
+ *  Checkpoint/restart - architecture specific support for x86
+ *
+ *  Copyright (C) 2008 Oren Laadan
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+#include <asm/desc.h>
+#include <asm/i387.h>
+
+#include <linux/checkpoint.h>
+#include <linux/checkpoint_hdr.h>
+
+/* read the thread_struct into the current task */
+int cr_read_thread(struct cr_ctx *ctx)
+{
+	struct cr_hdr_thread *hh = cr_hbuf_get(ctx, sizeof(*hh));
+	struct task_struct *t = current;
+	struct thread_struct *thread = &t->thread;
+	int parent, ret;
+
+	parent = cr_read_obj_type(ctx, hh, sizeof(*hh), CR_HDR_THREAD);
+	if (parent < 0) {
+		ret = parent;
+		goto out;
+	}
+
+	ret = -EINVAL;
+
+#if 0	/* activate when containers are used */
+	if (parent != task_pid_vnr(t))
+		goto out;
+#endif
+	cr_debug("ntls %d\n", hh->ntls);
+
+	if (hh->gdt_entry_tls_entries != GDT_ENTRY_TLS_ENTRIES ||
+	    hh->sizeof_tls_array != sizeof(thread->tls_array) ||
+	    hh->ntls < 0 || hh->ntls > GDT_ENTRY_TLS_ENTRIES)
+		goto out;
+
+	if (hh->ntls > 0) {
+		struct desc_struct *desc;
+		int size, cpu;
+
+		/*
+		 * restore TLS by hand: why convert to struct user_desc if
+		 * sys_set_thread_entry() will convert it back ?
+		 */
+
+		size = sizeof(*desc) * GDT_ENTRY_TLS_ENTRIES;
+		desc = kmalloc(size, GFP_KERNEL);
+		if (!desc)
+			return -ENOMEM;
+
+		ret = cr_kread(ctx, desc, size);
+		if (ret >= 0) {
+			/*
+			 * FIX: add sanity checks (eg. that values makes
+			 * sense, that we don't overwrite old values, etc
+			 */
+			cpu = get_cpu();
+			memcpy(thread->tls_array, desc, size);
+			load_TLS(thread, cpu);
+			put_cpu();
+		}
+		kfree(desc);
+	}
+
+	ret = 0;
+ out:
+	cr_hbuf_put(ctx, sizeof(*hh));
+	return ret;
+}
+
+#ifdef CONFIG_X86_64
+
+#error "CONFIG_X86_64 unsupported yet."
+
+#else	/* !CONFIG_X86_64 */
+
+static int cr_load_cpu_regs(struct cr_hdr_cpu *hh, struct task_struct *t)
+{
+	struct thread_struct *thread = &t->thread;
+	struct pt_regs *regs = task_pt_regs(t);
+
+	regs->bx = hh->bx;
+	regs->cx = hh->cx;
+	regs->dx = hh->dx;
+	regs->si = hh->si;
+	regs->di = hh->di;
+	regs->bp = hh->bp;
+	regs->ax = hh->ax;
+	regs->ds = hh->ds;
+	regs->es = hh->es;
+	regs->orig_ax = hh->orig_ax;
+	regs->ip = hh->ip;
+	regs->cs = hh->cs;
+	regs->flags = hh->flags;
+	regs->sp = hh->sp;
+	regs->ss = hh->ss;
+
+	thread->gs = hh->gs;
+	thread->fs = hh->fs;
+	loadsegment(gs, hh->gs);
+	loadsegment(fs, hh->fs);
+
+	return 0;
+}
+
+static int cr_load_cpu_debug(struct cr_hdr_cpu *hh, struct task_struct *t)
+{
+	/* debug regs */
+
+	if (hh->uses_debug) {
+		set_debugreg(hh->debugreg0, 0);
+		set_debugreg(hh->debugreg1, 1);
+		/* ignore 4, 5 */
+		set_debugreg(hh->debugreg2, 2);
+		set_debugreg(hh->debugreg3, 3);
+		set_debugreg(hh->debugreg6, 6);
+		set_debugreg(hh->debugreg7, 7);
+	}
+
+	return 0;
+}
+
+static int cr_load_cpu_fpu(struct cr_hdr_cpu *hh, struct task_struct *t)
+{
+	preempt_disable();
+
+	__clear_fpu(t);		/* in case we used FPU in user mode */
+
+	if (!hh->used_math)
+		clear_used_math();
+
+	preempt_enable();
+	return 0;
+}
+
+static int cr_read_cpu_fpu(struct cr_ctx *ctx, struct task_struct *t)
+{
+	void *xstate_buf = cr_hbuf_get(ctx, xstate_size);
+	int ret;
+
+	ret = cr_kread(ctx, xstate_buf, xstate_size);
+	if (ret < 0)
+		goto out;
+
+	/* i387 + MMU + SSE */
+	preempt_disable();
+
+	/* init_fpu() also calls set_used_math() */
+	ret = init_fpu(current);
+	if (ret < 0)
+		return ret;
+
+	memcpy(t->thread.xstate, xstate_buf, xstate_size);
+	preempt_enable();
+ out:
+	cr_hbuf_put(ctx, xstate_size);
+	return 0;
+}
+
+#endif	/* CONFIG_X86_64 */
+
+/* read the cpu state and registers for the current task */
+int cr_read_cpu(struct cr_ctx *ctx)
+{
+	struct cr_hdr_cpu *hh = cr_hbuf_get(ctx, sizeof(*hh));
+	struct task_struct *t = current;
+	int parent, ret;
+
+	parent = cr_read_obj_type(ctx, hh, sizeof(*hh), CR_HDR_CPU);
+	if (parent < 0) {
+		ret = parent;
+		goto out;
+	}
+
+	ret = -EINVAL;
+
+#if 0	/* activate when containers are used */
+	if (parent != task_pid_vnr(t))
+		goto out;
+#endif
+	/* FIX: sanity check for sensitive registers (eg. eflags) */
+
+	cr_debug("math %d debug %d\n", hh->used_math, hh->uses_debug);
+
+	ret = cr_load_cpu_regs(hh, t);
+	if (ret < 0)
+		goto out;
+	ret = cr_load_cpu_debug(hh, t);
+	if (ret < 0)
+		goto out;
+	ret = cr_load_cpu_fpu(hh, t);
+	if (ret < 0)
+		goto out;
+
+	if (hh->used_math)
+		ret = cr_read_cpu_fpu(ctx, t);
+ out:
+	cr_hbuf_put(ctx, sizeof(*hh));
+	return ret;
+}
+
+int cr_read_head_arch(struct cr_ctx *ctx)
+{
+	struct cr_hdr_head_arch *hh = cr_hbuf_get(ctx, sizeof(*hh));
+	int parent, ret = 0;
+
+	parent = cr_read_obj_type(ctx, hh, sizeof(*hh), CR_HDR_HEAD_ARCH);
+	if (parent < 0) {
+		ret = parent;
+		goto out;
+	} else if (parent != 0)
+		goto out;
+
+	/* FIX: verify compatibility of architecture features */
+
+	/* verify FPU capabilities */
+	if (hh->has_fxsr != cpu_has_fxsr ||
+	    hh->has_xsave != cpu_has_xsave ||
+	    hh->xstate_size != xstate_size)
+		ret = -EINVAL;
+ out:
+	cr_hbuf_put(ctx, sizeof(*hh));
+
+	return ret;
+}
diff --git a/checkpoint/checkpoint.c b/checkpoint/checkpoint.c
index fccf723..17cc8d2 100644
--- a/checkpoint/checkpoint.c
+++ b/checkpoint/checkpoint.c
@@ -20,6 +20,8 @@
 #include <linux/checkpoint.h>
 #include <linux/checkpoint_hdr.h>
 
+#include "checkpoint_arch.h"
+
 /* unique checkpoint identifier (FIXME: should be per-container ?) */
 static atomic_t cr_ctx_count = ATOMIC_INIT(0);
 
@@ -105,7 +107,10 @@ static int cr_write_head(struct cr_ctx *ctx)
 
 	ret = cr_write_obj(ctx, &h, hh);
 	cr_hbuf_put(ctx, sizeof(*hh));
-	return ret;
+	if (ret < 0)
+		return ret;
+
+	return cr_write_head_arch(ctx);
 }
 
 /* write the checkpoint trailer */
@@ -160,8 +165,16 @@ static int cr_write_task(struct cr_ctx *ctx, struct task_struct *t)
 	int ret;
 
 	ret = cr_write_task_struct(ctx, t);
-	cr_debug("ret %d\n", ret);
-
+	cr_debug("task_struct: ret %d\n", ret);
+	if (ret < 0)
+		goto out;
+	ret = cr_write_thread(ctx, t);
+	cr_debug("thread: ret %d\n", ret);
+	if (ret < 0)
+		goto out;
+	ret = cr_write_cpu(ctx, t);
+	cr_debug("cpu: ret %d\n", ret);
+ out:
 	return ret;
 }
 
diff --git a/checkpoint/checkpoint_arch.h b/checkpoint/checkpoint_arch.h
new file mode 100644
index 0000000..ada1369
--- /dev/null
+++ b/checkpoint/checkpoint_arch.h
@@ -0,0 +1,9 @@
+#include <linux/checkpoint.h>
+
+extern int cr_write_head_arch(struct cr_ctx *ctx);
+extern int cr_write_thread(struct cr_ctx *ctx, struct task_struct *t);
+extern int cr_write_cpu(struct cr_ctx *ctx, struct task_struct *t);
+
+extern int cr_read_head_arch(struct cr_ctx *ctx);
+extern int cr_read_thread(struct cr_ctx *ctx);
+extern int cr_read_cpu(struct cr_ctx *ctx);
diff --git a/checkpoint/restart.c b/checkpoint/restart.c
index a95d2e8..d74d755 100644
--- a/checkpoint/restart.c
+++ b/checkpoint/restart.c
@@ -15,6 +15,8 @@
 #include <linux/checkpoint.h>
 #include <linux/checkpoint_hdr.h>
 
+#include "checkpoint_arch.h"
+
 /**
  * cr_read_obj - read a whole record (cr_hdr followed by payload)
  * @ctx: checkpoint context
@@ -142,9 +144,9 @@ static int cr_read_head(struct cr_ctx *ctx)
 
 	ctx->oflags = hh->flags;
 
-	/* FIX: verify compatibility of release, version and machine */
+	/* FIX: verify compatibility of release, version */
 
-	ret = 0;
+	ret = cr_read_head_arch(ctx);
  out:
 	cr_hbuf_put(ctx, sizeof(*hh));
 	return ret;
@@ -214,8 +216,17 @@ static int cr_read_task(struct cr_ctx *ctx)
 	int ret;
 
 	ret = cr_read_task_struct(ctx);
-	cr_debug("ret %d\n", ret);
+	cr_debug("task_struct: ret %d\n", ret);
+	if (ret < 0)
+		goto out;
+	ret = cr_read_thread(ctx);
+	cr_debug("thread: ret %d\n", ret);
+	if (ret < 0)
+		goto out;
+	ret = cr_read_cpu(ctx);
+	cr_debug("cpu: ret %d\n", ret);
 
+ out:
 	return ret;
 }
 
diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
index 257f87f..b74b5f9 100644
--- a/include/linux/checkpoint_hdr.h
+++ b/include/linux/checkpoint_hdr.h
@@ -12,6 +12,7 @@
 
 #include <linux/types.h>
 #include <linux/utsname.h>
+#include <asm/checkpoint_hdr.h>
 
 /*
  * To maintain compatibility between 32-bit and 64-bit architecture flavors,
@@ -30,6 +31,7 @@ struct cr_hdr {
 /* header types */
 enum {
 	CR_HDR_HEAD = 1,
+	CR_HDR_HEAD_ARCH,
 	CR_HDR_BUFFER,
 	CR_HDR_STRING,
 
-- 
1.5.4.3

^ permalink raw reply related	[flat|nested] 133+ messages in thread

* [RFC v11][PATCH 04/13] x86 support for checkpoint/restart
  2008-12-05 17:31 ` Oren Laadan
  (?)
@ 2008-12-05 17:31   ` Oren Laadan
  -1 siblings, 0 replies; 133+ messages in thread
From: Oren Laadan @ 2008-12-05 17:31 UTC (permalink / raw)
  To: Oren Laadan
  Cc: containers, linux-kernel, linux-mm, linux-api, Linux Torvalds,
	Thomas Gleixner, Serge Hallyn, Dave Hansen, Ingo Molnar,
	H. Peter Anvin, Alexander Viro, MinChan Kim, arnd, jeremy,
	Oren Laadan

Add logic to save and restore architecture specific state, including
thread-specific state, CPU registers and FPU state.

In addition, architecture capabilities are saved in an architecure
specific extension of the header (cr_hdr_head_arch); Currently this
includes only FPU capabilities.

Currently only x86-32 is supported. Compiling on x86-64 will trigger
an explicit error.

Changelog[v9]:
  - Add arch-specific header that details architecture capabilities;
    split FPU restore to send capabilities only once.
  - Test for zero TLS entries in cr_write_thread()
  - Fix asm/checkpoint_hdr.h so it can be included from user-space

Changelog[v7]:
  - Fix save/restore state of FPU

Changelog[v5]:
  - Remove preempt_disable() when restoring debug registers

Changelog[v4]:
  - Fix header structure alignment

Changelog[v2]:
  - Pad header structures to 64 bits to ensure compatibility
  - Follow Dave Hansen's refactoring of the original post

Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com>
---
 arch/x86/include/asm/checkpoint_hdr.h |   85 ++++++++++++
 arch/x86/mm/Makefile                  |    2 +
 arch/x86/mm/checkpoint.c              |  223 +++++++++++++++++++++++++++++++
 arch/x86/mm/restart.c                 |  232 +++++++++++++++++++++++++++++++++
 checkpoint/checkpoint.c               |   19 +++-
 checkpoint/checkpoint_arch.h          |    9 ++
 checkpoint/restart.c                  |   17 ++-
 include/linux/checkpoint_hdr.h        |    2 +
 8 files changed, 583 insertions(+), 6 deletions(-)
 create mode 100644 arch/x86/include/asm/checkpoint_hdr.h
 create mode 100644 arch/x86/mm/checkpoint.c
 create mode 100644 arch/x86/mm/restart.c
 create mode 100644 checkpoint/checkpoint_arch.h

diff --git a/arch/x86/include/asm/checkpoint_hdr.h b/arch/x86/include/asm/checkpoint_hdr.h
new file mode 100644
index 0000000..6325062
--- /dev/null
+++ b/arch/x86/include/asm/checkpoint_hdr.h
@@ -0,0 +1,85 @@
+#ifndef __ASM_X86_CKPT_HDR_H
+#define __ASM_X86_CKPT_HDR_H
+/*
+ *  Checkpoint/restart - architecture specific headers x86
+ *
+ *  Copyright (C) 2008 Oren Laadan
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+#include <linux/types.h>
+
+/* i387 structure seen from kernel/userspace */
+#ifdef __KERNEL__
+#include <asm/processor.h>
+#else
+#include <sys/user.h>
+#endif
+
+struct cr_hdr_head_arch {
+	/* FIXME: add HAVE_HWFP */
+
+	__u16 has_fxsr;
+	__u16 has_xsave;
+	__u16 xstate_size;
+	__u16 _pading;
+} __attribute__((aligned(8)));
+
+struct cr_hdr_thread {
+	/* FIXME: restart blocks */
+
+	__s16 gdt_entry_tls_entries;
+	__s16 sizeof_tls_array;
+	__s16 ntls;	/* number of TLS entries to follow */
+} __attribute__((aligned(8)));
+
+struct cr_hdr_cpu {
+	/* see struct pt_regs (x86-64) */
+	__u64 r15;
+	__u64 r14;
+	__u64 r13;
+	__u64 r12;
+	__u64 bp;
+	__u64 bx;
+	__u64 r11;
+	__u64 r10;
+	__u64 r9;
+	__u64 r8;
+	__u64 ax;
+	__u64 cx;
+	__u64 dx;
+	__u64 si;
+	__u64 di;
+	__u64 orig_ax;
+	__u64 ip;
+	__u64 cs;
+	__u64 flags;
+	__u64 sp;
+	__u64 ss;
+
+	/* segment registers */
+	__u64 ds;
+	__u64 es;
+	__u64 fs;
+	__u64 gs;
+
+	/* debug registers */
+	__u64 debugreg0;
+	__u64 debugreg1;
+	__u64 debugreg2;
+	__u64 debugreg3;
+	__u64 debugreg4;
+	__u64 debugreg5;
+	__u64 debugreg6;
+	__u64 debugreg7;
+
+	__u32 uses_debug;
+	__u32 used_math;
+
+	/* thread_xstate contents follow (if used_math) */
+} __attribute__((aligned(8)));
+
+#endif /* __ASM_X86_CKPT_HDR__H */
diff --git a/arch/x86/mm/Makefile b/arch/x86/mm/Makefile
index fea4565..6527ea2 100644
--- a/arch/x86/mm/Makefile
+++ b/arch/x86/mm/Makefile
@@ -18,3 +18,5 @@ obj-$(CONFIG_K8_NUMA)		+= k8topology_64.o
 obj-$(CONFIG_ACPI_NUMA)		+= srat_$(BITS).o
 
 obj-$(CONFIG_MEMTEST)		+= memtest.o
+
+obj-$(CONFIG_CHECKPOINT_RESTART) += checkpoint.o restart.o
diff --git a/arch/x86/mm/checkpoint.c b/arch/x86/mm/checkpoint.c
new file mode 100644
index 0000000..8dd6d2d
--- /dev/null
+++ b/arch/x86/mm/checkpoint.c
@@ -0,0 +1,223 @@
+/*
+ *  Checkpoint/restart - architecture specific support for x86
+ *
+ *  Copyright (C) 2008 Oren Laadan
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+#include <asm/desc.h>
+#include <asm/i387.h>
+
+#include <linux/checkpoint.h>
+#include <linux/checkpoint_hdr.h>
+
+/* dump the thread_struct of a given task */
+int cr_write_thread(struct cr_ctx *ctx, struct task_struct *t)
+{
+	struct cr_hdr h;
+	struct cr_hdr_thread *hh = cr_hbuf_get(ctx, sizeof(*hh));
+	struct thread_struct *thread;
+	struct desc_struct *desc;
+	int ntls = 0;
+	int n, ret;
+
+	h.type = CR_HDR_THREAD;
+	h.len = sizeof(*hh);
+	h.parent = task_pid_vnr(t);
+
+	thread = &t->thread;
+
+	/* calculate no. of TLS entries that follow */
+	desc = thread->tls_array;
+	for (n = GDT_ENTRY_TLS_ENTRIES; n > 0; n--, desc++) {
+		if (desc->a || desc->b)
+			ntls++;
+	}
+
+	hh->gdt_entry_tls_entries = GDT_ENTRY_TLS_ENTRIES;
+	hh->sizeof_tls_array = sizeof(thread->tls_array);
+	hh->ntls = ntls;
+
+	ret = cr_write_obj(ctx, &h, hh);
+	cr_hbuf_put(ctx, sizeof(*hh));
+	if (ret < 0)
+		return ret;
+
+	cr_debug("ntls %d\n", ntls);
+	if (ntls == 0)
+		return 0;
+
+	/* for simplicity dump the entire array, cherry-pick upon restart */
+	ret = cr_kwrite(ctx, thread->tls_array, sizeof(thread->tls_array));
+
+	/* IGNORE RESTART BLOCKS FOR NOW ... */
+
+	return ret;
+}
+
+#ifdef CONFIG_X86_64
+
+#error "CONFIG_X86_64 unsupported yet."
+
+#else	/* !CONFIG_X86_64 */
+
+static void cr_save_cpu_regs(struct cr_hdr_cpu *hh, struct task_struct *t)
+{
+	struct thread_struct *thread = &t->thread;
+	struct pt_regs *regs = task_pt_regs(t);
+
+	hh->bp = regs->bp;
+	hh->bx = regs->bx;
+	hh->ax = regs->ax;
+	hh->cx = regs->cx;
+	hh->dx = regs->dx;
+	hh->si = regs->si;
+	hh->di = regs->di;
+	hh->orig_ax = regs->orig_ax;
+	hh->ip = regs->ip;
+	hh->cs = regs->cs;
+	hh->flags = regs->flags;
+	hh->sp = regs->sp;
+	hh->ss = regs->ss;
+
+	hh->ds = regs->ds;
+	hh->es = regs->es;
+
+	/*
+	 * for checkpoint in process context (from within a container)
+	 * the GS and FS registers should be saved from the hardware;
+	 * otherwise they are already sabed on the thread structure
+	 */
+	if (t == current) {
+		savesegment(gs, hh->gs);
+		savesegment(fs, hh->fs);
+	} else {
+		hh->gs = thread->gs;
+		hh->fs = thread->fs;
+	}
+
+	/*
+	 * for checkpoint in process context (from within a container),
+	 * the actual syscall is taking place at this very moment; so
+	 * we (optimistically) subtitute the future return value (0) of
+	 * this syscall into the orig_eax, so that upon restart it will
+	 * succeed (or it will endlessly retry checkpoint...)
+	 */
+	if (t == current) {
+		BUG_ON(hh->orig_ax < 0);
+		hh->ax = 0;
+	}
+}
+
+static void cr_save_cpu_debug(struct cr_hdr_cpu *hh, struct task_struct *t)
+{
+	struct thread_struct *thread = &t->thread;
+
+	/* debug regs */
+
+	/*
+	 * for checkpoint in process context (from within a container),
+	 * get the actual registers; otherwise get the saved values.
+	 */
+
+	if (t == current) {
+		get_debugreg(hh->debugreg0, 0);
+		get_debugreg(hh->debugreg1, 1);
+		get_debugreg(hh->debugreg2, 2);
+		get_debugreg(hh->debugreg3, 3);
+		get_debugreg(hh->debugreg6, 6);
+		get_debugreg(hh->debugreg7, 7);
+	} else {
+		hh->debugreg0 = thread->debugreg0;
+		hh->debugreg1 = thread->debugreg1;
+		hh->debugreg2 = thread->debugreg2;
+		hh->debugreg3 = thread->debugreg3;
+		hh->debugreg6 = thread->debugreg6;
+		hh->debugreg7 = thread->debugreg7;
+	}
+
+	hh->debugreg4 = 0;
+	hh->debugreg5 = 0;
+
+	hh->uses_debug = !!(task_thread_info(t)->flags & TIF_DEBUG);
+}
+
+static void cr_save_cpu_fpu(struct cr_hdr_cpu *hh, struct task_struct *t)
+{
+	hh->used_math = tsk_used_math(t) ? 1 : 0;
+}
+
+static int cr_write_cpu_fpu(struct cr_ctx *ctx, struct task_struct *t)
+{
+	void *xstate_buf = cr_hbuf_get(ctx, xstate_size);
+
+	/* i387 + MMU + SSE logic */
+	preempt_disable();	/* needed it (t == current) */
+
+	/*
+	 * normally, no need to unlazy_fpu(), since TS_USEDFPU flag
+	 * have been cleared when task was context-switched out...
+	 * except if we are in process context, in which case we do
+	 */
+	if (t == current && (task_thread_info(t)->status & TS_USEDFPU))
+		unlazy_fpu(current);
+
+	memcpy(xstate_buf, t->thread.xstate, xstate_size);
+	preempt_enable();	/* needed it (t == current) */
+
+	return cr_kwrite(ctx, xstate_buf, xstate_size);
+}
+
+#endif	/* CONFIG_X86_64 */
+
+/* dump the cpu state and registers of a given task */
+int cr_write_cpu(struct cr_ctx *ctx, struct task_struct *t)
+{
+	struct cr_hdr h;
+	struct cr_hdr_cpu *hh = cr_hbuf_get(ctx, sizeof(*hh));
+	int ret;
+
+	h.type = CR_HDR_CPU;
+	h.len = sizeof(*hh);
+	h.parent = task_pid_vnr(t);
+
+	cr_save_cpu_regs(hh, t);
+	cr_save_cpu_debug(hh, t);
+	cr_save_cpu_fpu(hh, t);
+
+	cr_debug("math %d debug %d\n", hh->used_math, hh->uses_debug);
+
+	ret = cr_write_obj(ctx, &h, hh);
+	if (ret < 0)
+		goto out;
+
+	if (hh->used_math)
+		ret = cr_write_cpu_fpu(ctx, t);
+ out:
+	cr_hbuf_put(ctx, sizeof(*hh));
+	return ret;
+}
+
+int cr_write_head_arch(struct cr_ctx *ctx)
+{
+	struct cr_hdr h;
+	struct cr_hdr_head_arch *hh = cr_hbuf_get(ctx, sizeof(*hh));
+	int ret;
+
+	h.type = CR_HDR_HEAD_ARCH;
+	h.len = sizeof(*hh);
+	h.parent = 0;
+
+	/* FPU capabilities */
+	hh->has_fxsr = cpu_has_fxsr;
+	hh->has_xsave = cpu_has_xsave;
+	hh->xstate_size = xstate_size;
+
+	ret = cr_write_obj(ctx, &h, hh);
+	cr_hbuf_put(ctx, sizeof(*hh));
+
+	return ret;
+}
diff --git a/arch/x86/mm/restart.c b/arch/x86/mm/restart.c
new file mode 100644
index 0000000..45ad790
--- /dev/null
+++ b/arch/x86/mm/restart.c
@@ -0,0 +1,232 @@
+/*
+ *  Checkpoint/restart - architecture specific support for x86
+ *
+ *  Copyright (C) 2008 Oren Laadan
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+#include <asm/desc.h>
+#include <asm/i387.h>
+
+#include <linux/checkpoint.h>
+#include <linux/checkpoint_hdr.h>
+
+/* read the thread_struct into the current task */
+int cr_read_thread(struct cr_ctx *ctx)
+{
+	struct cr_hdr_thread *hh = cr_hbuf_get(ctx, sizeof(*hh));
+	struct task_struct *t = current;
+	struct thread_struct *thread = &t->thread;
+	int parent, ret;
+
+	parent = cr_read_obj_type(ctx, hh, sizeof(*hh), CR_HDR_THREAD);
+	if (parent < 0) {
+		ret = parent;
+		goto out;
+	}
+
+	ret = -EINVAL;
+
+#if 0	/* activate when containers are used */
+	if (parent != task_pid_vnr(t))
+		goto out;
+#endif
+	cr_debug("ntls %d\n", hh->ntls);
+
+	if (hh->gdt_entry_tls_entries != GDT_ENTRY_TLS_ENTRIES ||
+	    hh->sizeof_tls_array != sizeof(thread->tls_array) ||
+	    hh->ntls < 0 || hh->ntls > GDT_ENTRY_TLS_ENTRIES)
+		goto out;
+
+	if (hh->ntls > 0) {
+		struct desc_struct *desc;
+		int size, cpu;
+
+		/*
+		 * restore TLS by hand: why convert to struct user_desc if
+		 * sys_set_thread_entry() will convert it back ?
+		 */
+
+		size = sizeof(*desc) * GDT_ENTRY_TLS_ENTRIES;
+		desc = kmalloc(size, GFP_KERNEL);
+		if (!desc)
+			return -ENOMEM;
+
+		ret = cr_kread(ctx, desc, size);
+		if (ret >= 0) {
+			/*
+			 * FIX: add sanity checks (eg. that values makes
+			 * sense, that we don't overwrite old values, etc
+			 */
+			cpu = get_cpu();
+			memcpy(thread->tls_array, desc, size);
+			load_TLS(thread, cpu);
+			put_cpu();
+		}
+		kfree(desc);
+	}
+
+	ret = 0;
+ out:
+	cr_hbuf_put(ctx, sizeof(*hh));
+	return ret;
+}
+
+#ifdef CONFIG_X86_64
+
+#error "CONFIG_X86_64 unsupported yet."
+
+#else	/* !CONFIG_X86_64 */
+
+static int cr_load_cpu_regs(struct cr_hdr_cpu *hh, struct task_struct *t)
+{
+	struct thread_struct *thread = &t->thread;
+	struct pt_regs *regs = task_pt_regs(t);
+
+	regs->bx = hh->bx;
+	regs->cx = hh->cx;
+	regs->dx = hh->dx;
+	regs->si = hh->si;
+	regs->di = hh->di;
+	regs->bp = hh->bp;
+	regs->ax = hh->ax;
+	regs->ds = hh->ds;
+	regs->es = hh->es;
+	regs->orig_ax = hh->orig_ax;
+	regs->ip = hh->ip;
+	regs->cs = hh->cs;
+	regs->flags = hh->flags;
+	regs->sp = hh->sp;
+	regs->ss = hh->ss;
+
+	thread->gs = hh->gs;
+	thread->fs = hh->fs;
+	loadsegment(gs, hh->gs);
+	loadsegment(fs, hh->fs);
+
+	return 0;
+}
+
+static int cr_load_cpu_debug(struct cr_hdr_cpu *hh, struct task_struct *t)
+{
+	/* debug regs */
+
+	if (hh->uses_debug) {
+		set_debugreg(hh->debugreg0, 0);
+		set_debugreg(hh->debugreg1, 1);
+		/* ignore 4, 5 */
+		set_debugreg(hh->debugreg2, 2);
+		set_debugreg(hh->debugreg3, 3);
+		set_debugreg(hh->debugreg6, 6);
+		set_debugreg(hh->debugreg7, 7);
+	}
+
+	return 0;
+}
+
+static int cr_load_cpu_fpu(struct cr_hdr_cpu *hh, struct task_struct *t)
+{
+	preempt_disable();
+
+	__clear_fpu(t);		/* in case we used FPU in user mode */
+
+	if (!hh->used_math)
+		clear_used_math();
+
+	preempt_enable();
+	return 0;
+}
+
+static int cr_read_cpu_fpu(struct cr_ctx *ctx, struct task_struct *t)
+{
+	void *xstate_buf = cr_hbuf_get(ctx, xstate_size);
+	int ret;
+
+	ret = cr_kread(ctx, xstate_buf, xstate_size);
+	if (ret < 0)
+		goto out;
+
+	/* i387 + MMU + SSE */
+	preempt_disable();
+
+	/* init_fpu() also calls set_used_math() */
+	ret = init_fpu(current);
+	if (ret < 0)
+		return ret;
+
+	memcpy(t->thread.xstate, xstate_buf, xstate_size);
+	preempt_enable();
+ out:
+	cr_hbuf_put(ctx, xstate_size);
+	return 0;
+}
+
+#endif	/* CONFIG_X86_64 */
+
+/* read the cpu state and registers for the current task */
+int cr_read_cpu(struct cr_ctx *ctx)
+{
+	struct cr_hdr_cpu *hh = cr_hbuf_get(ctx, sizeof(*hh));
+	struct task_struct *t = current;
+	int parent, ret;
+
+	parent = cr_read_obj_type(ctx, hh, sizeof(*hh), CR_HDR_CPU);
+	if (parent < 0) {
+		ret = parent;
+		goto out;
+	}
+
+	ret = -EINVAL;
+
+#if 0	/* activate when containers are used */
+	if (parent != task_pid_vnr(t))
+		goto out;
+#endif
+	/* FIX: sanity check for sensitive registers (eg. eflags) */
+
+	cr_debug("math %d debug %d\n", hh->used_math, hh->uses_debug);
+
+	ret = cr_load_cpu_regs(hh, t);
+	if (ret < 0)
+		goto out;
+	ret = cr_load_cpu_debug(hh, t);
+	if (ret < 0)
+		goto out;
+	ret = cr_load_cpu_fpu(hh, t);
+	if (ret < 0)
+		goto out;
+
+	if (hh->used_math)
+		ret = cr_read_cpu_fpu(ctx, t);
+ out:
+	cr_hbuf_put(ctx, sizeof(*hh));
+	return ret;
+}
+
+int cr_read_head_arch(struct cr_ctx *ctx)
+{
+	struct cr_hdr_head_arch *hh = cr_hbuf_get(ctx, sizeof(*hh));
+	int parent, ret = 0;
+
+	parent = cr_read_obj_type(ctx, hh, sizeof(*hh), CR_HDR_HEAD_ARCH);
+	if (parent < 0) {
+		ret = parent;
+		goto out;
+	} else if (parent != 0)
+		goto out;
+
+	/* FIX: verify compatibility of architecture features */
+
+	/* verify FPU capabilities */
+	if (hh->has_fxsr != cpu_has_fxsr ||
+	    hh->has_xsave != cpu_has_xsave ||
+	    hh->xstate_size != xstate_size)
+		ret = -EINVAL;
+ out:
+	cr_hbuf_put(ctx, sizeof(*hh));
+
+	return ret;
+}
diff --git a/checkpoint/checkpoint.c b/checkpoint/checkpoint.c
index fccf723..17cc8d2 100644
--- a/checkpoint/checkpoint.c
+++ b/checkpoint/checkpoint.c
@@ -20,6 +20,8 @@
 #include <linux/checkpoint.h>
 #include <linux/checkpoint_hdr.h>
 
+#include "checkpoint_arch.h"
+
 /* unique checkpoint identifier (FIXME: should be per-container ?) */
 static atomic_t cr_ctx_count = ATOMIC_INIT(0);
 
@@ -105,7 +107,10 @@ static int cr_write_head(struct cr_ctx *ctx)
 
 	ret = cr_write_obj(ctx, &h, hh);
 	cr_hbuf_put(ctx, sizeof(*hh));
-	return ret;
+	if (ret < 0)
+		return ret;
+
+	return cr_write_head_arch(ctx);
 }
 
 /* write the checkpoint trailer */
@@ -160,8 +165,16 @@ static int cr_write_task(struct cr_ctx *ctx, struct task_struct *t)
 	int ret;
 
 	ret = cr_write_task_struct(ctx, t);
-	cr_debug("ret %d\n", ret);
-
+	cr_debug("task_struct: ret %d\n", ret);
+	if (ret < 0)
+		goto out;
+	ret = cr_write_thread(ctx, t);
+	cr_debug("thread: ret %d\n", ret);
+	if (ret < 0)
+		goto out;
+	ret = cr_write_cpu(ctx, t);
+	cr_debug("cpu: ret %d\n", ret);
+ out:
 	return ret;
 }
 
diff --git a/checkpoint/checkpoint_arch.h b/checkpoint/checkpoint_arch.h
new file mode 100644
index 0000000..ada1369
--- /dev/null
+++ b/checkpoint/checkpoint_arch.h
@@ -0,0 +1,9 @@
+#include <linux/checkpoint.h>
+
+extern int cr_write_head_arch(struct cr_ctx *ctx);
+extern int cr_write_thread(struct cr_ctx *ctx, struct task_struct *t);
+extern int cr_write_cpu(struct cr_ctx *ctx, struct task_struct *t);
+
+extern int cr_read_head_arch(struct cr_ctx *ctx);
+extern int cr_read_thread(struct cr_ctx *ctx);
+extern int cr_read_cpu(struct cr_ctx *ctx);
diff --git a/checkpoint/restart.c b/checkpoint/restart.c
index a95d2e8..d74d755 100644
--- a/checkpoint/restart.c
+++ b/checkpoint/restart.c
@@ -15,6 +15,8 @@
 #include <linux/checkpoint.h>
 #include <linux/checkpoint_hdr.h>
 
+#include "checkpoint_arch.h"
+
 /**
  * cr_read_obj - read a whole record (cr_hdr followed by payload)
  * @ctx: checkpoint context
@@ -142,9 +144,9 @@ static int cr_read_head(struct cr_ctx *ctx)
 
 	ctx->oflags = hh->flags;
 
-	/* FIX: verify compatibility of release, version and machine */
+	/* FIX: verify compatibility of release, version */
 
-	ret = 0;
+	ret = cr_read_head_arch(ctx);
  out:
 	cr_hbuf_put(ctx, sizeof(*hh));
 	return ret;
@@ -214,8 +216,17 @@ static int cr_read_task(struct cr_ctx *ctx)
 	int ret;
 
 	ret = cr_read_task_struct(ctx);
-	cr_debug("ret %d\n", ret);
+	cr_debug("task_struct: ret %d\n", ret);
+	if (ret < 0)
+		goto out;
+	ret = cr_read_thread(ctx);
+	cr_debug("thread: ret %d\n", ret);
+	if (ret < 0)
+		goto out;
+	ret = cr_read_cpu(ctx);
+	cr_debug("cpu: ret %d\n", ret);
 
+ out:
 	return ret;
 }
 
diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
index 257f87f..b74b5f9 100644
--- a/include/linux/checkpoint_hdr.h
+++ b/include/linux/checkpoint_hdr.h
@@ -12,6 +12,7 @@
 
 #include <linux/types.h>
 #include <linux/utsname.h>
+#include <asm/checkpoint_hdr.h>
 
 /*
  * To maintain compatibility between 32-bit and 64-bit architecture flavors,
@@ -30,6 +31,7 @@ struct cr_hdr {
 /* header types */
 enum {
 	CR_HDR_HEAD = 1,
+	CR_HDR_HEAD_ARCH,
 	CR_HDR_BUFFER,
 	CR_HDR_STRING,
 
-- 
1.5.4.3


^ permalink raw reply related	[flat|nested] 133+ messages in thread

* [RFC v11][PATCH 04/13] x86 support for checkpoint/restart
@ 2008-12-05 17:31   ` Oren Laadan
  0 siblings, 0 replies; 133+ messages in thread
From: Oren Laadan @ 2008-12-05 17:31 UTC (permalink / raw)
  Cc: containers, linux-kernel, linux-mm, linux-api, Linux Torvalds,
	Thomas Gleixner, Serge Hallyn, Dave Hansen, Ingo Molnar,
	H. Peter Anvin, Alexander Viro, MinChan Kim, arnd, jeremy,
	Oren Laadan

Add logic to save and restore architecture specific state, including
thread-specific state, CPU registers and FPU state.

In addition, architecture capabilities are saved in an architecure
specific extension of the header (cr_hdr_head_arch); Currently this
includes only FPU capabilities.

Currently only x86-32 is supported. Compiling on x86-64 will trigger
an explicit error.

Changelog[v9]:
  - Add arch-specific header that details architecture capabilities;
    split FPU restore to send capabilities only once.
  - Test for zero TLS entries in cr_write_thread()
  - Fix asm/checkpoint_hdr.h so it can be included from user-space

Changelog[v7]:
  - Fix save/restore state of FPU

Changelog[v5]:
  - Remove preempt_disable() when restoring debug registers

Changelog[v4]:
  - Fix header structure alignment

Changelog[v2]:
  - Pad header structures to 64 bits to ensure compatibility
  - Follow Dave Hansen's refactoring of the original post

Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com>
---
 arch/x86/include/asm/checkpoint_hdr.h |   85 ++++++++++++
 arch/x86/mm/Makefile                  |    2 +
 arch/x86/mm/checkpoint.c              |  223 +++++++++++++++++++++++++++++++
 arch/x86/mm/restart.c                 |  232 +++++++++++++++++++++++++++++++++
 checkpoint/checkpoint.c               |   19 +++-
 checkpoint/checkpoint_arch.h          |    9 ++
 checkpoint/restart.c                  |   17 ++-
 include/linux/checkpoint_hdr.h        |    2 +
 8 files changed, 583 insertions(+), 6 deletions(-)
 create mode 100644 arch/x86/include/asm/checkpoint_hdr.h
 create mode 100644 arch/x86/mm/checkpoint.c
 create mode 100644 arch/x86/mm/restart.c
 create mode 100644 checkpoint/checkpoint_arch.h

diff --git a/arch/x86/include/asm/checkpoint_hdr.h b/arch/x86/include/asm/checkpoint_hdr.h
new file mode 100644
index 0000000..6325062
--- /dev/null
+++ b/arch/x86/include/asm/checkpoint_hdr.h
@@ -0,0 +1,85 @@
+#ifndef __ASM_X86_CKPT_HDR_H
+#define __ASM_X86_CKPT_HDR_H
+/*
+ *  Checkpoint/restart - architecture specific headers x86
+ *
+ *  Copyright (C) 2008 Oren Laadan
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+#include <linux/types.h>
+
+/* i387 structure seen from kernel/userspace */
+#ifdef __KERNEL__
+#include <asm/processor.h>
+#else
+#include <sys/user.h>
+#endif
+
+struct cr_hdr_head_arch {
+	/* FIXME: add HAVE_HWFP */
+
+	__u16 has_fxsr;
+	__u16 has_xsave;
+	__u16 xstate_size;
+	__u16 _pading;
+} __attribute__((aligned(8)));
+
+struct cr_hdr_thread {
+	/* FIXME: restart blocks */
+
+	__s16 gdt_entry_tls_entries;
+	__s16 sizeof_tls_array;
+	__s16 ntls;	/* number of TLS entries to follow */
+} __attribute__((aligned(8)));
+
+struct cr_hdr_cpu {
+	/* see struct pt_regs (x86-64) */
+	__u64 r15;
+	__u64 r14;
+	__u64 r13;
+	__u64 r12;
+	__u64 bp;
+	__u64 bx;
+	__u64 r11;
+	__u64 r10;
+	__u64 r9;
+	__u64 r8;
+	__u64 ax;
+	__u64 cx;
+	__u64 dx;
+	__u64 si;
+	__u64 di;
+	__u64 orig_ax;
+	__u64 ip;
+	__u64 cs;
+	__u64 flags;
+	__u64 sp;
+	__u64 ss;
+
+	/* segment registers */
+	__u64 ds;
+	__u64 es;
+	__u64 fs;
+	__u64 gs;
+
+	/* debug registers */
+	__u64 debugreg0;
+	__u64 debugreg1;
+	__u64 debugreg2;
+	__u64 debugreg3;
+	__u64 debugreg4;
+	__u64 debugreg5;
+	__u64 debugreg6;
+	__u64 debugreg7;
+
+	__u32 uses_debug;
+	__u32 used_math;
+
+	/* thread_xstate contents follow (if used_math) */
+} __attribute__((aligned(8)));
+
+#endif /* __ASM_X86_CKPT_HDR__H */
diff --git a/arch/x86/mm/Makefile b/arch/x86/mm/Makefile
index fea4565..6527ea2 100644
--- a/arch/x86/mm/Makefile
+++ b/arch/x86/mm/Makefile
@@ -18,3 +18,5 @@ obj-$(CONFIG_K8_NUMA)		+= k8topology_64.o
 obj-$(CONFIG_ACPI_NUMA)		+= srat_$(BITS).o
 
 obj-$(CONFIG_MEMTEST)		+= memtest.o
+
+obj-$(CONFIG_CHECKPOINT_RESTART) += checkpoint.o restart.o
diff --git a/arch/x86/mm/checkpoint.c b/arch/x86/mm/checkpoint.c
new file mode 100644
index 0000000..8dd6d2d
--- /dev/null
+++ b/arch/x86/mm/checkpoint.c
@@ -0,0 +1,223 @@
+/*
+ *  Checkpoint/restart - architecture specific support for x86
+ *
+ *  Copyright (C) 2008 Oren Laadan
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+#include <asm/desc.h>
+#include <asm/i387.h>
+
+#include <linux/checkpoint.h>
+#include <linux/checkpoint_hdr.h>
+
+/* dump the thread_struct of a given task */
+int cr_write_thread(struct cr_ctx *ctx, struct task_struct *t)
+{
+	struct cr_hdr h;
+	struct cr_hdr_thread *hh = cr_hbuf_get(ctx, sizeof(*hh));
+	struct thread_struct *thread;
+	struct desc_struct *desc;
+	int ntls = 0;
+	int n, ret;
+
+	h.type = CR_HDR_THREAD;
+	h.len = sizeof(*hh);
+	h.parent = task_pid_vnr(t);
+
+	thread = &t->thread;
+
+	/* calculate no. of TLS entries that follow */
+	desc = thread->tls_array;
+	for (n = GDT_ENTRY_TLS_ENTRIES; n > 0; n--, desc++) {
+		if (desc->a || desc->b)
+			ntls++;
+	}
+
+	hh->gdt_entry_tls_entries = GDT_ENTRY_TLS_ENTRIES;
+	hh->sizeof_tls_array = sizeof(thread->tls_array);
+	hh->ntls = ntls;
+
+	ret = cr_write_obj(ctx, &h, hh);
+	cr_hbuf_put(ctx, sizeof(*hh));
+	if (ret < 0)
+		return ret;
+
+	cr_debug("ntls %d\n", ntls);
+	if (ntls == 0)
+		return 0;
+
+	/* for simplicity dump the entire array, cherry-pick upon restart */
+	ret = cr_kwrite(ctx, thread->tls_array, sizeof(thread->tls_array));
+
+	/* IGNORE RESTART BLOCKS FOR NOW ... */
+
+	return ret;
+}
+
+#ifdef CONFIG_X86_64
+
+#error "CONFIG_X86_64 unsupported yet."
+
+#else	/* !CONFIG_X86_64 */
+
+static void cr_save_cpu_regs(struct cr_hdr_cpu *hh, struct task_struct *t)
+{
+	struct thread_struct *thread = &t->thread;
+	struct pt_regs *regs = task_pt_regs(t);
+
+	hh->bp = regs->bp;
+	hh->bx = regs->bx;
+	hh->ax = regs->ax;
+	hh->cx = regs->cx;
+	hh->dx = regs->dx;
+	hh->si = regs->si;
+	hh->di = regs->di;
+	hh->orig_ax = regs->orig_ax;
+	hh->ip = regs->ip;
+	hh->cs = regs->cs;
+	hh->flags = regs->flags;
+	hh->sp = regs->sp;
+	hh->ss = regs->ss;
+
+	hh->ds = regs->ds;
+	hh->es = regs->es;
+
+	/*
+	 * for checkpoint in process context (from within a container)
+	 * the GS and FS registers should be saved from the hardware;
+	 * otherwise they are already sabed on the thread structure
+	 */
+	if (t == current) {
+		savesegment(gs, hh->gs);
+		savesegment(fs, hh->fs);
+	} else {
+		hh->gs = thread->gs;
+		hh->fs = thread->fs;
+	}
+
+	/*
+	 * for checkpoint in process context (from within a container),
+	 * the actual syscall is taking place at this very moment; so
+	 * we (optimistically) subtitute the future return value (0) of
+	 * this syscall into the orig_eax, so that upon restart it will
+	 * succeed (or it will endlessly retry checkpoint...)
+	 */
+	if (t == current) {
+		BUG_ON(hh->orig_ax < 0);
+		hh->ax = 0;
+	}
+}
+
+static void cr_save_cpu_debug(struct cr_hdr_cpu *hh, struct task_struct *t)
+{
+	struct thread_struct *thread = &t->thread;
+
+	/* debug regs */
+
+	/*
+	 * for checkpoint in process context (from within a container),
+	 * get the actual registers; otherwise get the saved values.
+	 */
+
+	if (t == current) {
+		get_debugreg(hh->debugreg0, 0);
+		get_debugreg(hh->debugreg1, 1);
+		get_debugreg(hh->debugreg2, 2);
+		get_debugreg(hh->debugreg3, 3);
+		get_debugreg(hh->debugreg6, 6);
+		get_debugreg(hh->debugreg7, 7);
+	} else {
+		hh->debugreg0 = thread->debugreg0;
+		hh->debugreg1 = thread->debugreg1;
+		hh->debugreg2 = thread->debugreg2;
+		hh->debugreg3 = thread->debugreg3;
+		hh->debugreg6 = thread->debugreg6;
+		hh->debugreg7 = thread->debugreg7;
+	}
+
+	hh->debugreg4 = 0;
+	hh->debugreg5 = 0;
+
+	hh->uses_debug = !!(task_thread_info(t)->flags & TIF_DEBUG);
+}
+
+static void cr_save_cpu_fpu(struct cr_hdr_cpu *hh, struct task_struct *t)
+{
+	hh->used_math = tsk_used_math(t) ? 1 : 0;
+}
+
+static int cr_write_cpu_fpu(struct cr_ctx *ctx, struct task_struct *t)
+{
+	void *xstate_buf = cr_hbuf_get(ctx, xstate_size);
+
+	/* i387 + MMU + SSE logic */
+	preempt_disable();	/* needed it (t == current) */
+
+	/*
+	 * normally, no need to unlazy_fpu(), since TS_USEDFPU flag
+	 * have been cleared when task was context-switched out...
+	 * except if we are in process context, in which case we do
+	 */
+	if (t == current && (task_thread_info(t)->status & TS_USEDFPU))
+		unlazy_fpu(current);
+
+	memcpy(xstate_buf, t->thread.xstate, xstate_size);
+	preempt_enable();	/* needed it (t == current) */
+
+	return cr_kwrite(ctx, xstate_buf, xstate_size);
+}
+
+#endif	/* CONFIG_X86_64 */
+
+/* dump the cpu state and registers of a given task */
+int cr_write_cpu(struct cr_ctx *ctx, struct task_struct *t)
+{
+	struct cr_hdr h;
+	struct cr_hdr_cpu *hh = cr_hbuf_get(ctx, sizeof(*hh));
+	int ret;
+
+	h.type = CR_HDR_CPU;
+	h.len = sizeof(*hh);
+	h.parent = task_pid_vnr(t);
+
+	cr_save_cpu_regs(hh, t);
+	cr_save_cpu_debug(hh, t);
+	cr_save_cpu_fpu(hh, t);
+
+	cr_debug("math %d debug %d\n", hh->used_math, hh->uses_debug);
+
+	ret = cr_write_obj(ctx, &h, hh);
+	if (ret < 0)
+		goto out;
+
+	if (hh->used_math)
+		ret = cr_write_cpu_fpu(ctx, t);
+ out:
+	cr_hbuf_put(ctx, sizeof(*hh));
+	return ret;
+}
+
+int cr_write_head_arch(struct cr_ctx *ctx)
+{
+	struct cr_hdr h;
+	struct cr_hdr_head_arch *hh = cr_hbuf_get(ctx, sizeof(*hh));
+	int ret;
+
+	h.type = CR_HDR_HEAD_ARCH;
+	h.len = sizeof(*hh);
+	h.parent = 0;
+
+	/* FPU capabilities */
+	hh->has_fxsr = cpu_has_fxsr;
+	hh->has_xsave = cpu_has_xsave;
+	hh->xstate_size = xstate_size;
+
+	ret = cr_write_obj(ctx, &h, hh);
+	cr_hbuf_put(ctx, sizeof(*hh));
+
+	return ret;
+}
diff --git a/arch/x86/mm/restart.c b/arch/x86/mm/restart.c
new file mode 100644
index 0000000..45ad790
--- /dev/null
+++ b/arch/x86/mm/restart.c
@@ -0,0 +1,232 @@
+/*
+ *  Checkpoint/restart - architecture specific support for x86
+ *
+ *  Copyright (C) 2008 Oren Laadan
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+#include <asm/desc.h>
+#include <asm/i387.h>
+
+#include <linux/checkpoint.h>
+#include <linux/checkpoint_hdr.h>
+
+/* read the thread_struct into the current task */
+int cr_read_thread(struct cr_ctx *ctx)
+{
+	struct cr_hdr_thread *hh = cr_hbuf_get(ctx, sizeof(*hh));
+	struct task_struct *t = current;
+	struct thread_struct *thread = &t->thread;
+	int parent, ret;
+
+	parent = cr_read_obj_type(ctx, hh, sizeof(*hh), CR_HDR_THREAD);
+	if (parent < 0) {
+		ret = parent;
+		goto out;
+	}
+
+	ret = -EINVAL;
+
+#if 0	/* activate when containers are used */
+	if (parent != task_pid_vnr(t))
+		goto out;
+#endif
+	cr_debug("ntls %d\n", hh->ntls);
+
+	if (hh->gdt_entry_tls_entries != GDT_ENTRY_TLS_ENTRIES ||
+	    hh->sizeof_tls_array != sizeof(thread->tls_array) ||
+	    hh->ntls < 0 || hh->ntls > GDT_ENTRY_TLS_ENTRIES)
+		goto out;
+
+	if (hh->ntls > 0) {
+		struct desc_struct *desc;
+		int size, cpu;
+
+		/*
+		 * restore TLS by hand: why convert to struct user_desc if
+		 * sys_set_thread_entry() will convert it back ?
+		 */
+
+		size = sizeof(*desc) * GDT_ENTRY_TLS_ENTRIES;
+		desc = kmalloc(size, GFP_KERNEL);
+		if (!desc)
+			return -ENOMEM;
+
+		ret = cr_kread(ctx, desc, size);
+		if (ret >= 0) {
+			/*
+			 * FIX: add sanity checks (eg. that values makes
+			 * sense, that we don't overwrite old values, etc
+			 */
+			cpu = get_cpu();
+			memcpy(thread->tls_array, desc, size);
+			load_TLS(thread, cpu);
+			put_cpu();
+		}
+		kfree(desc);
+	}
+
+	ret = 0;
+ out:
+	cr_hbuf_put(ctx, sizeof(*hh));
+	return ret;
+}
+
+#ifdef CONFIG_X86_64
+
+#error "CONFIG_X86_64 unsupported yet."
+
+#else	/* !CONFIG_X86_64 */
+
+static int cr_load_cpu_regs(struct cr_hdr_cpu *hh, struct task_struct *t)
+{
+	struct thread_struct *thread = &t->thread;
+	struct pt_regs *regs = task_pt_regs(t);
+
+	regs->bx = hh->bx;
+	regs->cx = hh->cx;
+	regs->dx = hh->dx;
+	regs->si = hh->si;
+	regs->di = hh->di;
+	regs->bp = hh->bp;
+	regs->ax = hh->ax;
+	regs->ds = hh->ds;
+	regs->es = hh->es;
+	regs->orig_ax = hh->orig_ax;
+	regs->ip = hh->ip;
+	regs->cs = hh->cs;
+	regs->flags = hh->flags;
+	regs->sp = hh->sp;
+	regs->ss = hh->ss;
+
+	thread->gs = hh->gs;
+	thread->fs = hh->fs;
+	loadsegment(gs, hh->gs);
+	loadsegment(fs, hh->fs);
+
+	return 0;
+}
+
+static int cr_load_cpu_debug(struct cr_hdr_cpu *hh, struct task_struct *t)
+{
+	/* debug regs */
+
+	if (hh->uses_debug) {
+		set_debugreg(hh->debugreg0, 0);
+		set_debugreg(hh->debugreg1, 1);
+		/* ignore 4, 5 */
+		set_debugreg(hh->debugreg2, 2);
+		set_debugreg(hh->debugreg3, 3);
+		set_debugreg(hh->debugreg6, 6);
+		set_debugreg(hh->debugreg7, 7);
+	}
+
+	return 0;
+}
+
+static int cr_load_cpu_fpu(struct cr_hdr_cpu *hh, struct task_struct *t)
+{
+	preempt_disable();
+
+	__clear_fpu(t);		/* in case we used FPU in user mode */
+
+	if (!hh->used_math)
+		clear_used_math();
+
+	preempt_enable();
+	return 0;
+}
+
+static int cr_read_cpu_fpu(struct cr_ctx *ctx, struct task_struct *t)
+{
+	void *xstate_buf = cr_hbuf_get(ctx, xstate_size);
+	int ret;
+
+	ret = cr_kread(ctx, xstate_buf, xstate_size);
+	if (ret < 0)
+		goto out;
+
+	/* i387 + MMU + SSE */
+	preempt_disable();
+
+	/* init_fpu() also calls set_used_math() */
+	ret = init_fpu(current);
+	if (ret < 0)
+		return ret;
+
+	memcpy(t->thread.xstate, xstate_buf, xstate_size);
+	preempt_enable();
+ out:
+	cr_hbuf_put(ctx, xstate_size);
+	return 0;
+}
+
+#endif	/* CONFIG_X86_64 */
+
+/* read the cpu state and registers for the current task */
+int cr_read_cpu(struct cr_ctx *ctx)
+{
+	struct cr_hdr_cpu *hh = cr_hbuf_get(ctx, sizeof(*hh));
+	struct task_struct *t = current;
+	int parent, ret;
+
+	parent = cr_read_obj_type(ctx, hh, sizeof(*hh), CR_HDR_CPU);
+	if (parent < 0) {
+		ret = parent;
+		goto out;
+	}
+
+	ret = -EINVAL;
+
+#if 0	/* activate when containers are used */
+	if (parent != task_pid_vnr(t))
+		goto out;
+#endif
+	/* FIX: sanity check for sensitive registers (eg. eflags) */
+
+	cr_debug("math %d debug %d\n", hh->used_math, hh->uses_debug);
+
+	ret = cr_load_cpu_regs(hh, t);
+	if (ret < 0)
+		goto out;
+	ret = cr_load_cpu_debug(hh, t);
+	if (ret < 0)
+		goto out;
+	ret = cr_load_cpu_fpu(hh, t);
+	if (ret < 0)
+		goto out;
+
+	if (hh->used_math)
+		ret = cr_read_cpu_fpu(ctx, t);
+ out:
+	cr_hbuf_put(ctx, sizeof(*hh));
+	return ret;
+}
+
+int cr_read_head_arch(struct cr_ctx *ctx)
+{
+	struct cr_hdr_head_arch *hh = cr_hbuf_get(ctx, sizeof(*hh));
+	int parent, ret = 0;
+
+	parent = cr_read_obj_type(ctx, hh, sizeof(*hh), CR_HDR_HEAD_ARCH);
+	if (parent < 0) {
+		ret = parent;
+		goto out;
+	} else if (parent != 0)
+		goto out;
+
+	/* FIX: verify compatibility of architecture features */
+
+	/* verify FPU capabilities */
+	if (hh->has_fxsr != cpu_has_fxsr ||
+	    hh->has_xsave != cpu_has_xsave ||
+	    hh->xstate_size != xstate_size)
+		ret = -EINVAL;
+ out:
+	cr_hbuf_put(ctx, sizeof(*hh));
+
+	return ret;
+}
diff --git a/checkpoint/checkpoint.c b/checkpoint/checkpoint.c
index fccf723..17cc8d2 100644
--- a/checkpoint/checkpoint.c
+++ b/checkpoint/checkpoint.c
@@ -20,6 +20,8 @@
 #include <linux/checkpoint.h>
 #include <linux/checkpoint_hdr.h>
 
+#include "checkpoint_arch.h"
+
 /* unique checkpoint identifier (FIXME: should be per-container ?) */
 static atomic_t cr_ctx_count = ATOMIC_INIT(0);
 
@@ -105,7 +107,10 @@ static int cr_write_head(struct cr_ctx *ctx)
 
 	ret = cr_write_obj(ctx, &h, hh);
 	cr_hbuf_put(ctx, sizeof(*hh));
-	return ret;
+	if (ret < 0)
+		return ret;
+
+	return cr_write_head_arch(ctx);
 }
 
 /* write the checkpoint trailer */
@@ -160,8 +165,16 @@ static int cr_write_task(struct cr_ctx *ctx, struct task_struct *t)
 	int ret;
 
 	ret = cr_write_task_struct(ctx, t);
-	cr_debug("ret %d\n", ret);
-
+	cr_debug("task_struct: ret %d\n", ret);
+	if (ret < 0)
+		goto out;
+	ret = cr_write_thread(ctx, t);
+	cr_debug("thread: ret %d\n", ret);
+	if (ret < 0)
+		goto out;
+	ret = cr_write_cpu(ctx, t);
+	cr_debug("cpu: ret %d\n", ret);
+ out:
 	return ret;
 }
 
diff --git a/checkpoint/checkpoint_arch.h b/checkpoint/checkpoint_arch.h
new file mode 100644
index 0000000..ada1369
--- /dev/null
+++ b/checkpoint/checkpoint_arch.h
@@ -0,0 +1,9 @@
+#include <linux/checkpoint.h>
+
+extern int cr_write_head_arch(struct cr_ctx *ctx);
+extern int cr_write_thread(struct cr_ctx *ctx, struct task_struct *t);
+extern int cr_write_cpu(struct cr_ctx *ctx, struct task_struct *t);
+
+extern int cr_read_head_arch(struct cr_ctx *ctx);
+extern int cr_read_thread(struct cr_ctx *ctx);
+extern int cr_read_cpu(struct cr_ctx *ctx);
diff --git a/checkpoint/restart.c b/checkpoint/restart.c
index a95d2e8..d74d755 100644
--- a/checkpoint/restart.c
+++ b/checkpoint/restart.c
@@ -15,6 +15,8 @@
 #include <linux/checkpoint.h>
 #include <linux/checkpoint_hdr.h>
 
+#include "checkpoint_arch.h"
+
 /**
  * cr_read_obj - read a whole record (cr_hdr followed by payload)
  * @ctx: checkpoint context
@@ -142,9 +144,9 @@ static int cr_read_head(struct cr_ctx *ctx)
 
 	ctx->oflags = hh->flags;
 
-	/* FIX: verify compatibility of release, version and machine */
+	/* FIX: verify compatibility of release, version */
 
-	ret = 0;
+	ret = cr_read_head_arch(ctx);
  out:
 	cr_hbuf_put(ctx, sizeof(*hh));
 	return ret;
@@ -214,8 +216,17 @@ static int cr_read_task(struct cr_ctx *ctx)
 	int ret;
 
 	ret = cr_read_task_struct(ctx);
-	cr_debug("ret %d\n", ret);
+	cr_debug("task_struct: ret %d\n", ret);
+	if (ret < 0)
+		goto out;
+	ret = cr_read_thread(ctx);
+	cr_debug("thread: ret %d\n", ret);
+	if (ret < 0)
+		goto out;
+	ret = cr_read_cpu(ctx);
+	cr_debug("cpu: ret %d\n", ret);
 
+ out:
 	return ret;
 }
 
diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
index 257f87f..b74b5f9 100644
--- a/include/linux/checkpoint_hdr.h
+++ b/include/linux/checkpoint_hdr.h
@@ -12,6 +12,7 @@
 
 #include <linux/types.h>
 #include <linux/utsname.h>
+#include <asm/checkpoint_hdr.h>
 
 /*
  * To maintain compatibility between 32-bit and 64-bit architecture flavors,
@@ -30,6 +31,7 @@ struct cr_hdr {
 /* header types */
 enum {
 	CR_HDR_HEAD = 1,
+	CR_HDR_HEAD_ARCH,
 	CR_HDR_BUFFER,
 	CR_HDR_STRING,
 
-- 
1.5.4.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 133+ messages in thread

* [RFC v11][PATCH 04/13] x86 support for checkpoint/restart
@ 2008-12-05 17:31   ` Oren Laadan
  0 siblings, 0 replies; 133+ messages in thread
From: Oren Laadan @ 2008-12-05 17:31 UTC (permalink / raw)
  To: Oren Laadan
  Cc: containers, linux-kernel, linux-mm, linux-api, Linux Torvalds,
	Thomas Gleixner, Serge Hallyn, Dave Hansen, Ingo Molnar,
	H. Peter Anvin, Alexander Viro, MinChan Kim, arnd, jeremy

Add logic to save and restore architecture specific state, including
thread-specific state, CPU registers and FPU state.

In addition, architecture capabilities are saved in an architecure
specific extension of the header (cr_hdr_head_arch); Currently this
includes only FPU capabilities.

Currently only x86-32 is supported. Compiling on x86-64 will trigger
an explicit error.

Changelog[v9]:
  - Add arch-specific header that details architecture capabilities;
    split FPU restore to send capabilities only once.
  - Test for zero TLS entries in cr_write_thread()
  - Fix asm/checkpoint_hdr.h so it can be included from user-space

Changelog[v7]:
  - Fix save/restore state of FPU

Changelog[v5]:
  - Remove preempt_disable() when restoring debug registers

Changelog[v4]:
  - Fix header structure alignment

Changelog[v2]:
  - Pad header structures to 64 bits to ensure compatibility
  - Follow Dave Hansen's refactoring of the original post

Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com>
---
 arch/x86/include/asm/checkpoint_hdr.h |   85 ++++++++++++
 arch/x86/mm/Makefile                  |    2 +
 arch/x86/mm/checkpoint.c              |  223 +++++++++++++++++++++++++++++++
 arch/x86/mm/restart.c                 |  232 +++++++++++++++++++++++++++++++++
 checkpoint/checkpoint.c               |   19 +++-
 checkpoint/checkpoint_arch.h          |    9 ++
 checkpoint/restart.c                  |   17 ++-
 include/linux/checkpoint_hdr.h        |    2 +
 8 files changed, 583 insertions(+), 6 deletions(-)
 create mode 100644 arch/x86/include/asm/checkpoint_hdr.h
 create mode 100644 arch/x86/mm/checkpoint.c
 create mode 100644 arch/x86/mm/restart.c
 create mode 100644 checkpoint/checkpoint_arch.h

diff --git a/arch/x86/include/asm/checkpoint_hdr.h b/arch/x86/include/asm/checkpoint_hdr.h
new file mode 100644
index 0000000..6325062
--- /dev/null
+++ b/arch/x86/include/asm/checkpoint_hdr.h
@@ -0,0 +1,85 @@
+#ifndef __ASM_X86_CKPT_HDR_H
+#define __ASM_X86_CKPT_HDR_H
+/*
+ *  Checkpoint/restart - architecture specific headers x86
+ *
+ *  Copyright (C) 2008 Oren Laadan
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+#include <linux/types.h>
+
+/* i387 structure seen from kernel/userspace */
+#ifdef __KERNEL__
+#include <asm/processor.h>
+#else
+#include <sys/user.h>
+#endif
+
+struct cr_hdr_head_arch {
+	/* FIXME: add HAVE_HWFP */
+
+	__u16 has_fxsr;
+	__u16 has_xsave;
+	__u16 xstate_size;
+	__u16 _pading;
+} __attribute__((aligned(8)));
+
+struct cr_hdr_thread {
+	/* FIXME: restart blocks */
+
+	__s16 gdt_entry_tls_entries;
+	__s16 sizeof_tls_array;
+	__s16 ntls;	/* number of TLS entries to follow */
+} __attribute__((aligned(8)));
+
+struct cr_hdr_cpu {
+	/* see struct pt_regs (x86-64) */
+	__u64 r15;
+	__u64 r14;
+	__u64 r13;
+	__u64 r12;
+	__u64 bp;
+	__u64 bx;
+	__u64 r11;
+	__u64 r10;
+	__u64 r9;
+	__u64 r8;
+	__u64 ax;
+	__u64 cx;
+	__u64 dx;
+	__u64 si;
+	__u64 di;
+	__u64 orig_ax;
+	__u64 ip;
+	__u64 cs;
+	__u64 flags;
+	__u64 sp;
+	__u64 ss;
+
+	/* segment registers */
+	__u64 ds;
+	__u64 es;
+	__u64 fs;
+	__u64 gs;
+
+	/* debug registers */
+	__u64 debugreg0;
+	__u64 debugreg1;
+	__u64 debugreg2;
+	__u64 debugreg3;
+	__u64 debugreg4;
+	__u64 debugreg5;
+	__u64 debugreg6;
+	__u64 debugreg7;
+
+	__u32 uses_debug;
+	__u32 used_math;
+
+	/* thread_xstate contents follow (if used_math) */
+} __attribute__((aligned(8)));
+
+#endif /* __ASM_X86_CKPT_HDR__H */
diff --git a/arch/x86/mm/Makefile b/arch/x86/mm/Makefile
index fea4565..6527ea2 100644
--- a/arch/x86/mm/Makefile
+++ b/arch/x86/mm/Makefile
@@ -18,3 +18,5 @@ obj-$(CONFIG_K8_NUMA)		+= k8topology_64.o
 obj-$(CONFIG_ACPI_NUMA)		+= srat_$(BITS).o
 
 obj-$(CONFIG_MEMTEST)		+= memtest.o
+
+obj-$(CONFIG_CHECKPOINT_RESTART) += checkpoint.o restart.o
diff --git a/arch/x86/mm/checkpoint.c b/arch/x86/mm/checkpoint.c
new file mode 100644
index 0000000..8dd6d2d
--- /dev/null
+++ b/arch/x86/mm/checkpoint.c
@@ -0,0 +1,223 @@
+/*
+ *  Checkpoint/restart - architecture specific support for x86
+ *
+ *  Copyright (C) 2008 Oren Laadan
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+#include <asm/desc.h>
+#include <asm/i387.h>
+
+#include <linux/checkpoint.h>
+#include <linux/checkpoint_hdr.h>
+
+/* dump the thread_struct of a given task */
+int cr_write_thread(struct cr_ctx *ctx, struct task_struct *t)
+{
+	struct cr_hdr h;
+	struct cr_hdr_thread *hh = cr_hbuf_get(ctx, sizeof(*hh));
+	struct thread_struct *thread;
+	struct desc_struct *desc;
+	int ntls = 0;
+	int n, ret;
+
+	h.type = CR_HDR_THREAD;
+	h.len = sizeof(*hh);
+	h.parent = task_pid_vnr(t);
+
+	thread = &t->thread;
+
+	/* calculate no. of TLS entries that follow */
+	desc = thread->tls_array;
+	for (n = GDT_ENTRY_TLS_ENTRIES; n > 0; n--, desc++) {
+		if (desc->a || desc->b)
+			ntls++;
+	}
+
+	hh->gdt_entry_tls_entries = GDT_ENTRY_TLS_ENTRIES;
+	hh->sizeof_tls_array = sizeof(thread->tls_array);
+	hh->ntls = ntls;
+
+	ret = cr_write_obj(ctx, &h, hh);
+	cr_hbuf_put(ctx, sizeof(*hh));
+	if (ret < 0)
+		return ret;
+
+	cr_debug("ntls %d\n", ntls);
+	if (ntls == 0)
+		return 0;
+
+	/* for simplicity dump the entire array, cherry-pick upon restart */
+	ret = cr_kwrite(ctx, thread->tls_array, sizeof(thread->tls_array));
+
+	/* IGNORE RESTART BLOCKS FOR NOW ... */
+
+	return ret;
+}
+
+#ifdef CONFIG_X86_64
+
+#error "CONFIG_X86_64 unsupported yet."
+
+#else	/* !CONFIG_X86_64 */
+
+static void cr_save_cpu_regs(struct cr_hdr_cpu *hh, struct task_struct *t)
+{
+	struct thread_struct *thread = &t->thread;
+	struct pt_regs *regs = task_pt_regs(t);
+
+	hh->bp = regs->bp;
+	hh->bx = regs->bx;
+	hh->ax = regs->ax;
+	hh->cx = regs->cx;
+	hh->dx = regs->dx;
+	hh->si = regs->si;
+	hh->di = regs->di;
+	hh->orig_ax = regs->orig_ax;
+	hh->ip = regs->ip;
+	hh->cs = regs->cs;
+	hh->flags = regs->flags;
+	hh->sp = regs->sp;
+	hh->ss = regs->ss;
+
+	hh->ds = regs->ds;
+	hh->es = regs->es;
+
+	/*
+	 * for checkpoint in process context (from within a container)
+	 * the GS and FS registers should be saved from the hardware;
+	 * otherwise they are already sabed on the thread structure
+	 */
+	if (t == current) {
+		savesegment(gs, hh->gs);
+		savesegment(fs, hh->fs);
+	} else {
+		hh->gs = thread->gs;
+		hh->fs = thread->fs;
+	}
+
+	/*
+	 * for checkpoint in process context (from within a container),
+	 * the actual syscall is taking place at this very moment; so
+	 * we (optimistically) subtitute the future return value (0) of
+	 * this syscall into the orig_eax, so that upon restart it will
+	 * succeed (or it will endlessly retry checkpoint...)
+	 */
+	if (t == current) {
+		BUG_ON(hh->orig_ax < 0);
+		hh->ax = 0;
+	}
+}
+
+static void cr_save_cpu_debug(struct cr_hdr_cpu *hh, struct task_struct *t)
+{
+	struct thread_struct *thread = &t->thread;
+
+	/* debug regs */
+
+	/*
+	 * for checkpoint in process context (from within a container),
+	 * get the actual registers; otherwise get the saved values.
+	 */
+
+	if (t == current) {
+		get_debugreg(hh->debugreg0, 0);
+		get_debugreg(hh->debugreg1, 1);
+		get_debugreg(hh->debugreg2, 2);
+		get_debugreg(hh->debugreg3, 3);
+		get_debugreg(hh->debugreg6, 6);
+		get_debugreg(hh->debugreg7, 7);
+	} else {
+		hh->debugreg0 = thread->debugreg0;
+		hh->debugreg1 = thread->debugreg1;
+		hh->debugreg2 = thread->debugreg2;
+		hh->debugreg3 = thread->debugreg3;
+		hh->debugreg6 = thread->debugreg6;
+		hh->debugreg7 = thread->debugreg7;
+	}
+
+	hh->debugreg4 = 0;
+	hh->debugreg5 = 0;
+
+	hh->uses_debug = !!(task_thread_info(t)->flags & TIF_DEBUG);
+}
+
+static void cr_save_cpu_fpu(struct cr_hdr_cpu *hh, struct task_struct *t)
+{
+	hh->used_math = tsk_used_math(t) ? 1 : 0;
+}
+
+static int cr_write_cpu_fpu(struct cr_ctx *ctx, struct task_struct *t)
+{
+	void *xstate_buf = cr_hbuf_get(ctx, xstate_size);
+
+	/* i387 + MMU + SSE logic */
+	preempt_disable();	/* needed it (t == current) */
+
+	/*
+	 * normally, no need to unlazy_fpu(), since TS_USEDFPU flag
+	 * have been cleared when task was context-switched out...
+	 * except if we are in process context, in which case we do
+	 */
+	if (t == current && (task_thread_info(t)->status & TS_USEDFPU))
+		unlazy_fpu(current);
+
+	memcpy(xstate_buf, t->thread.xstate, xstate_size);
+	preempt_enable();	/* needed it (t == current) */
+
+	return cr_kwrite(ctx, xstate_buf, xstate_size);
+}
+
+#endif	/* CONFIG_X86_64 */
+
+/* dump the cpu state and registers of a given task */
+int cr_write_cpu(struct cr_ctx *ctx, struct task_struct *t)
+{
+	struct cr_hdr h;
+	struct cr_hdr_cpu *hh = cr_hbuf_get(ctx, sizeof(*hh));
+	int ret;
+
+	h.type = CR_HDR_CPU;
+	h.len = sizeof(*hh);
+	h.parent = task_pid_vnr(t);
+
+	cr_save_cpu_regs(hh, t);
+	cr_save_cpu_debug(hh, t);
+	cr_save_cpu_fpu(hh, t);
+
+	cr_debug("math %d debug %d\n", hh->used_math, hh->uses_debug);
+
+	ret = cr_write_obj(ctx, &h, hh);
+	if (ret < 0)
+		goto out;
+
+	if (hh->used_math)
+		ret = cr_write_cpu_fpu(ctx, t);
+ out:
+	cr_hbuf_put(ctx, sizeof(*hh));
+	return ret;
+}
+
+int cr_write_head_arch(struct cr_ctx *ctx)
+{
+	struct cr_hdr h;
+	struct cr_hdr_head_arch *hh = cr_hbuf_get(ctx, sizeof(*hh));
+	int ret;
+
+	h.type = CR_HDR_HEAD_ARCH;
+	h.len = sizeof(*hh);
+	h.parent = 0;
+
+	/* FPU capabilities */
+	hh->has_fxsr = cpu_has_fxsr;
+	hh->has_xsave = cpu_has_xsave;
+	hh->xstate_size = xstate_size;
+
+	ret = cr_write_obj(ctx, &h, hh);
+	cr_hbuf_put(ctx, sizeof(*hh));
+
+	return ret;
+}
diff --git a/arch/x86/mm/restart.c b/arch/x86/mm/restart.c
new file mode 100644
index 0000000..45ad790
--- /dev/null
+++ b/arch/x86/mm/restart.c
@@ -0,0 +1,232 @@
+/*
+ *  Checkpoint/restart - architecture specific support for x86
+ *
+ *  Copyright (C) 2008 Oren Laadan
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+#include <asm/desc.h>
+#include <asm/i387.h>
+
+#include <linux/checkpoint.h>
+#include <linux/checkpoint_hdr.h>
+
+/* read the thread_struct into the current task */
+int cr_read_thread(struct cr_ctx *ctx)
+{
+	struct cr_hdr_thread *hh = cr_hbuf_get(ctx, sizeof(*hh));
+	struct task_struct *t = current;
+	struct thread_struct *thread = &t->thread;
+	int parent, ret;
+
+	parent = cr_read_obj_type(ctx, hh, sizeof(*hh), CR_HDR_THREAD);
+	if (parent < 0) {
+		ret = parent;
+		goto out;
+	}
+
+	ret = -EINVAL;
+
+#if 0	/* activate when containers are used */
+	if (parent != task_pid_vnr(t))
+		goto out;
+#endif
+	cr_debug("ntls %d\n", hh->ntls);
+
+	if (hh->gdt_entry_tls_entries != GDT_ENTRY_TLS_ENTRIES ||
+	    hh->sizeof_tls_array != sizeof(thread->tls_array) ||
+	    hh->ntls < 0 || hh->ntls > GDT_ENTRY_TLS_ENTRIES)
+		goto out;
+
+	if (hh->ntls > 0) {
+		struct desc_struct *desc;
+		int size, cpu;
+
+		/*
+		 * restore TLS by hand: why convert to struct user_desc if
+		 * sys_set_thread_entry() will convert it back ?
+		 */
+
+		size = sizeof(*desc) * GDT_ENTRY_TLS_ENTRIES;
+		desc = kmalloc(size, GFP_KERNEL);
+		if (!desc)
+			return -ENOMEM;
+
+		ret = cr_kread(ctx, desc, size);
+		if (ret >= 0) {
+			/*
+			 * FIX: add sanity checks (eg. that values makes
+			 * sense, that we don't overwrite old values, etc
+			 */
+			cpu = get_cpu();
+			memcpy(thread->tls_array, desc, size);
+			load_TLS(thread, cpu);
+			put_cpu();
+		}
+		kfree(desc);
+	}
+
+	ret = 0;
+ out:
+	cr_hbuf_put(ctx, sizeof(*hh));
+	return ret;
+}
+
+#ifdef CONFIG_X86_64
+
+#error "CONFIG_X86_64 unsupported yet."
+
+#else	/* !CONFIG_X86_64 */
+
+static int cr_load_cpu_regs(struct cr_hdr_cpu *hh, struct task_struct *t)
+{
+	struct thread_struct *thread = &t->thread;
+	struct pt_regs *regs = task_pt_regs(t);
+
+	regs->bx = hh->bx;
+	regs->cx = hh->cx;
+	regs->dx = hh->dx;
+	regs->si = hh->si;
+	regs->di = hh->di;
+	regs->bp = hh->bp;
+	regs->ax = hh->ax;
+	regs->ds = hh->ds;
+	regs->es = hh->es;
+	regs->orig_ax = hh->orig_ax;
+	regs->ip = hh->ip;
+	regs->cs = hh->cs;
+	regs->flags = hh->flags;
+	regs->sp = hh->sp;
+	regs->ss = hh->ss;
+
+	thread->gs = hh->gs;
+	thread->fs = hh->fs;
+	loadsegment(gs, hh->gs);
+	loadsegment(fs, hh->fs);
+
+	return 0;
+}
+
+static int cr_load_cpu_debug(struct cr_hdr_cpu *hh, struct task_struct *t)
+{
+	/* debug regs */
+
+	if (hh->uses_debug) {
+		set_debugreg(hh->debugreg0, 0);
+		set_debugreg(hh->debugreg1, 1);
+		/* ignore 4, 5 */
+		set_debugreg(hh->debugreg2, 2);
+		set_debugreg(hh->debugreg3, 3);
+		set_debugreg(hh->debugreg6, 6);
+		set_debugreg(hh->debugreg7, 7);
+	}
+
+	return 0;
+}
+
+static int cr_load_cpu_fpu(struct cr_hdr_cpu *hh, struct task_struct *t)
+{
+	preempt_disable();
+
+	__clear_fpu(t);		/* in case we used FPU in user mode */
+
+	if (!hh->used_math)
+		clear_used_math();
+
+	preempt_enable();
+	return 0;
+}
+
+static int cr_read_cpu_fpu(struct cr_ctx *ctx, struct task_struct *t)
+{
+	void *xstate_buf = cr_hbuf_get(ctx, xstate_size);
+	int ret;
+
+	ret = cr_kread(ctx, xstate_buf, xstate_size);
+	if (ret < 0)
+		goto out;
+
+	/* i387 + MMU + SSE */
+	preempt_disable();
+
+	/* init_fpu() also calls set_used_math() */
+	ret = init_fpu(current);
+	if (ret < 0)
+		return ret;
+
+	memcpy(t->thread.xstate, xstate_buf, xstate_size);
+	preempt_enable();
+ out:
+	cr_hbuf_put(ctx, xstate_size);
+	return 0;
+}
+
+#endif	/* CONFIG_X86_64 */
+
+/* read the cpu state and registers for the current task */
+int cr_read_cpu(struct cr_ctx *ctx)
+{
+	struct cr_hdr_cpu *hh = cr_hbuf_get(ctx, sizeof(*hh));
+	struct task_struct *t = current;
+	int parent, ret;
+
+	parent = cr_read_obj_type(ctx, hh, sizeof(*hh), CR_HDR_CPU);
+	if (parent < 0) {
+		ret = parent;
+		goto out;
+	}
+
+	ret = -EINVAL;
+
+#if 0	/* activate when containers are used */
+	if (parent != task_pid_vnr(t))
+		goto out;
+#endif
+	/* FIX: sanity check for sensitive registers (eg. eflags) */
+
+	cr_debug("math %d debug %d\n", hh->used_math, hh->uses_debug);
+
+	ret = cr_load_cpu_regs(hh, t);
+	if (ret < 0)
+		goto out;
+	ret = cr_load_cpu_debug(hh, t);
+	if (ret < 0)
+		goto out;
+	ret = cr_load_cpu_fpu(hh, t);
+	if (ret < 0)
+		goto out;
+
+	if (hh->used_math)
+		ret = cr_read_cpu_fpu(ctx, t);
+ out:
+	cr_hbuf_put(ctx, sizeof(*hh));
+	return ret;
+}
+
+int cr_read_head_arch(struct cr_ctx *ctx)
+{
+	struct cr_hdr_head_arch *hh = cr_hbuf_get(ctx, sizeof(*hh));
+	int parent, ret = 0;
+
+	parent = cr_read_obj_type(ctx, hh, sizeof(*hh), CR_HDR_HEAD_ARCH);
+	if (parent < 0) {
+		ret = parent;
+		goto out;
+	} else if (parent != 0)
+		goto out;
+
+	/* FIX: verify compatibility of architecture features */
+
+	/* verify FPU capabilities */
+	if (hh->has_fxsr != cpu_has_fxsr ||
+	    hh->has_xsave != cpu_has_xsave ||
+	    hh->xstate_size != xstate_size)
+		ret = -EINVAL;
+ out:
+	cr_hbuf_put(ctx, sizeof(*hh));
+
+	return ret;
+}
diff --git a/checkpoint/checkpoint.c b/checkpoint/checkpoint.c
index fccf723..17cc8d2 100644
--- a/checkpoint/checkpoint.c
+++ b/checkpoint/checkpoint.c
@@ -20,6 +20,8 @@
 #include <linux/checkpoint.h>
 #include <linux/checkpoint_hdr.h>
 
+#include "checkpoint_arch.h"
+
 /* unique checkpoint identifier (FIXME: should be per-container ?) */
 static atomic_t cr_ctx_count = ATOMIC_INIT(0);
 
@@ -105,7 +107,10 @@ static int cr_write_head(struct cr_ctx *ctx)
 
 	ret = cr_write_obj(ctx, &h, hh);
 	cr_hbuf_put(ctx, sizeof(*hh));
-	return ret;
+	if (ret < 0)
+		return ret;
+
+	return cr_write_head_arch(ctx);
 }
 
 /* write the checkpoint trailer */
@@ -160,8 +165,16 @@ static int cr_write_task(struct cr_ctx *ctx, struct task_struct *t)
 	int ret;
 
 	ret = cr_write_task_struct(ctx, t);
-	cr_debug("ret %d\n", ret);
-
+	cr_debug("task_struct: ret %d\n", ret);
+	if (ret < 0)
+		goto out;
+	ret = cr_write_thread(ctx, t);
+	cr_debug("thread: ret %d\n", ret);
+	if (ret < 0)
+		goto out;
+	ret = cr_write_cpu(ctx, t);
+	cr_debug("cpu: ret %d\n", ret);
+ out:
 	return ret;
 }
 
diff --git a/checkpoint/checkpoint_arch.h b/checkpoint/checkpoint_arch.h
new file mode 100644
index 0000000..ada1369
--- /dev/null
+++ b/checkpoint/checkpoint_arch.h
@@ -0,0 +1,9 @@
+#include <linux/checkpoint.h>
+
+extern int cr_write_head_arch(struct cr_ctx *ctx);
+extern int cr_write_thread(struct cr_ctx *ctx, struct task_struct *t);
+extern int cr_write_cpu(struct cr_ctx *ctx, struct task_struct *t);
+
+extern int cr_read_head_arch(struct cr_ctx *ctx);
+extern int cr_read_thread(struct cr_ctx *ctx);
+extern int cr_read_cpu(struct cr_ctx *ctx);
diff --git a/checkpoint/restart.c b/checkpoint/restart.c
index a95d2e8..d74d755 100644
--- a/checkpoint/restart.c
+++ b/checkpoint/restart.c
@@ -15,6 +15,8 @@
 #include <linux/checkpoint.h>
 #include <linux/checkpoint_hdr.h>
 
+#include "checkpoint_arch.h"
+
 /**
  * cr_read_obj - read a whole record (cr_hdr followed by payload)
  * @ctx: checkpoint context
@@ -142,9 +144,9 @@ static int cr_read_head(struct cr_ctx *ctx)
 
 	ctx->oflags = hh->flags;
 
-	/* FIX: verify compatibility of release, version and machine */
+	/* FIX: verify compatibility of release, version */
 
-	ret = 0;
+	ret = cr_read_head_arch(ctx);
  out:
 	cr_hbuf_put(ctx, sizeof(*hh));
 	return ret;
@@ -214,8 +216,17 @@ static int cr_read_task(struct cr_ctx *ctx)
 	int ret;
 
 	ret = cr_read_task_struct(ctx);
-	cr_debug("ret %d\n", ret);
+	cr_debug("task_struct: ret %d\n", ret);
+	if (ret < 0)
+		goto out;
+	ret = cr_read_thread(ctx);
+	cr_debug("thread: ret %d\n", ret);
+	if (ret < 0)
+		goto out;
+	ret = cr_read_cpu(ctx);
+	cr_debug("cpu: ret %d\n", ret);
 
+ out:
 	return ret;
 }
 
diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
index 257f87f..b74b5f9 100644
--- a/include/linux/checkpoint_hdr.h
+++ b/include/linux/checkpoint_hdr.h
@@ -12,6 +12,7 @@
 
 #include <linux/types.h>
 #include <linux/utsname.h>
+#include <asm/checkpoint_hdr.h>
 
 /*
  * To maintain compatibility between 32-bit and 64-bit architecture flavors,
@@ -30,6 +31,7 @@ struct cr_hdr {
 /* header types */
 enum {
 	CR_HDR_HEAD = 1,
+	CR_HDR_HEAD_ARCH,
 	CR_HDR_BUFFER,
 	CR_HDR_STRING,
 
-- 
1.5.4.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 133+ messages in thread

* [RFC v11][PATCH 05/13] Dump memory address space
       [not found] ` <1228498282-11804-1-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
                     ` (3 preceding siblings ...)
  2008-12-05 17:31   ` [RFC v11][PATCH 04/13] x86 support for checkpoint/restart Oren Laadan
@ 2008-12-05 17:31   ` Oren Laadan
  2008-12-05 17:31   ` [RFC v11][PATCH 06/13] Restore " Oren Laadan
                     ` (12 subsequent siblings)
  17 siblings, 0 replies; 133+ messages in thread
From: Oren Laadan @ 2008-12-05 17:31 UTC (permalink / raw)
  To: Oren Laadan
  Cc: jeremy-TSDbQ3PG+2Y, arnd-r2nGTMty4D4,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Dave Hansen,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Linux Torvalds, Alexander Viro,
	H. Peter Anvin, Thomas Gleixner, Ingo Molnar

For each VMA, there is a 'struct cr_vma'; if the VMA is file-mapped,
it will be followed by the file name. Then comes the actual contents,
in one or more chunk: each chunk begins with a header that specifies
how many pages it holds, then the virtual addresses of all the dumped
pages in that chunk, followed by the actual contents of all dumped
pages. A header with zero number of pages marks the end of the contents.
Then comes the next VMA and so on.

Changelog[v11]:
  - Copy contents of 'init->fs->root' instead of pointing to them.
  - Add missing test for VM_MAYSHARE when dumping memory

Changelog[v10]:
  - Acquire dcache_lock around call to __d_path() in cr_fill_name()

Changelog[v9]:
  - Introduce cr_ctx_checkpoint() for checkpoint-specific ctx setup
  - Test if __d_path() changes mnt/dentry (when crossing filesystem
    namespace boundary). for now cr_fill_fname() fails the checkpoint.

Changelog[v7]:
  - Fix argument given to kunmap_atomic() in memory dump/restore

Changelog[v6]:
  - Balance all calls to cr_hbuf_get() with matching cr_hbuf_put()
    (even though it's not really needed)

Changelog[v5]:
  - Improve memory dump code (following Dave Hansen's comments)
  - Change dump format (and code) to allow chunks of <vaddrs, pages>
    instead of one long list of each
  - Fix use of follow_page() to avoid faulting in non-present pages

Changelog[v4]:
  - Use standard list_... for cr_pgarr

Signed-off-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
Acked-by: Serge Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
Signed-off-by: Dave Hansen <dave-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
---
 arch/x86/include/asm/checkpoint_hdr.h |    5 +
 arch/x86/mm/checkpoint.c              |   31 ++
 checkpoint/Makefile                   |    3 +-
 checkpoint/checkpoint.c               |   88 ++++++
 checkpoint/checkpoint_arch.h          |    2 +
 checkpoint/checkpoint_mem.h           |   41 +++
 checkpoint/ckpt_mem.c                 |  503 +++++++++++++++++++++++++++++++++
 checkpoint/sys.c                      |   10 +
 include/linux/checkpoint.h            |   12 +
 include/linux/checkpoint_hdr.h        |   32 ++
 10 files changed, 726 insertions(+), 1 deletions(-)
 create mode 100644 checkpoint/checkpoint_mem.h
 create mode 100644 checkpoint/ckpt_mem.c

diff --git a/arch/x86/include/asm/checkpoint_hdr.h b/arch/x86/include/asm/checkpoint_hdr.h
index 6325062..33f4c70 100644
--- a/arch/x86/include/asm/checkpoint_hdr.h
+++ b/arch/x86/include/asm/checkpoint_hdr.h
@@ -82,4 +82,9 @@ struct cr_hdr_cpu {
 	/* thread_xstate contents follow (if used_math) */
 } __attribute__((aligned(8)));
 
+struct cr_hdr_mm_context {
+	__s16 ldt_entry_size;
+	__s16 nldt;
+} __attribute__((aligned(8)));
+
 #endif /* __ASM_X86_CKPT_HDR__H */
diff --git a/arch/x86/mm/checkpoint.c b/arch/x86/mm/checkpoint.c
index 8dd6d2d..757936e 100644
--- a/arch/x86/mm/checkpoint.c
+++ b/arch/x86/mm/checkpoint.c
@@ -221,3 +221,34 @@ int cr_write_head_arch(struct cr_ctx *ctx)
 
 	return ret;
 }
+
+/* dump the mm->context state */
+int cr_write_mm_context(struct cr_ctx *ctx, struct mm_struct *mm, int parent)
+{
+	struct cr_hdr h;
+	struct cr_hdr_mm_context *hh = cr_hbuf_get(ctx, sizeof(*hh));
+	int ret;
+
+	h.type = CR_HDR_MM_CONTEXT;
+	h.len = sizeof(*hh);
+	h.parent = parent;
+
+	mutex_lock(&mm->context.lock);
+
+	hh->ldt_entry_size = LDT_ENTRY_SIZE;
+	hh->nldt = mm->context.size;
+
+	cr_debug("nldt %d\n", hh->nldt);
+
+	ret = cr_write_obj(ctx, &h, hh);
+	cr_hbuf_put(ctx, sizeof(*hh));
+	if (ret < 0)
+		goto out;
+
+	ret = cr_kwrite(ctx, mm->context.ldt,
+			mm->context.size * LDT_ENTRY_SIZE);
+
+ out:
+	mutex_unlock(&mm->context.lock);
+	return ret;
+}
diff --git a/checkpoint/Makefile b/checkpoint/Makefile
index d2df68c..3a0df6d 100644
--- a/checkpoint/Makefile
+++ b/checkpoint/Makefile
@@ -2,4 +2,5 @@
 # Makefile for linux checkpoint/restart.
 #
 
-obj-$(CONFIG_CHECKPOINT_RESTART) += sys.o checkpoint.o restart.o
+obj-$(CONFIG_CHECKPOINT_RESTART) += sys.o checkpoint.o restart.o \
+		ckpt_mem.o
diff --git a/checkpoint/checkpoint.c b/checkpoint/checkpoint.c
index 17cc8d2..56d0ec2 100644
--- a/checkpoint/checkpoint.c
+++ b/checkpoint/checkpoint.c
@@ -13,6 +13,7 @@
 #include <linux/time.h>
 #include <linux/fs.h>
 #include <linux/file.h>
+#include <linux/fdtable.h>
 #include <linux/dcache.h>
 #include <linux/mount.h>
 #include <linux/utsname.h>
@@ -75,6 +76,66 @@ int cr_write_string(struct cr_ctx *ctx, char *str, int len)
 	return cr_write_obj(ctx, &h, str);
 }
 
+/**
+ * cr_fill_fname - return pathname of a given file
+ * @path: path name
+ * @root: relative root
+ * @buf: buffer for pathname
+ * @n: buffer length (in) and pathname length (out)
+ */
+static char *
+cr_fill_fname(struct path *path, struct path *root, char *buf, int *n)
+{
+	struct path tmp = *root;
+	char *fname;
+
+	BUG_ON(!buf);
+	spin_lock(&dcache_lock);
+	fname = __d_path(path, &tmp, buf, *n);
+	spin_unlock(&dcache_lock);
+	if (!IS_ERR(fname))
+		*n = (buf + (*n) - fname);
+	/*
+	 * FIXME: if __d_path() changed these, it must have stepped out of
+	 * init's namespace. Since currently we require a unified namespace
+	 * within the container: simply fail.
+	 */
+	if (tmp.mnt != root->mnt || tmp.dentry != root->dentry)
+		fname = ERR_PTR(-EBADF);
+
+	return fname;
+}
+
+/**
+ * cr_write_fname - write a file name
+ * @ctx: checkpoint context
+ * @path: path name
+ * @root: relative root
+ */
+int cr_write_fname(struct cr_ctx *ctx, struct path *path, struct path *root)
+{
+	struct cr_hdr h;
+	char *buf, *fname;
+	int ret, flen;
+
+	flen = PATH_MAX;
+	buf = kmalloc(flen, GFP_KERNEL);
+	if (!buf)
+		return -ENOMEM;
+
+	fname = cr_fill_fname(path, root, buf, &flen);
+	if (!IS_ERR(fname)) {
+		h.type = CR_HDR_FNAME;
+		h.len = flen;
+		h.parent = 0;
+		ret = cr_write_obj(ctx, &h, fname);
+	} else
+		ret = PTR_ERR(fname);
+
+	kfree(buf);
+	return ret;
+}
+
 /* write the checkpoint header */
 static int cr_write_head(struct cr_ctx *ctx)
 {
@@ -168,6 +229,10 @@ static int cr_write_task(struct cr_ctx *ctx, struct task_struct *t)
 	cr_debug("task_struct: ret %d\n", ret);
 	if (ret < 0)
 		goto out;
+	ret = cr_write_mm(ctx, t);
+	cr_debug("memory: ret %d\n", ret);
+	if (ret < 0)
+		goto out;
 	ret = cr_write_thread(ctx, t);
 	cr_debug("thread: ret %d\n", ret);
 	if (ret < 0)
@@ -178,10 +243,33 @@ static int cr_write_task(struct cr_ctx *ctx, struct task_struct *t)
 	return ret;
 }
 
+static int cr_ctx_checkpoint(struct cr_ctx *ctx, pid_t pid)
+{
+	struct fs_struct *fs;
+
+	ctx->root_pid = pid;
+
+	/*
+	 * assume checkpointer is in container's root vfs
+	 * FIXME: this works for now, but will change with real containers
+	 */
+
+	fs = current->fs;
+	read_lock(&fs->lock);
+	ctx->fs_mnt = fs->root;
+	path_get(&ctx->fs_mnt);
+	read_unlock(&fs->lock);
+
+	return 0;
+}
+
 int do_checkpoint(struct cr_ctx *ctx, pid_t pid)
 {
 	int ret;
 
+	ret = cr_ctx_checkpoint(ctx, pid);
+	if (ret < 0)
+		goto out;
 	ret = cr_write_head(ctx);
 	if (ret < 0)
 		goto out;
diff --git a/checkpoint/checkpoint_arch.h b/checkpoint/checkpoint_arch.h
index ada1369..f06c7eb 100644
--- a/checkpoint/checkpoint_arch.h
+++ b/checkpoint/checkpoint_arch.h
@@ -3,6 +3,8 @@
 extern int cr_write_head_arch(struct cr_ctx *ctx);
 extern int cr_write_thread(struct cr_ctx *ctx, struct task_struct *t);
 extern int cr_write_cpu(struct cr_ctx *ctx, struct task_struct *t);
+extern int cr_write_mm_context(struct cr_ctx *ctx,
+			       struct mm_struct *mm, int parent);
 
 extern int cr_read_head_arch(struct cr_ctx *ctx);
 extern int cr_read_thread(struct cr_ctx *ctx);
diff --git a/checkpoint/checkpoint_mem.h b/checkpoint/checkpoint_mem.h
new file mode 100644
index 0000000..85546f4
--- /dev/null
+++ b/checkpoint/checkpoint_mem.h
@@ -0,0 +1,41 @@
+#ifndef _CHECKPOINT_CKPT_MEM_H_
+#define _CHECKPOINT_CKPT_MEM_H_
+/*
+ *  Generic container checkpoint-restart
+ *
+ *  Copyright (C) 2008 Oren Laadan
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+#include <linux/mm_types.h>
+
+/*
+ * page-array chains: each cr_pgarr describes a set of <strcut page *,vaddr>
+ * tuples (where vaddr is the virtual address of a page in a particular mm).
+ * Specifically, we use separate arrays so that all vaddrs can be written
+ * and read at once.
+ */
+
+struct cr_pgarr {
+	unsigned long *vaddrs;
+	struct page **pages;
+	unsigned int nr_used;
+	struct list_head list;
+};
+
+#define CR_PGARR_TOTAL  (PAGE_SIZE / sizeof(void *))
+#define CR_PGARR_CHUNK  (4 * CR_PGARR_TOTAL)
+
+extern void cr_pgarr_free(struct cr_ctx *ctx);
+extern struct cr_pgarr *cr_pgarr_current(struct cr_ctx *ctx);
+extern void cr_pgarr_reset_all(struct cr_ctx *ctx);
+
+static inline int cr_pgarr_is_full(struct cr_pgarr *pgarr)
+{
+	return (pgarr->nr_used == CR_PGARR_TOTAL);
+}
+
+#endif /* _CHECKPOINT_CKPT_MEM_H_ */
diff --git a/checkpoint/ckpt_mem.c b/checkpoint/ckpt_mem.c
new file mode 100644
index 0000000..a2fcdbf
--- /dev/null
+++ b/checkpoint/ckpt_mem.c
@@ -0,0 +1,503 @@
+/*
+ *  Checkpoint memory contents
+ *
+ *  Copyright (C) 2008 Oren Laadan
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+#include <linux/kernel.h>
+#include <linux/sched.h>
+#include <linux/slab.h>
+#include <linux/file.h>
+#include <linux/pagemap.h>
+#include <linux/mm_types.h>
+#include <linux/checkpoint.h>
+#include <linux/checkpoint_hdr.h>
+
+#include "checkpoint_arch.h"
+#include "checkpoint_mem.h"
+
+/*
+ * utilities to alloc, free, and handle 'struct cr_pgarr' (page-arrays)
+ * (common to ckpt_mem.c and rstr_mem.c).
+ *
+ * The checkpoint context structure has two members for page-arrays:
+ *   ctx->pgarr_list: list head of the page-array chain
+ *
+ * During checkpoint (and restart) the chain tracks the dirty pages (page
+ * pointer and virtual address) of each MM. For a particular MM, these are
+ * always added to the head of the page-array chain (ctx->pgarr_list).
+ * This "current" page-array advances as necessary, and new page-array
+ * descriptors are allocated on-demand. Before the next chunk of pages,
+ * the chain is reset but not freed (that is, dereference page pointers).
+ */
+
+/* return first page-array in the chain */
+static inline struct cr_pgarr *cr_pgarr_first(struct cr_ctx *ctx)
+{
+	if (list_empty(&ctx->pgarr_list))
+		return NULL;
+	return list_first_entry(&ctx->pgarr_list, struct cr_pgarr, list);
+}
+
+/* release pages referenced by a page-array */
+static void cr_pgarr_release_pages(struct cr_pgarr *pgarr)
+{
+	int i;
+
+	cr_debug("nr_used %d\n", pgarr->nr_used);
+	/*
+	 * although both checkpoint and restart use 'nr_used', we only
+	 * collect pages during checkpoint; in restart we simply return
+	 */
+	if (!pgarr->pages)
+		return;
+	for (i = pgarr->nr_used; i--; /**/)
+		page_cache_release(pgarr->pages[i]);
+}
+
+/* free a single page-array object */
+static void cr_pgarr_free_one(struct cr_pgarr *pgarr)
+{
+	cr_pgarr_release_pages(pgarr);
+	kfree(pgarr->pages);
+	kfree(pgarr->vaddrs);
+	kfree(pgarr);
+}
+
+/* free a chain of page-arrays */
+void cr_pgarr_free(struct cr_ctx *ctx)
+{
+	struct cr_pgarr *pgarr, *tmp;
+
+	list_for_each_entry_safe(pgarr, tmp, &ctx->pgarr_list, list) {
+		list_del(&pgarr->list);
+		cr_pgarr_free_one(pgarr);
+	}
+}
+
+/* allocate a single page-array object */
+static struct cr_pgarr *cr_pgarr_alloc_one(unsigned long flags)
+{
+	struct cr_pgarr *pgarr;
+
+	pgarr = kzalloc(sizeof(*pgarr), GFP_KERNEL);
+	if (!pgarr)
+		return NULL;
+
+	pgarr->vaddrs = kmalloc(CR_PGARR_TOTAL * sizeof(unsigned long),
+				GFP_KERNEL);
+	if (!pgarr->vaddrs)
+		goto nomem;
+
+	/* pgarr->pages is needed only for checkpoint */
+	if (flags & CR_CTX_CKPT) {
+		pgarr->pages = kmalloc(CR_PGARR_TOTAL * sizeof(struct page *),
+				       GFP_KERNEL);
+		if (!pgarr->pages)
+			goto nomem;
+	}
+
+	return pgarr;
+
+ nomem:
+	cr_pgarr_free_one(pgarr);
+	return NULL;
+}
+
+/* cr_pgarr_current - return the next available page-array in the chain
+ * @ctx: checkpoint context
+ *
+ * Returns the first page-array in the list that has space. Extends the
+ * list if none has space.
+ */
+struct cr_pgarr *cr_pgarr_current(struct cr_ctx *ctx)
+{
+	struct cr_pgarr *pgarr;
+
+	pgarr = cr_pgarr_first(ctx);
+	if (pgarr && !cr_pgarr_is_full(pgarr))
+		goto out;
+	pgarr = cr_pgarr_alloc_one(ctx->flags);
+	if (!pgarr)
+		goto out;
+	list_add(&pgarr->list, &ctx->pgarr_list);
+ out:
+	return pgarr;
+}
+
+/* reset the page-array chain (dropping page references if necessary) */
+void cr_pgarr_reset_all(struct cr_ctx *ctx)
+{
+	struct cr_pgarr *pgarr;
+
+	list_for_each_entry(pgarr, &ctx->pgarr_list, list) {
+		cr_pgarr_release_pages(pgarr);
+		pgarr->nr_used = 0;
+	}
+}
+
+/*
+ * Checkpoint is outside the context of the checkpointee, so one cannot
+ * simply read pages from user-space. Instead, we scan the address space
+ * of the target to cherry-pick pages of interest. Selected pages are
+ * enlisted in a page-array chain (attached to the checkpoint context).
+ * To save their contents, each page is mapped to kernel memory and then
+ * dumped to the file descriptor.
+ */
+
+
+/**
+ * cr_private_follow_page - return page pointer for dirty pages
+ * @vma - target vma
+ * @addr - page address
+ *
+ * Looks up the page that correspond to the address in the vma, and
+ * returns the page if it was modified (and grabs a reference to it),
+ * or otherwise returns NULL (or error).
+ *
+ * This function should _only_ called for private vma's.
+ */
+static struct page *
+cr_private_follow_page(struct vm_area_struct *vma, unsigned long addr)
+{
+	struct page *page;
+
+	BUG_ON(vma->vm_flags & (VM_SHARED | VM_MAYSHARE));
+
+	/*
+	 * simplified version of get_user_pages(): already have vma,
+	 * only need FOLL_ANON, and (for now) ignore fault stats.
+	 *
+	 * follow_page() will return NULL if the page is not present
+	 * (swapped), ZERO_PAGE(0) if the pte wasn't allocated, and
+	 * the actual page pointer otherwise.
+	 *
+	 * FIXME: consolidate with get_user_pages()
+	 */
+
+	cond_resched();
+	while (!(page = follow_page(vma, addr, FOLL_ANON | FOLL_GET))) {
+		int ret;
+
+		/* the page is swapped out - bring it in (optimize ?) */
+		ret = handle_mm_fault(vma->vm_mm, vma, addr, 0);
+		if (ret & VM_FAULT_ERROR) {
+			if (ret & VM_FAULT_OOM)
+				return ERR_PTR(-ENOMEM);
+			else if (ret & VM_FAULT_SIGBUS)
+				return ERR_PTR(-EFAULT);
+			else
+				BUG();
+			break;
+		}
+		cond_resched();
+	}
+
+	if (IS_ERR(page))
+		return page;
+
+	/*
+	 * We only care about dirty pages: either non-zero page, or
+	 * file-backed (copy-on-write) that were touched. For the latter,
+	 * the page_mapping() will be unset because it will no longer be
+	 * mapped to the original file  after having been modified.
+	 */
+	if (page == ZERO_PAGE(0)) {
+		/* this is the zero page: ignore */
+		page_cache_release(page);
+		page = NULL;
+	} else if (vma->vm_file && (page_mapping(page) != NULL)) {
+		/* file backed clean cow: ignore */
+		page_cache_release(page);
+		page = NULL;
+	}
+
+	return page;
+}
+
+/**
+ * cr_private_vma_fill_pgarr - fill a page-array with addr/page tuples
+ * @ctx - checkpoint context
+ * @pgarr - page-array to fill
+ * @vma - vma to scan
+ * @start - start address (updated)
+ *
+ * Returns the number of pages collected
+ */
+static int
+cr_private_vma_fill_pgarr(struct cr_ctx *ctx, struct cr_pgarr *pgarr,
+			  struct vm_area_struct *vma, unsigned long *start)
+{
+	unsigned long end = vma->vm_end;
+	unsigned long addr = *start;
+	int orig_used = pgarr->nr_used;
+
+	/* this function is only for private memory (anon or file-mapped) */
+	BUG_ON(vma->vm_flags & (VM_SHARED | VM_MAYSHARE));
+
+	while (addr < end) {
+		struct page *page;
+
+		page = cr_private_follow_page(vma, addr);
+		if (IS_ERR(page))
+			return PTR_ERR(page);
+
+		if (page) {
+			pgarr->pages[pgarr->nr_used] = page;
+			pgarr->vaddrs[pgarr->nr_used] = addr;
+			pgarr->nr_used++;
+		}
+
+		addr += PAGE_SIZE;
+
+		if (cr_pgarr_is_full(pgarr))
+			break;
+	}
+
+	*start = addr;
+	return pgarr->nr_used - orig_used;
+}
+
+/* dump contents of a pages: use kmap_atomic() to avoid TLB flush */
+static int cr_page_write(struct cr_ctx *ctx, struct page *page, char *buf)
+{
+	void *ptr;
+
+	ptr = kmap_atomic(page, KM_USER1);
+	memcpy(buf, ptr, PAGE_SIZE);
+	kunmap_atomic(ptr, KM_USER1);
+
+	return cr_kwrite(ctx, buf, PAGE_SIZE);
+}
+
+/**
+ * cr_vma_dump_pages - dump pages listed in the ctx page-array chain
+ * @ctx - checkpoint context
+ * @total - total number of pages
+ *
+ * First dump all virtual addresses, followed by the contents of all pages
+ */
+static int cr_vma_dump_pages(struct cr_ctx *ctx, int total)
+{
+	struct cr_pgarr *pgarr;
+	char *buf;
+	int i, ret = 0;
+
+	if (!total)
+		return 0;
+
+	list_for_each_entry_reverse(pgarr, &ctx->pgarr_list, list) {
+		ret = cr_kwrite(ctx, pgarr->vaddrs,
+				pgarr->nr_used * sizeof(*pgarr->vaddrs));
+		if (ret < 0)
+			return ret;
+	}
+
+	buf = kmalloc(PAGE_SIZE, GFP_KERNEL);
+	if (!buf)
+		return -ENOMEM;
+
+	list_for_each_entry_reverse(pgarr, &ctx->pgarr_list, list) {
+		for (i = 0; i < pgarr->nr_used; i++) {
+			ret = cr_page_write(ctx, pgarr->pages[i], buf);
+			if (ret < 0)
+				goto out;
+		}
+	}
+
+ out:
+	kfree(buf);
+	return ret;
+}
+
+/**
+ * cr_write_private_vma_contents - dump contents of a VMA with private memory
+ * @ctx - checkpoint context
+ * @vma - vma to scan
+ *
+ * Collect lists of pages that needs to be dumped, and corresponding
+ * virtual addresses into ctx->pgarr_list page-array chain. Then dump
+ * the addresses, followed by the page contents.
+ */
+static int
+cr_write_private_vma_contents(struct cr_ctx *ctx, struct vm_area_struct *vma)
+{
+	struct cr_hdr h;
+	struct cr_hdr_pgarr *hh;
+	unsigned long addr = vma->vm_start;
+	struct cr_pgarr *pgarr;
+	unsigned long cnt = 0;
+	int ret;
+
+	/*
+	 * Work iteratively, collecting and dumping at most CR_PGARR_CHUNK
+	 * in each round. Each iterations is divided into two steps:
+	 *
+	 * (1) scan: scan through the PTEs of the vma to collect the pages
+	 * to dump (later we'll also make them COW), while keeping a list
+	 * of pages and their corresponding addresses on ctx->pgarr_list.
+	 *
+	 * (2) dump: write out a header specifying how many pages, followed
+	 * by the addresses of all pages in ctx->pgarr_list, followed by
+	 * the actual contents of all pages. (Then, release the references
+	 * to the pages and reset the page-array chain).
+	 *
+	 * (This split makes the logic simpler by first counting the pages
+	 * that need saving. More importantly, it allows for a future
+	 * optimization that will reduce application downtime by deferring
+	 * the actual write-out of the data to after the application is
+	 * allowed to resume execution).
+	 *
+	 * After dumpting the entire contents, conclude with a header that
+	 * specifies 0 pages to mark the end of the contents.
+	 */
+
+	h.type = CR_HDR_PGARR;
+	h.len = sizeof(*hh);
+	h.parent = 0;
+
+	while (addr < vma->vm_end) {
+		pgarr = cr_pgarr_current(ctx);
+		if (!pgarr)
+			return -ENOMEM;
+		ret = cr_private_vma_fill_pgarr(ctx, pgarr, vma, &addr);
+		if (ret < 0)
+			return ret;
+		cnt += ret;
+
+		/* did we complete a chunk, or is this the last chunk ? */
+		if (cnt >= CR_PGARR_CHUNK || (cnt && addr == vma->vm_end)) {
+			hh = cr_hbuf_get(ctx, sizeof(*hh));
+			hh->nr_pages = cnt;
+			ret = cr_write_obj(ctx, &h, hh);
+			cr_hbuf_put(ctx, sizeof(*hh));
+			if (ret < 0)
+				return ret;
+
+			ret = cr_vma_dump_pages(ctx, cnt);
+			if (ret < 0)
+				return ret;
+
+			cr_pgarr_reset_all(ctx);
+		}
+	}
+
+	/* mark end of contents with header saying "0" pages */
+	hh = cr_hbuf_get(ctx, sizeof(*hh));
+	hh->nr_pages = 0;
+	ret = cr_write_obj(ctx, &h, hh);
+	cr_hbuf_put(ctx, sizeof(*hh));
+
+	return ret;
+}
+
+static int cr_write_vma(struct cr_ctx *ctx, struct vm_area_struct *vma)
+{
+	struct cr_hdr h;
+	struct cr_hdr_vma *hh = cr_hbuf_get(ctx, sizeof(*hh));
+	int vma_type, ret;
+
+	h.type = CR_HDR_VMA;
+	h.len = sizeof(*hh);
+	h.parent = 0;
+
+	hh->vm_start = vma->vm_start;
+	hh->vm_end = vma->vm_end;
+	hh->vm_page_prot = vma->vm_page_prot.pgprot;
+	hh->vm_flags = vma->vm_flags;
+	hh->vm_pgoff = vma->vm_pgoff;
+
+#define CR_BAD_VM_FLAGS  \
+	(VM_SHARED | VM_MAYSHARE | VM_IO | VM_HUGETLB | VM_NONLINEAR)
+
+	if (vma->vm_flags & CR_BAD_VM_FLAGS) {
+		pr_warning("c/r: unsupported VMA %#lx\n", vma->vm_flags);
+		cr_hbuf_put(ctx, sizeof(*hh));
+		return -ENOSYS;
+	}
+
+	/* by default assume anon memory */
+	vma_type = CR_VMA_ANON;
+
+	/*
+	 * if there is a backing file, assume private-mapped
+	 * (FIXME: check if the file is unlinked)
+	 */
+	if (vma->vm_file)
+		vma_type = CR_VMA_FILE;
+
+	hh->vma_type = vma_type;
+
+	ret = cr_write_obj(ctx, &h, hh);
+	cr_hbuf_put(ctx, sizeof(*hh));
+	if (ret < 0)
+		return ret;
+
+	/* save the file name, if relevant */
+	if (vma->vm_file) {
+		ret = cr_write_fname(ctx, &vma->vm_file->f_path, &ctx->fs_mnt);
+		if (ret < 0)
+			return ret;
+	}
+
+	return cr_write_private_vma_contents(ctx, vma);
+}
+
+int cr_write_mm(struct cr_ctx *ctx, struct task_struct *t)
+{
+	struct cr_hdr h;
+	struct cr_hdr_mm *hh = cr_hbuf_get(ctx, sizeof(*hh));
+	struct mm_struct *mm;
+	struct vm_area_struct *vma;
+	int objref, ret;
+
+	h.type = CR_HDR_MM;
+	h.len = sizeof(*hh);
+	h.parent = task_pid_vnr(t);
+
+	mm = get_task_mm(t);
+
+	objref = 0;	/* will be meaningful with multiple processes */
+	hh->objref = objref;
+
+	down_read(&mm->mmap_sem);
+
+	hh->start_code = mm->start_code;
+	hh->end_code = mm->end_code;
+	hh->start_data = mm->start_data;
+	hh->end_data = mm->end_data;
+	hh->start_brk = mm->start_brk;
+	hh->brk = mm->brk;
+	hh->start_stack = mm->start_stack;
+	hh->arg_start = mm->arg_start;
+	hh->arg_end = mm->arg_end;
+	hh->env_start = mm->env_start;
+	hh->env_end = mm->env_end;
+
+	hh->map_count = mm->map_count;
+
+	/* FIX: need also mm->flags */
+
+	ret = cr_write_obj(ctx, &h, hh);
+	cr_hbuf_put(ctx, sizeof(*hh));
+	if (ret < 0)
+		goto out;
+
+	/* write the vma's */
+	for (vma = mm->mmap; vma; vma = vma->vm_next) {
+		ret = cr_write_vma(ctx, vma);
+		if (ret < 0)
+			goto out;
+	}
+
+	ret = cr_write_mm_context(ctx, mm, objref);
+
+ out:
+	up_read(&mm->mmap_sem);
+	mmput(mm);
+	return ret;
+}
diff --git a/checkpoint/sys.c b/checkpoint/sys.c
index bd14ef9..c547a1c 100644
--- a/checkpoint/sys.c
+++ b/checkpoint/sys.c
@@ -16,6 +16,8 @@
 #include <linux/capability.h>
 #include <linux/checkpoint.h>
 
+#include "checkpoint_mem.h"
+
 /*
  * Helpers to write(read) from(to) kernel space to(from) the checkpoint
  * image file descriptor (similar to how a core-dump is performed).
@@ -131,7 +133,13 @@ static void cr_ctx_free(struct cr_ctx *ctx)
 {
 	if (ctx->file)
 		fput(ctx->file);
+
 	kfree(ctx->hbuf);
+
+	path_put(&ctx->fs_mnt);		/* safe with NULL pointers */
+
+	cr_pgarr_free(ctx);
+
 	kfree(ctx);
 }
 
@@ -146,6 +154,8 @@ static struct cr_ctx *cr_ctx_alloc(int fd, unsigned long flags)
 
 	ctx->flags = flags;
 
+	INIT_LIST_HEAD(&ctx->pgarr_list);
+
 	err = -EBADF;
 	ctx->file = fget(fd);
 	if (!ctx->file)
diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
index 63f298f..4e97f9f 100644
--- a/include/linux/checkpoint.h
+++ b/include/linux/checkpoint.h
@@ -10,6 +10,9 @@
  *  distribution for more details.
  */
 
+#include <linux/path.h>
+#include <linux/fs.h>
+
 #define CR_VERSION  1
 
 struct cr_ctx {
@@ -25,6 +28,10 @@ struct cr_ctx {
 
 	void *hbuf;		/* temporary buffer for headers */
 	int hpos;		/* position in headers buffer */
+
+	struct list_head pgarr_list;	/* page array to dump VMA contents */
+
+	struct path fs_mnt;	/* container root (FIXME) */
 };
 
 /* cr_ctx: flags */
@@ -42,6 +49,8 @@ struct cr_hdr;
 extern int cr_write_obj(struct cr_ctx *ctx, struct cr_hdr *h, void *buf);
 extern int cr_write_buffer(struct cr_ctx *ctx, void *buf, int len);
 extern int cr_write_string(struct cr_ctx *ctx, char *str, int len);
+extern int cr_write_fname(struct cr_ctx *ctx,
+			  struct path *path, struct path *root);
 
 extern int cr_read_obj(struct cr_ctx *ctx, struct cr_hdr *h, void *buf, int n);
 extern int cr_read_obj_type(struct cr_ctx *ctx, void *buf, int len, int type);
@@ -50,7 +59,10 @@ extern int cr_read_buffer(struct cr_ctx *ctx, void *buf, int *len);
 extern int cr_read_string(struct cr_ctx *ctx, char *str, int len);
 
 extern int do_checkpoint(struct cr_ctx *ctx, pid_t pid);
+extern int cr_write_mm(struct cr_ctx *ctx, struct task_struct *t);
+
 extern int do_restart(struct cr_ctx *ctx, pid_t pid);
+extern int cr_read_mm(struct cr_ctx *ctx);
 
 #define cr_debug(fmt, args...)  \
 	pr_debug("[%d:c/r:%s] " fmt, task_pid_vnr(current), __func__, ## args)
diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
index b74b5f9..d78f0f1 100644
--- a/include/linux/checkpoint_hdr.h
+++ b/include/linux/checkpoint_hdr.h
@@ -34,6 +34,7 @@ enum {
 	CR_HDR_HEAD_ARCH,
 	CR_HDR_BUFFER,
 	CR_HDR_STRING,
+	CR_HDR_FNAME,
 
 	CR_HDR_TASK = 101,
 	CR_HDR_THREAD,
@@ -41,6 +42,7 @@ enum {
 
 	CR_HDR_MM = 201,
 	CR_HDR_VMA,
+	CR_HDR_PGARR,
 	CR_HDR_MM_CONTEXT,
 
 	CR_HDR_TAIL = 5001
@@ -75,4 +77,34 @@ struct cr_hdr_task {
 	__s32 task_comm_len;
 } __attribute__((aligned(8)));
 
+struct cr_hdr_mm {
+	__u32 objref;		/* identifier for shared objects */
+	__u32 map_count;
+
+	__u64 start_code, end_code, start_data, end_data;
+	__u64 start_brk, brk, start_stack;
+	__u64 arg_start, arg_end, env_start, env_end;
+} __attribute__((aligned(8)));
+
+/* vma subtypes */
+enum vm_type {
+	CR_VMA_ANON = 1,
+	CR_VMA_FILE
+};
+
+struct cr_hdr_vma {
+	__u32 vma_type;
+	__u32 _padding;
+
+	__u64 vm_start;
+	__u64 vm_end;
+	__u64 vm_page_prot;
+	__u64 vm_flags;
+	__u64 vm_pgoff;
+} __attribute__((aligned(8)));
+
+struct cr_hdr_pgarr {
+	__u64 nr_pages;		/* number of pages to saved */
+} __attribute__((aligned(8)));
+
 #endif /* _CHECKPOINT_CKPT_HDR_H_ */
-- 
1.5.4.3

^ permalink raw reply related	[flat|nested] 133+ messages in thread

* [RFC v11][PATCH 05/13] Dump memory address space
       [not found] ` <1228498282-11804-1-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
  2008-12-05 17:31   ` [RFC v11][PATCH 01/13] Create syscalls: sys_checkpoint, sys_restart Oren Laadan
@ 2008-12-05 17:31   ` Oren Laadan
  2008-12-05 17:31   ` [RFC v11][PATCH 03/13] General infrastructure for checkpoint restart Oren Laadan
                     ` (15 subsequent siblings)
  17 siblings, 0 replies; 133+ messages in thread
From: Oren Laadan @ 2008-12-05 17:31 UTC (permalink / raw)
  To: Oren Laadan
  Cc: containers, linux-kernel, linux-mm, linux-api, Linux Torvalds,
	Thomas Gleixner, Serge Hallyn, Dave Hansen, Ingo Molnar,
	H. Peter Anvin, Alexander Viro, MinChan Kim, arnd, jeremy,
	Oren Laadan

For each VMA, there is a 'struct cr_vma'; if the VMA is file-mapped,
it will be followed by the file name. Then comes the actual contents,
in one or more chunk: each chunk begins with a header that specifies
how many pages it holds, then the virtual addresses of all the dumped
pages in that chunk, followed by the actual contents of all dumped
pages. A header with zero number of pages marks the end of the contents.
Then comes the next VMA and so on.

Changelog[v11]:
  - Copy contents of 'init->fs->root' instead of pointing to them.
  - Add missing test for VM_MAYSHARE when dumping memory

Changelog[v10]:
  - Acquire dcache_lock around call to __d_path() in cr_fill_name()

Changelog[v9]:
  - Introduce cr_ctx_checkpoint() for checkpoint-specific ctx setup
  - Test if __d_path() changes mnt/dentry (when crossing filesystem
    namespace boundary). for now cr_fill_fname() fails the checkpoint.

Changelog[v7]:
  - Fix argument given to kunmap_atomic() in memory dump/restore

Changelog[v6]:
  - Balance all calls to cr_hbuf_get() with matching cr_hbuf_put()
    (even though it's not really needed)

Changelog[v5]:
  - Improve memory dump code (following Dave Hansen's comments)
  - Change dump format (and code) to allow chunks of <vaddrs, pages>
    instead of one long list of each
  - Fix use of follow_page() to avoid faulting in non-present pages

Changelog[v4]:
  - Use standard list_... for cr_pgarr

Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com>
---
 arch/x86/include/asm/checkpoint_hdr.h |    5 +
 arch/x86/mm/checkpoint.c              |   31 ++
 checkpoint/Makefile                   |    3 +-
 checkpoint/checkpoint.c               |   88 ++++++
 checkpoint/checkpoint_arch.h          |    2 +
 checkpoint/checkpoint_mem.h           |   41 +++
 checkpoint/ckpt_mem.c                 |  503 +++++++++++++++++++++++++++++++++
 checkpoint/sys.c                      |   10 +
 include/linux/checkpoint.h            |   12 +
 include/linux/checkpoint_hdr.h        |   32 ++
 10 files changed, 726 insertions(+), 1 deletions(-)
 create mode 100644 checkpoint/checkpoint_mem.h
 create mode 100644 checkpoint/ckpt_mem.c

diff --git a/arch/x86/include/asm/checkpoint_hdr.h b/arch/x86/include/asm/checkpoint_hdr.h
index 6325062..33f4c70 100644
--- a/arch/x86/include/asm/checkpoint_hdr.h
+++ b/arch/x86/include/asm/checkpoint_hdr.h
@@ -82,4 +82,9 @@ struct cr_hdr_cpu {
 	/* thread_xstate contents follow (if used_math) */
 } __attribute__((aligned(8)));
 
+struct cr_hdr_mm_context {
+	__s16 ldt_entry_size;
+	__s16 nldt;
+} __attribute__((aligned(8)));
+
 #endif /* __ASM_X86_CKPT_HDR__H */
diff --git a/arch/x86/mm/checkpoint.c b/arch/x86/mm/checkpoint.c
index 8dd6d2d..757936e 100644
--- a/arch/x86/mm/checkpoint.c
+++ b/arch/x86/mm/checkpoint.c
@@ -221,3 +221,34 @@ int cr_write_head_arch(struct cr_ctx *ctx)
 
 	return ret;
 }
+
+/* dump the mm->context state */
+int cr_write_mm_context(struct cr_ctx *ctx, struct mm_struct *mm, int parent)
+{
+	struct cr_hdr h;
+	struct cr_hdr_mm_context *hh = cr_hbuf_get(ctx, sizeof(*hh));
+	int ret;
+
+	h.type = CR_HDR_MM_CONTEXT;
+	h.len = sizeof(*hh);
+	h.parent = parent;
+
+	mutex_lock(&mm->context.lock);
+
+	hh->ldt_entry_size = LDT_ENTRY_SIZE;
+	hh->nldt = mm->context.size;
+
+	cr_debug("nldt %d\n", hh->nldt);
+
+	ret = cr_write_obj(ctx, &h, hh);
+	cr_hbuf_put(ctx, sizeof(*hh));
+	if (ret < 0)
+		goto out;
+
+	ret = cr_kwrite(ctx, mm->context.ldt,
+			mm->context.size * LDT_ENTRY_SIZE);
+
+ out:
+	mutex_unlock(&mm->context.lock);
+	return ret;
+}
diff --git a/checkpoint/Makefile b/checkpoint/Makefile
index d2df68c..3a0df6d 100644
--- a/checkpoint/Makefile
+++ b/checkpoint/Makefile
@@ -2,4 +2,5 @@
 # Makefile for linux checkpoint/restart.
 #
 
-obj-$(CONFIG_CHECKPOINT_RESTART) += sys.o checkpoint.o restart.o
+obj-$(CONFIG_CHECKPOINT_RESTART) += sys.o checkpoint.o restart.o \
+		ckpt_mem.o
diff --git a/checkpoint/checkpoint.c b/checkpoint/checkpoint.c
index 17cc8d2..56d0ec2 100644
--- a/checkpoint/checkpoint.c
+++ b/checkpoint/checkpoint.c
@@ -13,6 +13,7 @@
 #include <linux/time.h>
 #include <linux/fs.h>
 #include <linux/file.h>
+#include <linux/fdtable.h>
 #include <linux/dcache.h>
 #include <linux/mount.h>
 #include <linux/utsname.h>
@@ -75,6 +76,66 @@ int cr_write_string(struct cr_ctx *ctx, char *str, int len)
 	return cr_write_obj(ctx, &h, str);
 }
 
+/**
+ * cr_fill_fname - return pathname of a given file
+ * @path: path name
+ * @root: relative root
+ * @buf: buffer for pathname
+ * @n: buffer length (in) and pathname length (out)
+ */
+static char *
+cr_fill_fname(struct path *path, struct path *root, char *buf, int *n)
+{
+	struct path tmp = *root;
+	char *fname;
+
+	BUG_ON(!buf);
+	spin_lock(&dcache_lock);
+	fname = __d_path(path, &tmp, buf, *n);
+	spin_unlock(&dcache_lock);
+	if (!IS_ERR(fname))
+		*n = (buf + (*n) - fname);
+	/*
+	 * FIXME: if __d_path() changed these, it must have stepped out of
+	 * init's namespace. Since currently we require a unified namespace
+	 * within the container: simply fail.
+	 */
+	if (tmp.mnt != root->mnt || tmp.dentry != root->dentry)
+		fname = ERR_PTR(-EBADF);
+
+	return fname;
+}
+
+/**
+ * cr_write_fname - write a file name
+ * @ctx: checkpoint context
+ * @path: path name
+ * @root: relative root
+ */
+int cr_write_fname(struct cr_ctx *ctx, struct path *path, struct path *root)
+{
+	struct cr_hdr h;
+	char *buf, *fname;
+	int ret, flen;
+
+	flen = PATH_MAX;
+	buf = kmalloc(flen, GFP_KERNEL);
+	if (!buf)
+		return -ENOMEM;
+
+	fname = cr_fill_fname(path, root, buf, &flen);
+	if (!IS_ERR(fname)) {
+		h.type = CR_HDR_FNAME;
+		h.len = flen;
+		h.parent = 0;
+		ret = cr_write_obj(ctx, &h, fname);
+	} else
+		ret = PTR_ERR(fname);
+
+	kfree(buf);
+	return ret;
+}
+
 /* write the checkpoint header */
 static int cr_write_head(struct cr_ctx *ctx)
 {
@@ -168,6 +229,10 @@ static int cr_write_task(struct cr_ctx *ctx, struct task_struct *t)
 	cr_debug("task_struct: ret %d\n", ret);
 	if (ret < 0)
 		goto out;
+	ret = cr_write_mm(ctx, t);
+	cr_debug("memory: ret %d\n", ret);
+	if (ret < 0)
+		goto out;
 	ret = cr_write_thread(ctx, t);
 	cr_debug("thread: ret %d\n", ret);
 	if (ret < 0)
@@ -178,10 +243,33 @@ static int cr_write_task(struct cr_ctx *ctx, struct task_struct *t)
 	return ret;
 }
 
+static int cr_ctx_checkpoint(struct cr_ctx *ctx, pid_t pid)
+{
+	struct fs_struct *fs;
+
+	ctx->root_pid = pid;
+
+	/*
+	 * assume checkpointer is in container's root vfs
+	 * FIXME: this works for now, but will change with real containers
+	 */
+
+	fs = current->fs;
+	read_lock(&fs->lock);
+	ctx->fs_mnt = fs->root;
+	path_get(&ctx->fs_mnt);
+	read_unlock(&fs->lock);
+
+	return 0;
+}
+
 int do_checkpoint(struct cr_ctx *ctx, pid_t pid)
 {
 	int ret;
 
+	ret = cr_ctx_checkpoint(ctx, pid);
+	if (ret < 0)
+		goto out;
 	ret = cr_write_head(ctx);
 	if (ret < 0)
 		goto out;
diff --git a/checkpoint/checkpoint_arch.h b/checkpoint/checkpoint_arch.h
index ada1369..f06c7eb 100644
--- a/checkpoint/checkpoint_arch.h
+++ b/checkpoint/checkpoint_arch.h
@@ -3,6 +3,8 @@
 extern int cr_write_head_arch(struct cr_ctx *ctx);
 extern int cr_write_thread(struct cr_ctx *ctx, struct task_struct *t);
 extern int cr_write_cpu(struct cr_ctx *ctx, struct task_struct *t);
+extern int cr_write_mm_context(struct cr_ctx *ctx,
+			       struct mm_struct *mm, int parent);
 
 extern int cr_read_head_arch(struct cr_ctx *ctx);
 extern int cr_read_thread(struct cr_ctx *ctx);
diff --git a/checkpoint/checkpoint_mem.h b/checkpoint/checkpoint_mem.h
new file mode 100644
index 0000000..85546f4
--- /dev/null
+++ b/checkpoint/checkpoint_mem.h
@@ -0,0 +1,41 @@
+#ifndef _CHECKPOINT_CKPT_MEM_H_
+#define _CHECKPOINT_CKPT_MEM_H_
+/*
+ *  Generic container checkpoint-restart
+ *
+ *  Copyright (C) 2008 Oren Laadan
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+#include <linux/mm_types.h>
+
+/*
+ * page-array chains: each cr_pgarr describes a set of <strcut page *,vaddr>
+ * tuples (where vaddr is the virtual address of a page in a particular mm).
+ * Specifically, we use separate arrays so that all vaddrs can be written
+ * and read at once.
+ */
+
+struct cr_pgarr {
+	unsigned long *vaddrs;
+	struct page **pages;
+	unsigned int nr_used;
+	struct list_head list;
+};
+
+#define CR_PGARR_TOTAL  (PAGE_SIZE / sizeof(void *))
+#define CR_PGARR_CHUNK  (4 * CR_PGARR_TOTAL)
+
+extern void cr_pgarr_free(struct cr_ctx *ctx);
+extern struct cr_pgarr *cr_pgarr_current(struct cr_ctx *ctx);
+extern void cr_pgarr_reset_all(struct cr_ctx *ctx);
+
+static inline int cr_pgarr_is_full(struct cr_pgarr *pgarr)
+{
+	return (pgarr->nr_used == CR_PGARR_TOTAL);
+}
+
+#endif /* _CHECKPOINT_CKPT_MEM_H_ */
diff --git a/checkpoint/ckpt_mem.c b/checkpoint/ckpt_mem.c
new file mode 100644
index 0000000..a2fcdbf
--- /dev/null
+++ b/checkpoint/ckpt_mem.c
@@ -0,0 +1,503 @@
+/*
+ *  Checkpoint memory contents
+ *
+ *  Copyright (C) 2008 Oren Laadan
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+#include <linux/kernel.h>
+#include <linux/sched.h>
+#include <linux/slab.h>
+#include <linux/file.h>
+#include <linux/pagemap.h>
+#include <linux/mm_types.h>
+#include <linux/checkpoint.h>
+#include <linux/checkpoint_hdr.h>
+
+#include "checkpoint_arch.h"
+#include "checkpoint_mem.h"
+
+/*
+ * utilities to alloc, free, and handle 'struct cr_pgarr' (page-arrays)
+ * (common to ckpt_mem.c and rstr_mem.c).
+ *
+ * The checkpoint context structure has two members for page-arrays:
+ *   ctx->pgarr_list: list head of the page-array chain
+ *
+ * During checkpoint (and restart) the chain tracks the dirty pages (page
+ * pointer and virtual address) of each MM. For a particular MM, these are
+ * always added to the head of the page-array chain (ctx->pgarr_list).
+ * This "current" page-array advances as necessary, and new page-array
+ * descriptors are allocated on-demand. Before the next chunk of pages,
+ * the chain is reset but not freed (that is, dereference page pointers).
+ */
+
+/* return first page-array in the chain */
+static inline struct cr_pgarr *cr_pgarr_first(struct cr_ctx *ctx)
+{
+	if (list_empty(&ctx->pgarr_list))
+		return NULL;
+	return list_first_entry(&ctx->pgarr_list, struct cr_pgarr, list);
+}
+
+/* release pages referenced by a page-array */
+static void cr_pgarr_release_pages(struct cr_pgarr *pgarr)
+{
+	int i;
+
+	cr_debug("nr_used %d\n", pgarr->nr_used);
+	/*
+	 * although both checkpoint and restart use 'nr_used', we only
+	 * collect pages during checkpoint; in restart we simply return
+	 */
+	if (!pgarr->pages)
+		return;
+	for (i = pgarr->nr_used; i--; /**/)
+		page_cache_release(pgarr->pages[i]);
+}
+
+/* free a single page-array object */
+static void cr_pgarr_free_one(struct cr_pgarr *pgarr)
+{
+	cr_pgarr_release_pages(pgarr);
+	kfree(pgarr->pages);
+	kfree(pgarr->vaddrs);
+	kfree(pgarr);
+}
+
+/* free a chain of page-arrays */
+void cr_pgarr_free(struct cr_ctx *ctx)
+{
+	struct cr_pgarr *pgarr, *tmp;
+
+	list_for_each_entry_safe(pgarr, tmp, &ctx->pgarr_list, list) {
+		list_del(&pgarr->list);
+		cr_pgarr_free_one(pgarr);
+	}
+}
+
+/* allocate a single page-array object */
+static struct cr_pgarr *cr_pgarr_alloc_one(unsigned long flags)
+{
+	struct cr_pgarr *pgarr;
+
+	pgarr = kzalloc(sizeof(*pgarr), GFP_KERNEL);
+	if (!pgarr)
+		return NULL;
+
+	pgarr->vaddrs = kmalloc(CR_PGARR_TOTAL * sizeof(unsigned long),
+				GFP_KERNEL);
+	if (!pgarr->vaddrs)
+		goto nomem;
+
+	/* pgarr->pages is needed only for checkpoint */
+	if (flags & CR_CTX_CKPT) {
+		pgarr->pages = kmalloc(CR_PGARR_TOTAL * sizeof(struct page *),
+				       GFP_KERNEL);
+		if (!pgarr->pages)
+			goto nomem;
+	}
+
+	return pgarr;
+
+ nomem:
+	cr_pgarr_free_one(pgarr);
+	return NULL;
+}
+
+/* cr_pgarr_current - return the next available page-array in the chain
+ * @ctx: checkpoint context
+ *
+ * Returns the first page-array in the list that has space. Extends the
+ * list if none has space.
+ */
+struct cr_pgarr *cr_pgarr_current(struct cr_ctx *ctx)
+{
+	struct cr_pgarr *pgarr;
+
+	pgarr = cr_pgarr_first(ctx);
+	if (pgarr && !cr_pgarr_is_full(pgarr))
+		goto out;
+	pgarr = cr_pgarr_alloc_one(ctx->flags);
+	if (!pgarr)
+		goto out;
+	list_add(&pgarr->list, &ctx->pgarr_list);
+ out:
+	return pgarr;
+}
+
+/* reset the page-array chain (dropping page references if necessary) */
+void cr_pgarr_reset_all(struct cr_ctx *ctx)
+{
+	struct cr_pgarr *pgarr;
+
+	list_for_each_entry(pgarr, &ctx->pgarr_list, list) {
+		cr_pgarr_release_pages(pgarr);
+		pgarr->nr_used = 0;
+	}
+}
+
+/*
+ * Checkpoint is outside the context of the checkpointee, so one cannot
+ * simply read pages from user-space. Instead, we scan the address space
+ * of the target to cherry-pick pages of interest. Selected pages are
+ * enlisted in a page-array chain (attached to the checkpoint context).
+ * To save their contents, each page is mapped to kernel memory and then
+ * dumped to the file descriptor.
+ */
+
+
+/**
+ * cr_private_follow_page - return page pointer for dirty pages
+ * @vma - target vma
+ * @addr - page address
+ *
+ * Looks up the page that correspond to the address in the vma, and
+ * returns the page if it was modified (and grabs a reference to it),
+ * or otherwise returns NULL (or error).
+ *
+ * This function should _only_ called for private vma's.
+ */
+static struct page *
+cr_private_follow_page(struct vm_area_struct *vma, unsigned long addr)
+{
+	struct page *page;
+
+	BUG_ON(vma->vm_flags & (VM_SHARED | VM_MAYSHARE));
+
+	/*
+	 * simplified version of get_user_pages(): already have vma,
+	 * only need FOLL_ANON, and (for now) ignore fault stats.
+	 *
+	 * follow_page() will return NULL if the page is not present
+	 * (swapped), ZERO_PAGE(0) if the pte wasn't allocated, and
+	 * the actual page pointer otherwise.
+	 *
+	 * FIXME: consolidate with get_user_pages()
+	 */
+
+	cond_resched();
+	while (!(page = follow_page(vma, addr, FOLL_ANON | FOLL_GET))) {
+		int ret;
+
+		/* the page is swapped out - bring it in (optimize ?) */
+		ret = handle_mm_fault(vma->vm_mm, vma, addr, 0);
+		if (ret & VM_FAULT_ERROR) {
+			if (ret & VM_FAULT_OOM)
+				return ERR_PTR(-ENOMEM);
+			else if (ret & VM_FAULT_SIGBUS)
+				return ERR_PTR(-EFAULT);
+			else
+				BUG();
+			break;
+		}
+		cond_resched();
+	}
+
+	if (IS_ERR(page))
+		return page;
+
+	/*
+	 * We only care about dirty pages: either non-zero page, or
+	 * file-backed (copy-on-write) that were touched. For the latter,
+	 * the page_mapping() will be unset because it will no longer be
+	 * mapped to the original file  after having been modified.
+	 */
+	if (page == ZERO_PAGE(0)) {
+		/* this is the zero page: ignore */
+		page_cache_release(page);
+		page = NULL;
+	} else if (vma->vm_file && (page_mapping(page) != NULL)) {
+		/* file backed clean cow: ignore */
+		page_cache_release(page);
+		page = NULL;
+	}
+
+	return page;
+}
+
+/**
+ * cr_private_vma_fill_pgarr - fill a page-array with addr/page tuples
+ * @ctx - checkpoint context
+ * @pgarr - page-array to fill
+ * @vma - vma to scan
+ * @start - start address (updated)
+ *
+ * Returns the number of pages collected
+ */
+static int
+cr_private_vma_fill_pgarr(struct cr_ctx *ctx, struct cr_pgarr *pgarr,
+			  struct vm_area_struct *vma, unsigned long *start)
+{
+	unsigned long end = vma->vm_end;
+	unsigned long addr = *start;
+	int orig_used = pgarr->nr_used;
+
+	/* this function is only for private memory (anon or file-mapped) */
+	BUG_ON(vma->vm_flags & (VM_SHARED | VM_MAYSHARE));
+
+	while (addr < end) {
+		struct page *page;
+
+		page = cr_private_follow_page(vma, addr);
+		if (IS_ERR(page))
+			return PTR_ERR(page);
+
+		if (page) {
+			pgarr->pages[pgarr->nr_used] = page;
+			pgarr->vaddrs[pgarr->nr_used] = addr;
+			pgarr->nr_used++;
+		}
+
+		addr += PAGE_SIZE;
+
+		if (cr_pgarr_is_full(pgarr))
+			break;
+	}
+
+	*start = addr;
+	return pgarr->nr_used - orig_used;
+}
+
+/* dump contents of a pages: use kmap_atomic() to avoid TLB flush */
+static int cr_page_write(struct cr_ctx *ctx, struct page *page, char *buf)
+{
+	void *ptr;
+
+	ptr = kmap_atomic(page, KM_USER1);
+	memcpy(buf, ptr, PAGE_SIZE);
+	kunmap_atomic(ptr, KM_USER1);
+
+	return cr_kwrite(ctx, buf, PAGE_SIZE);
+}
+
+/**
+ * cr_vma_dump_pages - dump pages listed in the ctx page-array chain
+ * @ctx - checkpoint context
+ * @total - total number of pages
+ *
+ * First dump all virtual addresses, followed by the contents of all pages
+ */
+static int cr_vma_dump_pages(struct cr_ctx *ctx, int total)
+{
+	struct cr_pgarr *pgarr;
+	char *buf;
+	int i, ret = 0;
+
+	if (!total)
+		return 0;
+
+	list_for_each_entry_reverse(pgarr, &ctx->pgarr_list, list) {
+		ret = cr_kwrite(ctx, pgarr->vaddrs,
+				pgarr->nr_used * sizeof(*pgarr->vaddrs));
+		if (ret < 0)
+			return ret;
+	}
+
+	buf = kmalloc(PAGE_SIZE, GFP_KERNEL);
+	if (!buf)
+		return -ENOMEM;
+
+	list_for_each_entry_reverse(pgarr, &ctx->pgarr_list, list) {
+		for (i = 0; i < pgarr->nr_used; i++) {
+			ret = cr_page_write(ctx, pgarr->pages[i], buf);
+			if (ret < 0)
+				goto out;
+		}
+	}
+
+ out:
+	kfree(buf);
+	return ret;
+}
+
+/**
+ * cr_write_private_vma_contents - dump contents of a VMA with private memory
+ * @ctx - checkpoint context
+ * @vma - vma to scan
+ *
+ * Collect lists of pages that needs to be dumped, and corresponding
+ * virtual addresses into ctx->pgarr_list page-array chain. Then dump
+ * the addresses, followed by the page contents.
+ */
+static int
+cr_write_private_vma_contents(struct cr_ctx *ctx, struct vm_area_struct *vma)
+{
+	struct cr_hdr h;
+	struct cr_hdr_pgarr *hh;
+	unsigned long addr = vma->vm_start;
+	struct cr_pgarr *pgarr;
+	unsigned long cnt = 0;
+	int ret;
+
+	/*
+	 * Work iteratively, collecting and dumping at most CR_PGARR_CHUNK
+	 * in each round. Each iterations is divided into two steps:
+	 *
+	 * (1) scan: scan through the PTEs of the vma to collect the pages
+	 * to dump (later we'll also make them COW), while keeping a list
+	 * of pages and their corresponding addresses on ctx->pgarr_list.
+	 *
+	 * (2) dump: write out a header specifying how many pages, followed
+	 * by the addresses of all pages in ctx->pgarr_list, followed by
+	 * the actual contents of all pages. (Then, release the references
+	 * to the pages and reset the page-array chain).
+	 *
+	 * (This split makes the logic simpler by first counting the pages
+	 * that need saving. More importantly, it allows for a future
+	 * optimization that will reduce application downtime by deferring
+	 * the actual write-out of the data to after the application is
+	 * allowed to resume execution).
+	 *
+	 * After dumpting the entire contents, conclude with a header that
+	 * specifies 0 pages to mark the end of the contents.
+	 */
+
+	h.type = CR_HDR_PGARR;
+	h.len = sizeof(*hh);
+	h.parent = 0;
+
+	while (addr < vma->vm_end) {
+		pgarr = cr_pgarr_current(ctx);
+		if (!pgarr)
+			return -ENOMEM;
+		ret = cr_private_vma_fill_pgarr(ctx, pgarr, vma, &addr);
+		if (ret < 0)
+			return ret;
+		cnt += ret;
+
+		/* did we complete a chunk, or is this the last chunk ? */
+		if (cnt >= CR_PGARR_CHUNK || (cnt && addr == vma->vm_end)) {
+			hh = cr_hbuf_get(ctx, sizeof(*hh));
+			hh->nr_pages = cnt;
+			ret = cr_write_obj(ctx, &h, hh);
+			cr_hbuf_put(ctx, sizeof(*hh));
+			if (ret < 0)
+				return ret;
+
+			ret = cr_vma_dump_pages(ctx, cnt);
+			if (ret < 0)
+				return ret;
+
+			cr_pgarr_reset_all(ctx);
+		}
+	}
+
+	/* mark end of contents with header saying "0" pages */
+	hh = cr_hbuf_get(ctx, sizeof(*hh));
+	hh->nr_pages = 0;
+	ret = cr_write_obj(ctx, &h, hh);
+	cr_hbuf_put(ctx, sizeof(*hh));
+
+	return ret;
+}
+
+static int cr_write_vma(struct cr_ctx *ctx, struct vm_area_struct *vma)
+{
+	struct cr_hdr h;
+	struct cr_hdr_vma *hh = cr_hbuf_get(ctx, sizeof(*hh));
+	int vma_type, ret;
+
+	h.type = CR_HDR_VMA;
+	h.len = sizeof(*hh);
+	h.parent = 0;
+
+	hh->vm_start = vma->vm_start;
+	hh->vm_end = vma->vm_end;
+	hh->vm_page_prot = vma->vm_page_prot.pgprot;
+	hh->vm_flags = vma->vm_flags;
+	hh->vm_pgoff = vma->vm_pgoff;
+
+#define CR_BAD_VM_FLAGS  \
+	(VM_SHARED | VM_MAYSHARE | VM_IO | VM_HUGETLB | VM_NONLINEAR)
+
+	if (vma->vm_flags & CR_BAD_VM_FLAGS) {
+		pr_warning("c/r: unsupported VMA %#lx\n", vma->vm_flags);
+		cr_hbuf_put(ctx, sizeof(*hh));
+		return -ENOSYS;
+	}
+
+	/* by default assume anon memory */
+	vma_type = CR_VMA_ANON;
+
+	/*
+	 * if there is a backing file, assume private-mapped
+	 * (FIXME: check if the file is unlinked)
+	 */
+	if (vma->vm_file)
+		vma_type = CR_VMA_FILE;
+
+	hh->vma_type = vma_type;
+
+	ret = cr_write_obj(ctx, &h, hh);
+	cr_hbuf_put(ctx, sizeof(*hh));
+	if (ret < 0)
+		return ret;
+
+	/* save the file name, if relevant */
+	if (vma->vm_file) {
+		ret = cr_write_fname(ctx, &vma->vm_file->f_path, &ctx->fs_mnt);
+		if (ret < 0)
+			return ret;
+	}
+
+	return cr_write_private_vma_contents(ctx, vma);
+}
+
+int cr_write_mm(struct cr_ctx *ctx, struct task_struct *t)
+{
+	struct cr_hdr h;
+	struct cr_hdr_mm *hh = cr_hbuf_get(ctx, sizeof(*hh));
+	struct mm_struct *mm;
+	struct vm_area_struct *vma;
+	int objref, ret;
+
+	h.type = CR_HDR_MM;
+	h.len = sizeof(*hh);
+	h.parent = task_pid_vnr(t);
+
+	mm = get_task_mm(t);
+
+	objref = 0;	/* will be meaningful with multiple processes */
+	hh->objref = objref;
+
+	down_read(&mm->mmap_sem);
+
+	hh->start_code = mm->start_code;
+	hh->end_code = mm->end_code;
+	hh->start_data = mm->start_data;
+	hh->end_data = mm->end_data;
+	hh->start_brk = mm->start_brk;
+	hh->brk = mm->brk;
+	hh->start_stack = mm->start_stack;
+	hh->arg_start = mm->arg_start;
+	hh->arg_end = mm->arg_end;
+	hh->env_start = mm->env_start;
+	hh->env_end = mm->env_end;
+
+	hh->map_count = mm->map_count;
+
+	/* FIX: need also mm->flags */
+
+	ret = cr_write_obj(ctx, &h, hh);
+	cr_hbuf_put(ctx, sizeof(*hh));
+	if (ret < 0)
+		goto out;
+
+	/* write the vma's */
+	for (vma = mm->mmap; vma; vma = vma->vm_next) {
+		ret = cr_write_vma(ctx, vma);
+		if (ret < 0)
+			goto out;
+	}
+
+	ret = cr_write_mm_context(ctx, mm, objref);
+
+ out:
+	up_read(&mm->mmap_sem);
+	mmput(mm);
+	return ret;
+}
diff --git a/checkpoint/sys.c b/checkpoint/sys.c
index bd14ef9..c547a1c 100644
--- a/checkpoint/sys.c
+++ b/checkpoint/sys.c
@@ -16,6 +16,8 @@
 #include <linux/capability.h>
 #include <linux/checkpoint.h>
 
+#include "checkpoint_mem.h"
+
 /*
  * Helpers to write(read) from(to) kernel space to(from) the checkpoint
  * image file descriptor (similar to how a core-dump is performed).
@@ -131,7 +133,13 @@ static void cr_ctx_free(struct cr_ctx *ctx)
 {
 	if (ctx->file)
 		fput(ctx->file);
+
 	kfree(ctx->hbuf);
+
+	path_put(&ctx->fs_mnt);		/* safe with NULL pointers */
+
+	cr_pgarr_free(ctx);
+
 	kfree(ctx);
 }
 
@@ -146,6 +154,8 @@ static struct cr_ctx *cr_ctx_alloc(int fd, unsigned long flags)
 
 	ctx->flags = flags;
 
+	INIT_LIST_HEAD(&ctx->pgarr_list);
+
 	err = -EBADF;
 	ctx->file = fget(fd);
 	if (!ctx->file)
diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
index 63f298f..4e97f9f 100644
--- a/include/linux/checkpoint.h
+++ b/include/linux/checkpoint.h
@@ -10,6 +10,9 @@
  *  distribution for more details.
  */
 
+#include <linux/path.h>
+#include <linux/fs.h>
+
 #define CR_VERSION  1
 
 struct cr_ctx {
@@ -25,6 +28,10 @@ struct cr_ctx {
 
 	void *hbuf;		/* temporary buffer for headers */
 	int hpos;		/* position in headers buffer */
+
+	struct list_head pgarr_list;	/* page array to dump VMA contents */
+
+	struct path fs_mnt;	/* container root (FIXME) */
 };
 
 /* cr_ctx: flags */
@@ -42,6 +49,8 @@ struct cr_hdr;
 extern int cr_write_obj(struct cr_ctx *ctx, struct cr_hdr *h, void *buf);
 extern int cr_write_buffer(struct cr_ctx *ctx, void *buf, int len);
 extern int cr_write_string(struct cr_ctx *ctx, char *str, int len);
+extern int cr_write_fname(struct cr_ctx *ctx,
+			  struct path *path, struct path *root);
 
 extern int cr_read_obj(struct cr_ctx *ctx, struct cr_hdr *h, void *buf, int n);
 extern int cr_read_obj_type(struct cr_ctx *ctx, void *buf, int len, int type);
@@ -50,7 +59,10 @@ extern int cr_read_buffer(struct cr_ctx *ctx, void *buf, int *len);
 extern int cr_read_string(struct cr_ctx *ctx, char *str, int len);
 
 extern int do_checkpoint(struct cr_ctx *ctx, pid_t pid);
+extern int cr_write_mm(struct cr_ctx *ctx, struct task_struct *t);
+
 extern int do_restart(struct cr_ctx *ctx, pid_t pid);
+extern int cr_read_mm(struct cr_ctx *ctx);
 
 #define cr_debug(fmt, args...)  \
 	pr_debug("[%d:c/r:%s] " fmt, task_pid_vnr(current), __func__, ## args)
diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
index b74b5f9..d78f0f1 100644
--- a/include/linux/checkpoint_hdr.h
+++ b/include/linux/checkpoint_hdr.h
@@ -34,6 +34,7 @@ enum {
 	CR_HDR_HEAD_ARCH,
 	CR_HDR_BUFFER,
 	CR_HDR_STRING,
+	CR_HDR_FNAME,
 
 	CR_HDR_TASK = 101,
 	CR_HDR_THREAD,
@@ -41,6 +42,7 @@ enum {
 
 	CR_HDR_MM = 201,
 	CR_HDR_VMA,
+	CR_HDR_PGARR,
 	CR_HDR_MM_CONTEXT,
 
 	CR_HDR_TAIL = 5001
@@ -75,4 +77,34 @@ struct cr_hdr_task {
 	__s32 task_comm_len;
 } __attribute__((aligned(8)));
 
+struct cr_hdr_mm {
+	__u32 objref;		/* identifier for shared objects */
+	__u32 map_count;
+
+	__u64 start_code, end_code, start_data, end_data;
+	__u64 start_brk, brk, start_stack;
+	__u64 arg_start, arg_end, env_start, env_end;
+} __attribute__((aligned(8)));
+
+/* vma subtypes */
+enum vm_type {
+	CR_VMA_ANON = 1,
+	CR_VMA_FILE
+};
+
+struct cr_hdr_vma {
+	__u32 vma_type;
+	__u32 _padding;
+
+	__u64 vm_start;
+	__u64 vm_end;
+	__u64 vm_page_prot;
+	__u64 vm_flags;
+	__u64 vm_pgoff;
+} __attribute__((aligned(8)));
+
+struct cr_hdr_pgarr {
+	__u64 nr_pages;		/* number of pages to saved */
+} __attribute__((aligned(8)));
+
 #endif /* _CHECKPOINT_CKPT_HDR_H_ */
-- 
1.5.4.3


^ permalink raw reply related	[flat|nested] 133+ messages in thread

* [RFC v11][PATCH 05/13] Dump memory address space
@ 2008-12-05 17:31   ` Oren Laadan
  0 siblings, 0 replies; 133+ messages in thread
From: Oren Laadan @ 2008-12-05 17:31 UTC (permalink / raw)
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Linux Torvalds,
	Thomas Gleixner, Serge Hallyn, Dave Hansen, Ingo Molnar,
	H. Peter Anvin, Alexander Viro, MinChan Kim, arnd-r2nGTMty4D4,
	jeremy-TSDbQ3PG+2Y, Oren Laadan

For each VMA, there is a 'struct cr_vma'; if the VMA is file-mapped,
it will be followed by the file name. Then comes the actual contents,
in one or more chunk: each chunk begins with a header that specifies
how many pages it holds, then the virtual addresses of all the dumped
pages in that chunk, followed by the actual contents of all dumped
pages. A header with zero number of pages marks the end of the contents.
Then comes the next VMA and so on.

Changelog[v11]:
  - Copy contents of 'init->fs->root' instead of pointing to them.
  - Add missing test for VM_MAYSHARE when dumping memory

Changelog[v10]:
  - Acquire dcache_lock around call to __d_path() in cr_fill_name()

Changelog[v9]:
  - Introduce cr_ctx_checkpoint() for checkpoint-specific ctx setup
  - Test if __d_path() changes mnt/dentry (when crossing filesystem
    namespace boundary). for now cr_fill_fname() fails the checkpoint.

Changelog[v7]:
  - Fix argument given to kunmap_atomic() in memory dump/restore

Changelog[v6]:
  - Balance all calls to cr_hbuf_get() with matching cr_hbuf_put()
    (even though it's not really needed)

Changelog[v5]:
  - Improve memory dump code (following Dave Hansen's comments)
  - Change dump format (and code) to allow chunks of <vaddrs, pages>
    instead of one long list of each
  - Fix use of follow_page() to avoid faulting in non-present pages

Changelog[v4]:
  - Use standard list_... for cr_pgarr

Signed-off-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
Acked-by: Serge Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
Signed-off-by: Dave Hansen <dave-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
---
 arch/x86/include/asm/checkpoint_hdr.h |    5 +
 arch/x86/mm/checkpoint.c              |   31 ++
 checkpoint/Makefile                   |    3 +-
 checkpoint/checkpoint.c               |   88 ++++++
 checkpoint/checkpoint_arch.h          |    2 +
 checkpoint/checkpoint_mem.h           |   41 +++
 checkpoint/ckpt_mem.c                 |  503 +++++++++++++++++++++++++++++++++
 checkpoint/sys.c                      |   10 +
 include/linux/checkpoint.h            |   12 +
 include/linux/checkpoint_hdr.h        |   32 ++
 10 files changed, 726 insertions(+), 1 deletions(-)
 create mode 100644 checkpoint/checkpoint_mem.h
 create mode 100644 checkpoint/ckpt_mem.c

diff --git a/arch/x86/include/asm/checkpoint_hdr.h b/arch/x86/include/asm/checkpoint_hdr.h
index 6325062..33f4c70 100644
--- a/arch/x86/include/asm/checkpoint_hdr.h
+++ b/arch/x86/include/asm/checkpoint_hdr.h
@@ -82,4 +82,9 @@ struct cr_hdr_cpu {
 	/* thread_xstate contents follow (if used_math) */
 } __attribute__((aligned(8)));
 
+struct cr_hdr_mm_context {
+	__s16 ldt_entry_size;
+	__s16 nldt;
+} __attribute__((aligned(8)));
+
 #endif /* __ASM_X86_CKPT_HDR__H */
diff --git a/arch/x86/mm/checkpoint.c b/arch/x86/mm/checkpoint.c
index 8dd6d2d..757936e 100644
--- a/arch/x86/mm/checkpoint.c
+++ b/arch/x86/mm/checkpoint.c
@@ -221,3 +221,34 @@ int cr_write_head_arch(struct cr_ctx *ctx)
 
 	return ret;
 }
+
+/* dump the mm->context state */
+int cr_write_mm_context(struct cr_ctx *ctx, struct mm_struct *mm, int parent)
+{
+	struct cr_hdr h;
+	struct cr_hdr_mm_context *hh = cr_hbuf_get(ctx, sizeof(*hh));
+	int ret;
+
+	h.type = CR_HDR_MM_CONTEXT;
+	h.len = sizeof(*hh);
+	h.parent = parent;
+
+	mutex_lock(&mm->context.lock);
+
+	hh->ldt_entry_size = LDT_ENTRY_SIZE;
+	hh->nldt = mm->context.size;
+
+	cr_debug("nldt %d\n", hh->nldt);
+
+	ret = cr_write_obj(ctx, &h, hh);
+	cr_hbuf_put(ctx, sizeof(*hh));
+	if (ret < 0)
+		goto out;
+
+	ret = cr_kwrite(ctx, mm->context.ldt,
+			mm->context.size * LDT_ENTRY_SIZE);
+
+ out:
+	mutex_unlock(&mm->context.lock);
+	return ret;
+}
diff --git a/checkpoint/Makefile b/checkpoint/Makefile
index d2df68c..3a0df6d 100644
--- a/checkpoint/Makefile
+++ b/checkpoint/Makefile
@@ -2,4 +2,5 @@
 # Makefile for linux checkpoint/restart.
 #
 
-obj-$(CONFIG_CHECKPOINT_RESTART) += sys.o checkpoint.o restart.o
+obj-$(CONFIG_CHECKPOINT_RESTART) += sys.o checkpoint.o restart.o \
+		ckpt_mem.o
diff --git a/checkpoint/checkpoint.c b/checkpoint/checkpoint.c
index 17cc8d2..56d0ec2 100644
--- a/checkpoint/checkpoint.c
+++ b/checkpoint/checkpoint.c
@@ -13,6 +13,7 @@
 #include <linux/time.h>
 #include <linux/fs.h>
 #include <linux/file.h>
+#include <linux/fdtable.h>
 #include <linux/dcache.h>
 #include <linux/mount.h>
 #include <linux/utsname.h>
@@ -75,6 +76,66 @@ int cr_write_string(struct cr_ctx *ctx, char *str, int len)
 	return cr_write_obj(ctx, &h, str);
 }
 
+/**
+ * cr_fill_fname - return pathname of a given file
+ * @path: path name
+ * @root: relative root
+ * @buf: buffer for pathname
+ * @n: buffer length (in) and pathname length (out)
+ */
+static char *
+cr_fill_fname(struct path *path, struct path *root, char *buf, int *n)
+{
+	struct path tmp = *root;
+	char *fname;
+
+	BUG_ON(!buf);
+	spin_lock(&dcache_lock);
+	fname = __d_path(path, &tmp, buf, *n);
+	spin_unlock(&dcache_lock);
+	if (!IS_ERR(fname))
+		*n = (buf + (*n) - fname);
+	/*
+	 * FIXME: if __d_path() changed these, it must have stepped out of
+	 * init's namespace. Since currently we require a unified namespace
+	 * within the container: simply fail.
+	 */
+	if (tmp.mnt != root->mnt || tmp.dentry != root->dentry)
+		fname = ERR_PTR(-EBADF);
+
+	return fname;
+}
+
+/**
+ * cr_write_fname - write a file name
+ * @ctx: checkpoint context
+ * @path: path name
+ * @root: relative root
+ */
+int cr_write_fname(struct cr_ctx *ctx, struct path *path, struct path *root)
+{
+	struct cr_hdr h;
+	char *buf, *fname;
+	int ret, flen;
+
+	flen = PATH_MAX;
+	buf = kmalloc(flen, GFP_KERNEL);
+	if (!buf)
+		return -ENOMEM;
+
+	fname = cr_fill_fname(path, root, buf, &flen);
+	if (!IS_ERR(fname)) {
+		h.type = CR_HDR_FNAME;
+		h.len = flen;
+		h.parent = 0;
+		ret = cr_write_obj(ctx, &h, fname);
+	} else
+		ret = PTR_ERR(fname);
+
+	kfree(buf);
+	return ret;
+}
+
 /* write the checkpoint header */
 static int cr_write_head(struct cr_ctx *ctx)
 {
@@ -168,6 +229,10 @@ static int cr_write_task(struct cr_ctx *ctx, struct task_struct *t)
 	cr_debug("task_struct: ret %d\n", ret);
 	if (ret < 0)
 		goto out;
+	ret = cr_write_mm(ctx, t);
+	cr_debug("memory: ret %d\n", ret);
+	if (ret < 0)
+		goto out;
 	ret = cr_write_thread(ctx, t);
 	cr_debug("thread: ret %d\n", ret);
 	if (ret < 0)
@@ -178,10 +243,33 @@ static int cr_write_task(struct cr_ctx *ctx, struct task_struct *t)
 	return ret;
 }
 
+static int cr_ctx_checkpoint(struct cr_ctx *ctx, pid_t pid)
+{
+	struct fs_struct *fs;
+
+	ctx->root_pid = pid;
+
+	/*
+	 * assume checkpointer is in container's root vfs
+	 * FIXME: this works for now, but will change with real containers
+	 */
+
+	fs = current->fs;
+	read_lock(&fs->lock);
+	ctx->fs_mnt = fs->root;
+	path_get(&ctx->fs_mnt);
+	read_unlock(&fs->lock);
+
+	return 0;
+}
+
 int do_checkpoint(struct cr_ctx *ctx, pid_t pid)
 {
 	int ret;
 
+	ret = cr_ctx_checkpoint(ctx, pid);
+	if (ret < 0)
+		goto out;
 	ret = cr_write_head(ctx);
 	if (ret < 0)
 		goto out;
diff --git a/checkpoint/checkpoint_arch.h b/checkpoint/checkpoint_arch.h
index ada1369..f06c7eb 100644
--- a/checkpoint/checkpoint_arch.h
+++ b/checkpoint/checkpoint_arch.h
@@ -3,6 +3,8 @@
 extern int cr_write_head_arch(struct cr_ctx *ctx);
 extern int cr_write_thread(struct cr_ctx *ctx, struct task_struct *t);
 extern int cr_write_cpu(struct cr_ctx *ctx, struct task_struct *t);
+extern int cr_write_mm_context(struct cr_ctx *ctx,
+			       struct mm_struct *mm, int parent);
 
 extern int cr_read_head_arch(struct cr_ctx *ctx);
 extern int cr_read_thread(struct cr_ctx *ctx);
diff --git a/checkpoint/checkpoint_mem.h b/checkpoint/checkpoint_mem.h
new file mode 100644
index 0000000..85546f4
--- /dev/null
+++ b/checkpoint/checkpoint_mem.h
@@ -0,0 +1,41 @@
+#ifndef _CHECKPOINT_CKPT_MEM_H_
+#define _CHECKPOINT_CKPT_MEM_H_
+/*
+ *  Generic container checkpoint-restart
+ *
+ *  Copyright (C) 2008 Oren Laadan
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+#include <linux/mm_types.h>
+
+/*
+ * page-array chains: each cr_pgarr describes a set of <strcut page *,vaddr>
+ * tuples (where vaddr is the virtual address of a page in a particular mm).
+ * Specifically, we use separate arrays so that all vaddrs can be written
+ * and read at once.
+ */
+
+struct cr_pgarr {
+	unsigned long *vaddrs;
+	struct page **pages;
+	unsigned int nr_used;
+	struct list_head list;
+};
+
+#define CR_PGARR_TOTAL  (PAGE_SIZE / sizeof(void *))
+#define CR_PGARR_CHUNK  (4 * CR_PGARR_TOTAL)
+
+extern void cr_pgarr_free(struct cr_ctx *ctx);
+extern struct cr_pgarr *cr_pgarr_current(struct cr_ctx *ctx);
+extern void cr_pgarr_reset_all(struct cr_ctx *ctx);
+
+static inline int cr_pgarr_is_full(struct cr_pgarr *pgarr)
+{
+	return (pgarr->nr_used == CR_PGARR_TOTAL);
+}
+
+#endif /* _CHECKPOINT_CKPT_MEM_H_ */
diff --git a/checkpoint/ckpt_mem.c b/checkpoint/ckpt_mem.c
new file mode 100644
index 0000000..a2fcdbf
--- /dev/null
+++ b/checkpoint/ckpt_mem.c
@@ -0,0 +1,503 @@
+/*
+ *  Checkpoint memory contents
+ *
+ *  Copyright (C) 2008 Oren Laadan
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+#include <linux/kernel.h>
+#include <linux/sched.h>
+#include <linux/slab.h>
+#include <linux/file.h>
+#include <linux/pagemap.h>
+#include <linux/mm_types.h>
+#include <linux/checkpoint.h>
+#include <linux/checkpoint_hdr.h>
+
+#include "checkpoint_arch.h"
+#include "checkpoint_mem.h"
+
+/*
+ * utilities to alloc, free, and handle 'struct cr_pgarr' (page-arrays)
+ * (common to ckpt_mem.c and rstr_mem.c).
+ *
+ * The checkpoint context structure has two members for page-arrays:
+ *   ctx->pgarr_list: list head of the page-array chain
+ *
+ * During checkpoint (and restart) the chain tracks the dirty pages (page
+ * pointer and virtual address) of each MM. For a particular MM, these are
+ * always added to the head of the page-array chain (ctx->pgarr_list).
+ * This "current" page-array advances as necessary, and new page-array
+ * descriptors are allocated on-demand. Before the next chunk of pages,
+ * the chain is reset but not freed (that is, dereference page pointers).
+ */
+
+/* return first page-array in the chain */
+static inline struct cr_pgarr *cr_pgarr_first(struct cr_ctx *ctx)
+{
+	if (list_empty(&ctx->pgarr_list))
+		return NULL;
+	return list_first_entry(&ctx->pgarr_list, struct cr_pgarr, list);
+}
+
+/* release pages referenced by a page-array */
+static void cr_pgarr_release_pages(struct cr_pgarr *pgarr)
+{
+	int i;
+
+	cr_debug("nr_used %d\n", pgarr->nr_used);
+	/*
+	 * although both checkpoint and restart use 'nr_used', we only
+	 * collect pages during checkpoint; in restart we simply return
+	 */
+	if (!pgarr->pages)
+		return;
+	for (i = pgarr->nr_used; i--; /**/)
+		page_cache_release(pgarr->pages[i]);
+}
+
+/* free a single page-array object */
+static void cr_pgarr_free_one(struct cr_pgarr *pgarr)
+{
+	cr_pgarr_release_pages(pgarr);
+	kfree(pgarr->pages);
+	kfree(pgarr->vaddrs);
+	kfree(pgarr);
+}
+
+/* free a chain of page-arrays */
+void cr_pgarr_free(struct cr_ctx *ctx)
+{
+	struct cr_pgarr *pgarr, *tmp;
+
+	list_for_each_entry_safe(pgarr, tmp, &ctx->pgarr_list, list) {
+		list_del(&pgarr->list);
+		cr_pgarr_free_one(pgarr);
+	}
+}
+
+/* allocate a single page-array object */
+static struct cr_pgarr *cr_pgarr_alloc_one(unsigned long flags)
+{
+	struct cr_pgarr *pgarr;
+
+	pgarr = kzalloc(sizeof(*pgarr), GFP_KERNEL);
+	if (!pgarr)
+		return NULL;
+
+	pgarr->vaddrs = kmalloc(CR_PGARR_TOTAL * sizeof(unsigned long),
+				GFP_KERNEL);
+	if (!pgarr->vaddrs)
+		goto nomem;
+
+	/* pgarr->pages is needed only for checkpoint */
+	if (flags & CR_CTX_CKPT) {
+		pgarr->pages = kmalloc(CR_PGARR_TOTAL * sizeof(struct page *),
+				       GFP_KERNEL);
+		if (!pgarr->pages)
+			goto nomem;
+	}
+
+	return pgarr;
+
+ nomem:
+	cr_pgarr_free_one(pgarr);
+	return NULL;
+}
+
+/* cr_pgarr_current - return the next available page-array in the chain
+ * @ctx: checkpoint context
+ *
+ * Returns the first page-array in the list that has space. Extends the
+ * list if none has space.
+ */
+struct cr_pgarr *cr_pgarr_current(struct cr_ctx *ctx)
+{
+	struct cr_pgarr *pgarr;
+
+	pgarr = cr_pgarr_first(ctx);
+	if (pgarr && !cr_pgarr_is_full(pgarr))
+		goto out;
+	pgarr = cr_pgarr_alloc_one(ctx->flags);
+	if (!pgarr)
+		goto out;
+	list_add(&pgarr->list, &ctx->pgarr_list);
+ out:
+	return pgarr;
+}
+
+/* reset the page-array chain (dropping page references if necessary) */
+void cr_pgarr_reset_all(struct cr_ctx *ctx)
+{
+	struct cr_pgarr *pgarr;
+
+	list_for_each_entry(pgarr, &ctx->pgarr_list, list) {
+		cr_pgarr_release_pages(pgarr);
+		pgarr->nr_used = 0;
+	}
+}
+
+/*
+ * Checkpoint is outside the context of the checkpointee, so one cannot
+ * simply read pages from user-space. Instead, we scan the address space
+ * of the target to cherry-pick pages of interest. Selected pages are
+ * enlisted in a page-array chain (attached to the checkpoint context).
+ * To save their contents, each page is mapped to kernel memory and then
+ * dumped to the file descriptor.
+ */
+
+
+/**
+ * cr_private_follow_page - return page pointer for dirty pages
+ * @vma - target vma
+ * @addr - page address
+ *
+ * Looks up the page that correspond to the address in the vma, and
+ * returns the page if it was modified (and grabs a reference to it),
+ * or otherwise returns NULL (or error).
+ *
+ * This function should _only_ called for private vma's.
+ */
+static struct page *
+cr_private_follow_page(struct vm_area_struct *vma, unsigned long addr)
+{
+	struct page *page;
+
+	BUG_ON(vma->vm_flags & (VM_SHARED | VM_MAYSHARE));
+
+	/*
+	 * simplified version of get_user_pages(): already have vma,
+	 * only need FOLL_ANON, and (for now) ignore fault stats.
+	 *
+	 * follow_page() will return NULL if the page is not present
+	 * (swapped), ZERO_PAGE(0) if the pte wasn't allocated, and
+	 * the actual page pointer otherwise.
+	 *
+	 * FIXME: consolidate with get_user_pages()
+	 */
+
+	cond_resched();
+	while (!(page = follow_page(vma, addr, FOLL_ANON | FOLL_GET))) {
+		int ret;
+
+		/* the page is swapped out - bring it in (optimize ?) */
+		ret = handle_mm_fault(vma->vm_mm, vma, addr, 0);
+		if (ret & VM_FAULT_ERROR) {
+			if (ret & VM_FAULT_OOM)
+				return ERR_PTR(-ENOMEM);
+			else if (ret & VM_FAULT_SIGBUS)
+				return ERR_PTR(-EFAULT);
+			else
+				BUG();
+			break;
+		}
+		cond_resched();
+	}
+
+	if (IS_ERR(page))
+		return page;
+
+	/*
+	 * We only care about dirty pages: either non-zero page, or
+	 * file-backed (copy-on-write) that were touched. For the latter,
+	 * the page_mapping() will be unset because it will no longer be
+	 * mapped to the original file  after having been modified.
+	 */
+	if (page == ZERO_PAGE(0)) {
+		/* this is the zero page: ignore */
+		page_cache_release(page);
+		page = NULL;
+	} else if (vma->vm_file && (page_mapping(page) != NULL)) {
+		/* file backed clean cow: ignore */
+		page_cache_release(page);
+		page = NULL;
+	}
+
+	return page;
+}
+
+/**
+ * cr_private_vma_fill_pgarr - fill a page-array with addr/page tuples
+ * @ctx - checkpoint context
+ * @pgarr - page-array to fill
+ * @vma - vma to scan
+ * @start - start address (updated)
+ *
+ * Returns the number of pages collected
+ */
+static int
+cr_private_vma_fill_pgarr(struct cr_ctx *ctx, struct cr_pgarr *pgarr,
+			  struct vm_area_struct *vma, unsigned long *start)
+{
+	unsigned long end = vma->vm_end;
+	unsigned long addr = *start;
+	int orig_used = pgarr->nr_used;
+
+	/* this function is only for private memory (anon or file-mapped) */
+	BUG_ON(vma->vm_flags & (VM_SHARED | VM_MAYSHARE));
+
+	while (addr < end) {
+		struct page *page;
+
+		page = cr_private_follow_page(vma, addr);
+		if (IS_ERR(page))
+			return PTR_ERR(page);
+
+		if (page) {
+			pgarr->pages[pgarr->nr_used] = page;
+			pgarr->vaddrs[pgarr->nr_used] = addr;
+			pgarr->nr_used++;
+		}
+
+		addr += PAGE_SIZE;
+
+		if (cr_pgarr_is_full(pgarr))
+			break;
+	}
+
+	*start = addr;
+	return pgarr->nr_used - orig_used;
+}
+
+/* dump contents of a pages: use kmap_atomic() to avoid TLB flush */
+static int cr_page_write(struct cr_ctx *ctx, struct page *page, char *buf)
+{
+	void *ptr;
+
+	ptr = kmap_atomic(page, KM_USER1);
+	memcpy(buf, ptr, PAGE_SIZE);
+	kunmap_atomic(ptr, KM_USER1);
+
+	return cr_kwrite(ctx, buf, PAGE_SIZE);
+}
+
+/**
+ * cr_vma_dump_pages - dump pages listed in the ctx page-array chain
+ * @ctx - checkpoint context
+ * @total - total number of pages
+ *
+ * First dump all virtual addresses, followed by the contents of all pages
+ */
+static int cr_vma_dump_pages(struct cr_ctx *ctx, int total)
+{
+	struct cr_pgarr *pgarr;
+	char *buf;
+	int i, ret = 0;
+
+	if (!total)
+		return 0;
+
+	list_for_each_entry_reverse(pgarr, &ctx->pgarr_list, list) {
+		ret = cr_kwrite(ctx, pgarr->vaddrs,
+				pgarr->nr_used * sizeof(*pgarr->vaddrs));
+		if (ret < 0)
+			return ret;
+	}
+
+	buf = kmalloc(PAGE_SIZE, GFP_KERNEL);
+	if (!buf)
+		return -ENOMEM;
+
+	list_for_each_entry_reverse(pgarr, &ctx->pgarr_list, list) {
+		for (i = 0; i < pgarr->nr_used; i++) {
+			ret = cr_page_write(ctx, pgarr->pages[i], buf);
+			if (ret < 0)
+				goto out;
+		}
+	}
+
+ out:
+	kfree(buf);
+	return ret;
+}
+
+/**
+ * cr_write_private_vma_contents - dump contents of a VMA with private memory
+ * @ctx - checkpoint context
+ * @vma - vma to scan
+ *
+ * Collect lists of pages that needs to be dumped, and corresponding
+ * virtual addresses into ctx->pgarr_list page-array chain. Then dump
+ * the addresses, followed by the page contents.
+ */
+static int
+cr_write_private_vma_contents(struct cr_ctx *ctx, struct vm_area_struct *vma)
+{
+	struct cr_hdr h;
+	struct cr_hdr_pgarr *hh;
+	unsigned long addr = vma->vm_start;
+	struct cr_pgarr *pgarr;
+	unsigned long cnt = 0;
+	int ret;
+
+	/*
+	 * Work iteratively, collecting and dumping at most CR_PGARR_CHUNK
+	 * in each round. Each iterations is divided into two steps:
+	 *
+	 * (1) scan: scan through the PTEs of the vma to collect the pages
+	 * to dump (later we'll also make them COW), while keeping a list
+	 * of pages and their corresponding addresses on ctx->pgarr_list.
+	 *
+	 * (2) dump: write out a header specifying how many pages, followed
+	 * by the addresses of all pages in ctx->pgarr_list, followed by
+	 * the actual contents of all pages. (Then, release the references
+	 * to the pages and reset the page-array chain).
+	 *
+	 * (This split makes the logic simpler by first counting the pages
+	 * that need saving. More importantly, it allows for a future
+	 * optimization that will reduce application downtime by deferring
+	 * the actual write-out of the data to after the application is
+	 * allowed to resume execution).
+	 *
+	 * After dumpting the entire contents, conclude with a header that
+	 * specifies 0 pages to mark the end of the contents.
+	 */
+
+	h.type = CR_HDR_PGARR;
+	h.len = sizeof(*hh);
+	h.parent = 0;
+
+	while (addr < vma->vm_end) {
+		pgarr = cr_pgarr_current(ctx);
+		if (!pgarr)
+			return -ENOMEM;
+		ret = cr_private_vma_fill_pgarr(ctx, pgarr, vma, &addr);
+		if (ret < 0)
+			return ret;
+		cnt += ret;
+
+		/* did we complete a chunk, or is this the last chunk ? */
+		if (cnt >= CR_PGARR_CHUNK || (cnt && addr == vma->vm_end)) {
+			hh = cr_hbuf_get(ctx, sizeof(*hh));
+			hh->nr_pages = cnt;
+			ret = cr_write_obj(ctx, &h, hh);
+			cr_hbuf_put(ctx, sizeof(*hh));
+			if (ret < 0)
+				return ret;
+
+			ret = cr_vma_dump_pages(ctx, cnt);
+			if (ret < 0)
+				return ret;
+
+			cr_pgarr_reset_all(ctx);
+		}
+	}
+
+	/* mark end of contents with header saying "0" pages */
+	hh = cr_hbuf_get(ctx, sizeof(*hh));
+	hh->nr_pages = 0;
+	ret = cr_write_obj(ctx, &h, hh);
+	cr_hbuf_put(ctx, sizeof(*hh));
+
+	return ret;
+}
+
+static int cr_write_vma(struct cr_ctx *ctx, struct vm_area_struct *vma)
+{
+	struct cr_hdr h;
+	struct cr_hdr_vma *hh = cr_hbuf_get(ctx, sizeof(*hh));
+	int vma_type, ret;
+
+	h.type = CR_HDR_VMA;
+	h.len = sizeof(*hh);
+	h.parent = 0;
+
+	hh->vm_start = vma->vm_start;
+	hh->vm_end = vma->vm_end;
+	hh->vm_page_prot = vma->vm_page_prot.pgprot;
+	hh->vm_flags = vma->vm_flags;
+	hh->vm_pgoff = vma->vm_pgoff;
+
+#define CR_BAD_VM_FLAGS  \
+	(VM_SHARED | VM_MAYSHARE | VM_IO | VM_HUGETLB | VM_NONLINEAR)
+
+	if (vma->vm_flags & CR_BAD_VM_FLAGS) {
+		pr_warning("c/r: unsupported VMA %#lx\n", vma->vm_flags);
+		cr_hbuf_put(ctx, sizeof(*hh));
+		return -ENOSYS;
+	}
+
+	/* by default assume anon memory */
+	vma_type = CR_VMA_ANON;
+
+	/*
+	 * if there is a backing file, assume private-mapped
+	 * (FIXME: check if the file is unlinked)
+	 */
+	if (vma->vm_file)
+		vma_type = CR_VMA_FILE;
+
+	hh->vma_type = vma_type;
+
+	ret = cr_write_obj(ctx, &h, hh);
+	cr_hbuf_put(ctx, sizeof(*hh));
+	if (ret < 0)
+		return ret;
+
+	/* save the file name, if relevant */
+	if (vma->vm_file) {
+		ret = cr_write_fname(ctx, &vma->vm_file->f_path, &ctx->fs_mnt);
+		if (ret < 0)
+			return ret;
+	}
+
+	return cr_write_private_vma_contents(ctx, vma);
+}
+
+int cr_write_mm(struct cr_ctx *ctx, struct task_struct *t)
+{
+	struct cr_hdr h;
+	struct cr_hdr_mm *hh = cr_hbuf_get(ctx, sizeof(*hh));
+	struct mm_struct *mm;
+	struct vm_area_struct *vma;
+	int objref, ret;
+
+	h.type = CR_HDR_MM;
+	h.len = sizeof(*hh);
+	h.parent = task_pid_vnr(t);
+
+	mm = get_task_mm(t);
+
+	objref = 0;	/* will be meaningful with multiple processes */
+	hh->objref = objref;
+
+	down_read(&mm->mmap_sem);
+
+	hh->start_code = mm->start_code;
+	hh->end_code = mm->end_code;
+	hh->start_data = mm->start_data;
+	hh->end_data = mm->end_data;
+	hh->start_brk = mm->start_brk;
+	hh->brk = mm->brk;
+	hh->start_stack = mm->start_stack;
+	hh->arg_start = mm->arg_start;
+	hh->arg_end = mm->arg_end;
+	hh->env_start = mm->env_start;
+	hh->env_end = mm->env_end;
+
+	hh->map_count = mm->map_count;
+
+	/* FIX: need also mm->flags */
+
+	ret = cr_write_obj(ctx, &h, hh);
+	cr_hbuf_put(ctx, sizeof(*hh));
+	if (ret < 0)
+		goto out;
+
+	/* write the vma's */
+	for (vma = mm->mmap; vma; vma = vma->vm_next) {
+		ret = cr_write_vma(ctx, vma);
+		if (ret < 0)
+			goto out;
+	}
+
+	ret = cr_write_mm_context(ctx, mm, objref);
+
+ out:
+	up_read(&mm->mmap_sem);
+	mmput(mm);
+	return ret;
+}
diff --git a/checkpoint/sys.c b/checkpoint/sys.c
index bd14ef9..c547a1c 100644
--- a/checkpoint/sys.c
+++ b/checkpoint/sys.c
@@ -16,6 +16,8 @@
 #include <linux/capability.h>
 #include <linux/checkpoint.h>
 
+#include "checkpoint_mem.h"
+
 /*
  * Helpers to write(read) from(to) kernel space to(from) the checkpoint
  * image file descriptor (similar to how a core-dump is performed).
@@ -131,7 +133,13 @@ static void cr_ctx_free(struct cr_ctx *ctx)
 {
 	if (ctx->file)
 		fput(ctx->file);
+
 	kfree(ctx->hbuf);
+
+	path_put(&ctx->fs_mnt);		/* safe with NULL pointers */
+
+	cr_pgarr_free(ctx);
+
 	kfree(ctx);
 }
 
@@ -146,6 +154,8 @@ static struct cr_ctx *cr_ctx_alloc(int fd, unsigned long flags)
 
 	ctx->flags = flags;
 
+	INIT_LIST_HEAD(&ctx->pgarr_list);
+
 	err = -EBADF;
 	ctx->file = fget(fd);
 	if (!ctx->file)
diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
index 63f298f..4e97f9f 100644
--- a/include/linux/checkpoint.h
+++ b/include/linux/checkpoint.h
@@ -10,6 +10,9 @@
  *  distribution for more details.
  */
 
+#include <linux/path.h>
+#include <linux/fs.h>
+
 #define CR_VERSION  1
 
 struct cr_ctx {
@@ -25,6 +28,10 @@ struct cr_ctx {
 
 	void *hbuf;		/* temporary buffer for headers */
 	int hpos;		/* position in headers buffer */
+
+	struct list_head pgarr_list;	/* page array to dump VMA contents */
+
+	struct path fs_mnt;	/* container root (FIXME) */
 };
 
 /* cr_ctx: flags */
@@ -42,6 +49,8 @@ struct cr_hdr;
 extern int cr_write_obj(struct cr_ctx *ctx, struct cr_hdr *h, void *buf);
 extern int cr_write_buffer(struct cr_ctx *ctx, void *buf, int len);
 extern int cr_write_string(struct cr_ctx *ctx, char *str, int len);
+extern int cr_write_fname(struct cr_ctx *ctx,
+			  struct path *path, struct path *root);
 
 extern int cr_read_obj(struct cr_ctx *ctx, struct cr_hdr *h, void *buf, int n);
 extern int cr_read_obj_type(struct cr_ctx *ctx, void *buf, int len, int type);
@@ -50,7 +59,10 @@ extern int cr_read_buffer(struct cr_ctx *ctx, void *buf, int *len);
 extern int cr_read_string(struct cr_ctx *ctx, char *str, int len);
 
 extern int do_checkpoint(struct cr_ctx *ctx, pid_t pid);
+extern int cr_write_mm(struct cr_ctx *ctx, struct task_struct *t);
+
 extern int do_restart(struct cr_ctx *ctx, pid_t pid);
+extern int cr_read_mm(struct cr_ctx *ctx);
 
 #define cr_debug(fmt, args...)  \
 	pr_debug("[%d:c/r:%s] " fmt, task_pid_vnr(current), __func__, ## args)
diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
index b74b5f9..d78f0f1 100644
--- a/include/linux/checkpoint_hdr.h
+++ b/include/linux/checkpoint_hdr.h
@@ -34,6 +34,7 @@ enum {
 	CR_HDR_HEAD_ARCH,
 	CR_HDR_BUFFER,
 	CR_HDR_STRING,
+	CR_HDR_FNAME,
 
 	CR_HDR_TASK = 101,
 	CR_HDR_THREAD,
@@ -41,6 +42,7 @@ enum {
 
 	CR_HDR_MM = 201,
 	CR_HDR_VMA,
+	CR_HDR_PGARR,
 	CR_HDR_MM_CONTEXT,
 
 	CR_HDR_TAIL = 5001
@@ -75,4 +77,34 @@ struct cr_hdr_task {
 	__s32 task_comm_len;
 } __attribute__((aligned(8)));
 
+struct cr_hdr_mm {
+	__u32 objref;		/* identifier for shared objects */
+	__u32 map_count;
+
+	__u64 start_code, end_code, start_data, end_data;
+	__u64 start_brk, brk, start_stack;
+	__u64 arg_start, arg_end, env_start, env_end;
+} __attribute__((aligned(8)));
+
+/* vma subtypes */
+enum vm_type {
+	CR_VMA_ANON = 1,
+	CR_VMA_FILE
+};
+
+struct cr_hdr_vma {
+	__u32 vma_type;
+	__u32 _padding;
+
+	__u64 vm_start;
+	__u64 vm_end;
+	__u64 vm_page_prot;
+	__u64 vm_flags;
+	__u64 vm_pgoff;
+} __attribute__((aligned(8)));
+
+struct cr_hdr_pgarr {
+	__u64 nr_pages;		/* number of pages to saved */
+} __attribute__((aligned(8)));
+
 #endif /* _CHECKPOINT_CKPT_HDR_H_ */
-- 
1.5.4.3

--
To unsubscribe from this list: send the line "unsubscribe linux-api" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 133+ messages in thread

* [RFC v11][PATCH 05/13] Dump memory address space
@ 2008-12-05 17:31   ` Oren Laadan
  0 siblings, 0 replies; 133+ messages in thread
From: Oren Laadan @ 2008-12-05 17:31 UTC (permalink / raw)
  To: Oren Laadan
  Cc: containers, linux-kernel, linux-mm, linux-api, Linux Torvalds,
	Thomas Gleixner, Serge Hallyn, Dave Hansen, Ingo Molnar,
	H. Peter Anvin, Alexander Viro, MinChan Kim, arnd, jeremy

For each VMA, there is a 'struct cr_vma'; if the VMA is file-mapped,
it will be followed by the file name. Then comes the actual contents,
in one or more chunk: each chunk begins with a header that specifies
how many pages it holds, then the virtual addresses of all the dumped
pages in that chunk, followed by the actual contents of all dumped
pages. A header with zero number of pages marks the end of the contents.
Then comes the next VMA and so on.

Changelog[v11]:
  - Copy contents of 'init->fs->root' instead of pointing to them.
  - Add missing test for VM_MAYSHARE when dumping memory

Changelog[v10]:
  - Acquire dcache_lock around call to __d_path() in cr_fill_name()

Changelog[v9]:
  - Introduce cr_ctx_checkpoint() for checkpoint-specific ctx setup
  - Test if __d_path() changes mnt/dentry (when crossing filesystem
    namespace boundary). for now cr_fill_fname() fails the checkpoint.

Changelog[v7]:
  - Fix argument given to kunmap_atomic() in memory dump/restore

Changelog[v6]:
  - Balance all calls to cr_hbuf_get() with matching cr_hbuf_put()
    (even though it's not really needed)

Changelog[v5]:
  - Improve memory dump code (following Dave Hansen's comments)
  - Change dump format (and code) to allow chunks of <vaddrs, pages>
    instead of one long list of each
  - Fix use of follow_page() to avoid faulting in non-present pages

Changelog[v4]:
  - Use standard list_... for cr_pgarr

Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com>
---
 arch/x86/include/asm/checkpoint_hdr.h |    5 +
 arch/x86/mm/checkpoint.c              |   31 ++
 checkpoint/Makefile                   |    3 +-
 checkpoint/checkpoint.c               |   88 ++++++
 checkpoint/checkpoint_arch.h          |    2 +
 checkpoint/checkpoint_mem.h           |   41 +++
 checkpoint/ckpt_mem.c                 |  503 +++++++++++++++++++++++++++++++++
 checkpoint/sys.c                      |   10 +
 include/linux/checkpoint.h            |   12 +
 include/linux/checkpoint_hdr.h        |   32 ++
 10 files changed, 726 insertions(+), 1 deletions(-)
 create mode 100644 checkpoint/checkpoint_mem.h
 create mode 100644 checkpoint/ckpt_mem.c

diff --git a/arch/x86/include/asm/checkpoint_hdr.h b/arch/x86/include/asm/checkpoint_hdr.h
index 6325062..33f4c70 100644
--- a/arch/x86/include/asm/checkpoint_hdr.h
+++ b/arch/x86/include/asm/checkpoint_hdr.h
@@ -82,4 +82,9 @@ struct cr_hdr_cpu {
 	/* thread_xstate contents follow (if used_math) */
 } __attribute__((aligned(8)));
 
+struct cr_hdr_mm_context {
+	__s16 ldt_entry_size;
+	__s16 nldt;
+} __attribute__((aligned(8)));
+
 #endif /* __ASM_X86_CKPT_HDR__H */
diff --git a/arch/x86/mm/checkpoint.c b/arch/x86/mm/checkpoint.c
index 8dd6d2d..757936e 100644
--- a/arch/x86/mm/checkpoint.c
+++ b/arch/x86/mm/checkpoint.c
@@ -221,3 +221,34 @@ int cr_write_head_arch(struct cr_ctx *ctx)
 
 	return ret;
 }
+
+/* dump the mm->context state */
+int cr_write_mm_context(struct cr_ctx *ctx, struct mm_struct *mm, int parent)
+{
+	struct cr_hdr h;
+	struct cr_hdr_mm_context *hh = cr_hbuf_get(ctx, sizeof(*hh));
+	int ret;
+
+	h.type = CR_HDR_MM_CONTEXT;
+	h.len = sizeof(*hh);
+	h.parent = parent;
+
+	mutex_lock(&mm->context.lock);
+
+	hh->ldt_entry_size = LDT_ENTRY_SIZE;
+	hh->nldt = mm->context.size;
+
+	cr_debug("nldt %d\n", hh->nldt);
+
+	ret = cr_write_obj(ctx, &h, hh);
+	cr_hbuf_put(ctx, sizeof(*hh));
+	if (ret < 0)
+		goto out;
+
+	ret = cr_kwrite(ctx, mm->context.ldt,
+			mm->context.size * LDT_ENTRY_SIZE);
+
+ out:
+	mutex_unlock(&mm->context.lock);
+	return ret;
+}
diff --git a/checkpoint/Makefile b/checkpoint/Makefile
index d2df68c..3a0df6d 100644
--- a/checkpoint/Makefile
+++ b/checkpoint/Makefile
@@ -2,4 +2,5 @@
 # Makefile for linux checkpoint/restart.
 #
 
-obj-$(CONFIG_CHECKPOINT_RESTART) += sys.o checkpoint.o restart.o
+obj-$(CONFIG_CHECKPOINT_RESTART) += sys.o checkpoint.o restart.o \
+		ckpt_mem.o
diff --git a/checkpoint/checkpoint.c b/checkpoint/checkpoint.c
index 17cc8d2..56d0ec2 100644
--- a/checkpoint/checkpoint.c
+++ b/checkpoint/checkpoint.c
@@ -13,6 +13,7 @@
 #include <linux/time.h>
 #include <linux/fs.h>
 #include <linux/file.h>
+#include <linux/fdtable.h>
 #include <linux/dcache.h>
 #include <linux/mount.h>
 #include <linux/utsname.h>
@@ -75,6 +76,66 @@ int cr_write_string(struct cr_ctx *ctx, char *str, int len)
 	return cr_write_obj(ctx, &h, str);
 }
 
+/**
+ * cr_fill_fname - return pathname of a given file
+ * @path: path name
+ * @root: relative root
+ * @buf: buffer for pathname
+ * @n: buffer length (in) and pathname length (out)
+ */
+static char *
+cr_fill_fname(struct path *path, struct path *root, char *buf, int *n)
+{
+	struct path tmp = *root;
+	char *fname;
+
+	BUG_ON(!buf);
+	spin_lock(&dcache_lock);
+	fname = __d_path(path, &tmp, buf, *n);
+	spin_unlock(&dcache_lock);
+	if (!IS_ERR(fname))
+		*n = (buf + (*n) - fname);
+	/*
+	 * FIXME: if __d_path() changed these, it must have stepped out of
+	 * init's namespace. Since currently we require a unified namespace
+	 * within the container: simply fail.
+	 */
+	if (tmp.mnt != root->mnt || tmp.dentry != root->dentry)
+		fname = ERR_PTR(-EBADF);
+
+	return fname;
+}
+
+/**
+ * cr_write_fname - write a file name
+ * @ctx: checkpoint context
+ * @path: path name
+ * @root: relative root
+ */
+int cr_write_fname(struct cr_ctx *ctx, struct path *path, struct path *root)
+{
+	struct cr_hdr h;
+	char *buf, *fname;
+	int ret, flen;
+
+	flen = PATH_MAX;
+	buf = kmalloc(flen, GFP_KERNEL);
+	if (!buf)
+		return -ENOMEM;
+
+	fname = cr_fill_fname(path, root, buf, &flen);
+	if (!IS_ERR(fname)) {
+		h.type = CR_HDR_FNAME;
+		h.len = flen;
+		h.parent = 0;
+		ret = cr_write_obj(ctx, &h, fname);
+	} else
+		ret = PTR_ERR(fname);
+
+	kfree(buf);
+	return ret;
+}
+
 /* write the checkpoint header */
 static int cr_write_head(struct cr_ctx *ctx)
 {
@@ -168,6 +229,10 @@ static int cr_write_task(struct cr_ctx *ctx, struct task_struct *t)
 	cr_debug("task_struct: ret %d\n", ret);
 	if (ret < 0)
 		goto out;
+	ret = cr_write_mm(ctx, t);
+	cr_debug("memory: ret %d\n", ret);
+	if (ret < 0)
+		goto out;
 	ret = cr_write_thread(ctx, t);
 	cr_debug("thread: ret %d\n", ret);
 	if (ret < 0)
@@ -178,10 +243,33 @@ static int cr_write_task(struct cr_ctx *ctx, struct task_struct *t)
 	return ret;
 }
 
+static int cr_ctx_checkpoint(struct cr_ctx *ctx, pid_t pid)
+{
+	struct fs_struct *fs;
+
+	ctx->root_pid = pid;
+
+	/*
+	 * assume checkpointer is in container's root vfs
+	 * FIXME: this works for now, but will change with real containers
+	 */
+
+	fs = current->fs;
+	read_lock(&fs->lock);
+	ctx->fs_mnt = fs->root;
+	path_get(&ctx->fs_mnt);
+	read_unlock(&fs->lock);
+
+	return 0;
+}
+
 int do_checkpoint(struct cr_ctx *ctx, pid_t pid)
 {
 	int ret;
 
+	ret = cr_ctx_checkpoint(ctx, pid);
+	if (ret < 0)
+		goto out;
 	ret = cr_write_head(ctx);
 	if (ret < 0)
 		goto out;
diff --git a/checkpoint/checkpoint_arch.h b/checkpoint/checkpoint_arch.h
index ada1369..f06c7eb 100644
--- a/checkpoint/checkpoint_arch.h
+++ b/checkpoint/checkpoint_arch.h
@@ -3,6 +3,8 @@
 extern int cr_write_head_arch(struct cr_ctx *ctx);
 extern int cr_write_thread(struct cr_ctx *ctx, struct task_struct *t);
 extern int cr_write_cpu(struct cr_ctx *ctx, struct task_struct *t);
+extern int cr_write_mm_context(struct cr_ctx *ctx,
+			       struct mm_struct *mm, int parent);
 
 extern int cr_read_head_arch(struct cr_ctx *ctx);
 extern int cr_read_thread(struct cr_ctx *ctx);
diff --git a/checkpoint/checkpoint_mem.h b/checkpoint/checkpoint_mem.h
new file mode 100644
index 0000000..85546f4
--- /dev/null
+++ b/checkpoint/checkpoint_mem.h
@@ -0,0 +1,41 @@
+#ifndef _CHECKPOINT_CKPT_MEM_H_
+#define _CHECKPOINT_CKPT_MEM_H_
+/*
+ *  Generic container checkpoint-restart
+ *
+ *  Copyright (C) 2008 Oren Laadan
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+#include <linux/mm_types.h>
+
+/*
+ * page-array chains: each cr_pgarr describes a set of <strcut page *,vaddr>
+ * tuples (where vaddr is the virtual address of a page in a particular mm).
+ * Specifically, we use separate arrays so that all vaddrs can be written
+ * and read at once.
+ */
+
+struct cr_pgarr {
+	unsigned long *vaddrs;
+	struct page **pages;
+	unsigned int nr_used;
+	struct list_head list;
+};
+
+#define CR_PGARR_TOTAL  (PAGE_SIZE / sizeof(void *))
+#define CR_PGARR_CHUNK  (4 * CR_PGARR_TOTAL)
+
+extern void cr_pgarr_free(struct cr_ctx *ctx);
+extern struct cr_pgarr *cr_pgarr_current(struct cr_ctx *ctx);
+extern void cr_pgarr_reset_all(struct cr_ctx *ctx);
+
+static inline int cr_pgarr_is_full(struct cr_pgarr *pgarr)
+{
+	return (pgarr->nr_used == CR_PGARR_TOTAL);
+}
+
+#endif /* _CHECKPOINT_CKPT_MEM_H_ */
diff --git a/checkpoint/ckpt_mem.c b/checkpoint/ckpt_mem.c
new file mode 100644
index 0000000..a2fcdbf
--- /dev/null
+++ b/checkpoint/ckpt_mem.c
@@ -0,0 +1,503 @@
+/*
+ *  Checkpoint memory contents
+ *
+ *  Copyright (C) 2008 Oren Laadan
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+#include <linux/kernel.h>
+#include <linux/sched.h>
+#include <linux/slab.h>
+#include <linux/file.h>
+#include <linux/pagemap.h>
+#include <linux/mm_types.h>
+#include <linux/checkpoint.h>
+#include <linux/checkpoint_hdr.h>
+
+#include "checkpoint_arch.h"
+#include "checkpoint_mem.h"
+
+/*
+ * utilities to alloc, free, and handle 'struct cr_pgarr' (page-arrays)
+ * (common to ckpt_mem.c and rstr_mem.c).
+ *
+ * The checkpoint context structure has two members for page-arrays:
+ *   ctx->pgarr_list: list head of the page-array chain
+ *
+ * During checkpoint (and restart) the chain tracks the dirty pages (page
+ * pointer and virtual address) of each MM. For a particular MM, these are
+ * always added to the head of the page-array chain (ctx->pgarr_list).
+ * This "current" page-array advances as necessary, and new page-array
+ * descriptors are allocated on-demand. Before the next chunk of pages,
+ * the chain is reset but not freed (that is, dereference page pointers).
+ */
+
+/* return first page-array in the chain */
+static inline struct cr_pgarr *cr_pgarr_first(struct cr_ctx *ctx)
+{
+	if (list_empty(&ctx->pgarr_list))
+		return NULL;
+	return list_first_entry(&ctx->pgarr_list, struct cr_pgarr, list);
+}
+
+/* release pages referenced by a page-array */
+static void cr_pgarr_release_pages(struct cr_pgarr *pgarr)
+{
+	int i;
+
+	cr_debug("nr_used %d\n", pgarr->nr_used);
+	/*
+	 * although both checkpoint and restart use 'nr_used', we only
+	 * collect pages during checkpoint; in restart we simply return
+	 */
+	if (!pgarr->pages)
+		return;
+	for (i = pgarr->nr_used; i--; /**/)
+		page_cache_release(pgarr->pages[i]);
+}
+
+/* free a single page-array object */
+static void cr_pgarr_free_one(struct cr_pgarr *pgarr)
+{
+	cr_pgarr_release_pages(pgarr);
+	kfree(pgarr->pages);
+	kfree(pgarr->vaddrs);
+	kfree(pgarr);
+}
+
+/* free a chain of page-arrays */
+void cr_pgarr_free(struct cr_ctx *ctx)
+{
+	struct cr_pgarr *pgarr, *tmp;
+
+	list_for_each_entry_safe(pgarr, tmp, &ctx->pgarr_list, list) {
+		list_del(&pgarr->list);
+		cr_pgarr_free_one(pgarr);
+	}
+}
+
+/* allocate a single page-array object */
+static struct cr_pgarr *cr_pgarr_alloc_one(unsigned long flags)
+{
+	struct cr_pgarr *pgarr;
+
+	pgarr = kzalloc(sizeof(*pgarr), GFP_KERNEL);
+	if (!pgarr)
+		return NULL;
+
+	pgarr->vaddrs = kmalloc(CR_PGARR_TOTAL * sizeof(unsigned long),
+				GFP_KERNEL);
+	if (!pgarr->vaddrs)
+		goto nomem;
+
+	/* pgarr->pages is needed only for checkpoint */
+	if (flags & CR_CTX_CKPT) {
+		pgarr->pages = kmalloc(CR_PGARR_TOTAL * sizeof(struct page *),
+				       GFP_KERNEL);
+		if (!pgarr->pages)
+			goto nomem;
+	}
+
+	return pgarr;
+
+ nomem:
+	cr_pgarr_free_one(pgarr);
+	return NULL;
+}
+
+/* cr_pgarr_current - return the next available page-array in the chain
+ * @ctx: checkpoint context
+ *
+ * Returns the first page-array in the list that has space. Extends the
+ * list if none has space.
+ */
+struct cr_pgarr *cr_pgarr_current(struct cr_ctx *ctx)
+{
+	struct cr_pgarr *pgarr;
+
+	pgarr = cr_pgarr_first(ctx);
+	if (pgarr && !cr_pgarr_is_full(pgarr))
+		goto out;
+	pgarr = cr_pgarr_alloc_one(ctx->flags);
+	if (!pgarr)
+		goto out;
+	list_add(&pgarr->list, &ctx->pgarr_list);
+ out:
+	return pgarr;
+}
+
+/* reset the page-array chain (dropping page references if necessary) */
+void cr_pgarr_reset_all(struct cr_ctx *ctx)
+{
+	struct cr_pgarr *pgarr;
+
+	list_for_each_entry(pgarr, &ctx->pgarr_list, list) {
+		cr_pgarr_release_pages(pgarr);
+		pgarr->nr_used = 0;
+	}
+}
+
+/*
+ * Checkpoint is outside the context of the checkpointee, so one cannot
+ * simply read pages from user-space. Instead, we scan the address space
+ * of the target to cherry-pick pages of interest. Selected pages are
+ * enlisted in a page-array chain (attached to the checkpoint context).
+ * To save their contents, each page is mapped to kernel memory and then
+ * dumped to the file descriptor.
+ */
+
+
+/**
+ * cr_private_follow_page - return page pointer for dirty pages
+ * @vma - target vma
+ * @addr - page address
+ *
+ * Looks up the page that correspond to the address in the vma, and
+ * returns the page if it was modified (and grabs a reference to it),
+ * or otherwise returns NULL (or error).
+ *
+ * This function should _only_ called for private vma's.
+ */
+static struct page *
+cr_private_follow_page(struct vm_area_struct *vma, unsigned long addr)
+{
+	struct page *page;
+
+	BUG_ON(vma->vm_flags & (VM_SHARED | VM_MAYSHARE));
+
+	/*
+	 * simplified version of get_user_pages(): already have vma,
+	 * only need FOLL_ANON, and (for now) ignore fault stats.
+	 *
+	 * follow_page() will return NULL if the page is not present
+	 * (swapped), ZERO_PAGE(0) if the pte wasn't allocated, and
+	 * the actual page pointer otherwise.
+	 *
+	 * FIXME: consolidate with get_user_pages()
+	 */
+
+	cond_resched();
+	while (!(page = follow_page(vma, addr, FOLL_ANON | FOLL_GET))) {
+		int ret;
+
+		/* the page is swapped out - bring it in (optimize ?) */
+		ret = handle_mm_fault(vma->vm_mm, vma, addr, 0);
+		if (ret & VM_FAULT_ERROR) {
+			if (ret & VM_FAULT_OOM)
+				return ERR_PTR(-ENOMEM);
+			else if (ret & VM_FAULT_SIGBUS)
+				return ERR_PTR(-EFAULT);
+			else
+				BUG();
+			break;
+		}
+		cond_resched();
+	}
+
+	if (IS_ERR(page))
+		return page;
+
+	/*
+	 * We only care about dirty pages: either non-zero page, or
+	 * file-backed (copy-on-write) that were touched. For the latter,
+	 * the page_mapping() will be unset because it will no longer be
+	 * mapped to the original file  after having been modified.
+	 */
+	if (page == ZERO_PAGE(0)) {
+		/* this is the zero page: ignore */
+		page_cache_release(page);
+		page = NULL;
+	} else if (vma->vm_file && (page_mapping(page) != NULL)) {
+		/* file backed clean cow: ignore */
+		page_cache_release(page);
+		page = NULL;
+	}
+
+	return page;
+}
+
+/**
+ * cr_private_vma_fill_pgarr - fill a page-array with addr/page tuples
+ * @ctx - checkpoint context
+ * @pgarr - page-array to fill
+ * @vma - vma to scan
+ * @start - start address (updated)
+ *
+ * Returns the number of pages collected
+ */
+static int
+cr_private_vma_fill_pgarr(struct cr_ctx *ctx, struct cr_pgarr *pgarr,
+			  struct vm_area_struct *vma, unsigned long *start)
+{
+	unsigned long end = vma->vm_end;
+	unsigned long addr = *start;
+	int orig_used = pgarr->nr_used;
+
+	/* this function is only for private memory (anon or file-mapped) */
+	BUG_ON(vma->vm_flags & (VM_SHARED | VM_MAYSHARE));
+
+	while (addr < end) {
+		struct page *page;
+
+		page = cr_private_follow_page(vma, addr);
+		if (IS_ERR(page))
+			return PTR_ERR(page);
+
+		if (page) {
+			pgarr->pages[pgarr->nr_used] = page;
+			pgarr->vaddrs[pgarr->nr_used] = addr;
+			pgarr->nr_used++;
+		}
+
+		addr += PAGE_SIZE;
+
+		if (cr_pgarr_is_full(pgarr))
+			break;
+	}
+
+	*start = addr;
+	return pgarr->nr_used - orig_used;
+}
+
+/* dump contents of a pages: use kmap_atomic() to avoid TLB flush */
+static int cr_page_write(struct cr_ctx *ctx, struct page *page, char *buf)
+{
+	void *ptr;
+
+	ptr = kmap_atomic(page, KM_USER1);
+	memcpy(buf, ptr, PAGE_SIZE);
+	kunmap_atomic(ptr, KM_USER1);
+
+	return cr_kwrite(ctx, buf, PAGE_SIZE);
+}
+
+/**
+ * cr_vma_dump_pages - dump pages listed in the ctx page-array chain
+ * @ctx - checkpoint context
+ * @total - total number of pages
+ *
+ * First dump all virtual addresses, followed by the contents of all pages
+ */
+static int cr_vma_dump_pages(struct cr_ctx *ctx, int total)
+{
+	struct cr_pgarr *pgarr;
+	char *buf;
+	int i, ret = 0;
+
+	if (!total)
+		return 0;
+
+	list_for_each_entry_reverse(pgarr, &ctx->pgarr_list, list) {
+		ret = cr_kwrite(ctx, pgarr->vaddrs,
+				pgarr->nr_used * sizeof(*pgarr->vaddrs));
+		if (ret < 0)
+			return ret;
+	}
+
+	buf = kmalloc(PAGE_SIZE, GFP_KERNEL);
+	if (!buf)
+		return -ENOMEM;
+
+	list_for_each_entry_reverse(pgarr, &ctx->pgarr_list, list) {
+		for (i = 0; i < pgarr->nr_used; i++) {
+			ret = cr_page_write(ctx, pgarr->pages[i], buf);
+			if (ret < 0)
+				goto out;
+		}
+	}
+
+ out:
+	kfree(buf);
+	return ret;
+}
+
+/**
+ * cr_write_private_vma_contents - dump contents of a VMA with private memory
+ * @ctx - checkpoint context
+ * @vma - vma to scan
+ *
+ * Collect lists of pages that needs to be dumped, and corresponding
+ * virtual addresses into ctx->pgarr_list page-array chain. Then dump
+ * the addresses, followed by the page contents.
+ */
+static int
+cr_write_private_vma_contents(struct cr_ctx *ctx, struct vm_area_struct *vma)
+{
+	struct cr_hdr h;
+	struct cr_hdr_pgarr *hh;
+	unsigned long addr = vma->vm_start;
+	struct cr_pgarr *pgarr;
+	unsigned long cnt = 0;
+	int ret;
+
+	/*
+	 * Work iteratively, collecting and dumping at most CR_PGARR_CHUNK
+	 * in each round. Each iterations is divided into two steps:
+	 *
+	 * (1) scan: scan through the PTEs of the vma to collect the pages
+	 * to dump (later we'll also make them COW), while keeping a list
+	 * of pages and their corresponding addresses on ctx->pgarr_list.
+	 *
+	 * (2) dump: write out a header specifying how many pages, followed
+	 * by the addresses of all pages in ctx->pgarr_list, followed by
+	 * the actual contents of all pages. (Then, release the references
+	 * to the pages and reset the page-array chain).
+	 *
+	 * (This split makes the logic simpler by first counting the pages
+	 * that need saving. More importantly, it allows for a future
+	 * optimization that will reduce application downtime by deferring
+	 * the actual write-out of the data to after the application is
+	 * allowed to resume execution).
+	 *
+	 * After dumpting the entire contents, conclude with a header that
+	 * specifies 0 pages to mark the end of the contents.
+	 */
+
+	h.type = CR_HDR_PGARR;
+	h.len = sizeof(*hh);
+	h.parent = 0;
+
+	while (addr < vma->vm_end) {
+		pgarr = cr_pgarr_current(ctx);
+		if (!pgarr)
+			return -ENOMEM;
+		ret = cr_private_vma_fill_pgarr(ctx, pgarr, vma, &addr);
+		if (ret < 0)
+			return ret;
+		cnt += ret;
+
+		/* did we complete a chunk, or is this the last chunk ? */
+		if (cnt >= CR_PGARR_CHUNK || (cnt && addr == vma->vm_end)) {
+			hh = cr_hbuf_get(ctx, sizeof(*hh));
+			hh->nr_pages = cnt;
+			ret = cr_write_obj(ctx, &h, hh);
+			cr_hbuf_put(ctx, sizeof(*hh));
+			if (ret < 0)
+				return ret;
+
+			ret = cr_vma_dump_pages(ctx, cnt);
+			if (ret < 0)
+				return ret;
+
+			cr_pgarr_reset_all(ctx);
+		}
+	}
+
+	/* mark end of contents with header saying "0" pages */
+	hh = cr_hbuf_get(ctx, sizeof(*hh));
+	hh->nr_pages = 0;
+	ret = cr_write_obj(ctx, &h, hh);
+	cr_hbuf_put(ctx, sizeof(*hh));
+
+	return ret;
+}
+
+static int cr_write_vma(struct cr_ctx *ctx, struct vm_area_struct *vma)
+{
+	struct cr_hdr h;
+	struct cr_hdr_vma *hh = cr_hbuf_get(ctx, sizeof(*hh));
+	int vma_type, ret;
+
+	h.type = CR_HDR_VMA;
+	h.len = sizeof(*hh);
+	h.parent = 0;
+
+	hh->vm_start = vma->vm_start;
+	hh->vm_end = vma->vm_end;
+	hh->vm_page_prot = vma->vm_page_prot.pgprot;
+	hh->vm_flags = vma->vm_flags;
+	hh->vm_pgoff = vma->vm_pgoff;
+
+#define CR_BAD_VM_FLAGS  \
+	(VM_SHARED | VM_MAYSHARE | VM_IO | VM_HUGETLB | VM_NONLINEAR)
+
+	if (vma->vm_flags & CR_BAD_VM_FLAGS) {
+		pr_warning("c/r: unsupported VMA %#lx\n", vma->vm_flags);
+		cr_hbuf_put(ctx, sizeof(*hh));
+		return -ENOSYS;
+	}
+
+	/* by default assume anon memory */
+	vma_type = CR_VMA_ANON;
+
+	/*
+	 * if there is a backing file, assume private-mapped
+	 * (FIXME: check if the file is unlinked)
+	 */
+	if (vma->vm_file)
+		vma_type = CR_VMA_FILE;
+
+	hh->vma_type = vma_type;
+
+	ret = cr_write_obj(ctx, &h, hh);
+	cr_hbuf_put(ctx, sizeof(*hh));
+	if (ret < 0)
+		return ret;
+
+	/* save the file name, if relevant */
+	if (vma->vm_file) {
+		ret = cr_write_fname(ctx, &vma->vm_file->f_path, &ctx->fs_mnt);
+		if (ret < 0)
+			return ret;
+	}
+
+	return cr_write_private_vma_contents(ctx, vma);
+}
+
+int cr_write_mm(struct cr_ctx *ctx, struct task_struct *t)
+{
+	struct cr_hdr h;
+	struct cr_hdr_mm *hh = cr_hbuf_get(ctx, sizeof(*hh));
+	struct mm_struct *mm;
+	struct vm_area_struct *vma;
+	int objref, ret;
+
+	h.type = CR_HDR_MM;
+	h.len = sizeof(*hh);
+	h.parent = task_pid_vnr(t);
+
+	mm = get_task_mm(t);
+
+	objref = 0;	/* will be meaningful with multiple processes */
+	hh->objref = objref;
+
+	down_read(&mm->mmap_sem);
+
+	hh->start_code = mm->start_code;
+	hh->end_code = mm->end_code;
+	hh->start_data = mm->start_data;
+	hh->end_data = mm->end_data;
+	hh->start_brk = mm->start_brk;
+	hh->brk = mm->brk;
+	hh->start_stack = mm->start_stack;
+	hh->arg_start = mm->arg_start;
+	hh->arg_end = mm->arg_end;
+	hh->env_start = mm->env_start;
+	hh->env_end = mm->env_end;
+
+	hh->map_count = mm->map_count;
+
+	/* FIX: need also mm->flags */
+
+	ret = cr_write_obj(ctx, &h, hh);
+	cr_hbuf_put(ctx, sizeof(*hh));
+	if (ret < 0)
+		goto out;
+
+	/* write the vma's */
+	for (vma = mm->mmap; vma; vma = vma->vm_next) {
+		ret = cr_write_vma(ctx, vma);
+		if (ret < 0)
+			goto out;
+	}
+
+	ret = cr_write_mm_context(ctx, mm, objref);
+
+ out:
+	up_read(&mm->mmap_sem);
+	mmput(mm);
+	return ret;
+}
diff --git a/checkpoint/sys.c b/checkpoint/sys.c
index bd14ef9..c547a1c 100644
--- a/checkpoint/sys.c
+++ b/checkpoint/sys.c
@@ -16,6 +16,8 @@
 #include <linux/capability.h>
 #include <linux/checkpoint.h>
 
+#include "checkpoint_mem.h"
+
 /*
  * Helpers to write(read) from(to) kernel space to(from) the checkpoint
  * image file descriptor (similar to how a core-dump is performed).
@@ -131,7 +133,13 @@ static void cr_ctx_free(struct cr_ctx *ctx)
 {
 	if (ctx->file)
 		fput(ctx->file);
+
 	kfree(ctx->hbuf);
+
+	path_put(&ctx->fs_mnt);		/* safe with NULL pointers */
+
+	cr_pgarr_free(ctx);
+
 	kfree(ctx);
 }
 
@@ -146,6 +154,8 @@ static struct cr_ctx *cr_ctx_alloc(int fd, unsigned long flags)
 
 	ctx->flags = flags;
 
+	INIT_LIST_HEAD(&ctx->pgarr_list);
+
 	err = -EBADF;
 	ctx->file = fget(fd);
 	if (!ctx->file)
diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
index 63f298f..4e97f9f 100644
--- a/include/linux/checkpoint.h
+++ b/include/linux/checkpoint.h
@@ -10,6 +10,9 @@
  *  distribution for more details.
  */
 
+#include <linux/path.h>
+#include <linux/fs.h>
+
 #define CR_VERSION  1
 
 struct cr_ctx {
@@ -25,6 +28,10 @@ struct cr_ctx {
 
 	void *hbuf;		/* temporary buffer for headers */
 	int hpos;		/* position in headers buffer */
+
+	struct list_head pgarr_list;	/* page array to dump VMA contents */
+
+	struct path fs_mnt;	/* container root (FIXME) */
 };
 
 /* cr_ctx: flags */
@@ -42,6 +49,8 @@ struct cr_hdr;
 extern int cr_write_obj(struct cr_ctx *ctx, struct cr_hdr *h, void *buf);
 extern int cr_write_buffer(struct cr_ctx *ctx, void *buf, int len);
 extern int cr_write_string(struct cr_ctx *ctx, char *str, int len);
+extern int cr_write_fname(struct cr_ctx *ctx,
+			  struct path *path, struct path *root);
 
 extern int cr_read_obj(struct cr_ctx *ctx, struct cr_hdr *h, void *buf, int n);
 extern int cr_read_obj_type(struct cr_ctx *ctx, void *buf, int len, int type);
@@ -50,7 +59,10 @@ extern int cr_read_buffer(struct cr_ctx *ctx, void *buf, int *len);
 extern int cr_read_string(struct cr_ctx *ctx, char *str, int len);
 
 extern int do_checkpoint(struct cr_ctx *ctx, pid_t pid);
+extern int cr_write_mm(struct cr_ctx *ctx, struct task_struct *t);
+
 extern int do_restart(struct cr_ctx *ctx, pid_t pid);
+extern int cr_read_mm(struct cr_ctx *ctx);
 
 #define cr_debug(fmt, args...)  \
 	pr_debug("[%d:c/r:%s] " fmt, task_pid_vnr(current), __func__, ## args)
diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
index b74b5f9..d78f0f1 100644
--- a/include/linux/checkpoint_hdr.h
+++ b/include/linux/checkpoint_hdr.h
@@ -34,6 +34,7 @@ enum {
 	CR_HDR_HEAD_ARCH,
 	CR_HDR_BUFFER,
 	CR_HDR_STRING,
+	CR_HDR_FNAME,
 
 	CR_HDR_TASK = 101,
 	CR_HDR_THREAD,
@@ -41,6 +42,7 @@ enum {
 
 	CR_HDR_MM = 201,
 	CR_HDR_VMA,
+	CR_HDR_PGARR,
 	CR_HDR_MM_CONTEXT,
 
 	CR_HDR_TAIL = 5001
@@ -75,4 +77,34 @@ struct cr_hdr_task {
 	__s32 task_comm_len;
 } __attribute__((aligned(8)));
 
+struct cr_hdr_mm {
+	__u32 objref;		/* identifier for shared objects */
+	__u32 map_count;
+
+	__u64 start_code, end_code, start_data, end_data;
+	__u64 start_brk, brk, start_stack;
+	__u64 arg_start, arg_end, env_start, env_end;
+} __attribute__((aligned(8)));
+
+/* vma subtypes */
+enum vm_type {
+	CR_VMA_ANON = 1,
+	CR_VMA_FILE
+};
+
+struct cr_hdr_vma {
+	__u32 vma_type;
+	__u32 _padding;
+
+	__u64 vm_start;
+	__u64 vm_end;
+	__u64 vm_page_prot;
+	__u64 vm_flags;
+	__u64 vm_pgoff;
+} __attribute__((aligned(8)));
+
+struct cr_hdr_pgarr {
+	__u64 nr_pages;		/* number of pages to saved */
+} __attribute__((aligned(8)));
+
 #endif /* _CHECKPOINT_CKPT_HDR_H_ */
-- 
1.5.4.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 133+ messages in thread

* [RFC v11][PATCH 06/13] Restore memory address space
       [not found] ` <1228498282-11804-1-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
                     ` (4 preceding siblings ...)
  2008-12-05 17:31   ` [RFC v11][PATCH 05/13] Dump memory address space Oren Laadan
@ 2008-12-05 17:31   ` Oren Laadan
  2008-12-05 17:31   ` [RFC v11][PATCH 07/13] Infrastructure for shared objects Oren Laadan
                     ` (11 subsequent siblings)
  17 siblings, 0 replies; 133+ messages in thread
From: Oren Laadan @ 2008-12-05 17:31 UTC (permalink / raw)
  To: Oren Laadan
  Cc: jeremy-TSDbQ3PG+2Y, arnd-r2nGTMty4D4,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Dave Hansen,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Linux Torvalds, Alexander Viro,
	H. Peter Anvin, Thomas Gleixner, Ingo Molnar

Restoring the memory address space begins with nuking the existing one
of the current process, and then reading the VMA state and contents.
Call do_mmap_pgoffset() for each VMA and then read in the data.

Changelog[v9]:
  - Introduce cr_ctx_checkpoint() for checkpoint-specific ctx setup

Changelog[v7]:
  - Fix argument given to kunmap_atomic() in memory dump/restore

Changelog[v6]:
  - Balance all calls to cr_hbuf_get() with matching cr_hbuf_put()
    (even though it's not really needed)

Changelog[v5]:
  - Improve memory restore code (following Dave Hansen's comments)
  - Change dump format (and code) to allow chunks of <vaddrs, pages>
    instead of one long list of each
  - Memory restore now maps user pages explicitly to copy data into them,
    instead of reading directly to user space; got rid of mprotect_fixup()

Changelog[v4]:
  - Use standard list_... for cr_pgarr


Signed-off-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
Acked-by: Serge Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
Signed-off-by: Dave Hansen <dave-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
---
 arch/x86/include/asm/checkpoint_hdr.h |    5 +
 arch/x86/mm/restart.c                 |   64 ++++++-
 checkpoint/Makefile                   |    2 +-
 checkpoint/checkpoint_arch.h          |    2 +
 checkpoint/checkpoint_mem.h           |    5 +
 checkpoint/restart.c                  |   51 +++++
 checkpoint/rstr_mem.c                 |  384 +++++++++++++++++++++++++++++++++
 include/linux/checkpoint.h            |    4 +
 8 files changed, 514 insertions(+), 3 deletions(-)
 create mode 100644 checkpoint/rstr_mem.c

diff --git a/arch/x86/include/asm/checkpoint_hdr.h b/arch/x86/include/asm/checkpoint_hdr.h
index 33f4c70..d13db9b 100644
--- a/arch/x86/include/asm/checkpoint_hdr.h
+++ b/arch/x86/include/asm/checkpoint_hdr.h
@@ -87,4 +87,9 @@ struct cr_hdr_mm_context {
 	__s16 nldt;
 } __attribute__((aligned(8)));
 
+#ifdef __KERNEL__
+/* misc prototypes from kernel (not defined elsewhere) */
+asmlinkage int sys_modify_ldt(int func, void __user *ptr, unsigned long bytecount);
+#endif
+
 #endif /* __ASM_X86_CKPT_HDR__H */
diff --git a/arch/x86/mm/restart.c b/arch/x86/mm/restart.c
index 45ad790..ab68b2f 100644
--- a/arch/x86/mm/restart.c
+++ b/arch/x86/mm/restart.c
@@ -52,8 +52,10 @@ int cr_read_thread(struct cr_ctx *ctx)
 
 		size = sizeof(*desc) * GDT_ENTRY_TLS_ENTRIES;
 		desc = kmalloc(size, GFP_KERNEL);
-		if (!desc)
-			return -ENOMEM;
+		if (!desc) {
+			ret = -ENOMEM;
+			goto out;
+		}
 
 		ret = cr_kread(ctx, desc, size);
 		if (ret >= 0) {
@@ -230,3 +232,61 @@ int cr_read_head_arch(struct cr_ctx *ctx)
 
 	return ret;
 }
+
+int cr_read_mm_context(struct cr_ctx *ctx, struct mm_struct *mm, int rparent)
+{
+	struct cr_hdr_mm_context *hh = cr_hbuf_get(ctx, sizeof(*hh));
+	int n, parent, ret = -EINVAL;
+
+	parent = cr_read_obj_type(ctx, hh, sizeof(*hh), CR_HDR_MM_CONTEXT);
+	cr_debug("parent %d rparent %d nldt %d\n", parent, rparent, hh->nldt);
+	if (parent < 0) {
+		ret = parent;
+		goto out;
+	}
+	if (parent != rparent)
+		goto out;
+
+	if (hh->nldt < 0 || hh->ldt_entry_size != LDT_ENTRY_SIZE)
+		goto out;
+
+	/*
+	 * to utilize the syscall modify_ldt() we first convert the data
+	 * in the checkpoint image from 'struct desc_struct' to 'struct
+	 * user_desc' with reverse logic of include/asm/desc.h:fill_ldt()
+	 */
+
+	for (n = 0; n < hh->nldt; n++) {
+		struct user_desc info;
+		struct desc_struct desc;
+		mm_segment_t old_fs;
+
+		ret = cr_kread(ctx, &desc, LDT_ENTRY_SIZE);
+		if (ret < 0)
+			goto out;
+
+		info.entry_number = n;
+		info.base_addr = desc.base0 | (desc.base1 << 16);
+		info.limit = desc.limit0;
+		info.seg_32bit = desc.d;
+		info.contents = desc.type >> 2;
+		info.read_exec_only = (desc.type >> 1) ^ 1;
+		info.limit_in_pages = desc.g;
+		info.seg_not_present = desc.p ^ 1;
+		info.useable = desc.avl;
+
+		old_fs = get_fs();
+		set_fs(get_ds());
+		ret = sys_modify_ldt(1, (struct user_desc __user *) &info,
+				     sizeof(info));
+		set_fs(old_fs);
+
+		if (ret < 0)
+			goto out;
+	}
+
+	ret = 0;
+ out:
+	cr_hbuf_put(ctx, sizeof(*hh));
+	return ret;
+}
diff --git a/checkpoint/Makefile b/checkpoint/Makefile
index 3a0df6d..ac35033 100644
--- a/checkpoint/Makefile
+++ b/checkpoint/Makefile
@@ -3,4 +3,4 @@
 #
 
 obj-$(CONFIG_CHECKPOINT_RESTART) += sys.o checkpoint.o restart.o \
-		ckpt_mem.o
+		ckpt_mem.o rstr_mem.o
diff --git a/checkpoint/checkpoint_arch.h b/checkpoint/checkpoint_arch.h
index f06c7eb..39c8224 100644
--- a/checkpoint/checkpoint_arch.h
+++ b/checkpoint/checkpoint_arch.h
@@ -9,3 +9,5 @@ extern int cr_write_mm_context(struct cr_ctx *ctx,
 extern int cr_read_head_arch(struct cr_ctx *ctx);
 extern int cr_read_thread(struct cr_ctx *ctx);
 extern int cr_read_cpu(struct cr_ctx *ctx);
+extern int cr_read_mm_context(struct cr_ctx *ctx,
+			      struct mm_struct *mm, int parent);
diff --git a/checkpoint/checkpoint_mem.h b/checkpoint/checkpoint_mem.h
index 85546f4..85a5cf3 100644
--- a/checkpoint/checkpoint_mem.h
+++ b/checkpoint/checkpoint_mem.h
@@ -38,4 +38,9 @@ static inline int cr_pgarr_is_full(struct cr_pgarr *pgarr)
 	return (pgarr->nr_used == CR_PGARR_TOTAL);
 }
 
+static inline int cr_pgarr_nr_free(struct cr_pgarr *pgarr)
+{
+	return CR_PGARR_TOTAL - pgarr->nr_used;
+}
+
 #endif /* _CHECKPOINT_CKPT_MEM_H_ */
diff --git a/checkpoint/restart.c b/checkpoint/restart.c
index d74d755..d90c28a 100644
--- a/checkpoint/restart.c
+++ b/checkpoint/restart.c
@@ -120,6 +120,44 @@ int cr_read_string(struct cr_ctx *ctx, char *str, int len)
 	return ret;
 }
 
+/**
+ * cr_read_fname - read a file name
+ * @ctx: checkpoint context
+ * @fname: buffer
+ * @n: buffer length
+ */
+int cr_read_fname(struct cr_ctx *ctx, char *fname, int flen)
+{
+	return cr_read_buf_type(ctx, fname, &flen, CR_HDR_FNAME);
+}
+
+/**
+ * cr_read_open_fname - read a file name and open a file
+ * @ctx: checkpoint context
+ * @flags: file flags
+ * @mode: file mode
+ */
+struct file *cr_read_open_fname(struct cr_ctx *ctx, int flags, int mode)
+{
+	struct file *file;
+	char *fname;
+	int ret;
+
+	fname = kmalloc(PATH_MAX, GFP_KERNEL);
+	if (!fname)
+		return ERR_PTR(-ENOMEM);
+
+	ret = cr_read_fname(ctx, fname, PATH_MAX);
+	cr_debug("fname '%s' flags %#x mode %#x\n", fname, flags, mode);
+	if (ret >= 0)
+		file = filp_open(fname, flags, mode);
+	else
+		file = ERR_PTR(ret);
+
+	kfree(fname);
+	return file;
+}
+
 /* read the checkpoint header */
 static int cr_read_head(struct cr_ctx *ctx)
 {
@@ -219,6 +257,10 @@ static int cr_read_task(struct cr_ctx *ctx)
 	cr_debug("task_struct: ret %d\n", ret);
 	if (ret < 0)
 		goto out;
+	ret = cr_read_mm(ctx);
+	cr_debug("memory: ret %d\n", ret);
+	if (ret < 0)
+		goto out;
 	ret = cr_read_thread(ctx);
 	cr_debug("thread: ret %d\n", ret);
 	if (ret < 0)
@@ -230,10 +272,19 @@ static int cr_read_task(struct cr_ctx *ctx)
 	return ret;
 }
 
+/* setup restart-specific parts of ctx */
+static int cr_ctx_restart(struct cr_ctx *ctx)
+{
+	return 0;
+}
+
 int do_restart(struct cr_ctx *ctx, pid_t pid)
 {
 	int ret;
 
+	ret = cr_ctx_restart(ctx);
+	if (ret < 0)
+		goto out;
 	ret = cr_read_head(ctx);
 	if (ret < 0)
 		goto out;
diff --git a/checkpoint/rstr_mem.c b/checkpoint/rstr_mem.c
new file mode 100644
index 0000000..6713f4f
--- /dev/null
+++ b/checkpoint/rstr_mem.c
@@ -0,0 +1,384 @@
+/*
+ *  Restart memory contents
+ *
+ *  Copyright (C) 2008 Oren Laadan
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+#include <linux/kernel.h>
+#include <linux/sched.h>
+#include <linux/fcntl.h>
+#include <linux/file.h>
+#include <linux/fs.h>
+#include <linux/pagemap.h>
+#include <linux/mm_types.h>
+#include <linux/mman.h>
+#include <linux/mm.h>
+#include <linux/err.h>
+#include <linux/checkpoint.h>
+#include <linux/checkpoint_hdr.h>
+
+#include "checkpoint_arch.h"
+#include "checkpoint_mem.h"
+
+/*
+ * Unlike checkpoint, restart is executed in the context of each restarting
+ * process: vma regions are restored via a call to mmap(), and the data is
+ * read into the address space of the current process.
+ */
+
+
+/**
+ * cr_read_pages_vaddrs - read addresses of pages to page-array chain
+ * @ctx - restart context
+ * @nr_pages - number of address to read
+ */
+static int cr_read_pages_vaddrs(struct cr_ctx *ctx, unsigned long nr_pages)
+{
+	struct cr_pgarr *pgarr;
+	unsigned long *vaddrp;
+	int nr, ret;
+
+	while (nr_pages) {
+		pgarr = cr_pgarr_current(ctx);
+		if (!pgarr)
+			return -ENOMEM;
+		nr = cr_pgarr_nr_free(pgarr);
+		if (nr > nr_pages)
+			nr = nr_pages;
+		vaddrp = &pgarr->vaddrs[pgarr->nr_used];
+		ret = cr_kread(ctx, vaddrp, nr * sizeof(unsigned long));
+		if (ret < 0)
+			return ret;
+		pgarr->nr_used += nr;
+		nr_pages -= nr;
+	}
+	return 0;
+}
+
+static int cr_page_read(struct cr_ctx *ctx, struct page *page, char *buf)
+{
+	void *ptr;
+	int ret;
+
+	ret = cr_kread(ctx, buf, PAGE_SIZE);
+	if (ret < 0)
+		return ret;
+
+	ptr = kmap_atomic(page, KM_USER1);
+	memcpy(ptr, buf, PAGE_SIZE);
+	kunmap_atomic(ptr, KM_USER1);
+
+	return 0;
+}
+
+/**
+ * cr_read_pages_contents - read in data of pages in page-array chain
+ * @ctx - restart context
+ */
+static int cr_read_pages_contents(struct cr_ctx *ctx)
+{
+	struct mm_struct *mm = current->mm;
+	struct cr_pgarr *pgarr;
+	unsigned long *vaddrs;
+	char *buf;
+	int i, ret = 0;
+
+	buf = kmalloc(PAGE_SIZE, GFP_KERNEL);
+	if (!buf)
+		return -ENOMEM;
+
+	down_read(&mm->mmap_sem);
+	list_for_each_entry_reverse(pgarr, &ctx->pgarr_list, list) {
+		vaddrs = pgarr->vaddrs;
+		for (i = 0; i < pgarr->nr_used; i++) {
+			struct page *page;
+
+			ret = get_user_pages(current, mm, vaddrs[i],
+					     1, 1, 1, &page, NULL);
+			if (ret < 0)
+				goto out;
+
+			ret = cr_page_read(ctx, page, buf);
+			page_cache_release(page);
+
+			if (ret < 0)
+				goto out;
+		}
+	}
+
+ out:
+	up_read(&mm->mmap_sem);
+	kfree(buf);
+	return 0;
+}
+
+/**
+ * cr_read_private_vma_contents - restore contents of a VMA with private memory
+ * @ctx - restart context
+ *
+ * Reads a header that specifies how many pages will follow, then reads
+ * a list of virtual addresses into ctx->pgarr_list page-array chain,
+ * followed by the actual contents of the corresponding pages. Iterates
+ * these steps until reaching a header specifying "0" pages, which marks
+ * the end of the contents.
+ */
+static int cr_read_private_vma_contents(struct cr_ctx *ctx)
+{
+	struct cr_hdr_pgarr *hh;
+	unsigned long nr_pages;
+	int parent, ret = 0;
+
+	while (1) {
+		hh = cr_hbuf_get(ctx, sizeof(*hh));
+		parent = cr_read_obj_type(ctx, hh, sizeof(*hh), CR_HDR_PGARR);
+		if (parent != 0) {
+			if (parent < 0)
+				ret = parent;
+			else
+				ret = -EINVAL;
+			cr_hbuf_put(ctx, sizeof(*hh));
+			break;
+		}
+
+		cr_debug("nr_pages %ld\n", (unsigned long) hh->nr_pages);
+
+		nr_pages = hh->nr_pages;
+		cr_hbuf_put(ctx, sizeof(*hh));
+
+		if (!nr_pages)
+			break;
+
+		ret = cr_read_pages_vaddrs(ctx, nr_pages);
+		if (ret < 0)
+			break;
+		ret = cr_read_pages_contents(ctx);
+		if (ret < 0)
+			break;
+		cr_pgarr_reset_all(ctx);
+	}
+
+	return ret;
+}
+
+/**
+ * cr_calc_map_prot_bits - convert vm_flags to mmap protection
+ * orig_vm_flags: source vm_flags
+ */
+static unsigned long cr_calc_map_prot_bits(unsigned long orig_vm_flags)
+{
+	unsigned long vm_prot = 0;
+
+	if (orig_vm_flags & VM_READ)
+		vm_prot |= PROT_READ;
+	if (orig_vm_flags & VM_WRITE)
+		vm_prot |= PROT_WRITE;
+	if (orig_vm_flags & VM_EXEC)
+		vm_prot |= PROT_EXEC;
+	if (orig_vm_flags & PROT_SEM)   /* only (?) with IPC-SHM  */
+		vm_prot |= PROT_SEM;
+
+	return vm_prot;
+}
+
+/**
+ * cr_calc_map_flags_bits - convert vm_flags to mmap flags
+ * orig_vm_flags: source vm_flags
+ */
+static unsigned long cr_calc_map_flags_bits(unsigned long orig_vm_flags)
+{
+	unsigned long vm_flags = 0;
+
+	vm_flags = MAP_FIXED;
+	if (orig_vm_flags & VM_GROWSDOWN)
+		vm_flags |= MAP_GROWSDOWN;
+	if (orig_vm_flags & VM_DENYWRITE)
+		vm_flags |= MAP_DENYWRITE;
+	if (orig_vm_flags & VM_EXECUTABLE)
+		vm_flags |= MAP_EXECUTABLE;
+	if (orig_vm_flags & VM_MAYSHARE)
+		vm_flags |= MAP_SHARED;
+	else
+		vm_flags |= MAP_PRIVATE;
+
+	return vm_flags;
+}
+
+static int cr_read_vma(struct cr_ctx *ctx, struct mm_struct *mm)
+{
+	struct cr_hdr_vma *hh = cr_hbuf_get(ctx, sizeof(*hh));
+	unsigned long vm_size, vm_start, vm_flags, vm_prot, vm_pgoff;
+	unsigned long addr;
+	struct file *file = NULL;
+	int parent, ret = -EINVAL;
+
+	parent = cr_read_obj_type(ctx, hh, sizeof(*hh), CR_HDR_VMA);
+	if (parent < 0) {
+		ret = parent;
+		goto err;
+	} else if (parent != 0)
+		goto err;
+
+	cr_debug("vma %#lx-%#lx type %d\n", (unsigned long) hh->vm_start,
+		 (unsigned long) hh->vm_end, (int) hh->vma_type);
+
+	if (hh->vm_end < hh->vm_start)
+		goto err;
+
+	vm_start = hh->vm_start;
+	vm_pgoff = hh->vm_pgoff;
+	vm_size = hh->vm_end - hh->vm_start;
+	vm_prot = cr_calc_map_prot_bits(hh->vm_flags);
+	vm_flags = cr_calc_map_flags_bits(hh->vm_flags);
+
+	switch (hh->vma_type) {
+
+	case CR_VMA_ANON:		/* anonymous private mapping */
+		if (vm_flags & VM_SHARED)
+			goto err;
+		/*
+		 * vm_pgoff for anonymous mapping is the "global" page
+		 * offset (namely from addr 0x0), so we force a zero
+		 */
+		vm_pgoff = 0;
+		break;
+
+	case CR_VMA_FILE:		/* private mapping from a file */
+		if (vm_flags & VM_SHARED)
+			goto err;
+		/*
+		 * for private mapping using 'read-only' is sufficient
+		 */
+		file = cr_read_open_fname(ctx, O_RDONLY, 0);
+		if (IS_ERR(file)) {
+			ret = PTR_ERR(file);
+			goto err;
+		}
+		break;
+
+	default:
+		goto err;
+
+	}
+
+	cr_hbuf_put(ctx, sizeof(*hh));
+
+	down_write(&mm->mmap_sem);
+	addr = do_mmap_pgoff(file, vm_start, vm_size,
+			     vm_prot, vm_flags, vm_pgoff);
+	up_write(&mm->mmap_sem);
+	cr_debug("size %#lx prot %#lx flag %#lx pgoff %#lx => %#lx\n",
+		 vm_size, vm_prot, vm_flags, vm_pgoff, addr);
+
+	/* the file (if opened) is now referenced by the vma */
+	if (file)
+		filp_close(file, NULL);
+
+	if (IS_ERR((void *) addr))
+		return PTR_ERR((void *) addr);
+
+	/*
+	 * CR_VMA_ANON: read in memory as is
+	 * CR_VMA_FILE: read in memory as is
+	 * (more to follow ...)
+	 */
+
+	switch (hh->vma_type) {
+	case CR_VMA_ANON:
+	case CR_VMA_FILE:
+		/* standard case: read the data into the memory */
+		ret = cr_read_private_vma_contents(ctx);
+		break;
+	}
+
+	if (ret < 0)
+		return ret;
+
+	cr_debug("vma retval %d\n", ret);
+	return 0;
+
+ err:
+	cr_hbuf_put(ctx, sizeof(*hh));
+	return ret;
+}
+
+static int cr_destroy_mm(struct mm_struct *mm)
+{
+	struct vm_area_struct *vmnext = mm->mmap;
+	struct vm_area_struct *vma;
+	int ret;
+
+	while (vmnext) {
+		vma = vmnext;
+		vmnext = vmnext->vm_next;
+		ret = do_munmap(mm, vma->vm_start, vma->vm_end-vma->vm_start);
+		if (ret < 0) {
+			pr_debug("c/r: restart failed do_munmap (%d)\n", ret);
+			return ret;
+		}
+	}
+	return 0;
+}
+
+int cr_read_mm(struct cr_ctx *ctx)
+{
+	struct cr_hdr_mm *hh = cr_hbuf_get(ctx, sizeof(*hh));
+	struct mm_struct *mm;
+	int nr, parent, ret;
+
+	parent = cr_read_obj_type(ctx, hh, sizeof(*hh), CR_HDR_MM);
+	if (parent < 0) {
+		ret = parent;
+		goto out;
+	}
+
+	ret = -EINVAL;
+#if 0	/* activate when containers are used */
+	if (parent != task_pid_vnr(current))
+		goto out;
+#endif
+	cr_debug("map_count %d\n", hh->map_count);
+
+	/* XXX need more sanity checks */
+	if (hh->start_code > hh->end_code ||
+	    hh->start_data > hh->end_data || hh->map_count < 0)
+		goto out;
+
+	mm = current->mm;
+
+	/* point of no return -- destruct current mm */
+	down_write(&mm->mmap_sem);
+	ret = cr_destroy_mm(mm);
+	if (ret < 0) {
+		up_write(&mm->mmap_sem);
+		goto out;
+	}
+	mm->start_code = hh->start_code;
+	mm->end_code = hh->end_code;
+	mm->start_data = hh->start_data;
+	mm->end_data = hh->end_data;
+	mm->start_brk = hh->start_brk;
+	mm->brk = hh->brk;
+	mm->start_stack = hh->start_stack;
+	mm->arg_start = hh->arg_start;
+	mm->arg_end = hh->arg_end;
+	mm->env_start = hh->env_start;
+	mm->env_end = hh->env_end;
+	up_write(&mm->mmap_sem);
+
+	/* FIX: need also mm->flags */
+
+	for (nr = hh->map_count; nr; nr--) {
+		ret = cr_read_vma(ctx, mm);
+		if (ret < 0)
+			goto out;
+	}
+
+	ret = cr_read_mm_context(ctx, mm, hh->objref);
+ out:
+	cr_hbuf_put(ctx, sizeof(*hh));
+	return ret;
+}
diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
index 4e97f9f..ab1b215 100644
--- a/include/linux/checkpoint.h
+++ b/include/linux/checkpoint.h
@@ -58,6 +58,10 @@ extern int cr_read_buf_type(struct cr_ctx *ctx, void *buf, int *len, int type);
 extern int cr_read_buffer(struct cr_ctx *ctx, void *buf, int *len);
 extern int cr_read_string(struct cr_ctx *ctx, char *str, int len);
 
+extern int cr_read_fname(struct cr_ctx *ctx, char *fname, int n);
+extern struct file *cr_read_open_fname(struct cr_ctx *ctx,
+				       int flags, int mode);
+
 extern int do_checkpoint(struct cr_ctx *ctx, pid_t pid);
 extern int cr_write_mm(struct cr_ctx *ctx, struct task_struct *t);
 
-- 
1.5.4.3

^ permalink raw reply related	[flat|nested] 133+ messages in thread

* [RFC v11][PATCH 06/13] Restore memory address space
  2008-12-05 17:31 ` Oren Laadan
  (?)
@ 2008-12-05 17:31   ` Oren Laadan
  -1 siblings, 0 replies; 133+ messages in thread
From: Oren Laadan @ 2008-12-05 17:31 UTC (permalink / raw)
  To: Oren Laadan
  Cc: containers, linux-kernel, linux-mm, linux-api, Linux Torvalds,
	Thomas Gleixner, Serge Hallyn, Dave Hansen, Ingo Molnar,
	H. Peter Anvin, Alexander Viro, MinChan Kim, arnd, jeremy,
	Oren Laadan

Restoring the memory address space begins with nuking the existing one
of the current process, and then reading the VMA state and contents.
Call do_mmap_pgoffset() for each VMA and then read in the data.

Changelog[v9]:
  - Introduce cr_ctx_checkpoint() for checkpoint-specific ctx setup

Changelog[v7]:
  - Fix argument given to kunmap_atomic() in memory dump/restore

Changelog[v6]:
  - Balance all calls to cr_hbuf_get() with matching cr_hbuf_put()
    (even though it's not really needed)

Changelog[v5]:
  - Improve memory restore code (following Dave Hansen's comments)
  - Change dump format (and code) to allow chunks of <vaddrs, pages>
    instead of one long list of each
  - Memory restore now maps user pages explicitly to copy data into them,
    instead of reading directly to user space; got rid of mprotect_fixup()

Changelog[v4]:
  - Use standard list_... for cr_pgarr


Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com>
---
 arch/x86/include/asm/checkpoint_hdr.h |    5 +
 arch/x86/mm/restart.c                 |   64 ++++++-
 checkpoint/Makefile                   |    2 +-
 checkpoint/checkpoint_arch.h          |    2 +
 checkpoint/checkpoint_mem.h           |    5 +
 checkpoint/restart.c                  |   51 +++++
 checkpoint/rstr_mem.c                 |  384 +++++++++++++++++++++++++++++++++
 include/linux/checkpoint.h            |    4 +
 8 files changed, 514 insertions(+), 3 deletions(-)
 create mode 100644 checkpoint/rstr_mem.c

diff --git a/arch/x86/include/asm/checkpoint_hdr.h b/arch/x86/include/asm/checkpoint_hdr.h
index 33f4c70..d13db9b 100644
--- a/arch/x86/include/asm/checkpoint_hdr.h
+++ b/arch/x86/include/asm/checkpoint_hdr.h
@@ -87,4 +87,9 @@ struct cr_hdr_mm_context {
 	__s16 nldt;
 } __attribute__((aligned(8)));
 
+#ifdef __KERNEL__
+/* misc prototypes from kernel (not defined elsewhere) */
+asmlinkage int sys_modify_ldt(int func, void __user *ptr, unsigned long bytecount);
+#endif
+
 #endif /* __ASM_X86_CKPT_HDR__H */
diff --git a/arch/x86/mm/restart.c b/arch/x86/mm/restart.c
index 45ad790..ab68b2f 100644
--- a/arch/x86/mm/restart.c
+++ b/arch/x86/mm/restart.c
@@ -52,8 +52,10 @@ int cr_read_thread(struct cr_ctx *ctx)
 
 		size = sizeof(*desc) * GDT_ENTRY_TLS_ENTRIES;
 		desc = kmalloc(size, GFP_KERNEL);
-		if (!desc)
-			return -ENOMEM;
+		if (!desc) {
+			ret = -ENOMEM;
+			goto out;
+		}
 
 		ret = cr_kread(ctx, desc, size);
 		if (ret >= 0) {
@@ -230,3 +232,61 @@ int cr_read_head_arch(struct cr_ctx *ctx)
 
 	return ret;
 }
+
+int cr_read_mm_context(struct cr_ctx *ctx, struct mm_struct *mm, int rparent)
+{
+	struct cr_hdr_mm_context *hh = cr_hbuf_get(ctx, sizeof(*hh));
+	int n, parent, ret = -EINVAL;
+
+	parent = cr_read_obj_type(ctx, hh, sizeof(*hh), CR_HDR_MM_CONTEXT);
+	cr_debug("parent %d rparent %d nldt %d\n", parent, rparent, hh->nldt);
+	if (parent < 0) {
+		ret = parent;
+		goto out;
+	}
+	if (parent != rparent)
+		goto out;
+
+	if (hh->nldt < 0 || hh->ldt_entry_size != LDT_ENTRY_SIZE)
+		goto out;
+
+	/*
+	 * to utilize the syscall modify_ldt() we first convert the data
+	 * in the checkpoint image from 'struct desc_struct' to 'struct
+	 * user_desc' with reverse logic of include/asm/desc.h:fill_ldt()
+	 */
+
+	for (n = 0; n < hh->nldt; n++) {
+		struct user_desc info;
+		struct desc_struct desc;
+		mm_segment_t old_fs;
+
+		ret = cr_kread(ctx, &desc, LDT_ENTRY_SIZE);
+		if (ret < 0)
+			goto out;
+
+		info.entry_number = n;
+		info.base_addr = desc.base0 | (desc.base1 << 16);
+		info.limit = desc.limit0;
+		info.seg_32bit = desc.d;
+		info.contents = desc.type >> 2;
+		info.read_exec_only = (desc.type >> 1) ^ 1;
+		info.limit_in_pages = desc.g;
+		info.seg_not_present = desc.p ^ 1;
+		info.useable = desc.avl;
+
+		old_fs = get_fs();
+		set_fs(get_ds());
+		ret = sys_modify_ldt(1, (struct user_desc __user *) &info,
+				     sizeof(info));
+		set_fs(old_fs);
+
+		if (ret < 0)
+			goto out;
+	}
+
+	ret = 0;
+ out:
+	cr_hbuf_put(ctx, sizeof(*hh));
+	return ret;
+}
diff --git a/checkpoint/Makefile b/checkpoint/Makefile
index 3a0df6d..ac35033 100644
--- a/checkpoint/Makefile
+++ b/checkpoint/Makefile
@@ -3,4 +3,4 @@
 #
 
 obj-$(CONFIG_CHECKPOINT_RESTART) += sys.o checkpoint.o restart.o \
-		ckpt_mem.o
+		ckpt_mem.o rstr_mem.o
diff --git a/checkpoint/checkpoint_arch.h b/checkpoint/checkpoint_arch.h
index f06c7eb..39c8224 100644
--- a/checkpoint/checkpoint_arch.h
+++ b/checkpoint/checkpoint_arch.h
@@ -9,3 +9,5 @@ extern int cr_write_mm_context(struct cr_ctx *ctx,
 extern int cr_read_head_arch(struct cr_ctx *ctx);
 extern int cr_read_thread(struct cr_ctx *ctx);
 extern int cr_read_cpu(struct cr_ctx *ctx);
+extern int cr_read_mm_context(struct cr_ctx *ctx,
+			      struct mm_struct *mm, int parent);
diff --git a/checkpoint/checkpoint_mem.h b/checkpoint/checkpoint_mem.h
index 85546f4..85a5cf3 100644
--- a/checkpoint/checkpoint_mem.h
+++ b/checkpoint/checkpoint_mem.h
@@ -38,4 +38,9 @@ static inline int cr_pgarr_is_full(struct cr_pgarr *pgarr)
 	return (pgarr->nr_used == CR_PGARR_TOTAL);
 }
 
+static inline int cr_pgarr_nr_free(struct cr_pgarr *pgarr)
+{
+	return CR_PGARR_TOTAL - pgarr->nr_used;
+}
+
 #endif /* _CHECKPOINT_CKPT_MEM_H_ */
diff --git a/checkpoint/restart.c b/checkpoint/restart.c
index d74d755..d90c28a 100644
--- a/checkpoint/restart.c
+++ b/checkpoint/restart.c
@@ -120,6 +120,44 @@ int cr_read_string(struct cr_ctx *ctx, char *str, int len)
 	return ret;
 }
 
+/**
+ * cr_read_fname - read a file name
+ * @ctx: checkpoint context
+ * @fname: buffer
+ * @n: buffer length
+ */
+int cr_read_fname(struct cr_ctx *ctx, char *fname, int flen)
+{
+	return cr_read_buf_type(ctx, fname, &flen, CR_HDR_FNAME);
+}
+
+/**
+ * cr_read_open_fname - read a file name and open a file
+ * @ctx: checkpoint context
+ * @flags: file flags
+ * @mode: file mode
+ */
+struct file *cr_read_open_fname(struct cr_ctx *ctx, int flags, int mode)
+{
+	struct file *file;
+	char *fname;
+	int ret;
+
+	fname = kmalloc(PATH_MAX, GFP_KERNEL);
+	if (!fname)
+		return ERR_PTR(-ENOMEM);
+
+	ret = cr_read_fname(ctx, fname, PATH_MAX);
+	cr_debug("fname '%s' flags %#x mode %#x\n", fname, flags, mode);
+	if (ret >= 0)
+		file = filp_open(fname, flags, mode);
+	else
+		file = ERR_PTR(ret);
+
+	kfree(fname);
+	return file;
+}
+
 /* read the checkpoint header */
 static int cr_read_head(struct cr_ctx *ctx)
 {
@@ -219,6 +257,10 @@ static int cr_read_task(struct cr_ctx *ctx)
 	cr_debug("task_struct: ret %d\n", ret);
 	if (ret < 0)
 		goto out;
+	ret = cr_read_mm(ctx);
+	cr_debug("memory: ret %d\n", ret);
+	if (ret < 0)
+		goto out;
 	ret = cr_read_thread(ctx);
 	cr_debug("thread: ret %d\n", ret);
 	if (ret < 0)
@@ -230,10 +272,19 @@ static int cr_read_task(struct cr_ctx *ctx)
 	return ret;
 }
 
+/* setup restart-specific parts of ctx */
+static int cr_ctx_restart(struct cr_ctx *ctx)
+{
+	return 0;
+}
+
 int do_restart(struct cr_ctx *ctx, pid_t pid)
 {
 	int ret;
 
+	ret = cr_ctx_restart(ctx);
+	if (ret < 0)
+		goto out;
 	ret = cr_read_head(ctx);
 	if (ret < 0)
 		goto out;
diff --git a/checkpoint/rstr_mem.c b/checkpoint/rstr_mem.c
new file mode 100644
index 0000000..6713f4f
--- /dev/null
+++ b/checkpoint/rstr_mem.c
@@ -0,0 +1,384 @@
+/*
+ *  Restart memory contents
+ *
+ *  Copyright (C) 2008 Oren Laadan
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+#include <linux/kernel.h>
+#include <linux/sched.h>
+#include <linux/fcntl.h>
+#include <linux/file.h>
+#include <linux/fs.h>
+#include <linux/pagemap.h>
+#include <linux/mm_types.h>
+#include <linux/mman.h>
+#include <linux/mm.h>
+#include <linux/err.h>
+#include <linux/checkpoint.h>
+#include <linux/checkpoint_hdr.h>
+
+#include "checkpoint_arch.h"
+#include "checkpoint_mem.h"
+
+/*
+ * Unlike checkpoint, restart is executed in the context of each restarting
+ * process: vma regions are restored via a call to mmap(), and the data is
+ * read into the address space of the current process.
+ */
+
+
+/**
+ * cr_read_pages_vaddrs - read addresses of pages to page-array chain
+ * @ctx - restart context
+ * @nr_pages - number of address to read
+ */
+static int cr_read_pages_vaddrs(struct cr_ctx *ctx, unsigned long nr_pages)
+{
+	struct cr_pgarr *pgarr;
+	unsigned long *vaddrp;
+	int nr, ret;
+
+	while (nr_pages) {
+		pgarr = cr_pgarr_current(ctx);
+		if (!pgarr)
+			return -ENOMEM;
+		nr = cr_pgarr_nr_free(pgarr);
+		if (nr > nr_pages)
+			nr = nr_pages;
+		vaddrp = &pgarr->vaddrs[pgarr->nr_used];
+		ret = cr_kread(ctx, vaddrp, nr * sizeof(unsigned long));
+		if (ret < 0)
+			return ret;
+		pgarr->nr_used += nr;
+		nr_pages -= nr;
+	}
+	return 0;
+}
+
+static int cr_page_read(struct cr_ctx *ctx, struct page *page, char *buf)
+{
+	void *ptr;
+	int ret;
+
+	ret = cr_kread(ctx, buf, PAGE_SIZE);
+	if (ret < 0)
+		return ret;
+
+	ptr = kmap_atomic(page, KM_USER1);
+	memcpy(ptr, buf, PAGE_SIZE);
+	kunmap_atomic(ptr, KM_USER1);
+
+	return 0;
+}
+
+/**
+ * cr_read_pages_contents - read in data of pages in page-array chain
+ * @ctx - restart context
+ */
+static int cr_read_pages_contents(struct cr_ctx *ctx)
+{
+	struct mm_struct *mm = current->mm;
+	struct cr_pgarr *pgarr;
+	unsigned long *vaddrs;
+	char *buf;
+	int i, ret = 0;
+
+	buf = kmalloc(PAGE_SIZE, GFP_KERNEL);
+	if (!buf)
+		return -ENOMEM;
+
+	down_read(&mm->mmap_sem);
+	list_for_each_entry_reverse(pgarr, &ctx->pgarr_list, list) {
+		vaddrs = pgarr->vaddrs;
+		for (i = 0; i < pgarr->nr_used; i++) {
+			struct page *page;
+
+			ret = get_user_pages(current, mm, vaddrs[i],
+					     1, 1, 1, &page, NULL);
+			if (ret < 0)
+				goto out;
+
+			ret = cr_page_read(ctx, page, buf);
+			page_cache_release(page);
+
+			if (ret < 0)
+				goto out;
+		}
+	}
+
+ out:
+	up_read(&mm->mmap_sem);
+	kfree(buf);
+	return 0;
+}
+
+/**
+ * cr_read_private_vma_contents - restore contents of a VMA with private memory
+ * @ctx - restart context
+ *
+ * Reads a header that specifies how many pages will follow, then reads
+ * a list of virtual addresses into ctx->pgarr_list page-array chain,
+ * followed by the actual contents of the corresponding pages. Iterates
+ * these steps until reaching a header specifying "0" pages, which marks
+ * the end of the contents.
+ */
+static int cr_read_private_vma_contents(struct cr_ctx *ctx)
+{
+	struct cr_hdr_pgarr *hh;
+	unsigned long nr_pages;
+	int parent, ret = 0;
+
+	while (1) {
+		hh = cr_hbuf_get(ctx, sizeof(*hh));
+		parent = cr_read_obj_type(ctx, hh, sizeof(*hh), CR_HDR_PGARR);
+		if (parent != 0) {
+			if (parent < 0)
+				ret = parent;
+			else
+				ret = -EINVAL;
+			cr_hbuf_put(ctx, sizeof(*hh));
+			break;
+		}
+
+		cr_debug("nr_pages %ld\n", (unsigned long) hh->nr_pages);
+
+		nr_pages = hh->nr_pages;
+		cr_hbuf_put(ctx, sizeof(*hh));
+
+		if (!nr_pages)
+			break;
+
+		ret = cr_read_pages_vaddrs(ctx, nr_pages);
+		if (ret < 0)
+			break;
+		ret = cr_read_pages_contents(ctx);
+		if (ret < 0)
+			break;
+		cr_pgarr_reset_all(ctx);
+	}
+
+	return ret;
+}
+
+/**
+ * cr_calc_map_prot_bits - convert vm_flags to mmap protection
+ * orig_vm_flags: source vm_flags
+ */
+static unsigned long cr_calc_map_prot_bits(unsigned long orig_vm_flags)
+{
+	unsigned long vm_prot = 0;
+
+	if (orig_vm_flags & VM_READ)
+		vm_prot |= PROT_READ;
+	if (orig_vm_flags & VM_WRITE)
+		vm_prot |= PROT_WRITE;
+	if (orig_vm_flags & VM_EXEC)
+		vm_prot |= PROT_EXEC;
+	if (orig_vm_flags & PROT_SEM)   /* only (?) with IPC-SHM  */
+		vm_prot |= PROT_SEM;
+
+	return vm_prot;
+}
+
+/**
+ * cr_calc_map_flags_bits - convert vm_flags to mmap flags
+ * orig_vm_flags: source vm_flags
+ */
+static unsigned long cr_calc_map_flags_bits(unsigned long orig_vm_flags)
+{
+	unsigned long vm_flags = 0;
+
+	vm_flags = MAP_FIXED;
+	if (orig_vm_flags & VM_GROWSDOWN)
+		vm_flags |= MAP_GROWSDOWN;
+	if (orig_vm_flags & VM_DENYWRITE)
+		vm_flags |= MAP_DENYWRITE;
+	if (orig_vm_flags & VM_EXECUTABLE)
+		vm_flags |= MAP_EXECUTABLE;
+	if (orig_vm_flags & VM_MAYSHARE)
+		vm_flags |= MAP_SHARED;
+	else
+		vm_flags |= MAP_PRIVATE;
+
+	return vm_flags;
+}
+
+static int cr_read_vma(struct cr_ctx *ctx, struct mm_struct *mm)
+{
+	struct cr_hdr_vma *hh = cr_hbuf_get(ctx, sizeof(*hh));
+	unsigned long vm_size, vm_start, vm_flags, vm_prot, vm_pgoff;
+	unsigned long addr;
+	struct file *file = NULL;
+	int parent, ret = -EINVAL;
+
+	parent = cr_read_obj_type(ctx, hh, sizeof(*hh), CR_HDR_VMA);
+	if (parent < 0) {
+		ret = parent;
+		goto err;
+	} else if (parent != 0)
+		goto err;
+
+	cr_debug("vma %#lx-%#lx type %d\n", (unsigned long) hh->vm_start,
+		 (unsigned long) hh->vm_end, (int) hh->vma_type);
+
+	if (hh->vm_end < hh->vm_start)
+		goto err;
+
+	vm_start = hh->vm_start;
+	vm_pgoff = hh->vm_pgoff;
+	vm_size = hh->vm_end - hh->vm_start;
+	vm_prot = cr_calc_map_prot_bits(hh->vm_flags);
+	vm_flags = cr_calc_map_flags_bits(hh->vm_flags);
+
+	switch (hh->vma_type) {
+
+	case CR_VMA_ANON:		/* anonymous private mapping */
+		if (vm_flags & VM_SHARED)
+			goto err;
+		/*
+		 * vm_pgoff for anonymous mapping is the "global" page
+		 * offset (namely from addr 0x0), so we force a zero
+		 */
+		vm_pgoff = 0;
+		break;
+
+	case CR_VMA_FILE:		/* private mapping from a file */
+		if (vm_flags & VM_SHARED)
+			goto err;
+		/*
+		 * for private mapping using 'read-only' is sufficient
+		 */
+		file = cr_read_open_fname(ctx, O_RDONLY, 0);
+		if (IS_ERR(file)) {
+			ret = PTR_ERR(file);
+			goto err;
+		}
+		break;
+
+	default:
+		goto err;
+
+	}
+
+	cr_hbuf_put(ctx, sizeof(*hh));
+
+	down_write(&mm->mmap_sem);
+	addr = do_mmap_pgoff(file, vm_start, vm_size,
+			     vm_prot, vm_flags, vm_pgoff);
+	up_write(&mm->mmap_sem);
+	cr_debug("size %#lx prot %#lx flag %#lx pgoff %#lx => %#lx\n",
+		 vm_size, vm_prot, vm_flags, vm_pgoff, addr);
+
+	/* the file (if opened) is now referenced by the vma */
+	if (file)
+		filp_close(file, NULL);
+
+	if (IS_ERR((void *) addr))
+		return PTR_ERR((void *) addr);
+
+	/*
+	 * CR_VMA_ANON: read in memory as is
+	 * CR_VMA_FILE: read in memory as is
+	 * (more to follow ...)
+	 */
+
+	switch (hh->vma_type) {
+	case CR_VMA_ANON:
+	case CR_VMA_FILE:
+		/* standard case: read the data into the memory */
+		ret = cr_read_private_vma_contents(ctx);
+		break;
+	}
+
+	if (ret < 0)
+		return ret;
+
+	cr_debug("vma retval %d\n", ret);
+	return 0;
+
+ err:
+	cr_hbuf_put(ctx, sizeof(*hh));
+	return ret;
+}
+
+static int cr_destroy_mm(struct mm_struct *mm)
+{
+	struct vm_area_struct *vmnext = mm->mmap;
+	struct vm_area_struct *vma;
+	int ret;
+
+	while (vmnext) {
+		vma = vmnext;
+		vmnext = vmnext->vm_next;
+		ret = do_munmap(mm, vma->vm_start, vma->vm_end-vma->vm_start);
+		if (ret < 0) {
+			pr_debug("c/r: restart failed do_munmap (%d)\n", ret);
+			return ret;
+		}
+	}
+	return 0;
+}
+
+int cr_read_mm(struct cr_ctx *ctx)
+{
+	struct cr_hdr_mm *hh = cr_hbuf_get(ctx, sizeof(*hh));
+	struct mm_struct *mm;
+	int nr, parent, ret;
+
+	parent = cr_read_obj_type(ctx, hh, sizeof(*hh), CR_HDR_MM);
+	if (parent < 0) {
+		ret = parent;
+		goto out;
+	}
+
+	ret = -EINVAL;
+#if 0	/* activate when containers are used */
+	if (parent != task_pid_vnr(current))
+		goto out;
+#endif
+	cr_debug("map_count %d\n", hh->map_count);
+
+	/* XXX need more sanity checks */
+	if (hh->start_code > hh->end_code ||
+	    hh->start_data > hh->end_data || hh->map_count < 0)
+		goto out;
+
+	mm = current->mm;
+
+	/* point of no return -- destruct current mm */
+	down_write(&mm->mmap_sem);
+	ret = cr_destroy_mm(mm);
+	if (ret < 0) {
+		up_write(&mm->mmap_sem);
+		goto out;
+	}
+	mm->start_code = hh->start_code;
+	mm->end_code = hh->end_code;
+	mm->start_data = hh->start_data;
+	mm->end_data = hh->end_data;
+	mm->start_brk = hh->start_brk;
+	mm->brk = hh->brk;
+	mm->start_stack = hh->start_stack;
+	mm->arg_start = hh->arg_start;
+	mm->arg_end = hh->arg_end;
+	mm->env_start = hh->env_start;
+	mm->env_end = hh->env_end;
+	up_write(&mm->mmap_sem);
+
+	/* FIX: need also mm->flags */
+
+	for (nr = hh->map_count; nr; nr--) {
+		ret = cr_read_vma(ctx, mm);
+		if (ret < 0)
+			goto out;
+	}
+
+	ret = cr_read_mm_context(ctx, mm, hh->objref);
+ out:
+	cr_hbuf_put(ctx, sizeof(*hh));
+	return ret;
+}
diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
index 4e97f9f..ab1b215 100644
--- a/include/linux/checkpoint.h
+++ b/include/linux/checkpoint.h
@@ -58,6 +58,10 @@ extern int cr_read_buf_type(struct cr_ctx *ctx, void *buf, int *len, int type);
 extern int cr_read_buffer(struct cr_ctx *ctx, void *buf, int *len);
 extern int cr_read_string(struct cr_ctx *ctx, char *str, int len);
 
+extern int cr_read_fname(struct cr_ctx *ctx, char *fname, int n);
+extern struct file *cr_read_open_fname(struct cr_ctx *ctx,
+				       int flags, int mode);
+
 extern int do_checkpoint(struct cr_ctx *ctx, pid_t pid);
 extern int cr_write_mm(struct cr_ctx *ctx, struct task_struct *t);
 
-- 
1.5.4.3


^ permalink raw reply related	[flat|nested] 133+ messages in thread

* [RFC v11][PATCH 06/13] Restore memory address space
@ 2008-12-05 17:31   ` Oren Laadan
  0 siblings, 0 replies; 133+ messages in thread
From: Oren Laadan @ 2008-12-05 17:31 UTC (permalink / raw)
  Cc: containers, linux-kernel, linux-mm, linux-api, Linux Torvalds,
	Thomas Gleixner, Serge Hallyn, Dave Hansen, Ingo Molnar,
	H. Peter Anvin, Alexander Viro, MinChan Kim, arnd, jeremy,
	Oren Laadan

Restoring the memory address space begins with nuking the existing one
of the current process, and then reading the VMA state and contents.
Call do_mmap_pgoffset() for each VMA and then read in the data.

Changelog[v9]:
  - Introduce cr_ctx_checkpoint() for checkpoint-specific ctx setup

Changelog[v7]:
  - Fix argument given to kunmap_atomic() in memory dump/restore

Changelog[v6]:
  - Balance all calls to cr_hbuf_get() with matching cr_hbuf_put()
    (even though it's not really needed)

Changelog[v5]:
  - Improve memory restore code (following Dave Hansen's comments)
  - Change dump format (and code) to allow chunks of <vaddrs, pages>
    instead of one long list of each
  - Memory restore now maps user pages explicitly to copy data into them,
    instead of reading directly to user space; got rid of mprotect_fixup()

Changelog[v4]:
  - Use standard list_... for cr_pgarr


Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com>
---
 arch/x86/include/asm/checkpoint_hdr.h |    5 +
 arch/x86/mm/restart.c                 |   64 ++++++-
 checkpoint/Makefile                   |    2 +-
 checkpoint/checkpoint_arch.h          |    2 +
 checkpoint/checkpoint_mem.h           |    5 +
 checkpoint/restart.c                  |   51 +++++
 checkpoint/rstr_mem.c                 |  384 +++++++++++++++++++++++++++++++++
 include/linux/checkpoint.h            |    4 +
 8 files changed, 514 insertions(+), 3 deletions(-)
 create mode 100644 checkpoint/rstr_mem.c

diff --git a/arch/x86/include/asm/checkpoint_hdr.h b/arch/x86/include/asm/checkpoint_hdr.h
index 33f4c70..d13db9b 100644
--- a/arch/x86/include/asm/checkpoint_hdr.h
+++ b/arch/x86/include/asm/checkpoint_hdr.h
@@ -87,4 +87,9 @@ struct cr_hdr_mm_context {
 	__s16 nldt;
 } __attribute__((aligned(8)));
 
+#ifdef __KERNEL__
+/* misc prototypes from kernel (not defined elsewhere) */
+asmlinkage int sys_modify_ldt(int func, void __user *ptr, unsigned long bytecount);
+#endif
+
 #endif /* __ASM_X86_CKPT_HDR__H */
diff --git a/arch/x86/mm/restart.c b/arch/x86/mm/restart.c
index 45ad790..ab68b2f 100644
--- a/arch/x86/mm/restart.c
+++ b/arch/x86/mm/restart.c
@@ -52,8 +52,10 @@ int cr_read_thread(struct cr_ctx *ctx)
 
 		size = sizeof(*desc) * GDT_ENTRY_TLS_ENTRIES;
 		desc = kmalloc(size, GFP_KERNEL);
-		if (!desc)
-			return -ENOMEM;
+		if (!desc) {
+			ret = -ENOMEM;
+			goto out;
+		}
 
 		ret = cr_kread(ctx, desc, size);
 		if (ret >= 0) {
@@ -230,3 +232,61 @@ int cr_read_head_arch(struct cr_ctx *ctx)
 
 	return ret;
 }
+
+int cr_read_mm_context(struct cr_ctx *ctx, struct mm_struct *mm, int rparent)
+{
+	struct cr_hdr_mm_context *hh = cr_hbuf_get(ctx, sizeof(*hh));
+	int n, parent, ret = -EINVAL;
+
+	parent = cr_read_obj_type(ctx, hh, sizeof(*hh), CR_HDR_MM_CONTEXT);
+	cr_debug("parent %d rparent %d nldt %d\n", parent, rparent, hh->nldt);
+	if (parent < 0) {
+		ret = parent;
+		goto out;
+	}
+	if (parent != rparent)
+		goto out;
+
+	if (hh->nldt < 0 || hh->ldt_entry_size != LDT_ENTRY_SIZE)
+		goto out;
+
+	/*
+	 * to utilize the syscall modify_ldt() we first convert the data
+	 * in the checkpoint image from 'struct desc_struct' to 'struct
+	 * user_desc' with reverse logic of include/asm/desc.h:fill_ldt()
+	 */
+
+	for (n = 0; n < hh->nldt; n++) {
+		struct user_desc info;
+		struct desc_struct desc;
+		mm_segment_t old_fs;
+
+		ret = cr_kread(ctx, &desc, LDT_ENTRY_SIZE);
+		if (ret < 0)
+			goto out;
+
+		info.entry_number = n;
+		info.base_addr = desc.base0 | (desc.base1 << 16);
+		info.limit = desc.limit0;
+		info.seg_32bit = desc.d;
+		info.contents = desc.type >> 2;
+		info.read_exec_only = (desc.type >> 1) ^ 1;
+		info.limit_in_pages = desc.g;
+		info.seg_not_present = desc.p ^ 1;
+		info.useable = desc.avl;
+
+		old_fs = get_fs();
+		set_fs(get_ds());
+		ret = sys_modify_ldt(1, (struct user_desc __user *) &info,
+				     sizeof(info));
+		set_fs(old_fs);
+
+		if (ret < 0)
+			goto out;
+	}
+
+	ret = 0;
+ out:
+	cr_hbuf_put(ctx, sizeof(*hh));
+	return ret;
+}
diff --git a/checkpoint/Makefile b/checkpoint/Makefile
index 3a0df6d..ac35033 100644
--- a/checkpoint/Makefile
+++ b/checkpoint/Makefile
@@ -3,4 +3,4 @@
 #
 
 obj-$(CONFIG_CHECKPOINT_RESTART) += sys.o checkpoint.o restart.o \
-		ckpt_mem.o
+		ckpt_mem.o rstr_mem.o
diff --git a/checkpoint/checkpoint_arch.h b/checkpoint/checkpoint_arch.h
index f06c7eb..39c8224 100644
--- a/checkpoint/checkpoint_arch.h
+++ b/checkpoint/checkpoint_arch.h
@@ -9,3 +9,5 @@ extern int cr_write_mm_context(struct cr_ctx *ctx,
 extern int cr_read_head_arch(struct cr_ctx *ctx);
 extern int cr_read_thread(struct cr_ctx *ctx);
 extern int cr_read_cpu(struct cr_ctx *ctx);
+extern int cr_read_mm_context(struct cr_ctx *ctx,
+			      struct mm_struct *mm, int parent);
diff --git a/checkpoint/checkpoint_mem.h b/checkpoint/checkpoint_mem.h
index 85546f4..85a5cf3 100644
--- a/checkpoint/checkpoint_mem.h
+++ b/checkpoint/checkpoint_mem.h
@@ -38,4 +38,9 @@ static inline int cr_pgarr_is_full(struct cr_pgarr *pgarr)
 	return (pgarr->nr_used == CR_PGARR_TOTAL);
 }
 
+static inline int cr_pgarr_nr_free(struct cr_pgarr *pgarr)
+{
+	return CR_PGARR_TOTAL - pgarr->nr_used;
+}
+
 #endif /* _CHECKPOINT_CKPT_MEM_H_ */
diff --git a/checkpoint/restart.c b/checkpoint/restart.c
index d74d755..d90c28a 100644
--- a/checkpoint/restart.c
+++ b/checkpoint/restart.c
@@ -120,6 +120,44 @@ int cr_read_string(struct cr_ctx *ctx, char *str, int len)
 	return ret;
 }
 
+/**
+ * cr_read_fname - read a file name
+ * @ctx: checkpoint context
+ * @fname: buffer
+ * @n: buffer length
+ */
+int cr_read_fname(struct cr_ctx *ctx, char *fname, int flen)
+{
+	return cr_read_buf_type(ctx, fname, &flen, CR_HDR_FNAME);
+}
+
+/**
+ * cr_read_open_fname - read a file name and open a file
+ * @ctx: checkpoint context
+ * @flags: file flags
+ * @mode: file mode
+ */
+struct file *cr_read_open_fname(struct cr_ctx *ctx, int flags, int mode)
+{
+	struct file *file;
+	char *fname;
+	int ret;
+
+	fname = kmalloc(PATH_MAX, GFP_KERNEL);
+	if (!fname)
+		return ERR_PTR(-ENOMEM);
+
+	ret = cr_read_fname(ctx, fname, PATH_MAX);
+	cr_debug("fname '%s' flags %#x mode %#x\n", fname, flags, mode);
+	if (ret >= 0)
+		file = filp_open(fname, flags, mode);
+	else
+		file = ERR_PTR(ret);
+
+	kfree(fname);
+	return file;
+}
+
 /* read the checkpoint header */
 static int cr_read_head(struct cr_ctx *ctx)
 {
@@ -219,6 +257,10 @@ static int cr_read_task(struct cr_ctx *ctx)
 	cr_debug("task_struct: ret %d\n", ret);
 	if (ret < 0)
 		goto out;
+	ret = cr_read_mm(ctx);
+	cr_debug("memory: ret %d\n", ret);
+	if (ret < 0)
+		goto out;
 	ret = cr_read_thread(ctx);
 	cr_debug("thread: ret %d\n", ret);
 	if (ret < 0)
@@ -230,10 +272,19 @@ static int cr_read_task(struct cr_ctx *ctx)
 	return ret;
 }
 
+/* setup restart-specific parts of ctx */
+static int cr_ctx_restart(struct cr_ctx *ctx)
+{
+	return 0;
+}
+
 int do_restart(struct cr_ctx *ctx, pid_t pid)
 {
 	int ret;
 
+	ret = cr_ctx_restart(ctx);
+	if (ret < 0)
+		goto out;
 	ret = cr_read_head(ctx);
 	if (ret < 0)
 		goto out;
diff --git a/checkpoint/rstr_mem.c b/checkpoint/rstr_mem.c
new file mode 100644
index 0000000..6713f4f
--- /dev/null
+++ b/checkpoint/rstr_mem.c
@@ -0,0 +1,384 @@
+/*
+ *  Restart memory contents
+ *
+ *  Copyright (C) 2008 Oren Laadan
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+#include <linux/kernel.h>
+#include <linux/sched.h>
+#include <linux/fcntl.h>
+#include <linux/file.h>
+#include <linux/fs.h>
+#include <linux/pagemap.h>
+#include <linux/mm_types.h>
+#include <linux/mman.h>
+#include <linux/mm.h>
+#include <linux/err.h>
+#include <linux/checkpoint.h>
+#include <linux/checkpoint_hdr.h>
+
+#include "checkpoint_arch.h"
+#include "checkpoint_mem.h"
+
+/*
+ * Unlike checkpoint, restart is executed in the context of each restarting
+ * process: vma regions are restored via a call to mmap(), and the data is
+ * read into the address space of the current process.
+ */
+
+
+/**
+ * cr_read_pages_vaddrs - read addresses of pages to page-array chain
+ * @ctx - restart context
+ * @nr_pages - number of address to read
+ */
+static int cr_read_pages_vaddrs(struct cr_ctx *ctx, unsigned long nr_pages)
+{
+	struct cr_pgarr *pgarr;
+	unsigned long *vaddrp;
+	int nr, ret;
+
+	while (nr_pages) {
+		pgarr = cr_pgarr_current(ctx);
+		if (!pgarr)
+			return -ENOMEM;
+		nr = cr_pgarr_nr_free(pgarr);
+		if (nr > nr_pages)
+			nr = nr_pages;
+		vaddrp = &pgarr->vaddrs[pgarr->nr_used];
+		ret = cr_kread(ctx, vaddrp, nr * sizeof(unsigned long));
+		if (ret < 0)
+			return ret;
+		pgarr->nr_used += nr;
+		nr_pages -= nr;
+	}
+	return 0;
+}
+
+static int cr_page_read(struct cr_ctx *ctx, struct page *page, char *buf)
+{
+	void *ptr;
+	int ret;
+
+	ret = cr_kread(ctx, buf, PAGE_SIZE);
+	if (ret < 0)
+		return ret;
+
+	ptr = kmap_atomic(page, KM_USER1);
+	memcpy(ptr, buf, PAGE_SIZE);
+	kunmap_atomic(ptr, KM_USER1);
+
+	return 0;
+}
+
+/**
+ * cr_read_pages_contents - read in data of pages in page-array chain
+ * @ctx - restart context
+ */
+static int cr_read_pages_contents(struct cr_ctx *ctx)
+{
+	struct mm_struct *mm = current->mm;
+	struct cr_pgarr *pgarr;
+	unsigned long *vaddrs;
+	char *buf;
+	int i, ret = 0;
+
+	buf = kmalloc(PAGE_SIZE, GFP_KERNEL);
+	if (!buf)
+		return -ENOMEM;
+
+	down_read(&mm->mmap_sem);
+	list_for_each_entry_reverse(pgarr, &ctx->pgarr_list, list) {
+		vaddrs = pgarr->vaddrs;
+		for (i = 0; i < pgarr->nr_used; i++) {
+			struct page *page;
+
+			ret = get_user_pages(current, mm, vaddrs[i],
+					     1, 1, 1, &page, NULL);
+			if (ret < 0)
+				goto out;
+
+			ret = cr_page_read(ctx, page, buf);
+			page_cache_release(page);
+
+			if (ret < 0)
+				goto out;
+		}
+	}
+
+ out:
+	up_read(&mm->mmap_sem);
+	kfree(buf);
+	return 0;
+}
+
+/**
+ * cr_read_private_vma_contents - restore contents of a VMA with private memory
+ * @ctx - restart context
+ *
+ * Reads a header that specifies how many pages will follow, then reads
+ * a list of virtual addresses into ctx->pgarr_list page-array chain,
+ * followed by the actual contents of the corresponding pages. Iterates
+ * these steps until reaching a header specifying "0" pages, which marks
+ * the end of the contents.
+ */
+static int cr_read_private_vma_contents(struct cr_ctx *ctx)
+{
+	struct cr_hdr_pgarr *hh;
+	unsigned long nr_pages;
+	int parent, ret = 0;
+
+	while (1) {
+		hh = cr_hbuf_get(ctx, sizeof(*hh));
+		parent = cr_read_obj_type(ctx, hh, sizeof(*hh), CR_HDR_PGARR);
+		if (parent != 0) {
+			if (parent < 0)
+				ret = parent;
+			else
+				ret = -EINVAL;
+			cr_hbuf_put(ctx, sizeof(*hh));
+			break;
+		}
+
+		cr_debug("nr_pages %ld\n", (unsigned long) hh->nr_pages);
+
+		nr_pages = hh->nr_pages;
+		cr_hbuf_put(ctx, sizeof(*hh));
+
+		if (!nr_pages)
+			break;
+
+		ret = cr_read_pages_vaddrs(ctx, nr_pages);
+		if (ret < 0)
+			break;
+		ret = cr_read_pages_contents(ctx);
+		if (ret < 0)
+			break;
+		cr_pgarr_reset_all(ctx);
+	}
+
+	return ret;
+}
+
+/**
+ * cr_calc_map_prot_bits - convert vm_flags to mmap protection
+ * orig_vm_flags: source vm_flags
+ */
+static unsigned long cr_calc_map_prot_bits(unsigned long orig_vm_flags)
+{
+	unsigned long vm_prot = 0;
+
+	if (orig_vm_flags & VM_READ)
+		vm_prot |= PROT_READ;
+	if (orig_vm_flags & VM_WRITE)
+		vm_prot |= PROT_WRITE;
+	if (orig_vm_flags & VM_EXEC)
+		vm_prot |= PROT_EXEC;
+	if (orig_vm_flags & PROT_SEM)   /* only (?) with IPC-SHM  */
+		vm_prot |= PROT_SEM;
+
+	return vm_prot;
+}
+
+/**
+ * cr_calc_map_flags_bits - convert vm_flags to mmap flags
+ * orig_vm_flags: source vm_flags
+ */
+static unsigned long cr_calc_map_flags_bits(unsigned long orig_vm_flags)
+{
+	unsigned long vm_flags = 0;
+
+	vm_flags = MAP_FIXED;
+	if (orig_vm_flags & VM_GROWSDOWN)
+		vm_flags |= MAP_GROWSDOWN;
+	if (orig_vm_flags & VM_DENYWRITE)
+		vm_flags |= MAP_DENYWRITE;
+	if (orig_vm_flags & VM_EXECUTABLE)
+		vm_flags |= MAP_EXECUTABLE;
+	if (orig_vm_flags & VM_MAYSHARE)
+		vm_flags |= MAP_SHARED;
+	else
+		vm_flags |= MAP_PRIVATE;
+
+	return vm_flags;
+}
+
+static int cr_read_vma(struct cr_ctx *ctx, struct mm_struct *mm)
+{
+	struct cr_hdr_vma *hh = cr_hbuf_get(ctx, sizeof(*hh));
+	unsigned long vm_size, vm_start, vm_flags, vm_prot, vm_pgoff;
+	unsigned long addr;
+	struct file *file = NULL;
+	int parent, ret = -EINVAL;
+
+	parent = cr_read_obj_type(ctx, hh, sizeof(*hh), CR_HDR_VMA);
+	if (parent < 0) {
+		ret = parent;
+		goto err;
+	} else if (parent != 0)
+		goto err;
+
+	cr_debug("vma %#lx-%#lx type %d\n", (unsigned long) hh->vm_start,
+		 (unsigned long) hh->vm_end, (int) hh->vma_type);
+
+	if (hh->vm_end < hh->vm_start)
+		goto err;
+
+	vm_start = hh->vm_start;
+	vm_pgoff = hh->vm_pgoff;
+	vm_size = hh->vm_end - hh->vm_start;
+	vm_prot = cr_calc_map_prot_bits(hh->vm_flags);
+	vm_flags = cr_calc_map_flags_bits(hh->vm_flags);
+
+	switch (hh->vma_type) {
+
+	case CR_VMA_ANON:		/* anonymous private mapping */
+		if (vm_flags & VM_SHARED)
+			goto err;
+		/*
+		 * vm_pgoff for anonymous mapping is the "global" page
+		 * offset (namely from addr 0x0), so we force a zero
+		 */
+		vm_pgoff = 0;
+		break;
+
+	case CR_VMA_FILE:		/* private mapping from a file */
+		if (vm_flags & VM_SHARED)
+			goto err;
+		/*
+		 * for private mapping using 'read-only' is sufficient
+		 */
+		file = cr_read_open_fname(ctx, O_RDONLY, 0);
+		if (IS_ERR(file)) {
+			ret = PTR_ERR(file);
+			goto err;
+		}
+		break;
+
+	default:
+		goto err;
+
+	}
+
+	cr_hbuf_put(ctx, sizeof(*hh));
+
+	down_write(&mm->mmap_sem);
+	addr = do_mmap_pgoff(file, vm_start, vm_size,
+			     vm_prot, vm_flags, vm_pgoff);
+	up_write(&mm->mmap_sem);
+	cr_debug("size %#lx prot %#lx flag %#lx pgoff %#lx => %#lx\n",
+		 vm_size, vm_prot, vm_flags, vm_pgoff, addr);
+
+	/* the file (if opened) is now referenced by the vma */
+	if (file)
+		filp_close(file, NULL);
+
+	if (IS_ERR((void *) addr))
+		return PTR_ERR((void *) addr);
+
+	/*
+	 * CR_VMA_ANON: read in memory as is
+	 * CR_VMA_FILE: read in memory as is
+	 * (more to follow ...)
+	 */
+
+	switch (hh->vma_type) {
+	case CR_VMA_ANON:
+	case CR_VMA_FILE:
+		/* standard case: read the data into the memory */
+		ret = cr_read_private_vma_contents(ctx);
+		break;
+	}
+
+	if (ret < 0)
+		return ret;
+
+	cr_debug("vma retval %d\n", ret);
+	return 0;
+
+ err:
+	cr_hbuf_put(ctx, sizeof(*hh));
+	return ret;
+}
+
+static int cr_destroy_mm(struct mm_struct *mm)
+{
+	struct vm_area_struct *vmnext = mm->mmap;
+	struct vm_area_struct *vma;
+	int ret;
+
+	while (vmnext) {
+		vma = vmnext;
+		vmnext = vmnext->vm_next;
+		ret = do_munmap(mm, vma->vm_start, vma->vm_end-vma->vm_start);
+		if (ret < 0) {
+			pr_debug("c/r: restart failed do_munmap (%d)\n", ret);
+			return ret;
+		}
+	}
+	return 0;
+}
+
+int cr_read_mm(struct cr_ctx *ctx)
+{
+	struct cr_hdr_mm *hh = cr_hbuf_get(ctx, sizeof(*hh));
+	struct mm_struct *mm;
+	int nr, parent, ret;
+
+	parent = cr_read_obj_type(ctx, hh, sizeof(*hh), CR_HDR_MM);
+	if (parent < 0) {
+		ret = parent;
+		goto out;
+	}
+
+	ret = -EINVAL;
+#if 0	/* activate when containers are used */
+	if (parent != task_pid_vnr(current))
+		goto out;
+#endif
+	cr_debug("map_count %d\n", hh->map_count);
+
+	/* XXX need more sanity checks */
+	if (hh->start_code > hh->end_code ||
+	    hh->start_data > hh->end_data || hh->map_count < 0)
+		goto out;
+
+	mm = current->mm;
+
+	/* point of no return -- destruct current mm */
+	down_write(&mm->mmap_sem);
+	ret = cr_destroy_mm(mm);
+	if (ret < 0) {
+		up_write(&mm->mmap_sem);
+		goto out;
+	}
+	mm->start_code = hh->start_code;
+	mm->end_code = hh->end_code;
+	mm->start_data = hh->start_data;
+	mm->end_data = hh->end_data;
+	mm->start_brk = hh->start_brk;
+	mm->brk = hh->brk;
+	mm->start_stack = hh->start_stack;
+	mm->arg_start = hh->arg_start;
+	mm->arg_end = hh->arg_end;
+	mm->env_start = hh->env_start;
+	mm->env_end = hh->env_end;
+	up_write(&mm->mmap_sem);
+
+	/* FIX: need also mm->flags */
+
+	for (nr = hh->map_count; nr; nr--) {
+		ret = cr_read_vma(ctx, mm);
+		if (ret < 0)
+			goto out;
+	}
+
+	ret = cr_read_mm_context(ctx, mm, hh->objref);
+ out:
+	cr_hbuf_put(ctx, sizeof(*hh));
+	return ret;
+}
diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
index 4e97f9f..ab1b215 100644
--- a/include/linux/checkpoint.h
+++ b/include/linux/checkpoint.h
@@ -58,6 +58,10 @@ extern int cr_read_buf_type(struct cr_ctx *ctx, void *buf, int *len, int type);
 extern int cr_read_buffer(struct cr_ctx *ctx, void *buf, int *len);
 extern int cr_read_string(struct cr_ctx *ctx, char *str, int len);
 
+extern int cr_read_fname(struct cr_ctx *ctx, char *fname, int n);
+extern struct file *cr_read_open_fname(struct cr_ctx *ctx,
+				       int flags, int mode);
+
 extern int do_checkpoint(struct cr_ctx *ctx, pid_t pid);
 extern int cr_write_mm(struct cr_ctx *ctx, struct task_struct *t);
 
-- 
1.5.4.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 133+ messages in thread

* [RFC v11][PATCH 06/13] Restore memory address space
@ 2008-12-05 17:31   ` Oren Laadan
  0 siblings, 0 replies; 133+ messages in thread
From: Oren Laadan @ 2008-12-05 17:31 UTC (permalink / raw)
  To: Oren Laadan
  Cc: containers, linux-kernel, linux-mm, linux-api, Linux Torvalds,
	Thomas Gleixner, Serge Hallyn, Dave Hansen, Ingo Molnar,
	H. Peter Anvin, Alexander Viro, MinChan Kim, arnd, jeremy

Restoring the memory address space begins with nuking the existing one
of the current process, and then reading the VMA state and contents.
Call do_mmap_pgoffset() for each VMA and then read in the data.

Changelog[v9]:
  - Introduce cr_ctx_checkpoint() for checkpoint-specific ctx setup

Changelog[v7]:
  - Fix argument given to kunmap_atomic() in memory dump/restore

Changelog[v6]:
  - Balance all calls to cr_hbuf_get() with matching cr_hbuf_put()
    (even though it's not really needed)

Changelog[v5]:
  - Improve memory restore code (following Dave Hansen's comments)
  - Change dump format (and code) to allow chunks of <vaddrs, pages>
    instead of one long list of each
  - Memory restore now maps user pages explicitly to copy data into them,
    instead of reading directly to user space; got rid of mprotect_fixup()

Changelog[v4]:
  - Use standard list_... for cr_pgarr


Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com>
---
 arch/x86/include/asm/checkpoint_hdr.h |    5 +
 arch/x86/mm/restart.c                 |   64 ++++++-
 checkpoint/Makefile                   |    2 +-
 checkpoint/checkpoint_arch.h          |    2 +
 checkpoint/checkpoint_mem.h           |    5 +
 checkpoint/restart.c                  |   51 +++++
 checkpoint/rstr_mem.c                 |  384 +++++++++++++++++++++++++++++++++
 include/linux/checkpoint.h            |    4 +
 8 files changed, 514 insertions(+), 3 deletions(-)
 create mode 100644 checkpoint/rstr_mem.c

diff --git a/arch/x86/include/asm/checkpoint_hdr.h b/arch/x86/include/asm/checkpoint_hdr.h
index 33f4c70..d13db9b 100644
--- a/arch/x86/include/asm/checkpoint_hdr.h
+++ b/arch/x86/include/asm/checkpoint_hdr.h
@@ -87,4 +87,9 @@ struct cr_hdr_mm_context {
 	__s16 nldt;
 } __attribute__((aligned(8)));
 
+#ifdef __KERNEL__
+/* misc prototypes from kernel (not defined elsewhere) */
+asmlinkage int sys_modify_ldt(int func, void __user *ptr, unsigned long bytecount);
+#endif
+
 #endif /* __ASM_X86_CKPT_HDR__H */
diff --git a/arch/x86/mm/restart.c b/arch/x86/mm/restart.c
index 45ad790..ab68b2f 100644
--- a/arch/x86/mm/restart.c
+++ b/arch/x86/mm/restart.c
@@ -52,8 +52,10 @@ int cr_read_thread(struct cr_ctx *ctx)
 
 		size = sizeof(*desc) * GDT_ENTRY_TLS_ENTRIES;
 		desc = kmalloc(size, GFP_KERNEL);
-		if (!desc)
-			return -ENOMEM;
+		if (!desc) {
+			ret = -ENOMEM;
+			goto out;
+		}
 
 		ret = cr_kread(ctx, desc, size);
 		if (ret >= 0) {
@@ -230,3 +232,61 @@ int cr_read_head_arch(struct cr_ctx *ctx)
 
 	return ret;
 }
+
+int cr_read_mm_context(struct cr_ctx *ctx, struct mm_struct *mm, int rparent)
+{
+	struct cr_hdr_mm_context *hh = cr_hbuf_get(ctx, sizeof(*hh));
+	int n, parent, ret = -EINVAL;
+
+	parent = cr_read_obj_type(ctx, hh, sizeof(*hh), CR_HDR_MM_CONTEXT);
+	cr_debug("parent %d rparent %d nldt %d\n", parent, rparent, hh->nldt);
+	if (parent < 0) {
+		ret = parent;
+		goto out;
+	}
+	if (parent != rparent)
+		goto out;
+
+	if (hh->nldt < 0 || hh->ldt_entry_size != LDT_ENTRY_SIZE)
+		goto out;
+
+	/*
+	 * to utilize the syscall modify_ldt() we first convert the data
+	 * in the checkpoint image from 'struct desc_struct' to 'struct
+	 * user_desc' with reverse logic of include/asm/desc.h:fill_ldt()
+	 */
+
+	for (n = 0; n < hh->nldt; n++) {
+		struct user_desc info;
+		struct desc_struct desc;
+		mm_segment_t old_fs;
+
+		ret = cr_kread(ctx, &desc, LDT_ENTRY_SIZE);
+		if (ret < 0)
+			goto out;
+
+		info.entry_number = n;
+		info.base_addr = desc.base0 | (desc.base1 << 16);
+		info.limit = desc.limit0;
+		info.seg_32bit = desc.d;
+		info.contents = desc.type >> 2;
+		info.read_exec_only = (desc.type >> 1) ^ 1;
+		info.limit_in_pages = desc.g;
+		info.seg_not_present = desc.p ^ 1;
+		info.useable = desc.avl;
+
+		old_fs = get_fs();
+		set_fs(get_ds());
+		ret = sys_modify_ldt(1, (struct user_desc __user *) &info,
+				     sizeof(info));
+		set_fs(old_fs);
+
+		if (ret < 0)
+			goto out;
+	}
+
+	ret = 0;
+ out:
+	cr_hbuf_put(ctx, sizeof(*hh));
+	return ret;
+}
diff --git a/checkpoint/Makefile b/checkpoint/Makefile
index 3a0df6d..ac35033 100644
--- a/checkpoint/Makefile
+++ b/checkpoint/Makefile
@@ -3,4 +3,4 @@
 #
 
 obj-$(CONFIG_CHECKPOINT_RESTART) += sys.o checkpoint.o restart.o \
-		ckpt_mem.o
+		ckpt_mem.o rstr_mem.o
diff --git a/checkpoint/checkpoint_arch.h b/checkpoint/checkpoint_arch.h
index f06c7eb..39c8224 100644
--- a/checkpoint/checkpoint_arch.h
+++ b/checkpoint/checkpoint_arch.h
@@ -9,3 +9,5 @@ extern int cr_write_mm_context(struct cr_ctx *ctx,
 extern int cr_read_head_arch(struct cr_ctx *ctx);
 extern int cr_read_thread(struct cr_ctx *ctx);
 extern int cr_read_cpu(struct cr_ctx *ctx);
+extern int cr_read_mm_context(struct cr_ctx *ctx,
+			      struct mm_struct *mm, int parent);
diff --git a/checkpoint/checkpoint_mem.h b/checkpoint/checkpoint_mem.h
index 85546f4..85a5cf3 100644
--- a/checkpoint/checkpoint_mem.h
+++ b/checkpoint/checkpoint_mem.h
@@ -38,4 +38,9 @@ static inline int cr_pgarr_is_full(struct cr_pgarr *pgarr)
 	return (pgarr->nr_used == CR_PGARR_TOTAL);
 }
 
+static inline int cr_pgarr_nr_free(struct cr_pgarr *pgarr)
+{
+	return CR_PGARR_TOTAL - pgarr->nr_used;
+}
+
 #endif /* _CHECKPOINT_CKPT_MEM_H_ */
diff --git a/checkpoint/restart.c b/checkpoint/restart.c
index d74d755..d90c28a 100644
--- a/checkpoint/restart.c
+++ b/checkpoint/restart.c
@@ -120,6 +120,44 @@ int cr_read_string(struct cr_ctx *ctx, char *str, int len)
 	return ret;
 }
 
+/**
+ * cr_read_fname - read a file name
+ * @ctx: checkpoint context
+ * @fname: buffer
+ * @n: buffer length
+ */
+int cr_read_fname(struct cr_ctx *ctx, char *fname, int flen)
+{
+	return cr_read_buf_type(ctx, fname, &flen, CR_HDR_FNAME);
+}
+
+/**
+ * cr_read_open_fname - read a file name and open a file
+ * @ctx: checkpoint context
+ * @flags: file flags
+ * @mode: file mode
+ */
+struct file *cr_read_open_fname(struct cr_ctx *ctx, int flags, int mode)
+{
+	struct file *file;
+	char *fname;
+	int ret;
+
+	fname = kmalloc(PATH_MAX, GFP_KERNEL);
+	if (!fname)
+		return ERR_PTR(-ENOMEM);
+
+	ret = cr_read_fname(ctx, fname, PATH_MAX);
+	cr_debug("fname '%s' flags %#x mode %#x\n", fname, flags, mode);
+	if (ret >= 0)
+		file = filp_open(fname, flags, mode);
+	else
+		file = ERR_PTR(ret);
+
+	kfree(fname);
+	return file;
+}
+
 /* read the checkpoint header */
 static int cr_read_head(struct cr_ctx *ctx)
 {
@@ -219,6 +257,10 @@ static int cr_read_task(struct cr_ctx *ctx)
 	cr_debug("task_struct: ret %d\n", ret);
 	if (ret < 0)
 		goto out;
+	ret = cr_read_mm(ctx);
+	cr_debug("memory: ret %d\n", ret);
+	if (ret < 0)
+		goto out;
 	ret = cr_read_thread(ctx);
 	cr_debug("thread: ret %d\n", ret);
 	if (ret < 0)
@@ -230,10 +272,19 @@ static int cr_read_task(struct cr_ctx *ctx)
 	return ret;
 }
 
+/* setup restart-specific parts of ctx */
+static int cr_ctx_restart(struct cr_ctx *ctx)
+{
+	return 0;
+}
+
 int do_restart(struct cr_ctx *ctx, pid_t pid)
 {
 	int ret;
 
+	ret = cr_ctx_restart(ctx);
+	if (ret < 0)
+		goto out;
 	ret = cr_read_head(ctx);
 	if (ret < 0)
 		goto out;
diff --git a/checkpoint/rstr_mem.c b/checkpoint/rstr_mem.c
new file mode 100644
index 0000000..6713f4f
--- /dev/null
+++ b/checkpoint/rstr_mem.c
@@ -0,0 +1,384 @@
+/*
+ *  Restart memory contents
+ *
+ *  Copyright (C) 2008 Oren Laadan
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+#include <linux/kernel.h>
+#include <linux/sched.h>
+#include <linux/fcntl.h>
+#include <linux/file.h>
+#include <linux/fs.h>
+#include <linux/pagemap.h>
+#include <linux/mm_types.h>
+#include <linux/mman.h>
+#include <linux/mm.h>
+#include <linux/err.h>
+#include <linux/checkpoint.h>
+#include <linux/checkpoint_hdr.h>
+
+#include "checkpoint_arch.h"
+#include "checkpoint_mem.h"
+
+/*
+ * Unlike checkpoint, restart is executed in the context of each restarting
+ * process: vma regions are restored via a call to mmap(), and the data is
+ * read into the address space of the current process.
+ */
+
+
+/**
+ * cr_read_pages_vaddrs - read addresses of pages to page-array chain
+ * @ctx - restart context
+ * @nr_pages - number of address to read
+ */
+static int cr_read_pages_vaddrs(struct cr_ctx *ctx, unsigned long nr_pages)
+{
+	struct cr_pgarr *pgarr;
+	unsigned long *vaddrp;
+	int nr, ret;
+
+	while (nr_pages) {
+		pgarr = cr_pgarr_current(ctx);
+		if (!pgarr)
+			return -ENOMEM;
+		nr = cr_pgarr_nr_free(pgarr);
+		if (nr > nr_pages)
+			nr = nr_pages;
+		vaddrp = &pgarr->vaddrs[pgarr->nr_used];
+		ret = cr_kread(ctx, vaddrp, nr * sizeof(unsigned long));
+		if (ret < 0)
+			return ret;
+		pgarr->nr_used += nr;
+		nr_pages -= nr;
+	}
+	return 0;
+}
+
+static int cr_page_read(struct cr_ctx *ctx, struct page *page, char *buf)
+{
+	void *ptr;
+	int ret;
+
+	ret = cr_kread(ctx, buf, PAGE_SIZE);
+	if (ret < 0)
+		return ret;
+
+	ptr = kmap_atomic(page, KM_USER1);
+	memcpy(ptr, buf, PAGE_SIZE);
+	kunmap_atomic(ptr, KM_USER1);
+
+	return 0;
+}
+
+/**
+ * cr_read_pages_contents - read in data of pages in page-array chain
+ * @ctx - restart context
+ */
+static int cr_read_pages_contents(struct cr_ctx *ctx)
+{
+	struct mm_struct *mm = current->mm;
+	struct cr_pgarr *pgarr;
+	unsigned long *vaddrs;
+	char *buf;
+	int i, ret = 0;
+
+	buf = kmalloc(PAGE_SIZE, GFP_KERNEL);
+	if (!buf)
+		return -ENOMEM;
+
+	down_read(&mm->mmap_sem);
+	list_for_each_entry_reverse(pgarr, &ctx->pgarr_list, list) {
+		vaddrs = pgarr->vaddrs;
+		for (i = 0; i < pgarr->nr_used; i++) {
+			struct page *page;
+
+			ret = get_user_pages(current, mm, vaddrs[i],
+					     1, 1, 1, &page, NULL);
+			if (ret < 0)
+				goto out;
+
+			ret = cr_page_read(ctx, page, buf);
+			page_cache_release(page);
+
+			if (ret < 0)
+				goto out;
+		}
+	}
+
+ out:
+	up_read(&mm->mmap_sem);
+	kfree(buf);
+	return 0;
+}
+
+/**
+ * cr_read_private_vma_contents - restore contents of a VMA with private memory
+ * @ctx - restart context
+ *
+ * Reads a header that specifies how many pages will follow, then reads
+ * a list of virtual addresses into ctx->pgarr_list page-array chain,
+ * followed by the actual contents of the corresponding pages. Iterates
+ * these steps until reaching a header specifying "0" pages, which marks
+ * the end of the contents.
+ */
+static int cr_read_private_vma_contents(struct cr_ctx *ctx)
+{
+	struct cr_hdr_pgarr *hh;
+	unsigned long nr_pages;
+	int parent, ret = 0;
+
+	while (1) {
+		hh = cr_hbuf_get(ctx, sizeof(*hh));
+		parent = cr_read_obj_type(ctx, hh, sizeof(*hh), CR_HDR_PGARR);
+		if (parent != 0) {
+			if (parent < 0)
+				ret = parent;
+			else
+				ret = -EINVAL;
+			cr_hbuf_put(ctx, sizeof(*hh));
+			break;
+		}
+
+		cr_debug("nr_pages %ld\n", (unsigned long) hh->nr_pages);
+
+		nr_pages = hh->nr_pages;
+		cr_hbuf_put(ctx, sizeof(*hh));
+
+		if (!nr_pages)
+			break;
+
+		ret = cr_read_pages_vaddrs(ctx, nr_pages);
+		if (ret < 0)
+			break;
+		ret = cr_read_pages_contents(ctx);
+		if (ret < 0)
+			break;
+		cr_pgarr_reset_all(ctx);
+	}
+
+	return ret;
+}
+
+/**
+ * cr_calc_map_prot_bits - convert vm_flags to mmap protection
+ * orig_vm_flags: source vm_flags
+ */
+static unsigned long cr_calc_map_prot_bits(unsigned long orig_vm_flags)
+{
+	unsigned long vm_prot = 0;
+
+	if (orig_vm_flags & VM_READ)
+		vm_prot |= PROT_READ;
+	if (orig_vm_flags & VM_WRITE)
+		vm_prot |= PROT_WRITE;
+	if (orig_vm_flags & VM_EXEC)
+		vm_prot |= PROT_EXEC;
+	if (orig_vm_flags & PROT_SEM)   /* only (?) with IPC-SHM  */
+		vm_prot |= PROT_SEM;
+
+	return vm_prot;
+}
+
+/**
+ * cr_calc_map_flags_bits - convert vm_flags to mmap flags
+ * orig_vm_flags: source vm_flags
+ */
+static unsigned long cr_calc_map_flags_bits(unsigned long orig_vm_flags)
+{
+	unsigned long vm_flags = 0;
+
+	vm_flags = MAP_FIXED;
+	if (orig_vm_flags & VM_GROWSDOWN)
+		vm_flags |= MAP_GROWSDOWN;
+	if (orig_vm_flags & VM_DENYWRITE)
+		vm_flags |= MAP_DENYWRITE;
+	if (orig_vm_flags & VM_EXECUTABLE)
+		vm_flags |= MAP_EXECUTABLE;
+	if (orig_vm_flags & VM_MAYSHARE)
+		vm_flags |= MAP_SHARED;
+	else
+		vm_flags |= MAP_PRIVATE;
+
+	return vm_flags;
+}
+
+static int cr_read_vma(struct cr_ctx *ctx, struct mm_struct *mm)
+{
+	struct cr_hdr_vma *hh = cr_hbuf_get(ctx, sizeof(*hh));
+	unsigned long vm_size, vm_start, vm_flags, vm_prot, vm_pgoff;
+	unsigned long addr;
+	struct file *file = NULL;
+	int parent, ret = -EINVAL;
+
+	parent = cr_read_obj_type(ctx, hh, sizeof(*hh), CR_HDR_VMA);
+	if (parent < 0) {
+		ret = parent;
+		goto err;
+	} else if (parent != 0)
+		goto err;
+
+	cr_debug("vma %#lx-%#lx type %d\n", (unsigned long) hh->vm_start,
+		 (unsigned long) hh->vm_end, (int) hh->vma_type);
+
+	if (hh->vm_end < hh->vm_start)
+		goto err;
+
+	vm_start = hh->vm_start;
+	vm_pgoff = hh->vm_pgoff;
+	vm_size = hh->vm_end - hh->vm_start;
+	vm_prot = cr_calc_map_prot_bits(hh->vm_flags);
+	vm_flags = cr_calc_map_flags_bits(hh->vm_flags);
+
+	switch (hh->vma_type) {
+
+	case CR_VMA_ANON:		/* anonymous private mapping */
+		if (vm_flags & VM_SHARED)
+			goto err;
+		/*
+		 * vm_pgoff for anonymous mapping is the "global" page
+		 * offset (namely from addr 0x0), so we force a zero
+		 */
+		vm_pgoff = 0;
+		break;
+
+	case CR_VMA_FILE:		/* private mapping from a file */
+		if (vm_flags & VM_SHARED)
+			goto err;
+		/*
+		 * for private mapping using 'read-only' is sufficient
+		 */
+		file = cr_read_open_fname(ctx, O_RDONLY, 0);
+		if (IS_ERR(file)) {
+			ret = PTR_ERR(file);
+			goto err;
+		}
+		break;
+
+	default:
+		goto err;
+
+	}
+
+	cr_hbuf_put(ctx, sizeof(*hh));
+
+	down_write(&mm->mmap_sem);
+	addr = do_mmap_pgoff(file, vm_start, vm_size,
+			     vm_prot, vm_flags, vm_pgoff);
+	up_write(&mm->mmap_sem);
+	cr_debug("size %#lx prot %#lx flag %#lx pgoff %#lx => %#lx\n",
+		 vm_size, vm_prot, vm_flags, vm_pgoff, addr);
+
+	/* the file (if opened) is now referenced by the vma */
+	if (file)
+		filp_close(file, NULL);
+
+	if (IS_ERR((void *) addr))
+		return PTR_ERR((void *) addr);
+
+	/*
+	 * CR_VMA_ANON: read in memory as is
+	 * CR_VMA_FILE: read in memory as is
+	 * (more to follow ...)
+	 */
+
+	switch (hh->vma_type) {
+	case CR_VMA_ANON:
+	case CR_VMA_FILE:
+		/* standard case: read the data into the memory */
+		ret = cr_read_private_vma_contents(ctx);
+		break;
+	}
+
+	if (ret < 0)
+		return ret;
+
+	cr_debug("vma retval %d\n", ret);
+	return 0;
+
+ err:
+	cr_hbuf_put(ctx, sizeof(*hh));
+	return ret;
+}
+
+static int cr_destroy_mm(struct mm_struct *mm)
+{
+	struct vm_area_struct *vmnext = mm->mmap;
+	struct vm_area_struct *vma;
+	int ret;
+
+	while (vmnext) {
+		vma = vmnext;
+		vmnext = vmnext->vm_next;
+		ret = do_munmap(mm, vma->vm_start, vma->vm_end-vma->vm_start);
+		if (ret < 0) {
+			pr_debug("c/r: restart failed do_munmap (%d)\n", ret);
+			return ret;
+		}
+	}
+	return 0;
+}
+
+int cr_read_mm(struct cr_ctx *ctx)
+{
+	struct cr_hdr_mm *hh = cr_hbuf_get(ctx, sizeof(*hh));
+	struct mm_struct *mm;
+	int nr, parent, ret;
+
+	parent = cr_read_obj_type(ctx, hh, sizeof(*hh), CR_HDR_MM);
+	if (parent < 0) {
+		ret = parent;
+		goto out;
+	}
+
+	ret = -EINVAL;
+#if 0	/* activate when containers are used */
+	if (parent != task_pid_vnr(current))
+		goto out;
+#endif
+	cr_debug("map_count %d\n", hh->map_count);
+
+	/* XXX need more sanity checks */
+	if (hh->start_code > hh->end_code ||
+	    hh->start_data > hh->end_data || hh->map_count < 0)
+		goto out;
+
+	mm = current->mm;
+
+	/* point of no return -- destruct current mm */
+	down_write(&mm->mmap_sem);
+	ret = cr_destroy_mm(mm);
+	if (ret < 0) {
+		up_write(&mm->mmap_sem);
+		goto out;
+	}
+	mm->start_code = hh->start_code;
+	mm->end_code = hh->end_code;
+	mm->start_data = hh->start_data;
+	mm->end_data = hh->end_data;
+	mm->start_brk = hh->start_brk;
+	mm->brk = hh->brk;
+	mm->start_stack = hh->start_stack;
+	mm->arg_start = hh->arg_start;
+	mm->arg_end = hh->arg_end;
+	mm->env_start = hh->env_start;
+	mm->env_end = hh->env_end;
+	up_write(&mm->mmap_sem);
+
+	/* FIX: need also mm->flags */
+
+	for (nr = hh->map_count; nr; nr--) {
+		ret = cr_read_vma(ctx, mm);
+		if (ret < 0)
+			goto out;
+	}
+
+	ret = cr_read_mm_context(ctx, mm, hh->objref);
+ out:
+	cr_hbuf_put(ctx, sizeof(*hh));
+	return ret;
+}
diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
index 4e97f9f..ab1b215 100644
--- a/include/linux/checkpoint.h
+++ b/include/linux/checkpoint.h
@@ -58,6 +58,10 @@ extern int cr_read_buf_type(struct cr_ctx *ctx, void *buf, int *len, int type);
 extern int cr_read_buffer(struct cr_ctx *ctx, void *buf, int *len);
 extern int cr_read_string(struct cr_ctx *ctx, char *str, int len);
 
+extern int cr_read_fname(struct cr_ctx *ctx, char *fname, int n);
+extern struct file *cr_read_open_fname(struct cr_ctx *ctx,
+				       int flags, int mode);
+
 extern int do_checkpoint(struct cr_ctx *ctx, pid_t pid);
 extern int cr_write_mm(struct cr_ctx *ctx, struct task_struct *t);
 
-- 
1.5.4.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 133+ messages in thread

* [RFC v11][PATCH 07/13] Infrastructure for shared objects
       [not found] ` <1228498282-11804-1-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
                     ` (5 preceding siblings ...)
  2008-12-05 17:31   ` [RFC v11][PATCH 06/13] Restore " Oren Laadan
@ 2008-12-05 17:31   ` Oren Laadan
  2008-12-05 17:31   ` [RFC v11][PATCH 08/13] Dump open file descriptors Oren Laadan
                     ` (10 subsequent siblings)
  17 siblings, 0 replies; 133+ messages in thread
From: Oren Laadan @ 2008-12-05 17:31 UTC (permalink / raw)
  To: Oren Laadan
  Cc: jeremy-TSDbQ3PG+2Y, arnd-r2nGTMty4D4,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Dave Hansen,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Linux Torvalds, Alexander Viro,
	H. Peter Anvin, Thomas Gleixner, Ingo Molnar

Infrastructure to handle objects that may be shared and referenced by
multiple tasks or other objects, e..g open files, memory address space
etc.

The state of shared objects is saved once. On the first encounter, the
state is dumped and the object is assigned a unique identifier (objref)
and also stored in a hash table (indexed by its physical kenrel address).
From then on the object will be found in the hash and only its identifier
is saved.

On restart the identifier is looked up in the hash table; if not found
then the state is read, the object is created, and added to the hash
table (this time indexed by its identifier). Otherwise, the object in
the hash table is used.

The hash is "one-way": objects added to it are never deleted until the
hash it discarded. The hash is discarded at the end of checkpoint or
restart, whether successful or not.

The hash keeps a reference to every object that is added to it, matching
the object's type, and maintains this reference during its lifetime.
Therefore, it is always safe to use an object that is stored in the hash.

Changelog[v11]:
  - Doc: be explicit about grabbing a reference and object lifetime

Changelog[v4]:
  - Fix calculation of hash table size

Changelog[v3]:
  - Use standard hlist_... for hash table

Signed-off-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
Acked-by: Serge Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
Signed-off-by: Dave Hansen <dave-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
---
 checkpoint/Makefile        |    2 +-
 checkpoint/objhash.c       |  278 ++++++++++++++++++++++++++++++++++++++++++++
 checkpoint/sys.c           |    4 +
 include/linux/checkpoint.h |   20 +++
 4 files changed, 303 insertions(+), 1 deletions(-)
 create mode 100644 checkpoint/objhash.c

diff --git a/checkpoint/Makefile b/checkpoint/Makefile
index ac35033..9843fb9 100644
--- a/checkpoint/Makefile
+++ b/checkpoint/Makefile
@@ -2,5 +2,5 @@
 # Makefile for linux checkpoint/restart.
 #
 
-obj-$(CONFIG_CHECKPOINT_RESTART) += sys.o checkpoint.o restart.o \
+obj-$(CONFIG_CHECKPOINT_RESTART) += sys.o checkpoint.o restart.o objhash.o \
 		ckpt_mem.o rstr_mem.o
diff --git a/checkpoint/objhash.c b/checkpoint/objhash.c
new file mode 100644
index 0000000..13d3e5d
--- /dev/null
+++ b/checkpoint/objhash.c
@@ -0,0 +1,278 @@
+/*
+ *  Checkpoint-restart - object hash infrastructure to manage shared objects
+ *
+ *  Copyright (C) 2008 Oren Laadan
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+#include <linux/kernel.h>
+#include <linux/file.h>
+#include <linux/hash.h>
+#include <linux/checkpoint.h>
+
+struct cr_objref {
+	int objref;
+	void *ptr;
+	unsigned short type;
+	unsigned short flags;
+	struct hlist_node hash;
+};
+
+struct cr_objhash {
+	struct hlist_head *head;
+	int next_free_objref;
+};
+
+#define CR_OBJHASH_NBITS  10
+#define CR_OBJHASH_TOTAL  (1UL << CR_OBJHASH_NBITS)
+
+static void cr_obj_ref_drop(struct cr_objref *obj)
+{
+	switch (obj->type) {
+	case CR_OBJ_FILE:
+		fput((struct file *) obj->ptr);
+		break;
+	default:
+		BUG();
+	}
+}
+
+static void cr_obj_ref_grab(struct cr_objref *obj)
+{
+	switch (obj->type) {
+	case CR_OBJ_FILE:
+		get_file((struct file *) obj->ptr);
+		break;
+	default:
+		BUG();
+	}
+}
+
+static void cr_objhash_clear(struct cr_objhash *objhash)
+{
+	struct hlist_head *h = objhash->head;
+	struct hlist_node *n, *t;
+	struct cr_objref *obj;
+	int i;
+
+	for (i = 0; i < CR_OBJHASH_TOTAL; i++) {
+		hlist_for_each_entry_safe(obj, n, t, &h[i], hash) {
+			cr_obj_ref_drop(obj);
+			kfree(obj);
+		}
+	}
+}
+
+void cr_objhash_free(struct cr_ctx *ctx)
+{
+	struct cr_objhash *objhash = ctx->objhash;
+
+	if (objhash) {
+		cr_objhash_clear(objhash);
+		kfree(objhash->head);
+		kfree(ctx->objhash);
+		ctx->objhash = NULL;
+	}
+}
+
+int cr_objhash_alloc(struct cr_ctx *ctx)
+{
+	struct cr_objhash *objhash;
+	struct hlist_head *head;
+
+	objhash = kzalloc(sizeof(*objhash), GFP_KERNEL);
+	if (!objhash)
+		return -ENOMEM;
+	head = kzalloc(CR_OBJHASH_TOTAL * sizeof(*head), GFP_KERNEL);
+	if (!head) {
+		kfree(objhash);
+		return -ENOMEM;
+	}
+
+	objhash->head = head;
+	objhash->next_free_objref = 1;
+
+	ctx->objhash = objhash;
+	return 0;
+}
+
+static struct cr_objref *cr_obj_find_by_ptr(struct cr_ctx *ctx, void *ptr)
+{
+	struct hlist_head *h;
+	struct hlist_node *n;
+	struct cr_objref *obj;
+
+	h = &ctx->objhash->head[hash_ptr(ptr, CR_OBJHASH_NBITS)];
+	hlist_for_each_entry(obj, n, h, hash)
+		if (obj->ptr == ptr)
+			return obj;
+	return NULL;
+}
+
+static struct cr_objref *cr_obj_find_by_objref(struct cr_ctx *ctx, int objref)
+{
+	struct hlist_head *h;
+	struct hlist_node *n;
+	struct cr_objref *obj;
+
+	h = &ctx->objhash->head[hash_ptr((void *) objref, CR_OBJHASH_NBITS)];
+	hlist_for_each_entry(obj, n, h, hash)
+		if (obj->objref == objref)
+			return obj;
+	return NULL;
+}
+
+/**
+ * cr_obj_new - allocate an object and add to the hash table
+ * @ctx: checkpoint context
+ * @ptr: pointer to object
+ * @objref: unique object reference
+ * @type: object type
+ * @flags: object flags
+ *
+ * Allocate an object referring to @ptr and add to the hash table.
+ * If @objref is zero, assign a unique object reference and use @ptr
+ * as a hash key [checkpoint]. Else use @objref as a key [restart].
+ * In both cases, grab a reference (depending on @type) to said obejct.
+ */
+static struct cr_objref *cr_obj_new(struct cr_ctx *ctx, void *ptr, int objref,
+				    unsigned short type, unsigned short flags)
+{
+	struct cr_objref *obj;
+	int i;
+
+	obj = kmalloc(sizeof(*obj), GFP_KERNEL);
+	if (!obj)
+		return NULL;
+
+	obj->ptr = ptr;
+	obj->type = type;
+	obj->flags = flags;
+
+	if (objref) {
+		/* use @objref to index (restart) */
+		obj->objref = objref;
+		i = hash_ptr((void *) objref, CR_OBJHASH_NBITS);
+	} else {
+		/* use @ptr to index, assign objref (checkpoint) */
+		obj->objref = ctx->objhash->next_free_objref++;;
+		i = hash_ptr(ptr, CR_OBJHASH_NBITS);
+	}
+
+	hlist_add_head(&obj->hash, &ctx->objhash->head[i]);
+	cr_obj_ref_grab(obj);
+	return obj;
+}
+
+/**
+ * cr_obj_add_ptr - add an object to the hash table if not already there
+ * @ctx: checkpoint context
+ * @ptr: pointer to object
+ * @objref: unique object reference [output]
+ * @type: object type
+ * @flags: object flags
+ *
+ * Look up the object pointed to by @ptr in the hash table. If it isn't
+ * already found there, then add the object to the table, and allocate a
+ * fresh unique object reference (objref). Grab a reference to every
+ * object that is added, and maintain the reference until the entire
+ * hash is free. 
+ *
+ * Fills the unique objref of the object into @objref.
+ * 
+ * [This is used during checkpoint].
+ *
+ * Returns 0 if found, 1 if added, < 0 on error
+ */
+int cr_obj_add_ptr(struct cr_ctx *ctx, void *ptr, int *objref,
+		   unsigned short type, unsigned short flags)
+{
+	struct cr_objref *obj;
+	int ret = 0;
+
+	obj = cr_obj_find_by_ptr(ctx, ptr);
+	if (!obj) {
+		obj = cr_obj_new(ctx, ptr, 0, type, flags);
+		if (!obj)
+			return -ENOMEM;
+		else
+			ret = 1;
+	} else if (obj->type != type)	/* sanity check */
+		return -EINVAL;
+	*objref = obj->objref;
+	return ret;
+}
+
+/**
+ * cr_obj_add_ref - add an object with unique objref to the hash table
+ * @ctx: checkpoint context
+ * @ptr: pointer to object
+ * @objref: unique identifier - object reference
+ * @type: object type
+ * @flags: object flags
+ *
+ * Add the object pointer to by @ptr and identified by unique object
+ * reference given by @objref to the hash table (indexed by @objref).
+ * Grab a reference to every object that is added, and maintain the
+ * reference until the entire hash is free. 
+ *
+ * [This is used during restart].
+ */
+int cr_obj_add_ref(struct cr_ctx *ctx, void *ptr, int objref,
+		   unsigned short type, unsigned short flags)
+{
+	struct cr_objref *obj;
+
+	obj = cr_obj_new(ctx, ptr, objref, type, flags);
+	return obj ? 0 : -ENOMEM;
+}
+
+/**
+ * cr_obj_get_by_ptr - find the unique object reference of an object
+ * @ctx: checkpoint context
+ * @ptr: pointer to object
+ * @type: object type
+ *
+ * Look up the unique object reference (objref) of the object pointed
+ * to by @ptr, and return that number, or 0 if not found.
+ *
+ * [This is used during checkpoint].
+ */
+int cr_obj_get_by_ptr(struct cr_ctx *ctx, void *ptr, unsigned short type)
+{
+	struct cr_objref *obj;
+
+	obj = cr_obj_find_by_ptr(ctx, ptr);
+	if (!obj)
+		return -ESRCH;
+	if (obj->type != type)
+		return -EINVAL;
+	return obj->objref;
+}
+
+/**
+ * cr_obj_get_by_ref - find an object given its unique object reference
+ * @ctx: checkpoint context
+ * @objref: unique identifier - object reference
+ * @type: object type
+ *
+ * Look up the object who is identified by unique object reference that
+ * is specified by @objref, and return a pointer to that matching object,
+ * or NULL if not found.
+ *
+ * [This is used during restart].
+ */
+void *cr_obj_get_by_ref(struct cr_ctx *ctx, int objref, unsigned short type)
+{
+	struct cr_objref *obj;
+
+	obj = cr_obj_find_by_objref(ctx, objref);
+	if (!obj)
+		return NULL;
+	if (obj->type != type)
+		return ERR_PTR(-EINVAL);
+	return obj->ptr;
+}
diff --git a/checkpoint/sys.c b/checkpoint/sys.c
index c547a1c..c077cd9 100644
--- a/checkpoint/sys.c
+++ b/checkpoint/sys.c
@@ -139,6 +139,7 @@ static void cr_ctx_free(struct cr_ctx *ctx)
 	path_put(&ctx->fs_mnt);		/* safe with NULL pointers */
 
 	cr_pgarr_free(ctx);
+	cr_objhash_free(ctx);
 
 	kfree(ctx);
 }
@@ -166,6 +167,9 @@ static struct cr_ctx *cr_ctx_alloc(int fd, unsigned long flags)
 	if (!ctx->hbuf)
 		goto err;
 
+	if (cr_objhash_alloc(ctx) < 0)
+		goto err;
+
 	return ctx;
 
  err:
diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
index ab1b215..7da696c 100644
--- a/include/linux/checkpoint.h
+++ b/include/linux/checkpoint.h
@@ -29,6 +29,8 @@ struct cr_ctx {
 	void *hbuf;		/* temporary buffer for headers */
 	int hpos;		/* position in headers buffer */
 
+	struct cr_objhash *objhash;	/* hash for shared objects */
+
 	struct list_head pgarr_list;	/* page array to dump VMA contents */
 
 	struct path fs_mnt;	/* container root (FIXME) */
@@ -44,6 +46,24 @@ extern int cr_kread(struct cr_ctx *ctx, void *buf, int count);
 extern void *cr_hbuf_get(struct cr_ctx *ctx, int n);
 extern void cr_hbuf_put(struct cr_ctx *ctx, int n);
 
+/* shared objects handling */
+
+enum {
+	CR_OBJ_FILE = 1,
+	CR_OBJ_MAX
+};
+
+extern void cr_objhash_free(struct cr_ctx *ctx);
+extern int cr_objhash_alloc(struct cr_ctx *ctx);
+extern void *cr_obj_get_by_ref(struct cr_ctx *ctx,
+			       int objref, unsigned short type);
+extern int cr_obj_get_by_ptr(struct cr_ctx *ctx,
+			     void *ptr, unsigned short type);
+extern int cr_obj_add_ptr(struct cr_ctx *ctx, void *ptr, int *objref,
+			  unsigned short type, unsigned short flags);
+extern int cr_obj_add_ref(struct cr_ctx *ctx, void *ptr, int objref,
+			  unsigned short type, unsigned short flags);
+
 struct cr_hdr;
 
 extern int cr_write_obj(struct cr_ctx *ctx, struct cr_hdr *h, void *buf);
-- 
1.5.4.3

^ permalink raw reply related	[flat|nested] 133+ messages in thread

* [RFC v11][PATCH 07/13] Infrastructure for shared objects
       [not found] ` <1228498282-11804-1-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
  2008-12-05 17:31   ` [RFC v11][PATCH 01/13] Create syscalls: sys_checkpoint, sys_restart Oren Laadan
@ 2008-12-05 17:31   ` Oren Laadan
  2008-12-05 17:31   ` [RFC v11][PATCH 03/13] General infrastructure for checkpoint restart Oren Laadan
                     ` (15 subsequent siblings)
  17 siblings, 0 replies; 133+ messages in thread
From: Oren Laadan @ 2008-12-05 17:31 UTC (permalink / raw)
  To: Oren Laadan
  Cc: containers, linux-kernel, linux-mm, linux-api, Linux Torvalds,
	Thomas Gleixner, Serge Hallyn, Dave Hansen, Ingo Molnar,
	H. Peter Anvin, Alexander Viro, MinChan Kim, arnd, jeremy,
	Oren Laadan

Infrastructure to handle objects that may be shared and referenced by
multiple tasks or other objects, e..g open files, memory address space
etc.

The state of shared objects is saved once. On the first encounter, the
state is dumped and the object is assigned a unique identifier (objref)
and also stored in a hash table (indexed by its physical kenrel address).
>From then on the object will be found in the hash and only its identifier
is saved.

On restart the identifier is looked up in the hash table; if not found
then the state is read, the object is created, and added to the hash
table (this time indexed by its identifier). Otherwise, the object in
the hash table is used.

The hash is "one-way": objects added to it are never deleted until the
hash it discarded. The hash is discarded at the end of checkpoint or
restart, whether successful or not.

The hash keeps a reference to every object that is added to it, matching
the object's type, and maintains this reference during its lifetime.
Therefore, it is always safe to use an object that is stored in the hash.

Changelog[v11]:
  - Doc: be explicit about grabbing a reference and object lifetime

Changelog[v4]:
  - Fix calculation of hash table size

Changelog[v3]:
  - Use standard hlist_... for hash table

Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com>
---
 checkpoint/Makefile        |    2 +-
 checkpoint/objhash.c       |  278 ++++++++++++++++++++++++++++++++++++++++++++
 checkpoint/sys.c           |    4 +
 include/linux/checkpoint.h |   20 +++
 4 files changed, 303 insertions(+), 1 deletions(-)
 create mode 100644 checkpoint/objhash.c

diff --git a/checkpoint/Makefile b/checkpoint/Makefile
index ac35033..9843fb9 100644
--- a/checkpoint/Makefile
+++ b/checkpoint/Makefile
@@ -2,5 +2,5 @@
 # Makefile for linux checkpoint/restart.
 #
 
-obj-$(CONFIG_CHECKPOINT_RESTART) += sys.o checkpoint.o restart.o \
+obj-$(CONFIG_CHECKPOINT_RESTART) += sys.o checkpoint.o restart.o objhash.o \
 		ckpt_mem.o rstr_mem.o
diff --git a/checkpoint/objhash.c b/checkpoint/objhash.c
new file mode 100644
index 0000000..13d3e5d
--- /dev/null
+++ b/checkpoint/objhash.c
@@ -0,0 +1,278 @@
+/*
+ *  Checkpoint-restart - object hash infrastructure to manage shared objects
+ *
+ *  Copyright (C) 2008 Oren Laadan
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+#include <linux/kernel.h>
+#include <linux/file.h>
+#include <linux/hash.h>
+#include <linux/checkpoint.h>
+
+struct cr_objref {
+	int objref;
+	void *ptr;
+	unsigned short type;
+	unsigned short flags;
+	struct hlist_node hash;
+};
+
+struct cr_objhash {
+	struct hlist_head *head;
+	int next_free_objref;
+};
+
+#define CR_OBJHASH_NBITS  10
+#define CR_OBJHASH_TOTAL  (1UL << CR_OBJHASH_NBITS)
+
+static void cr_obj_ref_drop(struct cr_objref *obj)
+{
+	switch (obj->type) {
+	case CR_OBJ_FILE:
+		fput((struct file *) obj->ptr);
+		break;
+	default:
+		BUG();
+	}
+}
+
+static void cr_obj_ref_grab(struct cr_objref *obj)
+{
+	switch (obj->type) {
+	case CR_OBJ_FILE:
+		get_file((struct file *) obj->ptr);
+		break;
+	default:
+		BUG();
+	}
+}
+
+static void cr_objhash_clear(struct cr_objhash *objhash)
+{
+	struct hlist_head *h = objhash->head;
+	struct hlist_node *n, *t;
+	struct cr_objref *obj;
+	int i;
+
+	for (i = 0; i < CR_OBJHASH_TOTAL; i++) {
+		hlist_for_each_entry_safe(obj, n, t, &h[i], hash) {
+			cr_obj_ref_drop(obj);
+			kfree(obj);
+		}
+	}
+}
+
+void cr_objhash_free(struct cr_ctx *ctx)
+{
+	struct cr_objhash *objhash = ctx->objhash;
+
+	if (objhash) {
+		cr_objhash_clear(objhash);
+		kfree(objhash->head);
+		kfree(ctx->objhash);
+		ctx->objhash = NULL;
+	}
+}
+
+int cr_objhash_alloc(struct cr_ctx *ctx)
+{
+	struct cr_objhash *objhash;
+	struct hlist_head *head;
+
+	objhash = kzalloc(sizeof(*objhash), GFP_KERNEL);
+	if (!objhash)
+		return -ENOMEM;
+	head = kzalloc(CR_OBJHASH_TOTAL * sizeof(*head), GFP_KERNEL);
+	if (!head) {
+		kfree(objhash);
+		return -ENOMEM;
+	}
+
+	objhash->head = head;
+	objhash->next_free_objref = 1;
+
+	ctx->objhash = objhash;
+	return 0;
+}
+
+static struct cr_objref *cr_obj_find_by_ptr(struct cr_ctx *ctx, void *ptr)
+{
+	struct hlist_head *h;
+	struct hlist_node *n;
+	struct cr_objref *obj;
+
+	h = &ctx->objhash->head[hash_ptr(ptr, CR_OBJHASH_NBITS)];
+	hlist_for_each_entry(obj, n, h, hash)
+		if (obj->ptr == ptr)
+			return obj;
+	return NULL;
+}
+
+static struct cr_objref *cr_obj_find_by_objref(struct cr_ctx *ctx, int objref)
+{
+	struct hlist_head *h;
+	struct hlist_node *n;
+	struct cr_objref *obj;
+
+	h = &ctx->objhash->head[hash_ptr((void *) objref, CR_OBJHASH_NBITS)];
+	hlist_for_each_entry(obj, n, h, hash)
+		if (obj->objref == objref)
+			return obj;
+	return NULL;
+}
+
+/**
+ * cr_obj_new - allocate an object and add to the hash table
+ * @ctx: checkpoint context
+ * @ptr: pointer to object
+ * @objref: unique object reference
+ * @type: object type
+ * @flags: object flags
+ *
+ * Allocate an object referring to @ptr and add to the hash table.
+ * If @objref is zero, assign a unique object reference and use @ptr
+ * as a hash key [checkpoint]. Else use @objref as a key [restart].
+ * In both cases, grab a reference (depending on @type) to said obejct.
+ */
+static struct cr_objref *cr_obj_new(struct cr_ctx *ctx, void *ptr, int objref,
+				    unsigned short type, unsigned short flags)
+{
+	struct cr_objref *obj;
+	int i;
+
+	obj = kmalloc(sizeof(*obj), GFP_KERNEL);
+	if (!obj)
+		return NULL;
+
+	obj->ptr = ptr;
+	obj->type = type;
+	obj->flags = flags;
+
+	if (objref) {
+		/* use @objref to index (restart) */
+		obj->objref = objref;
+		i = hash_ptr((void *) objref, CR_OBJHASH_NBITS);
+	} else {
+		/* use @ptr to index, assign objref (checkpoint) */
+		obj->objref = ctx->objhash->next_free_objref++;;
+		i = hash_ptr(ptr, CR_OBJHASH_NBITS);
+	}
+
+	hlist_add_head(&obj->hash, &ctx->objhash->head[i]);
+	cr_obj_ref_grab(obj);
+	return obj;
+}
+
+/**
+ * cr_obj_add_ptr - add an object to the hash table if not already there
+ * @ctx: checkpoint context
+ * @ptr: pointer to object
+ * @objref: unique object reference [output]
+ * @type: object type
+ * @flags: object flags
+ *
+ * Look up the object pointed to by @ptr in the hash table. If it isn't
+ * already found there, then add the object to the table, and allocate a
+ * fresh unique object reference (objref). Grab a reference to every
+ * object that is added, and maintain the reference until the entire
+ * hash is free. 
+ *
+ * Fills the unique objref of the object into @objref.
+ * 
+ * [This is used during checkpoint].
+ *
+ * Returns 0 if found, 1 if added, < 0 on error
+ */
+int cr_obj_add_ptr(struct cr_ctx *ctx, void *ptr, int *objref,
+		   unsigned short type, unsigned short flags)
+{
+	struct cr_objref *obj;
+	int ret = 0;
+
+	obj = cr_obj_find_by_ptr(ctx, ptr);
+	if (!obj) {
+		obj = cr_obj_new(ctx, ptr, 0, type, flags);
+		if (!obj)
+			return -ENOMEM;
+		else
+			ret = 1;
+	} else if (obj->type != type)	/* sanity check */
+		return -EINVAL;
+	*objref = obj->objref;
+	return ret;
+}
+
+/**
+ * cr_obj_add_ref - add an object with unique objref to the hash table
+ * @ctx: checkpoint context
+ * @ptr: pointer to object
+ * @objref: unique identifier - object reference
+ * @type: object type
+ * @flags: object flags
+ *
+ * Add the object pointer to by @ptr and identified by unique object
+ * reference given by @objref to the hash table (indexed by @objref).
+ * Grab a reference to every object that is added, and maintain the
+ * reference until the entire hash is free. 
+ *
+ * [This is used during restart].
+ */
+int cr_obj_add_ref(struct cr_ctx *ctx, void *ptr, int objref,
+		   unsigned short type, unsigned short flags)
+{
+	struct cr_objref *obj;
+
+	obj = cr_obj_new(ctx, ptr, objref, type, flags);
+	return obj ? 0 : -ENOMEM;
+}
+
+/**
+ * cr_obj_get_by_ptr - find the unique object reference of an object
+ * @ctx: checkpoint context
+ * @ptr: pointer to object
+ * @type: object type
+ *
+ * Look up the unique object reference (objref) of the object pointed
+ * to by @ptr, and return that number, or 0 if not found.
+ *
+ * [This is used during checkpoint].
+ */
+int cr_obj_get_by_ptr(struct cr_ctx *ctx, void *ptr, unsigned short type)
+{
+	struct cr_objref *obj;
+
+	obj = cr_obj_find_by_ptr(ctx, ptr);
+	if (!obj)
+		return -ESRCH;
+	if (obj->type != type)
+		return -EINVAL;
+	return obj->objref;
+}
+
+/**
+ * cr_obj_get_by_ref - find an object given its unique object reference
+ * @ctx: checkpoint context
+ * @objref: unique identifier - object reference
+ * @type: object type
+ *
+ * Look up the object who is identified by unique object reference that
+ * is specified by @objref, and return a pointer to that matching object,
+ * or NULL if not found.
+ *
+ * [This is used during restart].
+ */
+void *cr_obj_get_by_ref(struct cr_ctx *ctx, int objref, unsigned short type)
+{
+	struct cr_objref *obj;
+
+	obj = cr_obj_find_by_objref(ctx, objref);
+	if (!obj)
+		return NULL;
+	if (obj->type != type)
+		return ERR_PTR(-EINVAL);
+	return obj->ptr;
+}
diff --git a/checkpoint/sys.c b/checkpoint/sys.c
index c547a1c..c077cd9 100644
--- a/checkpoint/sys.c
+++ b/checkpoint/sys.c
@@ -139,6 +139,7 @@ static void cr_ctx_free(struct cr_ctx *ctx)
 	path_put(&ctx->fs_mnt);		/* safe with NULL pointers */
 
 	cr_pgarr_free(ctx);
+	cr_objhash_free(ctx);
 
 	kfree(ctx);
 }
@@ -166,6 +167,9 @@ static struct cr_ctx *cr_ctx_alloc(int fd, unsigned long flags)
 	if (!ctx->hbuf)
 		goto err;
 
+	if (cr_objhash_alloc(ctx) < 0)
+		goto err;
+
 	return ctx;
 
  err:
diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
index ab1b215..7da696c 100644
--- a/include/linux/checkpoint.h
+++ b/include/linux/checkpoint.h
@@ -29,6 +29,8 @@ struct cr_ctx {
 	void *hbuf;		/* temporary buffer for headers */
 	int hpos;		/* position in headers buffer */
 
+	struct cr_objhash *objhash;	/* hash for shared objects */
+
 	struct list_head pgarr_list;	/* page array to dump VMA contents */
 
 	struct path fs_mnt;	/* container root (FIXME) */
@@ -44,6 +46,24 @@ extern int cr_kread(struct cr_ctx *ctx, void *buf, int count);
 extern void *cr_hbuf_get(struct cr_ctx *ctx, int n);
 extern void cr_hbuf_put(struct cr_ctx *ctx, int n);
 
+/* shared objects handling */
+
+enum {
+	CR_OBJ_FILE = 1,
+	CR_OBJ_MAX
+};
+
+extern void cr_objhash_free(struct cr_ctx *ctx);
+extern int cr_objhash_alloc(struct cr_ctx *ctx);
+extern void *cr_obj_get_by_ref(struct cr_ctx *ctx,
+			       int objref, unsigned short type);
+extern int cr_obj_get_by_ptr(struct cr_ctx *ctx,
+			     void *ptr, unsigned short type);
+extern int cr_obj_add_ptr(struct cr_ctx *ctx, void *ptr, int *objref,
+			  unsigned short type, unsigned short flags);
+extern int cr_obj_add_ref(struct cr_ctx *ctx, void *ptr, int objref,
+			  unsigned short type, unsigned short flags);
+
 struct cr_hdr;
 
 extern int cr_write_obj(struct cr_ctx *ctx, struct cr_hdr *h, void *buf);
-- 
1.5.4.3


^ permalink raw reply related	[flat|nested] 133+ messages in thread

* [RFC v11][PATCH 07/13] Infrastructure for shared objects
@ 2008-12-05 17:31   ` Oren Laadan
  0 siblings, 0 replies; 133+ messages in thread
From: Oren Laadan @ 2008-12-05 17:31 UTC (permalink / raw)
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Linux Torvalds,
	Thomas Gleixner, Serge Hallyn, Dave Hansen, Ingo Molnar,
	H. Peter Anvin, Alexander Viro, MinChan Kim, arnd-r2nGTMty4D4,
	jeremy-TSDbQ3PG+2Y, Oren Laadan

Infrastructure to handle objects that may be shared and referenced by
multiple tasks or other objects, e..g open files, memory address space
etc.

The state of shared objects is saved once. On the first encounter, the
state is dumped and the object is assigned a unique identifier (objref)
and also stored in a hash table (indexed by its physical kenrel address).
>From then on the object will be found in the hash and only its identifier
is saved.

On restart the identifier is looked up in the hash table; if not found
then the state is read, the object is created, and added to the hash
table (this time indexed by its identifier). Otherwise, the object in
the hash table is used.

The hash is "one-way": objects added to it are never deleted until the
hash it discarded. The hash is discarded at the end of checkpoint or
restart, whether successful or not.

The hash keeps a reference to every object that is added to it, matching
the object's type, and maintains this reference during its lifetime.
Therefore, it is always safe to use an object that is stored in the hash.

Changelog[v11]:
  - Doc: be explicit about grabbing a reference and object lifetime

Changelog[v4]:
  - Fix calculation of hash table size

Changelog[v3]:
  - Use standard hlist_... for hash table

Signed-off-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
Acked-by: Serge Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
Signed-off-by: Dave Hansen <dave-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
---
 checkpoint/Makefile        |    2 +-
 checkpoint/objhash.c       |  278 ++++++++++++++++++++++++++++++++++++++++++++
 checkpoint/sys.c           |    4 +
 include/linux/checkpoint.h |   20 +++
 4 files changed, 303 insertions(+), 1 deletions(-)
 create mode 100644 checkpoint/objhash.c

diff --git a/checkpoint/Makefile b/checkpoint/Makefile
index ac35033..9843fb9 100644
--- a/checkpoint/Makefile
+++ b/checkpoint/Makefile
@@ -2,5 +2,5 @@
 # Makefile for linux checkpoint/restart.
 #
 
-obj-$(CONFIG_CHECKPOINT_RESTART) += sys.o checkpoint.o restart.o \
+obj-$(CONFIG_CHECKPOINT_RESTART) += sys.o checkpoint.o restart.o objhash.o \
 		ckpt_mem.o rstr_mem.o
diff --git a/checkpoint/objhash.c b/checkpoint/objhash.c
new file mode 100644
index 0000000..13d3e5d
--- /dev/null
+++ b/checkpoint/objhash.c
@@ -0,0 +1,278 @@
+/*
+ *  Checkpoint-restart - object hash infrastructure to manage shared objects
+ *
+ *  Copyright (C) 2008 Oren Laadan
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+#include <linux/kernel.h>
+#include <linux/file.h>
+#include <linux/hash.h>
+#include <linux/checkpoint.h>
+
+struct cr_objref {
+	int objref;
+	void *ptr;
+	unsigned short type;
+	unsigned short flags;
+	struct hlist_node hash;
+};
+
+struct cr_objhash {
+	struct hlist_head *head;
+	int next_free_objref;
+};
+
+#define CR_OBJHASH_NBITS  10
+#define CR_OBJHASH_TOTAL  (1UL << CR_OBJHASH_NBITS)
+
+static void cr_obj_ref_drop(struct cr_objref *obj)
+{
+	switch (obj->type) {
+	case CR_OBJ_FILE:
+		fput((struct file *) obj->ptr);
+		break;
+	default:
+		BUG();
+	}
+}
+
+static void cr_obj_ref_grab(struct cr_objref *obj)
+{
+	switch (obj->type) {
+	case CR_OBJ_FILE:
+		get_file((struct file *) obj->ptr);
+		break;
+	default:
+		BUG();
+	}
+}
+
+static void cr_objhash_clear(struct cr_objhash *objhash)
+{
+	struct hlist_head *h = objhash->head;
+	struct hlist_node *n, *t;
+	struct cr_objref *obj;
+	int i;
+
+	for (i = 0; i < CR_OBJHASH_TOTAL; i++) {
+		hlist_for_each_entry_safe(obj, n, t, &h[i], hash) {
+			cr_obj_ref_drop(obj);
+			kfree(obj);
+		}
+	}
+}
+
+void cr_objhash_free(struct cr_ctx *ctx)
+{
+	struct cr_objhash *objhash = ctx->objhash;
+
+	if (objhash) {
+		cr_objhash_clear(objhash);
+		kfree(objhash->head);
+		kfree(ctx->objhash);
+		ctx->objhash = NULL;
+	}
+}
+
+int cr_objhash_alloc(struct cr_ctx *ctx)
+{
+	struct cr_objhash *objhash;
+	struct hlist_head *head;
+
+	objhash = kzalloc(sizeof(*objhash), GFP_KERNEL);
+	if (!objhash)
+		return -ENOMEM;
+	head = kzalloc(CR_OBJHASH_TOTAL * sizeof(*head), GFP_KERNEL);
+	if (!head) {
+		kfree(objhash);
+		return -ENOMEM;
+	}
+
+	objhash->head = head;
+	objhash->next_free_objref = 1;
+
+	ctx->objhash = objhash;
+	return 0;
+}
+
+static struct cr_objref *cr_obj_find_by_ptr(struct cr_ctx *ctx, void *ptr)
+{
+	struct hlist_head *h;
+	struct hlist_node *n;
+	struct cr_objref *obj;
+
+	h = &ctx->objhash->head[hash_ptr(ptr, CR_OBJHASH_NBITS)];
+	hlist_for_each_entry(obj, n, h, hash)
+		if (obj->ptr == ptr)
+			return obj;
+	return NULL;
+}
+
+static struct cr_objref *cr_obj_find_by_objref(struct cr_ctx *ctx, int objref)
+{
+	struct hlist_head *h;
+	struct hlist_node *n;
+	struct cr_objref *obj;
+
+	h = &ctx->objhash->head[hash_ptr((void *) objref, CR_OBJHASH_NBITS)];
+	hlist_for_each_entry(obj, n, h, hash)
+		if (obj->objref == objref)
+			return obj;
+	return NULL;
+}
+
+/**
+ * cr_obj_new - allocate an object and add to the hash table
+ * @ctx: checkpoint context
+ * @ptr: pointer to object
+ * @objref: unique object reference
+ * @type: object type
+ * @flags: object flags
+ *
+ * Allocate an object referring to @ptr and add to the hash table.
+ * If @objref is zero, assign a unique object reference and use @ptr
+ * as a hash key [checkpoint]. Else use @objref as a key [restart].
+ * In both cases, grab a reference (depending on @type) to said obejct.
+ */
+static struct cr_objref *cr_obj_new(struct cr_ctx *ctx, void *ptr, int objref,
+				    unsigned short type, unsigned short flags)
+{
+	struct cr_objref *obj;
+	int i;
+
+	obj = kmalloc(sizeof(*obj), GFP_KERNEL);
+	if (!obj)
+		return NULL;
+
+	obj->ptr = ptr;
+	obj->type = type;
+	obj->flags = flags;
+
+	if (objref) {
+		/* use @objref to index (restart) */
+		obj->objref = objref;
+		i = hash_ptr((void *) objref, CR_OBJHASH_NBITS);
+	} else {
+		/* use @ptr to index, assign objref (checkpoint) */
+		obj->objref = ctx->objhash->next_free_objref++;;
+		i = hash_ptr(ptr, CR_OBJHASH_NBITS);
+	}
+
+	hlist_add_head(&obj->hash, &ctx->objhash->head[i]);
+	cr_obj_ref_grab(obj);
+	return obj;
+}
+
+/**
+ * cr_obj_add_ptr - add an object to the hash table if not already there
+ * @ctx: checkpoint context
+ * @ptr: pointer to object
+ * @objref: unique object reference [output]
+ * @type: object type
+ * @flags: object flags
+ *
+ * Look up the object pointed to by @ptr in the hash table. If it isn't
+ * already found there, then add the object to the table, and allocate a
+ * fresh unique object reference (objref). Grab a reference to every
+ * object that is added, and maintain the reference until the entire
+ * hash is free. 
+ *
+ * Fills the unique objref of the object into @objref.
+ * 
+ * [This is used during checkpoint].
+ *
+ * Returns 0 if found, 1 if added, < 0 on error
+ */
+int cr_obj_add_ptr(struct cr_ctx *ctx, void *ptr, int *objref,
+		   unsigned short type, unsigned short flags)
+{
+	struct cr_objref *obj;
+	int ret = 0;
+
+	obj = cr_obj_find_by_ptr(ctx, ptr);
+	if (!obj) {
+		obj = cr_obj_new(ctx, ptr, 0, type, flags);
+		if (!obj)
+			return -ENOMEM;
+		else
+			ret = 1;
+	} else if (obj->type != type)	/* sanity check */
+		return -EINVAL;
+	*objref = obj->objref;
+	return ret;
+}
+
+/**
+ * cr_obj_add_ref - add an object with unique objref to the hash table
+ * @ctx: checkpoint context
+ * @ptr: pointer to object
+ * @objref: unique identifier - object reference
+ * @type: object type
+ * @flags: object flags
+ *
+ * Add the object pointer to by @ptr and identified by unique object
+ * reference given by @objref to the hash table (indexed by @objref).
+ * Grab a reference to every object that is added, and maintain the
+ * reference until the entire hash is free. 
+ *
+ * [This is used during restart].
+ */
+int cr_obj_add_ref(struct cr_ctx *ctx, void *ptr, int objref,
+		   unsigned short type, unsigned short flags)
+{
+	struct cr_objref *obj;
+
+	obj = cr_obj_new(ctx, ptr, objref, type, flags);
+	return obj ? 0 : -ENOMEM;
+}
+
+/**
+ * cr_obj_get_by_ptr - find the unique object reference of an object
+ * @ctx: checkpoint context
+ * @ptr: pointer to object
+ * @type: object type
+ *
+ * Look up the unique object reference (objref) of the object pointed
+ * to by @ptr, and return that number, or 0 if not found.
+ *
+ * [This is used during checkpoint].
+ */
+int cr_obj_get_by_ptr(struct cr_ctx *ctx, void *ptr, unsigned short type)
+{
+	struct cr_objref *obj;
+
+	obj = cr_obj_find_by_ptr(ctx, ptr);
+	if (!obj)
+		return -ESRCH;
+	if (obj->type != type)
+		return -EINVAL;
+	return obj->objref;
+}
+
+/**
+ * cr_obj_get_by_ref - find an object given its unique object reference
+ * @ctx: checkpoint context
+ * @objref: unique identifier - object reference
+ * @type: object type
+ *
+ * Look up the object who is identified by unique object reference that
+ * is specified by @objref, and return a pointer to that matching object,
+ * or NULL if not found.
+ *
+ * [This is used during restart].
+ */
+void *cr_obj_get_by_ref(struct cr_ctx *ctx, int objref, unsigned short type)
+{
+	struct cr_objref *obj;
+
+	obj = cr_obj_find_by_objref(ctx, objref);
+	if (!obj)
+		return NULL;
+	if (obj->type != type)
+		return ERR_PTR(-EINVAL);
+	return obj->ptr;
+}
diff --git a/checkpoint/sys.c b/checkpoint/sys.c
index c547a1c..c077cd9 100644
--- a/checkpoint/sys.c
+++ b/checkpoint/sys.c
@@ -139,6 +139,7 @@ static void cr_ctx_free(struct cr_ctx *ctx)
 	path_put(&ctx->fs_mnt);		/* safe with NULL pointers */
 
 	cr_pgarr_free(ctx);
+	cr_objhash_free(ctx);
 
 	kfree(ctx);
 }
@@ -166,6 +167,9 @@ static struct cr_ctx *cr_ctx_alloc(int fd, unsigned long flags)
 	if (!ctx->hbuf)
 		goto err;
 
+	if (cr_objhash_alloc(ctx) < 0)
+		goto err;
+
 	return ctx;
 
  err:
diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
index ab1b215..7da696c 100644
--- a/include/linux/checkpoint.h
+++ b/include/linux/checkpoint.h
@@ -29,6 +29,8 @@ struct cr_ctx {
 	void *hbuf;		/* temporary buffer for headers */
 	int hpos;		/* position in headers buffer */
 
+	struct cr_objhash *objhash;	/* hash for shared objects */
+
 	struct list_head pgarr_list;	/* page array to dump VMA contents */
 
 	struct path fs_mnt;	/* container root (FIXME) */
@@ -44,6 +46,24 @@ extern int cr_kread(struct cr_ctx *ctx, void *buf, int count);
 extern void *cr_hbuf_get(struct cr_ctx *ctx, int n);
 extern void cr_hbuf_put(struct cr_ctx *ctx, int n);
 
+/* shared objects handling */
+
+enum {
+	CR_OBJ_FILE = 1,
+	CR_OBJ_MAX
+};
+
+extern void cr_objhash_free(struct cr_ctx *ctx);
+extern int cr_objhash_alloc(struct cr_ctx *ctx);
+extern void *cr_obj_get_by_ref(struct cr_ctx *ctx,
+			       int objref, unsigned short type);
+extern int cr_obj_get_by_ptr(struct cr_ctx *ctx,
+			     void *ptr, unsigned short type);
+extern int cr_obj_add_ptr(struct cr_ctx *ctx, void *ptr, int *objref,
+			  unsigned short type, unsigned short flags);
+extern int cr_obj_add_ref(struct cr_ctx *ctx, void *ptr, int objref,
+			  unsigned short type, unsigned short flags);
+
 struct cr_hdr;
 
 extern int cr_write_obj(struct cr_ctx *ctx, struct cr_hdr *h, void *buf);
-- 
1.5.4.3

--
To unsubscribe from this list: send the line "unsubscribe linux-api" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 133+ messages in thread

* [RFC v11][PATCH 07/13] Infrastructure for shared objects
@ 2008-12-05 17:31   ` Oren Laadan
  0 siblings, 0 replies; 133+ messages in thread
From: Oren Laadan @ 2008-12-05 17:31 UTC (permalink / raw)
  To: Oren Laadan
  Cc: containers, linux-kernel, linux-mm, linux-api, Linux Torvalds,
	Thomas Gleixner, Serge Hallyn, Dave Hansen, Ingo Molnar,
	H. Peter Anvin, Alexander Viro, MinChan Kim, arnd, jeremy

Infrastructure to handle objects that may be shared and referenced by
multiple tasks or other objects, e..g open files, memory address space
etc.

The state of shared objects is saved once. On the first encounter, the
state is dumped and the object is assigned a unique identifier (objref)
and also stored in a hash table (indexed by its physical kenrel address).
>From then on the object will be found in the hash and only its identifier
is saved.

On restart the identifier is looked up in the hash table; if not found
then the state is read, the object is created, and added to the hash
table (this time indexed by its identifier). Otherwise, the object in
the hash table is used.

The hash is "one-way": objects added to it are never deleted until the
hash it discarded. The hash is discarded at the end of checkpoint or
restart, whether successful or not.

The hash keeps a reference to every object that is added to it, matching
the object's type, and maintains this reference during its lifetime.
Therefore, it is always safe to use an object that is stored in the hash.

Changelog[v11]:
  - Doc: be explicit about grabbing a reference and object lifetime

Changelog[v4]:
  - Fix calculation of hash table size

Changelog[v3]:
  - Use standard hlist_... for hash table

Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com>
---
 checkpoint/Makefile        |    2 +-
 checkpoint/objhash.c       |  278 ++++++++++++++++++++++++++++++++++++++++++++
 checkpoint/sys.c           |    4 +
 include/linux/checkpoint.h |   20 +++
 4 files changed, 303 insertions(+), 1 deletions(-)
 create mode 100644 checkpoint/objhash.c

diff --git a/checkpoint/Makefile b/checkpoint/Makefile
index ac35033..9843fb9 100644
--- a/checkpoint/Makefile
+++ b/checkpoint/Makefile
@@ -2,5 +2,5 @@
 # Makefile for linux checkpoint/restart.
 #
 
-obj-$(CONFIG_CHECKPOINT_RESTART) += sys.o checkpoint.o restart.o \
+obj-$(CONFIG_CHECKPOINT_RESTART) += sys.o checkpoint.o restart.o objhash.o \
 		ckpt_mem.o rstr_mem.o
diff --git a/checkpoint/objhash.c b/checkpoint/objhash.c
new file mode 100644
index 0000000..13d3e5d
--- /dev/null
+++ b/checkpoint/objhash.c
@@ -0,0 +1,278 @@
+/*
+ *  Checkpoint-restart - object hash infrastructure to manage shared objects
+ *
+ *  Copyright (C) 2008 Oren Laadan
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+#include <linux/kernel.h>
+#include <linux/file.h>
+#include <linux/hash.h>
+#include <linux/checkpoint.h>
+
+struct cr_objref {
+	int objref;
+	void *ptr;
+	unsigned short type;
+	unsigned short flags;
+	struct hlist_node hash;
+};
+
+struct cr_objhash {
+	struct hlist_head *head;
+	int next_free_objref;
+};
+
+#define CR_OBJHASH_NBITS  10
+#define CR_OBJHASH_TOTAL  (1UL << CR_OBJHASH_NBITS)
+
+static void cr_obj_ref_drop(struct cr_objref *obj)
+{
+	switch (obj->type) {
+	case CR_OBJ_FILE:
+		fput((struct file *) obj->ptr);
+		break;
+	default:
+		BUG();
+	}
+}
+
+static void cr_obj_ref_grab(struct cr_objref *obj)
+{
+	switch (obj->type) {
+	case CR_OBJ_FILE:
+		get_file((struct file *) obj->ptr);
+		break;
+	default:
+		BUG();
+	}
+}
+
+static void cr_objhash_clear(struct cr_objhash *objhash)
+{
+	struct hlist_head *h = objhash->head;
+	struct hlist_node *n, *t;
+	struct cr_objref *obj;
+	int i;
+
+	for (i = 0; i < CR_OBJHASH_TOTAL; i++) {
+		hlist_for_each_entry_safe(obj, n, t, &h[i], hash) {
+			cr_obj_ref_drop(obj);
+			kfree(obj);
+		}
+	}
+}
+
+void cr_objhash_free(struct cr_ctx *ctx)
+{
+	struct cr_objhash *objhash = ctx->objhash;
+
+	if (objhash) {
+		cr_objhash_clear(objhash);
+		kfree(objhash->head);
+		kfree(ctx->objhash);
+		ctx->objhash = NULL;
+	}
+}
+
+int cr_objhash_alloc(struct cr_ctx *ctx)
+{
+	struct cr_objhash *objhash;
+	struct hlist_head *head;
+
+	objhash = kzalloc(sizeof(*objhash), GFP_KERNEL);
+	if (!objhash)
+		return -ENOMEM;
+	head = kzalloc(CR_OBJHASH_TOTAL * sizeof(*head), GFP_KERNEL);
+	if (!head) {
+		kfree(objhash);
+		return -ENOMEM;
+	}
+
+	objhash->head = head;
+	objhash->next_free_objref = 1;
+
+	ctx->objhash = objhash;
+	return 0;
+}
+
+static struct cr_objref *cr_obj_find_by_ptr(struct cr_ctx *ctx, void *ptr)
+{
+	struct hlist_head *h;
+	struct hlist_node *n;
+	struct cr_objref *obj;
+
+	h = &ctx->objhash->head[hash_ptr(ptr, CR_OBJHASH_NBITS)];
+	hlist_for_each_entry(obj, n, h, hash)
+		if (obj->ptr == ptr)
+			return obj;
+	return NULL;
+}
+
+static struct cr_objref *cr_obj_find_by_objref(struct cr_ctx *ctx, int objref)
+{
+	struct hlist_head *h;
+	struct hlist_node *n;
+	struct cr_objref *obj;
+
+	h = &ctx->objhash->head[hash_ptr((void *) objref, CR_OBJHASH_NBITS)];
+	hlist_for_each_entry(obj, n, h, hash)
+		if (obj->objref == objref)
+			return obj;
+	return NULL;
+}
+
+/**
+ * cr_obj_new - allocate an object and add to the hash table
+ * @ctx: checkpoint context
+ * @ptr: pointer to object
+ * @objref: unique object reference
+ * @type: object type
+ * @flags: object flags
+ *
+ * Allocate an object referring to @ptr and add to the hash table.
+ * If @objref is zero, assign a unique object reference and use @ptr
+ * as a hash key [checkpoint]. Else use @objref as a key [restart].
+ * In both cases, grab a reference (depending on @type) to said obejct.
+ */
+static struct cr_objref *cr_obj_new(struct cr_ctx *ctx, void *ptr, int objref,
+				    unsigned short type, unsigned short flags)
+{
+	struct cr_objref *obj;
+	int i;
+
+	obj = kmalloc(sizeof(*obj), GFP_KERNEL);
+	if (!obj)
+		return NULL;
+
+	obj->ptr = ptr;
+	obj->type = type;
+	obj->flags = flags;
+
+	if (objref) {
+		/* use @objref to index (restart) */
+		obj->objref = objref;
+		i = hash_ptr((void *) objref, CR_OBJHASH_NBITS);
+	} else {
+		/* use @ptr to index, assign objref (checkpoint) */
+		obj->objref = ctx->objhash->next_free_objref++;;
+		i = hash_ptr(ptr, CR_OBJHASH_NBITS);
+	}
+
+	hlist_add_head(&obj->hash, &ctx->objhash->head[i]);
+	cr_obj_ref_grab(obj);
+	return obj;
+}
+
+/**
+ * cr_obj_add_ptr - add an object to the hash table if not already there
+ * @ctx: checkpoint context
+ * @ptr: pointer to object
+ * @objref: unique object reference [output]
+ * @type: object type
+ * @flags: object flags
+ *
+ * Look up the object pointed to by @ptr in the hash table. If it isn't
+ * already found there, then add the object to the table, and allocate a
+ * fresh unique object reference (objref). Grab a reference to every
+ * object that is added, and maintain the reference until the entire
+ * hash is free. 
+ *
+ * Fills the unique objref of the object into @objref.
+ * 
+ * [This is used during checkpoint].
+ *
+ * Returns 0 if found, 1 if added, < 0 on error
+ */
+int cr_obj_add_ptr(struct cr_ctx *ctx, void *ptr, int *objref,
+		   unsigned short type, unsigned short flags)
+{
+	struct cr_objref *obj;
+	int ret = 0;
+
+	obj = cr_obj_find_by_ptr(ctx, ptr);
+	if (!obj) {
+		obj = cr_obj_new(ctx, ptr, 0, type, flags);
+		if (!obj)
+			return -ENOMEM;
+		else
+			ret = 1;
+	} else if (obj->type != type)	/* sanity check */
+		return -EINVAL;
+	*objref = obj->objref;
+	return ret;
+}
+
+/**
+ * cr_obj_add_ref - add an object with unique objref to the hash table
+ * @ctx: checkpoint context
+ * @ptr: pointer to object
+ * @objref: unique identifier - object reference
+ * @type: object type
+ * @flags: object flags
+ *
+ * Add the object pointer to by @ptr and identified by unique object
+ * reference given by @objref to the hash table (indexed by @objref).
+ * Grab a reference to every object that is added, and maintain the
+ * reference until the entire hash is free. 
+ *
+ * [This is used during restart].
+ */
+int cr_obj_add_ref(struct cr_ctx *ctx, void *ptr, int objref,
+		   unsigned short type, unsigned short flags)
+{
+	struct cr_objref *obj;
+
+	obj = cr_obj_new(ctx, ptr, objref, type, flags);
+	return obj ? 0 : -ENOMEM;
+}
+
+/**
+ * cr_obj_get_by_ptr - find the unique object reference of an object
+ * @ctx: checkpoint context
+ * @ptr: pointer to object
+ * @type: object type
+ *
+ * Look up the unique object reference (objref) of the object pointed
+ * to by @ptr, and return that number, or 0 if not found.
+ *
+ * [This is used during checkpoint].
+ */
+int cr_obj_get_by_ptr(struct cr_ctx *ctx, void *ptr, unsigned short type)
+{
+	struct cr_objref *obj;
+
+	obj = cr_obj_find_by_ptr(ctx, ptr);
+	if (!obj)
+		return -ESRCH;
+	if (obj->type != type)
+		return -EINVAL;
+	return obj->objref;
+}
+
+/**
+ * cr_obj_get_by_ref - find an object given its unique object reference
+ * @ctx: checkpoint context
+ * @objref: unique identifier - object reference
+ * @type: object type
+ *
+ * Look up the object who is identified by unique object reference that
+ * is specified by @objref, and return a pointer to that matching object,
+ * or NULL if not found.
+ *
+ * [This is used during restart].
+ */
+void *cr_obj_get_by_ref(struct cr_ctx *ctx, int objref, unsigned short type)
+{
+	struct cr_objref *obj;
+
+	obj = cr_obj_find_by_objref(ctx, objref);
+	if (!obj)
+		return NULL;
+	if (obj->type != type)
+		return ERR_PTR(-EINVAL);
+	return obj->ptr;
+}
diff --git a/checkpoint/sys.c b/checkpoint/sys.c
index c547a1c..c077cd9 100644
--- a/checkpoint/sys.c
+++ b/checkpoint/sys.c
@@ -139,6 +139,7 @@ static void cr_ctx_free(struct cr_ctx *ctx)
 	path_put(&ctx->fs_mnt);		/* safe with NULL pointers */
 
 	cr_pgarr_free(ctx);
+	cr_objhash_free(ctx);
 
 	kfree(ctx);
 }
@@ -166,6 +167,9 @@ static struct cr_ctx *cr_ctx_alloc(int fd, unsigned long flags)
 	if (!ctx->hbuf)
 		goto err;
 
+	if (cr_objhash_alloc(ctx) < 0)
+		goto err;
+
 	return ctx;
 
  err:
diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
index ab1b215..7da696c 100644
--- a/include/linux/checkpoint.h
+++ b/include/linux/checkpoint.h
@@ -29,6 +29,8 @@ struct cr_ctx {
 	void *hbuf;		/* temporary buffer for headers */
 	int hpos;		/* position in headers buffer */
 
+	struct cr_objhash *objhash;	/* hash for shared objects */
+
 	struct list_head pgarr_list;	/* page array to dump VMA contents */
 
 	struct path fs_mnt;	/* container root (FIXME) */
@@ -44,6 +46,24 @@ extern int cr_kread(struct cr_ctx *ctx, void *buf, int count);
 extern void *cr_hbuf_get(struct cr_ctx *ctx, int n);
 extern void cr_hbuf_put(struct cr_ctx *ctx, int n);
 
+/* shared objects handling */
+
+enum {
+	CR_OBJ_FILE = 1,
+	CR_OBJ_MAX
+};
+
+extern void cr_objhash_free(struct cr_ctx *ctx);
+extern int cr_objhash_alloc(struct cr_ctx *ctx);
+extern void *cr_obj_get_by_ref(struct cr_ctx *ctx,
+			       int objref, unsigned short type);
+extern int cr_obj_get_by_ptr(struct cr_ctx *ctx,
+			     void *ptr, unsigned short type);
+extern int cr_obj_add_ptr(struct cr_ctx *ctx, void *ptr, int *objref,
+			  unsigned short type, unsigned short flags);
+extern int cr_obj_add_ref(struct cr_ctx *ctx, void *ptr, int objref,
+			  unsigned short type, unsigned short flags);
+
 struct cr_hdr;
 
 extern int cr_write_obj(struct cr_ctx *ctx, struct cr_hdr *h, void *buf);
-- 
1.5.4.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 133+ messages in thread

* [RFC v11][PATCH 08/13] Dump open file descriptors
       [not found] ` <1228498282-11804-1-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
                     ` (6 preceding siblings ...)
  2008-12-05 17:31   ` [RFC v11][PATCH 07/13] Infrastructure for shared objects Oren Laadan
@ 2008-12-05 17:31   ` Oren Laadan
  2008-12-05 17:31   ` [RFC v11][PATCH 09/13] Restore open file descriprtors Oren Laadan
                     ` (9 subsequent siblings)
  17 siblings, 0 replies; 133+ messages in thread
From: Oren Laadan @ 2008-12-05 17:31 UTC (permalink / raw)
  To: Oren Laadan
  Cc: jeremy-TSDbQ3PG+2Y, arnd-r2nGTMty4D4,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Dave Hansen,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Linux Torvalds, Alexander Viro,
	H. Peter Anvin, Thomas Gleixner, Ingo Molnar

Dump the files_struct of a task with 'struct cr_hdr_files', followed by
all open file descriptors. Because the 'struct file' corresponding to an
FD can be shared, each they are assigned an objref and registered in the
object hash. A reference to the 'file *' is kept for as long as it lives
in the hash (the hash is only cleaned up at the end of the checkpoint).

For each open FD there is a 'struct cr_hdr_fd_ent' with the FD, its
close-on-exec property, and the objref of the corresponding 'file *'.
If the FD is to be saved (first time) then this is followed by a
'struct cr_hdr_fd_data' with the FD state. Then will come the next FD
and so on.

Recall that it is assumed that all tasks possibly sharing the file table
are frozen. If this assumption breaks, then the behavior is *undefined*:
checkpoint may fail, or restart from the resulting image file will fail.

This patch only handles basic FDs - regular files, directories.

Changelog[v11]:
  - Discard handling of opened symlinks (there is no such thing)
  - cr_scan_fds() retries from scratch if hits size limits

Changelog[v9]:
  - Fix a couple of leaks in cr_write_files()
  - Drop useless kfree from cr_scan_fds()

Changelog[v8]:
  - initialize 'coe' to workaround gcc false warning

Changelog[v6]:
  - Balance all calls to cr_hbuf_get() with matching cr_hbuf_put()
    (even though it's not really needed)

Signed-off-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
Acked-by: Serge Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
Signed-off-by: Dave Hansen <dave-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
---
 checkpoint/Makefile            |    2 +-
 checkpoint/checkpoint.c        |    4 +
 checkpoint/checkpoint_file.h   |   17 +++
 checkpoint/ckpt_file.c         |  224 ++++++++++++++++++++++++++++++++++++++++
 include/linux/checkpoint.h     |    3 +-
 include/linux/checkpoint_hdr.h |   31 ++++++-
 6 files changed, 278 insertions(+), 3 deletions(-)
 create mode 100644 checkpoint/checkpoint_file.h
 create mode 100644 checkpoint/ckpt_file.c

diff --git a/checkpoint/Makefile b/checkpoint/Makefile
index 9843fb9..7496695 100644
--- a/checkpoint/Makefile
+++ b/checkpoint/Makefile
@@ -3,4 +3,4 @@
 #
 
 obj-$(CONFIG_CHECKPOINT_RESTART) += sys.o checkpoint.o restart.o objhash.o \
-		ckpt_mem.o rstr_mem.o
+		ckpt_mem.o rstr_mem.o ckpt_file.o
diff --git a/checkpoint/checkpoint.c b/checkpoint/checkpoint.c
index 56d0ec2..75c7cd3 100644
--- a/checkpoint/checkpoint.c
+++ b/checkpoint/checkpoint.c
@@ -233,6 +233,10 @@ static int cr_write_task(struct cr_ctx *ctx, struct task_struct *t)
 	cr_debug("memory: ret %d\n", ret);
 	if (ret < 0)
 		goto out;
+	ret = cr_write_files(ctx, t);
+	cr_debug("files: ret %d\n", ret);
+	if (ret < 0)
+		goto out;
 	ret = cr_write_thread(ctx, t);
 	cr_debug("thread: ret %d\n", ret);
 	if (ret < 0)
diff --git a/checkpoint/checkpoint_file.h b/checkpoint/checkpoint_file.h
new file mode 100644
index 0000000..9dc3eba
--- /dev/null
+++ b/checkpoint/checkpoint_file.h
@@ -0,0 +1,17 @@
+#ifndef _CHECKPOINT_CKPT_FILE_H_
+#define _CHECKPOINT_CKPT_FILE_H_
+/*
+ *  Checkpoint file descriptors
+ *
+ *  Copyright (C) 2008 Oren Laadan
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+#include <linux/fdtable.h>
+
+int cr_scan_fds(struct files_struct *files, int **fdtable);
+
+#endif /* _CHECKPOINT_CKPT_FILE_H_ */
diff --git a/checkpoint/ckpt_file.c b/checkpoint/ckpt_file.c
new file mode 100644
index 0000000..6f73f8b
--- /dev/null
+++ b/checkpoint/ckpt_file.c
@@ -0,0 +1,224 @@
+/*
+ *  Checkpoint file descriptors
+ *
+ *  Copyright (C) 2008 Oren Laadan
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+#include <linux/kernel.h>
+#include <linux/sched.h>
+#include <linux/file.h>
+#include <linux/fdtable.h>
+#include <linux/checkpoint.h>
+#include <linux/checkpoint_hdr.h>
+
+#include "checkpoint_file.h"
+
+#define CR_DEFAULT_FDTABLE  256		/* an initial guess */
+
+/**
+ * cr_scan_fds - scan file table and construct array of open fds
+ * @files: files_struct pointer
+ * @fdtable: (output) array of open fds
+ *
+ * Returns the number of open fds found, and also the file table
+ * array via *fdtable. The caller should free the array.
+ *
+ * The caller must validate the file descriptors collected in the
+ * array before using them, e.g. by using fcheck_files(), in case
+ * the task's fdtable changes in the meantime.
+ */
+int cr_scan_fds(struct files_struct *files, int **fdtable)
+{
+	struct fdtable *fdt;
+	int *fds = NULL;
+	int i, n;
+	int tot = CR_DEFAULT_FDTABLE;
+
+	/*
+	 * We assume that all tasks possibly sharing the file table are
+	 * frozen (or we our a single process and we checkpoint ourselves).
+	 * Therefore, we can safely proceed after krealloc() from where we
+	 * left off. Otherwise the file table may be modified by another
+	 * task after we scan it. The behavior is this case is undefined,
+	 * and either and either checkpoint or restart will likely fail.
+	 */
+ retry:
+	fds = krealloc(fds, tot * sizeof(*fds), GFP_KERNEL);
+	if (!fds)
+		return -ENOMEM;
+
+	spin_lock(&files->file_lock);
+	rcu_read_lock();
+	fdt = files_fdtable(files);
+	for (n = 0, i = 0; i < fdt->max_fds; i++) {
+		if (!fcheck_files(files, i))
+			continue;
+		if (n == tot) {
+			spin_unlock(&files->file_lock);
+			rcu_read_unlock();
+			tot *= 2;	/* won't overflow: kmalloc will fail */
+			goto retry;
+		}
+		fds[n++] = i;
+	}
+	rcu_read_unlock();
+	spin_unlock(&files->file_lock);
+
+	*fdtable = fds;
+	return n;
+}
+
+/* cr_write_fd_data - dump the state of a given file pointer */
+static int cr_write_fd_data(struct cr_ctx *ctx, struct file *file, int parent)
+{
+	struct cr_hdr h;
+	struct cr_hdr_fd_data *hh = cr_hbuf_get(ctx, sizeof(*hh));
+	struct dentry *dent = file->f_dentry;
+	struct inode *inode = dent->d_inode;
+	enum fd_type fd_type;
+	int ret;
+
+	h.type = CR_HDR_FD_DATA;
+	h.len = sizeof(*hh);
+	h.parent = parent;
+
+	hh->f_flags = file->f_flags;
+	hh->f_mode = file->f_mode;
+	hh->f_pos = file->f_pos;
+	hh->f_version = file->f_version;
+	/* FIX: need also file->uid, file->gid, file->f_owner, etc */
+
+	switch (inode->i_mode & S_IFMT) {
+	case S_IFREG:
+		fd_type = CR_FD_FILE;
+		break;
+	case S_IFDIR:
+		fd_type = CR_FD_DIR;
+		break;
+	default:
+		cr_hbuf_put(ctx, sizeof(*hh));
+		return -EBADF;
+	}
+
+	/* FIX: check if the file/dir/link is unlinked */
+	hh->fd_type = fd_type;
+
+	ret = cr_write_obj(ctx, &h, hh);
+	cr_hbuf_put(ctx, sizeof(*hh));
+	if (ret < 0)
+		return ret;
+
+	return cr_write_fname(ctx, &file->f_path, &ctx->fs_mnt);
+}
+
+/**
+ * cr_write_fd_ent - dump the state of a given file descriptor
+ * @ctx: checkpoint context
+ * @files: files_struct pointer
+ * @fd: file descriptor
+ *
+ * Saves the state of the file descriptor; looks up the actual file
+ * pointer in the hash table, and if found saves the matching objref,
+ * otherwise calls cr_write_fd_data to dump the file pointer too.
+ */
+static int
+cr_write_fd_ent(struct cr_ctx *ctx, struct files_struct *files, int fd)
+{
+	struct cr_hdr h;
+	struct cr_hdr_fd_ent *hh = cr_hbuf_get(ctx, sizeof(*hh));
+	struct file *file;
+	struct fdtable *fdt;
+	int objref, new, ret;
+	int coe = 0;	/* avoid gcc warning */
+
+	rcu_read_lock();
+	fdt = files_fdtable(files);
+	file = fcheck_files(files, fd);
+	if (file) {
+		coe = FD_ISSET(fd, fdt->close_on_exec);
+		get_file(file);
+	}
+	rcu_read_unlock();
+
+	/* sanity check (although this shouldn't happen) */
+	if (!file) {
+		ret = -EBADF;
+		goto out;
+	}
+
+	/* adding 'file' to the hash will keep a reference to it */
+	new = cr_obj_add_ptr(ctx, file, &objref, CR_OBJ_FILE, 0);
+	cr_debug("fd %d objref %d file %p c-o-e %d)\n", fd, objref, file, coe);
+
+	if (new < 0) {
+		ret = new;
+		goto out;
+	}
+
+	h.type = CR_HDR_FD_ENT;
+	h.len = sizeof(*hh);
+	h.parent = 0;
+
+	hh->objref = objref;
+	hh->fd = fd;
+	hh->close_on_exec = coe;
+
+	ret = cr_write_obj(ctx, &h, hh);
+	if (ret < 0)
+		goto out;
+
+	/* new==1 if-and-only-if file was newly added to hash */
+	if (new)
+		ret = cr_write_fd_data(ctx, file, objref);
+
+out:
+	cr_hbuf_put(ctx, sizeof(*hh));
+	if (file)
+		fput(file);
+	return ret;
+}
+
+int cr_write_files(struct cr_ctx *ctx, struct task_struct *t)
+{
+	struct cr_hdr h;
+	struct cr_hdr_files *hh = cr_hbuf_get(ctx, sizeof(*hh));
+	struct files_struct *files;
+	int *fdtable = NULL;
+	int nfds, n, ret;
+
+	h.type = CR_HDR_FILES;
+	h.len = sizeof(*hh);
+	h.parent = task_pid_vnr(t);
+
+	files = get_files_struct(t);
+
+	nfds = cr_scan_fds(files, &fdtable);
+	if (nfds < 0) {
+		ret = nfds;
+		goto out;
+	}
+
+	hh->objref = 0;	/* will be meaningful with multiple processes */
+	hh->nfds = nfds;
+
+	ret = cr_write_obj(ctx, &h, hh);
+	cr_hbuf_put(ctx, sizeof(*hh));
+	if (ret < 0)
+		goto out;
+
+	cr_debug("nfds %d\n", nfds);
+	for (n = 0; n < nfds; n++) {
+		ret = cr_write_fd_ent(ctx, files, fdtable[n]);
+		if (ret < 0)
+			break;
+	}
+
+ out:
+	kfree(fdtable);
+	put_files_struct(files);
+	return ret;
+}
diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
index 7da696c..119090b 100644
--- a/include/linux/checkpoint.h
+++ b/include/linux/checkpoint.h
@@ -13,7 +13,7 @@
 #include <linux/path.h>
 #include <linux/fs.h>
 
-#define CR_VERSION  1
+#define CR_VERSION  2
 
 struct cr_ctx {
 	int crid;		/* unique checkpoint id */
@@ -84,6 +84,7 @@ extern struct file *cr_read_open_fname(struct cr_ctx *ctx,
 
 extern int do_checkpoint(struct cr_ctx *ctx, pid_t pid);
 extern int cr_write_mm(struct cr_ctx *ctx, struct task_struct *t);
+extern int cr_write_files(struct cr_ctx *ctx, struct task_struct *t);
 
 extern int do_restart(struct cr_ctx *ctx, pid_t pid);
 extern int cr_read_mm(struct cr_ctx *ctx);
diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
index d78f0f1..8c3b5b2 100644
--- a/include/linux/checkpoint_hdr.h
+++ b/include/linux/checkpoint_hdr.h
@@ -17,7 +17,7 @@
 /*
  * To maintain compatibility between 32-bit and 64-bit architecture flavors,
  * keep data 64-bit aligned: use padding for structure members, and use
- * __attribute__ ((aligned (8))) for the entire structure.
+ * __attribute__((aligned(8))) for the entire structure.
  */
 
 /* records: generic header */
@@ -45,6 +45,10 @@ enum {
 	CR_HDR_PGARR,
 	CR_HDR_MM_CONTEXT,
 
+	CR_HDR_FILES = 301,
+	CR_HDR_FD_ENT,
+	CR_HDR_FD_DATA,
+
 	CR_HDR_TAIL = 5001
 };
 
@@ -107,4 +111,29 @@ struct cr_hdr_pgarr {
 	__u64 nr_pages;		/* number of pages to saved */
 } __attribute__((aligned(8)));
 
+struct cr_hdr_files {
+	__u32 objref;		/* identifier for shared objects */
+	__u32 nfds;
+} __attribute__((aligned(8)));
+
+struct cr_hdr_fd_ent {
+	__u32 objref;		/* identifier for shared objects */
+	__s32 fd;
+	__u32 close_on_exec;
+} __attribute__((aligned(8)));
+
+/* fd types */
+enum  fd_type {
+	CR_FD_FILE = 1,
+	CR_FD_DIR,
+};
+
+struct cr_hdr_fd_data {
+	__u16 fd_type;
+	__u16 f_mode;
+	__u32 f_flags;
+	__u64 f_pos;
+	__u64 f_version;
+} __attribute__((aligned(8)));
+
 #endif /* _CHECKPOINT_CKPT_HDR_H_ */
-- 
1.5.4.3

^ permalink raw reply related	[flat|nested] 133+ messages in thread

* [RFC v11][PATCH 08/13] Dump open file descriptors
       [not found] ` <1228498282-11804-1-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
  2008-12-05 17:31   ` [RFC v11][PATCH 01/13] Create syscalls: sys_checkpoint, sys_restart Oren Laadan
@ 2008-12-05 17:31   ` Oren Laadan
  2008-12-05 17:31   ` [RFC v11][PATCH 03/13] General infrastructure for checkpoint restart Oren Laadan
                     ` (15 subsequent siblings)
  17 siblings, 0 replies; 133+ messages in thread
From: Oren Laadan @ 2008-12-05 17:31 UTC (permalink / raw)
  To: Oren Laadan
  Cc: containers, linux-kernel, linux-mm, linux-api, Linux Torvalds,
	Thomas Gleixner, Serge Hallyn, Dave Hansen, Ingo Molnar,
	H. Peter Anvin, Alexander Viro, MinChan Kim, arnd, jeremy,
	Oren Laadan

Dump the files_struct of a task with 'struct cr_hdr_files', followed by
all open file descriptors. Because the 'struct file' corresponding to an
FD can be shared, each they are assigned an objref and registered in the
object hash. A reference to the 'file *' is kept for as long as it lives
in the hash (the hash is only cleaned up at the end of the checkpoint).

For each open FD there is a 'struct cr_hdr_fd_ent' with the FD, its
close-on-exec property, and the objref of the corresponding 'file *'.
If the FD is to be saved (first time) then this is followed by a
'struct cr_hdr_fd_data' with the FD state. Then will come the next FD
and so on.

Recall that it is assumed that all tasks possibly sharing the file table
are frozen. If this assumption breaks, then the behavior is *undefined*:
checkpoint may fail, or restart from the resulting image file will fail.

This patch only handles basic FDs - regular files, directories.

Changelog[v11]:
  - Discard handling of opened symlinks (there is no such thing)
  - cr_scan_fds() retries from scratch if hits size limits

Changelog[v9]:
  - Fix a couple of leaks in cr_write_files()
  - Drop useless kfree from cr_scan_fds()

Changelog[v8]:
  - initialize 'coe' to workaround gcc false warning

Changelog[v6]:
  - Balance all calls to cr_hbuf_get() with matching cr_hbuf_put()
    (even though it's not really needed)

Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com>
---
 checkpoint/Makefile            |    2 +-
 checkpoint/checkpoint.c        |    4 +
 checkpoint/checkpoint_file.h   |   17 +++
 checkpoint/ckpt_file.c         |  224 ++++++++++++++++++++++++++++++++++++++++
 include/linux/checkpoint.h     |    3 +-
 include/linux/checkpoint_hdr.h |   31 ++++++-
 6 files changed, 278 insertions(+), 3 deletions(-)
 create mode 100644 checkpoint/checkpoint_file.h
 create mode 100644 checkpoint/ckpt_file.c

diff --git a/checkpoint/Makefile b/checkpoint/Makefile
index 9843fb9..7496695 100644
--- a/checkpoint/Makefile
+++ b/checkpoint/Makefile
@@ -3,4 +3,4 @@
 #
 
 obj-$(CONFIG_CHECKPOINT_RESTART) += sys.o checkpoint.o restart.o objhash.o \
-		ckpt_mem.o rstr_mem.o
+		ckpt_mem.o rstr_mem.o ckpt_file.o
diff --git a/checkpoint/checkpoint.c b/checkpoint/checkpoint.c
index 56d0ec2..75c7cd3 100644
--- a/checkpoint/checkpoint.c
+++ b/checkpoint/checkpoint.c
@@ -233,6 +233,10 @@ static int cr_write_task(struct cr_ctx *ctx, struct task_struct *t)
 	cr_debug("memory: ret %d\n", ret);
 	if (ret < 0)
 		goto out;
+	ret = cr_write_files(ctx, t);
+	cr_debug("files: ret %d\n", ret);
+	if (ret < 0)
+		goto out;
 	ret = cr_write_thread(ctx, t);
 	cr_debug("thread: ret %d\n", ret);
 	if (ret < 0)
diff --git a/checkpoint/checkpoint_file.h b/checkpoint/checkpoint_file.h
new file mode 100644
index 0000000..9dc3eba
--- /dev/null
+++ b/checkpoint/checkpoint_file.h
@@ -0,0 +1,17 @@
+#ifndef _CHECKPOINT_CKPT_FILE_H_
+#define _CHECKPOINT_CKPT_FILE_H_
+/*
+ *  Checkpoint file descriptors
+ *
+ *  Copyright (C) 2008 Oren Laadan
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+#include <linux/fdtable.h>
+
+int cr_scan_fds(struct files_struct *files, int **fdtable);
+
+#endif /* _CHECKPOINT_CKPT_FILE_H_ */
diff --git a/checkpoint/ckpt_file.c b/checkpoint/ckpt_file.c
new file mode 100644
index 0000000..6f73f8b
--- /dev/null
+++ b/checkpoint/ckpt_file.c
@@ -0,0 +1,224 @@
+/*
+ *  Checkpoint file descriptors
+ *
+ *  Copyright (C) 2008 Oren Laadan
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+#include <linux/kernel.h>
+#include <linux/sched.h>
+#include <linux/file.h>
+#include <linux/fdtable.h>
+#include <linux/checkpoint.h>
+#include <linux/checkpoint_hdr.h>
+
+#include "checkpoint_file.h"
+
+#define CR_DEFAULT_FDTABLE  256		/* an initial guess */
+
+/**
+ * cr_scan_fds - scan file table and construct array of open fds
+ * @files: files_struct pointer
+ * @fdtable: (output) array of open fds
+ *
+ * Returns the number of open fds found, and also the file table
+ * array via *fdtable. The caller should free the array.
+ *
+ * The caller must validate the file descriptors collected in the
+ * array before using them, e.g. by using fcheck_files(), in case
+ * the task's fdtable changes in the meantime.
+ */
+int cr_scan_fds(struct files_struct *files, int **fdtable)
+{
+	struct fdtable *fdt;
+	int *fds = NULL;
+	int i, n;
+	int tot = CR_DEFAULT_FDTABLE;
+
+	/*
+	 * We assume that all tasks possibly sharing the file table are
+	 * frozen (or we our a single process and we checkpoint ourselves).
+	 * Therefore, we can safely proceed after krealloc() from where we
+	 * left off. Otherwise the file table may be modified by another
+	 * task after we scan it. The behavior is this case is undefined,
+	 * and either and either checkpoint or restart will likely fail.
+	 */
+ retry:
+	fds = krealloc(fds, tot * sizeof(*fds), GFP_KERNEL);
+	if (!fds)
+		return -ENOMEM;
+
+	spin_lock(&files->file_lock);
+	rcu_read_lock();
+	fdt = files_fdtable(files);
+	for (n = 0, i = 0; i < fdt->max_fds; i++) {
+		if (!fcheck_files(files, i))
+			continue;
+		if (n == tot) {
+			spin_unlock(&files->file_lock);
+			rcu_read_unlock();
+			tot *= 2;	/* won't overflow: kmalloc will fail */
+			goto retry;
+		}
+		fds[n++] = i;
+	}
+	rcu_read_unlock();
+	spin_unlock(&files->file_lock);
+
+	*fdtable = fds;
+	return n;
+}
+
+/* cr_write_fd_data - dump the state of a given file pointer */
+static int cr_write_fd_data(struct cr_ctx *ctx, struct file *file, int parent)
+{
+	struct cr_hdr h;
+	struct cr_hdr_fd_data *hh = cr_hbuf_get(ctx, sizeof(*hh));
+	struct dentry *dent = file->f_dentry;
+	struct inode *inode = dent->d_inode;
+	enum fd_type fd_type;
+	int ret;
+
+	h.type = CR_HDR_FD_DATA;
+	h.len = sizeof(*hh);
+	h.parent = parent;
+
+	hh->f_flags = file->f_flags;
+	hh->f_mode = file->f_mode;
+	hh->f_pos = file->f_pos;
+	hh->f_version = file->f_version;
+	/* FIX: need also file->uid, file->gid, file->f_owner, etc */
+
+	switch (inode->i_mode & S_IFMT) {
+	case S_IFREG:
+		fd_type = CR_FD_FILE;
+		break;
+	case S_IFDIR:
+		fd_type = CR_FD_DIR;
+		break;
+	default:
+		cr_hbuf_put(ctx, sizeof(*hh));
+		return -EBADF;
+	}
+
+	/* FIX: check if the file/dir/link is unlinked */
+	hh->fd_type = fd_type;
+
+	ret = cr_write_obj(ctx, &h, hh);
+	cr_hbuf_put(ctx, sizeof(*hh));
+	if (ret < 0)
+		return ret;
+
+	return cr_write_fname(ctx, &file->f_path, &ctx->fs_mnt);
+}
+
+/**
+ * cr_write_fd_ent - dump the state of a given file descriptor
+ * @ctx: checkpoint context
+ * @files: files_struct pointer
+ * @fd: file descriptor
+ *
+ * Saves the state of the file descriptor; looks up the actual file
+ * pointer in the hash table, and if found saves the matching objref,
+ * otherwise calls cr_write_fd_data to dump the file pointer too.
+ */
+static int
+cr_write_fd_ent(struct cr_ctx *ctx, struct files_struct *files, int fd)
+{
+	struct cr_hdr h;
+	struct cr_hdr_fd_ent *hh = cr_hbuf_get(ctx, sizeof(*hh));
+	struct file *file;
+	struct fdtable *fdt;
+	int objref, new, ret;
+	int coe = 0;	/* avoid gcc warning */
+
+	rcu_read_lock();
+	fdt = files_fdtable(files);
+	file = fcheck_files(files, fd);
+	if (file) {
+		coe = FD_ISSET(fd, fdt->close_on_exec);
+		get_file(file);
+	}
+	rcu_read_unlock();
+
+	/* sanity check (although this shouldn't happen) */
+	if (!file) {
+		ret = -EBADF;
+		goto out;
+	}
+
+	/* adding 'file' to the hash will keep a reference to it */
+	new = cr_obj_add_ptr(ctx, file, &objref, CR_OBJ_FILE, 0);
+	cr_debug("fd %d objref %d file %p c-o-e %d)\n", fd, objref, file, coe);
+
+	if (new < 0) {
+		ret = new;
+		goto out;
+	}
+
+	h.type = CR_HDR_FD_ENT;
+	h.len = sizeof(*hh);
+	h.parent = 0;
+
+	hh->objref = objref;
+	hh->fd = fd;
+	hh->close_on_exec = coe;
+
+	ret = cr_write_obj(ctx, &h, hh);
+	if (ret < 0)
+		goto out;
+
+	/* new==1 if-and-only-if file was newly added to hash */
+	if (new)
+		ret = cr_write_fd_data(ctx, file, objref);
+
+out:
+	cr_hbuf_put(ctx, sizeof(*hh));
+	if (file)
+		fput(file);
+	return ret;
+}
+
+int cr_write_files(struct cr_ctx *ctx, struct task_struct *t)
+{
+	struct cr_hdr h;
+	struct cr_hdr_files *hh = cr_hbuf_get(ctx, sizeof(*hh));
+	struct files_struct *files;
+	int *fdtable = NULL;
+	int nfds, n, ret;
+
+	h.type = CR_HDR_FILES;
+	h.len = sizeof(*hh);
+	h.parent = task_pid_vnr(t);
+
+	files = get_files_struct(t);
+
+	nfds = cr_scan_fds(files, &fdtable);
+	if (nfds < 0) {
+		ret = nfds;
+		goto out;
+	}
+
+	hh->objref = 0;	/* will be meaningful with multiple processes */
+	hh->nfds = nfds;
+
+	ret = cr_write_obj(ctx, &h, hh);
+	cr_hbuf_put(ctx, sizeof(*hh));
+	if (ret < 0)
+		goto out;
+
+	cr_debug("nfds %d\n", nfds);
+	for (n = 0; n < nfds; n++) {
+		ret = cr_write_fd_ent(ctx, files, fdtable[n]);
+		if (ret < 0)
+			break;
+	}
+
+ out:
+	kfree(fdtable);
+	put_files_struct(files);
+	return ret;
+}
diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
index 7da696c..119090b 100644
--- a/include/linux/checkpoint.h
+++ b/include/linux/checkpoint.h
@@ -13,7 +13,7 @@
 #include <linux/path.h>
 #include <linux/fs.h>
 
-#define CR_VERSION  1
+#define CR_VERSION  2
 
 struct cr_ctx {
 	int crid;		/* unique checkpoint id */
@@ -84,6 +84,7 @@ extern struct file *cr_read_open_fname(struct cr_ctx *ctx,
 
 extern int do_checkpoint(struct cr_ctx *ctx, pid_t pid);
 extern int cr_write_mm(struct cr_ctx *ctx, struct task_struct *t);
+extern int cr_write_files(struct cr_ctx *ctx, struct task_struct *t);
 
 extern int do_restart(struct cr_ctx *ctx, pid_t pid);
 extern int cr_read_mm(struct cr_ctx *ctx);
diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
index d78f0f1..8c3b5b2 100644
--- a/include/linux/checkpoint_hdr.h
+++ b/include/linux/checkpoint_hdr.h
@@ -17,7 +17,7 @@
 /*
  * To maintain compatibility between 32-bit and 64-bit architecture flavors,
  * keep data 64-bit aligned: use padding for structure members, and use
- * __attribute__ ((aligned (8))) for the entire structure.
+ * __attribute__((aligned(8))) for the entire structure.
  */
 
 /* records: generic header */
@@ -45,6 +45,10 @@ enum {
 	CR_HDR_PGARR,
 	CR_HDR_MM_CONTEXT,
 
+	CR_HDR_FILES = 301,
+	CR_HDR_FD_ENT,
+	CR_HDR_FD_DATA,
+
 	CR_HDR_TAIL = 5001
 };
 
@@ -107,4 +111,29 @@ struct cr_hdr_pgarr {
 	__u64 nr_pages;		/* number of pages to saved */
 } __attribute__((aligned(8)));
 
+struct cr_hdr_files {
+	__u32 objref;		/* identifier for shared objects */
+	__u32 nfds;
+} __attribute__((aligned(8)));
+
+struct cr_hdr_fd_ent {
+	__u32 objref;		/* identifier for shared objects */
+	__s32 fd;
+	__u32 close_on_exec;
+} __attribute__((aligned(8)));
+
+/* fd types */
+enum  fd_type {
+	CR_FD_FILE = 1,
+	CR_FD_DIR,
+};
+
+struct cr_hdr_fd_data {
+	__u16 fd_type;
+	__u16 f_mode;
+	__u32 f_flags;
+	__u64 f_pos;
+	__u64 f_version;
+} __attribute__((aligned(8)));
+
 #endif /* _CHECKPOINT_CKPT_HDR_H_ */
-- 
1.5.4.3


^ permalink raw reply related	[flat|nested] 133+ messages in thread

* [RFC v11][PATCH 08/13] Dump open file descriptors
@ 2008-12-05 17:31   ` Oren Laadan
  0 siblings, 0 replies; 133+ messages in thread
From: Oren Laadan @ 2008-12-05 17:31 UTC (permalink / raw)
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Linux Torvalds,
	Thomas Gleixner, Serge Hallyn, Dave Hansen, Ingo Molnar,
	H. Peter Anvin, Alexander Viro, MinChan Kim, arnd-r2nGTMty4D4,
	jeremy-TSDbQ3PG+2Y, Oren Laadan

Dump the files_struct of a task with 'struct cr_hdr_files', followed by
all open file descriptors. Because the 'struct file' corresponding to an
FD can be shared, each they are assigned an objref and registered in the
object hash. A reference to the 'file *' is kept for as long as it lives
in the hash (the hash is only cleaned up at the end of the checkpoint).

For each open FD there is a 'struct cr_hdr_fd_ent' with the FD, its
close-on-exec property, and the objref of the corresponding 'file *'.
If the FD is to be saved (first time) then this is followed by a
'struct cr_hdr_fd_data' with the FD state. Then will come the next FD
and so on.

Recall that it is assumed that all tasks possibly sharing the file table
are frozen. If this assumption breaks, then the behavior is *undefined*:
checkpoint may fail, or restart from the resulting image file will fail.

This patch only handles basic FDs - regular files, directories.

Changelog[v11]:
  - Discard handling of opened symlinks (there is no such thing)
  - cr_scan_fds() retries from scratch if hits size limits

Changelog[v9]:
  - Fix a couple of leaks in cr_write_files()
  - Drop useless kfree from cr_scan_fds()

Changelog[v8]:
  - initialize 'coe' to workaround gcc false warning

Changelog[v6]:
  - Balance all calls to cr_hbuf_get() with matching cr_hbuf_put()
    (even though it's not really needed)

Signed-off-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
Acked-by: Serge Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
Signed-off-by: Dave Hansen <dave-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
---
 checkpoint/Makefile            |    2 +-
 checkpoint/checkpoint.c        |    4 +
 checkpoint/checkpoint_file.h   |   17 +++
 checkpoint/ckpt_file.c         |  224 ++++++++++++++++++++++++++++++++++++++++
 include/linux/checkpoint.h     |    3 +-
 include/linux/checkpoint_hdr.h |   31 ++++++-
 6 files changed, 278 insertions(+), 3 deletions(-)
 create mode 100644 checkpoint/checkpoint_file.h
 create mode 100644 checkpoint/ckpt_file.c

diff --git a/checkpoint/Makefile b/checkpoint/Makefile
index 9843fb9..7496695 100644
--- a/checkpoint/Makefile
+++ b/checkpoint/Makefile
@@ -3,4 +3,4 @@
 #
 
 obj-$(CONFIG_CHECKPOINT_RESTART) += sys.o checkpoint.o restart.o objhash.o \
-		ckpt_mem.o rstr_mem.o
+		ckpt_mem.o rstr_mem.o ckpt_file.o
diff --git a/checkpoint/checkpoint.c b/checkpoint/checkpoint.c
index 56d0ec2..75c7cd3 100644
--- a/checkpoint/checkpoint.c
+++ b/checkpoint/checkpoint.c
@@ -233,6 +233,10 @@ static int cr_write_task(struct cr_ctx *ctx, struct task_struct *t)
 	cr_debug("memory: ret %d\n", ret);
 	if (ret < 0)
 		goto out;
+	ret = cr_write_files(ctx, t);
+	cr_debug("files: ret %d\n", ret);
+	if (ret < 0)
+		goto out;
 	ret = cr_write_thread(ctx, t);
 	cr_debug("thread: ret %d\n", ret);
 	if (ret < 0)
diff --git a/checkpoint/checkpoint_file.h b/checkpoint/checkpoint_file.h
new file mode 100644
index 0000000..9dc3eba
--- /dev/null
+++ b/checkpoint/checkpoint_file.h
@@ -0,0 +1,17 @@
+#ifndef _CHECKPOINT_CKPT_FILE_H_
+#define _CHECKPOINT_CKPT_FILE_H_
+/*
+ *  Checkpoint file descriptors
+ *
+ *  Copyright (C) 2008 Oren Laadan
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+#include <linux/fdtable.h>
+
+int cr_scan_fds(struct files_struct *files, int **fdtable);
+
+#endif /* _CHECKPOINT_CKPT_FILE_H_ */
diff --git a/checkpoint/ckpt_file.c b/checkpoint/ckpt_file.c
new file mode 100644
index 0000000..6f73f8b
--- /dev/null
+++ b/checkpoint/ckpt_file.c
@@ -0,0 +1,224 @@
+/*
+ *  Checkpoint file descriptors
+ *
+ *  Copyright (C) 2008 Oren Laadan
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+#include <linux/kernel.h>
+#include <linux/sched.h>
+#include <linux/file.h>
+#include <linux/fdtable.h>
+#include <linux/checkpoint.h>
+#include <linux/checkpoint_hdr.h>
+
+#include "checkpoint_file.h"
+
+#define CR_DEFAULT_FDTABLE  256		/* an initial guess */
+
+/**
+ * cr_scan_fds - scan file table and construct array of open fds
+ * @files: files_struct pointer
+ * @fdtable: (output) array of open fds
+ *
+ * Returns the number of open fds found, and also the file table
+ * array via *fdtable. The caller should free the array.
+ *
+ * The caller must validate the file descriptors collected in the
+ * array before using them, e.g. by using fcheck_files(), in case
+ * the task's fdtable changes in the meantime.
+ */
+int cr_scan_fds(struct files_struct *files, int **fdtable)
+{
+	struct fdtable *fdt;
+	int *fds = NULL;
+	int i, n;
+	int tot = CR_DEFAULT_FDTABLE;
+
+	/*
+	 * We assume that all tasks possibly sharing the file table are
+	 * frozen (or we our a single process and we checkpoint ourselves).
+	 * Therefore, we can safely proceed after krealloc() from where we
+	 * left off. Otherwise the file table may be modified by another
+	 * task after we scan it. The behavior is this case is undefined,
+	 * and either and either checkpoint or restart will likely fail.
+	 */
+ retry:
+	fds = krealloc(fds, tot * sizeof(*fds), GFP_KERNEL);
+	if (!fds)
+		return -ENOMEM;
+
+	spin_lock(&files->file_lock);
+	rcu_read_lock();
+	fdt = files_fdtable(files);
+	for (n = 0, i = 0; i < fdt->max_fds; i++) {
+		if (!fcheck_files(files, i))
+			continue;
+		if (n == tot) {
+			spin_unlock(&files->file_lock);
+			rcu_read_unlock();
+			tot *= 2;	/* won't overflow: kmalloc will fail */
+			goto retry;
+		}
+		fds[n++] = i;
+	}
+	rcu_read_unlock();
+	spin_unlock(&files->file_lock);
+
+	*fdtable = fds;
+	return n;
+}
+
+/* cr_write_fd_data - dump the state of a given file pointer */
+static int cr_write_fd_data(struct cr_ctx *ctx, struct file *file, int parent)
+{
+	struct cr_hdr h;
+	struct cr_hdr_fd_data *hh = cr_hbuf_get(ctx, sizeof(*hh));
+	struct dentry *dent = file->f_dentry;
+	struct inode *inode = dent->d_inode;
+	enum fd_type fd_type;
+	int ret;
+
+	h.type = CR_HDR_FD_DATA;
+	h.len = sizeof(*hh);
+	h.parent = parent;
+
+	hh->f_flags = file->f_flags;
+	hh->f_mode = file->f_mode;
+	hh->f_pos = file->f_pos;
+	hh->f_version = file->f_version;
+	/* FIX: need also file->uid, file->gid, file->f_owner, etc */
+
+	switch (inode->i_mode & S_IFMT) {
+	case S_IFREG:
+		fd_type = CR_FD_FILE;
+		break;
+	case S_IFDIR:
+		fd_type = CR_FD_DIR;
+		break;
+	default:
+		cr_hbuf_put(ctx, sizeof(*hh));
+		return -EBADF;
+	}
+
+	/* FIX: check if the file/dir/link is unlinked */
+	hh->fd_type = fd_type;
+
+	ret = cr_write_obj(ctx, &h, hh);
+	cr_hbuf_put(ctx, sizeof(*hh));
+	if (ret < 0)
+		return ret;
+
+	return cr_write_fname(ctx, &file->f_path, &ctx->fs_mnt);
+}
+
+/**
+ * cr_write_fd_ent - dump the state of a given file descriptor
+ * @ctx: checkpoint context
+ * @files: files_struct pointer
+ * @fd: file descriptor
+ *
+ * Saves the state of the file descriptor; looks up the actual file
+ * pointer in the hash table, and if found saves the matching objref,
+ * otherwise calls cr_write_fd_data to dump the file pointer too.
+ */
+static int
+cr_write_fd_ent(struct cr_ctx *ctx, struct files_struct *files, int fd)
+{
+	struct cr_hdr h;
+	struct cr_hdr_fd_ent *hh = cr_hbuf_get(ctx, sizeof(*hh));
+	struct file *file;
+	struct fdtable *fdt;
+	int objref, new, ret;
+	int coe = 0;	/* avoid gcc warning */
+
+	rcu_read_lock();
+	fdt = files_fdtable(files);
+	file = fcheck_files(files, fd);
+	if (file) {
+		coe = FD_ISSET(fd, fdt->close_on_exec);
+		get_file(file);
+	}
+	rcu_read_unlock();
+
+	/* sanity check (although this shouldn't happen) */
+	if (!file) {
+		ret = -EBADF;
+		goto out;
+	}
+
+	/* adding 'file' to the hash will keep a reference to it */
+	new = cr_obj_add_ptr(ctx, file, &objref, CR_OBJ_FILE, 0);
+	cr_debug("fd %d objref %d file %p c-o-e %d)\n", fd, objref, file, coe);
+
+	if (new < 0) {
+		ret = new;
+		goto out;
+	}
+
+	h.type = CR_HDR_FD_ENT;
+	h.len = sizeof(*hh);
+	h.parent = 0;
+
+	hh->objref = objref;
+	hh->fd = fd;
+	hh->close_on_exec = coe;
+
+	ret = cr_write_obj(ctx, &h, hh);
+	if (ret < 0)
+		goto out;
+
+	/* new==1 if-and-only-if file was newly added to hash */
+	if (new)
+		ret = cr_write_fd_data(ctx, file, objref);
+
+out:
+	cr_hbuf_put(ctx, sizeof(*hh));
+	if (file)
+		fput(file);
+	return ret;
+}
+
+int cr_write_files(struct cr_ctx *ctx, struct task_struct *t)
+{
+	struct cr_hdr h;
+	struct cr_hdr_files *hh = cr_hbuf_get(ctx, sizeof(*hh));
+	struct files_struct *files;
+	int *fdtable = NULL;
+	int nfds, n, ret;
+
+	h.type = CR_HDR_FILES;
+	h.len = sizeof(*hh);
+	h.parent = task_pid_vnr(t);
+
+	files = get_files_struct(t);
+
+	nfds = cr_scan_fds(files, &fdtable);
+	if (nfds < 0) {
+		ret = nfds;
+		goto out;
+	}
+
+	hh->objref = 0;	/* will be meaningful with multiple processes */
+	hh->nfds = nfds;
+
+	ret = cr_write_obj(ctx, &h, hh);
+	cr_hbuf_put(ctx, sizeof(*hh));
+	if (ret < 0)
+		goto out;
+
+	cr_debug("nfds %d\n", nfds);
+	for (n = 0; n < nfds; n++) {
+		ret = cr_write_fd_ent(ctx, files, fdtable[n]);
+		if (ret < 0)
+			break;
+	}
+
+ out:
+	kfree(fdtable);
+	put_files_struct(files);
+	return ret;
+}
diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
index 7da696c..119090b 100644
--- a/include/linux/checkpoint.h
+++ b/include/linux/checkpoint.h
@@ -13,7 +13,7 @@
 #include <linux/path.h>
 #include <linux/fs.h>
 
-#define CR_VERSION  1
+#define CR_VERSION  2
 
 struct cr_ctx {
 	int crid;		/* unique checkpoint id */
@@ -84,6 +84,7 @@ extern struct file *cr_read_open_fname(struct cr_ctx *ctx,
 
 extern int do_checkpoint(struct cr_ctx *ctx, pid_t pid);
 extern int cr_write_mm(struct cr_ctx *ctx, struct task_struct *t);
+extern int cr_write_files(struct cr_ctx *ctx, struct task_struct *t);
 
 extern int do_restart(struct cr_ctx *ctx, pid_t pid);
 extern int cr_read_mm(struct cr_ctx *ctx);
diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
index d78f0f1..8c3b5b2 100644
--- a/include/linux/checkpoint_hdr.h
+++ b/include/linux/checkpoint_hdr.h
@@ -17,7 +17,7 @@
 /*
  * To maintain compatibility between 32-bit and 64-bit architecture flavors,
  * keep data 64-bit aligned: use padding for structure members, and use
- * __attribute__ ((aligned (8))) for the entire structure.
+ * __attribute__((aligned(8))) for the entire structure.
  */
 
 /* records: generic header */
@@ -45,6 +45,10 @@ enum {
 	CR_HDR_PGARR,
 	CR_HDR_MM_CONTEXT,
 
+	CR_HDR_FILES = 301,
+	CR_HDR_FD_ENT,
+	CR_HDR_FD_DATA,
+
 	CR_HDR_TAIL = 5001
 };
 
@@ -107,4 +111,29 @@ struct cr_hdr_pgarr {
 	__u64 nr_pages;		/* number of pages to saved */
 } __attribute__((aligned(8)));
 
+struct cr_hdr_files {
+	__u32 objref;		/* identifier for shared objects */
+	__u32 nfds;
+} __attribute__((aligned(8)));
+
+struct cr_hdr_fd_ent {
+	__u32 objref;		/* identifier for shared objects */
+	__s32 fd;
+	__u32 close_on_exec;
+} __attribute__((aligned(8)));
+
+/* fd types */
+enum  fd_type {
+	CR_FD_FILE = 1,
+	CR_FD_DIR,
+};
+
+struct cr_hdr_fd_data {
+	__u16 fd_type;
+	__u16 f_mode;
+	__u32 f_flags;
+	__u64 f_pos;
+	__u64 f_version;
+} __attribute__((aligned(8)));
+
 #endif /* _CHECKPOINT_CKPT_HDR_H_ */
-- 
1.5.4.3

--
To unsubscribe from this list: send the line "unsubscribe linux-api" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 133+ messages in thread

* [RFC v11][PATCH 08/13] Dump open file descriptors
@ 2008-12-05 17:31   ` Oren Laadan
  0 siblings, 0 replies; 133+ messages in thread
From: Oren Laadan @ 2008-12-05 17:31 UTC (permalink / raw)
  To: Oren Laadan
  Cc: containers, linux-kernel, linux-mm, linux-api, Linux Torvalds,
	Thomas Gleixner, Serge Hallyn, Dave Hansen, Ingo Molnar,
	H. Peter Anvin, Alexander Viro, MinChan Kim, arnd, jeremy

Dump the files_struct of a task with 'struct cr_hdr_files', followed by
all open file descriptors. Because the 'struct file' corresponding to an
FD can be shared, each they are assigned an objref and registered in the
object hash. A reference to the 'file *' is kept for as long as it lives
in the hash (the hash is only cleaned up at the end of the checkpoint).

For each open FD there is a 'struct cr_hdr_fd_ent' with the FD, its
close-on-exec property, and the objref of the corresponding 'file *'.
If the FD is to be saved (first time) then this is followed by a
'struct cr_hdr_fd_data' with the FD state. Then will come the next FD
and so on.

Recall that it is assumed that all tasks possibly sharing the file table
are frozen. If this assumption breaks, then the behavior is *undefined*:
checkpoint may fail, or restart from the resulting image file will fail.

This patch only handles basic FDs - regular files, directories.

Changelog[v11]:
  - Discard handling of opened symlinks (there is no such thing)
  - cr_scan_fds() retries from scratch if hits size limits

Changelog[v9]:
  - Fix a couple of leaks in cr_write_files()
  - Drop useless kfree from cr_scan_fds()

Changelog[v8]:
  - initialize 'coe' to workaround gcc false warning

Changelog[v6]:
  - Balance all calls to cr_hbuf_get() with matching cr_hbuf_put()
    (even though it's not really needed)

Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com>
---
 checkpoint/Makefile            |    2 +-
 checkpoint/checkpoint.c        |    4 +
 checkpoint/checkpoint_file.h   |   17 +++
 checkpoint/ckpt_file.c         |  224 ++++++++++++++++++++++++++++++++++++++++
 include/linux/checkpoint.h     |    3 +-
 include/linux/checkpoint_hdr.h |   31 ++++++-
 6 files changed, 278 insertions(+), 3 deletions(-)
 create mode 100644 checkpoint/checkpoint_file.h
 create mode 100644 checkpoint/ckpt_file.c

diff --git a/checkpoint/Makefile b/checkpoint/Makefile
index 9843fb9..7496695 100644
--- a/checkpoint/Makefile
+++ b/checkpoint/Makefile
@@ -3,4 +3,4 @@
 #
 
 obj-$(CONFIG_CHECKPOINT_RESTART) += sys.o checkpoint.o restart.o objhash.o \
-		ckpt_mem.o rstr_mem.o
+		ckpt_mem.o rstr_mem.o ckpt_file.o
diff --git a/checkpoint/checkpoint.c b/checkpoint/checkpoint.c
index 56d0ec2..75c7cd3 100644
--- a/checkpoint/checkpoint.c
+++ b/checkpoint/checkpoint.c
@@ -233,6 +233,10 @@ static int cr_write_task(struct cr_ctx *ctx, struct task_struct *t)
 	cr_debug("memory: ret %d\n", ret);
 	if (ret < 0)
 		goto out;
+	ret = cr_write_files(ctx, t);
+	cr_debug("files: ret %d\n", ret);
+	if (ret < 0)
+		goto out;
 	ret = cr_write_thread(ctx, t);
 	cr_debug("thread: ret %d\n", ret);
 	if (ret < 0)
diff --git a/checkpoint/checkpoint_file.h b/checkpoint/checkpoint_file.h
new file mode 100644
index 0000000..9dc3eba
--- /dev/null
+++ b/checkpoint/checkpoint_file.h
@@ -0,0 +1,17 @@
+#ifndef _CHECKPOINT_CKPT_FILE_H_
+#define _CHECKPOINT_CKPT_FILE_H_
+/*
+ *  Checkpoint file descriptors
+ *
+ *  Copyright (C) 2008 Oren Laadan
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+#include <linux/fdtable.h>
+
+int cr_scan_fds(struct files_struct *files, int **fdtable);
+
+#endif /* _CHECKPOINT_CKPT_FILE_H_ */
diff --git a/checkpoint/ckpt_file.c b/checkpoint/ckpt_file.c
new file mode 100644
index 0000000..6f73f8b
--- /dev/null
+++ b/checkpoint/ckpt_file.c
@@ -0,0 +1,224 @@
+/*
+ *  Checkpoint file descriptors
+ *
+ *  Copyright (C) 2008 Oren Laadan
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+#include <linux/kernel.h>
+#include <linux/sched.h>
+#include <linux/file.h>
+#include <linux/fdtable.h>
+#include <linux/checkpoint.h>
+#include <linux/checkpoint_hdr.h>
+
+#include "checkpoint_file.h"
+
+#define CR_DEFAULT_FDTABLE  256		/* an initial guess */
+
+/**
+ * cr_scan_fds - scan file table and construct array of open fds
+ * @files: files_struct pointer
+ * @fdtable: (output) array of open fds
+ *
+ * Returns the number of open fds found, and also the file table
+ * array via *fdtable. The caller should free the array.
+ *
+ * The caller must validate the file descriptors collected in the
+ * array before using them, e.g. by using fcheck_files(), in case
+ * the task's fdtable changes in the meantime.
+ */
+int cr_scan_fds(struct files_struct *files, int **fdtable)
+{
+	struct fdtable *fdt;
+	int *fds = NULL;
+	int i, n;
+	int tot = CR_DEFAULT_FDTABLE;
+
+	/*
+	 * We assume that all tasks possibly sharing the file table are
+	 * frozen (or we our a single process and we checkpoint ourselves).
+	 * Therefore, we can safely proceed after krealloc() from where we
+	 * left off. Otherwise the file table may be modified by another
+	 * task after we scan it. The behavior is this case is undefined,
+	 * and either and either checkpoint or restart will likely fail.
+	 */
+ retry:
+	fds = krealloc(fds, tot * sizeof(*fds), GFP_KERNEL);
+	if (!fds)
+		return -ENOMEM;
+
+	spin_lock(&files->file_lock);
+	rcu_read_lock();
+	fdt = files_fdtable(files);
+	for (n = 0, i = 0; i < fdt->max_fds; i++) {
+		if (!fcheck_files(files, i))
+			continue;
+		if (n == tot) {
+			spin_unlock(&files->file_lock);
+			rcu_read_unlock();
+			tot *= 2;	/* won't overflow: kmalloc will fail */
+			goto retry;
+		}
+		fds[n++] = i;
+	}
+	rcu_read_unlock();
+	spin_unlock(&files->file_lock);
+
+	*fdtable = fds;
+	return n;
+}
+
+/* cr_write_fd_data - dump the state of a given file pointer */
+static int cr_write_fd_data(struct cr_ctx *ctx, struct file *file, int parent)
+{
+	struct cr_hdr h;
+	struct cr_hdr_fd_data *hh = cr_hbuf_get(ctx, sizeof(*hh));
+	struct dentry *dent = file->f_dentry;
+	struct inode *inode = dent->d_inode;
+	enum fd_type fd_type;
+	int ret;
+
+	h.type = CR_HDR_FD_DATA;
+	h.len = sizeof(*hh);
+	h.parent = parent;
+
+	hh->f_flags = file->f_flags;
+	hh->f_mode = file->f_mode;
+	hh->f_pos = file->f_pos;
+	hh->f_version = file->f_version;
+	/* FIX: need also file->uid, file->gid, file->f_owner, etc */
+
+	switch (inode->i_mode & S_IFMT) {
+	case S_IFREG:
+		fd_type = CR_FD_FILE;
+		break;
+	case S_IFDIR:
+		fd_type = CR_FD_DIR;
+		break;
+	default:
+		cr_hbuf_put(ctx, sizeof(*hh));
+		return -EBADF;
+	}
+
+	/* FIX: check if the file/dir/link is unlinked */
+	hh->fd_type = fd_type;
+
+	ret = cr_write_obj(ctx, &h, hh);
+	cr_hbuf_put(ctx, sizeof(*hh));
+	if (ret < 0)
+		return ret;
+
+	return cr_write_fname(ctx, &file->f_path, &ctx->fs_mnt);
+}
+
+/**
+ * cr_write_fd_ent - dump the state of a given file descriptor
+ * @ctx: checkpoint context
+ * @files: files_struct pointer
+ * @fd: file descriptor
+ *
+ * Saves the state of the file descriptor; looks up the actual file
+ * pointer in the hash table, and if found saves the matching objref,
+ * otherwise calls cr_write_fd_data to dump the file pointer too.
+ */
+static int
+cr_write_fd_ent(struct cr_ctx *ctx, struct files_struct *files, int fd)
+{
+	struct cr_hdr h;
+	struct cr_hdr_fd_ent *hh = cr_hbuf_get(ctx, sizeof(*hh));
+	struct file *file;
+	struct fdtable *fdt;
+	int objref, new, ret;
+	int coe = 0;	/* avoid gcc warning */
+
+	rcu_read_lock();
+	fdt = files_fdtable(files);
+	file = fcheck_files(files, fd);
+	if (file) {
+		coe = FD_ISSET(fd, fdt->close_on_exec);
+		get_file(file);
+	}
+	rcu_read_unlock();
+
+	/* sanity check (although this shouldn't happen) */
+	if (!file) {
+		ret = -EBADF;
+		goto out;
+	}
+
+	/* adding 'file' to the hash will keep a reference to it */
+	new = cr_obj_add_ptr(ctx, file, &objref, CR_OBJ_FILE, 0);
+	cr_debug("fd %d objref %d file %p c-o-e %d)\n", fd, objref, file, coe);
+
+	if (new < 0) {
+		ret = new;
+		goto out;
+	}
+
+	h.type = CR_HDR_FD_ENT;
+	h.len = sizeof(*hh);
+	h.parent = 0;
+
+	hh->objref = objref;
+	hh->fd = fd;
+	hh->close_on_exec = coe;
+
+	ret = cr_write_obj(ctx, &h, hh);
+	if (ret < 0)
+		goto out;
+
+	/* new==1 if-and-only-if file was newly added to hash */
+	if (new)
+		ret = cr_write_fd_data(ctx, file, objref);
+
+out:
+	cr_hbuf_put(ctx, sizeof(*hh));
+	if (file)
+		fput(file);
+	return ret;
+}
+
+int cr_write_files(struct cr_ctx *ctx, struct task_struct *t)
+{
+	struct cr_hdr h;
+	struct cr_hdr_files *hh = cr_hbuf_get(ctx, sizeof(*hh));
+	struct files_struct *files;
+	int *fdtable = NULL;
+	int nfds, n, ret;
+
+	h.type = CR_HDR_FILES;
+	h.len = sizeof(*hh);
+	h.parent = task_pid_vnr(t);
+
+	files = get_files_struct(t);
+
+	nfds = cr_scan_fds(files, &fdtable);
+	if (nfds < 0) {
+		ret = nfds;
+		goto out;
+	}
+
+	hh->objref = 0;	/* will be meaningful with multiple processes */
+	hh->nfds = nfds;
+
+	ret = cr_write_obj(ctx, &h, hh);
+	cr_hbuf_put(ctx, sizeof(*hh));
+	if (ret < 0)
+		goto out;
+
+	cr_debug("nfds %d\n", nfds);
+	for (n = 0; n < nfds; n++) {
+		ret = cr_write_fd_ent(ctx, files, fdtable[n]);
+		if (ret < 0)
+			break;
+	}
+
+ out:
+	kfree(fdtable);
+	put_files_struct(files);
+	return ret;
+}
diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
index 7da696c..119090b 100644
--- a/include/linux/checkpoint.h
+++ b/include/linux/checkpoint.h
@@ -13,7 +13,7 @@
 #include <linux/path.h>
 #include <linux/fs.h>
 
-#define CR_VERSION  1
+#define CR_VERSION  2
 
 struct cr_ctx {
 	int crid;		/* unique checkpoint id */
@@ -84,6 +84,7 @@ extern struct file *cr_read_open_fname(struct cr_ctx *ctx,
 
 extern int do_checkpoint(struct cr_ctx *ctx, pid_t pid);
 extern int cr_write_mm(struct cr_ctx *ctx, struct task_struct *t);
+extern int cr_write_files(struct cr_ctx *ctx, struct task_struct *t);
 
 extern int do_restart(struct cr_ctx *ctx, pid_t pid);
 extern int cr_read_mm(struct cr_ctx *ctx);
diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
index d78f0f1..8c3b5b2 100644
--- a/include/linux/checkpoint_hdr.h
+++ b/include/linux/checkpoint_hdr.h
@@ -17,7 +17,7 @@
 /*
  * To maintain compatibility between 32-bit and 64-bit architecture flavors,
  * keep data 64-bit aligned: use padding for structure members, and use
- * __attribute__ ((aligned (8))) for the entire structure.
+ * __attribute__((aligned(8))) for the entire structure.
  */
 
 /* records: generic header */
@@ -45,6 +45,10 @@ enum {
 	CR_HDR_PGARR,
 	CR_HDR_MM_CONTEXT,
 
+	CR_HDR_FILES = 301,
+	CR_HDR_FD_ENT,
+	CR_HDR_FD_DATA,
+
 	CR_HDR_TAIL = 5001
 };
 
@@ -107,4 +111,29 @@ struct cr_hdr_pgarr {
 	__u64 nr_pages;		/* number of pages to saved */
 } __attribute__((aligned(8)));
 
+struct cr_hdr_files {
+	__u32 objref;		/* identifier for shared objects */
+	__u32 nfds;
+} __attribute__((aligned(8)));
+
+struct cr_hdr_fd_ent {
+	__u32 objref;		/* identifier for shared objects */
+	__s32 fd;
+	__u32 close_on_exec;
+} __attribute__((aligned(8)));
+
+/* fd types */
+enum  fd_type {
+	CR_FD_FILE = 1,
+	CR_FD_DIR,
+};
+
+struct cr_hdr_fd_data {
+	__u16 fd_type;
+	__u16 f_mode;
+	__u32 f_flags;
+	__u64 f_pos;
+	__u64 f_version;
+} __attribute__((aligned(8)));
+
 #endif /* _CHECKPOINT_CKPT_HDR_H_ */
-- 
1.5.4.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 133+ messages in thread

* [RFC v11][PATCH 09/13] Restore open file descriprtors
       [not found] ` <1228498282-11804-1-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
                     ` (7 preceding siblings ...)
  2008-12-05 17:31   ` [RFC v11][PATCH 08/13] Dump open file descriptors Oren Laadan
@ 2008-12-05 17:31   ` Oren Laadan
  2008-12-05 17:31   ` [RFC v11][PATCH 10/13] External checkpoint of a task other than ourself Oren Laadan
                     ` (8 subsequent siblings)
  17 siblings, 0 replies; 133+ messages in thread
From: Oren Laadan @ 2008-12-05 17:31 UTC (permalink / raw)
  To: Oren Laadan
  Cc: jeremy-TSDbQ3PG+2Y, arnd-r2nGTMty4D4,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Dave Hansen,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Linux Torvalds, Alexander Viro,
	H. Peter Anvin, Thomas Gleixner, Ingo Molnar

Restore open file descriptors: for each FD read 'struct cr_hdr_fd_ent'
and lookup objref in the hash table; if not found (first occurence), read
in 'struct cr_hdr_fd_data', create a new FD and register in the hash.
Otherwise attach the file pointer from the hash as an FD.

This patch only handles basic FDs - regular files, directories and also
symbolic links.

Changelog[v6]:
  - Balance all calls to cr_hbuf_get() with matching cr_hbuf_put()
    (even though it's not really needed)

Signed-off-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
Acked-by: Serge Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
Signed-off-by: Dave Hansen <dave-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
---
 checkpoint/Makefile        |    2 +-
 checkpoint/restart.c       |    4 +
 checkpoint/rstr_file.c     |  248 ++++++++++++++++++++++++++++++++++++++++++++
 include/linux/checkpoint.h |    1 +
 4 files changed, 254 insertions(+), 1 deletions(-)
 create mode 100644 checkpoint/rstr_file.c

diff --git a/checkpoint/Makefile b/checkpoint/Makefile
index 7496695..88bbc10 100644
--- a/checkpoint/Makefile
+++ b/checkpoint/Makefile
@@ -3,4 +3,4 @@
 #
 
 obj-$(CONFIG_CHECKPOINT_RESTART) += sys.o checkpoint.o restart.o objhash.o \
-		ckpt_mem.o rstr_mem.o ckpt_file.o
+		ckpt_mem.o rstr_mem.o ckpt_file.o rstr_file.o
diff --git a/checkpoint/restart.c b/checkpoint/restart.c
index d90c28a..22e7995 100644
--- a/checkpoint/restart.c
+++ b/checkpoint/restart.c
@@ -261,6 +261,10 @@ static int cr_read_task(struct cr_ctx *ctx)
 	cr_debug("memory: ret %d\n", ret);
 	if (ret < 0)
 		goto out;
+	ret = cr_read_files(ctx);
+	cr_debug("files: ret %d\n", ret);
+	if (ret < 0)
+		goto out;
 	ret = cr_read_thread(ctx);
 	cr_debug("thread: ret %d\n", ret);
 	if (ret < 0)
diff --git a/checkpoint/rstr_file.c b/checkpoint/rstr_file.c
new file mode 100644
index 0000000..e06db81
--- /dev/null
+++ b/checkpoint/rstr_file.c
@@ -0,0 +1,248 @@
+/*
+ *  Checkpoint file descriptors
+ *
+ *  Copyright (C) 2008 Oren Laadan
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+#include <linux/kernel.h>
+#include <linux/sched.h>
+#include <linux/fs.h>
+#include <linux/file.h>
+#include <linux/fdtable.h>
+#include <linux/fsnotify.h>
+#include <linux/syscalls.h>
+#include <linux/checkpoint.h>
+#include <linux/checkpoint_hdr.h>
+
+#include "checkpoint_file.h"
+
+static int cr_close_all_fds(struct files_struct *files)
+{
+	int *fdtable;
+	int nfds;
+
+	nfds = cr_scan_fds(files, &fdtable);
+	if (nfds < 0)
+		return nfds;
+	while (nfds--)
+		sys_close(fdtable[nfds]);
+	kfree(fdtable);
+	return 0;
+}
+
+/**
+ * cr_attach_file - attach a lonely file ptr to a file descriptor
+ * @file: lonely file pointer
+ */
+static int cr_attach_file(struct file *file)
+{
+	int fd = get_unused_fd_flags(0);
+
+	if (fd >= 0) {
+		fsnotify_open(file->f_path.dentry);
+		fd_install(fd, file);
+	}
+	return fd;
+}
+
+/**
+ * cr_attach_get_file - attach (and get) lonely file ptr to a file descriptor
+ * @file: lonely file pointer
+ */
+static int cr_attach_get_file(struct file *file)
+{
+	int fd = get_unused_fd_flags(0);
+
+	if (fd >= 0) {
+		fsnotify_open(file->f_path.dentry);
+		get_file(file);
+		fd_install(fd, file);
+	}
+	return fd;
+}
+
+#define CR_SETFL_MASK (O_APPEND|O_NONBLOCK|O_NDELAY|FASYNC|O_DIRECT|O_NOATIME)
+
+/* cr_read_fd_data - restore the state of a given file pointer */
+static int
+cr_read_fd_data(struct cr_ctx *ctx, struct files_struct *files, int rparent)
+{
+	struct cr_hdr_fd_data *hh = cr_hbuf_get(ctx, sizeof(*hh));
+	struct file *file;
+	int parent, ret;
+	int fd = 0;	/* pacify gcc warning */
+
+	parent = cr_read_obj_type(ctx, hh, sizeof(*hh), CR_HDR_FD_DATA);
+	cr_debug("rparent %d parent %d flags %#x mode %#x how %d\n",
+		 rparent, parent, hh->f_flags, hh->f_mode, hh->fd_type);
+	if (parent < 0) {
+		ret = parent;
+		goto out;
+	}
+
+	ret = -EINVAL;
+
+	if (parent != rparent)
+		goto out;
+
+	/* FIX: more sanity checks on f_flags, f_mode etc */
+
+	switch (hh->fd_type) {
+	case CR_FD_FILE:
+	case CR_FD_DIR:
+		file = cr_read_open_fname(ctx, hh->f_flags, hh->f_mode);
+		break;
+	default:
+		goto out;
+	}
+
+	if (IS_ERR(file)) {
+		ret = PTR_ERR(file);
+		goto out;
+	}
+
+	/* FIX: need to restore uid, gid, owner etc */
+
+	/* adding <objref,file> to the hash will keep a reference to it */
+	ret = cr_obj_add_ref(ctx, file, parent, CR_OBJ_FILE, 0);
+	if (ret < 0) {
+		filp_close(file, NULL);
+		goto out;
+	}
+
+	fd = cr_attach_file(file);	/* no need to cleanup 'file' below */
+	if (fd < 0) {
+		ret = fd;
+		filp_close(file, NULL);
+		goto out;
+	}
+
+	ret = sys_fcntl(fd, F_SETFL, hh->f_flags & CR_SETFL_MASK);
+	if (ret < 0)
+		goto out;
+	ret = vfs_llseek(file, hh->f_pos, SEEK_SET);
+	if (ret == -ESPIPE)	/* ignore error on non-seekable files */
+		ret = 0;
+
+	ret = 0;
+ out:
+	cr_hbuf_put(ctx, sizeof(*hh));
+	return ret < 0 ? ret : fd;
+}
+
+/**
+ * cr_read_fd_ent - restore the state of a given file descriptor
+ * @ctx: checkpoint context
+ * @files: files_struct pointer
+ * @parent: parent objref
+ *
+ * Restores the state of a file descriptor; looks up the objref (in the
+ * header) in the hash table, and if found picks the matching file and
+ * use it; otherwise calls cr_read_fd_data to restore the file too.
+ */
+static int
+cr_read_fd_ent(struct cr_ctx *ctx, struct files_struct *files, int rparent)
+{
+	struct cr_hdr_fd_ent *hh = cr_hbuf_get(ctx, sizeof(*hh));
+	struct file *file;
+	int newfd, parent, ret;
+
+	parent = cr_read_obj_type(ctx, hh, sizeof(*hh), CR_HDR_FD_ENT);
+	cr_debug("rparent %d parent %d ref %d fd %d c.o.e %d\n",
+		 rparent, parent, hh->objref, hh->fd, hh->close_on_exec);
+	if (parent < 0) {
+		ret = parent;
+		goto out;
+	}
+
+	ret = -EINVAL;
+
+	if (parent != rparent)
+		goto out;
+	if (hh->objref <= 0)
+		goto out;
+
+	file = cr_obj_get_by_ref(ctx, hh->objref, CR_OBJ_FILE);
+	if (IS_ERR(file)) {
+		ret = PTR_ERR(file);
+		goto out;
+	}
+
+	if (file) {
+		/* reuse file descriptor found in the hash table */
+		newfd = cr_attach_get_file(file);
+	} else {
+		/* create new file pointer (and register in hash table) */
+		newfd = cr_read_fd_data(ctx, files, hh->objref);
+	}
+
+	if (newfd < 0) {
+		ret = newfd;
+		goto out;
+	}
+
+	cr_debug("newfd got %d wanted %d\n", newfd, hh->fd);
+
+	/* if newfd isn't desired fd then reposition it */
+	if (newfd != hh->fd) {
+		ret = sys_dup2(newfd, hh->fd);
+		if (ret < 0)
+			goto out;
+		sys_close(newfd);
+	}
+
+	if (hh->close_on_exec)
+		set_close_on_exec(hh->fd, 1);
+
+	ret = 0;
+ out:
+	cr_hbuf_put(ctx, sizeof(*hh));
+	return ret;
+}
+
+int cr_read_files(struct cr_ctx *ctx)
+{
+	struct cr_hdr_files *hh = cr_hbuf_get(ctx, sizeof(*hh));
+	struct files_struct *files = current->files;
+	int i, parent, ret;
+
+	parent = cr_read_obj_type(ctx, hh, sizeof(*hh), CR_HDR_FILES);
+	if (parent < 0) {
+		ret = parent;
+		goto out;
+	}
+
+	ret = -EINVAL;
+#if 0	/* activate when containers are used */
+	if (parent != task_pid_vnr(current))
+		goto out;
+#endif
+	cr_debug("objref %d nfds %d\n", hh->objref, hh->nfds);
+	if (hh->objref < 0 || hh->nfds < 0)
+		goto out;
+
+	if (hh->nfds > sysctl_nr_open) {
+		ret = -EMFILE;
+		goto out;
+	}
+
+	/* point of no return -- close all file descriptors */
+	ret = cr_close_all_fds(files);
+	if (ret < 0)
+		goto out;
+
+	for (i = 0; i < hh->nfds; i++) {
+		ret = cr_read_fd_ent(ctx, files, hh->objref);
+		if (ret < 0)
+			break;
+	}
+
+	ret = 0;
+ out:
+	cr_hbuf_put(ctx, sizeof(*hh));
+	return ret;
+}
diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
index 119090b..3649f9c 100644
--- a/include/linux/checkpoint.h
+++ b/include/linux/checkpoint.h
@@ -88,6 +88,7 @@ extern int cr_write_files(struct cr_ctx *ctx, struct task_struct *t);
 
 extern int do_restart(struct cr_ctx *ctx, pid_t pid);
 extern int cr_read_mm(struct cr_ctx *ctx);
+extern int cr_read_files(struct cr_ctx *ctx);
 
 #define cr_debug(fmt, args...)  \
 	pr_debug("[%d:c/r:%s] " fmt, task_pid_vnr(current), __func__, ## args)
-- 
1.5.4.3

^ permalink raw reply related	[flat|nested] 133+ messages in thread

* [RFC v11][PATCH 09/13] Restore open file descriprtors
       [not found] ` <1228498282-11804-1-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
  2008-12-05 17:31   ` [RFC v11][PATCH 01/13] Create syscalls: sys_checkpoint, sys_restart Oren Laadan
@ 2008-12-05 17:31   ` Oren Laadan
  2008-12-05 17:31   ` [RFC v11][PATCH 03/13] General infrastructure for checkpoint restart Oren Laadan
                     ` (15 subsequent siblings)
  17 siblings, 0 replies; 133+ messages in thread
From: Oren Laadan @ 2008-12-05 17:31 UTC (permalink / raw)
  To: Oren Laadan
  Cc: containers, linux-kernel, linux-mm, linux-api, Linux Torvalds,
	Thomas Gleixner, Serge Hallyn, Dave Hansen, Ingo Molnar,
	H. Peter Anvin, Alexander Viro, MinChan Kim, arnd, jeremy,
	Oren Laadan

Restore open file descriptors: for each FD read 'struct cr_hdr_fd_ent'
and lookup objref in the hash table; if not found (first occurence), read
in 'struct cr_hdr_fd_data', create a new FD and register in the hash.
Otherwise attach the file pointer from the hash as an FD.

This patch only handles basic FDs - regular files, directories and also
symbolic links.

Changelog[v6]:
  - Balance all calls to cr_hbuf_get() with matching cr_hbuf_put()
    (even though it's not really needed)

Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com>
---
 checkpoint/Makefile        |    2 +-
 checkpoint/restart.c       |    4 +
 checkpoint/rstr_file.c     |  248 ++++++++++++++++++++++++++++++++++++++++++++
 include/linux/checkpoint.h |    1 +
 4 files changed, 254 insertions(+), 1 deletions(-)
 create mode 100644 checkpoint/rstr_file.c

diff --git a/checkpoint/Makefile b/checkpoint/Makefile
index 7496695..88bbc10 100644
--- a/checkpoint/Makefile
+++ b/checkpoint/Makefile
@@ -3,4 +3,4 @@
 #
 
 obj-$(CONFIG_CHECKPOINT_RESTART) += sys.o checkpoint.o restart.o objhash.o \
-		ckpt_mem.o rstr_mem.o ckpt_file.o
+		ckpt_mem.o rstr_mem.o ckpt_file.o rstr_file.o
diff --git a/checkpoint/restart.c b/checkpoint/restart.c
index d90c28a..22e7995 100644
--- a/checkpoint/restart.c
+++ b/checkpoint/restart.c
@@ -261,6 +261,10 @@ static int cr_read_task(struct cr_ctx *ctx)
 	cr_debug("memory: ret %d\n", ret);
 	if (ret < 0)
 		goto out;
+	ret = cr_read_files(ctx);
+	cr_debug("files: ret %d\n", ret);
+	if (ret < 0)
+		goto out;
 	ret = cr_read_thread(ctx);
 	cr_debug("thread: ret %d\n", ret);
 	if (ret < 0)
diff --git a/checkpoint/rstr_file.c b/checkpoint/rstr_file.c
new file mode 100644
index 0000000..e06db81
--- /dev/null
+++ b/checkpoint/rstr_file.c
@@ -0,0 +1,248 @@
+/*
+ *  Checkpoint file descriptors
+ *
+ *  Copyright (C) 2008 Oren Laadan
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+#include <linux/kernel.h>
+#include <linux/sched.h>
+#include <linux/fs.h>
+#include <linux/file.h>
+#include <linux/fdtable.h>
+#include <linux/fsnotify.h>
+#include <linux/syscalls.h>
+#include <linux/checkpoint.h>
+#include <linux/checkpoint_hdr.h>
+
+#include "checkpoint_file.h"
+
+static int cr_close_all_fds(struct files_struct *files)
+{
+	int *fdtable;
+	int nfds;
+
+	nfds = cr_scan_fds(files, &fdtable);
+	if (nfds < 0)
+		return nfds;
+	while (nfds--)
+		sys_close(fdtable[nfds]);
+	kfree(fdtable);
+	return 0;
+}
+
+/**
+ * cr_attach_file - attach a lonely file ptr to a file descriptor
+ * @file: lonely file pointer
+ */
+static int cr_attach_file(struct file *file)
+{
+	int fd = get_unused_fd_flags(0);
+
+	if (fd >= 0) {
+		fsnotify_open(file->f_path.dentry);
+		fd_install(fd, file);
+	}
+	return fd;
+}
+
+/**
+ * cr_attach_get_file - attach (and get) lonely file ptr to a file descriptor
+ * @file: lonely file pointer
+ */
+static int cr_attach_get_file(struct file *file)
+{
+	int fd = get_unused_fd_flags(0);
+
+	if (fd >= 0) {
+		fsnotify_open(file->f_path.dentry);
+		get_file(file);
+		fd_install(fd, file);
+	}
+	return fd;
+}
+
+#define CR_SETFL_MASK (O_APPEND|O_NONBLOCK|O_NDELAY|FASYNC|O_DIRECT|O_NOATIME)
+
+/* cr_read_fd_data - restore the state of a given file pointer */
+static int
+cr_read_fd_data(struct cr_ctx *ctx, struct files_struct *files, int rparent)
+{
+	struct cr_hdr_fd_data *hh = cr_hbuf_get(ctx, sizeof(*hh));
+	struct file *file;
+	int parent, ret;
+	int fd = 0;	/* pacify gcc warning */
+
+	parent = cr_read_obj_type(ctx, hh, sizeof(*hh), CR_HDR_FD_DATA);
+	cr_debug("rparent %d parent %d flags %#x mode %#x how %d\n",
+		 rparent, parent, hh->f_flags, hh->f_mode, hh->fd_type);
+	if (parent < 0) {
+		ret = parent;
+		goto out;
+	}
+
+	ret = -EINVAL;
+
+	if (parent != rparent)
+		goto out;
+
+	/* FIX: more sanity checks on f_flags, f_mode etc */
+
+	switch (hh->fd_type) {
+	case CR_FD_FILE:
+	case CR_FD_DIR:
+		file = cr_read_open_fname(ctx, hh->f_flags, hh->f_mode);
+		break;
+	default:
+		goto out;
+	}
+
+	if (IS_ERR(file)) {
+		ret = PTR_ERR(file);
+		goto out;
+	}
+
+	/* FIX: need to restore uid, gid, owner etc */
+
+	/* adding <objref,file> to the hash will keep a reference to it */
+	ret = cr_obj_add_ref(ctx, file, parent, CR_OBJ_FILE, 0);
+	if (ret < 0) {
+		filp_close(file, NULL);
+		goto out;
+	}
+
+	fd = cr_attach_file(file);	/* no need to cleanup 'file' below */
+	if (fd < 0) {
+		ret = fd;
+		filp_close(file, NULL);
+		goto out;
+	}
+
+	ret = sys_fcntl(fd, F_SETFL, hh->f_flags & CR_SETFL_MASK);
+	if (ret < 0)
+		goto out;
+	ret = vfs_llseek(file, hh->f_pos, SEEK_SET);
+	if (ret == -ESPIPE)	/* ignore error on non-seekable files */
+		ret = 0;
+
+	ret = 0;
+ out:
+	cr_hbuf_put(ctx, sizeof(*hh));
+	return ret < 0 ? ret : fd;
+}
+
+/**
+ * cr_read_fd_ent - restore the state of a given file descriptor
+ * @ctx: checkpoint context
+ * @files: files_struct pointer
+ * @parent: parent objref
+ *
+ * Restores the state of a file descriptor; looks up the objref (in the
+ * header) in the hash table, and if found picks the matching file and
+ * use it; otherwise calls cr_read_fd_data to restore the file too.
+ */
+static int
+cr_read_fd_ent(struct cr_ctx *ctx, struct files_struct *files, int rparent)
+{
+	struct cr_hdr_fd_ent *hh = cr_hbuf_get(ctx, sizeof(*hh));
+	struct file *file;
+	int newfd, parent, ret;
+
+	parent = cr_read_obj_type(ctx, hh, sizeof(*hh), CR_HDR_FD_ENT);
+	cr_debug("rparent %d parent %d ref %d fd %d c.o.e %d\n",
+		 rparent, parent, hh->objref, hh->fd, hh->close_on_exec);
+	if (parent < 0) {
+		ret = parent;
+		goto out;
+	}
+
+	ret = -EINVAL;
+
+	if (parent != rparent)
+		goto out;
+	if (hh->objref <= 0)
+		goto out;
+
+	file = cr_obj_get_by_ref(ctx, hh->objref, CR_OBJ_FILE);
+	if (IS_ERR(file)) {
+		ret = PTR_ERR(file);
+		goto out;
+	}
+
+	if (file) {
+		/* reuse file descriptor found in the hash table */
+		newfd = cr_attach_get_file(file);
+	} else {
+		/* create new file pointer (and register in hash table) */
+		newfd = cr_read_fd_data(ctx, files, hh->objref);
+	}
+
+	if (newfd < 0) {
+		ret = newfd;
+		goto out;
+	}
+
+	cr_debug("newfd got %d wanted %d\n", newfd, hh->fd);
+
+	/* if newfd isn't desired fd then reposition it */
+	if (newfd != hh->fd) {
+		ret = sys_dup2(newfd, hh->fd);
+		if (ret < 0)
+			goto out;
+		sys_close(newfd);
+	}
+
+	if (hh->close_on_exec)
+		set_close_on_exec(hh->fd, 1);
+
+	ret = 0;
+ out:
+	cr_hbuf_put(ctx, sizeof(*hh));
+	return ret;
+}
+
+int cr_read_files(struct cr_ctx *ctx)
+{
+	struct cr_hdr_files *hh = cr_hbuf_get(ctx, sizeof(*hh));
+	struct files_struct *files = current->files;
+	int i, parent, ret;
+
+	parent = cr_read_obj_type(ctx, hh, sizeof(*hh), CR_HDR_FILES);
+	if (parent < 0) {
+		ret = parent;
+		goto out;
+	}
+
+	ret = -EINVAL;
+#if 0	/* activate when containers are used */
+	if (parent != task_pid_vnr(current))
+		goto out;
+#endif
+	cr_debug("objref %d nfds %d\n", hh->objref, hh->nfds);
+	if (hh->objref < 0 || hh->nfds < 0)
+		goto out;
+
+	if (hh->nfds > sysctl_nr_open) {
+		ret = -EMFILE;
+		goto out;
+	}
+
+	/* point of no return -- close all file descriptors */
+	ret = cr_close_all_fds(files);
+	if (ret < 0)
+		goto out;
+
+	for (i = 0; i < hh->nfds; i++) {
+		ret = cr_read_fd_ent(ctx, files, hh->objref);
+		if (ret < 0)
+			break;
+	}
+
+	ret = 0;
+ out:
+	cr_hbuf_put(ctx, sizeof(*hh));
+	return ret;
+}
diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
index 119090b..3649f9c 100644
--- a/include/linux/checkpoint.h
+++ b/include/linux/checkpoint.h
@@ -88,6 +88,7 @@ extern int cr_write_files(struct cr_ctx *ctx, struct task_struct *t);
 
 extern int do_restart(struct cr_ctx *ctx, pid_t pid);
 extern int cr_read_mm(struct cr_ctx *ctx);
+extern int cr_read_files(struct cr_ctx *ctx);
 
 #define cr_debug(fmt, args...)  \
 	pr_debug("[%d:c/r:%s] " fmt, task_pid_vnr(current), __func__, ## args)
-- 
1.5.4.3


^ permalink raw reply related	[flat|nested] 133+ messages in thread

* [RFC v11][PATCH 09/13] Restore open file descriprtors
@ 2008-12-05 17:31   ` Oren Laadan
  0 siblings, 0 replies; 133+ messages in thread
From: Oren Laadan @ 2008-12-05 17:31 UTC (permalink / raw)
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Linux Torvalds,
	Thomas Gleixner, Serge Hallyn, Dave Hansen, Ingo Molnar,
	H. Peter Anvin, Alexander Viro, MinChan Kim, arnd-r2nGTMty4D4,
	jeremy-TSDbQ3PG+2Y, Oren Laadan

Restore open file descriptors: for each FD read 'struct cr_hdr_fd_ent'
and lookup objref in the hash table; if not found (first occurence), read
in 'struct cr_hdr_fd_data', create a new FD and register in the hash.
Otherwise attach the file pointer from the hash as an FD.

This patch only handles basic FDs - regular files, directories and also
symbolic links.

Changelog[v6]:
  - Balance all calls to cr_hbuf_get() with matching cr_hbuf_put()
    (even though it's not really needed)

Signed-off-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
Acked-by: Serge Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
Signed-off-by: Dave Hansen <dave-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
---
 checkpoint/Makefile        |    2 +-
 checkpoint/restart.c       |    4 +
 checkpoint/rstr_file.c     |  248 ++++++++++++++++++++++++++++++++++++++++++++
 include/linux/checkpoint.h |    1 +
 4 files changed, 254 insertions(+), 1 deletions(-)
 create mode 100644 checkpoint/rstr_file.c

diff --git a/checkpoint/Makefile b/checkpoint/Makefile
index 7496695..88bbc10 100644
--- a/checkpoint/Makefile
+++ b/checkpoint/Makefile
@@ -3,4 +3,4 @@
 #
 
 obj-$(CONFIG_CHECKPOINT_RESTART) += sys.o checkpoint.o restart.o objhash.o \
-		ckpt_mem.o rstr_mem.o ckpt_file.o
+		ckpt_mem.o rstr_mem.o ckpt_file.o rstr_file.o
diff --git a/checkpoint/restart.c b/checkpoint/restart.c
index d90c28a..22e7995 100644
--- a/checkpoint/restart.c
+++ b/checkpoint/restart.c
@@ -261,6 +261,10 @@ static int cr_read_task(struct cr_ctx *ctx)
 	cr_debug("memory: ret %d\n", ret);
 	if (ret < 0)
 		goto out;
+	ret = cr_read_files(ctx);
+	cr_debug("files: ret %d\n", ret);
+	if (ret < 0)
+		goto out;
 	ret = cr_read_thread(ctx);
 	cr_debug("thread: ret %d\n", ret);
 	if (ret < 0)
diff --git a/checkpoint/rstr_file.c b/checkpoint/rstr_file.c
new file mode 100644
index 0000000..e06db81
--- /dev/null
+++ b/checkpoint/rstr_file.c
@@ -0,0 +1,248 @@
+/*
+ *  Checkpoint file descriptors
+ *
+ *  Copyright (C) 2008 Oren Laadan
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+#include <linux/kernel.h>
+#include <linux/sched.h>
+#include <linux/fs.h>
+#include <linux/file.h>
+#include <linux/fdtable.h>
+#include <linux/fsnotify.h>
+#include <linux/syscalls.h>
+#include <linux/checkpoint.h>
+#include <linux/checkpoint_hdr.h>
+
+#include "checkpoint_file.h"
+
+static int cr_close_all_fds(struct files_struct *files)
+{
+	int *fdtable;
+	int nfds;
+
+	nfds = cr_scan_fds(files, &fdtable);
+	if (nfds < 0)
+		return nfds;
+	while (nfds--)
+		sys_close(fdtable[nfds]);
+	kfree(fdtable);
+	return 0;
+}
+
+/**
+ * cr_attach_file - attach a lonely file ptr to a file descriptor
+ * @file: lonely file pointer
+ */
+static int cr_attach_file(struct file *file)
+{
+	int fd = get_unused_fd_flags(0);
+
+	if (fd >= 0) {
+		fsnotify_open(file->f_path.dentry);
+		fd_install(fd, file);
+	}
+	return fd;
+}
+
+/**
+ * cr_attach_get_file - attach (and get) lonely file ptr to a file descriptor
+ * @file: lonely file pointer
+ */
+static int cr_attach_get_file(struct file *file)
+{
+	int fd = get_unused_fd_flags(0);
+
+	if (fd >= 0) {
+		fsnotify_open(file->f_path.dentry);
+		get_file(file);
+		fd_install(fd, file);
+	}
+	return fd;
+}
+
+#define CR_SETFL_MASK (O_APPEND|O_NONBLOCK|O_NDELAY|FASYNC|O_DIRECT|O_NOATIME)
+
+/* cr_read_fd_data - restore the state of a given file pointer */
+static int
+cr_read_fd_data(struct cr_ctx *ctx, struct files_struct *files, int rparent)
+{
+	struct cr_hdr_fd_data *hh = cr_hbuf_get(ctx, sizeof(*hh));
+	struct file *file;
+	int parent, ret;
+	int fd = 0;	/* pacify gcc warning */
+
+	parent = cr_read_obj_type(ctx, hh, sizeof(*hh), CR_HDR_FD_DATA);
+	cr_debug("rparent %d parent %d flags %#x mode %#x how %d\n",
+		 rparent, parent, hh->f_flags, hh->f_mode, hh->fd_type);
+	if (parent < 0) {
+		ret = parent;
+		goto out;
+	}
+
+	ret = -EINVAL;
+
+	if (parent != rparent)
+		goto out;
+
+	/* FIX: more sanity checks on f_flags, f_mode etc */
+
+	switch (hh->fd_type) {
+	case CR_FD_FILE:
+	case CR_FD_DIR:
+		file = cr_read_open_fname(ctx, hh->f_flags, hh->f_mode);
+		break;
+	default:
+		goto out;
+	}
+
+	if (IS_ERR(file)) {
+		ret = PTR_ERR(file);
+		goto out;
+	}
+
+	/* FIX: need to restore uid, gid, owner etc */
+
+	/* adding <objref,file> to the hash will keep a reference to it */
+	ret = cr_obj_add_ref(ctx, file, parent, CR_OBJ_FILE, 0);
+	if (ret < 0) {
+		filp_close(file, NULL);
+		goto out;
+	}
+
+	fd = cr_attach_file(file);	/* no need to cleanup 'file' below */
+	if (fd < 0) {
+		ret = fd;
+		filp_close(file, NULL);
+		goto out;
+	}
+
+	ret = sys_fcntl(fd, F_SETFL, hh->f_flags & CR_SETFL_MASK);
+	if (ret < 0)
+		goto out;
+	ret = vfs_llseek(file, hh->f_pos, SEEK_SET);
+	if (ret == -ESPIPE)	/* ignore error on non-seekable files */
+		ret = 0;
+
+	ret = 0;
+ out:
+	cr_hbuf_put(ctx, sizeof(*hh));
+	return ret < 0 ? ret : fd;
+}
+
+/**
+ * cr_read_fd_ent - restore the state of a given file descriptor
+ * @ctx: checkpoint context
+ * @files: files_struct pointer
+ * @parent: parent objref
+ *
+ * Restores the state of a file descriptor; looks up the objref (in the
+ * header) in the hash table, and if found picks the matching file and
+ * use it; otherwise calls cr_read_fd_data to restore the file too.
+ */
+static int
+cr_read_fd_ent(struct cr_ctx *ctx, struct files_struct *files, int rparent)
+{
+	struct cr_hdr_fd_ent *hh = cr_hbuf_get(ctx, sizeof(*hh));
+	struct file *file;
+	int newfd, parent, ret;
+
+	parent = cr_read_obj_type(ctx, hh, sizeof(*hh), CR_HDR_FD_ENT);
+	cr_debug("rparent %d parent %d ref %d fd %d c.o.e %d\n",
+		 rparent, parent, hh->objref, hh->fd, hh->close_on_exec);
+	if (parent < 0) {
+		ret = parent;
+		goto out;
+	}
+
+	ret = -EINVAL;
+
+	if (parent != rparent)
+		goto out;
+	if (hh->objref <= 0)
+		goto out;
+
+	file = cr_obj_get_by_ref(ctx, hh->objref, CR_OBJ_FILE);
+	if (IS_ERR(file)) {
+		ret = PTR_ERR(file);
+		goto out;
+	}
+
+	if (file) {
+		/* reuse file descriptor found in the hash table */
+		newfd = cr_attach_get_file(file);
+	} else {
+		/* create new file pointer (and register in hash table) */
+		newfd = cr_read_fd_data(ctx, files, hh->objref);
+	}
+
+	if (newfd < 0) {
+		ret = newfd;
+		goto out;
+	}
+
+	cr_debug("newfd got %d wanted %d\n", newfd, hh->fd);
+
+	/* if newfd isn't desired fd then reposition it */
+	if (newfd != hh->fd) {
+		ret = sys_dup2(newfd, hh->fd);
+		if (ret < 0)
+			goto out;
+		sys_close(newfd);
+	}
+
+	if (hh->close_on_exec)
+		set_close_on_exec(hh->fd, 1);
+
+	ret = 0;
+ out:
+	cr_hbuf_put(ctx, sizeof(*hh));
+	return ret;
+}
+
+int cr_read_files(struct cr_ctx *ctx)
+{
+	struct cr_hdr_files *hh = cr_hbuf_get(ctx, sizeof(*hh));
+	struct files_struct *files = current->files;
+	int i, parent, ret;
+
+	parent = cr_read_obj_type(ctx, hh, sizeof(*hh), CR_HDR_FILES);
+	if (parent < 0) {
+		ret = parent;
+		goto out;
+	}
+
+	ret = -EINVAL;
+#if 0	/* activate when containers are used */
+	if (parent != task_pid_vnr(current))
+		goto out;
+#endif
+	cr_debug("objref %d nfds %d\n", hh->objref, hh->nfds);
+	if (hh->objref < 0 || hh->nfds < 0)
+		goto out;
+
+	if (hh->nfds > sysctl_nr_open) {
+		ret = -EMFILE;
+		goto out;
+	}
+
+	/* point of no return -- close all file descriptors */
+	ret = cr_close_all_fds(files);
+	if (ret < 0)
+		goto out;
+
+	for (i = 0; i < hh->nfds; i++) {
+		ret = cr_read_fd_ent(ctx, files, hh->objref);
+		if (ret < 0)
+			break;
+	}
+
+	ret = 0;
+ out:
+	cr_hbuf_put(ctx, sizeof(*hh));
+	return ret;
+}
diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
index 119090b..3649f9c 100644
--- a/include/linux/checkpoint.h
+++ b/include/linux/checkpoint.h
@@ -88,6 +88,7 @@ extern int cr_write_files(struct cr_ctx *ctx, struct task_struct *t);
 
 extern int do_restart(struct cr_ctx *ctx, pid_t pid);
 extern int cr_read_mm(struct cr_ctx *ctx);
+extern int cr_read_files(struct cr_ctx *ctx);
 
 #define cr_debug(fmt, args...)  \
 	pr_debug("[%d:c/r:%s] " fmt, task_pid_vnr(current), __func__, ## args)
-- 
1.5.4.3

--
To unsubscribe from this list: send the line "unsubscribe linux-api" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 133+ messages in thread

* [RFC v11][PATCH 09/13] Restore open file descriprtors
@ 2008-12-05 17:31   ` Oren Laadan
  0 siblings, 0 replies; 133+ messages in thread
From: Oren Laadan @ 2008-12-05 17:31 UTC (permalink / raw)
  To: Oren Laadan
  Cc: containers, linux-kernel, linux-mm, linux-api, Linux Torvalds,
	Thomas Gleixner, Serge Hallyn, Dave Hansen, Ingo Molnar,
	H. Peter Anvin, Alexander Viro, MinChan Kim, arnd, jeremy

Restore open file descriptors: for each FD read 'struct cr_hdr_fd_ent'
and lookup objref in the hash table; if not found (first occurence), read
in 'struct cr_hdr_fd_data', create a new FD and register in the hash.
Otherwise attach the file pointer from the hash as an FD.

This patch only handles basic FDs - regular files, directories and also
symbolic links.

Changelog[v6]:
  - Balance all calls to cr_hbuf_get() with matching cr_hbuf_put()
    (even though it's not really needed)

Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com>
---
 checkpoint/Makefile        |    2 +-
 checkpoint/restart.c       |    4 +
 checkpoint/rstr_file.c     |  248 ++++++++++++++++++++++++++++++++++++++++++++
 include/linux/checkpoint.h |    1 +
 4 files changed, 254 insertions(+), 1 deletions(-)
 create mode 100644 checkpoint/rstr_file.c

diff --git a/checkpoint/Makefile b/checkpoint/Makefile
index 7496695..88bbc10 100644
--- a/checkpoint/Makefile
+++ b/checkpoint/Makefile
@@ -3,4 +3,4 @@
 #
 
 obj-$(CONFIG_CHECKPOINT_RESTART) += sys.o checkpoint.o restart.o objhash.o \
-		ckpt_mem.o rstr_mem.o ckpt_file.o
+		ckpt_mem.o rstr_mem.o ckpt_file.o rstr_file.o
diff --git a/checkpoint/restart.c b/checkpoint/restart.c
index d90c28a..22e7995 100644
--- a/checkpoint/restart.c
+++ b/checkpoint/restart.c
@@ -261,6 +261,10 @@ static int cr_read_task(struct cr_ctx *ctx)
 	cr_debug("memory: ret %d\n", ret);
 	if (ret < 0)
 		goto out;
+	ret = cr_read_files(ctx);
+	cr_debug("files: ret %d\n", ret);
+	if (ret < 0)
+		goto out;
 	ret = cr_read_thread(ctx);
 	cr_debug("thread: ret %d\n", ret);
 	if (ret < 0)
diff --git a/checkpoint/rstr_file.c b/checkpoint/rstr_file.c
new file mode 100644
index 0000000..e06db81
--- /dev/null
+++ b/checkpoint/rstr_file.c
@@ -0,0 +1,248 @@
+/*
+ *  Checkpoint file descriptors
+ *
+ *  Copyright (C) 2008 Oren Laadan
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+#include <linux/kernel.h>
+#include <linux/sched.h>
+#include <linux/fs.h>
+#include <linux/file.h>
+#include <linux/fdtable.h>
+#include <linux/fsnotify.h>
+#include <linux/syscalls.h>
+#include <linux/checkpoint.h>
+#include <linux/checkpoint_hdr.h>
+
+#include "checkpoint_file.h"
+
+static int cr_close_all_fds(struct files_struct *files)
+{
+	int *fdtable;
+	int nfds;
+
+	nfds = cr_scan_fds(files, &fdtable);
+	if (nfds < 0)
+		return nfds;
+	while (nfds--)
+		sys_close(fdtable[nfds]);
+	kfree(fdtable);
+	return 0;
+}
+
+/**
+ * cr_attach_file - attach a lonely file ptr to a file descriptor
+ * @file: lonely file pointer
+ */
+static int cr_attach_file(struct file *file)
+{
+	int fd = get_unused_fd_flags(0);
+
+	if (fd >= 0) {
+		fsnotify_open(file->f_path.dentry);
+		fd_install(fd, file);
+	}
+	return fd;
+}
+
+/**
+ * cr_attach_get_file - attach (and get) lonely file ptr to a file descriptor
+ * @file: lonely file pointer
+ */
+static int cr_attach_get_file(struct file *file)
+{
+	int fd = get_unused_fd_flags(0);
+
+	if (fd >= 0) {
+		fsnotify_open(file->f_path.dentry);
+		get_file(file);
+		fd_install(fd, file);
+	}
+	return fd;
+}
+
+#define CR_SETFL_MASK (O_APPEND|O_NONBLOCK|O_NDELAY|FASYNC|O_DIRECT|O_NOATIME)
+
+/* cr_read_fd_data - restore the state of a given file pointer */
+static int
+cr_read_fd_data(struct cr_ctx *ctx, struct files_struct *files, int rparent)
+{
+	struct cr_hdr_fd_data *hh = cr_hbuf_get(ctx, sizeof(*hh));
+	struct file *file;
+	int parent, ret;
+	int fd = 0;	/* pacify gcc warning */
+
+	parent = cr_read_obj_type(ctx, hh, sizeof(*hh), CR_HDR_FD_DATA);
+	cr_debug("rparent %d parent %d flags %#x mode %#x how %d\n",
+		 rparent, parent, hh->f_flags, hh->f_mode, hh->fd_type);
+	if (parent < 0) {
+		ret = parent;
+		goto out;
+	}
+
+	ret = -EINVAL;
+
+	if (parent != rparent)
+		goto out;
+
+	/* FIX: more sanity checks on f_flags, f_mode etc */
+
+	switch (hh->fd_type) {
+	case CR_FD_FILE:
+	case CR_FD_DIR:
+		file = cr_read_open_fname(ctx, hh->f_flags, hh->f_mode);
+		break;
+	default:
+		goto out;
+	}
+
+	if (IS_ERR(file)) {
+		ret = PTR_ERR(file);
+		goto out;
+	}
+
+	/* FIX: need to restore uid, gid, owner etc */
+
+	/* adding <objref,file> to the hash will keep a reference to it */
+	ret = cr_obj_add_ref(ctx, file, parent, CR_OBJ_FILE, 0);
+	if (ret < 0) {
+		filp_close(file, NULL);
+		goto out;
+	}
+
+	fd = cr_attach_file(file);	/* no need to cleanup 'file' below */
+	if (fd < 0) {
+		ret = fd;
+		filp_close(file, NULL);
+		goto out;
+	}
+
+	ret = sys_fcntl(fd, F_SETFL, hh->f_flags & CR_SETFL_MASK);
+	if (ret < 0)
+		goto out;
+	ret = vfs_llseek(file, hh->f_pos, SEEK_SET);
+	if (ret == -ESPIPE)	/* ignore error on non-seekable files */
+		ret = 0;
+
+	ret = 0;
+ out:
+	cr_hbuf_put(ctx, sizeof(*hh));
+	return ret < 0 ? ret : fd;
+}
+
+/**
+ * cr_read_fd_ent - restore the state of a given file descriptor
+ * @ctx: checkpoint context
+ * @files: files_struct pointer
+ * @parent: parent objref
+ *
+ * Restores the state of a file descriptor; looks up the objref (in the
+ * header) in the hash table, and if found picks the matching file and
+ * use it; otherwise calls cr_read_fd_data to restore the file too.
+ */
+static int
+cr_read_fd_ent(struct cr_ctx *ctx, struct files_struct *files, int rparent)
+{
+	struct cr_hdr_fd_ent *hh = cr_hbuf_get(ctx, sizeof(*hh));
+	struct file *file;
+	int newfd, parent, ret;
+
+	parent = cr_read_obj_type(ctx, hh, sizeof(*hh), CR_HDR_FD_ENT);
+	cr_debug("rparent %d parent %d ref %d fd %d c.o.e %d\n",
+		 rparent, parent, hh->objref, hh->fd, hh->close_on_exec);
+	if (parent < 0) {
+		ret = parent;
+		goto out;
+	}
+
+	ret = -EINVAL;
+
+	if (parent != rparent)
+		goto out;
+	if (hh->objref <= 0)
+		goto out;
+
+	file = cr_obj_get_by_ref(ctx, hh->objref, CR_OBJ_FILE);
+	if (IS_ERR(file)) {
+		ret = PTR_ERR(file);
+		goto out;
+	}
+
+	if (file) {
+		/* reuse file descriptor found in the hash table */
+		newfd = cr_attach_get_file(file);
+	} else {
+		/* create new file pointer (and register in hash table) */
+		newfd = cr_read_fd_data(ctx, files, hh->objref);
+	}
+
+	if (newfd < 0) {
+		ret = newfd;
+		goto out;
+	}
+
+	cr_debug("newfd got %d wanted %d\n", newfd, hh->fd);
+
+	/* if newfd isn't desired fd then reposition it */
+	if (newfd != hh->fd) {
+		ret = sys_dup2(newfd, hh->fd);
+		if (ret < 0)
+			goto out;
+		sys_close(newfd);
+	}
+
+	if (hh->close_on_exec)
+		set_close_on_exec(hh->fd, 1);
+
+	ret = 0;
+ out:
+	cr_hbuf_put(ctx, sizeof(*hh));
+	return ret;
+}
+
+int cr_read_files(struct cr_ctx *ctx)
+{
+	struct cr_hdr_files *hh = cr_hbuf_get(ctx, sizeof(*hh));
+	struct files_struct *files = current->files;
+	int i, parent, ret;
+
+	parent = cr_read_obj_type(ctx, hh, sizeof(*hh), CR_HDR_FILES);
+	if (parent < 0) {
+		ret = parent;
+		goto out;
+	}
+
+	ret = -EINVAL;
+#if 0	/* activate when containers are used */
+	if (parent != task_pid_vnr(current))
+		goto out;
+#endif
+	cr_debug("objref %d nfds %d\n", hh->objref, hh->nfds);
+	if (hh->objref < 0 || hh->nfds < 0)
+		goto out;
+
+	if (hh->nfds > sysctl_nr_open) {
+		ret = -EMFILE;
+		goto out;
+	}
+
+	/* point of no return -- close all file descriptors */
+	ret = cr_close_all_fds(files);
+	if (ret < 0)
+		goto out;
+
+	for (i = 0; i < hh->nfds; i++) {
+		ret = cr_read_fd_ent(ctx, files, hh->objref);
+		if (ret < 0)
+			break;
+	}
+
+	ret = 0;
+ out:
+	cr_hbuf_put(ctx, sizeof(*hh));
+	return ret;
+}
diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
index 119090b..3649f9c 100644
--- a/include/linux/checkpoint.h
+++ b/include/linux/checkpoint.h
@@ -88,6 +88,7 @@ extern int cr_write_files(struct cr_ctx *ctx, struct task_struct *t);
 
 extern int do_restart(struct cr_ctx *ctx, pid_t pid);
 extern int cr_read_mm(struct cr_ctx *ctx);
+extern int cr_read_files(struct cr_ctx *ctx);
 
 #define cr_debug(fmt, args...)  \
 	pr_debug("[%d:c/r:%s] " fmt, task_pid_vnr(current), __func__, ## args)
-- 
1.5.4.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 133+ messages in thread

* [RFC v11][PATCH 10/13] External checkpoint of a task other than ourself
       [not found] ` <1228498282-11804-1-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
                     ` (8 preceding siblings ...)
  2008-12-05 17:31   ` [RFC v11][PATCH 09/13] Restore open file descriprtors Oren Laadan
@ 2008-12-05 17:31   ` Oren Laadan
  2008-12-05 17:31   ` Oren Laadan
                     ` (7 subsequent siblings)
  17 siblings, 0 replies; 133+ messages in thread
From: Oren Laadan @ 2008-12-05 17:31 UTC (permalink / raw)
  To: Oren Laadan
  Cc: jeremy-TSDbQ3PG+2Y, arnd-r2nGTMty4D4,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Dave Hansen,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Linux Torvalds, Alexander Viro,
	H. Peter Anvin, Thomas Gleixner, Ingo Molnar

Now we can do "external" checkpoint, i.e. act on another task.

sys_checkpoint() now looks up the target pid (in our namespace) and
checkpoints that corresponding task. That task should be the root of
a container.

sys_restart() remains the same, as the restart is always done in the
context of the restarting task.

Changelog[v11]:
  - Copy contents of 'init->fs->root' instead of pointing to them

Changelog[v10]:
  - Grab vfs root of container init, rather than current process

Signed-off-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
Acked-by: Serge Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
---
 checkpoint/checkpoint.c    |   72 ++++++++++++++++++++++++++++++++++++++++++-
 checkpoint/restart.c       |    4 +-
 checkpoint/sys.c           |    6 ++++
 include/linux/checkpoint.h |    2 +
 4 files changed, 80 insertions(+), 4 deletions(-)

diff --git a/checkpoint/checkpoint.c b/checkpoint/checkpoint.c
index 75c7cd3..f2d91f8 100644
--- a/checkpoint/checkpoint.c
+++ b/checkpoint/checkpoint.c
@@ -10,6 +10,7 @@
 
 #include <linux/version.h>
 #include <linux/sched.h>
+#include <linux/ptrace.h>
 #include <linux/time.h>
 #include <linux/fs.h>
 #include <linux/file.h>
@@ -225,6 +226,13 @@ static int cr_write_task(struct cr_ctx *ctx, struct task_struct *t)
 {
 	int ret;
 
+	/* TODO: verity that the task is frozen (unless self) */
+
+	if (t->state == TASK_DEAD) {
+		pr_warning("c/r: task may not be in state TASK_DEAD\n");
+		return -EAGAIN;
+	}
+
 	ret = cr_write_task_struct(ctx, t);
 	cr_debug("task_struct: ret %d\n", ret);
 	if (ret < 0)
@@ -247,22 +255,82 @@ static int cr_write_task(struct cr_ctx *ctx, struct task_struct *t)
 	return ret;
 }
 
+static int cr_get_container(struct cr_ctx *ctx, pid_t pid)
+{
+	struct task_struct *task = NULL;
+	struct nsproxy *nsproxy = NULL;
+	int err = -ESRCH;
+
+	ctx->root_pid = pid;
+
+	read_lock(&tasklist_lock);
+	task = find_task_by_vpid(pid);
+	if (task)
+		get_task_struct(task);
+	read_unlock(&tasklist_lock);
+
+	if (!task)
+		goto out;
+
+#if 0	/* enable to use containers */
+	if (!is_container_init(task)) {
+		err = -EINVAL;
+		goto out;
+	}
+#endif
+
+	if (!ptrace_may_access(task, PTRACE_MODE_READ)) {
+		err = -EPERM;
+		goto out;
+	}
+
+	rcu_read_lock();
+	if (task_nsproxy(task)) {
+		nsproxy = task_nsproxy(task);
+		get_nsproxy(nsproxy);
+	}
+	rcu_read_unlock();
+
+	if (!nsproxy)
+		goto out;
+
+	/* TODO: verify that the container is frozen */
+
+	ctx->root_task = task;
+	ctx->root_nsproxy = nsproxy;
+
+	return 0;
+
+ out:
+	if (task)
+		put_task_struct(task);
+	return err;
+}
+
+/* setup checkpoint-specific parts of ctx */
 static int cr_ctx_checkpoint(struct cr_ctx *ctx, pid_t pid)
 {
 	struct fs_struct *fs;
+	int ret;
 
 	ctx->root_pid = pid;
 
+	ret = cr_get_container(ctx, pid);
+	if (ret < 0)
+		return ret;
+
 	/*
 	 * assume checkpointer is in container's root vfs
 	 * FIXME: this works for now, but will change with real containers
 	 */
 
-	fs = current->fs;
+	task_lock(ctx->root_task);
+	fs = ctx->root_task->fs;
 	read_lock(&fs->lock);
 	ctx->fs_mnt = fs->root;
 	path_get(&ctx->fs_mnt);
 	read_unlock(&fs->lock);
+	task_unlock(ctx->root_task);
 
 	return 0;
 }
@@ -277,7 +345,7 @@ int do_checkpoint(struct cr_ctx *ctx, pid_t pid)
 	ret = cr_write_head(ctx);
 	if (ret < 0)
 		goto out;
-	ret = cr_write_task(ctx, current);
+	ret = cr_write_task(ctx, ctx->root_task);
 	if (ret < 0)
 		goto out;
 	ret = cr_write_tail(ctx);
diff --git a/checkpoint/restart.c b/checkpoint/restart.c
index 22e7995..f4f737d 100644
--- a/checkpoint/restart.c
+++ b/checkpoint/restart.c
@@ -277,7 +277,7 @@ static int cr_read_task(struct cr_ctx *ctx)
 }
 
 /* setup restart-specific parts of ctx */
-static int cr_ctx_restart(struct cr_ctx *ctx)
+static int cr_ctx_restart(struct cr_ctx *ctx, pid_t pid)
 {
 	return 0;
 }
@@ -286,7 +286,7 @@ int do_restart(struct cr_ctx *ctx, pid_t pid)
 {
 	int ret;
 
-	ret = cr_ctx_restart(ctx);
+	ret = cr_ctx_restart(ctx, pid);
 	if (ret < 0)
 		goto out;
 	ret = cr_read_head(ctx);
diff --git a/checkpoint/sys.c b/checkpoint/sys.c
index c077cd9..7083fff 100644
--- a/checkpoint/sys.c
+++ b/checkpoint/sys.c
@@ -9,6 +9,7 @@
  */
 
 #include <linux/sched.h>
+#include <linux/nsproxy.h>
 #include <linux/kernel.h>
 #include <linux/fs.h>
 #include <linux/file.h>
@@ -141,6 +142,11 @@ static void cr_ctx_free(struct cr_ctx *ctx)
 	cr_pgarr_free(ctx);
 	cr_objhash_free(ctx);
 
+	if (ctx->root_nsproxy)
+		put_nsproxy(ctx->root_nsproxy);
+	if (ctx->root_task)
+		put_task_struct(ctx->root_task);
+
 	kfree(ctx);
 }
 
diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
index 3649f9c..3c29f8e 100644
--- a/include/linux/checkpoint.h
+++ b/include/linux/checkpoint.h
@@ -19,6 +19,8 @@ struct cr_ctx {
 	int crid;		/* unique checkpoint id */
 
 	pid_t root_pid;		/* container identifier */
+	struct task_struct *root_task;	/* container root task */
+	struct nsproxy *root_nsproxy;	/* container root nsproxy */
 
 	unsigned long flags;
 	unsigned long oflags;	/* restart: old flags */
-- 
1.5.4.3

^ permalink raw reply related	[flat|nested] 133+ messages in thread

* [RFC v11][PATCH 10/13] External checkpoint of a task other than ourself
  2008-12-05 17:31 ` Oren Laadan
@ 2008-12-05 17:31   ` Oren Laadan
  -1 siblings, 0 replies; 133+ messages in thread
From: Oren Laadan @ 2008-12-05 17:31 UTC (permalink / raw)
  To: Oren Laadan
  Cc: containers, linux-kernel, linux-mm, linux-api, Linux Torvalds,
	Thomas Gleixner, Serge Hallyn, Dave Hansen, Ingo Molnar,
	H. Peter Anvin, Alexander Viro, MinChan Kim, arnd, jeremy,
	Oren Laadan

Now we can do "external" checkpoint, i.e. act on another task.

sys_checkpoint() now looks up the target pid (in our namespace) and
checkpoints that corresponding task. That task should be the root of
a container.

sys_restart() remains the same, as the restart is always done in the
context of the restarting task.

Changelog[v11]:
  - Copy contents of 'init->fs->root' instead of pointing to them

Changelog[v10]:
  - Grab vfs root of container init, rather than current process

Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
Acked-by: Serge Hallyn <serue@us.ibm.com>
---
 checkpoint/checkpoint.c    |   72 ++++++++++++++++++++++++++++++++++++++++++-
 checkpoint/restart.c       |    4 +-
 checkpoint/sys.c           |    6 ++++
 include/linux/checkpoint.h |    2 +
 4 files changed, 80 insertions(+), 4 deletions(-)

diff --git a/checkpoint/checkpoint.c b/checkpoint/checkpoint.c
index 75c7cd3..f2d91f8 100644
--- a/checkpoint/checkpoint.c
+++ b/checkpoint/checkpoint.c
@@ -10,6 +10,7 @@
 
 #include <linux/version.h>
 #include <linux/sched.h>
+#include <linux/ptrace.h>
 #include <linux/time.h>
 #include <linux/fs.h>
 #include <linux/file.h>
@@ -225,6 +226,13 @@ static int cr_write_task(struct cr_ctx *ctx, struct task_struct *t)
 {
 	int ret;
 
+	/* TODO: verity that the task is frozen (unless self) */
+
+	if (t->state == TASK_DEAD) {
+		pr_warning("c/r: task may not be in state TASK_DEAD\n");
+		return -EAGAIN;
+	}
+
 	ret = cr_write_task_struct(ctx, t);
 	cr_debug("task_struct: ret %d\n", ret);
 	if (ret < 0)
@@ -247,22 +255,82 @@ static int cr_write_task(struct cr_ctx *ctx, struct task_struct *t)
 	return ret;
 }
 
+static int cr_get_container(struct cr_ctx *ctx, pid_t pid)
+{
+	struct task_struct *task = NULL;
+	struct nsproxy *nsproxy = NULL;
+	int err = -ESRCH;
+
+	ctx->root_pid = pid;
+
+	read_lock(&tasklist_lock);
+	task = find_task_by_vpid(pid);
+	if (task)
+		get_task_struct(task);
+	read_unlock(&tasklist_lock);
+
+	if (!task)
+		goto out;
+
+#if 0	/* enable to use containers */
+	if (!is_container_init(task)) {
+		err = -EINVAL;
+		goto out;
+	}
+#endif
+
+	if (!ptrace_may_access(task, PTRACE_MODE_READ)) {
+		err = -EPERM;
+		goto out;
+	}
+
+	rcu_read_lock();
+	if (task_nsproxy(task)) {
+		nsproxy = task_nsproxy(task);
+		get_nsproxy(nsproxy);
+	}
+	rcu_read_unlock();
+
+	if (!nsproxy)
+		goto out;
+
+	/* TODO: verify that the container is frozen */
+
+	ctx->root_task = task;
+	ctx->root_nsproxy = nsproxy;
+
+	return 0;
+
+ out:
+	if (task)
+		put_task_struct(task);
+	return err;
+}
+
+/* setup checkpoint-specific parts of ctx */
 static int cr_ctx_checkpoint(struct cr_ctx *ctx, pid_t pid)
 {
 	struct fs_struct *fs;
+	int ret;
 
 	ctx->root_pid = pid;
 
+	ret = cr_get_container(ctx, pid);
+	if (ret < 0)
+		return ret;
+
 	/*
 	 * assume checkpointer is in container's root vfs
 	 * FIXME: this works for now, but will change with real containers
 	 */
 
-	fs = current->fs;
+	task_lock(ctx->root_task);
+	fs = ctx->root_task->fs;
 	read_lock(&fs->lock);
 	ctx->fs_mnt = fs->root;
 	path_get(&ctx->fs_mnt);
 	read_unlock(&fs->lock);
+	task_unlock(ctx->root_task);
 
 	return 0;
 }
@@ -277,7 +345,7 @@ int do_checkpoint(struct cr_ctx *ctx, pid_t pid)
 	ret = cr_write_head(ctx);
 	if (ret < 0)
 		goto out;
-	ret = cr_write_task(ctx, current);
+	ret = cr_write_task(ctx, ctx->root_task);
 	if (ret < 0)
 		goto out;
 	ret = cr_write_tail(ctx);
diff --git a/checkpoint/restart.c b/checkpoint/restart.c
index 22e7995..f4f737d 100644
--- a/checkpoint/restart.c
+++ b/checkpoint/restart.c
@@ -277,7 +277,7 @@ static int cr_read_task(struct cr_ctx *ctx)
 }
 
 /* setup restart-specific parts of ctx */
-static int cr_ctx_restart(struct cr_ctx *ctx)
+static int cr_ctx_restart(struct cr_ctx *ctx, pid_t pid)
 {
 	return 0;
 }
@@ -286,7 +286,7 @@ int do_restart(struct cr_ctx *ctx, pid_t pid)
 {
 	int ret;
 
-	ret = cr_ctx_restart(ctx);
+	ret = cr_ctx_restart(ctx, pid);
 	if (ret < 0)
 		goto out;
 	ret = cr_read_head(ctx);
diff --git a/checkpoint/sys.c b/checkpoint/sys.c
index c077cd9..7083fff 100644
--- a/checkpoint/sys.c
+++ b/checkpoint/sys.c
@@ -9,6 +9,7 @@
  */
 
 #include <linux/sched.h>
+#include <linux/nsproxy.h>
 #include <linux/kernel.h>
 #include <linux/fs.h>
 #include <linux/file.h>
@@ -141,6 +142,11 @@ static void cr_ctx_free(struct cr_ctx *ctx)
 	cr_pgarr_free(ctx);
 	cr_objhash_free(ctx);
 
+	if (ctx->root_nsproxy)
+		put_nsproxy(ctx->root_nsproxy);
+	if (ctx->root_task)
+		put_task_struct(ctx->root_task);
+
 	kfree(ctx);
 }
 
diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
index 3649f9c..3c29f8e 100644
--- a/include/linux/checkpoint.h
+++ b/include/linux/checkpoint.h
@@ -19,6 +19,8 @@ struct cr_ctx {
 	int crid;		/* unique checkpoint id */
 
 	pid_t root_pid;		/* container identifier */
+	struct task_struct *root_task;	/* container root task */
+	struct nsproxy *root_nsproxy;	/* container root nsproxy */
 
 	unsigned long flags;
 	unsigned long oflags;	/* restart: old flags */
-- 
1.5.4.3


^ permalink raw reply related	[flat|nested] 133+ messages in thread

* [RFC v11][PATCH 10/13] External checkpoint of a task other than ourself
       [not found] ` <1228498282-11804-1-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
                     ` (9 preceding siblings ...)
  2008-12-05 17:31   ` [RFC v11][PATCH 10/13] External checkpoint of a task other than ourself Oren Laadan
@ 2008-12-05 17:31   ` Oren Laadan
  2008-12-05 17:31   ` [RFC v11][PATCH 11/13] Track in-kernel when we expect checkpoint/restart to work Oren Laadan
                     ` (6 subsequent siblings)
  17 siblings, 0 replies; 133+ messages in thread
From: Oren Laadan @ 2008-12-05 17:31 UTC (permalink / raw)
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Linux Torvalds,
	Thomas Gleixner, Serge Hallyn, Dave Hansen, Ingo Molnar,
	H. Peter Anvin, Alexander Viro, MinChan Kim, arnd-r2nGTMty4D4,
	jeremy-TSDbQ3PG+2Y, Oren Laadan

Now we can do "external" checkpoint, i.e. act on another task.

sys_checkpoint() now looks up the target pid (in our namespace) and
checkpoints that corresponding task. That task should be the root of
a container.

sys_restart() remains the same, as the restart is always done in the
context of the restarting task.

Changelog[v11]:
  - Copy contents of 'init->fs->root' instead of pointing to them

Changelog[v10]:
  - Grab vfs root of container init, rather than current process

Signed-off-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
Acked-by: Serge Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
---
 checkpoint/checkpoint.c    |   72 ++++++++++++++++++++++++++++++++++++++++++-
 checkpoint/restart.c       |    4 +-
 checkpoint/sys.c           |    6 ++++
 include/linux/checkpoint.h |    2 +
 4 files changed, 80 insertions(+), 4 deletions(-)

diff --git a/checkpoint/checkpoint.c b/checkpoint/checkpoint.c
index 75c7cd3..f2d91f8 100644
--- a/checkpoint/checkpoint.c
+++ b/checkpoint/checkpoint.c
@@ -10,6 +10,7 @@
 
 #include <linux/version.h>
 #include <linux/sched.h>
+#include <linux/ptrace.h>
 #include <linux/time.h>
 #include <linux/fs.h>
 #include <linux/file.h>
@@ -225,6 +226,13 @@ static int cr_write_task(struct cr_ctx *ctx, struct task_struct *t)
 {
 	int ret;
 
+	/* TODO: verity that the task is frozen (unless self) */
+
+	if (t->state == TASK_DEAD) {
+		pr_warning("c/r: task may not be in state TASK_DEAD\n");
+		return -EAGAIN;
+	}
+
 	ret = cr_write_task_struct(ctx, t);
 	cr_debug("task_struct: ret %d\n", ret);
 	if (ret < 0)
@@ -247,22 +255,82 @@ static int cr_write_task(struct cr_ctx *ctx, struct task_struct *t)
 	return ret;
 }
 
+static int cr_get_container(struct cr_ctx *ctx, pid_t pid)
+{
+	struct task_struct *task = NULL;
+	struct nsproxy *nsproxy = NULL;
+	int err = -ESRCH;
+
+	ctx->root_pid = pid;
+
+	read_lock(&tasklist_lock);
+	task = find_task_by_vpid(pid);
+	if (task)
+		get_task_struct(task);
+	read_unlock(&tasklist_lock);
+
+	if (!task)
+		goto out;
+
+#if 0	/* enable to use containers */
+	if (!is_container_init(task)) {
+		err = -EINVAL;
+		goto out;
+	}
+#endif
+
+	if (!ptrace_may_access(task, PTRACE_MODE_READ)) {
+		err = -EPERM;
+		goto out;
+	}
+
+	rcu_read_lock();
+	if (task_nsproxy(task)) {
+		nsproxy = task_nsproxy(task);
+		get_nsproxy(nsproxy);
+	}
+	rcu_read_unlock();
+
+	if (!nsproxy)
+		goto out;
+
+	/* TODO: verify that the container is frozen */
+
+	ctx->root_task = task;
+	ctx->root_nsproxy = nsproxy;
+
+	return 0;
+
+ out:
+	if (task)
+		put_task_struct(task);
+	return err;
+}
+
+/* setup checkpoint-specific parts of ctx */
 static int cr_ctx_checkpoint(struct cr_ctx *ctx, pid_t pid)
 {
 	struct fs_struct *fs;
+	int ret;
 
 	ctx->root_pid = pid;
 
+	ret = cr_get_container(ctx, pid);
+	if (ret < 0)
+		return ret;
+
 	/*
 	 * assume checkpointer is in container's root vfs
 	 * FIXME: this works for now, but will change with real containers
 	 */
 
-	fs = current->fs;
+	task_lock(ctx->root_task);
+	fs = ctx->root_task->fs;
 	read_lock(&fs->lock);
 	ctx->fs_mnt = fs->root;
 	path_get(&ctx->fs_mnt);
 	read_unlock(&fs->lock);
+	task_unlock(ctx->root_task);
 
 	return 0;
 }
@@ -277,7 +345,7 @@ int do_checkpoint(struct cr_ctx *ctx, pid_t pid)
 	ret = cr_write_head(ctx);
 	if (ret < 0)
 		goto out;
-	ret = cr_write_task(ctx, current);
+	ret = cr_write_task(ctx, ctx->root_task);
 	if (ret < 0)
 		goto out;
 	ret = cr_write_tail(ctx);
diff --git a/checkpoint/restart.c b/checkpoint/restart.c
index 22e7995..f4f737d 100644
--- a/checkpoint/restart.c
+++ b/checkpoint/restart.c
@@ -277,7 +277,7 @@ static int cr_read_task(struct cr_ctx *ctx)
 }
 
 /* setup restart-specific parts of ctx */
-static int cr_ctx_restart(struct cr_ctx *ctx)
+static int cr_ctx_restart(struct cr_ctx *ctx, pid_t pid)
 {
 	return 0;
 }
@@ -286,7 +286,7 @@ int do_restart(struct cr_ctx *ctx, pid_t pid)
 {
 	int ret;
 
-	ret = cr_ctx_restart(ctx);
+	ret = cr_ctx_restart(ctx, pid);
 	if (ret < 0)
 		goto out;
 	ret = cr_read_head(ctx);
diff --git a/checkpoint/sys.c b/checkpoint/sys.c
index c077cd9..7083fff 100644
--- a/checkpoint/sys.c
+++ b/checkpoint/sys.c
@@ -9,6 +9,7 @@
  */
 
 #include <linux/sched.h>
+#include <linux/nsproxy.h>
 #include <linux/kernel.h>
 #include <linux/fs.h>
 #include <linux/file.h>
@@ -141,6 +142,11 @@ static void cr_ctx_free(struct cr_ctx *ctx)
 	cr_pgarr_free(ctx);
 	cr_objhash_free(ctx);
 
+	if (ctx->root_nsproxy)
+		put_nsproxy(ctx->root_nsproxy);
+	if (ctx->root_task)
+		put_task_struct(ctx->root_task);
+
 	kfree(ctx);
 }
 
diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
index 3649f9c..3c29f8e 100644
--- a/include/linux/checkpoint.h
+++ b/include/linux/checkpoint.h
@@ -19,6 +19,8 @@ struct cr_ctx {
 	int crid;		/* unique checkpoint id */
 
 	pid_t root_pid;		/* container identifier */
+	struct task_struct *root_task;	/* container root task */
+	struct nsproxy *root_nsproxy;	/* container root nsproxy */
 
 	unsigned long flags;
 	unsigned long oflags;	/* restart: old flags */
-- 
1.5.4.3

--
To unsubscribe from this list: send the line "unsubscribe linux-api" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 133+ messages in thread

* [RFC v11][PATCH 10/13] External checkpoint of a task other than ourself
  2008-12-05 17:31 ` Oren Laadan
                   ` (10 preceding siblings ...)
  (?)
@ 2008-12-05 17:31 ` Oren Laadan
  -1 siblings, 0 replies; 133+ messages in thread
From: Oren Laadan @ 2008-12-05 17:31 UTC (permalink / raw)
  Cc: containers, linux-kernel, linux-mm, linux-api, Linux Torvalds,
	Thomas Gleixner, Serge Hallyn, Dave Hansen, Ingo Molnar,
	H. Peter Anvin, Alexander Viro, MinChan Kim, arnd, jeremy,
	Oren Laadan

Now we can do "external" checkpoint, i.e. act on another task.

sys_checkpoint() now looks up the target pid (in our namespace) and
checkpoints that corresponding task. That task should be the root of
a container.

sys_restart() remains the same, as the restart is always done in the
context of the restarting task.

Changelog[v11]:
  - Copy contents of 'init->fs->root' instead of pointing to them

Changelog[v10]:
  - Grab vfs root of container init, rather than current process

Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
Acked-by: Serge Hallyn <serue@us.ibm.com>
---
 checkpoint/checkpoint.c    |   72 ++++++++++++++++++++++++++++++++++++++++++-
 checkpoint/restart.c       |    4 +-
 checkpoint/sys.c           |    6 ++++
 include/linux/checkpoint.h |    2 +
 4 files changed, 80 insertions(+), 4 deletions(-)

diff --git a/checkpoint/checkpoint.c b/checkpoint/checkpoint.c
index 75c7cd3..f2d91f8 100644
--- a/checkpoint/checkpoint.c
+++ b/checkpoint/checkpoint.c
@@ -10,6 +10,7 @@
 
 #include <linux/version.h>
 #include <linux/sched.h>
+#include <linux/ptrace.h>
 #include <linux/time.h>
 #include <linux/fs.h>
 #include <linux/file.h>
@@ -225,6 +226,13 @@ static int cr_write_task(struct cr_ctx *ctx, struct task_struct *t)
 {
 	int ret;
 
+	/* TODO: verity that the task is frozen (unless self) */
+
+	if (t->state == TASK_DEAD) {
+		pr_warning("c/r: task may not be in state TASK_DEAD\n");
+		return -EAGAIN;
+	}
+
 	ret = cr_write_task_struct(ctx, t);
 	cr_debug("task_struct: ret %d\n", ret);
 	if (ret < 0)
@@ -247,22 +255,82 @@ static int cr_write_task(struct cr_ctx *ctx, struct task_struct *t)
 	return ret;
 }
 
+static int cr_get_container(struct cr_ctx *ctx, pid_t pid)
+{
+	struct task_struct *task = NULL;
+	struct nsproxy *nsproxy = NULL;
+	int err = -ESRCH;
+
+	ctx->root_pid = pid;
+
+	read_lock(&tasklist_lock);
+	task = find_task_by_vpid(pid);
+	if (task)
+		get_task_struct(task);
+	read_unlock(&tasklist_lock);
+
+	if (!task)
+		goto out;
+
+#if 0	/* enable to use containers */
+	if (!is_container_init(task)) {
+		err = -EINVAL;
+		goto out;
+	}
+#endif
+
+	if (!ptrace_may_access(task, PTRACE_MODE_READ)) {
+		err = -EPERM;
+		goto out;
+	}
+
+	rcu_read_lock();
+	if (task_nsproxy(task)) {
+		nsproxy = task_nsproxy(task);
+		get_nsproxy(nsproxy);
+	}
+	rcu_read_unlock();
+
+	if (!nsproxy)
+		goto out;
+
+	/* TODO: verify that the container is frozen */
+
+	ctx->root_task = task;
+	ctx->root_nsproxy = nsproxy;
+
+	return 0;
+
+ out:
+	if (task)
+		put_task_struct(task);
+	return err;
+}
+
+/* setup checkpoint-specific parts of ctx */
 static int cr_ctx_checkpoint(struct cr_ctx *ctx, pid_t pid)
 {
 	struct fs_struct *fs;
+	int ret;
 
 	ctx->root_pid = pid;
 
+	ret = cr_get_container(ctx, pid);
+	if (ret < 0)
+		return ret;
+
 	/*
 	 * assume checkpointer is in container's root vfs
 	 * FIXME: this works for now, but will change with real containers
 	 */
 
-	fs = current->fs;
+	task_lock(ctx->root_task);
+	fs = ctx->root_task->fs;
 	read_lock(&fs->lock);
 	ctx->fs_mnt = fs->root;
 	path_get(&ctx->fs_mnt);
 	read_unlock(&fs->lock);
+	task_unlock(ctx->root_task);
 
 	return 0;
 }
@@ -277,7 +345,7 @@ int do_checkpoint(struct cr_ctx *ctx, pid_t pid)
 	ret = cr_write_head(ctx);
 	if (ret < 0)
 		goto out;
-	ret = cr_write_task(ctx, current);
+	ret = cr_write_task(ctx, ctx->root_task);
 	if (ret < 0)
 		goto out;
 	ret = cr_write_tail(ctx);
diff --git a/checkpoint/restart.c b/checkpoint/restart.c
index 22e7995..f4f737d 100644
--- a/checkpoint/restart.c
+++ b/checkpoint/restart.c
@@ -277,7 +277,7 @@ static int cr_read_task(struct cr_ctx *ctx)
 }
 
 /* setup restart-specific parts of ctx */
-static int cr_ctx_restart(struct cr_ctx *ctx)
+static int cr_ctx_restart(struct cr_ctx *ctx, pid_t pid)
 {
 	return 0;
 }
@@ -286,7 +286,7 @@ int do_restart(struct cr_ctx *ctx, pid_t pid)
 {
 	int ret;
 
-	ret = cr_ctx_restart(ctx);
+	ret = cr_ctx_restart(ctx, pid);
 	if (ret < 0)
 		goto out;
 	ret = cr_read_head(ctx);
diff --git a/checkpoint/sys.c b/checkpoint/sys.c
index c077cd9..7083fff 100644
--- a/checkpoint/sys.c
+++ b/checkpoint/sys.c
@@ -9,6 +9,7 @@
  */
 
 #include <linux/sched.h>
+#include <linux/nsproxy.h>
 #include <linux/kernel.h>
 #include <linux/fs.h>
 #include <linux/file.h>
@@ -141,6 +142,11 @@ static void cr_ctx_free(struct cr_ctx *ctx)
 	cr_pgarr_free(ctx);
 	cr_objhash_free(ctx);
 
+	if (ctx->root_nsproxy)
+		put_nsproxy(ctx->root_nsproxy);
+	if (ctx->root_task)
+		put_task_struct(ctx->root_task);
+
 	kfree(ctx);
 }
 
diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
index 3649f9c..3c29f8e 100644
--- a/include/linux/checkpoint.h
+++ b/include/linux/checkpoint.h
@@ -19,6 +19,8 @@ struct cr_ctx {
 	int crid;		/* unique checkpoint id */
 
 	pid_t root_pid;		/* container identifier */
+	struct task_struct *root_task;	/* container root task */
+	struct nsproxy *root_nsproxy;	/* container root nsproxy */
 
 	unsigned long flags;
 	unsigned long oflags;	/* restart: old flags */
-- 
1.5.4.3

^ permalink raw reply related	[flat|nested] 133+ messages in thread

* [RFC v11][PATCH 10/13] External checkpoint of a task other than ourself
@ 2008-12-05 17:31   ` Oren Laadan
  0 siblings, 0 replies; 133+ messages in thread
From: Oren Laadan @ 2008-12-05 17:31 UTC (permalink / raw)
  To: Oren Laadan
  Cc: containers, linux-kernel, linux-mm, linux-api, Linux Torvalds,
	Thomas Gleixner, Serge Hallyn, Dave Hansen, Ingo Molnar,
	H. Peter Anvin, Alexander Viro, MinChan Kim, arnd, jeremy

Now we can do "external" checkpoint, i.e. act on another task.

sys_checkpoint() now looks up the target pid (in our namespace) and
checkpoints that corresponding task. That task should be the root of
a container.

sys_restart() remains the same, as the restart is always done in the
context of the restarting task.

Changelog[v11]:
  - Copy contents of 'init->fs->root' instead of pointing to them

Changelog[v10]:
  - Grab vfs root of container init, rather than current process

Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
Acked-by: Serge Hallyn <serue@us.ibm.com>
---
 checkpoint/checkpoint.c    |   72 ++++++++++++++++++++++++++++++++++++++++++-
 checkpoint/restart.c       |    4 +-
 checkpoint/sys.c           |    6 ++++
 include/linux/checkpoint.h |    2 +
 4 files changed, 80 insertions(+), 4 deletions(-)

diff --git a/checkpoint/checkpoint.c b/checkpoint/checkpoint.c
index 75c7cd3..f2d91f8 100644
--- a/checkpoint/checkpoint.c
+++ b/checkpoint/checkpoint.c
@@ -10,6 +10,7 @@
 
 #include <linux/version.h>
 #include <linux/sched.h>
+#include <linux/ptrace.h>
 #include <linux/time.h>
 #include <linux/fs.h>
 #include <linux/file.h>
@@ -225,6 +226,13 @@ static int cr_write_task(struct cr_ctx *ctx, struct task_struct *t)
 {
 	int ret;
 
+	/* TODO: verity that the task is frozen (unless self) */
+
+	if (t->state == TASK_DEAD) {
+		pr_warning("c/r: task may not be in state TASK_DEAD\n");
+		return -EAGAIN;
+	}
+
 	ret = cr_write_task_struct(ctx, t);
 	cr_debug("task_struct: ret %d\n", ret);
 	if (ret < 0)
@@ -247,22 +255,82 @@ static int cr_write_task(struct cr_ctx *ctx, struct task_struct *t)
 	return ret;
 }
 
+static int cr_get_container(struct cr_ctx *ctx, pid_t pid)
+{
+	struct task_struct *task = NULL;
+	struct nsproxy *nsproxy = NULL;
+	int err = -ESRCH;
+
+	ctx->root_pid = pid;
+
+	read_lock(&tasklist_lock);
+	task = find_task_by_vpid(pid);
+	if (task)
+		get_task_struct(task);
+	read_unlock(&tasklist_lock);
+
+	if (!task)
+		goto out;
+
+#if 0	/* enable to use containers */
+	if (!is_container_init(task)) {
+		err = -EINVAL;
+		goto out;
+	}
+#endif
+
+	if (!ptrace_may_access(task, PTRACE_MODE_READ)) {
+		err = -EPERM;
+		goto out;
+	}
+
+	rcu_read_lock();
+	if (task_nsproxy(task)) {
+		nsproxy = task_nsproxy(task);
+		get_nsproxy(nsproxy);
+	}
+	rcu_read_unlock();
+
+	if (!nsproxy)
+		goto out;
+
+	/* TODO: verify that the container is frozen */
+
+	ctx->root_task = task;
+	ctx->root_nsproxy = nsproxy;
+
+	return 0;
+
+ out:
+	if (task)
+		put_task_struct(task);
+	return err;
+}
+
+/* setup checkpoint-specific parts of ctx */
 static int cr_ctx_checkpoint(struct cr_ctx *ctx, pid_t pid)
 {
 	struct fs_struct *fs;
+	int ret;
 
 	ctx->root_pid = pid;
 
+	ret = cr_get_container(ctx, pid);
+	if (ret < 0)
+		return ret;
+
 	/*
 	 * assume checkpointer is in container's root vfs
 	 * FIXME: this works for now, but will change with real containers
 	 */
 
-	fs = current->fs;
+	task_lock(ctx->root_task);
+	fs = ctx->root_task->fs;
 	read_lock(&fs->lock);
 	ctx->fs_mnt = fs->root;
 	path_get(&ctx->fs_mnt);
 	read_unlock(&fs->lock);
+	task_unlock(ctx->root_task);
 
 	return 0;
 }
@@ -277,7 +345,7 @@ int do_checkpoint(struct cr_ctx *ctx, pid_t pid)
 	ret = cr_write_head(ctx);
 	if (ret < 0)
 		goto out;
-	ret = cr_write_task(ctx, current);
+	ret = cr_write_task(ctx, ctx->root_task);
 	if (ret < 0)
 		goto out;
 	ret = cr_write_tail(ctx);
diff --git a/checkpoint/restart.c b/checkpoint/restart.c
index 22e7995..f4f737d 100644
--- a/checkpoint/restart.c
+++ b/checkpoint/restart.c
@@ -277,7 +277,7 @@ static int cr_read_task(struct cr_ctx *ctx)
 }
 
 /* setup restart-specific parts of ctx */
-static int cr_ctx_restart(struct cr_ctx *ctx)
+static int cr_ctx_restart(struct cr_ctx *ctx, pid_t pid)
 {
 	return 0;
 }
@@ -286,7 +286,7 @@ int do_restart(struct cr_ctx *ctx, pid_t pid)
 {
 	int ret;
 
-	ret = cr_ctx_restart(ctx);
+	ret = cr_ctx_restart(ctx, pid);
 	if (ret < 0)
 		goto out;
 	ret = cr_read_head(ctx);
diff --git a/checkpoint/sys.c b/checkpoint/sys.c
index c077cd9..7083fff 100644
--- a/checkpoint/sys.c
+++ b/checkpoint/sys.c
@@ -9,6 +9,7 @@
  */
 
 #include <linux/sched.h>
+#include <linux/nsproxy.h>
 #include <linux/kernel.h>
 #include <linux/fs.h>
 #include <linux/file.h>
@@ -141,6 +142,11 @@ static void cr_ctx_free(struct cr_ctx *ctx)
 	cr_pgarr_free(ctx);
 	cr_objhash_free(ctx);
 
+	if (ctx->root_nsproxy)
+		put_nsproxy(ctx->root_nsproxy);
+	if (ctx->root_task)
+		put_task_struct(ctx->root_task);
+
 	kfree(ctx);
 }
 
diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
index 3649f9c..3c29f8e 100644
--- a/include/linux/checkpoint.h
+++ b/include/linux/checkpoint.h
@@ -19,6 +19,8 @@ struct cr_ctx {
 	int crid;		/* unique checkpoint id */
 
 	pid_t root_pid;		/* container identifier */
+	struct task_struct *root_task;	/* container root task */
+	struct nsproxy *root_nsproxy;	/* container root nsproxy */
 
 	unsigned long flags;
 	unsigned long oflags;	/* restart: old flags */
-- 
1.5.4.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 133+ messages in thread

* [RFC v11][PATCH 11/13] Track in-kernel when we expect checkpoint/restart to work
       [not found] ` <1228498282-11804-1-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
                     ` (10 preceding siblings ...)
  2008-12-05 17:31   ` Oren Laadan
@ 2008-12-05 17:31   ` Oren Laadan
  2008-12-05 17:31   ` [RFC v11][PATCH 12/13] Checkpoint multiple processes Oren Laadan
                     ` (5 subsequent siblings)
  17 siblings, 0 replies; 133+ messages in thread
From: Oren Laadan @ 2008-12-05 17:31 UTC (permalink / raw)
  To: Oren Laadan
  Cc: jeremy-TSDbQ3PG+2Y, arnd-r2nGTMty4D4,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Dave Hansen,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Linux Torvalds, Alexander Viro,
	H. Peter Anvin, Thomas Gleixner, Ingo Molnar

From: Dave Hansen <dave-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>

Suggested by Ingo.

Checkpoint/restart is going to be a long effort to get things working.
We're going to have a lot of things that we know just don't work for
a long time.  That doesn't mean that it will be useless, it just means
that there's some complicated features that we are going to have to
work incrementally to fix.

This patch introduces a new mechanism to help the checkpoint/restart
developers.  A new function pair: task/process_deny_checkpoint() is
created.  When called, these tell the kernel that we *know* that the
process has performed some activity that will keep it from being
properly checkpointed.

The 'flag' is an atomic_t for now so that we can have some level
of atomicity and make sure to only warn once.

For now, this is a one-way trip.  Once a process is no longer
'may_checkpoint' capable, neither it nor its children ever will be.
This can, of course, be fixed up in the future.  We might want to
reset the flag when a new pid namespace is created, for instance.

Signed-off-by: Dave Hansen <dave-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
Signed-off-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
---
 checkpoint/checkpoint.c    |    6 ++++++
 include/linux/checkpoint.h |   33 ++++++++++++++++++++++++++++++++-
 include/linux/sched.h      |    4 ++++
 kernel/fork.c              |   10 ++++++++++
 4 files changed, 52 insertions(+), 1 deletions(-)

diff --git a/checkpoint/checkpoint.c b/checkpoint/checkpoint.c
index f2d91f8..e8e352f 100644
--- a/checkpoint/checkpoint.c
+++ b/checkpoint/checkpoint.c
@@ -233,6 +233,12 @@ static int cr_write_task(struct cr_ctx *ctx, struct task_struct *t)
 		return -EAGAIN;
 	}
 
+	if (!atomic_read(&t->may_checkpoint)) {
+		pr_warning("c/r: task %d may not checkpoint\n",
+			   task_pid_vnr(t));
+		return -EBUSY;
+	}
+
 	ret = cr_write_task_struct(ctx, t);
 	cr_debug("task_struct: ret %d\n", ret);
 	if (ret < 0)
diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
index 3c29f8e..c97f608 100644
--- a/include/linux/checkpoint.h
+++ b/include/linux/checkpoint.h
@@ -10,8 +10,11 @@
  *  distribution for more details.
  */
 
-#include <linux/path.h>
 #include <linux/fs.h>
+#include <linux/path.h>
+#include <linux/sched.h>
+
+#ifdef CONFIG_CHECKPOINT_RESTART
 
 #define CR_VERSION  2
 
@@ -95,4 +98,32 @@ extern int cr_read_files(struct cr_ctx *ctx);
 #define cr_debug(fmt, args...)  \
 	pr_debug("[%d:c/r:%s] " fmt, task_pid_vnr(current), __func__, ## args)
 
+static inline void __task_deny_checkpointing(struct task_struct *task,
+		char *file, int line)
+{
+	if (!atomic_dec_and_test(&task->may_checkpoint))
+		return;
+	printk(KERN_INFO "process performed an action that can not be "
+			"checkpointed at: %s:%d\n", file, line);
+	WARN_ON(1);
+}
+#define process_deny_checkpointing(p)  \
+	__task_deny_checkpointing(p, __FILE__, __LINE__)
+
+/*
+ * For now, we're not going to have a distinction between
+ * tasks and processes for the purpose of c/r.  But, allow
+ * these two calls anyway to make new users at least think
+ * about it.
+ */
+#define task_deny_checkpointing(p)  \
+	__task_deny_checkpointing(p, __FILE__, __LINE__)
+
+#else
+
+static inline void task_deny_checkpointing(struct task_struct *task) {}
+static inline void process_deny_checkpointing(struct task_struct *task) {}
+
+#endif
+
 #endif /* _CHECKPOINT_CKPT_H_ */
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 55e30d1..faa2ec6 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1356,6 +1356,10 @@ struct task_struct {
 	unsigned long default_timer_slack_ns;
 
 	struct list_head	*scm_work_list;
+
+#ifdef CONFIG_CHECKPOINT_RESTART
+	atomic_t may_checkpoint;
+#endif
 };
 
 /*
diff --git a/kernel/fork.c b/kernel/fork.c
index 2a372a0..72df853 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -196,6 +196,13 @@ void __init fork_init(unsigned long mempages)
 	init_task.signal->rlim[RLIMIT_NPROC].rlim_max = max_threads/2;
 	init_task.signal->rlim[RLIMIT_SIGPENDING] =
 		init_task.signal->rlim[RLIMIT_NPROC];
+
+#ifdef CONFIG_CHECKPOINT_RESTART
+	/*
+	 * This probably won't stay set for long...
+	 */
+	atomic_set(&init_task.may_checkpoint, 1);
+#endif
 }
 
 int __attribute__((weak)) arch_dup_task_struct(struct task_struct *dst,
@@ -246,6 +253,9 @@ static struct task_struct *dup_task_struct(struct task_struct *orig)
 	tsk->btrace_seq = 0;
 #endif
 	tsk->splice_pipe = NULL;
+#ifdef CONFIG_CHECKPOINT_RESTART
+	atomic_set(&tsk->may_checkpoint, atomic_read(&orig->may_checkpoint));
+#endif
 	return tsk;
 
 out:
-- 
1.5.4.3

^ permalink raw reply related	[flat|nested] 133+ messages in thread

* [RFC v11][PATCH 11/13] Track in-kernel when we expect checkpoint/restart to work
       [not found] ` <1228498282-11804-1-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
  2008-12-05 17:31   ` [RFC v11][PATCH 01/13] Create syscalls: sys_checkpoint, sys_restart Oren Laadan
@ 2008-12-05 17:31   ` Oren Laadan
  2008-12-05 17:31   ` [RFC v11][PATCH 03/13] General infrastructure for checkpoint restart Oren Laadan
                     ` (15 subsequent siblings)
  17 siblings, 0 replies; 133+ messages in thread
From: Oren Laadan @ 2008-12-05 17:31 UTC (permalink / raw)
  To: Oren Laadan
  Cc: containers, linux-kernel, linux-mm, linux-api, Linux Torvalds,
	Thomas Gleixner, Serge Hallyn, Dave Hansen, Ingo Molnar,
	H. Peter Anvin, Alexander Viro, MinChan Kim, arnd, jeremy,
	Oren Laadan

From: Dave Hansen <dave@linux.vnet.ibm.com>

Suggested by Ingo.

Checkpoint/restart is going to be a long effort to get things working.
We're going to have a lot of things that we know just don't work for
a long time.  That doesn't mean that it will be useless, it just means
that there's some complicated features that we are going to have to
work incrementally to fix.

This patch introduces a new mechanism to help the checkpoint/restart
developers.  A new function pair: task/process_deny_checkpoint() is
created.  When called, these tell the kernel that we *know* that the
process has performed some activity that will keep it from being
properly checkpointed.

The 'flag' is an atomic_t for now so that we can have some level
of atomicity and make sure to only warn once.

For now, this is a one-way trip.  Once a process is no longer
'may_checkpoint' capable, neither it nor its children ever will be.
This can, of course, be fixed up in the future.  We might want to
reset the flag when a new pid namespace is created, for instance.

Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com>
Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
---
 checkpoint/checkpoint.c    |    6 ++++++
 include/linux/checkpoint.h |   33 ++++++++++++++++++++++++++++++++-
 include/linux/sched.h      |    4 ++++
 kernel/fork.c              |   10 ++++++++++
 4 files changed, 52 insertions(+), 1 deletions(-)

diff --git a/checkpoint/checkpoint.c b/checkpoint/checkpoint.c
index f2d91f8..e8e352f 100644
--- a/checkpoint/checkpoint.c
+++ b/checkpoint/checkpoint.c
@@ -233,6 +233,12 @@ static int cr_write_task(struct cr_ctx *ctx, struct task_struct *t)
 		return -EAGAIN;
 	}
 
+	if (!atomic_read(&t->may_checkpoint)) {
+		pr_warning("c/r: task %d may not checkpoint\n",
+			   task_pid_vnr(t));
+		return -EBUSY;
+	}
+
 	ret = cr_write_task_struct(ctx, t);
 	cr_debug("task_struct: ret %d\n", ret);
 	if (ret < 0)
diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
index 3c29f8e..c97f608 100644
--- a/include/linux/checkpoint.h
+++ b/include/linux/checkpoint.h
@@ -10,8 +10,11 @@
  *  distribution for more details.
  */
 
-#include <linux/path.h>
 #include <linux/fs.h>
+#include <linux/path.h>
+#include <linux/sched.h>
+
+#ifdef CONFIG_CHECKPOINT_RESTART
 
 #define CR_VERSION  2
 
@@ -95,4 +98,32 @@ extern int cr_read_files(struct cr_ctx *ctx);
 #define cr_debug(fmt, args...)  \
 	pr_debug("[%d:c/r:%s] " fmt, task_pid_vnr(current), __func__, ## args)
 
+static inline void __task_deny_checkpointing(struct task_struct *task,
+		char *file, int line)
+{
+	if (!atomic_dec_and_test(&task->may_checkpoint))
+		return;
+	printk(KERN_INFO "process performed an action that can not be "
+			"checkpointed at: %s:%d\n", file, line);
+	WARN_ON(1);
+}
+#define process_deny_checkpointing(p)  \
+	__task_deny_checkpointing(p, __FILE__, __LINE__)
+
+/*
+ * For now, we're not going to have a distinction between
+ * tasks and processes for the purpose of c/r.  But, allow
+ * these two calls anyway to make new users at least think
+ * about it.
+ */
+#define task_deny_checkpointing(p)  \
+	__task_deny_checkpointing(p, __FILE__, __LINE__)
+
+#else
+
+static inline void task_deny_checkpointing(struct task_struct *task) {}
+static inline void process_deny_checkpointing(struct task_struct *task) {}
+
+#endif
+
 #endif /* _CHECKPOINT_CKPT_H_ */
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 55e30d1..faa2ec6 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1356,6 +1356,10 @@ struct task_struct {
 	unsigned long default_timer_slack_ns;
 
 	struct list_head	*scm_work_list;
+
+#ifdef CONFIG_CHECKPOINT_RESTART
+	atomic_t may_checkpoint;
+#endif
 };
 
 /*
diff --git a/kernel/fork.c b/kernel/fork.c
index 2a372a0..72df853 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -196,6 +196,13 @@ void __init fork_init(unsigned long mempages)
 	init_task.signal->rlim[RLIMIT_NPROC].rlim_max = max_threads/2;
 	init_task.signal->rlim[RLIMIT_SIGPENDING] =
 		init_task.signal->rlim[RLIMIT_NPROC];
+
+#ifdef CONFIG_CHECKPOINT_RESTART
+	/*
+	 * This probably won't stay set for long...
+	 */
+	atomic_set(&init_task.may_checkpoint, 1);
+#endif
 }
 
 int __attribute__((weak)) arch_dup_task_struct(struct task_struct *dst,
@@ -246,6 +253,9 @@ static struct task_struct *dup_task_struct(struct task_struct *orig)
 	tsk->btrace_seq = 0;
 #endif
 	tsk->splice_pipe = NULL;
+#ifdef CONFIG_CHECKPOINT_RESTART
+	atomic_set(&tsk->may_checkpoint, atomic_read(&orig->may_checkpoint));
+#endif
 	return tsk;
 
 out:
-- 
1.5.4.3


^ permalink raw reply related	[flat|nested] 133+ messages in thread

* [RFC v11][PATCH 11/13] Track in-kernel when we expect checkpoint/restart to work
@ 2008-12-05 17:31   ` Oren Laadan
  0 siblings, 0 replies; 133+ messages in thread
From: Oren Laadan @ 2008-12-05 17:31 UTC (permalink / raw)
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Linux Torvalds,
	Thomas Gleixner, Serge Hallyn, Dave Hansen, Ingo Molnar,
	H. Peter Anvin, Alexander Viro, MinChan Kim, arnd-r2nGTMty4D4,
	jeremy-TSDbQ3PG+2Y, Oren Laadan

From: Dave Hansen <dave-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>

Suggested by Ingo.

Checkpoint/restart is going to be a long effort to get things working.
We're going to have a lot of things that we know just don't work for
a long time.  That doesn't mean that it will be useless, it just means
that there's some complicated features that we are going to have to
work incrementally to fix.

This patch introduces a new mechanism to help the checkpoint/restart
developers.  A new function pair: task/process_deny_checkpoint() is
created.  When called, these tell the kernel that we *know* that the
process has performed some activity that will keep it from being
properly checkpointed.

The 'flag' is an atomic_t for now so that we can have some level
of atomicity and make sure to only warn once.

For now, this is a one-way trip.  Once a process is no longer
'may_checkpoint' capable, neither it nor its children ever will be.
This can, of course, be fixed up in the future.  We might want to
reset the flag when a new pid namespace is created, for instance.

Signed-off-by: Dave Hansen <dave-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
Signed-off-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
---
 checkpoint/checkpoint.c    |    6 ++++++
 include/linux/checkpoint.h |   33 ++++++++++++++++++++++++++++++++-
 include/linux/sched.h      |    4 ++++
 kernel/fork.c              |   10 ++++++++++
 4 files changed, 52 insertions(+), 1 deletions(-)

diff --git a/checkpoint/checkpoint.c b/checkpoint/checkpoint.c
index f2d91f8..e8e352f 100644
--- a/checkpoint/checkpoint.c
+++ b/checkpoint/checkpoint.c
@@ -233,6 +233,12 @@ static int cr_write_task(struct cr_ctx *ctx, struct task_struct *t)
 		return -EAGAIN;
 	}
 
+	if (!atomic_read(&t->may_checkpoint)) {
+		pr_warning("c/r: task %d may not checkpoint\n",
+			   task_pid_vnr(t));
+		return -EBUSY;
+	}
+
 	ret = cr_write_task_struct(ctx, t);
 	cr_debug("task_struct: ret %d\n", ret);
 	if (ret < 0)
diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
index 3c29f8e..c97f608 100644
--- a/include/linux/checkpoint.h
+++ b/include/linux/checkpoint.h
@@ -10,8 +10,11 @@
  *  distribution for more details.
  */
 
-#include <linux/path.h>
 #include <linux/fs.h>
+#include <linux/path.h>
+#include <linux/sched.h>
+
+#ifdef CONFIG_CHECKPOINT_RESTART
 
 #define CR_VERSION  2
 
@@ -95,4 +98,32 @@ extern int cr_read_files(struct cr_ctx *ctx);
 #define cr_debug(fmt, args...)  \
 	pr_debug("[%d:c/r:%s] " fmt, task_pid_vnr(current), __func__, ## args)
 
+static inline void __task_deny_checkpointing(struct task_struct *task,
+		char *file, int line)
+{
+	if (!atomic_dec_and_test(&task->may_checkpoint))
+		return;
+	printk(KERN_INFO "process performed an action that can not be "
+			"checkpointed at: %s:%d\n", file, line);
+	WARN_ON(1);
+}
+#define process_deny_checkpointing(p)  \
+	__task_deny_checkpointing(p, __FILE__, __LINE__)
+
+/*
+ * For now, we're not going to have a distinction between
+ * tasks and processes for the purpose of c/r.  But, allow
+ * these two calls anyway to make new users at least think
+ * about it.
+ */
+#define task_deny_checkpointing(p)  \
+	__task_deny_checkpointing(p, __FILE__, __LINE__)
+
+#else
+
+static inline void task_deny_checkpointing(struct task_struct *task) {}
+static inline void process_deny_checkpointing(struct task_struct *task) {}
+
+#endif
+
 #endif /* _CHECKPOINT_CKPT_H_ */
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 55e30d1..faa2ec6 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1356,6 +1356,10 @@ struct task_struct {
 	unsigned long default_timer_slack_ns;
 
 	struct list_head	*scm_work_list;
+
+#ifdef CONFIG_CHECKPOINT_RESTART
+	atomic_t may_checkpoint;
+#endif
 };
 
 /*
diff --git a/kernel/fork.c b/kernel/fork.c
index 2a372a0..72df853 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -196,6 +196,13 @@ void __init fork_init(unsigned long mempages)
 	init_task.signal->rlim[RLIMIT_NPROC].rlim_max = max_threads/2;
 	init_task.signal->rlim[RLIMIT_SIGPENDING] =
 		init_task.signal->rlim[RLIMIT_NPROC];
+
+#ifdef CONFIG_CHECKPOINT_RESTART
+	/*
+	 * This probably won't stay set for long...
+	 */
+	atomic_set(&init_task.may_checkpoint, 1);
+#endif
 }
 
 int __attribute__((weak)) arch_dup_task_struct(struct task_struct *dst,
@@ -246,6 +253,9 @@ static struct task_struct *dup_task_struct(struct task_struct *orig)
 	tsk->btrace_seq = 0;
 #endif
 	tsk->splice_pipe = NULL;
+#ifdef CONFIG_CHECKPOINT_RESTART
+	atomic_set(&tsk->may_checkpoint, atomic_read(&orig->may_checkpoint));
+#endif
 	return tsk;
 
 out:
-- 
1.5.4.3

--
To unsubscribe from this list: send the line "unsubscribe linux-api" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 133+ messages in thread

* [RFC v11][PATCH 11/13] Track in-kernel when we expect checkpoint/restart to work
@ 2008-12-05 17:31   ` Oren Laadan
  0 siblings, 0 replies; 133+ messages in thread
From: Oren Laadan @ 2008-12-05 17:31 UTC (permalink / raw)
  To: Oren Laadan
  Cc: containers, linux-kernel, linux-mm, linux-api, Linux Torvalds,
	Thomas Gleixner, Serge Hallyn, Dave Hansen, Ingo Molnar,
	H. Peter Anvin, Alexander Viro, MinChan Kim, arnd, jeremy

From: Dave Hansen <dave@linux.vnet.ibm.com>

Suggested by Ingo.

Checkpoint/restart is going to be a long effort to get things working.
We're going to have a lot of things that we know just don't work for
a long time.  That doesn't mean that it will be useless, it just means
that there's some complicated features that we are going to have to
work incrementally to fix.

This patch introduces a new mechanism to help the checkpoint/restart
developers.  A new function pair: task/process_deny_checkpoint() is
created.  When called, these tell the kernel that we *know* that the
process has performed some activity that will keep it from being
properly checkpointed.

The 'flag' is an atomic_t for now so that we can have some level
of atomicity and make sure to only warn once.

For now, this is a one-way trip.  Once a process is no longer
'may_checkpoint' capable, neither it nor its children ever will be.
This can, of course, be fixed up in the future.  We might want to
reset the flag when a new pid namespace is created, for instance.

Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com>
Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
---
 checkpoint/checkpoint.c    |    6 ++++++
 include/linux/checkpoint.h |   33 ++++++++++++++++++++++++++++++++-
 include/linux/sched.h      |    4 ++++
 kernel/fork.c              |   10 ++++++++++
 4 files changed, 52 insertions(+), 1 deletions(-)

diff --git a/checkpoint/checkpoint.c b/checkpoint/checkpoint.c
index f2d91f8..e8e352f 100644
--- a/checkpoint/checkpoint.c
+++ b/checkpoint/checkpoint.c
@@ -233,6 +233,12 @@ static int cr_write_task(struct cr_ctx *ctx, struct task_struct *t)
 		return -EAGAIN;
 	}
 
+	if (!atomic_read(&t->may_checkpoint)) {
+		pr_warning("c/r: task %d may not checkpoint\n",
+			   task_pid_vnr(t));
+		return -EBUSY;
+	}
+
 	ret = cr_write_task_struct(ctx, t);
 	cr_debug("task_struct: ret %d\n", ret);
 	if (ret < 0)
diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
index 3c29f8e..c97f608 100644
--- a/include/linux/checkpoint.h
+++ b/include/linux/checkpoint.h
@@ -10,8 +10,11 @@
  *  distribution for more details.
  */
 
-#include <linux/path.h>
 #include <linux/fs.h>
+#include <linux/path.h>
+#include <linux/sched.h>
+
+#ifdef CONFIG_CHECKPOINT_RESTART
 
 #define CR_VERSION  2
 
@@ -95,4 +98,32 @@ extern int cr_read_files(struct cr_ctx *ctx);
 #define cr_debug(fmt, args...)  \
 	pr_debug("[%d:c/r:%s] " fmt, task_pid_vnr(current), __func__, ## args)
 
+static inline void __task_deny_checkpointing(struct task_struct *task,
+		char *file, int line)
+{
+	if (!atomic_dec_and_test(&task->may_checkpoint))
+		return;
+	printk(KERN_INFO "process performed an action that can not be "
+			"checkpointed at: %s:%d\n", file, line);
+	WARN_ON(1);
+}
+#define process_deny_checkpointing(p)  \
+	__task_deny_checkpointing(p, __FILE__, __LINE__)
+
+/*
+ * For now, we're not going to have a distinction between
+ * tasks and processes for the purpose of c/r.  But, allow
+ * these two calls anyway to make new users at least think
+ * about it.
+ */
+#define task_deny_checkpointing(p)  \
+	__task_deny_checkpointing(p, __FILE__, __LINE__)
+
+#else
+
+static inline void task_deny_checkpointing(struct task_struct *task) {}
+static inline void process_deny_checkpointing(struct task_struct *task) {}
+
+#endif
+
 #endif /* _CHECKPOINT_CKPT_H_ */
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 55e30d1..faa2ec6 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1356,6 +1356,10 @@ struct task_struct {
 	unsigned long default_timer_slack_ns;
 
 	struct list_head	*scm_work_list;
+
+#ifdef CONFIG_CHECKPOINT_RESTART
+	atomic_t may_checkpoint;
+#endif
 };
 
 /*
diff --git a/kernel/fork.c b/kernel/fork.c
index 2a372a0..72df853 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -196,6 +196,13 @@ void __init fork_init(unsigned long mempages)
 	init_task.signal->rlim[RLIMIT_NPROC].rlim_max = max_threads/2;
 	init_task.signal->rlim[RLIMIT_SIGPENDING] =
 		init_task.signal->rlim[RLIMIT_NPROC];
+
+#ifdef CONFIG_CHECKPOINT_RESTART
+	/*
+	 * This probably won't stay set for long...
+	 */
+	atomic_set(&init_task.may_checkpoint, 1);
+#endif
 }
 
 int __attribute__((weak)) arch_dup_task_struct(struct task_struct *dst,
@@ -246,6 +253,9 @@ static struct task_struct *dup_task_struct(struct task_struct *orig)
 	tsk->btrace_seq = 0;
 #endif
 	tsk->splice_pipe = NULL;
+#ifdef CONFIG_CHECKPOINT_RESTART
+	atomic_set(&tsk->may_checkpoint, atomic_read(&orig->may_checkpoint));
+#endif
 	return tsk;
 
 out:
-- 
1.5.4.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 133+ messages in thread

* [RFC v11][PATCH 12/13] Checkpoint multiple processes
       [not found] ` <1228498282-11804-1-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
                     ` (11 preceding siblings ...)
  2008-12-05 17:31   ` [RFC v11][PATCH 11/13] Track in-kernel when we expect checkpoint/restart to work Oren Laadan
@ 2008-12-05 17:31   ` Oren Laadan
  2008-12-05 17:31   ` [RFC v11][PATCH 13/13] Restart " Oren Laadan
                     ` (4 subsequent siblings)
  17 siblings, 0 replies; 133+ messages in thread
From: Oren Laadan @ 2008-12-05 17:31 UTC (permalink / raw)
  To: Oren Laadan
  Cc: jeremy-TSDbQ3PG+2Y, arnd-r2nGTMty4D4,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Dave Hansen,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Linux Torvalds, Alexander Viro,
	H. Peter Anvin, Thomas Gleixner, Ingo Molnar

Checkpointing of multiple processes works by recording the tasks tree
structure below a given task (usually this task is the container init).

For a given task, do a DFS scan of the tasks tree and collect them
into an array (keeping a reference to each task). Using DFS simplifies
the recreation of tasks either in user space or kernel space. For each
task collected, test if it can be checkpointed, and save its pid, tgid,
and ppid.

The actual work is divided into two passes: a first scan counts the
tasks, then memory is allocated and a second scan fills the array.

The logic is suitable for creation of processes during restart either
in userspace or by the kernel.

Currently we ignore threads and zombies, as well as session ids.

Signed-off-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
Acked-by: Serge Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
---
 checkpoint/checkpoint.c        |  228 +++++++++++++++++++++++++++++++++++++---
 checkpoint/sys.c               |   16 +++
 include/linux/checkpoint.h     |    3 +
 include/linux/checkpoint_hdr.h |   13 ++-
 4 files changed, 243 insertions(+), 17 deletions(-)

diff --git a/checkpoint/checkpoint.c b/checkpoint/checkpoint.c
index e8e352f..a406feb 100644
--- a/checkpoint/checkpoint.c
+++ b/checkpoint/checkpoint.c
@@ -226,19 +226,6 @@ static int cr_write_task(struct cr_ctx *ctx, struct task_struct *t)
 {
 	int ret;
 
-	/* TODO: verity that the task is frozen (unless self) */
-
-	if (t->state == TASK_DEAD) {
-		pr_warning("c/r: task may not be in state TASK_DEAD\n");
-		return -EAGAIN;
-	}
-
-	if (!atomic_read(&t->may_checkpoint)) {
-		pr_warning("c/r: task %d may not checkpoint\n",
-			   task_pid_vnr(t));
-		return -EBUSY;
-	}
-
 	ret = cr_write_task_struct(ctx, t);
 	cr_debug("task_struct: ret %d\n", ret);
 	if (ret < 0)
@@ -261,6 +248,205 @@ static int cr_write_task(struct cr_ctx *ctx, struct task_struct *t)
 	return ret;
 }
 
+/* dump all tasks in ctx->tasks_arr[] */
+static int cr_write_all_tasks(struct cr_ctx *ctx)
+{
+	int n, ret = 0;
+
+	for (n = 0; n < ctx->tasks_nr; n++) {
+		cr_debug("dumping task #%d\n", n);
+		ret = cr_write_task(ctx, ctx->tasks_arr[n]);
+		if (ret < 0)
+			break;
+	}
+
+	return ret;
+}
+
+static int cr_may_checkpoint_task(struct task_struct *t, struct cr_ctx *ctx)
+{
+	cr_debug("check %d\n", task_pid_nr_ns(t, ctx->root_nsproxy->pid_ns));
+
+	if (t->state == TASK_DEAD) {
+		pr_warning("c/r: task %d is TASK_DEAD\n", task_pid_vnr(t));
+		return -EAGAIN;
+	}
+
+	if (!atomic_read(&t->may_checkpoint)) {
+		pr_warning("c/r: task %d uncheckpointable\n", task_pid_vnr(t));
+		return -EBUSY;
+	}
+
+	if (!ptrace_may_access(t, PTRACE_MODE_READ))
+		return -EPERM;
+
+	/* FIXME: verify that the task is frozen (unless self) */
+
+	/* FIXME: change this for nested containers */
+	if (task_nsproxy(t) != ctx->root_nsproxy)
+		return -EPERM;
+
+	return 0;
+}
+
+#define CR_HDR_PIDS_CHUNK	256
+
+static int cr_write_pids(struct cr_ctx *ctx)
+{
+	struct cr_hdr_pids *hh;
+	struct pid_namespace *ns;
+	struct task_struct *task;
+	struct task_struct **tasks_arr;
+	int tasks_nr, n, ret = 0, pos = 0;
+
+	ns = ctx->root_nsproxy->pid_ns;
+	tasks_arr = ctx->tasks_arr;
+	tasks_nr = ctx->tasks_nr;
+
+	hh = cr_hbuf_get(ctx, sizeof(*hh) * CR_HDR_PIDS_CHUNK);
+
+	while (tasks_nr > 0) {
+		rcu_read_lock();
+		for (n = min(tasks_nr, CR_HDR_PIDS_CHUNK); n; n--) {
+			task = tasks_arr[pos];
+
+			/* is this task cool ? */
+			ret = cr_may_checkpoint_task(task, ctx);
+			if (ret < 0) {
+				rcu_read_unlock();
+				goto out;
+			}
+			hh[pos].vpid = task_pid_nr_ns(task, ns);
+			hh[pos].vtgid = task_tgid_nr_ns(task, ns);
+			hh[pos].vppid = task_tgid_nr_ns(task->real_parent, ns);
+			cr_debug("task[%d]: vpid %d vtgid %d parent %d\n", pos,
+				 hh[pos].vpid, hh[pos].vtgid, hh[pos].vppid);
+			pos++;
+		}
+		rcu_read_unlock();
+
+		n = min(tasks_nr, CR_HDR_PIDS_CHUNK);
+		ret = cr_kwrite(ctx, hh, n * sizeof(*hh));
+		if (ret < 0)
+			break;
+
+		tasks_nr -= n;
+	}
+ out:
+	cr_hbuf_put(ctx, sizeof(*hh));
+	return ret;
+}
+
+/* count number of tasks in tree (and optionally fill pid's in array) */
+static int cr_tree_count_tasks(struct cr_ctx *ctx)
+{
+	struct task_struct *root = ctx->root_task;
+	struct task_struct *task = root;
+	struct task_struct *parent = NULL;
+	struct task_struct **tasks_arr = ctx->tasks_arr;
+	int tasks_nr = ctx->tasks_nr;
+	int nr = 0;
+
+	read_lock(&tasklist_lock);
+
+	/* count tasks via DFS scan of the tree */
+	while (1) {
+		if (tasks_arr) {
+			/* unlikely, but ... */
+			if (nr == tasks_nr)
+				return -EBUSY;	/* cleanup in cr_ctx_free() */
+			tasks_arr[nr] = task;
+			get_task_struct(task);
+		}
+
+		nr++;
+
+		/* if has children - proceed with child */
+		if (!list_empty(&task->children)) {
+			parent = task;
+			task = list_entry(task->children.next,
+					  struct task_struct, sibling);
+			continue;
+		}
+
+		while (task != root) {
+			/* if has sibling - proceed with sibling */
+			if (!list_is_last(&task->sibling, &parent->children)) {
+				task = list_entry(task->sibling.next,
+						  struct task_struct, sibling);
+				break;
+			}
+
+			/* else, trace back to parent and proceed */
+			task = parent;
+			parent = parent->real_parent;
+		}
+
+		if (task == root)
+			break;
+	}
+
+	read_unlock(&tasklist_lock);
+	return nr;
+}
+
+/*
+ * cr_build_tree - scan the tasks tree in DFS order and fill in array
+ * @ctx: checkpoint context
+ *
+ * Using DFS order simplifies the restart logic to re-create the tasks.
+ *
+ * On success, ctx->tasks_arr will be allocated and populated with all
+ * tasks (reference taken), and ctx->tasks_nr will hold the total count.
+ * The array is cleaned up by cr_ctx_free().
+ */
+static int cr_build_tree(struct cr_ctx *ctx)
+{
+	int n, m;
+
+	/* count tasks (no side effects) */
+	n = cr_tree_count_tasks(ctx);
+
+	ctx->tasks_nr = n;
+	ctx->tasks_arr = kzalloc(n * sizeof(*ctx->tasks_arr), GFP_KERNEL);
+	if (!ctx->tasks_arr)
+		return -ENOMEM;
+
+	/* count again (now will fill array) */
+	m = cr_tree_count_tasks(ctx);
+
+	/* unlikely, but ... (cleanup in cr_ctx_free) */
+	if (m < 0)
+		return m;
+	else if (m != n)
+		return -EBUSY;
+
+	return 0;
+}
+
+/* dump the array that describes the tasks tree */
+static int cr_write_tree(struct cr_ctx *ctx)
+{
+	struct cr_hdr h;
+	struct cr_hdr_tree *hh;
+	int ret;
+
+	h.type = CR_HDR_TREE;
+	h.len = sizeof(*hh);
+	h.parent = 0;
+
+	hh = cr_hbuf_get(ctx, sizeof(*hh));
+	hh->tasks_nr = ctx->tasks_nr;
+
+	ret = cr_write_obj(ctx, &h, hh);
+	cr_hbuf_put(ctx, sizeof(*hh));
+	if (ret < 0)
+		return ret;
+
+	ret = cr_write_pids(ctx);
+	return ret;
+}
+
 static int cr_get_container(struct cr_ctx *ctx, pid_t pid)
 {
 	struct task_struct *task = NULL;
@@ -278,7 +464,7 @@ static int cr_get_container(struct cr_ctx *ctx, pid_t pid)
 	if (!task)
 		goto out;
 
-#if 0	/* enable to use containers */
+#if 0	/* enable with containers */
 	if (!is_container_init(task)) {
 		err = -EINVAL;
 		goto out;
@@ -300,7 +486,7 @@ static int cr_get_container(struct cr_ctx *ctx, pid_t pid)
 	if (!nsproxy)
 		goto out;
 
-	/* TODO: verify that the container is frozen */
+	/* FIXME: verify that the container is frozen */
 
 	ctx->root_task = task;
 	ctx->root_nsproxy = nsproxy;
@@ -348,12 +534,22 @@ int do_checkpoint(struct cr_ctx *ctx, pid_t pid)
 	ret = cr_ctx_checkpoint(ctx, pid);
 	if (ret < 0)
 		goto out;
+
+	ret = cr_build_tree(ctx);
+	if (ret < 0)
+		goto out;
+
 	ret = cr_write_head(ctx);
 	if (ret < 0)
 		goto out;
-	ret = cr_write_task(ctx, ctx->root_task);
+	ret = cr_write_tree(ctx);
 	if (ret < 0)
 		goto out;
+
+	ret = cr_write_all_tasks(ctx);
+	if (ret < 0)
+		goto out;
+
 	ret = cr_write_tail(ctx);
 	if (ret < 0)
 		goto out;
diff --git a/checkpoint/sys.c b/checkpoint/sys.c
index 7083fff..121c979 100644
--- a/checkpoint/sys.c
+++ b/checkpoint/sys.c
@@ -130,6 +130,19 @@ void cr_hbuf_put(struct cr_ctx *ctx, int n)
  * restart operation, and persists until the operation is completed.
  */
 
+static void cr_task_arr_free(struct cr_ctx *ctx)
+{
+	int n;
+
+	for (n = 0; n < ctx->tasks_nr; n++) {
+		if (ctx->tasks_arr[n]) {
+			put_task_struct(ctx->tasks_arr[n]);
+			ctx->tasks_arr[n] = NULL;
+		}
+	}
+	kfree(ctx->tasks_arr);
+}
+
 static void cr_ctx_free(struct cr_ctx *ctx)
 {
 	if (ctx->file)
@@ -142,6 +155,9 @@ static void cr_ctx_free(struct cr_ctx *ctx)
 	cr_pgarr_free(ctx);
 	cr_objhash_free(ctx);
 
+	if (ctx->tasks_arr)
+		cr_task_arr_free(ctx);
+
 	if (ctx->root_nsproxy)
 		put_nsproxy(ctx->root_nsproxy);
 	if (ctx->root_task)
diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
index c97f608..2504717 100644
--- a/include/linux/checkpoint.h
+++ b/include/linux/checkpoint.h
@@ -34,6 +34,9 @@ struct cr_ctx {
 	void *hbuf;		/* temporary buffer for headers */
 	int hpos;		/* position in headers buffer */
 
+	struct task_struct **tasks_arr;	/* array of all tasks in container */
+	int tasks_nr;			/* size of tasks array */
+
 	struct cr_objhash *objhash;	/* hash for shared objects */
 
 	struct list_head pgarr_list;	/* page array to dump VMA contents */
diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
index 8c3b5b2..ed8b7fb 100644
--- a/include/linux/checkpoint_hdr.h
+++ b/include/linux/checkpoint_hdr.h
@@ -36,7 +36,8 @@ enum {
 	CR_HDR_STRING,
 	CR_HDR_FNAME,
 
-	CR_HDR_TASK = 101,
+	CR_HDR_TREE = 101,
+	CR_HDR_TASK,
 	CR_HDR_THREAD,
 	CR_HDR_CPU,
 
@@ -72,6 +73,16 @@ struct cr_hdr_tail {
 	__u64 magic;
 } __attribute__((aligned(8)));
 
+struct cr_hdr_tree {
+	__u32 tasks_nr;
+} __attribute__((aligned(8)));
+
+struct cr_hdr_pids {
+	__s32 vpid;
+	__s32 vtgid;
+	__s32 vppid;
+} __attribute__((aligned(8)));
+
 struct cr_hdr_task {
 	__u32 state;
 	__u32 exit_state;
-- 
1.5.4.3

^ permalink raw reply related	[flat|nested] 133+ messages in thread

* [RFC v11][PATCH 12/13] Checkpoint multiple processes
       [not found] ` <1228498282-11804-1-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
  2008-12-05 17:31   ` [RFC v11][PATCH 01/13] Create syscalls: sys_checkpoint, sys_restart Oren Laadan
@ 2008-12-05 17:31   ` Oren Laadan
  2008-12-05 17:31   ` [RFC v11][PATCH 03/13] General infrastructure for checkpoint restart Oren Laadan
                     ` (15 subsequent siblings)
  17 siblings, 0 replies; 133+ messages in thread
From: Oren Laadan @ 2008-12-05 17:31 UTC (permalink / raw)
  To: Oren Laadan
  Cc: containers, linux-kernel, linux-mm, linux-api, Linux Torvalds,
	Thomas Gleixner, Serge Hallyn, Dave Hansen, Ingo Molnar,
	H. Peter Anvin, Alexander Viro, MinChan Kim, arnd, jeremy,
	Oren Laadan

Checkpointing of multiple processes works by recording the tasks tree
structure below a given task (usually this task is the container init).

For a given task, do a DFS scan of the tasks tree and collect them
into an array (keeping a reference to each task). Using DFS simplifies
the recreation of tasks either in user space or kernel space. For each
task collected, test if it can be checkpointed, and save its pid, tgid,
and ppid.

The actual work is divided into two passes: a first scan counts the
tasks, then memory is allocated and a second scan fills the array.

The logic is suitable for creation of processes during restart either
in userspace or by the kernel.

Currently we ignore threads and zombies, as well as session ids.

Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
Acked-by: Serge Hallyn <serue@us.ibm.com>
---
 checkpoint/checkpoint.c        |  228 +++++++++++++++++++++++++++++++++++++---
 checkpoint/sys.c               |   16 +++
 include/linux/checkpoint.h     |    3 +
 include/linux/checkpoint_hdr.h |   13 ++-
 4 files changed, 243 insertions(+), 17 deletions(-)

diff --git a/checkpoint/checkpoint.c b/checkpoint/checkpoint.c
index e8e352f..a406feb 100644
--- a/checkpoint/checkpoint.c
+++ b/checkpoint/checkpoint.c
@@ -226,19 +226,6 @@ static int cr_write_task(struct cr_ctx *ctx, struct task_struct *t)
 {
 	int ret;
 
-	/* TODO: verity that the task is frozen (unless self) */
-
-	if (t->state == TASK_DEAD) {
-		pr_warning("c/r: task may not be in state TASK_DEAD\n");
-		return -EAGAIN;
-	}
-
-	if (!atomic_read(&t->may_checkpoint)) {
-		pr_warning("c/r: task %d may not checkpoint\n",
-			   task_pid_vnr(t));
-		return -EBUSY;
-	}
-
 	ret = cr_write_task_struct(ctx, t);
 	cr_debug("task_struct: ret %d\n", ret);
 	if (ret < 0)
@@ -261,6 +248,205 @@ static int cr_write_task(struct cr_ctx *ctx, struct task_struct *t)
 	return ret;
 }
 
+/* dump all tasks in ctx->tasks_arr[] */
+static int cr_write_all_tasks(struct cr_ctx *ctx)
+{
+	int n, ret = 0;
+
+	for (n = 0; n < ctx->tasks_nr; n++) {
+		cr_debug("dumping task #%d\n", n);
+		ret = cr_write_task(ctx, ctx->tasks_arr[n]);
+		if (ret < 0)
+			break;
+	}
+
+	return ret;
+}
+
+static int cr_may_checkpoint_task(struct task_struct *t, struct cr_ctx *ctx)
+{
+	cr_debug("check %d\n", task_pid_nr_ns(t, ctx->root_nsproxy->pid_ns));
+
+	if (t->state == TASK_DEAD) {
+		pr_warning("c/r: task %d is TASK_DEAD\n", task_pid_vnr(t));
+		return -EAGAIN;
+	}
+
+	if (!atomic_read(&t->may_checkpoint)) {
+		pr_warning("c/r: task %d uncheckpointable\n", task_pid_vnr(t));
+		return -EBUSY;
+	}
+
+	if (!ptrace_may_access(t, PTRACE_MODE_READ))
+		return -EPERM;
+
+	/* FIXME: verify that the task is frozen (unless self) */
+
+	/* FIXME: change this for nested containers */
+	if (task_nsproxy(t) != ctx->root_nsproxy)
+		return -EPERM;
+
+	return 0;
+}
+
+#define CR_HDR_PIDS_CHUNK	256
+
+static int cr_write_pids(struct cr_ctx *ctx)
+{
+	struct cr_hdr_pids *hh;
+	struct pid_namespace *ns;
+	struct task_struct *task;
+	struct task_struct **tasks_arr;
+	int tasks_nr, n, ret = 0, pos = 0;
+
+	ns = ctx->root_nsproxy->pid_ns;
+	tasks_arr = ctx->tasks_arr;
+	tasks_nr = ctx->tasks_nr;
+
+	hh = cr_hbuf_get(ctx, sizeof(*hh) * CR_HDR_PIDS_CHUNK);
+
+	while (tasks_nr > 0) {
+		rcu_read_lock();
+		for (n = min(tasks_nr, CR_HDR_PIDS_CHUNK); n; n--) {
+			task = tasks_arr[pos];
+
+			/* is this task cool ? */
+			ret = cr_may_checkpoint_task(task, ctx);
+			if (ret < 0) {
+				rcu_read_unlock();
+				goto out;
+			}
+			hh[pos].vpid = task_pid_nr_ns(task, ns);
+			hh[pos].vtgid = task_tgid_nr_ns(task, ns);
+			hh[pos].vppid = task_tgid_nr_ns(task->real_parent, ns);
+			cr_debug("task[%d]: vpid %d vtgid %d parent %d\n", pos,
+				 hh[pos].vpid, hh[pos].vtgid, hh[pos].vppid);
+			pos++;
+		}
+		rcu_read_unlock();
+
+		n = min(tasks_nr, CR_HDR_PIDS_CHUNK);
+		ret = cr_kwrite(ctx, hh, n * sizeof(*hh));
+		if (ret < 0)
+			break;
+
+		tasks_nr -= n;
+	}
+ out:
+	cr_hbuf_put(ctx, sizeof(*hh));
+	return ret;
+}
+
+/* count number of tasks in tree (and optionally fill pid's in array) */
+static int cr_tree_count_tasks(struct cr_ctx *ctx)
+{
+	struct task_struct *root = ctx->root_task;
+	struct task_struct *task = root;
+	struct task_struct *parent = NULL;
+	struct task_struct **tasks_arr = ctx->tasks_arr;
+	int tasks_nr = ctx->tasks_nr;
+	int nr = 0;
+
+	read_lock(&tasklist_lock);
+
+	/* count tasks via DFS scan of the tree */
+	while (1) {
+		if (tasks_arr) {
+			/* unlikely, but ... */
+			if (nr == tasks_nr)
+				return -EBUSY;	/* cleanup in cr_ctx_free() */
+			tasks_arr[nr] = task;
+			get_task_struct(task);
+		}
+
+		nr++;
+
+		/* if has children - proceed with child */
+		if (!list_empty(&task->children)) {
+			parent = task;
+			task = list_entry(task->children.next,
+					  struct task_struct, sibling);
+			continue;
+		}
+
+		while (task != root) {
+			/* if has sibling - proceed with sibling */
+			if (!list_is_last(&task->sibling, &parent->children)) {
+				task = list_entry(task->sibling.next,
+						  struct task_struct, sibling);
+				break;
+			}
+
+			/* else, trace back to parent and proceed */
+			task = parent;
+			parent = parent->real_parent;
+		}
+
+		if (task == root)
+			break;
+	}
+
+	read_unlock(&tasklist_lock);
+	return nr;
+}
+
+/*
+ * cr_build_tree - scan the tasks tree in DFS order and fill in array
+ * @ctx: checkpoint context
+ *
+ * Using DFS order simplifies the restart logic to re-create the tasks.
+ *
+ * On success, ctx->tasks_arr will be allocated and populated with all
+ * tasks (reference taken), and ctx->tasks_nr will hold the total count.
+ * The array is cleaned up by cr_ctx_free().
+ */
+static int cr_build_tree(struct cr_ctx *ctx)
+{
+	int n, m;
+
+	/* count tasks (no side effects) */
+	n = cr_tree_count_tasks(ctx);
+
+	ctx->tasks_nr = n;
+	ctx->tasks_arr = kzalloc(n * sizeof(*ctx->tasks_arr), GFP_KERNEL);
+	if (!ctx->tasks_arr)
+		return -ENOMEM;
+
+	/* count again (now will fill array) */
+	m = cr_tree_count_tasks(ctx);
+
+	/* unlikely, but ... (cleanup in cr_ctx_free) */
+	if (m < 0)
+		return m;
+	else if (m != n)
+		return -EBUSY;
+
+	return 0;
+}
+
+/* dump the array that describes the tasks tree */
+static int cr_write_tree(struct cr_ctx *ctx)
+{
+	struct cr_hdr h;
+	struct cr_hdr_tree *hh;
+	int ret;
+
+	h.type = CR_HDR_TREE;
+	h.len = sizeof(*hh);
+	h.parent = 0;
+
+	hh = cr_hbuf_get(ctx, sizeof(*hh));
+	hh->tasks_nr = ctx->tasks_nr;
+
+	ret = cr_write_obj(ctx, &h, hh);
+	cr_hbuf_put(ctx, sizeof(*hh));
+	if (ret < 0)
+		return ret;
+
+	ret = cr_write_pids(ctx);
+	return ret;
+}
+
 static int cr_get_container(struct cr_ctx *ctx, pid_t pid)
 {
 	struct task_struct *task = NULL;
@@ -278,7 +464,7 @@ static int cr_get_container(struct cr_ctx *ctx, pid_t pid)
 	if (!task)
 		goto out;
 
-#if 0	/* enable to use containers */
+#if 0	/* enable with containers */
 	if (!is_container_init(task)) {
 		err = -EINVAL;
 		goto out;
@@ -300,7 +486,7 @@ static int cr_get_container(struct cr_ctx *ctx, pid_t pid)
 	if (!nsproxy)
 		goto out;
 
-	/* TODO: verify that the container is frozen */
+	/* FIXME: verify that the container is frozen */
 
 	ctx->root_task = task;
 	ctx->root_nsproxy = nsproxy;
@@ -348,12 +534,22 @@ int do_checkpoint(struct cr_ctx *ctx, pid_t pid)
 	ret = cr_ctx_checkpoint(ctx, pid);
 	if (ret < 0)
 		goto out;
+
+	ret = cr_build_tree(ctx);
+	if (ret < 0)
+		goto out;
+
 	ret = cr_write_head(ctx);
 	if (ret < 0)
 		goto out;
-	ret = cr_write_task(ctx, ctx->root_task);
+	ret = cr_write_tree(ctx);
 	if (ret < 0)
 		goto out;
+
+	ret = cr_write_all_tasks(ctx);
+	if (ret < 0)
+		goto out;
+
 	ret = cr_write_tail(ctx);
 	if (ret < 0)
 		goto out;
diff --git a/checkpoint/sys.c b/checkpoint/sys.c
index 7083fff..121c979 100644
--- a/checkpoint/sys.c
+++ b/checkpoint/sys.c
@@ -130,6 +130,19 @@ void cr_hbuf_put(struct cr_ctx *ctx, int n)
  * restart operation, and persists until the operation is completed.
  */
 
+static void cr_task_arr_free(struct cr_ctx *ctx)
+{
+	int n;
+
+	for (n = 0; n < ctx->tasks_nr; n++) {
+		if (ctx->tasks_arr[n]) {
+			put_task_struct(ctx->tasks_arr[n]);
+			ctx->tasks_arr[n] = NULL;
+		}
+	}
+	kfree(ctx->tasks_arr);
+}
+
 static void cr_ctx_free(struct cr_ctx *ctx)
 {
 	if (ctx->file)
@@ -142,6 +155,9 @@ static void cr_ctx_free(struct cr_ctx *ctx)
 	cr_pgarr_free(ctx);
 	cr_objhash_free(ctx);
 
+	if (ctx->tasks_arr)
+		cr_task_arr_free(ctx);
+
 	if (ctx->root_nsproxy)
 		put_nsproxy(ctx->root_nsproxy);
 	if (ctx->root_task)
diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
index c97f608..2504717 100644
--- a/include/linux/checkpoint.h
+++ b/include/linux/checkpoint.h
@@ -34,6 +34,9 @@ struct cr_ctx {
 	void *hbuf;		/* temporary buffer for headers */
 	int hpos;		/* position in headers buffer */
 
+	struct task_struct **tasks_arr;	/* array of all tasks in container */
+	int tasks_nr;			/* size of tasks array */
+
 	struct cr_objhash *objhash;	/* hash for shared objects */
 
 	struct list_head pgarr_list;	/* page array to dump VMA contents */
diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
index 8c3b5b2..ed8b7fb 100644
--- a/include/linux/checkpoint_hdr.h
+++ b/include/linux/checkpoint_hdr.h
@@ -36,7 +36,8 @@ enum {
 	CR_HDR_STRING,
 	CR_HDR_FNAME,
 
-	CR_HDR_TASK = 101,
+	CR_HDR_TREE = 101,
+	CR_HDR_TASK,
 	CR_HDR_THREAD,
 	CR_HDR_CPU,
 
@@ -72,6 +73,16 @@ struct cr_hdr_tail {
 	__u64 magic;
 } __attribute__((aligned(8)));
 
+struct cr_hdr_tree {
+	__u32 tasks_nr;
+} __attribute__((aligned(8)));
+
+struct cr_hdr_pids {
+	__s32 vpid;
+	__s32 vtgid;
+	__s32 vppid;
+} __attribute__((aligned(8)));
+
 struct cr_hdr_task {
 	__u32 state;
 	__u32 exit_state;
-- 
1.5.4.3


^ permalink raw reply related	[flat|nested] 133+ messages in thread

* [RFC v11][PATCH 12/13] Checkpoint multiple processes
@ 2008-12-05 17:31   ` Oren Laadan
  0 siblings, 0 replies; 133+ messages in thread
From: Oren Laadan @ 2008-12-05 17:31 UTC (permalink / raw)
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Linux Torvalds,
	Thomas Gleixner, Serge Hallyn, Dave Hansen, Ingo Molnar,
	H. Peter Anvin, Alexander Viro, MinChan Kim, arnd-r2nGTMty4D4,
	jeremy-TSDbQ3PG+2Y, Oren Laadan

Checkpointing of multiple processes works by recording the tasks tree
structure below a given task (usually this task is the container init).

For a given task, do a DFS scan of the tasks tree and collect them
into an array (keeping a reference to each task). Using DFS simplifies
the recreation of tasks either in user space or kernel space. For each
task collected, test if it can be checkpointed, and save its pid, tgid,
and ppid.

The actual work is divided into two passes: a first scan counts the
tasks, then memory is allocated and a second scan fills the array.

The logic is suitable for creation of processes during restart either
in userspace or by the kernel.

Currently we ignore threads and zombies, as well as session ids.

Signed-off-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
Acked-by: Serge Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
---
 checkpoint/checkpoint.c        |  228 +++++++++++++++++++++++++++++++++++++---
 checkpoint/sys.c               |   16 +++
 include/linux/checkpoint.h     |    3 +
 include/linux/checkpoint_hdr.h |   13 ++-
 4 files changed, 243 insertions(+), 17 deletions(-)

diff --git a/checkpoint/checkpoint.c b/checkpoint/checkpoint.c
index e8e352f..a406feb 100644
--- a/checkpoint/checkpoint.c
+++ b/checkpoint/checkpoint.c
@@ -226,19 +226,6 @@ static int cr_write_task(struct cr_ctx *ctx, struct task_struct *t)
 {
 	int ret;
 
-	/* TODO: verity that the task is frozen (unless self) */
-
-	if (t->state == TASK_DEAD) {
-		pr_warning("c/r: task may not be in state TASK_DEAD\n");
-		return -EAGAIN;
-	}
-
-	if (!atomic_read(&t->may_checkpoint)) {
-		pr_warning("c/r: task %d may not checkpoint\n",
-			   task_pid_vnr(t));
-		return -EBUSY;
-	}
-
 	ret = cr_write_task_struct(ctx, t);
 	cr_debug("task_struct: ret %d\n", ret);
 	if (ret < 0)
@@ -261,6 +248,205 @@ static int cr_write_task(struct cr_ctx *ctx, struct task_struct *t)
 	return ret;
 }
 
+/* dump all tasks in ctx->tasks_arr[] */
+static int cr_write_all_tasks(struct cr_ctx *ctx)
+{
+	int n, ret = 0;
+
+	for (n = 0; n < ctx->tasks_nr; n++) {
+		cr_debug("dumping task #%d\n", n);
+		ret = cr_write_task(ctx, ctx->tasks_arr[n]);
+		if (ret < 0)
+			break;
+	}
+
+	return ret;
+}
+
+static int cr_may_checkpoint_task(struct task_struct *t, struct cr_ctx *ctx)
+{
+	cr_debug("check %d\n", task_pid_nr_ns(t, ctx->root_nsproxy->pid_ns));
+
+	if (t->state == TASK_DEAD) {
+		pr_warning("c/r: task %d is TASK_DEAD\n", task_pid_vnr(t));
+		return -EAGAIN;
+	}
+
+	if (!atomic_read(&t->may_checkpoint)) {
+		pr_warning("c/r: task %d uncheckpointable\n", task_pid_vnr(t));
+		return -EBUSY;
+	}
+
+	if (!ptrace_may_access(t, PTRACE_MODE_READ))
+		return -EPERM;
+
+	/* FIXME: verify that the task is frozen (unless self) */
+
+	/* FIXME: change this for nested containers */
+	if (task_nsproxy(t) != ctx->root_nsproxy)
+		return -EPERM;
+
+	return 0;
+}
+
+#define CR_HDR_PIDS_CHUNK	256
+
+static int cr_write_pids(struct cr_ctx *ctx)
+{
+	struct cr_hdr_pids *hh;
+	struct pid_namespace *ns;
+	struct task_struct *task;
+	struct task_struct **tasks_arr;
+	int tasks_nr, n, ret = 0, pos = 0;
+
+	ns = ctx->root_nsproxy->pid_ns;
+	tasks_arr = ctx->tasks_arr;
+	tasks_nr = ctx->tasks_nr;
+
+	hh = cr_hbuf_get(ctx, sizeof(*hh) * CR_HDR_PIDS_CHUNK);
+
+	while (tasks_nr > 0) {
+		rcu_read_lock();
+		for (n = min(tasks_nr, CR_HDR_PIDS_CHUNK); n; n--) {
+			task = tasks_arr[pos];
+
+			/* is this task cool ? */
+			ret = cr_may_checkpoint_task(task, ctx);
+			if (ret < 0) {
+				rcu_read_unlock();
+				goto out;
+			}
+			hh[pos].vpid = task_pid_nr_ns(task, ns);
+			hh[pos].vtgid = task_tgid_nr_ns(task, ns);
+			hh[pos].vppid = task_tgid_nr_ns(task->real_parent, ns);
+			cr_debug("task[%d]: vpid %d vtgid %d parent %d\n", pos,
+				 hh[pos].vpid, hh[pos].vtgid, hh[pos].vppid);
+			pos++;
+		}
+		rcu_read_unlock();
+
+		n = min(tasks_nr, CR_HDR_PIDS_CHUNK);
+		ret = cr_kwrite(ctx, hh, n * sizeof(*hh));
+		if (ret < 0)
+			break;
+
+		tasks_nr -= n;
+	}
+ out:
+	cr_hbuf_put(ctx, sizeof(*hh));
+	return ret;
+}
+
+/* count number of tasks in tree (and optionally fill pid's in array) */
+static int cr_tree_count_tasks(struct cr_ctx *ctx)
+{
+	struct task_struct *root = ctx->root_task;
+	struct task_struct *task = root;
+	struct task_struct *parent = NULL;
+	struct task_struct **tasks_arr = ctx->tasks_arr;
+	int tasks_nr = ctx->tasks_nr;
+	int nr = 0;
+
+	read_lock(&tasklist_lock);
+
+	/* count tasks via DFS scan of the tree */
+	while (1) {
+		if (tasks_arr) {
+			/* unlikely, but ... */
+			if (nr == tasks_nr)
+				return -EBUSY;	/* cleanup in cr_ctx_free() */
+			tasks_arr[nr] = task;
+			get_task_struct(task);
+		}
+
+		nr++;
+
+		/* if has children - proceed with child */
+		if (!list_empty(&task->children)) {
+			parent = task;
+			task = list_entry(task->children.next,
+					  struct task_struct, sibling);
+			continue;
+		}
+
+		while (task != root) {
+			/* if has sibling - proceed with sibling */
+			if (!list_is_last(&task->sibling, &parent->children)) {
+				task = list_entry(task->sibling.next,
+						  struct task_struct, sibling);
+				break;
+			}
+
+			/* else, trace back to parent and proceed */
+			task = parent;
+			parent = parent->real_parent;
+		}
+
+		if (task == root)
+			break;
+	}
+
+	read_unlock(&tasklist_lock);
+	return nr;
+}
+
+/*
+ * cr_build_tree - scan the tasks tree in DFS order and fill in array
+ * @ctx: checkpoint context
+ *
+ * Using DFS order simplifies the restart logic to re-create the tasks.
+ *
+ * On success, ctx->tasks_arr will be allocated and populated with all
+ * tasks (reference taken), and ctx->tasks_nr will hold the total count.
+ * The array is cleaned up by cr_ctx_free().
+ */
+static int cr_build_tree(struct cr_ctx *ctx)
+{
+	int n, m;
+
+	/* count tasks (no side effects) */
+	n = cr_tree_count_tasks(ctx);
+
+	ctx->tasks_nr = n;
+	ctx->tasks_arr = kzalloc(n * sizeof(*ctx->tasks_arr), GFP_KERNEL);
+	if (!ctx->tasks_arr)
+		return -ENOMEM;
+
+	/* count again (now will fill array) */
+	m = cr_tree_count_tasks(ctx);
+
+	/* unlikely, but ... (cleanup in cr_ctx_free) */
+	if (m < 0)
+		return m;
+	else if (m != n)
+		return -EBUSY;
+
+	return 0;
+}
+
+/* dump the array that describes the tasks tree */
+static int cr_write_tree(struct cr_ctx *ctx)
+{
+	struct cr_hdr h;
+	struct cr_hdr_tree *hh;
+	int ret;
+
+	h.type = CR_HDR_TREE;
+	h.len = sizeof(*hh);
+	h.parent = 0;
+
+	hh = cr_hbuf_get(ctx, sizeof(*hh));
+	hh->tasks_nr = ctx->tasks_nr;
+
+	ret = cr_write_obj(ctx, &h, hh);
+	cr_hbuf_put(ctx, sizeof(*hh));
+	if (ret < 0)
+		return ret;
+
+	ret = cr_write_pids(ctx);
+	return ret;
+}
+
 static int cr_get_container(struct cr_ctx *ctx, pid_t pid)
 {
 	struct task_struct *task = NULL;
@@ -278,7 +464,7 @@ static int cr_get_container(struct cr_ctx *ctx, pid_t pid)
 	if (!task)
 		goto out;
 
-#if 0	/* enable to use containers */
+#if 0	/* enable with containers */
 	if (!is_container_init(task)) {
 		err = -EINVAL;
 		goto out;
@@ -300,7 +486,7 @@ static int cr_get_container(struct cr_ctx *ctx, pid_t pid)
 	if (!nsproxy)
 		goto out;
 
-	/* TODO: verify that the container is frozen */
+	/* FIXME: verify that the container is frozen */
 
 	ctx->root_task = task;
 	ctx->root_nsproxy = nsproxy;
@@ -348,12 +534,22 @@ int do_checkpoint(struct cr_ctx *ctx, pid_t pid)
 	ret = cr_ctx_checkpoint(ctx, pid);
 	if (ret < 0)
 		goto out;
+
+	ret = cr_build_tree(ctx);
+	if (ret < 0)
+		goto out;
+
 	ret = cr_write_head(ctx);
 	if (ret < 0)
 		goto out;
-	ret = cr_write_task(ctx, ctx->root_task);
+	ret = cr_write_tree(ctx);
 	if (ret < 0)
 		goto out;
+
+	ret = cr_write_all_tasks(ctx);
+	if (ret < 0)
+		goto out;
+
 	ret = cr_write_tail(ctx);
 	if (ret < 0)
 		goto out;
diff --git a/checkpoint/sys.c b/checkpoint/sys.c
index 7083fff..121c979 100644
--- a/checkpoint/sys.c
+++ b/checkpoint/sys.c
@@ -130,6 +130,19 @@ void cr_hbuf_put(struct cr_ctx *ctx, int n)
  * restart operation, and persists until the operation is completed.
  */
 
+static void cr_task_arr_free(struct cr_ctx *ctx)
+{
+	int n;
+
+	for (n = 0; n < ctx->tasks_nr; n++) {
+		if (ctx->tasks_arr[n]) {
+			put_task_struct(ctx->tasks_arr[n]);
+			ctx->tasks_arr[n] = NULL;
+		}
+	}
+	kfree(ctx->tasks_arr);
+}
+
 static void cr_ctx_free(struct cr_ctx *ctx)
 {
 	if (ctx->file)
@@ -142,6 +155,9 @@ static void cr_ctx_free(struct cr_ctx *ctx)
 	cr_pgarr_free(ctx);
 	cr_objhash_free(ctx);
 
+	if (ctx->tasks_arr)
+		cr_task_arr_free(ctx);
+
 	if (ctx->root_nsproxy)
 		put_nsproxy(ctx->root_nsproxy);
 	if (ctx->root_task)
diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
index c97f608..2504717 100644
--- a/include/linux/checkpoint.h
+++ b/include/linux/checkpoint.h
@@ -34,6 +34,9 @@ struct cr_ctx {
 	void *hbuf;		/* temporary buffer for headers */
 	int hpos;		/* position in headers buffer */
 
+	struct task_struct **tasks_arr;	/* array of all tasks in container */
+	int tasks_nr;			/* size of tasks array */
+
 	struct cr_objhash *objhash;	/* hash for shared objects */
 
 	struct list_head pgarr_list;	/* page array to dump VMA contents */
diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
index 8c3b5b2..ed8b7fb 100644
--- a/include/linux/checkpoint_hdr.h
+++ b/include/linux/checkpoint_hdr.h
@@ -36,7 +36,8 @@ enum {
 	CR_HDR_STRING,
 	CR_HDR_FNAME,
 
-	CR_HDR_TASK = 101,
+	CR_HDR_TREE = 101,
+	CR_HDR_TASK,
 	CR_HDR_THREAD,
 	CR_HDR_CPU,
 
@@ -72,6 +73,16 @@ struct cr_hdr_tail {
 	__u64 magic;
 } __attribute__((aligned(8)));
 
+struct cr_hdr_tree {
+	__u32 tasks_nr;
+} __attribute__((aligned(8)));
+
+struct cr_hdr_pids {
+	__s32 vpid;
+	__s32 vtgid;
+	__s32 vppid;
+} __attribute__((aligned(8)));
+
 struct cr_hdr_task {
 	__u32 state;
 	__u32 exit_state;
-- 
1.5.4.3

--
To unsubscribe from this list: send the line "unsubscribe linux-api" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 133+ messages in thread

* [RFC v11][PATCH 12/13] Checkpoint multiple processes
@ 2008-12-05 17:31   ` Oren Laadan
  0 siblings, 0 replies; 133+ messages in thread
From: Oren Laadan @ 2008-12-05 17:31 UTC (permalink / raw)
  To: Oren Laadan
  Cc: containers, linux-kernel, linux-mm, linux-api, Linux Torvalds,
	Thomas Gleixner, Serge Hallyn, Dave Hansen, Ingo Molnar,
	H. Peter Anvin, Alexander Viro, MinChan Kim, arnd, jeremy

Checkpointing of multiple processes works by recording the tasks tree
structure below a given task (usually this task is the container init).

For a given task, do a DFS scan of the tasks tree and collect them
into an array (keeping a reference to each task). Using DFS simplifies
the recreation of tasks either in user space or kernel space. For each
task collected, test if it can be checkpointed, and save its pid, tgid,
and ppid.

The actual work is divided into two passes: a first scan counts the
tasks, then memory is allocated and a second scan fills the array.

The logic is suitable for creation of processes during restart either
in userspace or by the kernel.

Currently we ignore threads and zombies, as well as session ids.

Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
Acked-by: Serge Hallyn <serue@us.ibm.com>
---
 checkpoint/checkpoint.c        |  228 +++++++++++++++++++++++++++++++++++++---
 checkpoint/sys.c               |   16 +++
 include/linux/checkpoint.h     |    3 +
 include/linux/checkpoint_hdr.h |   13 ++-
 4 files changed, 243 insertions(+), 17 deletions(-)

diff --git a/checkpoint/checkpoint.c b/checkpoint/checkpoint.c
index e8e352f..a406feb 100644
--- a/checkpoint/checkpoint.c
+++ b/checkpoint/checkpoint.c
@@ -226,19 +226,6 @@ static int cr_write_task(struct cr_ctx *ctx, struct task_struct *t)
 {
 	int ret;
 
-	/* TODO: verity that the task is frozen (unless self) */
-
-	if (t->state == TASK_DEAD) {
-		pr_warning("c/r: task may not be in state TASK_DEAD\n");
-		return -EAGAIN;
-	}
-
-	if (!atomic_read(&t->may_checkpoint)) {
-		pr_warning("c/r: task %d may not checkpoint\n",
-			   task_pid_vnr(t));
-		return -EBUSY;
-	}
-
 	ret = cr_write_task_struct(ctx, t);
 	cr_debug("task_struct: ret %d\n", ret);
 	if (ret < 0)
@@ -261,6 +248,205 @@ static int cr_write_task(struct cr_ctx *ctx, struct task_struct *t)
 	return ret;
 }
 
+/* dump all tasks in ctx->tasks_arr[] */
+static int cr_write_all_tasks(struct cr_ctx *ctx)
+{
+	int n, ret = 0;
+
+	for (n = 0; n < ctx->tasks_nr; n++) {
+		cr_debug("dumping task #%d\n", n);
+		ret = cr_write_task(ctx, ctx->tasks_arr[n]);
+		if (ret < 0)
+			break;
+	}
+
+	return ret;
+}
+
+static int cr_may_checkpoint_task(struct task_struct *t, struct cr_ctx *ctx)
+{
+	cr_debug("check %d\n", task_pid_nr_ns(t, ctx->root_nsproxy->pid_ns));
+
+	if (t->state == TASK_DEAD) {
+		pr_warning("c/r: task %d is TASK_DEAD\n", task_pid_vnr(t));
+		return -EAGAIN;
+	}
+
+	if (!atomic_read(&t->may_checkpoint)) {
+		pr_warning("c/r: task %d uncheckpointable\n", task_pid_vnr(t));
+		return -EBUSY;
+	}
+
+	if (!ptrace_may_access(t, PTRACE_MODE_READ))
+		return -EPERM;
+
+	/* FIXME: verify that the task is frozen (unless self) */
+
+	/* FIXME: change this for nested containers */
+	if (task_nsproxy(t) != ctx->root_nsproxy)
+		return -EPERM;
+
+	return 0;
+}
+
+#define CR_HDR_PIDS_CHUNK	256
+
+static int cr_write_pids(struct cr_ctx *ctx)
+{
+	struct cr_hdr_pids *hh;
+	struct pid_namespace *ns;
+	struct task_struct *task;
+	struct task_struct **tasks_arr;
+	int tasks_nr, n, ret = 0, pos = 0;
+
+	ns = ctx->root_nsproxy->pid_ns;
+	tasks_arr = ctx->tasks_arr;
+	tasks_nr = ctx->tasks_nr;
+
+	hh = cr_hbuf_get(ctx, sizeof(*hh) * CR_HDR_PIDS_CHUNK);
+
+	while (tasks_nr > 0) {
+		rcu_read_lock();
+		for (n = min(tasks_nr, CR_HDR_PIDS_CHUNK); n; n--) {
+			task = tasks_arr[pos];
+
+			/* is this task cool ? */
+			ret = cr_may_checkpoint_task(task, ctx);
+			if (ret < 0) {
+				rcu_read_unlock();
+				goto out;
+			}
+			hh[pos].vpid = task_pid_nr_ns(task, ns);
+			hh[pos].vtgid = task_tgid_nr_ns(task, ns);
+			hh[pos].vppid = task_tgid_nr_ns(task->real_parent, ns);
+			cr_debug("task[%d]: vpid %d vtgid %d parent %d\n", pos,
+				 hh[pos].vpid, hh[pos].vtgid, hh[pos].vppid);
+			pos++;
+		}
+		rcu_read_unlock();
+
+		n = min(tasks_nr, CR_HDR_PIDS_CHUNK);
+		ret = cr_kwrite(ctx, hh, n * sizeof(*hh));
+		if (ret < 0)
+			break;
+
+		tasks_nr -= n;
+	}
+ out:
+	cr_hbuf_put(ctx, sizeof(*hh));
+	return ret;
+}
+
+/* count number of tasks in tree (and optionally fill pid's in array) */
+static int cr_tree_count_tasks(struct cr_ctx *ctx)
+{
+	struct task_struct *root = ctx->root_task;
+	struct task_struct *task = root;
+	struct task_struct *parent = NULL;
+	struct task_struct **tasks_arr = ctx->tasks_arr;
+	int tasks_nr = ctx->tasks_nr;
+	int nr = 0;
+
+	read_lock(&tasklist_lock);
+
+	/* count tasks via DFS scan of the tree */
+	while (1) {
+		if (tasks_arr) {
+			/* unlikely, but ... */
+			if (nr == tasks_nr)
+				return -EBUSY;	/* cleanup in cr_ctx_free() */
+			tasks_arr[nr] = task;
+			get_task_struct(task);
+		}
+
+		nr++;
+
+		/* if has children - proceed with child */
+		if (!list_empty(&task->children)) {
+			parent = task;
+			task = list_entry(task->children.next,
+					  struct task_struct, sibling);
+			continue;
+		}
+
+		while (task != root) {
+			/* if has sibling - proceed with sibling */
+			if (!list_is_last(&task->sibling, &parent->children)) {
+				task = list_entry(task->sibling.next,
+						  struct task_struct, sibling);
+				break;
+			}
+
+			/* else, trace back to parent and proceed */
+			task = parent;
+			parent = parent->real_parent;
+		}
+
+		if (task == root)
+			break;
+	}
+
+	read_unlock(&tasklist_lock);
+	return nr;
+}
+
+/*
+ * cr_build_tree - scan the tasks tree in DFS order and fill in array
+ * @ctx: checkpoint context
+ *
+ * Using DFS order simplifies the restart logic to re-create the tasks.
+ *
+ * On success, ctx->tasks_arr will be allocated and populated with all
+ * tasks (reference taken), and ctx->tasks_nr will hold the total count.
+ * The array is cleaned up by cr_ctx_free().
+ */
+static int cr_build_tree(struct cr_ctx *ctx)
+{
+	int n, m;
+
+	/* count tasks (no side effects) */
+	n = cr_tree_count_tasks(ctx);
+
+	ctx->tasks_nr = n;
+	ctx->tasks_arr = kzalloc(n * sizeof(*ctx->tasks_arr), GFP_KERNEL);
+	if (!ctx->tasks_arr)
+		return -ENOMEM;
+
+	/* count again (now will fill array) */
+	m = cr_tree_count_tasks(ctx);
+
+	/* unlikely, but ... (cleanup in cr_ctx_free) */
+	if (m < 0)
+		return m;
+	else if (m != n)
+		return -EBUSY;
+
+	return 0;
+}
+
+/* dump the array that describes the tasks tree */
+static int cr_write_tree(struct cr_ctx *ctx)
+{
+	struct cr_hdr h;
+	struct cr_hdr_tree *hh;
+	int ret;
+
+	h.type = CR_HDR_TREE;
+	h.len = sizeof(*hh);
+	h.parent = 0;
+
+	hh = cr_hbuf_get(ctx, sizeof(*hh));
+	hh->tasks_nr = ctx->tasks_nr;
+
+	ret = cr_write_obj(ctx, &h, hh);
+	cr_hbuf_put(ctx, sizeof(*hh));
+	if (ret < 0)
+		return ret;
+
+	ret = cr_write_pids(ctx);
+	return ret;
+}
+
 static int cr_get_container(struct cr_ctx *ctx, pid_t pid)
 {
 	struct task_struct *task = NULL;
@@ -278,7 +464,7 @@ static int cr_get_container(struct cr_ctx *ctx, pid_t pid)
 	if (!task)
 		goto out;
 
-#if 0	/* enable to use containers */
+#if 0	/* enable with containers */
 	if (!is_container_init(task)) {
 		err = -EINVAL;
 		goto out;
@@ -300,7 +486,7 @@ static int cr_get_container(struct cr_ctx *ctx, pid_t pid)
 	if (!nsproxy)
 		goto out;
 
-	/* TODO: verify that the container is frozen */
+	/* FIXME: verify that the container is frozen */
 
 	ctx->root_task = task;
 	ctx->root_nsproxy = nsproxy;
@@ -348,12 +534,22 @@ int do_checkpoint(struct cr_ctx *ctx, pid_t pid)
 	ret = cr_ctx_checkpoint(ctx, pid);
 	if (ret < 0)
 		goto out;
+
+	ret = cr_build_tree(ctx);
+	if (ret < 0)
+		goto out;
+
 	ret = cr_write_head(ctx);
 	if (ret < 0)
 		goto out;
-	ret = cr_write_task(ctx, ctx->root_task);
+	ret = cr_write_tree(ctx);
 	if (ret < 0)
 		goto out;
+
+	ret = cr_write_all_tasks(ctx);
+	if (ret < 0)
+		goto out;
+
 	ret = cr_write_tail(ctx);
 	if (ret < 0)
 		goto out;
diff --git a/checkpoint/sys.c b/checkpoint/sys.c
index 7083fff..121c979 100644
--- a/checkpoint/sys.c
+++ b/checkpoint/sys.c
@@ -130,6 +130,19 @@ void cr_hbuf_put(struct cr_ctx *ctx, int n)
  * restart operation, and persists until the operation is completed.
  */
 
+static void cr_task_arr_free(struct cr_ctx *ctx)
+{
+	int n;
+
+	for (n = 0; n < ctx->tasks_nr; n++) {
+		if (ctx->tasks_arr[n]) {
+			put_task_struct(ctx->tasks_arr[n]);
+			ctx->tasks_arr[n] = NULL;
+		}
+	}
+	kfree(ctx->tasks_arr);
+}
+
 static void cr_ctx_free(struct cr_ctx *ctx)
 {
 	if (ctx->file)
@@ -142,6 +155,9 @@ static void cr_ctx_free(struct cr_ctx *ctx)
 	cr_pgarr_free(ctx);
 	cr_objhash_free(ctx);
 
+	if (ctx->tasks_arr)
+		cr_task_arr_free(ctx);
+
 	if (ctx->root_nsproxy)
 		put_nsproxy(ctx->root_nsproxy);
 	if (ctx->root_task)
diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
index c97f608..2504717 100644
--- a/include/linux/checkpoint.h
+++ b/include/linux/checkpoint.h
@@ -34,6 +34,9 @@ struct cr_ctx {
 	void *hbuf;		/* temporary buffer for headers */
 	int hpos;		/* position in headers buffer */
 
+	struct task_struct **tasks_arr;	/* array of all tasks in container */
+	int tasks_nr;			/* size of tasks array */
+
 	struct cr_objhash *objhash;	/* hash for shared objects */
 
 	struct list_head pgarr_list;	/* page array to dump VMA contents */
diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
index 8c3b5b2..ed8b7fb 100644
--- a/include/linux/checkpoint_hdr.h
+++ b/include/linux/checkpoint_hdr.h
@@ -36,7 +36,8 @@ enum {
 	CR_HDR_STRING,
 	CR_HDR_FNAME,
 
-	CR_HDR_TASK = 101,
+	CR_HDR_TREE = 101,
+	CR_HDR_TASK,
 	CR_HDR_THREAD,
 	CR_HDR_CPU,
 
@@ -72,6 +73,16 @@ struct cr_hdr_tail {
 	__u64 magic;
 } __attribute__((aligned(8)));
 
+struct cr_hdr_tree {
+	__u32 tasks_nr;
+} __attribute__((aligned(8)));
+
+struct cr_hdr_pids {
+	__s32 vpid;
+	__s32 vtgid;
+	__s32 vppid;
+} __attribute__((aligned(8)));
+
 struct cr_hdr_task {
 	__u32 state;
 	__u32 exit_state;
-- 
1.5.4.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 133+ messages in thread

* [RFC v11][PATCH 13/13] Restart multiple processes
       [not found] ` <1228498282-11804-1-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
                     ` (12 preceding siblings ...)
  2008-12-05 17:31   ` [RFC v11][PATCH 12/13] Checkpoint multiple processes Oren Laadan
@ 2008-12-05 17:31   ` Oren Laadan
  2008-12-05 17:31   ` Oren Laadan
                     ` (3 subsequent siblings)
  17 siblings, 0 replies; 133+ messages in thread
From: Oren Laadan @ 2008-12-05 17:31 UTC (permalink / raw)
  To: Oren Laadan
  Cc: jeremy-TSDbQ3PG+2Y, arnd-r2nGTMty4D4,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Dave Hansen,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Linux Torvalds, Alexander Viro,
	H. Peter Anvin, Thomas Gleixner, Ingo Molnar

Restarting of multiple processes expects all restarting tasks to call
sys_restart(). Once inside the system call, each task will restart
itself at the same order that they were saved. The internals of the
syscall will take care of in-kernel synchronization bewteen tasks.

This patch does _not_ create the task tree in the kernel. Instead it
assumes that all tasks are created in some way and then invoke the
restart syscall. You can use the userspace mktree.c program to do
that.

The init task (*) has a special role: it allocates the restart context
(ctx), and coordinates the operation. In particular, it first waits
until all participating tasks enter the kernel, and provides them the
common restart context. Once everyone in ready, it begins to restart
itself.

In contrast, the other tasks enter the kernel, locate the init task (*)
and grab its restart context, and then wait for their turn to restore.

When a task (init or not) completes its restart, it hands the control
over to the next in line, by waking that task.

An array of pids (the one saved during the checkpoint) is used to
synchronize the operation. The first task in the array is the init
task (*). The restart context (ctx) maintain a "current position" in
the array, which indicates which task is currently active. Once the
currently active task completes its own restart, it increments that
position and wakes up the next task.

Restart assumes that userspace provides meaningful data, otherwise
it's garbage-in-garbage-out. In this case, the syscall may block
indefinitely, but in TASK_INTERRUPTIBLE, so the user can ctrl-c or
otherwise kill the stray restarting tasks.

In terms of security, restart runs as the user the invokes it, so it
will not allow a user to do more than is otherwise permitted by the
usual system semantics and policy.

Currently we ignore threads and zombies, as well as session ids.
Add support for multiple processes

(*) For containers, restart should be called inside a fresh container
by the init task of that container. However, it is also possible to
restart applications not necessarily inside a container, and without
restoring the original pids of the processes (that is, provided that
the application can tolerate such behavior). This is useful to allow
multi-process restart of tasks not isolated inside a container, and
also for debugging.

Signed-off-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
---
 checkpoint/restart.c       |  214 +++++++++++++++++++++++++++++++++++++++++++-
 checkpoint/sys.c           |   34 ++++++--
 include/linux/checkpoint.h |   23 ++++-
 include/linux/sched.h      |    1 +
 4 files changed, 258 insertions(+), 14 deletions(-)

diff --git a/checkpoint/restart.c b/checkpoint/restart.c
index f4f737d..24392ee 100644
--- a/checkpoint/restart.c
+++ b/checkpoint/restart.c
@@ -10,6 +10,7 @@
 
 #include <linux/version.h>
 #include <linux/sched.h>
+#include <linux/wait.h>
 #include <linux/file.h>
 #include <linux/magic.h>
 #include <linux/checkpoint.h>
@@ -276,30 +277,235 @@ static int cr_read_task(struct cr_ctx *ctx)
 	return ret;
 }
 
+/* cr_read_tree - read the tasks tree into the checkpoint context */
+static int cr_read_tree(struct cr_ctx *ctx)
+{
+	struct cr_hdr_tree *hh = cr_hbuf_get(ctx, sizeof(*hh));
+	int parent, size, ret = -EINVAL;
+
+	parent = cr_read_obj_type(ctx, hh, sizeof(*hh), CR_HDR_TREE);
+	if (parent < 0) {
+		ret = parent;
+		goto out;
+	} else if (parent != 0)
+		goto out;
+
+	if (hh->tasks_nr < 0)
+		goto out;
+
+	ctx->pids_nr = hh->tasks_nr;
+	size = sizeof(*ctx->pids_arr) * ctx->pids_nr;
+	if (size < 0)		/* overflow ? */
+		goto out;
+
+	ctx->pids_arr = kmalloc(size, GFP_KERNEL);
+	if (!ctx->pids_arr) {
+		ret = -ENOMEM;
+		goto out;
+	}
+	ret = cr_kread(ctx, ctx->pids_arr, size);
+ out:
+	cr_hbuf_put(ctx, sizeof(*hh));
+	return ret;
+}
+
+static int cr_wait_task(struct cr_ctx *ctx)
+{
+	pid_t pid = task_pid_vnr(current);
+
+	cr_debug("pid %d waiting\n", pid);
+	return wait_event_interruptible(ctx->waitq, ctx->pids_active == pid);
+}
+
+static int cr_next_task(struct cr_ctx *ctx)
+{
+	struct task_struct *tsk;
+
+	ctx->pids_pos++;
+
+	cr_debug("pids_pos %d %d\n", ctx->pids_pos, ctx->pids_nr);
+	if (ctx->pids_pos == ctx->pids_nr) {
+		complete(&ctx->complete);
+		return 0;
+	}
+
+	ctx->pids_active = ctx->pids_arr[ctx->pids_pos].vpid;
+
+	cr_debug("pids_next %d\n", ctx->pids_active);
+
+	rcu_read_lock();
+	tsk = find_task_by_pid_ns(ctx->pids_active, ctx->root_nsproxy->pid_ns);
+	if (tsk)
+		wake_up_process(tsk);
+	rcu_read_unlock();
+
+	if (!tsk) {
+		ctx->pids_err = -ESRCH;
+		complete(&ctx->complete);
+		return -ESRCH;
+	}
+
+	return 0;
+}
+
+/* FIXME: this should be per container */
+DECLARE_WAIT_QUEUE_HEAD(cr_restart_waitq);
+
+static int do_restart_task(struct cr_ctx *ctx, pid_t pid)
+{
+	struct task_struct *root_task;
+	int ret;
+
+	rcu_read_lock();
+	root_task = find_task_by_pid_ns(pid, current->nsproxy->pid_ns);
+	if (root_task)
+		get_task_struct(root_task);
+	rcu_read_unlock();
+
+	if (!root_task)
+		return -EINVAL;
+
+	/*
+	 * wait for container init to initialize the restart context, then
+	 * grab a reference to that context, and if we're the last task to
+	 * do it, notify the container init.
+	 */
+	ret = wait_event_interruptible(cr_restart_waitq,
+				       root_task->checkpoint_ctx);
+	if (ret < 0)
+		goto out;
+
+	task_lock(root_task);
+	ctx = root_task->checkpoint_ctx;
+	if (ctx)
+		cr_ctx_get(ctx);
+	task_unlock(root_task);
+
+	if (!ctx) {
+		ret = -EAGAIN;
+		goto out;
+	}
+
+	if (atomic_dec_and_test(&ctx->tasks_count))
+		complete(&ctx->complete);
+
+	/* wait for our turn, do the restore, and tell next task in line */
+	ret = cr_wait_task(ctx);
+	if (ret < 0)
+		goto out;
+	ret = cr_read_task(ctx);
+	if (ret < 0)
+		goto out;
+	ret = cr_next_task(ctx);
+
+ out:
+	cr_ctx_put(ctx);
+	put_task_struct(root_task);
+	return ret;
+}
+
+static int cr_wait_all_tasks_start(struct cr_ctx *ctx)
+{
+	int ret;
+
+	if (ctx->pids_nr == 1)
+		return 0;
+
+	init_completion(&ctx->complete);
+	current->checkpoint_ctx = ctx;
+
+	wake_up_all(&cr_restart_waitq);
+
+	ret = wait_for_completion_interruptible(&ctx->complete);
+	if (ret < 0)
+		return ret;
+
+	task_lock(current);
+	current->checkpoint_ctx = NULL;
+	task_unlock(current);
+
+	return 0;
+}
+
+static int cr_wait_all_tasks_finish(struct cr_ctx *ctx)
+{
+	int ret;
+
+	if (ctx->pids_nr == 1)
+		return 0;
+
+	init_completion(&ctx->complete);
+
+	ret = cr_next_task(ctx);
+	if (ret < 0)
+		return ret;
+
+	ret = wait_for_completion_interruptible(&ctx->complete);
+	if (ret < 0)
+		return ret;
+
+	return 0;
+}
+
 /* setup restart-specific parts of ctx */
 static int cr_ctx_restart(struct cr_ctx *ctx, pid_t pid)
 {
+	ctx->root_pid = pid;
+	ctx->root_task = current;
+	ctx->root_nsproxy = current->nsproxy;
+
+	get_task_struct(ctx->root_task);
+	get_nsproxy(ctx->root_nsproxy);
+
+	atomic_set(&ctx->tasks_count, ctx->pids_nr - 1);
+
 	return 0;
 }
 
-int do_restart(struct cr_ctx *ctx, pid_t pid)
+static int do_restart_root(struct cr_ctx *ctx, pid_t pid)
 {
 	int ret;
 
+	ret = cr_read_head(ctx);
+	if (ret < 0)
+		goto out;
+	ret = cr_read_tree(ctx);
+	if (ret < 0)
+		goto out;
+
 	ret = cr_ctx_restart(ctx, pid);
 	if (ret < 0)
 		goto out;
-	ret = cr_read_head(ctx);
+
+	/* wait for all other tasks to enter do_restart_task() */
+	ret = cr_wait_all_tasks_start(ctx);
 	if (ret < 0)
 		goto out;
+
 	ret = cr_read_task(ctx);
 	if (ret < 0)
 		goto out;
-	ret = cr_read_tail(ctx);
+
+	/* wait for all other tasks to complete do_restart_task() */
+	ret = cr_wait_all_tasks_finish(ctx);
 	if (ret < 0)
 		goto out;
 
-	/* on success, adjust the return value if needed [TODO] */
+	ret = cr_read_tail(ctx);
+
  out:
 	return ret;
 }
+
+int do_restart(struct cr_ctx *ctx, pid_t pid)
+{
+	int ret;
+
+	if (ctx)
+		ret = do_restart_root(ctx, pid);
+	else
+		ret = do_restart_task(ctx, pid);
+
+	/* on success, adjust the return value if needed [TODO] */
+	return ret;
+}
diff --git a/checkpoint/sys.c b/checkpoint/sys.c
index 121c979..188dbe3 100644
--- a/checkpoint/sys.c
+++ b/checkpoint/sys.c
@@ -145,6 +145,8 @@ static void cr_task_arr_free(struct cr_ctx *ctx)
 
 static void cr_ctx_free(struct cr_ctx *ctx)
 {
+	BUG_ON(atomic_read(&ctx->refcount));
+
 	if (ctx->file)
 		fput(ctx->file);
 
@@ -163,6 +165,8 @@ static void cr_ctx_free(struct cr_ctx *ctx)
 	if (ctx->root_task)
 		put_task_struct(ctx->root_task);
 
+	kfree(ctx->pids_arr);
+
 	kfree(ctx);
 }
 
@@ -177,7 +181,9 @@ static struct cr_ctx *cr_ctx_alloc(int fd, unsigned long flags)
 
 	ctx->flags = flags;
 
+	atomic_set(&ctx->refcount, 0);
 	INIT_LIST_HEAD(&ctx->pgarr_list);
+	init_waitqueue_head(&ctx->waitq);
 
 	err = -EBADF;
 	ctx->file = fget(fd);
@@ -192,6 +198,7 @@ static struct cr_ctx *cr_ctx_alloc(int fd, unsigned long flags)
 	if (cr_objhash_alloc(ctx) < 0)
 		goto err;
 
+	atomic_inc(&ctx->refcount);
 	return ctx;
 
  err:
@@ -199,6 +206,17 @@ static struct cr_ctx *cr_ctx_alloc(int fd, unsigned long flags)
 	return ERR_PTR(err);
 }
 
+void cr_ctx_get(struct cr_ctx *ctx)
+{
+	atomic_inc(&ctx->refcount);
+}
+
+void cr_ctx_put(struct cr_ctx *ctx)
+{
+	if (ctx && atomic_dec_and_test(&ctx->refcount))
+		cr_ctx_free(ctx);
+}
+
 /**
  * sys_checkpoint - checkpoint a container
  * @pid: pid of the container init(1) process
@@ -226,7 +244,7 @@ asmlinkage long sys_checkpoint(pid_t pid, int fd, unsigned long flags)
 	if (!ret)
 		ret = ctx->crid;
 
-	cr_ctx_free(ctx);
+	cr_ctx_put(ctx);
 	return ret;
 }
 
@@ -241,7 +259,7 @@ asmlinkage long sys_checkpoint(pid_t pid, int fd, unsigned long flags)
  */
 asmlinkage long sys_restart(int crid, int fd, unsigned long flags)
 {
-	struct cr_ctx *ctx;
+	struct cr_ctx *ctx = NULL;
 	pid_t pid;
 	int ret;
 
@@ -249,15 +267,17 @@ asmlinkage long sys_restart(int crid, int fd, unsigned long flags)
 	if (flags)
 		return -EINVAL;
 
-	ctx = cr_ctx_alloc(fd, flags | CR_CTX_RSTR);
-	if (IS_ERR(ctx))
-		return PTR_ERR(ctx);
-
 	/* FIXME: for now, we use 'crid' as a pid */
 	pid = (pid_t) crid;
 
+	if (pid == task_pid_vnr(current))
+		ctx = cr_ctx_alloc(fd, flags | CR_CTX_RSTR);
+
+	if (IS_ERR(ctx))
+		return PTR_ERR(ctx);
+
 	ret = do_restart(ctx, pid);
 
-	cr_ctx_free(ctx);
+	cr_ctx_put(ctx);
 	return ret;
 }
diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
index 2504717..b32c0a7 100644
--- a/include/linux/checkpoint.h
+++ b/include/linux/checkpoint.h
@@ -13,10 +13,11 @@
 #include <linux/fs.h>
 #include <linux/path.h>
 #include <linux/sched.h>
+#include <asm/atomic.h>
 
 #ifdef CONFIG_CHECKPOINT_RESTART
 
-#define CR_VERSION  2
+#define CR_VERSION  3
 
 struct cr_ctx {
 	int crid;		/* unique checkpoint id */
@@ -34,14 +35,27 @@ struct cr_ctx {
 	void *hbuf;		/* temporary buffer for headers */
 	int hpos;		/* position in headers buffer */
 
-	struct task_struct **tasks_arr;	/* array of all tasks in container */
-	int tasks_nr;			/* size of tasks array */
+	atomic_t refcount;
 
 	struct cr_objhash *objhash;	/* hash for shared objects */
 
 	struct list_head pgarr_list;	/* page array to dump VMA contents */
 
 	struct path fs_mnt;	/* container root (FIXME) */
+
+	/* [multi-process checkpoint] */
+	struct task_struct **tasks_arr; /* array of all tasks [checkpoint] */
+	int tasks_nr;                   /* size of tasks array */
+
+	/* [multi-process restart] */
+	struct cr_hdr_pids *pids_arr;	/* array of all pids [restart] */
+	int pids_nr;			/* size of pids array */
+	int pids_pos;			/* position pids array */
+	int pids_err;			/* error occured ? */
+	pid_t pids_active;		/* pid of (next) active task */
+	atomic_t tasks_count;		/* sync of restarting tasks */
+	struct completion complete;	/* sync of restarting tasks */
+	wait_queue_head_t waitq;	/* sync of restarting tasks */
 };
 
 /* cr_ctx: flags */
@@ -54,6 +68,9 @@ extern int cr_kread(struct cr_ctx *ctx, void *buf, int count);
 extern void *cr_hbuf_get(struct cr_ctx *ctx, int n);
 extern void cr_hbuf_put(struct cr_ctx *ctx, int n);
 
+extern void cr_ctx_get(struct cr_ctx *ctx);
+extern void cr_ctx_put(struct cr_ctx *ctx);
+
 /* shared objects handling */
 
 enum {
diff --git a/include/linux/sched.h b/include/linux/sched.h
index faa2ec6..0150e90 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1359,6 +1359,7 @@ struct task_struct {
 
 #ifdef CONFIG_CHECKPOINT_RESTART
 	atomic_t may_checkpoint;
+	struct cr_ctx *checkpoint_ctx;
 #endif
 };
 
-- 
1.5.4.3

^ permalink raw reply related	[flat|nested] 133+ messages in thread

* [RFC v11][PATCH 13/13] Restart multiple processes
  2008-12-05 17:31 ` Oren Laadan
@ 2008-12-05 17:31   ` Oren Laadan
  -1 siblings, 0 replies; 133+ messages in thread
From: Oren Laadan @ 2008-12-05 17:31 UTC (permalink / raw)
  To: Oren Laadan
  Cc: containers, linux-kernel, linux-mm, linux-api, Linux Torvalds,
	Thomas Gleixner, Serge Hallyn, Dave Hansen, Ingo Molnar,
	H. Peter Anvin, Alexander Viro, MinChan Kim, arnd, jeremy,
	Oren Laadan

Restarting of multiple processes expects all restarting tasks to call
sys_restart(). Once inside the system call, each task will restart
itself at the same order that they were saved. The internals of the
syscall will take care of in-kernel synchronization bewteen tasks.

This patch does _not_ create the task tree in the kernel. Instead it
assumes that all tasks are created in some way and then invoke the
restart syscall. You can use the userspace mktree.c program to do
that.

The init task (*) has a special role: it allocates the restart context
(ctx), and coordinates the operation. In particular, it first waits
until all participating tasks enter the kernel, and provides them the
common restart context. Once everyone in ready, it begins to restart
itself.

In contrast, the other tasks enter the kernel, locate the init task (*)
and grab its restart context, and then wait for their turn to restore.

When a task (init or not) completes its restart, it hands the control
over to the next in line, by waking that task.

An array of pids (the one saved during the checkpoint) is used to
synchronize the operation. The first task in the array is the init
task (*). The restart context (ctx) maintain a "current position" in
the array, which indicates which task is currently active. Once the
currently active task completes its own restart, it increments that
position and wakes up the next task.

Restart assumes that userspace provides meaningful data, otherwise
it's garbage-in-garbage-out. In this case, the syscall may block
indefinitely, but in TASK_INTERRUPTIBLE, so the user can ctrl-c or
otherwise kill the stray restarting tasks.

In terms of security, restart runs as the user the invokes it, so it
will not allow a user to do more than is otherwise permitted by the
usual system semantics and policy.

Currently we ignore threads and zombies, as well as session ids.
Add support for multiple processes

(*) For containers, restart should be called inside a fresh container
by the init task of that container. However, it is also possible to
restart applications not necessarily inside a container, and without
restoring the original pids of the processes (that is, provided that
the application can tolerate such behavior). This is useful to allow
multi-process restart of tasks not isolated inside a container, and
also for debugging.

Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
---
 checkpoint/restart.c       |  214 +++++++++++++++++++++++++++++++++++++++++++-
 checkpoint/sys.c           |   34 ++++++--
 include/linux/checkpoint.h |   23 ++++-
 include/linux/sched.h      |    1 +
 4 files changed, 258 insertions(+), 14 deletions(-)

diff --git a/checkpoint/restart.c b/checkpoint/restart.c
index f4f737d..24392ee 100644
--- a/checkpoint/restart.c
+++ b/checkpoint/restart.c
@@ -10,6 +10,7 @@
 
 #include <linux/version.h>
 #include <linux/sched.h>
+#include <linux/wait.h>
 #include <linux/file.h>
 #include <linux/magic.h>
 #include <linux/checkpoint.h>
@@ -276,30 +277,235 @@ static int cr_read_task(struct cr_ctx *ctx)
 	return ret;
 }
 
+/* cr_read_tree - read the tasks tree into the checkpoint context */
+static int cr_read_tree(struct cr_ctx *ctx)
+{
+	struct cr_hdr_tree *hh = cr_hbuf_get(ctx, sizeof(*hh));
+	int parent, size, ret = -EINVAL;
+
+	parent = cr_read_obj_type(ctx, hh, sizeof(*hh), CR_HDR_TREE);
+	if (parent < 0) {
+		ret = parent;
+		goto out;
+	} else if (parent != 0)
+		goto out;
+
+	if (hh->tasks_nr < 0)
+		goto out;
+
+	ctx->pids_nr = hh->tasks_nr;
+	size = sizeof(*ctx->pids_arr) * ctx->pids_nr;
+	if (size < 0)		/* overflow ? */
+		goto out;
+
+	ctx->pids_arr = kmalloc(size, GFP_KERNEL);
+	if (!ctx->pids_arr) {
+		ret = -ENOMEM;
+		goto out;
+	}
+	ret = cr_kread(ctx, ctx->pids_arr, size);
+ out:
+	cr_hbuf_put(ctx, sizeof(*hh));
+	return ret;
+}
+
+static int cr_wait_task(struct cr_ctx *ctx)
+{
+	pid_t pid = task_pid_vnr(current);
+
+	cr_debug("pid %d waiting\n", pid);
+	return wait_event_interruptible(ctx->waitq, ctx->pids_active == pid);
+}
+
+static int cr_next_task(struct cr_ctx *ctx)
+{
+	struct task_struct *tsk;
+
+	ctx->pids_pos++;
+
+	cr_debug("pids_pos %d %d\n", ctx->pids_pos, ctx->pids_nr);
+	if (ctx->pids_pos == ctx->pids_nr) {
+		complete(&ctx->complete);
+		return 0;
+	}
+
+	ctx->pids_active = ctx->pids_arr[ctx->pids_pos].vpid;
+
+	cr_debug("pids_next %d\n", ctx->pids_active);
+
+	rcu_read_lock();
+	tsk = find_task_by_pid_ns(ctx->pids_active, ctx->root_nsproxy->pid_ns);
+	if (tsk)
+		wake_up_process(tsk);
+	rcu_read_unlock();
+
+	if (!tsk) {
+		ctx->pids_err = -ESRCH;
+		complete(&ctx->complete);
+		return -ESRCH;
+	}
+
+	return 0;
+}
+
+/* FIXME: this should be per container */
+DECLARE_WAIT_QUEUE_HEAD(cr_restart_waitq);
+
+static int do_restart_task(struct cr_ctx *ctx, pid_t pid)
+{
+	struct task_struct *root_task;
+	int ret;
+
+	rcu_read_lock();
+	root_task = find_task_by_pid_ns(pid, current->nsproxy->pid_ns);
+	if (root_task)
+		get_task_struct(root_task);
+	rcu_read_unlock();
+
+	if (!root_task)
+		return -EINVAL;
+
+	/*
+	 * wait for container init to initialize the restart context, then
+	 * grab a reference to that context, and if we're the last task to
+	 * do it, notify the container init.
+	 */
+	ret = wait_event_interruptible(cr_restart_waitq,
+				       root_task->checkpoint_ctx);
+	if (ret < 0)
+		goto out;
+
+	task_lock(root_task);
+	ctx = root_task->checkpoint_ctx;
+	if (ctx)
+		cr_ctx_get(ctx);
+	task_unlock(root_task);
+
+	if (!ctx) {
+		ret = -EAGAIN;
+		goto out;
+	}
+
+	if (atomic_dec_and_test(&ctx->tasks_count))
+		complete(&ctx->complete);
+
+	/* wait for our turn, do the restore, and tell next task in line */
+	ret = cr_wait_task(ctx);
+	if (ret < 0)
+		goto out;
+	ret = cr_read_task(ctx);
+	if (ret < 0)
+		goto out;
+	ret = cr_next_task(ctx);
+
+ out:
+	cr_ctx_put(ctx);
+	put_task_struct(root_task);
+	return ret;
+}
+
+static int cr_wait_all_tasks_start(struct cr_ctx *ctx)
+{
+	int ret;
+
+	if (ctx->pids_nr == 1)
+		return 0;
+
+	init_completion(&ctx->complete);
+	current->checkpoint_ctx = ctx;
+
+	wake_up_all(&cr_restart_waitq);
+
+	ret = wait_for_completion_interruptible(&ctx->complete);
+	if (ret < 0)
+		return ret;
+
+	task_lock(current);
+	current->checkpoint_ctx = NULL;
+	task_unlock(current);
+
+	return 0;
+}
+
+static int cr_wait_all_tasks_finish(struct cr_ctx *ctx)
+{
+	int ret;
+
+	if (ctx->pids_nr == 1)
+		return 0;
+
+	init_completion(&ctx->complete);
+
+	ret = cr_next_task(ctx);
+	if (ret < 0)
+		return ret;
+
+	ret = wait_for_completion_interruptible(&ctx->complete);
+	if (ret < 0)
+		return ret;
+
+	return 0;
+}
+
 /* setup restart-specific parts of ctx */
 static int cr_ctx_restart(struct cr_ctx *ctx, pid_t pid)
 {
+	ctx->root_pid = pid;
+	ctx->root_task = current;
+	ctx->root_nsproxy = current->nsproxy;
+
+	get_task_struct(ctx->root_task);
+	get_nsproxy(ctx->root_nsproxy);
+
+	atomic_set(&ctx->tasks_count, ctx->pids_nr - 1);
+
 	return 0;
 }
 
-int do_restart(struct cr_ctx *ctx, pid_t pid)
+static int do_restart_root(struct cr_ctx *ctx, pid_t pid)
 {
 	int ret;
 
+	ret = cr_read_head(ctx);
+	if (ret < 0)
+		goto out;
+	ret = cr_read_tree(ctx);
+	if (ret < 0)
+		goto out;
+
 	ret = cr_ctx_restart(ctx, pid);
 	if (ret < 0)
 		goto out;
-	ret = cr_read_head(ctx);
+
+	/* wait for all other tasks to enter do_restart_task() */
+	ret = cr_wait_all_tasks_start(ctx);
 	if (ret < 0)
 		goto out;
+
 	ret = cr_read_task(ctx);
 	if (ret < 0)
 		goto out;
-	ret = cr_read_tail(ctx);
+
+	/* wait for all other tasks to complete do_restart_task() */
+	ret = cr_wait_all_tasks_finish(ctx);
 	if (ret < 0)
 		goto out;
 
-	/* on success, adjust the return value if needed [TODO] */
+	ret = cr_read_tail(ctx);
+
  out:
 	return ret;
 }
+
+int do_restart(struct cr_ctx *ctx, pid_t pid)
+{
+	int ret;
+
+	if (ctx)
+		ret = do_restart_root(ctx, pid);
+	else
+		ret = do_restart_task(ctx, pid);
+
+	/* on success, adjust the return value if needed [TODO] */
+	return ret;
+}
diff --git a/checkpoint/sys.c b/checkpoint/sys.c
index 121c979..188dbe3 100644
--- a/checkpoint/sys.c
+++ b/checkpoint/sys.c
@@ -145,6 +145,8 @@ static void cr_task_arr_free(struct cr_ctx *ctx)
 
 static void cr_ctx_free(struct cr_ctx *ctx)
 {
+	BUG_ON(atomic_read(&ctx->refcount));
+
 	if (ctx->file)
 		fput(ctx->file);
 
@@ -163,6 +165,8 @@ static void cr_ctx_free(struct cr_ctx *ctx)
 	if (ctx->root_task)
 		put_task_struct(ctx->root_task);
 
+	kfree(ctx->pids_arr);
+
 	kfree(ctx);
 }
 
@@ -177,7 +181,9 @@ static struct cr_ctx *cr_ctx_alloc(int fd, unsigned long flags)
 
 	ctx->flags = flags;
 
+	atomic_set(&ctx->refcount, 0);
 	INIT_LIST_HEAD(&ctx->pgarr_list);
+	init_waitqueue_head(&ctx->waitq);
 
 	err = -EBADF;
 	ctx->file = fget(fd);
@@ -192,6 +198,7 @@ static struct cr_ctx *cr_ctx_alloc(int fd, unsigned long flags)
 	if (cr_objhash_alloc(ctx) < 0)
 		goto err;
 
+	atomic_inc(&ctx->refcount);
 	return ctx;
 
  err:
@@ -199,6 +206,17 @@ static struct cr_ctx *cr_ctx_alloc(int fd, unsigned long flags)
 	return ERR_PTR(err);
 }
 
+void cr_ctx_get(struct cr_ctx *ctx)
+{
+	atomic_inc(&ctx->refcount);
+}
+
+void cr_ctx_put(struct cr_ctx *ctx)
+{
+	if (ctx && atomic_dec_and_test(&ctx->refcount))
+		cr_ctx_free(ctx);
+}
+
 /**
  * sys_checkpoint - checkpoint a container
  * @pid: pid of the container init(1) process
@@ -226,7 +244,7 @@ asmlinkage long sys_checkpoint(pid_t pid, int fd, unsigned long flags)
 	if (!ret)
 		ret = ctx->crid;
 
-	cr_ctx_free(ctx);
+	cr_ctx_put(ctx);
 	return ret;
 }
 
@@ -241,7 +259,7 @@ asmlinkage long sys_checkpoint(pid_t pid, int fd, unsigned long flags)
  */
 asmlinkage long sys_restart(int crid, int fd, unsigned long flags)
 {
-	struct cr_ctx *ctx;
+	struct cr_ctx *ctx = NULL;
 	pid_t pid;
 	int ret;
 
@@ -249,15 +267,17 @@ asmlinkage long sys_restart(int crid, int fd, unsigned long flags)
 	if (flags)
 		return -EINVAL;
 
-	ctx = cr_ctx_alloc(fd, flags | CR_CTX_RSTR);
-	if (IS_ERR(ctx))
-		return PTR_ERR(ctx);
-
 	/* FIXME: for now, we use 'crid' as a pid */
 	pid = (pid_t) crid;
 
+	if (pid == task_pid_vnr(current))
+		ctx = cr_ctx_alloc(fd, flags | CR_CTX_RSTR);
+
+	if (IS_ERR(ctx))
+		return PTR_ERR(ctx);
+
 	ret = do_restart(ctx, pid);
 
-	cr_ctx_free(ctx);
+	cr_ctx_put(ctx);
 	return ret;
 }
diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
index 2504717..b32c0a7 100644
--- a/include/linux/checkpoint.h
+++ b/include/linux/checkpoint.h
@@ -13,10 +13,11 @@
 #include <linux/fs.h>
 #include <linux/path.h>
 #include <linux/sched.h>
+#include <asm/atomic.h>
 
 #ifdef CONFIG_CHECKPOINT_RESTART
 
-#define CR_VERSION  2
+#define CR_VERSION  3
 
 struct cr_ctx {
 	int crid;		/* unique checkpoint id */
@@ -34,14 +35,27 @@ struct cr_ctx {
 	void *hbuf;		/* temporary buffer for headers */
 	int hpos;		/* position in headers buffer */
 
-	struct task_struct **tasks_arr;	/* array of all tasks in container */
-	int tasks_nr;			/* size of tasks array */
+	atomic_t refcount;
 
 	struct cr_objhash *objhash;	/* hash for shared objects */
 
 	struct list_head pgarr_list;	/* page array to dump VMA contents */
 
 	struct path fs_mnt;	/* container root (FIXME) */
+
+	/* [multi-process checkpoint] */
+	struct task_struct **tasks_arr; /* array of all tasks [checkpoint] */
+	int tasks_nr;                   /* size of tasks array */
+
+	/* [multi-process restart] */
+	struct cr_hdr_pids *pids_arr;	/* array of all pids [restart] */
+	int pids_nr;			/* size of pids array */
+	int pids_pos;			/* position pids array */
+	int pids_err;			/* error occured ? */
+	pid_t pids_active;		/* pid of (next) active task */
+	atomic_t tasks_count;		/* sync of restarting tasks */
+	struct completion complete;	/* sync of restarting tasks */
+	wait_queue_head_t waitq;	/* sync of restarting tasks */
 };
 
 /* cr_ctx: flags */
@@ -54,6 +68,9 @@ extern int cr_kread(struct cr_ctx *ctx, void *buf, int count);
 extern void *cr_hbuf_get(struct cr_ctx *ctx, int n);
 extern void cr_hbuf_put(struct cr_ctx *ctx, int n);
 
+extern void cr_ctx_get(struct cr_ctx *ctx);
+extern void cr_ctx_put(struct cr_ctx *ctx);
+
 /* shared objects handling */
 
 enum {
diff --git a/include/linux/sched.h b/include/linux/sched.h
index faa2ec6..0150e90 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1359,6 +1359,7 @@ struct task_struct {
 
 #ifdef CONFIG_CHECKPOINT_RESTART
 	atomic_t may_checkpoint;
+	struct cr_ctx *checkpoint_ctx;
 #endif
 };
 
-- 
1.5.4.3


^ permalink raw reply related	[flat|nested] 133+ messages in thread

* [RFC v11][PATCH 13/13] Restart multiple processes
       [not found] ` <1228498282-11804-1-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
                     ` (13 preceding siblings ...)
  2008-12-05 17:31   ` [RFC v11][PATCH 13/13] Restart " Oren Laadan
@ 2008-12-05 17:31   ` Oren Laadan
  2008-12-06  0:19   ` [RFC v11][PATCH 00/13] Kernel based checkpoint/restart Serge E. Hallyn
                     ` (2 subsequent siblings)
  17 siblings, 0 replies; 133+ messages in thread
From: Oren Laadan @ 2008-12-05 17:31 UTC (permalink / raw)
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Linux Torvalds,
	Thomas Gleixner, Serge Hallyn, Dave Hansen, Ingo Molnar,
	H. Peter Anvin, Alexander Viro, MinChan Kim, arnd-r2nGTMty4D4,
	jeremy-TSDbQ3PG+2Y, Oren Laadan

Restarting of multiple processes expects all restarting tasks to call
sys_restart(). Once inside the system call, each task will restart
itself at the same order that they were saved. The internals of the
syscall will take care of in-kernel synchronization bewteen tasks.

This patch does _not_ create the task tree in the kernel. Instead it
assumes that all tasks are created in some way and then invoke the
restart syscall. You can use the userspace mktree.c program to do
that.

The init task (*) has a special role: it allocates the restart context
(ctx), and coordinates the operation. In particular, it first waits
until all participating tasks enter the kernel, and provides them the
common restart context. Once everyone in ready, it begins to restart
itself.

In contrast, the other tasks enter the kernel, locate the init task (*)
and grab its restart context, and then wait for their turn to restore.

When a task (init or not) completes its restart, it hands the control
over to the next in line, by waking that task.

An array of pids (the one saved during the checkpoint) is used to
synchronize the operation. The first task in the array is the init
task (*). The restart context (ctx) maintain a "current position" in
the array, which indicates which task is currently active. Once the
currently active task completes its own restart, it increments that
position and wakes up the next task.

Restart assumes that userspace provides meaningful data, otherwise
it's garbage-in-garbage-out. In this case, the syscall may block
indefinitely, but in TASK_INTERRUPTIBLE, so the user can ctrl-c or
otherwise kill the stray restarting tasks.

In terms of security, restart runs as the user the invokes it, so it
will not allow a user to do more than is otherwise permitted by the
usual system semantics and policy.

Currently we ignore threads and zombies, as well as session ids.
Add support for multiple processes

(*) For containers, restart should be called inside a fresh container
by the init task of that container. However, it is also possible to
restart applications not necessarily inside a container, and without
restoring the original pids of the processes (that is, provided that
the application can tolerate such behavior). This is useful to allow
multi-process restart of tasks not isolated inside a container, and
also for debugging.

Signed-off-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
---
 checkpoint/restart.c       |  214 +++++++++++++++++++++++++++++++++++++++++++-
 checkpoint/sys.c           |   34 ++++++--
 include/linux/checkpoint.h |   23 ++++-
 include/linux/sched.h      |    1 +
 4 files changed, 258 insertions(+), 14 deletions(-)

diff --git a/checkpoint/restart.c b/checkpoint/restart.c
index f4f737d..24392ee 100644
--- a/checkpoint/restart.c
+++ b/checkpoint/restart.c
@@ -10,6 +10,7 @@
 
 #include <linux/version.h>
 #include <linux/sched.h>
+#include <linux/wait.h>
 #include <linux/file.h>
 #include <linux/magic.h>
 #include <linux/checkpoint.h>
@@ -276,30 +277,235 @@ static int cr_read_task(struct cr_ctx *ctx)
 	return ret;
 }
 
+/* cr_read_tree - read the tasks tree into the checkpoint context */
+static int cr_read_tree(struct cr_ctx *ctx)
+{
+	struct cr_hdr_tree *hh = cr_hbuf_get(ctx, sizeof(*hh));
+	int parent, size, ret = -EINVAL;
+
+	parent = cr_read_obj_type(ctx, hh, sizeof(*hh), CR_HDR_TREE);
+	if (parent < 0) {
+		ret = parent;
+		goto out;
+	} else if (parent != 0)
+		goto out;
+
+	if (hh->tasks_nr < 0)
+		goto out;
+
+	ctx->pids_nr = hh->tasks_nr;
+	size = sizeof(*ctx->pids_arr) * ctx->pids_nr;
+	if (size < 0)		/* overflow ? */
+		goto out;
+
+	ctx->pids_arr = kmalloc(size, GFP_KERNEL);
+	if (!ctx->pids_arr) {
+		ret = -ENOMEM;
+		goto out;
+	}
+	ret = cr_kread(ctx, ctx->pids_arr, size);
+ out:
+	cr_hbuf_put(ctx, sizeof(*hh));
+	return ret;
+}
+
+static int cr_wait_task(struct cr_ctx *ctx)
+{
+	pid_t pid = task_pid_vnr(current);
+
+	cr_debug("pid %d waiting\n", pid);
+	return wait_event_interruptible(ctx->waitq, ctx->pids_active == pid);
+}
+
+static int cr_next_task(struct cr_ctx *ctx)
+{
+	struct task_struct *tsk;
+
+	ctx->pids_pos++;
+
+	cr_debug("pids_pos %d %d\n", ctx->pids_pos, ctx->pids_nr);
+	if (ctx->pids_pos == ctx->pids_nr) {
+		complete(&ctx->complete);
+		return 0;
+	}
+
+	ctx->pids_active = ctx->pids_arr[ctx->pids_pos].vpid;
+
+	cr_debug("pids_next %d\n", ctx->pids_active);
+
+	rcu_read_lock();
+	tsk = find_task_by_pid_ns(ctx->pids_active, ctx->root_nsproxy->pid_ns);
+	if (tsk)
+		wake_up_process(tsk);
+	rcu_read_unlock();
+
+	if (!tsk) {
+		ctx->pids_err = -ESRCH;
+		complete(&ctx->complete);
+		return -ESRCH;
+	}
+
+	return 0;
+}
+
+/* FIXME: this should be per container */
+DECLARE_WAIT_QUEUE_HEAD(cr_restart_waitq);
+
+static int do_restart_task(struct cr_ctx *ctx, pid_t pid)
+{
+	struct task_struct *root_task;
+	int ret;
+
+	rcu_read_lock();
+	root_task = find_task_by_pid_ns(pid, current->nsproxy->pid_ns);
+	if (root_task)
+		get_task_struct(root_task);
+	rcu_read_unlock();
+
+	if (!root_task)
+		return -EINVAL;
+
+	/*
+	 * wait for container init to initialize the restart context, then
+	 * grab a reference to that context, and if we're the last task to
+	 * do it, notify the container init.
+	 */
+	ret = wait_event_interruptible(cr_restart_waitq,
+				       root_task->checkpoint_ctx);
+	if (ret < 0)
+		goto out;
+
+	task_lock(root_task);
+	ctx = root_task->checkpoint_ctx;
+	if (ctx)
+		cr_ctx_get(ctx);
+	task_unlock(root_task);
+
+	if (!ctx) {
+		ret = -EAGAIN;
+		goto out;
+	}
+
+	if (atomic_dec_and_test(&ctx->tasks_count))
+		complete(&ctx->complete);
+
+	/* wait for our turn, do the restore, and tell next task in line */
+	ret = cr_wait_task(ctx);
+	if (ret < 0)
+		goto out;
+	ret = cr_read_task(ctx);
+	if (ret < 0)
+		goto out;
+	ret = cr_next_task(ctx);
+
+ out:
+	cr_ctx_put(ctx);
+	put_task_struct(root_task);
+	return ret;
+}
+
+static int cr_wait_all_tasks_start(struct cr_ctx *ctx)
+{
+	int ret;
+
+	if (ctx->pids_nr == 1)
+		return 0;
+
+	init_completion(&ctx->complete);
+	current->checkpoint_ctx = ctx;
+
+	wake_up_all(&cr_restart_waitq);
+
+	ret = wait_for_completion_interruptible(&ctx->complete);
+	if (ret < 0)
+		return ret;
+
+	task_lock(current);
+	current->checkpoint_ctx = NULL;
+	task_unlock(current);
+
+	return 0;
+}
+
+static int cr_wait_all_tasks_finish(struct cr_ctx *ctx)
+{
+	int ret;
+
+	if (ctx->pids_nr == 1)
+		return 0;
+
+	init_completion(&ctx->complete);
+
+	ret = cr_next_task(ctx);
+	if (ret < 0)
+		return ret;
+
+	ret = wait_for_completion_interruptible(&ctx->complete);
+	if (ret < 0)
+		return ret;
+
+	return 0;
+}
+
 /* setup restart-specific parts of ctx */
 static int cr_ctx_restart(struct cr_ctx *ctx, pid_t pid)
 {
+	ctx->root_pid = pid;
+	ctx->root_task = current;
+	ctx->root_nsproxy = current->nsproxy;
+
+	get_task_struct(ctx->root_task);
+	get_nsproxy(ctx->root_nsproxy);
+
+	atomic_set(&ctx->tasks_count, ctx->pids_nr - 1);
+
 	return 0;
 }
 
-int do_restart(struct cr_ctx *ctx, pid_t pid)
+static int do_restart_root(struct cr_ctx *ctx, pid_t pid)
 {
 	int ret;
 
+	ret = cr_read_head(ctx);
+	if (ret < 0)
+		goto out;
+	ret = cr_read_tree(ctx);
+	if (ret < 0)
+		goto out;
+
 	ret = cr_ctx_restart(ctx, pid);
 	if (ret < 0)
 		goto out;
-	ret = cr_read_head(ctx);
+
+	/* wait for all other tasks to enter do_restart_task() */
+	ret = cr_wait_all_tasks_start(ctx);
 	if (ret < 0)
 		goto out;
+
 	ret = cr_read_task(ctx);
 	if (ret < 0)
 		goto out;
-	ret = cr_read_tail(ctx);
+
+	/* wait for all other tasks to complete do_restart_task() */
+	ret = cr_wait_all_tasks_finish(ctx);
 	if (ret < 0)
 		goto out;
 
-	/* on success, adjust the return value if needed [TODO] */
+	ret = cr_read_tail(ctx);
+
  out:
 	return ret;
 }
+
+int do_restart(struct cr_ctx *ctx, pid_t pid)
+{
+	int ret;
+
+	if (ctx)
+		ret = do_restart_root(ctx, pid);
+	else
+		ret = do_restart_task(ctx, pid);
+
+	/* on success, adjust the return value if needed [TODO] */
+	return ret;
+}
diff --git a/checkpoint/sys.c b/checkpoint/sys.c
index 121c979..188dbe3 100644
--- a/checkpoint/sys.c
+++ b/checkpoint/sys.c
@@ -145,6 +145,8 @@ static void cr_task_arr_free(struct cr_ctx *ctx)
 
 static void cr_ctx_free(struct cr_ctx *ctx)
 {
+	BUG_ON(atomic_read(&ctx->refcount));
+
 	if (ctx->file)
 		fput(ctx->file);
 
@@ -163,6 +165,8 @@ static void cr_ctx_free(struct cr_ctx *ctx)
 	if (ctx->root_task)
 		put_task_struct(ctx->root_task);
 
+	kfree(ctx->pids_arr);
+
 	kfree(ctx);
 }
 
@@ -177,7 +181,9 @@ static struct cr_ctx *cr_ctx_alloc(int fd, unsigned long flags)
 
 	ctx->flags = flags;
 
+	atomic_set(&ctx->refcount, 0);
 	INIT_LIST_HEAD(&ctx->pgarr_list);
+	init_waitqueue_head(&ctx->waitq);
 
 	err = -EBADF;
 	ctx->file = fget(fd);
@@ -192,6 +198,7 @@ static struct cr_ctx *cr_ctx_alloc(int fd, unsigned long flags)
 	if (cr_objhash_alloc(ctx) < 0)
 		goto err;
 
+	atomic_inc(&ctx->refcount);
 	return ctx;
 
  err:
@@ -199,6 +206,17 @@ static struct cr_ctx *cr_ctx_alloc(int fd, unsigned long flags)
 	return ERR_PTR(err);
 }
 
+void cr_ctx_get(struct cr_ctx *ctx)
+{
+	atomic_inc(&ctx->refcount);
+}
+
+void cr_ctx_put(struct cr_ctx *ctx)
+{
+	if (ctx && atomic_dec_and_test(&ctx->refcount))
+		cr_ctx_free(ctx);
+}
+
 /**
  * sys_checkpoint - checkpoint a container
  * @pid: pid of the container init(1) process
@@ -226,7 +244,7 @@ asmlinkage long sys_checkpoint(pid_t pid, int fd, unsigned long flags)
 	if (!ret)
 		ret = ctx->crid;
 
-	cr_ctx_free(ctx);
+	cr_ctx_put(ctx);
 	return ret;
 }
 
@@ -241,7 +259,7 @@ asmlinkage long sys_checkpoint(pid_t pid, int fd, unsigned long flags)
  */
 asmlinkage long sys_restart(int crid, int fd, unsigned long flags)
 {
-	struct cr_ctx *ctx;
+	struct cr_ctx *ctx = NULL;
 	pid_t pid;
 	int ret;
 
@@ -249,15 +267,17 @@ asmlinkage long sys_restart(int crid, int fd, unsigned long flags)
 	if (flags)
 		return -EINVAL;
 
-	ctx = cr_ctx_alloc(fd, flags | CR_CTX_RSTR);
-	if (IS_ERR(ctx))
-		return PTR_ERR(ctx);
-
 	/* FIXME: for now, we use 'crid' as a pid */
 	pid = (pid_t) crid;
 
+	if (pid == task_pid_vnr(current))
+		ctx = cr_ctx_alloc(fd, flags | CR_CTX_RSTR);
+
+	if (IS_ERR(ctx))
+		return PTR_ERR(ctx);
+
 	ret = do_restart(ctx, pid);
 
-	cr_ctx_free(ctx);
+	cr_ctx_put(ctx);
 	return ret;
 }
diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
index 2504717..b32c0a7 100644
--- a/include/linux/checkpoint.h
+++ b/include/linux/checkpoint.h
@@ -13,10 +13,11 @@
 #include <linux/fs.h>
 #include <linux/path.h>
 #include <linux/sched.h>
+#include <asm/atomic.h>
 
 #ifdef CONFIG_CHECKPOINT_RESTART
 
-#define CR_VERSION  2
+#define CR_VERSION  3
 
 struct cr_ctx {
 	int crid;		/* unique checkpoint id */
@@ -34,14 +35,27 @@ struct cr_ctx {
 	void *hbuf;		/* temporary buffer for headers */
 	int hpos;		/* position in headers buffer */
 
-	struct task_struct **tasks_arr;	/* array of all tasks in container */
-	int tasks_nr;			/* size of tasks array */
+	atomic_t refcount;
 
 	struct cr_objhash *objhash;	/* hash for shared objects */
 
 	struct list_head pgarr_list;	/* page array to dump VMA contents */
 
 	struct path fs_mnt;	/* container root (FIXME) */
+
+	/* [multi-process checkpoint] */
+	struct task_struct **tasks_arr; /* array of all tasks [checkpoint] */
+	int tasks_nr;                   /* size of tasks array */
+
+	/* [multi-process restart] */
+	struct cr_hdr_pids *pids_arr;	/* array of all pids [restart] */
+	int pids_nr;			/* size of pids array */
+	int pids_pos;			/* position pids array */
+	int pids_err;			/* error occured ? */
+	pid_t pids_active;		/* pid of (next) active task */
+	atomic_t tasks_count;		/* sync of restarting tasks */
+	struct completion complete;	/* sync of restarting tasks */
+	wait_queue_head_t waitq;	/* sync of restarting tasks */
 };
 
 /* cr_ctx: flags */
@@ -54,6 +68,9 @@ extern int cr_kread(struct cr_ctx *ctx, void *buf, int count);
 extern void *cr_hbuf_get(struct cr_ctx *ctx, int n);
 extern void cr_hbuf_put(struct cr_ctx *ctx, int n);
 
+extern void cr_ctx_get(struct cr_ctx *ctx);
+extern void cr_ctx_put(struct cr_ctx *ctx);
+
 /* shared objects handling */
 
 enum {
diff --git a/include/linux/sched.h b/include/linux/sched.h
index faa2ec6..0150e90 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1359,6 +1359,7 @@ struct task_struct {
 
 #ifdef CONFIG_CHECKPOINT_RESTART
 	atomic_t may_checkpoint;
+	struct cr_ctx *checkpoint_ctx;
 #endif
 };
 
-- 
1.5.4.3

--
To unsubscribe from this list: send the line "unsubscribe linux-api" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 133+ messages in thread

* [RFC v11][PATCH 13/13] Restart multiple processes
  2008-12-05 17:31 ` Oren Laadan
                   ` (16 preceding siblings ...)
  (?)
@ 2008-12-05 17:31 ` Oren Laadan
  -1 siblings, 0 replies; 133+ messages in thread
From: Oren Laadan @ 2008-12-05 17:31 UTC (permalink / raw)
  Cc: containers, linux-kernel, linux-mm, linux-api, Linux Torvalds,
	Thomas Gleixner, Serge Hallyn, Dave Hansen, Ingo Molnar,
	H. Peter Anvin, Alexander Viro, MinChan Kim, arnd, jeremy,
	Oren Laadan

Restarting of multiple processes expects all restarting tasks to call
sys_restart(). Once inside the system call, each task will restart
itself at the same order that they were saved. The internals of the
syscall will take care of in-kernel synchronization bewteen tasks.

This patch does _not_ create the task tree in the kernel. Instead it
assumes that all tasks are created in some way and then invoke the
restart syscall. You can use the userspace mktree.c program to do
that.

The init task (*) has a special role: it allocates the restart context
(ctx), and coordinates the operation. In particular, it first waits
until all participating tasks enter the kernel, and provides them the
common restart context. Once everyone in ready, it begins to restart
itself.

In contrast, the other tasks enter the kernel, locate the init task (*)
and grab its restart context, and then wait for their turn to restore.

When a task (init or not) completes its restart, it hands the control
over to the next in line, by waking that task.

An array of pids (the one saved during the checkpoint) is used to
synchronize the operation. The first task in the array is the init
task (*). The restart context (ctx) maintain a "current position" in
the array, which indicates which task is currently active. Once the
currently active task completes its own restart, it increments that
position and wakes up the next task.

Restart assumes that userspace provides meaningful data, otherwise
it's garbage-in-garbage-out. In this case, the syscall may block
indefinitely, but in TASK_INTERRUPTIBLE, so the user can ctrl-c or
otherwise kill the stray restarting tasks.

In terms of security, restart runs as the user the invokes it, so it
will not allow a user to do more than is otherwise permitted by the
usual system semantics and policy.

Currently we ignore threads and zombies, as well as session ids.
Add support for multiple processes

(*) For containers, restart should be called inside a fresh container
by the init task of that container. However, it is also possible to
restart applications not necessarily inside a container, and without
restoring the original pids of the processes (that is, provided that
the application can tolerate such behavior). This is useful to allow
multi-process restart of tasks not isolated inside a container, and
also for debugging.

Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
---
 checkpoint/restart.c       |  214 +++++++++++++++++++++++++++++++++++++++++++-
 checkpoint/sys.c           |   34 ++++++--
 include/linux/checkpoint.h |   23 ++++-
 include/linux/sched.h      |    1 +
 4 files changed, 258 insertions(+), 14 deletions(-)

diff --git a/checkpoint/restart.c b/checkpoint/restart.c
index f4f737d..24392ee 100644
--- a/checkpoint/restart.c
+++ b/checkpoint/restart.c
@@ -10,6 +10,7 @@
 
 #include <linux/version.h>
 #include <linux/sched.h>
+#include <linux/wait.h>
 #include <linux/file.h>
 #include <linux/magic.h>
 #include <linux/checkpoint.h>
@@ -276,30 +277,235 @@ static int cr_read_task(struct cr_ctx *ctx)
 	return ret;
 }
 
+/* cr_read_tree - read the tasks tree into the checkpoint context */
+static int cr_read_tree(struct cr_ctx *ctx)
+{
+	struct cr_hdr_tree *hh = cr_hbuf_get(ctx, sizeof(*hh));
+	int parent, size, ret = -EINVAL;
+
+	parent = cr_read_obj_type(ctx, hh, sizeof(*hh), CR_HDR_TREE);
+	if (parent < 0) {
+		ret = parent;
+		goto out;
+	} else if (parent != 0)
+		goto out;
+
+	if (hh->tasks_nr < 0)
+		goto out;
+
+	ctx->pids_nr = hh->tasks_nr;
+	size = sizeof(*ctx->pids_arr) * ctx->pids_nr;
+	if (size < 0)		/* overflow ? */
+		goto out;
+
+	ctx->pids_arr = kmalloc(size, GFP_KERNEL);
+	if (!ctx->pids_arr) {
+		ret = -ENOMEM;
+		goto out;
+	}
+	ret = cr_kread(ctx, ctx->pids_arr, size);
+ out:
+	cr_hbuf_put(ctx, sizeof(*hh));
+	return ret;
+}
+
+static int cr_wait_task(struct cr_ctx *ctx)
+{
+	pid_t pid = task_pid_vnr(current);
+
+	cr_debug("pid %d waiting\n", pid);
+	return wait_event_interruptible(ctx->waitq, ctx->pids_active == pid);
+}
+
+static int cr_next_task(struct cr_ctx *ctx)
+{
+	struct task_struct *tsk;
+
+	ctx->pids_pos++;
+
+	cr_debug("pids_pos %d %d\n", ctx->pids_pos, ctx->pids_nr);
+	if (ctx->pids_pos == ctx->pids_nr) {
+		complete(&ctx->complete);
+		return 0;
+	}
+
+	ctx->pids_active = ctx->pids_arr[ctx->pids_pos].vpid;
+
+	cr_debug("pids_next %d\n", ctx->pids_active);
+
+	rcu_read_lock();
+	tsk = find_task_by_pid_ns(ctx->pids_active, ctx->root_nsproxy->pid_ns);
+	if (tsk)
+		wake_up_process(tsk);
+	rcu_read_unlock();
+
+	if (!tsk) {
+		ctx->pids_err = -ESRCH;
+		complete(&ctx->complete);
+		return -ESRCH;
+	}
+
+	return 0;
+}
+
+/* FIXME: this should be per container */
+DECLARE_WAIT_QUEUE_HEAD(cr_restart_waitq);
+
+static int do_restart_task(struct cr_ctx *ctx, pid_t pid)
+{
+	struct task_struct *root_task;
+	int ret;
+
+	rcu_read_lock();
+	root_task = find_task_by_pid_ns(pid, current->nsproxy->pid_ns);
+	if (root_task)
+		get_task_struct(root_task);
+	rcu_read_unlock();
+
+	if (!root_task)
+		return -EINVAL;
+
+	/*
+	 * wait for container init to initialize the restart context, then
+	 * grab a reference to that context, and if we're the last task to
+	 * do it, notify the container init.
+	 */
+	ret = wait_event_interruptible(cr_restart_waitq,
+				       root_task->checkpoint_ctx);
+	if (ret < 0)
+		goto out;
+
+	task_lock(root_task);
+	ctx = root_task->checkpoint_ctx;
+	if (ctx)
+		cr_ctx_get(ctx);
+	task_unlock(root_task);
+
+	if (!ctx) {
+		ret = -EAGAIN;
+		goto out;
+	}
+
+	if (atomic_dec_and_test(&ctx->tasks_count))
+		complete(&ctx->complete);
+
+	/* wait for our turn, do the restore, and tell next task in line */
+	ret = cr_wait_task(ctx);
+	if (ret < 0)
+		goto out;
+	ret = cr_read_task(ctx);
+	if (ret < 0)
+		goto out;
+	ret = cr_next_task(ctx);
+
+ out:
+	cr_ctx_put(ctx);
+	put_task_struct(root_task);
+	return ret;
+}
+
+static int cr_wait_all_tasks_start(struct cr_ctx *ctx)
+{
+	int ret;
+
+	if (ctx->pids_nr == 1)
+		return 0;
+
+	init_completion(&ctx->complete);
+	current->checkpoint_ctx = ctx;
+
+	wake_up_all(&cr_restart_waitq);
+
+	ret = wait_for_completion_interruptible(&ctx->complete);
+	if (ret < 0)
+		return ret;
+
+	task_lock(current);
+	current->checkpoint_ctx = NULL;
+	task_unlock(current);
+
+	return 0;
+}
+
+static int cr_wait_all_tasks_finish(struct cr_ctx *ctx)
+{
+	int ret;
+
+	if (ctx->pids_nr == 1)
+		return 0;
+
+	init_completion(&ctx->complete);
+
+	ret = cr_next_task(ctx);
+	if (ret < 0)
+		return ret;
+
+	ret = wait_for_completion_interruptible(&ctx->complete);
+	if (ret < 0)
+		return ret;
+
+	return 0;
+}
+
 /* setup restart-specific parts of ctx */
 static int cr_ctx_restart(struct cr_ctx *ctx, pid_t pid)
 {
+	ctx->root_pid = pid;
+	ctx->root_task = current;
+	ctx->root_nsproxy = current->nsproxy;
+
+	get_task_struct(ctx->root_task);
+	get_nsproxy(ctx->root_nsproxy);
+
+	atomic_set(&ctx->tasks_count, ctx->pids_nr - 1);
+
 	return 0;
 }
 
-int do_restart(struct cr_ctx *ctx, pid_t pid)
+static int do_restart_root(struct cr_ctx *ctx, pid_t pid)
 {
 	int ret;
 
+	ret = cr_read_head(ctx);
+	if (ret < 0)
+		goto out;
+	ret = cr_read_tree(ctx);
+	if (ret < 0)
+		goto out;
+
 	ret = cr_ctx_restart(ctx, pid);
 	if (ret < 0)
 		goto out;
-	ret = cr_read_head(ctx);
+
+	/* wait for all other tasks to enter do_restart_task() */
+	ret = cr_wait_all_tasks_start(ctx);
 	if (ret < 0)
 		goto out;
+
 	ret = cr_read_task(ctx);
 	if (ret < 0)
 		goto out;
-	ret = cr_read_tail(ctx);
+
+	/* wait for all other tasks to complete do_restart_task() */
+	ret = cr_wait_all_tasks_finish(ctx);
 	if (ret < 0)
 		goto out;
 
-	/* on success, adjust the return value if needed [TODO] */
+	ret = cr_read_tail(ctx);
+
  out:
 	return ret;
 }
+
+int do_restart(struct cr_ctx *ctx, pid_t pid)
+{
+	int ret;
+
+	if (ctx)
+		ret = do_restart_root(ctx, pid);
+	else
+		ret = do_restart_task(ctx, pid);
+
+	/* on success, adjust the return value if needed [TODO] */
+	return ret;
+}
diff --git a/checkpoint/sys.c b/checkpoint/sys.c
index 121c979..188dbe3 100644
--- a/checkpoint/sys.c
+++ b/checkpoint/sys.c
@@ -145,6 +145,8 @@ static void cr_task_arr_free(struct cr_ctx *ctx)
 
 static void cr_ctx_free(struct cr_ctx *ctx)
 {
+	BUG_ON(atomic_read(&ctx->refcount));
+
 	if (ctx->file)
 		fput(ctx->file);
 
@@ -163,6 +165,8 @@ static void cr_ctx_free(struct cr_ctx *ctx)
 	if (ctx->root_task)
 		put_task_struct(ctx->root_task);
 
+	kfree(ctx->pids_arr);
+
 	kfree(ctx);
 }
 
@@ -177,7 +181,9 @@ static struct cr_ctx *cr_ctx_alloc(int fd, unsigned long flags)
 
 	ctx->flags = flags;
 
+	atomic_set(&ctx->refcount, 0);
 	INIT_LIST_HEAD(&ctx->pgarr_list);
+	init_waitqueue_head(&ctx->waitq);
 
 	err = -EBADF;
 	ctx->file = fget(fd);
@@ -192,6 +198,7 @@ static struct cr_ctx *cr_ctx_alloc(int fd, unsigned long flags)
 	if (cr_objhash_alloc(ctx) < 0)
 		goto err;
 
+	atomic_inc(&ctx->refcount);
 	return ctx;
 
  err:
@@ -199,6 +206,17 @@ static struct cr_ctx *cr_ctx_alloc(int fd, unsigned long flags)
 	return ERR_PTR(err);
 }
 
+void cr_ctx_get(struct cr_ctx *ctx)
+{
+	atomic_inc(&ctx->refcount);
+}
+
+void cr_ctx_put(struct cr_ctx *ctx)
+{
+	if (ctx && atomic_dec_and_test(&ctx->refcount))
+		cr_ctx_free(ctx);
+}
+
 /**
  * sys_checkpoint - checkpoint a container
  * @pid: pid of the container init(1) process
@@ -226,7 +244,7 @@ asmlinkage long sys_checkpoint(pid_t pid, int fd, unsigned long flags)
 	if (!ret)
 		ret = ctx->crid;
 
-	cr_ctx_free(ctx);
+	cr_ctx_put(ctx);
 	return ret;
 }
 
@@ -241,7 +259,7 @@ asmlinkage long sys_checkpoint(pid_t pid, int fd, unsigned long flags)
  */
 asmlinkage long sys_restart(int crid, int fd, unsigned long flags)
 {
-	struct cr_ctx *ctx;
+	struct cr_ctx *ctx = NULL;
 	pid_t pid;
 	int ret;
 
@@ -249,15 +267,17 @@ asmlinkage long sys_restart(int crid, int fd, unsigned long flags)
 	if (flags)
 		return -EINVAL;
 
-	ctx = cr_ctx_alloc(fd, flags | CR_CTX_RSTR);
-	if (IS_ERR(ctx))
-		return PTR_ERR(ctx);
-
 	/* FIXME: for now, we use 'crid' as a pid */
 	pid = (pid_t) crid;
 
+	if (pid == task_pid_vnr(current))
+		ctx = cr_ctx_alloc(fd, flags | CR_CTX_RSTR);
+
+	if (IS_ERR(ctx))
+		return PTR_ERR(ctx);
+
 	ret = do_restart(ctx, pid);
 
-	cr_ctx_free(ctx);
+	cr_ctx_put(ctx);
 	return ret;
 }
diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
index 2504717..b32c0a7 100644
--- a/include/linux/checkpoint.h
+++ b/include/linux/checkpoint.h
@@ -13,10 +13,11 @@
 #include <linux/fs.h>
 #include <linux/path.h>
 #include <linux/sched.h>
+#include <asm/atomic.h>
 
 #ifdef CONFIG_CHECKPOINT_RESTART
 
-#define CR_VERSION  2
+#define CR_VERSION  3
 
 struct cr_ctx {
 	int crid;		/* unique checkpoint id */
@@ -34,14 +35,27 @@ struct cr_ctx {
 	void *hbuf;		/* temporary buffer for headers */
 	int hpos;		/* position in headers buffer */
 
-	struct task_struct **tasks_arr;	/* array of all tasks in container */
-	int tasks_nr;			/* size of tasks array */
+	atomic_t refcount;
 
 	struct cr_objhash *objhash;	/* hash for shared objects */
 
 	struct list_head pgarr_list;	/* page array to dump VMA contents */
 
 	struct path fs_mnt;	/* container root (FIXME) */
+
+	/* [multi-process checkpoint] */
+	struct task_struct **tasks_arr; /* array of all tasks [checkpoint] */
+	int tasks_nr;                   /* size of tasks array */
+
+	/* [multi-process restart] */
+	struct cr_hdr_pids *pids_arr;	/* array of all pids [restart] */
+	int pids_nr;			/* size of pids array */
+	int pids_pos;			/* position pids array */
+	int pids_err;			/* error occured ? */
+	pid_t pids_active;		/* pid of (next) active task */
+	atomic_t tasks_count;		/* sync of restarting tasks */
+	struct completion complete;	/* sync of restarting tasks */
+	wait_queue_head_t waitq;	/* sync of restarting tasks */
 };
 
 /* cr_ctx: flags */
@@ -54,6 +68,9 @@ extern int cr_kread(struct cr_ctx *ctx, void *buf, int count);
 extern void *cr_hbuf_get(struct cr_ctx *ctx, int n);
 extern void cr_hbuf_put(struct cr_ctx *ctx, int n);
 
+extern void cr_ctx_get(struct cr_ctx *ctx);
+extern void cr_ctx_put(struct cr_ctx *ctx);
+
 /* shared objects handling */
 
 enum {
diff --git a/include/linux/sched.h b/include/linux/sched.h
index faa2ec6..0150e90 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1359,6 +1359,7 @@ struct task_struct {
 
 #ifdef CONFIG_CHECKPOINT_RESTART
 	atomic_t may_checkpoint;
+	struct cr_ctx *checkpoint_ctx;
 #endif
 };
 
-- 
1.5.4.3

^ permalink raw reply related	[flat|nested] 133+ messages in thread

* [RFC v11][PATCH 13/13] Restart multiple processes
@ 2008-12-05 17:31   ` Oren Laadan
  0 siblings, 0 replies; 133+ messages in thread
From: Oren Laadan @ 2008-12-05 17:31 UTC (permalink / raw)
  To: Oren Laadan
  Cc: containers, linux-kernel, linux-mm, linux-api, Linux Torvalds,
	Thomas Gleixner, Serge Hallyn, Dave Hansen, Ingo Molnar,
	H. Peter Anvin, Alexander Viro, MinChan Kim, arnd, jeremy

Restarting of multiple processes expects all restarting tasks to call
sys_restart(). Once inside the system call, each task will restart
itself at the same order that they were saved. The internals of the
syscall will take care of in-kernel synchronization bewteen tasks.

This patch does _not_ create the task tree in the kernel. Instead it
assumes that all tasks are created in some way and then invoke the
restart syscall. You can use the userspace mktree.c program to do
that.

The init task (*) has a special role: it allocates the restart context
(ctx), and coordinates the operation. In particular, it first waits
until all participating tasks enter the kernel, and provides them the
common restart context. Once everyone in ready, it begins to restart
itself.

In contrast, the other tasks enter the kernel, locate the init task (*)
and grab its restart context, and then wait for their turn to restore.

When a task (init or not) completes its restart, it hands the control
over to the next in line, by waking that task.

An array of pids (the one saved during the checkpoint) is used to
synchronize the operation. The first task in the array is the init
task (*). The restart context (ctx) maintain a "current position" in
the array, which indicates which task is currently active. Once the
currently active task completes its own restart, it increments that
position and wakes up the next task.

Restart assumes that userspace provides meaningful data, otherwise
it's garbage-in-garbage-out. In this case, the syscall may block
indefinitely, but in TASK_INTERRUPTIBLE, so the user can ctrl-c or
otherwise kill the stray restarting tasks.

In terms of security, restart runs as the user the invokes it, so it
will not allow a user to do more than is otherwise permitted by the
usual system semantics and policy.

Currently we ignore threads and zombies, as well as session ids.
Add support for multiple processes

(*) For containers, restart should be called inside a fresh container
by the init task of that container. However, it is also possible to
restart applications not necessarily inside a container, and without
restoring the original pids of the processes (that is, provided that
the application can tolerate such behavior). This is useful to allow
multi-process restart of tasks not isolated inside a container, and
also for debugging.

Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
---
 checkpoint/restart.c       |  214 +++++++++++++++++++++++++++++++++++++++++++-
 checkpoint/sys.c           |   34 ++++++--
 include/linux/checkpoint.h |   23 ++++-
 include/linux/sched.h      |    1 +
 4 files changed, 258 insertions(+), 14 deletions(-)

diff --git a/checkpoint/restart.c b/checkpoint/restart.c
index f4f737d..24392ee 100644
--- a/checkpoint/restart.c
+++ b/checkpoint/restart.c
@@ -10,6 +10,7 @@
 
 #include <linux/version.h>
 #include <linux/sched.h>
+#include <linux/wait.h>
 #include <linux/file.h>
 #include <linux/magic.h>
 #include <linux/checkpoint.h>
@@ -276,30 +277,235 @@ static int cr_read_task(struct cr_ctx *ctx)
 	return ret;
 }
 
+/* cr_read_tree - read the tasks tree into the checkpoint context */
+static int cr_read_tree(struct cr_ctx *ctx)
+{
+	struct cr_hdr_tree *hh = cr_hbuf_get(ctx, sizeof(*hh));
+	int parent, size, ret = -EINVAL;
+
+	parent = cr_read_obj_type(ctx, hh, sizeof(*hh), CR_HDR_TREE);
+	if (parent < 0) {
+		ret = parent;
+		goto out;
+	} else if (parent != 0)
+		goto out;
+
+	if (hh->tasks_nr < 0)
+		goto out;
+
+	ctx->pids_nr = hh->tasks_nr;
+	size = sizeof(*ctx->pids_arr) * ctx->pids_nr;
+	if (size < 0)		/* overflow ? */
+		goto out;
+
+	ctx->pids_arr = kmalloc(size, GFP_KERNEL);
+	if (!ctx->pids_arr) {
+		ret = -ENOMEM;
+		goto out;
+	}
+	ret = cr_kread(ctx, ctx->pids_arr, size);
+ out:
+	cr_hbuf_put(ctx, sizeof(*hh));
+	return ret;
+}
+
+static int cr_wait_task(struct cr_ctx *ctx)
+{
+	pid_t pid = task_pid_vnr(current);
+
+	cr_debug("pid %d waiting\n", pid);
+	return wait_event_interruptible(ctx->waitq, ctx->pids_active == pid);
+}
+
+static int cr_next_task(struct cr_ctx *ctx)
+{
+	struct task_struct *tsk;
+
+	ctx->pids_pos++;
+
+	cr_debug("pids_pos %d %d\n", ctx->pids_pos, ctx->pids_nr);
+	if (ctx->pids_pos == ctx->pids_nr) {
+		complete(&ctx->complete);
+		return 0;
+	}
+
+	ctx->pids_active = ctx->pids_arr[ctx->pids_pos].vpid;
+
+	cr_debug("pids_next %d\n", ctx->pids_active);
+
+	rcu_read_lock();
+	tsk = find_task_by_pid_ns(ctx->pids_active, ctx->root_nsproxy->pid_ns);
+	if (tsk)
+		wake_up_process(tsk);
+	rcu_read_unlock();
+
+	if (!tsk) {
+		ctx->pids_err = -ESRCH;
+		complete(&ctx->complete);
+		return -ESRCH;
+	}
+
+	return 0;
+}
+
+/* FIXME: this should be per container */
+DECLARE_WAIT_QUEUE_HEAD(cr_restart_waitq);
+
+static int do_restart_task(struct cr_ctx *ctx, pid_t pid)
+{
+	struct task_struct *root_task;
+	int ret;
+
+	rcu_read_lock();
+	root_task = find_task_by_pid_ns(pid, current->nsproxy->pid_ns);
+	if (root_task)
+		get_task_struct(root_task);
+	rcu_read_unlock();
+
+	if (!root_task)
+		return -EINVAL;
+
+	/*
+	 * wait for container init to initialize the restart context, then
+	 * grab a reference to that context, and if we're the last task to
+	 * do it, notify the container init.
+	 */
+	ret = wait_event_interruptible(cr_restart_waitq,
+				       root_task->checkpoint_ctx);
+	if (ret < 0)
+		goto out;
+
+	task_lock(root_task);
+	ctx = root_task->checkpoint_ctx;
+	if (ctx)
+		cr_ctx_get(ctx);
+	task_unlock(root_task);
+
+	if (!ctx) {
+		ret = -EAGAIN;
+		goto out;
+	}
+
+	if (atomic_dec_and_test(&ctx->tasks_count))
+		complete(&ctx->complete);
+
+	/* wait for our turn, do the restore, and tell next task in line */
+	ret = cr_wait_task(ctx);
+	if (ret < 0)
+		goto out;
+	ret = cr_read_task(ctx);
+	if (ret < 0)
+		goto out;
+	ret = cr_next_task(ctx);
+
+ out:
+	cr_ctx_put(ctx);
+	put_task_struct(root_task);
+	return ret;
+}
+
+static int cr_wait_all_tasks_start(struct cr_ctx *ctx)
+{
+	int ret;
+
+	if (ctx->pids_nr == 1)
+		return 0;
+
+	init_completion(&ctx->complete);
+	current->checkpoint_ctx = ctx;
+
+	wake_up_all(&cr_restart_waitq);
+
+	ret = wait_for_completion_interruptible(&ctx->complete);
+	if (ret < 0)
+		return ret;
+
+	task_lock(current);
+	current->checkpoint_ctx = NULL;
+	task_unlock(current);
+
+	return 0;
+}
+
+static int cr_wait_all_tasks_finish(struct cr_ctx *ctx)
+{
+	int ret;
+
+	if (ctx->pids_nr == 1)
+		return 0;
+
+	init_completion(&ctx->complete);
+
+	ret = cr_next_task(ctx);
+	if (ret < 0)
+		return ret;
+
+	ret = wait_for_completion_interruptible(&ctx->complete);
+	if (ret < 0)
+		return ret;
+
+	return 0;
+}
+
 /* setup restart-specific parts of ctx */
 static int cr_ctx_restart(struct cr_ctx *ctx, pid_t pid)
 {
+	ctx->root_pid = pid;
+	ctx->root_task = current;
+	ctx->root_nsproxy = current->nsproxy;
+
+	get_task_struct(ctx->root_task);
+	get_nsproxy(ctx->root_nsproxy);
+
+	atomic_set(&ctx->tasks_count, ctx->pids_nr - 1);
+
 	return 0;
 }
 
-int do_restart(struct cr_ctx *ctx, pid_t pid)
+static int do_restart_root(struct cr_ctx *ctx, pid_t pid)
 {
 	int ret;
 
+	ret = cr_read_head(ctx);
+	if (ret < 0)
+		goto out;
+	ret = cr_read_tree(ctx);
+	if (ret < 0)
+		goto out;
+
 	ret = cr_ctx_restart(ctx, pid);
 	if (ret < 0)
 		goto out;
-	ret = cr_read_head(ctx);
+
+	/* wait for all other tasks to enter do_restart_task() */
+	ret = cr_wait_all_tasks_start(ctx);
 	if (ret < 0)
 		goto out;
+
 	ret = cr_read_task(ctx);
 	if (ret < 0)
 		goto out;
-	ret = cr_read_tail(ctx);
+
+	/* wait for all other tasks to complete do_restart_task() */
+	ret = cr_wait_all_tasks_finish(ctx);
 	if (ret < 0)
 		goto out;
 
-	/* on success, adjust the return value if needed [TODO] */
+	ret = cr_read_tail(ctx);
+
  out:
 	return ret;
 }
+
+int do_restart(struct cr_ctx *ctx, pid_t pid)
+{
+	int ret;
+
+	if (ctx)
+		ret = do_restart_root(ctx, pid);
+	else
+		ret = do_restart_task(ctx, pid);
+
+	/* on success, adjust the return value if needed [TODO] */
+	return ret;
+}
diff --git a/checkpoint/sys.c b/checkpoint/sys.c
index 121c979..188dbe3 100644
--- a/checkpoint/sys.c
+++ b/checkpoint/sys.c
@@ -145,6 +145,8 @@ static void cr_task_arr_free(struct cr_ctx *ctx)
 
 static void cr_ctx_free(struct cr_ctx *ctx)
 {
+	BUG_ON(atomic_read(&ctx->refcount));
+
 	if (ctx->file)
 		fput(ctx->file);
 
@@ -163,6 +165,8 @@ static void cr_ctx_free(struct cr_ctx *ctx)
 	if (ctx->root_task)
 		put_task_struct(ctx->root_task);
 
+	kfree(ctx->pids_arr);
+
 	kfree(ctx);
 }
 
@@ -177,7 +181,9 @@ static struct cr_ctx *cr_ctx_alloc(int fd, unsigned long flags)
 
 	ctx->flags = flags;
 
+	atomic_set(&ctx->refcount, 0);
 	INIT_LIST_HEAD(&ctx->pgarr_list);
+	init_waitqueue_head(&ctx->waitq);
 
 	err = -EBADF;
 	ctx->file = fget(fd);
@@ -192,6 +198,7 @@ static struct cr_ctx *cr_ctx_alloc(int fd, unsigned long flags)
 	if (cr_objhash_alloc(ctx) < 0)
 		goto err;
 
+	atomic_inc(&ctx->refcount);
 	return ctx;
 
  err:
@@ -199,6 +206,17 @@ static struct cr_ctx *cr_ctx_alloc(int fd, unsigned long flags)
 	return ERR_PTR(err);
 }
 
+void cr_ctx_get(struct cr_ctx *ctx)
+{
+	atomic_inc(&ctx->refcount);
+}
+
+void cr_ctx_put(struct cr_ctx *ctx)
+{
+	if (ctx && atomic_dec_and_test(&ctx->refcount))
+		cr_ctx_free(ctx);
+}
+
 /**
  * sys_checkpoint - checkpoint a container
  * @pid: pid of the container init(1) process
@@ -226,7 +244,7 @@ asmlinkage long sys_checkpoint(pid_t pid, int fd, unsigned long flags)
 	if (!ret)
 		ret = ctx->crid;
 
-	cr_ctx_free(ctx);
+	cr_ctx_put(ctx);
 	return ret;
 }
 
@@ -241,7 +259,7 @@ asmlinkage long sys_checkpoint(pid_t pid, int fd, unsigned long flags)
  */
 asmlinkage long sys_restart(int crid, int fd, unsigned long flags)
 {
-	struct cr_ctx *ctx;
+	struct cr_ctx *ctx = NULL;
 	pid_t pid;
 	int ret;
 
@@ -249,15 +267,17 @@ asmlinkage long sys_restart(int crid, int fd, unsigned long flags)
 	if (flags)
 		return -EINVAL;
 
-	ctx = cr_ctx_alloc(fd, flags | CR_CTX_RSTR);
-	if (IS_ERR(ctx))
-		return PTR_ERR(ctx);
-
 	/* FIXME: for now, we use 'crid' as a pid */
 	pid = (pid_t) crid;
 
+	if (pid == task_pid_vnr(current))
+		ctx = cr_ctx_alloc(fd, flags | CR_CTX_RSTR);
+
+	if (IS_ERR(ctx))
+		return PTR_ERR(ctx);
+
 	ret = do_restart(ctx, pid);
 
-	cr_ctx_free(ctx);
+	cr_ctx_put(ctx);
 	return ret;
 }
diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
index 2504717..b32c0a7 100644
--- a/include/linux/checkpoint.h
+++ b/include/linux/checkpoint.h
@@ -13,10 +13,11 @@
 #include <linux/fs.h>
 #include <linux/path.h>
 #include <linux/sched.h>
+#include <asm/atomic.h>
 
 #ifdef CONFIG_CHECKPOINT_RESTART
 
-#define CR_VERSION  2
+#define CR_VERSION  3
 
 struct cr_ctx {
 	int crid;		/* unique checkpoint id */
@@ -34,14 +35,27 @@ struct cr_ctx {
 	void *hbuf;		/* temporary buffer for headers */
 	int hpos;		/* position in headers buffer */
 
-	struct task_struct **tasks_arr;	/* array of all tasks in container */
-	int tasks_nr;			/* size of tasks array */
+	atomic_t refcount;
 
 	struct cr_objhash *objhash;	/* hash for shared objects */
 
 	struct list_head pgarr_list;	/* page array to dump VMA contents */
 
 	struct path fs_mnt;	/* container root (FIXME) */
+
+	/* [multi-process checkpoint] */
+	struct task_struct **tasks_arr; /* array of all tasks [checkpoint] */
+	int tasks_nr;                   /* size of tasks array */
+
+	/* [multi-process restart] */
+	struct cr_hdr_pids *pids_arr;	/* array of all pids [restart] */
+	int pids_nr;			/* size of pids array */
+	int pids_pos;			/* position pids array */
+	int pids_err;			/* error occured ? */
+	pid_t pids_active;		/* pid of (next) active task */
+	atomic_t tasks_count;		/* sync of restarting tasks */
+	struct completion complete;	/* sync of restarting tasks */
+	wait_queue_head_t waitq;	/* sync of restarting tasks */
 };
 
 /* cr_ctx: flags */
@@ -54,6 +68,9 @@ extern int cr_kread(struct cr_ctx *ctx, void *buf, int count);
 extern void *cr_hbuf_get(struct cr_ctx *ctx, int n);
 extern void cr_hbuf_put(struct cr_ctx *ctx, int n);
 
+extern void cr_ctx_get(struct cr_ctx *ctx);
+extern void cr_ctx_put(struct cr_ctx *ctx);
+
 /* shared objects handling */
 
 enum {
diff --git a/include/linux/sched.h b/include/linux/sched.h
index faa2ec6..0150e90 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1359,6 +1359,7 @@ struct task_struct {
 
 #ifdef CONFIG_CHECKPOINT_RESTART
 	atomic_t may_checkpoint;
+	struct cr_ctx *checkpoint_ctx;
 #endif
 };
 
-- 
1.5.4.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 133+ messages in thread

* Re: [RFC v11][PATCH 00/13] Kernel based checkpoint/restart
       [not found] ` <1228498282-11804-1-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
                     ` (14 preceding siblings ...)
  2008-12-05 17:31   ` Oren Laadan
@ 2008-12-06  0:19   ` Serge E. Hallyn
  2008-12-09 19:42   ` Serge E. Hallyn
  2008-12-16 18:43   ` Dave Hansen
  17 siblings, 0 replies; 133+ messages in thread
From: Serge E. Hallyn @ 2008-12-06  0:19 UTC (permalink / raw)
  To: Oren Laadan
  Cc: jeremy-TSDbQ3PG+2Y, arnd-r2nGTMty4D4,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Dave Hansen,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Linux Torvalds, Alexander Viro,
	H. Peter Anvin, Thomas Gleixner, Ingo Molnar

Quoting Oren Laadan (orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org):
> Checkpoint-restart (c/r): fixed races in file handling (comments from
> from Al Viro). Updated and tested against v2.6.28-rc7 (feaf384...)
> 
> We'd like these to make it into -mm. This version addresses the
> last of the known bugs. Please pull at least the first 11 patches,
> as they are similar to before.
> 
> Patches 1-11 are stable, providing self- and external- c/r of a
> single process.
> Patches 12 and 13 are newer, adding support for c/r of multiple
> processes.
> 
> The git tree tracking v11, branch 'ckpt-v11' (and older versions):
> 	git://git.ncl.cs.columbia.edu/pub/git/linux-cr.git
> 
> Restarting multiple processes requires 'mktree' userspace tool:
> 	git://git.ncl.cs.columbia.edu/pub/git/user-cr.git

Thanks Oren, this set is working great for me.  hours of
(run in container; while (1) { checkpoint; kill; restart in container;}
went fine.  500 simultaneoush checkpoints of the same task
went fine.  mktree works great.

thanks,
-serge

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [RFC v11][PATCH 00/13] Kernel based checkpoint/restart
       [not found] ` <1228498282-11804-1-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
  2008-12-05 17:31   ` [RFC v11][PATCH 01/13] Create syscalls: sys_checkpoint, sys_restart Oren Laadan
@ 2008-12-06  0:19   ` Serge E. Hallyn
  2008-12-05 17:31   ` [RFC v11][PATCH 03/13] General infrastructure for checkpoint restart Oren Laadan
                     ` (15 subsequent siblings)
  17 siblings, 0 replies; 133+ messages in thread
From: Serge E. Hallyn @ 2008-12-06  0:19 UTC (permalink / raw)
  To: Oren Laadan
  Cc: containers, linux-kernel, linux-mm, linux-api, Linux Torvalds,
	Thomas Gleixner, Dave Hansen, Ingo Molnar, H. Peter Anvin,
	Alexander Viro, MinChan Kim, arnd, jeremy

Quoting Oren Laadan (orenl@cs.columbia.edu):
> Checkpoint-restart (c/r): fixed races in file handling (comments from
> from Al Viro). Updated and tested against v2.6.28-rc7 (feaf384...)
> 
> We'd like these to make it into -mm. This version addresses the
> last of the known bugs. Please pull at least the first 11 patches,
> as they are similar to before.
> 
> Patches 1-11 are stable, providing self- and external- c/r of a
> single process.
> Patches 12 and 13 are newer, adding support for c/r of multiple
> processes.
> 
> The git tree tracking v11, branch 'ckpt-v11' (and older versions):
> 	git://git.ncl.cs.columbia.edu/pub/git/linux-cr.git
> 
> Restarting multiple processes requires 'mktree' userspace tool:
> 	git://git.ncl.cs.columbia.edu/pub/git/user-cr.git

Thanks Oren, this set is working great for me.  hours of
(run in container; while (1) { checkpoint; kill; restart in container;}
went fine.  500 simultaneoush checkpoints of the same task
went fine.  mktree works great.

thanks,
-serge

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [RFC v11][PATCH 00/13] Kernel based checkpoint/restart
@ 2008-12-06  0:19   ` Serge E. Hallyn
  0 siblings, 0 replies; 133+ messages in thread
From: Serge E. Hallyn @ 2008-12-06  0:19 UTC (permalink / raw)
  To: Oren Laadan
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Linux Torvalds,
	Thomas Gleixner, Dave Hansen, Ingo Molnar, H. Peter Anvin,
	Alexander Viro, MinChan Kim, arnd-r2nGTMty4D4,
	jeremy-TSDbQ3PG+2Y

Quoting Oren Laadan (orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org):
> Checkpoint-restart (c/r): fixed races in file handling (comments from
> from Al Viro). Updated and tested against v2.6.28-rc7 (feaf384...)
> 
> We'd like these to make it into -mm. This version addresses the
> last of the known bugs. Please pull at least the first 11 patches,
> as they are similar to before.
> 
> Patches 1-11 are stable, providing self- and external- c/r of a
> single process.
> Patches 12 and 13 are newer, adding support for c/r of multiple
> processes.
> 
> The git tree tracking v11, branch 'ckpt-v11' (and older versions):
> 	git://git.ncl.cs.columbia.edu/pub/git/linux-cr.git
> 
> Restarting multiple processes requires 'mktree' userspace tool:
> 	git://git.ncl.cs.columbia.edu/pub/git/user-cr.git

Thanks Oren, this set is working great for me.  hours of
(run in container; while (1) { checkpoint; kill; restart in container;}
went fine.  500 simultaneoush checkpoints of the same task
went fine.  mktree works great.

thanks,
-serge
--
To unsubscribe from this list: send the line "unsubscribe linux-api" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [RFC v11][PATCH 00/13] Kernel based checkpoint/restart
@ 2008-12-06  0:19   ` Serge E. Hallyn
  0 siblings, 0 replies; 133+ messages in thread
From: Serge E. Hallyn @ 2008-12-06  0:19 UTC (permalink / raw)
  To: Oren Laadan
  Cc: containers, linux-kernel, linux-mm, linux-api, Linux Torvalds,
	Thomas Gleixner, Dave Hansen, Ingo Molnar, H. Peter Anvin,
	Alexander Viro, MinChan Kim, arnd, jeremy

Quoting Oren Laadan (orenl@cs.columbia.edu):
> Checkpoint-restart (c/r): fixed races in file handling (comments from
> from Al Viro). Updated and tested against v2.6.28-rc7 (feaf384...)
> 
> We'd like these to make it into -mm. This version addresses the
> last of the known bugs. Please pull at least the first 11 patches,
> as they are similar to before.
> 
> Patches 1-11 are stable, providing self- and external- c/r of a
> single process.
> Patches 12 and 13 are newer, adding support for c/r of multiple
> processes.
> 
> The git tree tracking v11, branch 'ckpt-v11' (and older versions):
> 	git://git.ncl.cs.columbia.edu/pub/git/linux-cr.git
> 
> Restarting multiple processes requires 'mktree' userspace tool:
> 	git://git.ncl.cs.columbia.edu/pub/git/user-cr.git

Thanks Oren, this set is working great for me.  hours of
(run in container; while (1) { checkpoint; kill; restart in container;}
went fine.  500 simultaneoush checkpoints of the same task
went fine.  mktree works great.

thanks,
-serge

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [RFC v11][PATCH 03/13] General infrastructure for checkpoint restart
       [not found]   ` <1228498282-11804-4-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
@ 2008-12-06  7:26     ` Joe Perches
  2008-12-16 19:04     ` Mike Waychison
  2008-12-16 21:54     ` Mike Waychison
  2 siblings, 0 replies; 133+ messages in thread
From: Joe Perches @ 2008-12-06  7:26 UTC (permalink / raw)
  To: Oren Laadan
  Cc: jeremy-TSDbQ3PG+2Y, arnd-r2nGTMty4D4,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Dave Hansen,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Linux Torvalds, Alexander Viro,
	H. Peter Anvin, Thomas Gleixner, Ingo Molnar

On Fri, 2008-12-05 at 12:31 -0500, Oren Laadan wrote:
> diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
> new file mode 100644
> index 0000000..63f298f
> --- /dev/null
> +++ b/include/linux/checkpoint.h
[]
> +#define cr_debug(fmt, args...)  \
> +	pr_debug("[%d:c/r:%s] " fmt, task_pid_vnr(current), __func__, ## args)
> +

perhaps:

#define pr_fmt(fmt) "[%d:c/r:%s] " fmt, task_pid_vnr(current), __func__

and use pr_debug instead?

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [RFC v11][PATCH 03/13] General infrastructure for checkpoint restart
       [not found]   ` <1228498282-11804-4-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
  2008-12-06  7:26     ` Joe Perches
@ 2008-12-06  7:26     ` Joe Perches
  2008-12-16 21:54     ` Mike Waychison
  2 siblings, 0 replies; 133+ messages in thread
From: Joe Perches @ 2008-12-06  7:26 UTC (permalink / raw)
  To: Oren Laadan
  Cc: containers, linux-kernel, linux-mm, linux-api, Linux Torvalds,
	Thomas Gleixner, Serge Hallyn, Dave Hansen, Ingo Molnar,
	H. Peter Anvin, Alexander Viro, MinChan Kim, arnd, jeremy

On Fri, 2008-12-05 at 12:31 -0500, Oren Laadan wrote:
> diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
> new file mode 100644
> index 0000000..63f298f
> --- /dev/null
> +++ b/include/linux/checkpoint.h
[]
> +#define cr_debug(fmt, args...)  \
> +	pr_debug("[%d:c/r:%s] " fmt, task_pid_vnr(current), __func__, ## args)
> +

perhaps:

#define pr_fmt(fmt) "[%d:c/r:%s] " fmt, task_pid_vnr(current), __func__

and use pr_debug instead?



^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [RFC v11][PATCH 03/13] General infrastructure for checkpoint restart
@ 2008-12-06  7:26     ` Joe Perches
  0 siblings, 0 replies; 133+ messages in thread
From: Joe Perches @ 2008-12-06  7:26 UTC (permalink / raw)
  To: Oren Laadan
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Linux Torvalds,
	Thomas Gleixner, Serge Hallyn, Dave Hansen, Ingo Molnar,
	H. Peter Anvin, Alexander Viro, MinChan Kim, arnd-r2nGTMty4D4,
	jeremy-TSDbQ3PG+2Y

On Fri, 2008-12-05 at 12:31 -0500, Oren Laadan wrote:
> diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
> new file mode 100644
> index 0000000..63f298f
> --- /dev/null
> +++ b/include/linux/checkpoint.h
[]
> +#define cr_debug(fmt, args...)  \
> +	pr_debug("[%d:c/r:%s] " fmt, task_pid_vnr(current), __func__, ## args)
> +

perhaps:

#define pr_fmt(fmt) "[%d:c/r:%s] " fmt, task_pid_vnr(current), __func__

and use pr_debug instead?


--
To unsubscribe from this list: send the line "unsubscribe linux-api" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [RFC v11][PATCH 03/13] General infrastructure for checkpoint restart
@ 2008-12-06  7:26     ` Joe Perches
  0 siblings, 0 replies; 133+ messages in thread
From: Joe Perches @ 2008-12-06  7:26 UTC (permalink / raw)
  To: Oren Laadan
  Cc: containers, linux-kernel, linux-mm, linux-api, Linux Torvalds,
	Thomas Gleixner, Serge Hallyn, Dave Hansen, Ingo Molnar,
	H. Peter Anvin, Alexander Viro, MinChan Kim, arnd, jeremy

On Fri, 2008-12-05 at 12:31 -0500, Oren Laadan wrote:
> diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
> new file mode 100644
> index 0000000..63f298f
> --- /dev/null
> +++ b/include/linux/checkpoint.h
[]
> +#define cr_debug(fmt, args...)  \
> +	pr_debug("[%d:c/r:%s] " fmt, task_pid_vnr(current), __func__, ## args)
> +

perhaps:

#define pr_fmt(fmt) "[%d:c/r:%s] " fmt, task_pid_vnr(current), __func__

and use pr_debug instead?


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [RFC v11][PATCH 00/13] Kernel based checkpoint/restart
       [not found] ` <1228498282-11804-1-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
                     ` (15 preceding siblings ...)
  2008-12-06  0:19   ` [RFC v11][PATCH 00/13] Kernel based checkpoint/restart Serge E. Hallyn
@ 2008-12-09 19:42   ` Serge E. Hallyn
  2008-12-16 18:43   ` Dave Hansen
  17 siblings, 0 replies; 133+ messages in thread
From: Serge E. Hallyn @ 2008-12-09 19:42 UTC (permalink / raw)
  To: Oren Laadan
  Cc: jeremy-TSDbQ3PG+2Y, arnd-r2nGTMty4D4,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Dave Hansen,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Linux Torvalds, Alexander Viro,
	H. Peter Anvin, Thomas Gleixner, Ingo Molnar

Quoting Oren Laadan (orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org):
> Checkpoint-restart (c/r): fixed races in file handling (comments from
> from Al Viro). Updated and tested against v2.6.28-rc7 (feaf384...)
> 
> We'd like these to make it into -mm. This version addresses the
> last of the known bugs. Please pull at least the first 11 patches,
> as they are similar to before.

So far I'm finding no regressions and checkpoint/restart is working
perfectly for me.

Andrew, any chance of getting this round into -mm for some extra
testing?

thanks,
-serge

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [RFC v11][PATCH 00/13] Kernel based checkpoint/restart
       [not found] ` <1228498282-11804-1-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
  2008-12-05 17:31   ` [RFC v11][PATCH 01/13] Create syscalls: sys_checkpoint, sys_restart Oren Laadan
@ 2008-12-09 19:42   ` Serge E. Hallyn
  2008-12-05 17:31   ` [RFC v11][PATCH 03/13] General infrastructure for checkpoint restart Oren Laadan
                     ` (15 subsequent siblings)
  17 siblings, 0 replies; 133+ messages in thread
From: Serge E. Hallyn @ 2008-12-09 19:42 UTC (permalink / raw)
  To: Oren Laadan
  Cc: containers, linux-kernel, linux-mm, linux-api, Linux Torvalds,
	Thomas Gleixner, Dave Hansen, Ingo Molnar, H. Peter Anvin,
	Alexander Viro, MinChan Kim, arnd, jeremy

Quoting Oren Laadan (orenl@cs.columbia.edu):
> Checkpoint-restart (c/r): fixed races in file handling (comments from
> from Al Viro). Updated and tested against v2.6.28-rc7 (feaf384...)
> 
> We'd like these to make it into -mm. This version addresses the
> last of the known bugs. Please pull at least the first 11 patches,
> as they are similar to before.

So far I'm finding no regressions and checkpoint/restart is working
perfectly for me.

Andrew, any chance of getting this round into -mm for some extra
testing?

thanks,
-serge

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [RFC v11][PATCH 00/13] Kernel based checkpoint/restart
@ 2008-12-09 19:42   ` Serge E. Hallyn
  0 siblings, 0 replies; 133+ messages in thread
From: Serge E. Hallyn @ 2008-12-09 19:42 UTC (permalink / raw)
  To: Oren Laadan
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Linux Torvalds,
	Thomas Gleixner, Dave Hansen, Ingo Molnar, H. Peter Anvin,
	Alexander Viro, MinChan Kim, arnd-r2nGTMty4D4,
	jeremy-TSDbQ3PG+2Y

Quoting Oren Laadan (orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org):
> Checkpoint-restart (c/r): fixed races in file handling (comments from
> from Al Viro). Updated and tested against v2.6.28-rc7 (feaf384...)
> 
> We'd like these to make it into -mm. This version addresses the
> last of the known bugs. Please pull at least the first 11 patches,
> as they are similar to before.

So far I'm finding no regressions and checkpoint/restart is working
perfectly for me.

Andrew, any chance of getting this round into -mm for some extra
testing?

thanks,
-serge
--
To unsubscribe from this list: send the line "unsubscribe linux-api" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [RFC v11][PATCH 00/13] Kernel based checkpoint/restart
@ 2008-12-09 19:42   ` Serge E. Hallyn
  0 siblings, 0 replies; 133+ messages in thread
From: Serge E. Hallyn @ 2008-12-09 19:42 UTC (permalink / raw)
  To: Oren Laadan
  Cc: containers, linux-kernel, linux-mm, linux-api, Linux Torvalds,
	Thomas Gleixner, Dave Hansen, Ingo Molnar, H. Peter Anvin,
	Alexander Viro, MinChan Kim, arnd, jeremy

Quoting Oren Laadan (orenl@cs.columbia.edu):
> Checkpoint-restart (c/r): fixed races in file handling (comments from
> from Al Viro). Updated and tested against v2.6.28-rc7 (feaf384...)
> 
> We'd like these to make it into -mm. This version addresses the
> last of the known bugs. Please pull at least the first 11 patches,
> as they are similar to before.

So far I'm finding no regressions and checkpoint/restart is working
perfectly for me.

Andrew, any chance of getting this round into -mm for some extra
testing?

thanks,
-serge

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [RFC v11][PATCH 00/13] Kernel based checkpoint/restart
       [not found] ` <1228498282-11804-1-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
                     ` (16 preceding siblings ...)
  2008-12-09 19:42   ` Serge E. Hallyn
@ 2008-12-16 18:43   ` Dave Hansen
  17 siblings, 0 replies; 133+ messages in thread
From: Dave Hansen @ 2008-12-16 18:43 UTC (permalink / raw)
  To: akpm
  Cc: jeremy-TSDbQ3PG+2Y, arnd-r2nGTMty4D4,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Linux Torvalds, Alexander Viro,
	H. Peter Anvin, Thomas Gleixner, Ingo Molnar

Andrew,

I just realized that you weren't cc'd on these when they were posted.

Can we give them a run in -mm?  As far as I know, all review comments
have been addressed and there's nothing outstanding.

On Fri, 2008-12-05 at 12:31 -0500, Oren Laadan wrote:
> Checkpoint-restart (c/r): fixed races in file handling (comments from
> from Al Viro). Updated and tested against v2.6.28-rc7 (feaf384...)
> 
> We'd like these to make it into -mm. This version addresses the
> last of the known bugs. Please pull at least the first 11 patches,
> as they are similar to before.
> 
> Patches 1-11 are stable, providing self- and external- c/r of a
> single process.
> Patches 12 and 13 are newer, adding support for c/r of multiple
> processes.
> 
> The git tree tracking v11, branch 'ckpt-v11' (and older versions):
> 	git://git.ncl.cs.columbia.edu/pub/git/linux-cr.git
> 
> Restarting multiple processes requires 'mktree' userspace tool:
> 	git://git.ncl.cs.columbia.edu/pub/git/user-cr.git
> 
> Oren.
> 
> 
> --
> Why do we want it?  It allows containers to be moved between physical
> machines' kernels in the same way that VMWare can move VMs between
> physical machines' hypervisors.  There are currently at least two
> out-of-tree implementations of this in the commercial world (IBM's
> Metacluster and Parallels' OpenVZ/Virtuozzo) and several in the academic
> world like Zap.
> 
> Why do we need it in mainline now?  Because we already have plenty of
> out-of-tree ones, and  want to know what an in-tree one will be like.   :)  
> What *I* want right now is the extra review and scrutiny that comes with
> a mainline submission to make sure we're not going in a direction
> contrary to the community.
> 
> This only supports pretty simple apps.  But, I trust Ingo when he says:
> >> > > Generally, if something works for simple apps already (in a robust, 
> >> > > compatible and supportable way) and users find it "very cool", then 
> >> > > support for more complex apps is not far in the future.  but if you
> >> > > want to support more complex apps straight away, it takes forever and
> >> > > gets ugly.
> 
> We're *certainly* going to be changing the ABI (which is the format of
> the checkpoint).  I'd like to follow the model that we used for
> ext4-dev, which is to make it very clear that this is a development-only
> feature for now.  Perhaps we do that by making the interface only
> available through debugfs or something similar for now.  Or, reserving
> the syscall numbers but require some runtime switch to be thrown before
> they can be used.  I'm open to suggestions here.
> --
> 
> --
> Todo:
> - Add support for x86-64 and improve ABI
> - Refine or change syscall interface
> - Handle multiple namespaces in a container (e.g. save the filesystem
>   namespaces state with the file descriptors)
> - Security (without CAPS_SYS_ADMIN files restore may fail)
> 
> Changelog:
> 
> [2008-Dec-05] v11:
>   - Use contents of 'init->fs->root' instead of pointing to it
>   - Ignore symlinks (there is no such thing as an open symlink)
>   - cr_scan_fds() retries from scratch if it hits size limits
>   - Add missing test for VM_MAYSHARE when dumping memory
>   - Improve documentation about: behavior when tasks aren't fronen,
>     life span of the object hash, references to objects in the hash
> 
> [2008-Nov-26] v10:
>   - Grab vfs root of container init, rather than current process
>   - Acquire dcache_lock around call to __d_path() in cr_fill_name()
>   - Force end-of-string in cr_read_string() (fix possible DoS)
>   - Introduce cr_write_buffer(), cr_read_buffer() and cr_read_buf_type()
> 
> [2008-Nov-10] v9:
>   - Support multiple processes c/r
>   - Extend checkpoint header with archtiecture dependent header 
>   - Misc bug fixes (see individual changelogs)
>   - Rebase to v2.6.28-rc3.
> 
> [2008-Oct-29] v8:
>   - Support "external" checkpoint
>   - Include Dave Hansen's 'deny-checkpoint' patch
>   - Split docs in Documentation/checkpoint/..., and improve contents
> 
> [2008-Oct-17] v7:
>   - Fix save/restore state of FPU
>   - Fix argument given to kunmap_atomic() in memory dump/restore
> 
> [2008-Oct-07] v6:
>   - Balance all calls to cr_hbuf_get() with matching cr_hbuf_put()
>     (even though it's not really needed)
>   - Add assumptions and what's-missing to documentation
>   - Misc fixes and cleanups
> 
> [2008-Sep-11] v5:
>   - Config is now 'def_bool n' by default
>   - Improve memory dump/restore code (following Dave Hansen's comments)
>   - Change dump format (and code) to allow chunks of <vaddrs, pages>
>     instead of one long list of each
>   - Fix use of follow_page() to avoid faulting in non-present pages
>   - Memory restore now maps user pages explicitly to copy data into them,
>     instead of reading directly to user space; got rid of mprotect_fixup()
>   - Remove preempt_disable() when restoring debug registers
>   - Rename headers files s/ckpt/checkpoint/
>   - Fix misc bugs in files dump/restore
>   - Fixes and cleanups on some error paths
>   - Fix misc coding style
> 
> [2008-Sep-09] v4:
>   - Various fixes and clean-ups
>   - Fix calculation of hash table size
>   - Fix header structure alignment
>   - Use stand list_... for cr_pgarr
> 
> [2008-Aug-29] v3:
>   - Various fixes and clean-ups
>   - Use standard hlist_... for hash table
>   - Better use of standard kmalloc/kfree
> 
> [2008-Aug-20] v2:
>   - Added Dump and restore of open files (regular and directories)
>   - Added basic handling of shared objects, and improve handling of
>     'parent tag' concept
>   - Added documentation
>   - Improved ABI, 64bit padding for image data
>   - Improved locking when saving/restoring memory
>   - Added UTS information to header (release, version, machine)
>   - Cleanup extraction of filename from a file pointer
>   - Refactor to allow easier reviewing
>   - Remove requirement for CAPS_SYS_ADMIN until we come up with a
>     security policy (this means that file restore may fail)
>   - Other cleanup and response to comments for v1
> 
> [2008-Jul-29] v1:
>   - Initial version: support a single task with address space of only
>     private anonymous or file-mapped VMAs; syscalls ignore pid/crid
>     argument and act on current process.
> 
> --
> At the containers mini-conference before OLS, the consensus among
> all the stakeholders was that doing checkpoint/restart in the kernel
> as much as possible was the best approach.  With this approach, the
> kernel will export a relatively opaque 'blob' of data to userspace
> which can then be handed to the new kernel at restore time.
> 
> This is different than what had been proposed before, which was
> that a userspace application would be responsible for collecting
> all of this data.  We were also planning on adding lots of new,
> little kernel interfaces for all of the things that needed
> checkpointing.  This unites those into a single, grand interface.
> 
> The 'blob' will contain copies of select portions of kernel
> structures such as vmas and mm_structs.  It will also contain
> copies of the actual memory that the process uses.  Any changes
> in this blob's format between kernel revisions can be handled by
> an in-userspace conversion program.
> 
> This is a similar approach to virtually all of the commercial
> checkpoint/restart products out there, as well as the research
> project Zap.
> 
> These patches basically serialize internel kernel state and write
> it out to a file descriptor.  The checkpoint and restore are done
> with two new system calls: sys_checkpoint and sys_restart.
> 
> In this incarnation, they can only work checkpoint and restore a
> single task. The task's address space may consist of only private,
> simple vma's - anonymous or file-mapped. The open files may consist
> of only simple files and directories.
> --
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo-Bw31MaZKKs0EbZ0PF+XxCw@public.gmane.org  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org"> email-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org </a>
-- Dave

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [RFC v11][PATCH 00/13] Kernel based checkpoint/restart
  2008-12-05 17:31 ` Oren Laadan
@ 2008-12-16 18:43   ` Dave Hansen
  -1 siblings, 0 replies; 133+ messages in thread
From: Dave Hansen @ 2008-12-16 18:43 UTC (permalink / raw)
  To: akpm
  Cc: containers, linux-kernel, linux-mm, linux-api, Linux Torvalds,
	Thomas Gleixner, Serge Hallyn, Ingo Molnar, H. Peter Anvin,
	Alexander Viro, MinChan Kim, arnd, jeremy, Oren Laadan

Andrew,

I just realized that you weren't cc'd on these when they were posted.

Can we give them a run in -mm?  As far as I know, all review comments
have been addressed and there's nothing outstanding.

On Fri, 2008-12-05 at 12:31 -0500, Oren Laadan wrote:
> Checkpoint-restart (c/r): fixed races in file handling (comments from
> from Al Viro). Updated and tested against v2.6.28-rc7 (feaf384...)
> 
> We'd like these to make it into -mm. This version addresses the
> last of the known bugs. Please pull at least the first 11 patches,
> as they are similar to before.
> 
> Patches 1-11 are stable, providing self- and external- c/r of a
> single process.
> Patches 12 and 13 are newer, adding support for c/r of multiple
> processes.
> 
> The git tree tracking v11, branch 'ckpt-v11' (and older versions):
> 	git://git.ncl.cs.columbia.edu/pub/git/linux-cr.git
> 
> Restarting multiple processes requires 'mktree' userspace tool:
> 	git://git.ncl.cs.columbia.edu/pub/git/user-cr.git
> 
> Oren.
> 
> 
> --
> Why do we want it?  It allows containers to be moved between physical
> machines' kernels in the same way that VMWare can move VMs between
> physical machines' hypervisors.  There are currently at least two
> out-of-tree implementations of this in the commercial world (IBM's
> Metacluster and Parallels' OpenVZ/Virtuozzo) and several in the academic
> world like Zap.
> 
> Why do we need it in mainline now?  Because we already have plenty of
> out-of-tree ones, and  want to know what an in-tree one will be like.   :)  
> What *I* want right now is the extra review and scrutiny that comes with
> a mainline submission to make sure we're not going in a direction
> contrary to the community.
> 
> This only supports pretty simple apps.  But, I trust Ingo when he says:
> >> > > Generally, if something works for simple apps already (in a robust, 
> >> > > compatible and supportable way) and users find it "very cool", then 
> >> > > support for more complex apps is not far in the future.  but if you
> >> > > want to support more complex apps straight away, it takes forever and
> >> > > gets ugly.
> 
> We're *certainly* going to be changing the ABI (which is the format of
> the checkpoint).  I'd like to follow the model that we used for
> ext4-dev, which is to make it very clear that this is a development-only
> feature for now.  Perhaps we do that by making the interface only
> available through debugfs or something similar for now.  Or, reserving
> the syscall numbers but require some runtime switch to be thrown before
> they can be used.  I'm open to suggestions here.
> --
> 
> --
> Todo:
> - Add support for x86-64 and improve ABI
> - Refine or change syscall interface
> - Handle multiple namespaces in a container (e.g. save the filesystem
>   namespaces state with the file descriptors)
> - Security (without CAPS_SYS_ADMIN files restore may fail)
> 
> Changelog:
> 
> [2008-Dec-05] v11:
>   - Use contents of 'init->fs->root' instead of pointing to it
>   - Ignore symlinks (there is no such thing as an open symlink)
>   - cr_scan_fds() retries from scratch if it hits size limits
>   - Add missing test for VM_MAYSHARE when dumping memory
>   - Improve documentation about: behavior when tasks aren't fronen,
>     life span of the object hash, references to objects in the hash
> 
> [2008-Nov-26] v10:
>   - Grab vfs root of container init, rather than current process
>   - Acquire dcache_lock around call to __d_path() in cr_fill_name()
>   - Force end-of-string in cr_read_string() (fix possible DoS)
>   - Introduce cr_write_buffer(), cr_read_buffer() and cr_read_buf_type()
> 
> [2008-Nov-10] v9:
>   - Support multiple processes c/r
>   - Extend checkpoint header with archtiecture dependent header 
>   - Misc bug fixes (see individual changelogs)
>   - Rebase to v2.6.28-rc3.
> 
> [2008-Oct-29] v8:
>   - Support "external" checkpoint
>   - Include Dave Hansen's 'deny-checkpoint' patch
>   - Split docs in Documentation/checkpoint/..., and improve contents
> 
> [2008-Oct-17] v7:
>   - Fix save/restore state of FPU
>   - Fix argument given to kunmap_atomic() in memory dump/restore
> 
> [2008-Oct-07] v6:
>   - Balance all calls to cr_hbuf_get() with matching cr_hbuf_put()
>     (even though it's not really needed)
>   - Add assumptions and what's-missing to documentation
>   - Misc fixes and cleanups
> 
> [2008-Sep-11] v5:
>   - Config is now 'def_bool n' by default
>   - Improve memory dump/restore code (following Dave Hansen's comments)
>   - Change dump format (and code) to allow chunks of <vaddrs, pages>
>     instead of one long list of each
>   - Fix use of follow_page() to avoid faulting in non-present pages
>   - Memory restore now maps user pages explicitly to copy data into them,
>     instead of reading directly to user space; got rid of mprotect_fixup()
>   - Remove preempt_disable() when restoring debug registers
>   - Rename headers files s/ckpt/checkpoint/
>   - Fix misc bugs in files dump/restore
>   - Fixes and cleanups on some error paths
>   - Fix misc coding style
> 
> [2008-Sep-09] v4:
>   - Various fixes and clean-ups
>   - Fix calculation of hash table size
>   - Fix header structure alignment
>   - Use stand list_... for cr_pgarr
> 
> [2008-Aug-29] v3:
>   - Various fixes and clean-ups
>   - Use standard hlist_... for hash table
>   - Better use of standard kmalloc/kfree
> 
> [2008-Aug-20] v2:
>   - Added Dump and restore of open files (regular and directories)
>   - Added basic handling of shared objects, and improve handling of
>     'parent tag' concept
>   - Added documentation
>   - Improved ABI, 64bit padding for image data
>   - Improved locking when saving/restoring memory
>   - Added UTS information to header (release, version, machine)
>   - Cleanup extraction of filename from a file pointer
>   - Refactor to allow easier reviewing
>   - Remove requirement for CAPS_SYS_ADMIN until we come up with a
>     security policy (this means that file restore may fail)
>   - Other cleanup and response to comments for v1
> 
> [2008-Jul-29] v1:
>   - Initial version: support a single task with address space of only
>     private anonymous or file-mapped VMAs; syscalls ignore pid/crid
>     argument and act on current process.
> 
> --
> At the containers mini-conference before OLS, the consensus among
> all the stakeholders was that doing checkpoint/restart in the kernel
> as much as possible was the best approach.  With this approach, the
> kernel will export a relatively opaque 'blob' of data to userspace
> which can then be handed to the new kernel at restore time.
> 
> This is different than what had been proposed before, which was
> that a userspace application would be responsible for collecting
> all of this data.  We were also planning on adding lots of new,
> little kernel interfaces for all of the things that needed
> checkpointing.  This unites those into a single, grand interface.
> 
> The 'blob' will contain copies of select portions of kernel
> structures such as vmas and mm_structs.  It will also contain
> copies of the actual memory that the process uses.  Any changes
> in this blob's format between kernel revisions can be handled by
> an in-userspace conversion program.
> 
> This is a similar approach to virtually all of the commercial
> checkpoint/restart products out there, as well as the research
> project Zap.
> 
> These patches basically serialize internel kernel state and write
> it out to a file descriptor.  The checkpoint and restore are done
> with two new system calls: sys_checkpoint and sys_restart.
> 
> In this incarnation, they can only work checkpoint and restore a
> single task. The task's address space may consist of only private,
> simple vma's - anonymous or file-mapped. The open files may consist
> of only simple files and directories.
> --
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
-- Dave


^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [RFC v11][PATCH 00/13] Kernel based checkpoint/restart
@ 2008-12-16 18:43   ` Dave Hansen
  0 siblings, 0 replies; 133+ messages in thread
From: Dave Hansen @ 2008-12-16 18:43 UTC (permalink / raw)
  To: akpm
  Cc: containers, linux-kernel, linux-mm, linux-api, Linux Torvalds,
	Thomas Gleixner, Serge Hallyn, Ingo Molnar, H. Peter Anvin,
	Alexander Viro, MinChan Kim, arnd, jeremy, Oren Laadan

Andrew,

I just realized that you weren't cc'd on these when they were posted.

Can we give them a run in -mm?  As far as I know, all review comments
have been addressed and there's nothing outstanding.

On Fri, 2008-12-05 at 12:31 -0500, Oren Laadan wrote:
> Checkpoint-restart (c/r): fixed races in file handling (comments from
> from Al Viro). Updated and tested against v2.6.28-rc7 (feaf384...)
> 
> We'd like these to make it into -mm. This version addresses the
> last of the known bugs. Please pull at least the first 11 patches,
> as they are similar to before.
> 
> Patches 1-11 are stable, providing self- and external- c/r of a
> single process.
> Patches 12 and 13 are newer, adding support for c/r of multiple
> processes.
> 
> The git tree tracking v11, branch 'ckpt-v11' (and older versions):
> 	git://git.ncl.cs.columbia.edu/pub/git/linux-cr.git
> 
> Restarting multiple processes requires 'mktree' userspace tool:
> 	git://git.ncl.cs.columbia.edu/pub/git/user-cr.git
> 
> Oren.
> 
> 
> --
> Why do we want it?  It allows containers to be moved between physical
> machines' kernels in the same way that VMWare can move VMs between
> physical machines' hypervisors.  There are currently at least two
> out-of-tree implementations of this in the commercial world (IBM's
> Metacluster and Parallels' OpenVZ/Virtuozzo) and several in the academic
> world like Zap.
> 
> Why do we need it in mainline now?  Because we already have plenty of
> out-of-tree ones, and  want to know what an in-tree one will be like.   :)  
> What *I* want right now is the extra review and scrutiny that comes with
> a mainline submission to make sure we're not going in a direction
> contrary to the community.
> 
> This only supports pretty simple apps.  But, I trust Ingo when he says:
> >> > > Generally, if something works for simple apps already (in a robust, 
> >> > > compatible and supportable way) and users find it "very cool", then 
> >> > > support for more complex apps is not far in the future.  but if you
> >> > > want to support more complex apps straight away, it takes forever and
> >> > > gets ugly.
> 
> We're *certainly* going to be changing the ABI (which is the format of
> the checkpoint).  I'd like to follow the model that we used for
> ext4-dev, which is to make it very clear that this is a development-only
> feature for now.  Perhaps we do that by making the interface only
> available through debugfs or something similar for now.  Or, reserving
> the syscall numbers but require some runtime switch to be thrown before
> they can be used.  I'm open to suggestions here.
> --
> 
> --
> Todo:
> - Add support for x86-64 and improve ABI
> - Refine or change syscall interface
> - Handle multiple namespaces in a container (e.g. save the filesystem
>   namespaces state with the file descriptors)
> - Security (without CAPS_SYS_ADMIN files restore may fail)
> 
> Changelog:
> 
> [2008-Dec-05] v11:
>   - Use contents of 'init->fs->root' instead of pointing to it
>   - Ignore symlinks (there is no such thing as an open symlink)
>   - cr_scan_fds() retries from scratch if it hits size limits
>   - Add missing test for VM_MAYSHARE when dumping memory
>   - Improve documentation about: behavior when tasks aren't fronen,
>     life span of the object hash, references to objects in the hash
> 
> [2008-Nov-26] v10:
>   - Grab vfs root of container init, rather than current process
>   - Acquire dcache_lock around call to __d_path() in cr_fill_name()
>   - Force end-of-string in cr_read_string() (fix possible DoS)
>   - Introduce cr_write_buffer(), cr_read_buffer() and cr_read_buf_type()
> 
> [2008-Nov-10] v9:
>   - Support multiple processes c/r
>   - Extend checkpoint header with archtiecture dependent header 
>   - Misc bug fixes (see individual changelogs)
>   - Rebase to v2.6.28-rc3.
> 
> [2008-Oct-29] v8:
>   - Support "external" checkpoint
>   - Include Dave Hansen's 'deny-checkpoint' patch
>   - Split docs in Documentation/checkpoint/..., and improve contents
> 
> [2008-Oct-17] v7:
>   - Fix save/restore state of FPU
>   - Fix argument given to kunmap_atomic() in memory dump/restore
> 
> [2008-Oct-07] v6:
>   - Balance all calls to cr_hbuf_get() with matching cr_hbuf_put()
>     (even though it's not really needed)
>   - Add assumptions and what's-missing to documentation
>   - Misc fixes and cleanups
> 
> [2008-Sep-11] v5:
>   - Config is now 'def_bool n' by default
>   - Improve memory dump/restore code (following Dave Hansen's comments)
>   - Change dump format (and code) to allow chunks of <vaddrs, pages>
>     instead of one long list of each
>   - Fix use of follow_page() to avoid faulting in non-present pages
>   - Memory restore now maps user pages explicitly to copy data into them,
>     instead of reading directly to user space; got rid of mprotect_fixup()
>   - Remove preempt_disable() when restoring debug registers
>   - Rename headers files s/ckpt/checkpoint/
>   - Fix misc bugs in files dump/restore
>   - Fixes and cleanups on some error paths
>   - Fix misc coding style
> 
> [2008-Sep-09] v4:
>   - Various fixes and clean-ups
>   - Fix calculation of hash table size
>   - Fix header structure alignment
>   - Use stand list_... for cr_pgarr
> 
> [2008-Aug-29] v3:
>   - Various fixes and clean-ups
>   - Use standard hlist_... for hash table
>   - Better use of standard kmalloc/kfree
> 
> [2008-Aug-20] v2:
>   - Added Dump and restore of open files (regular and directories)
>   - Added basic handling of shared objects, and improve handling of
>     'parent tag' concept
>   - Added documentation
>   - Improved ABI, 64bit padding for image data
>   - Improved locking when saving/restoring memory
>   - Added UTS information to header (release, version, machine)
>   - Cleanup extraction of filename from a file pointer
>   - Refactor to allow easier reviewing
>   - Remove requirement for CAPS_SYS_ADMIN until we come up with a
>     security policy (this means that file restore may fail)
>   - Other cleanup and response to comments for v1
> 
> [2008-Jul-29] v1:
>   - Initial version: support a single task with address space of only
>     private anonymous or file-mapped VMAs; syscalls ignore pid/crid
>     argument and act on current process.
> 
> --
> At the containers mini-conference before OLS, the consensus among
> all the stakeholders was that doing checkpoint/restart in the kernel
> as much as possible was the best approach.  With this approach, the
> kernel will export a relatively opaque 'blob' of data to userspace
> which can then be handed to the new kernel at restore time.
> 
> This is different than what had been proposed before, which was
> that a userspace application would be responsible for collecting
> all of this data.  We were also planning on adding lots of new,
> little kernel interfaces for all of the things that needed
> checkpointing.  This unites those into a single, grand interface.
> 
> The 'blob' will contain copies of select portions of kernel
> structures such as vmas and mm_structs.  It will also contain
> copies of the actual memory that the process uses.  Any changes
> in this blob's format between kernel revisions can be handled by
> an in-userspace conversion program.
> 
> This is a similar approach to virtually all of the commercial
> checkpoint/restart products out there, as well as the research
> project Zap.
> 
> These patches basically serialize internel kernel state and write
> it out to a file descriptor.  The checkpoint and restore are done
> with two new system calls: sys_checkpoint and sys_restart.
> 
> In this incarnation, they can only work checkpoint and restore a
> single task. The task's address space may consist of only private,
> simple vma's - anonymous or file-mapped. The open files may consist
> of only simple files and directories.
> --
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
-- Dave

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [RFC v11][PATCH 03/13] General infrastructure for checkpoint restart
       [not found]   ` <1228498282-11804-4-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
  2008-12-06  7:26     ` Joe Perches
@ 2008-12-16 19:04     ` Mike Waychison
  2008-12-16 21:54     ` Mike Waychison
  2 siblings, 0 replies; 133+ messages in thread
From: Mike Waychison @ 2008-12-16 19:04 UTC (permalink / raw)
  To: Oren Laadan
  Cc: jeremy-TSDbQ3PG+2Y, arnd-r2nGTMty4D4,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Dave Hansen,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Linux Torvalds, Alexander Viro,
	H. Peter Anvin, Thomas Gleixner, Ingo Molnar

Oren Laadan wrote:

> +/*
> + * Helpers to write(read) from(to) kernel space to(from) the checkpoint
> + * image file descriptor (similar to how a core-dump is performed).
> + *
> + *   cr_kwrite() - write a kernel-space buffer to the checkpoint image
> + *   cr_kread() - read from the checkpoint image to a kernel-space buffer
> + */
> +
> +int cr_kwrite(struct cr_ctx *ctx, void *addr, int count)
> +{
> +	struct file *file = ctx->file;
> +	mm_segment_t fs;
> +	ssize_t nwrite;
> +	int nleft;
> +
> +	fs = get_fs();
> +	set_fs(KERNEL_DS);
> +	for (nleft = count; nleft; nleft -= nwrite) {
> +		nwrite = file->f_op->write(file, addr, nleft, &file->f_pos);
> +		if (nwrite < 0) {
> +			if (nwrite == -EAGAIN)
> +				nwrite = 0;
> +			else

set_fs(fs) here

> +				return nwrite;
> +		}
> +		addr += nwrite;
> +	}
> +	set_fs(fs);
> +	ctx->total += count;
> +	return 0;
> +}
> +
> +int cr_kread(struct cr_ctx *ctx, void *addr, int count)
> +{
> +	struct file *file = ctx->file;
> +	mm_segment_t fs;
> +	ssize_t nread;
> +	int nleft;
> +
> +	fs = get_fs();
> +	set_fs(KERNEL_DS);
> +	for (nleft = count; nleft; nleft -= nread) {
> +		nread = file->f_op->read(file, addr, nleft, &file->f_pos);
> +		if (nread <= 0) {
> +			if (nread == -EAGAIN) {
> +				nread = 0;
> +				continue;
> +			} else if (nread == 0)
> +				nread = -EPIPE;		/* unexecpted EOF */

set_fs(fs) here as well

> +			return nread;
> +		}
> +		addr += nread;
> +	}
> +	set_fs(fs);
> +	ctx->total += count;
> +	return 0;
> +}
> +

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [RFC v11][PATCH 03/13] General infrastructure for checkpoint restart
  2008-12-05 17:31   ` Oren Laadan
@ 2008-12-16 19:04     ` Mike Waychison
  -1 siblings, 0 replies; 133+ messages in thread
From: Mike Waychison @ 2008-12-16 19:04 UTC (permalink / raw)
  To: Oren Laadan
  Cc: jeremy, arnd, linux-api, containers, linux-kernel, Dave Hansen,
	linux-mm, Linux Torvalds, Alexander Viro, H. Peter Anvin,
	Thomas Gleixner, Ingo Molnar

Oren Laadan wrote:

> +/*
> + * Helpers to write(read) from(to) kernel space to(from) the checkpoint
> + * image file descriptor (similar to how a core-dump is performed).
> + *
> + *   cr_kwrite() - write a kernel-space buffer to the checkpoint image
> + *   cr_kread() - read from the checkpoint image to a kernel-space buffer
> + */
> +
> +int cr_kwrite(struct cr_ctx *ctx, void *addr, int count)
> +{
> +	struct file *file = ctx->file;
> +	mm_segment_t fs;
> +	ssize_t nwrite;
> +	int nleft;
> +
> +	fs = get_fs();
> +	set_fs(KERNEL_DS);
> +	for (nleft = count; nleft; nleft -= nwrite) {
> +		nwrite = file->f_op->write(file, addr, nleft, &file->f_pos);
> +		if (nwrite < 0) {
> +			if (nwrite == -EAGAIN)
> +				nwrite = 0;
> +			else

set_fs(fs) here

> +				return nwrite;
> +		}
> +		addr += nwrite;
> +	}
> +	set_fs(fs);
> +	ctx->total += count;
> +	return 0;
> +}
> +
> +int cr_kread(struct cr_ctx *ctx, void *addr, int count)
> +{
> +	struct file *file = ctx->file;
> +	mm_segment_t fs;
> +	ssize_t nread;
> +	int nleft;
> +
> +	fs = get_fs();
> +	set_fs(KERNEL_DS);
> +	for (nleft = count; nleft; nleft -= nread) {
> +		nread = file->f_op->read(file, addr, nleft, &file->f_pos);
> +		if (nread <= 0) {
> +			if (nread == -EAGAIN) {
> +				nread = 0;
> +				continue;
> +			} else if (nread == 0)
> +				nread = -EPIPE;		/* unexecpted EOF */

set_fs(fs) here as well

> +			return nread;
> +		}
> +		addr += nread;
> +	}
> +	set_fs(fs);
> +	ctx->total += count;
> +	return 0;
> +}
> +


^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [RFC v11][PATCH 03/13] General infrastructure for checkpoint restart
@ 2008-12-16 19:04     ` Mike Waychison
  0 siblings, 0 replies; 133+ messages in thread
From: Mike Waychison @ 2008-12-16 19:04 UTC (permalink / raw)
  To: Oren Laadan
  Cc: jeremy, arnd, linux-api, containers, linux-kernel, Dave Hansen,
	linux-mm, Linux Torvalds, Alexander Viro, H. Peter Anvin,
	Thomas Gleixner, Ingo Molnar

Oren Laadan wrote:

> +/*
> + * Helpers to write(read) from(to) kernel space to(from) the checkpoint
> + * image file descriptor (similar to how a core-dump is performed).
> + *
> + *   cr_kwrite() - write a kernel-space buffer to the checkpoint image
> + *   cr_kread() - read from the checkpoint image to a kernel-space buffer
> + */
> +
> +int cr_kwrite(struct cr_ctx *ctx, void *addr, int count)
> +{
> +	struct file *file = ctx->file;
> +	mm_segment_t fs;
> +	ssize_t nwrite;
> +	int nleft;
> +
> +	fs = get_fs();
> +	set_fs(KERNEL_DS);
> +	for (nleft = count; nleft; nleft -= nwrite) {
> +		nwrite = file->f_op->write(file, addr, nleft, &file->f_pos);
> +		if (nwrite < 0) {
> +			if (nwrite == -EAGAIN)
> +				nwrite = 0;
> +			else

set_fs(fs) here

> +				return nwrite;
> +		}
> +		addr += nwrite;
> +	}
> +	set_fs(fs);
> +	ctx->total += count;
> +	return 0;
> +}
> +
> +int cr_kread(struct cr_ctx *ctx, void *addr, int count)
> +{
> +	struct file *file = ctx->file;
> +	mm_segment_t fs;
> +	ssize_t nread;
> +	int nleft;
> +
> +	fs = get_fs();
> +	set_fs(KERNEL_DS);
> +	for (nleft = count; nleft; nleft -= nread) {
> +		nread = file->f_op->read(file, addr, nleft, &file->f_pos);
> +		if (nread <= 0) {
> +			if (nread == -EAGAIN) {
> +				nread = 0;
> +				continue;
> +			} else if (nread == 0)
> +				nread = -EPIPE;		/* unexecpted EOF */

set_fs(fs) here as well

> +			return nread;
> +		}
> +		addr += nread;
> +	}
> +	set_fs(fs);
> +	ctx->total += count;
> +	return 0;
> +}
> +

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [RFC v11][PATCH 03/13] General infrastructure for checkpoint restart
       [not found]     ` <4947FBC8.2000601-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
@ 2008-12-16 19:28       ` Linus Torvalds
  0 siblings, 0 replies; 133+ messages in thread
From: Linus Torvalds @ 2008-12-16 19:28 UTC (permalink / raw)
  To: Mike Waychison
  Cc: jeremy-TSDbQ3PG+2Y, arnd-r2nGTMty4D4,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	H. Peter Anvin, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Dave Hansen,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Alexander Viro,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Thomas Gleixner, Ingo Molnar



On Tue, 16 Dec 2008, Mike Waychison wrote:
> 
> set_fs(fs) here

Btw, this all is an excellent example of why people should try to aim for 
small functions and use lots of them.

It's often _way_ more readable to do

	static inline int __some_fn(...)
	{
		.. do the real work here ..
	}

	int some_fn(...)
	{
		int retval;

		prepare();
		retval = __some_fn(..)
		finish();

		return retval;
	}

where "prepare/finish" can be about locking, or set_fs(), or allocation 
and de-allocation of temporary buffers, or any number of things like that.

With set_fs() in particular, the wrapper function also tends to be the 
perfect place to change a regular (kernel) pointer into a user pointer. 
IOW, it's the place to make sparse happy, where you can do things like

	uptr = (__force void __user *)ptr;

and comment on the fact that the forced user pointer cast is valid only 
because of the set_fs().

Because it looks like the code isn't sparse-clean.

Btw, I also think that code like this is bogus:

	nwrite = file->f_op->write(file, addr, nleft, &file->f_pos);

because you're not supposed to pass in the raw file->f_pos to that 
function. It's fundamentally thread-unsafe. I realize that maybe you don't 
care, but the thing is, you're supposed to do

	loff_t pos = file->pos;
	..
	nwrite = file->f_op->write(file, addr, nleft, &pos);
	..
	file->f_pos = pos;

and in fact preferably use "file_pos_read()" and "file_pos_write()" (but 
we've never exposed them outside of fs/read_write.c, so I guess we should 
do that).

And yes, I realize that some code does take the address of f_pos directly 
(splice, nfsctl, others), and I realize that it works, but it's still bad 
form. Please don't add more of them.

			Linus

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [RFC v11][PATCH 03/13] General infrastructure for checkpoint restart
  2008-12-16 19:04     ` Mike Waychison
@ 2008-12-16 19:28       ` Linus Torvalds
  -1 siblings, 0 replies; 133+ messages in thread
From: Linus Torvalds @ 2008-12-16 19:28 UTC (permalink / raw)
  To: Mike Waychison
  Cc: Oren Laadan, jeremy, arnd, linux-api, containers, linux-kernel,
	Dave Hansen, linux-mm, Alexander Viro, H. Peter Anvin,
	Thomas Gleixner, Ingo Molnar



On Tue, 16 Dec 2008, Mike Waychison wrote:
> 
> set_fs(fs) here

Btw, this all is an excellent example of why people should try to aim for 
small functions and use lots of them.

It's often _way_ more readable to do

	static inline int __some_fn(...)
	{
		.. do the real work here ..
	}

	int some_fn(...)
	{
		int retval;

		prepare();
		retval = __some_fn(..)
		finish();

		return retval;
	}

where "prepare/finish" can be about locking, or set_fs(), or allocation 
and de-allocation of temporary buffers, or any number of things like that.

With set_fs() in particular, the wrapper function also tends to be the 
perfect place to change a regular (kernel) pointer into a user pointer. 
IOW, it's the place to make sparse happy, where you can do things like

	uptr = (__force void __user *)ptr;

and comment on the fact that the forced user pointer cast is valid only 
because of the set_fs().

Because it looks like the code isn't sparse-clean.

Btw, I also think that code like this is bogus:

	nwrite = file->f_op->write(file, addr, nleft, &file->f_pos);

because you're not supposed to pass in the raw file->f_pos to that 
function. It's fundamentally thread-unsafe. I realize that maybe you don't 
care, but the thing is, you're supposed to do

	loff_t pos = file->pos;
	..
	nwrite = file->f_op->write(file, addr, nleft, &pos);
	..
	file->f_pos = pos;

and in fact preferably use "file_pos_read()" and "file_pos_write()" (but 
we've never exposed them outside of fs/read_write.c, so I guess we should 
do that).

And yes, I realize that some code does take the address of f_pos directly 
(splice, nfsctl, others), and I realize that it works, but it's still bad 
form. Please don't add more of them.

			Linus

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [RFC v11][PATCH 03/13] General infrastructure for checkpoint restart
@ 2008-12-16 19:28       ` Linus Torvalds
  0 siblings, 0 replies; 133+ messages in thread
From: Linus Torvalds @ 2008-12-16 19:28 UTC (permalink / raw)
  To: Mike Waychison
  Cc: Oren Laadan, jeremy, arnd, linux-api, containers, linux-kernel,
	Dave Hansen, linux-mm, Alexander Viro, H. Peter Anvin,
	Thomas Gleixner, Ingo Molnar



On Tue, 16 Dec 2008, Mike Waychison wrote:
> 
> set_fs(fs) here

Btw, this all is an excellent example of why people should try to aim for 
small functions and use lots of them.

It's often _way_ more readable to do

	static inline int __some_fn(...)
	{
		.. do the real work here ..
	}

	int some_fn(...)
	{
		int retval;

		prepare();
		retval = __some_fn(..)
		finish();

		return retval;
	}

where "prepare/finish" can be about locking, or set_fs(), or allocation 
and de-allocation of temporary buffers, or any number of things like that.

With set_fs() in particular, the wrapper function also tends to be the 
perfect place to change a regular (kernel) pointer into a user pointer. 
IOW, it's the place to make sparse happy, where you can do things like

	uptr = (__force void __user *)ptr;

and comment on the fact that the forced user pointer cast is valid only 
because of the set_fs().

Because it looks like the code isn't sparse-clean.

Btw, I also think that code like this is bogus:

	nwrite = file->f_op->write(file, addr, nleft, &file->f_pos);

because you're not supposed to pass in the raw file->f_pos to that 
function. It's fundamentally thread-unsafe. I realize that maybe you don't 
care, but the thing is, you're supposed to do

	loff_t pos = file->pos;
	..
	nwrite = file->f_op->write(file, addr, nleft, &pos);
	..
	file->f_pos = pos;

and in fact preferably use "file_pos_read()" and "file_pos_write()" (but 
we've never exposed them outside of fs/read_write.c, so I guess we should 
do that).

And yes, I realize that some code does take the address of f_pos directly 
(splice, nfsctl, others), and I realize that it works, but it's still bad 
form. Please don't add more of them.

			Linus
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [RFC v11][PATCH 03/13] General infrastructure for checkpoint restart
       [not found]   ` <1228498282-11804-4-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
  2008-12-06  7:26     ` Joe Perches
  2008-12-16 19:04     ` Mike Waychison
@ 2008-12-16 21:54     ` Mike Waychison
  2 siblings, 0 replies; 133+ messages in thread
From: Mike Waychison @ 2008-12-16 21:54 UTC (permalink / raw)
  To: Oren Laadan
  Cc: jeremy-TSDbQ3PG+2Y, arnd-r2nGTMty4D4,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Dave Hansen,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Linux Torvalds, Alexander Viro,
	H. Peter Anvin, Thomas Gleixner, Ingo Molnar

Oren Laadan wrote:
> diff --git a/checkpoint/sys.c b/checkpoint/sys.c
> index 375129c..bd14ef9 100644
> --- a/checkpoint/sys.c
> +++ b/checkpoint/sys.c

> +/*
> + * During checkpoint and restart the code writes outs/reads in data
> + * to/from the checkpoint image from/to a temporary buffer (ctx->hbuf).
> + * Because operations can be nested, use cr_hbuf_get() to reserve space
> + * in the buffer, then cr_hbuf_put() when you no longer need that space.
> + */

This seems a bit over-kill for buffer management no?  The only large 
header seems to be cr_hdr_head and the blowup comes from utsinfo string 
data (which could easily be moved out to be in it's own CR_HDR_STRING 
blocks).

Wouldn't it be easier to use stack-local storage than balancing the 
cr_hbuf_get/put routines?

> +
> +/*
> + * ctx->hbuf is used to hold headers and data of known (or bound),
> + * static sizes. In some cases, multiple headers may be allocated in
> + * a nested manner. The size should accommodate all headers, nested
> + * or not, on all archs.
> + */
> +#define CR_HBUF_TOTAL  (8 * 4096)
> +
> +/**
> + * cr_hbuf_get - reserve space on the hbuf
> + * @ctx: checkpoint context
> + * @n: number of bytes to reserve
> + *
> + * Returns pointer to reserved space
> + */
> +void *cr_hbuf_get(struct cr_ctx *ctx, int n)
> +{
> +	void *ptr;
> +
> +	/*
> +	 * Since requests depend on logic and static header sizes (not on
> +	 * user data), space should always suffice, unless someone either
> +	 * made a structure bigger or call path deeper than expected.
> +	 */
> +	BUG_ON(ctx->hpos + n > CR_HBUF_TOTAL);
> +	ptr = ctx->hbuf + ctx->hpos;
> +	ctx->hpos += n;
> +	return ptr;
> +}
> +
> +/**
> + * cr_hbuf_put - unreserve space on the hbuf
> + * @ctx: checkpoint context
> + * @n: number of bytes to reserve
> + */
> +void cr_hbuf_put(struct cr_ctx *ctx, int n)
> +{
> +	BUG_ON(ctx->hpos < n);
> +	ctx->hpos -= n;
> +}
> +

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [RFC v11][PATCH 03/13] General infrastructure for checkpoint restart
  2008-12-05 17:31   ` Oren Laadan
@ 2008-12-16 21:54     ` Mike Waychison
  -1 siblings, 0 replies; 133+ messages in thread
From: Mike Waychison @ 2008-12-16 21:54 UTC (permalink / raw)
  To: Oren Laadan
  Cc: jeremy, arnd, linux-api, containers, linux-kernel, Dave Hansen,
	linux-mm, Linux Torvalds, Alexander Viro, H. Peter Anvin,
	Thomas Gleixner, Ingo Molnar

Oren Laadan wrote:
> diff --git a/checkpoint/sys.c b/checkpoint/sys.c
> index 375129c..bd14ef9 100644
> --- a/checkpoint/sys.c
> +++ b/checkpoint/sys.c

> +/*
> + * During checkpoint and restart the code writes outs/reads in data
> + * to/from the checkpoint image from/to a temporary buffer (ctx->hbuf).
> + * Because operations can be nested, use cr_hbuf_get() to reserve space
> + * in the buffer, then cr_hbuf_put() when you no longer need that space.
> + */

This seems a bit over-kill for buffer management no?  The only large 
header seems to be cr_hdr_head and the blowup comes from utsinfo string 
data (which could easily be moved out to be in it's own CR_HDR_STRING 
blocks).

Wouldn't it be easier to use stack-local storage than balancing the 
cr_hbuf_get/put routines?

> +
> +/*
> + * ctx->hbuf is used to hold headers and data of known (or bound),
> + * static sizes. In some cases, multiple headers may be allocated in
> + * a nested manner. The size should accommodate all headers, nested
> + * or not, on all archs.
> + */
> +#define CR_HBUF_TOTAL  (8 * 4096)
> +
> +/**
> + * cr_hbuf_get - reserve space on the hbuf
> + * @ctx: checkpoint context
> + * @n: number of bytes to reserve
> + *
> + * Returns pointer to reserved space
> + */
> +void *cr_hbuf_get(struct cr_ctx *ctx, int n)
> +{
> +	void *ptr;
> +
> +	/*
> +	 * Since requests depend on logic and static header sizes (not on
> +	 * user data), space should always suffice, unless someone either
> +	 * made a structure bigger or call path deeper than expected.
> +	 */
> +	BUG_ON(ctx->hpos + n > CR_HBUF_TOTAL);
> +	ptr = ctx->hbuf + ctx->hpos;
> +	ctx->hpos += n;
> +	return ptr;
> +}
> +
> +/**
> + * cr_hbuf_put - unreserve space on the hbuf
> + * @ctx: checkpoint context
> + * @n: number of bytes to reserve
> + */
> +void cr_hbuf_put(struct cr_ctx *ctx, int n)
> +{
> +	BUG_ON(ctx->hpos < n);
> +	ctx->hpos -= n;
> +}
> +

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [RFC v11][PATCH 03/13] General infrastructure for checkpoint restart
@ 2008-12-16 21:54     ` Mike Waychison
  0 siblings, 0 replies; 133+ messages in thread
From: Mike Waychison @ 2008-12-16 21:54 UTC (permalink / raw)
  To: Oren Laadan
  Cc: jeremy, arnd, linux-api, containers, linux-kernel, Dave Hansen,
	linux-mm, Linux Torvalds, Alexander Viro, H. Peter Anvin,
	Thomas Gleixner, Ingo Molnar

Oren Laadan wrote:
> diff --git a/checkpoint/sys.c b/checkpoint/sys.c
> index 375129c..bd14ef9 100644
> --- a/checkpoint/sys.c
> +++ b/checkpoint/sys.c

> +/*
> + * During checkpoint and restart the code writes outs/reads in data
> + * to/from the checkpoint image from/to a temporary buffer (ctx->hbuf).
> + * Because operations can be nested, use cr_hbuf_get() to reserve space
> + * in the buffer, then cr_hbuf_put() when you no longer need that space.
> + */

This seems a bit over-kill for buffer management no?  The only large 
header seems to be cr_hdr_head and the blowup comes from utsinfo string 
data (which could easily be moved out to be in it's own CR_HDR_STRING 
blocks).

Wouldn't it be easier to use stack-local storage than balancing the 
cr_hbuf_get/put routines?

> +
> +/*
> + * ctx->hbuf is used to hold headers and data of known (or bound),
> + * static sizes. In some cases, multiple headers may be allocated in
> + * a nested manner. The size should accommodate all headers, nested
> + * or not, on all archs.
> + */
> +#define CR_HBUF_TOTAL  (8 * 4096)
> +
> +/**
> + * cr_hbuf_get - reserve space on the hbuf
> + * @ctx: checkpoint context
> + * @n: number of bytes to reserve
> + *
> + * Returns pointer to reserved space
> + */
> +void *cr_hbuf_get(struct cr_ctx *ctx, int n)
> +{
> +	void *ptr;
> +
> +	/*
> +	 * Since requests depend on logic and static header sizes (not on
> +	 * user data), space should always suffice, unless someone either
> +	 * made a structure bigger or call path deeper than expected.
> +	 */
> +	BUG_ON(ctx->hpos + n > CR_HBUF_TOTAL);
> +	ptr = ctx->hbuf + ctx->hpos;
> +	ctx->hpos += n;
> +	return ptr;
> +}
> +
> +/**
> + * cr_hbuf_put - unreserve space on the hbuf
> + * @ctx: checkpoint context
> + * @n: number of bytes to reserve
> + */
> +void cr_hbuf_put(struct cr_ctx *ctx, int n)
> +{
> +	BUG_ON(ctx->hpos < n);
> +	ctx->hpos -= n;
> +}
> +
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [RFC v11][PATCH 03/13] General infrastructure for checkpoint restart
       [not found]     ` <49482394.10006-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
@ 2008-12-16 22:14       ` Dave Hansen
  0 siblings, 0 replies; 133+ messages in thread
From: Dave Hansen @ 2008-12-16 22:14 UTC (permalink / raw)
  To: Mike Waychison
  Cc: jeremy-TSDbQ3PG+2Y, arnd-r2nGTMty4D4,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	H. Peter Anvin, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Linux Torvalds, Alexander Viro,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Thomas Gleixner, Ingo Molnar

On Tue, 2008-12-16 at 13:54 -0800, Mike Waychison wrote:
> Oren Laadan wrote:
> > diff --git a/checkpoint/sys.c b/checkpoint/sys.c
> > index 375129c..bd14ef9 100644
> > --- a/checkpoint/sys.c
> > +++ b/checkpoint/sys.c
> 
> > +/*
> > + * During checkpoint and restart the code writes outs/reads in data
> > + * to/from the checkpoint image from/to a temporary buffer (ctx->hbuf).
> > + * Because operations can be nested, use cr_hbuf_get() to reserve space
> > + * in the buffer, then cr_hbuf_put() when you no longer need that space.
> > + */
> 
> This seems a bit over-kill for buffer management no?  The only large 
> header seems to be cr_hdr_head and the blowup comes from utsinfo string 
> data (which could easily be moved out to be in it's own CR_HDR_STRING 
> blocks).
> 
> Wouldn't it be easier to use stack-local storage than balancing the 
> cr_hbuf_get/put routines?

I've asked the same question, so I'll give you Oren's response that I
remember:

cr_hbuf_get/put() are more of an API that we can use later.  For now,
those buffers really are temporary.  But, in a case where we want to do
a really fast checkpoint (to reduce "downtime" during the checkpoint) we
store the image entirely in kernel memory to be written out later.

In that case, cr_hbuf_put() stops doing anything at all because we keep
the memory around.

cr_hbuf_get() becomes, "I need some memory to write some checkpointy
things into".

cr_hbuf_put() becomes, "I'm done with this for now, only keep it if
someone else needs it."

This might all be a lot clearer if we just kept some more explicit
accounting around about who is using the objects.  Something like:

struct cr_buf {
	struct kref ref;
	int size;
	char buf[0];
};

/* replaces cr_hbuf_get() */
struct cr_buf *alloc_cr_buf(int size, gfp_t flags)
{
	struct cr_buf *buf;

	buf = kmalloc(sizeof(cr_buf) + size, flags);
	if (!buf)
		return NULL;
	buf->ref = 1; /* or whatever */
	buf->size = size;
	return buf;
}

int cr_kwrite(struct cr_buf *buf)
{
	if (writing_checkpoint_now) {
		// or whatever this write call was...
		vfs_write(&buf->buf[0], buf->size);
	} else if (deferring_write) {		
		kref_get(buf->kref);
	}
}

-- Dave

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [RFC v11][PATCH 03/13] General infrastructure for checkpoint restart
  2008-12-16 21:54     ` Mike Waychison
@ 2008-12-16 22:14       ` Dave Hansen
  -1 siblings, 0 replies; 133+ messages in thread
From: Dave Hansen @ 2008-12-16 22:14 UTC (permalink / raw)
  To: Mike Waychison
  Cc: Oren Laadan, jeremy, arnd, linux-api, containers, linux-kernel,
	linux-mm, Linux Torvalds, Alexander Viro, H. Peter Anvin,
	Thomas Gleixner, Ingo Molnar

On Tue, 2008-12-16 at 13:54 -0800, Mike Waychison wrote:
> Oren Laadan wrote:
> > diff --git a/checkpoint/sys.c b/checkpoint/sys.c
> > index 375129c..bd14ef9 100644
> > --- a/checkpoint/sys.c
> > +++ b/checkpoint/sys.c
> 
> > +/*
> > + * During checkpoint and restart the code writes outs/reads in data
> > + * to/from the checkpoint image from/to a temporary buffer (ctx->hbuf).
> > + * Because operations can be nested, use cr_hbuf_get() to reserve space
> > + * in the buffer, then cr_hbuf_put() when you no longer need that space.
> > + */
> 
> This seems a bit over-kill for buffer management no?  The only large 
> header seems to be cr_hdr_head and the blowup comes from utsinfo string 
> data (which could easily be moved out to be in it's own CR_HDR_STRING 
> blocks).
> 
> Wouldn't it be easier to use stack-local storage than balancing the 
> cr_hbuf_get/put routines?

I've asked the same question, so I'll give you Oren's response that I
remember:

cr_hbuf_get/put() are more of an API that we can use later.  For now,
those buffers really are temporary.  But, in a case where we want to do
a really fast checkpoint (to reduce "downtime" during the checkpoint) we
store the image entirely in kernel memory to be written out later.

In that case, cr_hbuf_put() stops doing anything at all because we keep
the memory around.

cr_hbuf_get() becomes, "I need some memory to write some checkpointy
things into".

cr_hbuf_put() becomes, "I'm done with this for now, only keep it if
someone else needs it."

This might all be a lot clearer if we just kept some more explicit
accounting around about who is using the objects.  Something like:

struct cr_buf {
	struct kref ref;
	int size;
	char buf[0];
};

/* replaces cr_hbuf_get() */
struct cr_buf *alloc_cr_buf(int size, gfp_t flags)
{
	struct cr_buf *buf;

	buf = kmalloc(sizeof(cr_buf) + size, flags);
	if (!buf)
		return NULL;
	buf->ref = 1; /* or whatever */
	buf->size = size;
	return buf;
}

int cr_kwrite(struct cr_buf *buf)
{
	if (writing_checkpoint_now) {
		// or whatever this write call was...
		vfs_write(&buf->buf[0], buf->size);
	} else if (deferring_write) {		
		kref_get(buf->kref);
	}
}

-- Dave


^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [RFC v11][PATCH 03/13] General infrastructure for checkpoint restart
@ 2008-12-16 22:14       ` Dave Hansen
  0 siblings, 0 replies; 133+ messages in thread
From: Dave Hansen @ 2008-12-16 22:14 UTC (permalink / raw)
  To: Mike Waychison
  Cc: Oren Laadan, jeremy, arnd, linux-api, containers, linux-kernel,
	linux-mm, Linux Torvalds, Alexander Viro, H. Peter Anvin,
	Thomas Gleixner, Ingo Molnar

On Tue, 2008-12-16 at 13:54 -0800, Mike Waychison wrote:
> Oren Laadan wrote:
> > diff --git a/checkpoint/sys.c b/checkpoint/sys.c
> > index 375129c..bd14ef9 100644
> > --- a/checkpoint/sys.c
> > +++ b/checkpoint/sys.c
> 
> > +/*
> > + * During checkpoint and restart the code writes outs/reads in data
> > + * to/from the checkpoint image from/to a temporary buffer (ctx->hbuf).
> > + * Because operations can be nested, use cr_hbuf_get() to reserve space
> > + * in the buffer, then cr_hbuf_put() when you no longer need that space.
> > + */
> 
> This seems a bit over-kill for buffer management no?  The only large 
> header seems to be cr_hdr_head and the blowup comes from utsinfo string 
> data (which could easily be moved out to be in it's own CR_HDR_STRING 
> blocks).
> 
> Wouldn't it be easier to use stack-local storage than balancing the 
> cr_hbuf_get/put routines?

I've asked the same question, so I'll give you Oren's response that I
remember:

cr_hbuf_get/put() are more of an API that we can use later.  For now,
those buffers really are temporary.  But, in a case where we want to do
a really fast checkpoint (to reduce "downtime" during the checkpoint) we
store the image entirely in kernel memory to be written out later.

In that case, cr_hbuf_put() stops doing anything at all because we keep
the memory around.

cr_hbuf_get() becomes, "I need some memory to write some checkpointy
things into".

cr_hbuf_put() becomes, "I'm done with this for now, only keep it if
someone else needs it."

This might all be a lot clearer if we just kept some more explicit
accounting around about who is using the objects.  Something like:

struct cr_buf {
	struct kref ref;
	int size;
	char buf[0];
};

/* replaces cr_hbuf_get() */
struct cr_buf *alloc_cr_buf(int size, gfp_t flags)
{
	struct cr_buf *buf;

	buf = kmalloc(sizeof(cr_buf) + size, flags);
	if (!buf)
		return NULL;
	buf->ref = 1; /* or whatever */
	buf->size = size;
	return buf;
}

int cr_kwrite(struct cr_buf *buf)
{
	if (writing_checkpoint_now) {
		// or whatever this write call was...
		vfs_write(&buf->buf[0], buf->size);
	} else if (deferring_write) {		
		kref_get(buf->kref);
	}
}

-- Dave

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [RFC v11][PATCH 03/13] General infrastructure for checkpoint restart
  2008-12-16 22:14       ` Dave Hansen
  (?)
@ 2008-12-16 22:43       ` Mike Waychison
  -1 siblings, 0 replies; 133+ messages in thread
From: Mike Waychison @ 2008-12-16 22:43 UTC (permalink / raw)
  To: Dave Hansen
  Cc: jeremy-TSDbQ3PG+2Y, arnd-r2nGTMty4D4,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	H. Peter Anvin, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Linux Torvalds, Alexander Viro,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Thomas Gleixner, Ingo Molnar

Dave Hansen wrote:
> On Tue, 2008-12-16 at 13:54 -0800, Mike Waychison wrote:
>> Oren Laadan wrote:
>>> diff --git a/checkpoint/sys.c b/checkpoint/sys.c
>>> index 375129c..bd14ef9 100644
>>> --- a/checkpoint/sys.c
>>> +++ b/checkpoint/sys.c
>>> +/*
>>> + * During checkpoint and restart the code writes outs/reads in data
>>> + * to/from the checkpoint image from/to a temporary buffer (ctx->hbuf).
>>> + * Because operations can be nested, use cr_hbuf_get() to reserve space
>>> + * in the buffer, then cr_hbuf_put() when you no longer need that space.
>>> + */
>> This seems a bit over-kill for buffer management no?  The only large 
>> header seems to be cr_hdr_head and the blowup comes from utsinfo string 
>> data (which could easily be moved out to be in it's own CR_HDR_STRING 
>> blocks).
>>
>> Wouldn't it be easier to use stack-local storage than balancing the 
>> cr_hbuf_get/put routines?
> 
> I've asked the same question, so I'll give you Oren's response that I
> remember:
> 
> cr_hbuf_get/put() are more of an API that we can use later.  For now,
> those buffers really are temporary.  But, in a case where we want to do
> a really fast checkpoint (to reduce "downtime" during the checkpoint) we
> store the image entirely in kernel memory to be written out later.
> 

Hmm, if I'm understanding you correctly, adding ref counts explicitly 
(like you suggest below)  would be used to let a lower layer defer 
writes.  Seems like this could be just as easily done with explicits 
kmallocs and transferring ownership of the allocated memory to the 
in-kernel representation handling layer below (which in turn queues the 
data structures for writes).

Any such layer would probably need to hold references to objects 
enqueued for write-out, so they will still a full cleanup path in case 
of success/error/abort (which means that any advantage of creating a 
pool of allocations for O(1) cleanup disappears).

Reference counting these guys doesn't have a clear advantage to me. 
They seem to have a pretty linear lifetime.

> In that case, cr_hbuf_put() stops doing anything at all because we keep
> the memory around.
> 
> cr_hbuf_get() becomes, "I need some memory to write some checkpointy
> things into".
> 
> cr_hbuf_put() becomes, "I'm done with this for now, only keep it if
> someone else needs it."
> 
> This might all be a lot clearer if we just kept some more explicit
> accounting around about who is using the objects.  Something like:
> 
> struct cr_buf {
> 	struct kref ref;
> 	int size;
> 	char buf[0];
> };
> 
> /* replaces cr_hbuf_get() */
> struct cr_buf *alloc_cr_buf(int size, gfp_t flags)
> {
> 	struct cr_buf *buf;
> 
> 	buf = kmalloc(sizeof(cr_buf) + size, flags);
> 	if (!buf)
> 		return NULL;
> 	buf->ref = 1; /* or whatever */
> 	buf->size = size;
> 	return buf;
> }
> 
> int cr_kwrite(struct cr_buf *buf)
> {
> 	if (writing_checkpoint_now) {
> 		// or whatever this write call was...
> 		vfs_write(&buf->buf[0], buf->size);
> 	} else if (deferring_write) {		
> 		kref_get(buf->kref);
> 	}
> }
> 
> -- Dave
> 

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [RFC v11][PATCH 03/13] General infrastructure for checkpoint restart
  2008-12-16 22:14       ` Dave Hansen
@ 2008-12-16 22:43         ` Mike Waychison
  -1 siblings, 0 replies; 133+ messages in thread
From: Mike Waychison @ 2008-12-16 22:43 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Oren Laadan, jeremy, arnd, linux-api, containers, linux-kernel,
	linux-mm, Linux Torvalds, Alexander Viro, H. Peter Anvin,
	Thomas Gleixner, Ingo Molnar

Dave Hansen wrote:
> On Tue, 2008-12-16 at 13:54 -0800, Mike Waychison wrote:
>> Oren Laadan wrote:
>>> diff --git a/checkpoint/sys.c b/checkpoint/sys.c
>>> index 375129c..bd14ef9 100644
>>> --- a/checkpoint/sys.c
>>> +++ b/checkpoint/sys.c
>>> +/*
>>> + * During checkpoint and restart the code writes outs/reads in data
>>> + * to/from the checkpoint image from/to a temporary buffer (ctx->hbuf).
>>> + * Because operations can be nested, use cr_hbuf_get() to reserve space
>>> + * in the buffer, then cr_hbuf_put() when you no longer need that space.
>>> + */
>> This seems a bit over-kill for buffer management no?  The only large 
>> header seems to be cr_hdr_head and the blowup comes from utsinfo string 
>> data (which could easily be moved out to be in it's own CR_HDR_STRING 
>> blocks).
>>
>> Wouldn't it be easier to use stack-local storage than balancing the 
>> cr_hbuf_get/put routines?
> 
> I've asked the same question, so I'll give you Oren's response that I
> remember:
> 
> cr_hbuf_get/put() are more of an API that we can use later.  For now,
> those buffers really are temporary.  But, in a case where we want to do
> a really fast checkpoint (to reduce "downtime" during the checkpoint) we
> store the image entirely in kernel memory to be written out later.
> 

Hmm, if I'm understanding you correctly, adding ref counts explicitly 
(like you suggest below)  would be used to let a lower layer defer 
writes.  Seems like this could be just as easily done with explicits 
kmallocs and transferring ownership of the allocated memory to the 
in-kernel representation handling layer below (which in turn queues the 
data structures for writes).

Any such layer would probably need to hold references to objects 
enqueued for write-out, so they will still a full cleanup path in case 
of success/error/abort (which means that any advantage of creating a 
pool of allocations for O(1) cleanup disappears).

Reference counting these guys doesn't have a clear advantage to me. 
They seem to have a pretty linear lifetime.

> In that case, cr_hbuf_put() stops doing anything at all because we keep
> the memory around.
> 
> cr_hbuf_get() becomes, "I need some memory to write some checkpointy
> things into".
> 
> cr_hbuf_put() becomes, "I'm done with this for now, only keep it if
> someone else needs it."
> 
> This might all be a lot clearer if we just kept some more explicit
> accounting around about who is using the objects.  Something like:
> 
> struct cr_buf {
> 	struct kref ref;
> 	int size;
> 	char buf[0];
> };
> 
> /* replaces cr_hbuf_get() */
> struct cr_buf *alloc_cr_buf(int size, gfp_t flags)
> {
> 	struct cr_buf *buf;
> 
> 	buf = kmalloc(sizeof(cr_buf) + size, flags);
> 	if (!buf)
> 		return NULL;
> 	buf->ref = 1; /* or whatever */
> 	buf->size = size;
> 	return buf;
> }
> 
> int cr_kwrite(struct cr_buf *buf)
> {
> 	if (writing_checkpoint_now) {
> 		// or whatever this write call was...
> 		vfs_write(&buf->buf[0], buf->size);
> 	} else if (deferring_write) {		
> 		kref_get(buf->kref);
> 	}
> }
> 
> -- Dave
> 


^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [RFC v11][PATCH 03/13] General infrastructure for checkpoint restart
@ 2008-12-16 22:43         ` Mike Waychison
  0 siblings, 0 replies; 133+ messages in thread
From: Mike Waychison @ 2008-12-16 22:43 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Oren Laadan, jeremy, arnd, linux-api, containers, linux-kernel,
	linux-mm, Linux Torvalds, Alexander Viro, H. Peter Anvin,
	Thomas Gleixner, Ingo Molnar

Dave Hansen wrote:
> On Tue, 2008-12-16 at 13:54 -0800, Mike Waychison wrote:
>> Oren Laadan wrote:
>>> diff --git a/checkpoint/sys.c b/checkpoint/sys.c
>>> index 375129c..bd14ef9 100644
>>> --- a/checkpoint/sys.c
>>> +++ b/checkpoint/sys.c
>>> +/*
>>> + * During checkpoint and restart the code writes outs/reads in data
>>> + * to/from the checkpoint image from/to a temporary buffer (ctx->hbuf).
>>> + * Because operations can be nested, use cr_hbuf_get() to reserve space
>>> + * in the buffer, then cr_hbuf_put() when you no longer need that space.
>>> + */
>> This seems a bit over-kill for buffer management no?  The only large 
>> header seems to be cr_hdr_head and the blowup comes from utsinfo string 
>> data (which could easily be moved out to be in it's own CR_HDR_STRING 
>> blocks).
>>
>> Wouldn't it be easier to use stack-local storage than balancing the 
>> cr_hbuf_get/put routines?
> 
> I've asked the same question, so I'll give you Oren's response that I
> remember:
> 
> cr_hbuf_get/put() are more of an API that we can use later.  For now,
> those buffers really are temporary.  But, in a case where we want to do
> a really fast checkpoint (to reduce "downtime" during the checkpoint) we
> store the image entirely in kernel memory to be written out later.
> 

Hmm, if I'm understanding you correctly, adding ref counts explicitly 
(like you suggest below)  would be used to let a lower layer defer 
writes.  Seems like this could be just as easily done with explicits 
kmallocs and transferring ownership of the allocated memory to the 
in-kernel representation handling layer below (which in turn queues the 
data structures for writes).

Any such layer would probably need to hold references to objects 
enqueued for write-out, so they will still a full cleanup path in case 
of success/error/abort (which means that any advantage of creating a 
pool of allocations for O(1) cleanup disappears).

Reference counting these guys doesn't have a clear advantage to me. 
They seem to have a pretty linear lifetime.

> In that case, cr_hbuf_put() stops doing anything at all because we keep
> the memory around.
> 
> cr_hbuf_get() becomes, "I need some memory to write some checkpointy
> things into".
> 
> cr_hbuf_put() becomes, "I'm done with this for now, only keep it if
> someone else needs it."
> 
> This might all be a lot clearer if we just kept some more explicit
> accounting around about who is using the objects.  Something like:
> 
> struct cr_buf {
> 	struct kref ref;
> 	int size;
> 	char buf[0];
> };
> 
> /* replaces cr_hbuf_get() */
> struct cr_buf *alloc_cr_buf(int size, gfp_t flags)
> {
> 	struct cr_buf *buf;
> 
> 	buf = kmalloc(sizeof(cr_buf) + size, flags);
> 	if (!buf)
> 		return NULL;
> 	buf->ref = 1; /* or whatever */
> 	buf->size = size;
> 	return buf;
> }
> 
> int cr_kwrite(struct cr_buf *buf)
> {
> 	if (writing_checkpoint_now) {
> 		// or whatever this write call was...
> 		vfs_write(&buf->buf[0], buf->size);
> 	} else if (deferring_write) {		
> 		kref_get(buf->kref);
> 	}
> }
> 
> -- Dave
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [RFC v11][PATCH 03/13] General infrastructure for checkpoint restart
  2008-12-16 22:14       ` Dave Hansen
                         ` (3 preceding siblings ...)
  (?)
@ 2008-12-16 23:42       ` Oren Laadan
  -1 siblings, 0 replies; 133+ messages in thread
From: Oren Laadan @ 2008-12-16 23:42 UTC (permalink / raw)
  To: Dave Hansen
  Cc: jeremy-TSDbQ3PG+2Y, arnd-r2nGTMty4D4,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Linux Torvalds, Alexander Viro,
	H. Peter Anvin, Thomas Gleixner, Ingo Molnar



Dave Hansen wrote:
> On Tue, 2008-12-16 at 13:54 -0800, Mike Waychison wrote:
>> Oren Laadan wrote:
>>> diff --git a/checkpoint/sys.c b/checkpoint/sys.c
>>> index 375129c..bd14ef9 100644
>>> --- a/checkpoint/sys.c
>>> +++ b/checkpoint/sys.c
>>> +/*
>>> + * During checkpoint and restart the code writes outs/reads in data
>>> + * to/from the checkpoint image from/to a temporary buffer (ctx->hbuf).
>>> + * Because operations can be nested, use cr_hbuf_get() to reserve space
>>> + * in the buffer, then cr_hbuf_put() when you no longer need that space.
>>> + */
>> This seems a bit over-kill for buffer management no?  The only large 
>> header seems to be cr_hdr_head and the blowup comes from utsinfo string 
>> data (which could easily be moved out to be in it's own CR_HDR_STRING 
>> blocks).
>>
>> Wouldn't it be easier to use stack-local storage than balancing the 
>> cr_hbuf_get/put routines?
> 
> I've asked the same question, so I'll give you Oren's response that I
> remember:
> 
> cr_hbuf_get/put() are more of an API that we can use later.  For now,
> those buffers really are temporary.  But, in a case where we want to do
> a really fast checkpoint (to reduce "downtime" during the checkpoint) we
> store the image entirely in kernel memory to be written out later.

Precisely.

Note that by "store the image entirely" we mean everything that cannot
be saved COW - so memory pages are not duplicated; the rest of the data
tends to take less than 5-10% of the total size.

Buffering the checkpoint image in kernel is useful to reduce downtime
during checkpoint, and also useful for super-fast rollback of a task
(or container) by always keeping everything in memory.

This abstraction is also useful for restart, e.g. to implement read-ahead
of the checkpoint image into the kernel.

Note also that in the future we will have larger headers (e.g. to record
the state of a socket), and there may be some nested calls (e.g. to dump
a connected unix-domain socket we will want to first save the "parent"
listening socket, and also there is nesting in restart).

Instead of using the stack for some headers and memory allocation for
other headers, this abstraction provides a standard interface for all
checkpoint/restart code (and the actual implementation may vary for
different purposes).

> 
> In that case, cr_hbuf_put() stops doing anything at all because we keep
> the memory around.
> 
> cr_hbuf_get() becomes, "I need some memory to write some checkpointy
> things into".
> 
> cr_hbuf_put() becomes, "I'm done with this for now, only keep it if
> someone else needs it."
> 
> This might all be a lot clearer if we just kept some more explicit
> accounting around about who is using the objects.  Something like:
> 
> struct cr_buf {
> 	struct kref ref;
> 	int size;
> 	char buf[0];
> };
> 
> /* replaces cr_hbuf_get() */
> struct cr_buf *alloc_cr_buf(int size, gfp_t flags)
> {
> 	struct cr_buf *buf;
> 
> 	buf = kmalloc(sizeof(cr_buf) + size, flags);
> 	if (!buf)
> 		return NULL;
> 	buf->ref = 1; /* or whatever */
> 	buf->size = size;
> 	return buf;
> }
> 
> int cr_kwrite(struct cr_buf *buf)
> {
> 	if (writing_checkpoint_now) {
> 		// or whatever this write call was...
> 		vfs_write(&buf->buf[0], buf->size);
> 	} else if (deferring_write) {		
> 		kref_get(buf->kref);
> 	}
> }

Yes, something like that, except you can do without the reference count
since the buffer is tied to 'struct cr_ctx'; so this is what I had in
mind:

In non-buffering mode - as it is now - cr_kwrite() will write the data out.

In buffering mode (not implemented yet), cr_write() will either do nothing
(if we borrow from the current code, that uses a temp buffer), or attach
the 'struct cr_buf' to the 'struct cr_ctx' (if we borrow Dave's suggestion
above).

In buffering mode, we'll also need 'cr_writeout()' which will write out
the entire buffer to the 'struct file'.  (This function will do nothing
in non-buffering mode).

Finally, the buffers will be freed when the 'struct cr_ctx' is cleaned.

Oren.

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [RFC v11][PATCH 03/13] General infrastructure for checkpoint restart
  2008-12-16 22:14       ` Dave Hansen
@ 2008-12-16 23:42         ` Oren Laadan
  -1 siblings, 0 replies; 133+ messages in thread
From: Oren Laadan @ 2008-12-16 23:42 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Mike Waychison, jeremy, arnd, linux-api, containers,
	linux-kernel, linux-mm, Linux Torvalds, Alexander Viro,
	H. Peter Anvin, Thomas Gleixner, Ingo Molnar



Dave Hansen wrote:
> On Tue, 2008-12-16 at 13:54 -0800, Mike Waychison wrote:
>> Oren Laadan wrote:
>>> diff --git a/checkpoint/sys.c b/checkpoint/sys.c
>>> index 375129c..bd14ef9 100644
>>> --- a/checkpoint/sys.c
>>> +++ b/checkpoint/sys.c
>>> +/*
>>> + * During checkpoint and restart the code writes outs/reads in data
>>> + * to/from the checkpoint image from/to a temporary buffer (ctx->hbuf).
>>> + * Because operations can be nested, use cr_hbuf_get() to reserve space
>>> + * in the buffer, then cr_hbuf_put() when you no longer need that space.
>>> + */
>> This seems a bit over-kill for buffer management no?  The only large 
>> header seems to be cr_hdr_head and the blowup comes from utsinfo string 
>> data (which could easily be moved out to be in it's own CR_HDR_STRING 
>> blocks).
>>
>> Wouldn't it be easier to use stack-local storage than balancing the 
>> cr_hbuf_get/put routines?
> 
> I've asked the same question, so I'll give you Oren's response that I
> remember:
> 
> cr_hbuf_get/put() are more of an API that we can use later.  For now,
> those buffers really are temporary.  But, in a case where we want to do
> a really fast checkpoint (to reduce "downtime" during the checkpoint) we
> store the image entirely in kernel memory to be written out later.

Precisely.

Note that by "store the image entirely" we mean everything that cannot
be saved COW - so memory pages are not duplicated; the rest of the data
tends to take less than 5-10% of the total size.

Buffering the checkpoint image in kernel is useful to reduce downtime
during checkpoint, and also useful for super-fast rollback of a task
(or container) by always keeping everything in memory.

This abstraction is also useful for restart, e.g. to implement read-ahead
of the checkpoint image into the kernel.

Note also that in the future we will have larger headers (e.g. to record
the state of a socket), and there may be some nested calls (e.g. to dump
a connected unix-domain socket we will want to first save the "parent"
listening socket, and also there is nesting in restart).

Instead of using the stack for some headers and memory allocation for
other headers, this abstraction provides a standard interface for all
checkpoint/restart code (and the actual implementation may vary for
different purposes).

> 
> In that case, cr_hbuf_put() stops doing anything at all because we keep
> the memory around.
> 
> cr_hbuf_get() becomes, "I need some memory to write some checkpointy
> things into".
> 
> cr_hbuf_put() becomes, "I'm done with this for now, only keep it if
> someone else needs it."
> 
> This might all be a lot clearer if we just kept some more explicit
> accounting around about who is using the objects.  Something like:
> 
> struct cr_buf {
> 	struct kref ref;
> 	int size;
> 	char buf[0];
> };
> 
> /* replaces cr_hbuf_get() */
> struct cr_buf *alloc_cr_buf(int size, gfp_t flags)
> {
> 	struct cr_buf *buf;
> 
> 	buf = kmalloc(sizeof(cr_buf) + size, flags);
> 	if (!buf)
> 		return NULL;
> 	buf->ref = 1; /* or whatever */
> 	buf->size = size;
> 	return buf;
> }
> 
> int cr_kwrite(struct cr_buf *buf)
> {
> 	if (writing_checkpoint_now) {
> 		// or whatever this write call was...
> 		vfs_write(&buf->buf[0], buf->size);
> 	} else if (deferring_write) {		
> 		kref_get(buf->kref);
> 	}
> }

Yes, something like that, except you can do without the reference count
since the buffer is tied to 'struct cr_ctx'; so this is what I had in
mind:

In non-buffering mode - as it is now - cr_kwrite() will write the data out.

In buffering mode (not implemented yet), cr_write() will either do nothing
(if we borrow from the current code, that uses a temp buffer), or attach
the 'struct cr_buf' to the 'struct cr_ctx' (if we borrow Dave's suggestion
above).

In buffering mode, we'll also need 'cr_writeout()' which will write out
the entire buffer to the 'struct file'.  (This function will do nothing
in non-buffering mode).

Finally, the buffers will be freed when the 'struct cr_ctx' is cleaned.

Oren.


^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [RFC v11][PATCH 03/13] General infrastructure for checkpoint restart
@ 2008-12-16 23:42         ` Oren Laadan
  0 siblings, 0 replies; 133+ messages in thread
From: Oren Laadan @ 2008-12-16 23:42 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Mike Waychison, jeremy, arnd, linux-api, containers,
	linux-kernel, linux-mm, Linux Torvalds, Alexander Viro,
	H. Peter Anvin, Thomas Gleixner, Ingo Molnar



Dave Hansen wrote:
> On Tue, 2008-12-16 at 13:54 -0800, Mike Waychison wrote:
>> Oren Laadan wrote:
>>> diff --git a/checkpoint/sys.c b/checkpoint/sys.c
>>> index 375129c..bd14ef9 100644
>>> --- a/checkpoint/sys.c
>>> +++ b/checkpoint/sys.c
>>> +/*
>>> + * During checkpoint and restart the code writes outs/reads in data
>>> + * to/from the checkpoint image from/to a temporary buffer (ctx->hbuf).
>>> + * Because operations can be nested, use cr_hbuf_get() to reserve space
>>> + * in the buffer, then cr_hbuf_put() when you no longer need that space.
>>> + */
>> This seems a bit over-kill for buffer management no?  The only large 
>> header seems to be cr_hdr_head and the blowup comes from utsinfo string 
>> data (which could easily be moved out to be in it's own CR_HDR_STRING 
>> blocks).
>>
>> Wouldn't it be easier to use stack-local storage than balancing the 
>> cr_hbuf_get/put routines?
> 
> I've asked the same question, so I'll give you Oren's response that I
> remember:
> 
> cr_hbuf_get/put() are more of an API that we can use later.  For now,
> those buffers really are temporary.  But, in a case where we want to do
> a really fast checkpoint (to reduce "downtime" during the checkpoint) we
> store the image entirely in kernel memory to be written out later.

Precisely.

Note that by "store the image entirely" we mean everything that cannot
be saved COW - so memory pages are not duplicated; the rest of the data
tends to take less than 5-10% of the total size.

Buffering the checkpoint image in kernel is useful to reduce downtime
during checkpoint, and also useful for super-fast rollback of a task
(or container) by always keeping everything in memory.

This abstraction is also useful for restart, e.g. to implement read-ahead
of the checkpoint image into the kernel.

Note also that in the future we will have larger headers (e.g. to record
the state of a socket), and there may be some nested calls (e.g. to dump
a connected unix-domain socket we will want to first save the "parent"
listening socket, and also there is nesting in restart).

Instead of using the stack for some headers and memory allocation for
other headers, this abstraction provides a standard interface for all
checkpoint/restart code (and the actual implementation may vary for
different purposes).

> 
> In that case, cr_hbuf_put() stops doing anything at all because we keep
> the memory around.
> 
> cr_hbuf_get() becomes, "I need some memory to write some checkpointy
> things into".
> 
> cr_hbuf_put() becomes, "I'm done with this for now, only keep it if
> someone else needs it."
> 
> This might all be a lot clearer if we just kept some more explicit
> accounting around about who is using the objects.  Something like:
> 
> struct cr_buf {
> 	struct kref ref;
> 	int size;
> 	char buf[0];
> };
> 
> /* replaces cr_hbuf_get() */
> struct cr_buf *alloc_cr_buf(int size, gfp_t flags)
> {
> 	struct cr_buf *buf;
> 
> 	buf = kmalloc(sizeof(cr_buf) + size, flags);
> 	if (!buf)
> 		return NULL;
> 	buf->ref = 1; /* or whatever */
> 	buf->size = size;
> 	return buf;
> }
> 
> int cr_kwrite(struct cr_buf *buf)
> {
> 	if (writing_checkpoint_now) {
> 		// or whatever this write call was...
> 		vfs_write(&buf->buf[0], buf->size);
> 	} else if (deferring_write) {		
> 		kref_get(buf->kref);
> 	}
> }

Yes, something like that, except you can do without the reference count
since the buffer is tied to 'struct cr_ctx'; so this is what I had in
mind:

In non-buffering mode - as it is now - cr_kwrite() will write the data out.

In buffering mode (not implemented yet), cr_write() will either do nothing
(if we borrow from the current code, that uses a temp buffer), or attach
the 'struct cr_buf' to the 'struct cr_ctx' (if we borrow Dave's suggestion
above).

In buffering mode, we'll also need 'cr_writeout()' which will write out
the entire buffer to the 'struct file'.  (This function will do nothing
in non-buffering mode).

Finally, the buffers will be freed when the 'struct cr_ctx' is cleaned.

Oren.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [RFC v11][PATCH 03/13] General infrastructure for checkpoint restart
       [not found]         ` <49482F14.1040407-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
@ 2008-12-17  0:13           ` Dave Hansen
  0 siblings, 0 replies; 133+ messages in thread
From: Dave Hansen @ 2008-12-17  0:13 UTC (permalink / raw)
  To: Mike Waychison
  Cc: jeremy-TSDbQ3PG+2Y, arnd-r2nGTMty4D4,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Linux Torvalds, Alexander Viro,
	H. Peter Anvin, Thomas Gleixner, Ingo Molnar

On Tue, 2008-12-16 at 14:43 -0800, Mike Waychison wrote:
> Hmm, if I'm understanding you correctly, adding ref counts explicitly 
> (like you suggest below)  would be used to let a lower layer defer 
> writes.  Seems like this could be just as easily done with explicits 
> kmallocs and transferring ownership of the allocated memory to the 
> in-kernel representation handling layer below (which in turn queues the 
> data structures for writes).

Yup, that's true.  We'd effectively shift the burden of freeing those
buffers into the cr_write() (or whatever we call it) function.  

But, I'm just thinking about the sys_checkpoint() side.  I need to go
look at the restart code too.

-- Dave

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [RFC v11][PATCH 03/13] General infrastructure for checkpoint restart
  2008-12-16 22:43         ` Mike Waychison
@ 2008-12-17  0:13           ` Dave Hansen
  -1 siblings, 0 replies; 133+ messages in thread
From: Dave Hansen @ 2008-12-17  0:13 UTC (permalink / raw)
  To: Mike Waychison
  Cc: jeremy, arnd, containers, H. Peter Anvin, linux-kernel, linux-mm,
	Linux Torvalds, Alexander Viro, linux-api, Thomas Gleixner,
	Ingo Molnar

On Tue, 2008-12-16 at 14:43 -0800, Mike Waychison wrote:
> Hmm, if I'm understanding you correctly, adding ref counts explicitly 
> (like you suggest below)  would be used to let a lower layer defer 
> writes.  Seems like this could be just as easily done with explicits 
> kmallocs and transferring ownership of the allocated memory to the 
> in-kernel representation handling layer below (which in turn queues the 
> data structures for writes).

Yup, that's true.  We'd effectively shift the burden of freeing those
buffers into the cr_write() (or whatever we call it) function.  

But, I'm just thinking about the sys_checkpoint() side.  I need to go
look at the restart code too.

-- Dave


^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [RFC v11][PATCH 03/13] General infrastructure for checkpoint restart
@ 2008-12-17  0:13           ` Dave Hansen
  0 siblings, 0 replies; 133+ messages in thread
From: Dave Hansen @ 2008-12-17  0:13 UTC (permalink / raw)
  To: Mike Waychison
  Cc: jeremy, arnd, containers, H. Peter Anvin, linux-kernel, linux-mm,
	Linux Torvalds, Alexander Viro, linux-api, Thomas Gleixner,
	Ingo Molnar

On Tue, 2008-12-16 at 14:43 -0800, Mike Waychison wrote:
> Hmm, if I'm understanding you correctly, adding ref counts explicitly 
> (like you suggest below)  would be used to let a lower layer defer 
> writes.  Seems like this could be just as easily done with explicits 
> kmallocs and transferring ownership of the allocated memory to the 
> in-kernel representation handling layer below (which in turn queues the 
> data structures for writes).

Yup, that's true.  We'd effectively shift the burden of freeing those
buffers into the cr_write() (or whatever we call it) function.  

But, I'm just thinking about the sys_checkpoint() side.  I need to go
look at the restart code too.

-- Dave

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [RFC v11][PATCH 03/13] General infrastructure for checkpoint restart
       [not found]         ` <49483D01.1050603-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
@ 2008-12-17  0:42           ` Mike Waychison
  0 siblings, 0 replies; 133+ messages in thread
From: Mike Waychison @ 2008-12-17  0:42 UTC (permalink / raw)
  To: Oren Laadan
  Cc: jeremy-TSDbQ3PG+2Y, arnd-r2nGTMty4D4,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Dave Hansen,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Linux Torvalds, Alexander Viro,
	H. Peter Anvin, Thomas Gleixner, Ingo Molnar

Oren Laadan wrote:
> 
> Dave Hansen wrote:
>> On Tue, 2008-12-16 at 13:54 -0800, Mike Waychison wrote:
>>> Oren Laadan wrote:
>>>> diff --git a/checkpoint/sys.c b/checkpoint/sys.c
>>>> index 375129c..bd14ef9 100644
>>>> --- a/checkpoint/sys.c
>>>> +++ b/checkpoint/sys.c
>>>> +/*
>>>> + * During checkpoint and restart the code writes outs/reads in data
>>>> + * to/from the checkpoint image from/to a temporary buffer (ctx->hbuf).
>>>> + * Because operations can be nested, use cr_hbuf_get() to reserve space
>>>> + * in the buffer, then cr_hbuf_put() when you no longer need that space.
>>>> + */
>>> This seems a bit over-kill for buffer management no?  The only large 
>>> header seems to be cr_hdr_head and the blowup comes from utsinfo string 
>>> data (which could easily be moved out to be in it's own CR_HDR_STRING 
>>> blocks).
>>>
>>> Wouldn't it be easier to use stack-local storage than balancing the 
>>> cr_hbuf_get/put routines?
>> I've asked the same question, so I'll give you Oren's response that I
>> remember:
>>
>> cr_hbuf_get/put() are more of an API that we can use later.  For now,
>> those buffers really are temporary.  But, in a case where we want to do
>> a really fast checkpoint (to reduce "downtime" during the checkpoint) we
>> store the image entirely in kernel memory to be written out later.
> 
> Precisely.
> 
> Note that by "store the image entirely" we mean everything that cannot
> be saved COW - so memory pages are not duplicated; the rest of the data
> tends to take less than 5-10% of the total size.
> 
> Buffering the checkpoint image in kernel is useful to reduce downtime
> during checkpoint, and also useful for super-fast rollback of a task
> (or container) by always keeping everything in memory.

I agree that buffering metadata and references to COWable structures is 
useful, but this doesn't require a new allocator.

> 
> This abstraction is also useful for restart, e.g. to implement read-ahead
> of the checkpoint image into the kernel.

I'm not sure what you mean by checkpoint image read-ahead.  The data is 
in a stream format and not a randomly accessible packet format.

> 
> Note also that in the future we will have larger headers (e.g. to record
> the state of a socket), 

This is a bit off-topic for this patch series, but what are your intents 
on socket serialization?

FWIW looking at the socket problem for our purposes, we expect to throw 
away all network sockets (and internal state including queued rx/tx 
data) at checkpoint (replacing them with sockets that return ECONNRESET 
at restore time) and rely on userland exception handling to re-establish 
RPC channels.  This failure mode looks exactly the same as a network 
partition/machine restart/application crash which our applications need 
to handle already.

> and there may be some nested calls (e.g. to dump
> a connected unix-domain socket we will want to first save the "parent"
> listening socket, and also there is nesting in restart).
> 

Right, nesting can be a problem, so maybe on stack isn't the best way to 
handle the header allocations, but again this doesn't necessarily mean 
we need a new object allocation scheme.

> Instead of using the stack for some headers and memory allocation for
> other headers, this abstraction provides a standard interface for all
> checkpoint/restart code (and the actual implementation may vary for
> different purposes).

At a minimum, I would expect that cr_hbuf_put() should take a pointer 
rather than a size argument.

> 
>> In that case, cr_hbuf_put() stops doing anything at all because we keep
>> the memory around.
>>
>> cr_hbuf_get() becomes, "I need some memory to write some checkpointy
>> things into".
>>
>> cr_hbuf_put() becomes, "I'm done with this for now, only keep it if
>> someone else needs it."
>>
>> This might all be a lot clearer if we just kept some more explicit
>> accounting around about who is using the objects.  Something like:
>>
>> struct cr_buf {
>> 	struct kref ref;
>> 	int size;
>> 	char buf[0];
>> };
>>
>> /* replaces cr_hbuf_get() */
>> struct cr_buf *alloc_cr_buf(int size, gfp_t flags)
>> {
>> 	struct cr_buf *buf;
>>
>> 	buf = kmalloc(sizeof(cr_buf) + size, flags);
>> 	if (!buf)
>> 		return NULL;
>> 	buf->ref = 1; /* or whatever */
>> 	buf->size = size;
>> 	return buf;
>> }
>>
>> int cr_kwrite(struct cr_buf *buf)
>> {
>> 	if (writing_checkpoint_now) {
>> 		// or whatever this write call was...
>> 		vfs_write(&buf->buf[0], buf->size);
>> 	} else if (deferring_write) {		
>> 		kref_get(buf->kref);
>> 	}
>> }
> 
> Yes, something like that, except you can do without the reference count
> since the buffer is tied to 'struct cr_ctx'; so this is what I had in
> mind:
> 
> In non-buffering mode - as it is now - cr_kwrite() will write the data out.
> 
> In buffering mode (not implemented yet), cr_write() will either do nothing
> (if we borrow from the current code, that uses a temp buffer), or attach
> the 'struct cr_buf' to the 'struct cr_ctx' (if we borrow Dave's suggestion
> above).
> 
> In buffering mode, we'll also need 'cr_writeout()' which will write out
> the entire buffer to the 'struct file'.  (This function will do nothing
> in non-buffering mode).
> 
> Finally, the buffers will be freed when the 'struct cr_ctx' is cleaned.

This smells like a memory pool abstraction and probably shouldn't be 
specific to cr_ctx at all.

As I mentioned earlier though, you still need to walk the entire 
buffered structure in 'buffered' mode to release COWable object 
references.  Adding a kfree in that path makes the code much easier to 
understand and places the exception/cleanup path in-band rather than 
out-of-band with a memory pool.

Mike Waychison

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [RFC v11][PATCH 03/13] General infrastructure for checkpoint restart
  2008-12-16 23:42         ` Oren Laadan
@ 2008-12-17  0:42           ` Mike Waychison
  -1 siblings, 0 replies; 133+ messages in thread
From: Mike Waychison @ 2008-12-17  0:42 UTC (permalink / raw)
  To: Oren Laadan
  Cc: Dave Hansen, jeremy, arnd, linux-api, containers, linux-kernel,
	linux-mm, Linux Torvalds, Alexander Viro, H. Peter Anvin,
	Thomas Gleixner, Ingo Molnar

Oren Laadan wrote:
> 
> Dave Hansen wrote:
>> On Tue, 2008-12-16 at 13:54 -0800, Mike Waychison wrote:
>>> Oren Laadan wrote:
>>>> diff --git a/checkpoint/sys.c b/checkpoint/sys.c
>>>> index 375129c..bd14ef9 100644
>>>> --- a/checkpoint/sys.c
>>>> +++ b/checkpoint/sys.c
>>>> +/*
>>>> + * During checkpoint and restart the code writes outs/reads in data
>>>> + * to/from the checkpoint image from/to a temporary buffer (ctx->hbuf).
>>>> + * Because operations can be nested, use cr_hbuf_get() to reserve space
>>>> + * in the buffer, then cr_hbuf_put() when you no longer need that space.
>>>> + */
>>> This seems a bit over-kill for buffer management no?  The only large 
>>> header seems to be cr_hdr_head and the blowup comes from utsinfo string 
>>> data (which could easily be moved out to be in it's own CR_HDR_STRING 
>>> blocks).
>>>
>>> Wouldn't it be easier to use stack-local storage than balancing the 
>>> cr_hbuf_get/put routines?
>> I've asked the same question, so I'll give you Oren's response that I
>> remember:
>>
>> cr_hbuf_get/put() are more of an API that we can use later.  For now,
>> those buffers really are temporary.  But, in a case where we want to do
>> a really fast checkpoint (to reduce "downtime" during the checkpoint) we
>> store the image entirely in kernel memory to be written out later.
> 
> Precisely.
> 
> Note that by "store the image entirely" we mean everything that cannot
> be saved COW - so memory pages are not duplicated; the rest of the data
> tends to take less than 5-10% of the total size.
> 
> Buffering the checkpoint image in kernel is useful to reduce downtime
> during checkpoint, and also useful for super-fast rollback of a task
> (or container) by always keeping everything in memory.

I agree that buffering metadata and references to COWable structures is 
useful, but this doesn't require a new allocator.

> 
> This abstraction is also useful for restart, e.g. to implement read-ahead
> of the checkpoint image into the kernel.

I'm not sure what you mean by checkpoint image read-ahead.  The data is 
in a stream format and not a randomly accessible packet format.

> 
> Note also that in the future we will have larger headers (e.g. to record
> the state of a socket), 

This is a bit off-topic for this patch series, but what are your intents 
on socket serialization?

FWIW looking at the socket problem for our purposes, we expect to throw 
away all network sockets (and internal state including queued rx/tx 
data) at checkpoint (replacing them with sockets that return ECONNRESET 
at restore time) and rely on userland exception handling to re-establish 
RPC channels.  This failure mode looks exactly the same as a network 
partition/machine restart/application crash which our applications need 
to handle already.

> and there may be some nested calls (e.g. to dump
> a connected unix-domain socket we will want to first save the "parent"
> listening socket, and also there is nesting in restart).
> 

Right, nesting can be a problem, so maybe on stack isn't the best way to 
handle the header allocations, but again this doesn't necessarily mean 
we need a new object allocation scheme.

> Instead of using the stack for some headers and memory allocation for
> other headers, this abstraction provides a standard interface for all
> checkpoint/restart code (and the actual implementation may vary for
> different purposes).

At a minimum, I would expect that cr_hbuf_put() should take a pointer 
rather than a size argument.

> 
>> In that case, cr_hbuf_put() stops doing anything at all because we keep
>> the memory around.
>>
>> cr_hbuf_get() becomes, "I need some memory to write some checkpointy
>> things into".
>>
>> cr_hbuf_put() becomes, "I'm done with this for now, only keep it if
>> someone else needs it."
>>
>> This might all be a lot clearer if we just kept some more explicit
>> accounting around about who is using the objects.  Something like:
>>
>> struct cr_buf {
>> 	struct kref ref;
>> 	int size;
>> 	char buf[0];
>> };
>>
>> /* replaces cr_hbuf_get() */
>> struct cr_buf *alloc_cr_buf(int size, gfp_t flags)
>> {
>> 	struct cr_buf *buf;
>>
>> 	buf = kmalloc(sizeof(cr_buf) + size, flags);
>> 	if (!buf)
>> 		return NULL;
>> 	buf->ref = 1; /* or whatever */
>> 	buf->size = size;
>> 	return buf;
>> }
>>
>> int cr_kwrite(struct cr_buf *buf)
>> {
>> 	if (writing_checkpoint_now) {
>> 		// or whatever this write call was...
>> 		vfs_write(&buf->buf[0], buf->size);
>> 	} else if (deferring_write) {		
>> 		kref_get(buf->kref);
>> 	}
>> }
> 
> Yes, something like that, except you can do without the reference count
> since the buffer is tied to 'struct cr_ctx'; so this is what I had in
> mind:
> 
> In non-buffering mode - as it is now - cr_kwrite() will write the data out.
> 
> In buffering mode (not implemented yet), cr_write() will either do nothing
> (if we borrow from the current code, that uses a temp buffer), or attach
> the 'struct cr_buf' to the 'struct cr_ctx' (if we borrow Dave's suggestion
> above).
> 
> In buffering mode, we'll also need 'cr_writeout()' which will write out
> the entire buffer to the 'struct file'.  (This function will do nothing
> in non-buffering mode).
> 
> Finally, the buffers will be freed when the 'struct cr_ctx' is cleaned.

This smells like a memory pool abstraction and probably shouldn't be 
specific to cr_ctx at all.

As I mentioned earlier though, you still need to walk the entire 
buffered structure in 'buffered' mode to release COWable object 
references.  Adding a kfree in that path makes the code much easier to 
understand and places the exception/cleanup path in-band rather than 
out-of-band with a memory pool.

Mike Waychison

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [RFC v11][PATCH 03/13] General infrastructure for checkpoint restart
@ 2008-12-17  0:42           ` Mike Waychison
  0 siblings, 0 replies; 133+ messages in thread
From: Mike Waychison @ 2008-12-17  0:42 UTC (permalink / raw)
  To: Oren Laadan
  Cc: Dave Hansen, jeremy, arnd, linux-api, containers, linux-kernel,
	linux-mm, Linux Torvalds, Alexander Viro, H. Peter Anvin,
	Thomas Gleixner, Ingo Molnar

Oren Laadan wrote:
> 
> Dave Hansen wrote:
>> On Tue, 2008-12-16 at 13:54 -0800, Mike Waychison wrote:
>>> Oren Laadan wrote:
>>>> diff --git a/checkpoint/sys.c b/checkpoint/sys.c
>>>> index 375129c..bd14ef9 100644
>>>> --- a/checkpoint/sys.c
>>>> +++ b/checkpoint/sys.c
>>>> +/*
>>>> + * During checkpoint and restart the code writes outs/reads in data
>>>> + * to/from the checkpoint image from/to a temporary buffer (ctx->hbuf).
>>>> + * Because operations can be nested, use cr_hbuf_get() to reserve space
>>>> + * in the buffer, then cr_hbuf_put() when you no longer need that space.
>>>> + */
>>> This seems a bit over-kill for buffer management no?  The only large 
>>> header seems to be cr_hdr_head and the blowup comes from utsinfo string 
>>> data (which could easily be moved out to be in it's own CR_HDR_STRING 
>>> blocks).
>>>
>>> Wouldn't it be easier to use stack-local storage than balancing the 
>>> cr_hbuf_get/put routines?
>> I've asked the same question, so I'll give you Oren's response that I
>> remember:
>>
>> cr_hbuf_get/put() are more of an API that we can use later.  For now,
>> those buffers really are temporary.  But, in a case where we want to do
>> a really fast checkpoint (to reduce "downtime" during the checkpoint) we
>> store the image entirely in kernel memory to be written out later.
> 
> Precisely.
> 
> Note that by "store the image entirely" we mean everything that cannot
> be saved COW - so memory pages are not duplicated; the rest of the data
> tends to take less than 5-10% of the total size.
> 
> Buffering the checkpoint image in kernel is useful to reduce downtime
> during checkpoint, and also useful for super-fast rollback of a task
> (or container) by always keeping everything in memory.

I agree that buffering metadata and references to COWable structures is 
useful, but this doesn't require a new allocator.

> 
> This abstraction is also useful for restart, e.g. to implement read-ahead
> of the checkpoint image into the kernel.

I'm not sure what you mean by checkpoint image read-ahead.  The data is 
in a stream format and not a randomly accessible packet format.

> 
> Note also that in the future we will have larger headers (e.g. to record
> the state of a socket), 

This is a bit off-topic for this patch series, but what are your intents 
on socket serialization?

FWIW looking at the socket problem for our purposes, we expect to throw 
away all network sockets (and internal state including queued rx/tx 
data) at checkpoint (replacing them with sockets that return ECONNRESET 
at restore time) and rely on userland exception handling to re-establish 
RPC channels.  This failure mode looks exactly the same as a network 
partition/machine restart/application crash which our applications need 
to handle already.

> and there may be some nested calls (e.g. to dump
> a connected unix-domain socket we will want to first save the "parent"
> listening socket, and also there is nesting in restart).
> 

Right, nesting can be a problem, so maybe on stack isn't the best way to 
handle the header allocations, but again this doesn't necessarily mean 
we need a new object allocation scheme.

> Instead of using the stack for some headers and memory allocation for
> other headers, this abstraction provides a standard interface for all
> checkpoint/restart code (and the actual implementation may vary for
> different purposes).

At a minimum, I would expect that cr_hbuf_put() should take a pointer 
rather than a size argument.

> 
>> In that case, cr_hbuf_put() stops doing anything at all because we keep
>> the memory around.
>>
>> cr_hbuf_get() becomes, "I need some memory to write some checkpointy
>> things into".
>>
>> cr_hbuf_put() becomes, "I'm done with this for now, only keep it if
>> someone else needs it."
>>
>> This might all be a lot clearer if we just kept some more explicit
>> accounting around about who is using the objects.  Something like:
>>
>> struct cr_buf {
>> 	struct kref ref;
>> 	int size;
>> 	char buf[0];
>> };
>>
>> /* replaces cr_hbuf_get() */
>> struct cr_buf *alloc_cr_buf(int size, gfp_t flags)
>> {
>> 	struct cr_buf *buf;
>>
>> 	buf = kmalloc(sizeof(cr_buf) + size, flags);
>> 	if (!buf)
>> 		return NULL;
>> 	buf->ref = 1; /* or whatever */
>> 	buf->size = size;
>> 	return buf;
>> }
>>
>> int cr_kwrite(struct cr_buf *buf)
>> {
>> 	if (writing_checkpoint_now) {
>> 		// or whatever this write call was...
>> 		vfs_write(&buf->buf[0], buf->size);
>> 	} else if (deferring_write) {		
>> 		kref_get(buf->kref);
>> 	}
>> }
> 
> Yes, something like that, except you can do without the reference count
> since the buffer is tied to 'struct cr_ctx'; so this is what I had in
> mind:
> 
> In non-buffering mode - as it is now - cr_kwrite() will write the data out.
> 
> In buffering mode (not implemented yet), cr_write() will either do nothing
> (if we borrow from the current code, that uses a temp buffer), or attach
> the 'struct cr_buf' to the 'struct cr_ctx' (if we borrow Dave's suggestion
> above).
> 
> In buffering mode, we'll also need 'cr_writeout()' which will write out
> the entire buffer to the 'struct file'.  (This function will do nothing
> in non-buffering mode).
> 
> Finally, the buffers will be freed when the 'struct cr_ctx' is cleaned.

This smells like a memory pool abstraction and probably shouldn't be 
specific to cr_ctx at all.

As I mentioned earlier though, you still need to walk the entire 
buffered structure in 'buffered' mode to release COWable object 
references.  Adding a kfree in that path makes the code much easier to 
understand and places the exception/cleanup path in-band rather than 
out-of-band with a memory pool.

Mike Waychison
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [RFC v11][PATCH 03/13] General infrastructure for checkpoint restart
       [not found]           ` <49484AE2.3000007-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
@ 2008-12-17  2:08             ` Oren Laadan
  0 siblings, 0 replies; 133+ messages in thread
From: Oren Laadan @ 2008-12-17  2:08 UTC (permalink / raw)
  To: Mike Waychison
  Cc: jeremy-TSDbQ3PG+2Y, arnd-r2nGTMty4D4,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Dave Hansen,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Linux Torvalds, Alexander Viro,
	H. Peter Anvin, Thomas Gleixner, Ingo Molnar



Mike Waychison wrote:
> Oren Laadan wrote:
>>
>> Dave Hansen wrote:
>>> On Tue, 2008-12-16 at 13:54 -0800, Mike Waychison wrote:
>>>> Oren Laadan wrote:
>>>>> diff --git a/checkpoint/sys.c b/checkpoint/sys.c
>>>>> index 375129c..bd14ef9 100644
>>>>> --- a/checkpoint/sys.c
>>>>> +++ b/checkpoint/sys.c
>>>>> +/*
>>>>> + * During checkpoint and restart the code writes outs/reads in data
>>>>> + * to/from the checkpoint image from/to a temporary buffer
>>>>> (ctx->hbuf).
>>>>> + * Because operations can be nested, use cr_hbuf_get() to reserve
>>>>> space
>>>>> + * in the buffer, then cr_hbuf_put() when you no longer need that
>>>>> space.
>>>>> + */
>>>> This seems a bit over-kill for buffer management no?  The only large
>>>> header seems to be cr_hdr_head and the blowup comes from utsinfo
>>>> string data (which could easily be moved out to be in it's own
>>>> CR_HDR_STRING blocks).
>>>>
>>>> Wouldn't it be easier to use stack-local storage than balancing the
>>>> cr_hbuf_get/put routines?
>>> I've asked the same question, so I'll give you Oren's response that I
>>> remember:
>>>
>>> cr_hbuf_get/put() are more of an API that we can use later.  For now,
>>> those buffers really are temporary.  But, in a case where we want to do
>>> a really fast checkpoint (to reduce "downtime" during the checkpoint) we
>>> store the image entirely in kernel memory to be written out later.
>>
>> Precisely.
>>
>> Note that by "store the image entirely" we mean everything that cannot
>> be saved COW - so memory pages are not duplicated; the rest of the data
>> tends to take less than 5-10% of the total size.
>>
>> Buffering the checkpoint image in kernel is useful to reduce downtime
>> during checkpoint, and also useful for super-fast rollback of a task
>> (or container) by always keeping everything in memory.
> 
> I agree that buffering metadata and references to COWable structures is
> useful, but this doesn't require a new allocator.

First, many thanks for the comments !

The idea is to hide the details behind a pair of operation: one to get a
buffer, and the other to "deposit" the data (in non-buffering mode this
means write out and free).
My first goal is to have all code use a standard method (cr_hbuf_get/put);
the specific implementation of which may vary for buffering/non-buffering
cases, and may change.
(In the code, I used a "static" common buffer attached to 'struct ct_ctx'
for simplicity and speed, and because it is similar in idea to how the
buffering mode could be implemented).

> 
>>
>> This abstraction is also useful for restart, e.g. to implement read-ahead
>> of the checkpoint image into the kernel.
> 
> I'm not sure what you mean by checkpoint image read-ahead.  The data is
> in a stream format and not a randomly accessible packet format.

I mean that for efficiency, rather than reading many small chunks of data
(header by header), we could read-ahead a large buffer into kernel memory
and then have cr_hbuf_get() provide a pointer to the right position, and
cr_hbuf_put() do nothing.

>>
>> Note also that in the future we will have larger headers (e.g. to record
>> the state of a socket), 
> 
> This is a bit off-topic for this patch series, but what are your intents
> on socket serialization?

I should have been more precise - I was referring to unix domain sockets.

> 
> FWIW looking at the socket problem for our purposes, we expect to throw
> away all network sockets (and internal state including queued rx/tx
> data) at checkpoint (replacing them with sockets that return ECONNRESET
> at restore time) and rely on userland exception handling to re-establish
> RPC channels.  This failure mode looks exactly the same as a network
> partition/machine restart/application crash which our applications need
> to handle already.

Yes.

But ... we also have to support non-TCP and general connection-less
sockets too.

Also, there is the case of INET connections that are confined to the
container - connections to localhost (or local IP): since they do not
depend on the outside world, we should restore them as is.

Finally, in the case of live migration, where you wanna migrate the
network connections too, and then there is quite a bit of state associated.
But that it far down the road...

> 
>> and there may be some nested calls (e.g. to dump
>> a connected unix-domain socket we will want to first save the "parent"
>> listening socket, and also there is nesting in restart).
>>
> 
> Right, nesting can be a problem, so maybe on stack isn't the best way to
> handle the header allocations, but again this doesn't necessarily mean
> we need a new object allocation scheme.
> 
>> Instead of using the stack for some headers and memory allocation for
>> other headers, this abstraction provides a standard interface for all
>> checkpoint/restart code (and the actual implementation may vary for
>> different purposes).
> 
> At a minimum, I would expect that cr_hbuf_put() should take a pointer
> rather than a size argument.

I agree that this is nicer in terms of API. However, requiring the
size (oh, well, we have it already) sometimes allow simpler and more
efficient implementation, since the code need not keep track of pointers.

> 
>>
>>> In that case, cr_hbuf_put() stops doing anything at all because we keep
>>> the memory around.
>>>
>>> cr_hbuf_get() becomes, "I need some memory to write some checkpointy
>>> things into".
>>>
>>> cr_hbuf_put() becomes, "I'm done with this for now, only keep it if
>>> someone else needs it."
>>>
>>> This might all be a lot clearer if we just kept some more explicit
>>> accounting around about who is using the objects.  Something like:
>>>
>>> struct cr_buf {
>>>     struct kref ref;
>>>     int size;
>>>     char buf[0];
>>> };
>>>

[...]

>>
>> Yes, something like that, except you can do without the reference count
>> since the buffer is tied to 'struct cr_ctx'; so this is what I had in
>> mind:
>>
>> In non-buffering mode - as it is now - cr_kwrite() will write the data
>> out.
>>
>> In buffering mode (not implemented yet), cr_write() will either do
>> nothing
>> (if we borrow from the current code, that uses a temp buffer), or attach
>> the 'struct cr_buf' to the 'struct cr_ctx' (if we borrow Dave's
>> suggestion
>> above).
>>
>> In buffering mode, we'll also need 'cr_writeout()' which will write out
>> the entire buffer to the 'struct file'.  (This function will do nothing
>> in non-buffering mode).
>>
>> Finally, the buffers will be freed when the 'struct cr_ctx' is cleaned.
> 
> This smells like a memory pool abstraction and probably shouldn't be
> specific to cr_ctx at all.

Is there something else existing that I can use ?

In Zap I use a chain of large chunks of memory ("large" can even be
MBs each chunk, allocated with vmalloc). cr_hbuf_get() returns pointers
into these chunks. With successive checkpoints, the initial guess as
to the desired total size is based on the last checkpoint. Cleanup is
very fast. Write-out is also fast - write chunks in bulk, not small
buffers one by one. Eventually this allows a faster-than-once-a-second
checkpoint with degrading applications' response time. (of course,
there are other optimizations involved; but improvement was measurable).

> 
> As I mentioned earlier though, you still need to walk the entire
> buffered structure in 'buffered' mode to release COWable object
> references.  Adding a kfree in that path makes the code much easier to
> understand and places the exception/cleanup path in-band rather than
> out-of-band with a memory pool.

For the record, every cr_hbuf_get() is matched by a corresponding
cr_hbuf_put() exactly for this reason - so cleanup is correct even
if we change the implementation of cr_hbuf_... (as suggested by
Dave Hansen).

But actually, why do we need to keep the references to COWable objects
(currently - only memory pages) in the buffer ?
They are kept already on a separate structure (cr_ctx->pgarr_list).
So when we finally write out the data, we first flush the entire buffer
(no memory contents!), followed by the memory contents. The restart
code will be modified as well.

I deliberate avoided all these optimization at this early stage... but
tried to ensure that they can be "easily" added later.

Oren.

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [RFC v11][PATCH 03/13] General infrastructure for checkpoint restart
  2008-12-17  0:42           ` Mike Waychison
@ 2008-12-17  2:08             ` Oren Laadan
  -1 siblings, 0 replies; 133+ messages in thread
From: Oren Laadan @ 2008-12-17  2:08 UTC (permalink / raw)
  To: Mike Waychison
  Cc: Dave Hansen, jeremy, arnd, linux-api, containers, linux-kernel,
	linux-mm, Linux Torvalds, Alexander Viro, H. Peter Anvin,
	Thomas Gleixner, Ingo Molnar



Mike Waychison wrote:
> Oren Laadan wrote:
>>
>> Dave Hansen wrote:
>>> On Tue, 2008-12-16 at 13:54 -0800, Mike Waychison wrote:
>>>> Oren Laadan wrote:
>>>>> diff --git a/checkpoint/sys.c b/checkpoint/sys.c
>>>>> index 375129c..bd14ef9 100644
>>>>> --- a/checkpoint/sys.c
>>>>> +++ b/checkpoint/sys.c
>>>>> +/*
>>>>> + * During checkpoint and restart the code writes outs/reads in data
>>>>> + * to/from the checkpoint image from/to a temporary buffer
>>>>> (ctx->hbuf).
>>>>> + * Because operations can be nested, use cr_hbuf_get() to reserve
>>>>> space
>>>>> + * in the buffer, then cr_hbuf_put() when you no longer need that
>>>>> space.
>>>>> + */
>>>> This seems a bit over-kill for buffer management no?  The only large
>>>> header seems to be cr_hdr_head and the blowup comes from utsinfo
>>>> string data (which could easily be moved out to be in it's own
>>>> CR_HDR_STRING blocks).
>>>>
>>>> Wouldn't it be easier to use stack-local storage than balancing the
>>>> cr_hbuf_get/put routines?
>>> I've asked the same question, so I'll give you Oren's response that I
>>> remember:
>>>
>>> cr_hbuf_get/put() are more of an API that we can use later.  For now,
>>> those buffers really are temporary.  But, in a case where we want to do
>>> a really fast checkpoint (to reduce "downtime" during the checkpoint) we
>>> store the image entirely in kernel memory to be written out later.
>>
>> Precisely.
>>
>> Note that by "store the image entirely" we mean everything that cannot
>> be saved COW - so memory pages are not duplicated; the rest of the data
>> tends to take less than 5-10% of the total size.
>>
>> Buffering the checkpoint image in kernel is useful to reduce downtime
>> during checkpoint, and also useful for super-fast rollback of a task
>> (or container) by always keeping everything in memory.
> 
> I agree that buffering metadata and references to COWable structures is
> useful, but this doesn't require a new allocator.

First, many thanks for the comments !

The idea is to hide the details behind a pair of operation: one to get a
buffer, and the other to "deposit" the data (in non-buffering mode this
means write out and free).
My first goal is to have all code use a standard method (cr_hbuf_get/put);
the specific implementation of which may vary for buffering/non-buffering
cases, and may change.
(In the code, I used a "static" common buffer attached to 'struct ct_ctx'
for simplicity and speed, and because it is similar in idea to how the
buffering mode could be implemented).

> 
>>
>> This abstraction is also useful for restart, e.g. to implement read-ahead
>> of the checkpoint image into the kernel.
> 
> I'm not sure what you mean by checkpoint image read-ahead.  The data is
> in a stream format and not a randomly accessible packet format.

I mean that for efficiency, rather than reading many small chunks of data
(header by header), we could read-ahead a large buffer into kernel memory
and then have cr_hbuf_get() provide a pointer to the right position, and
cr_hbuf_put() do nothing.

>>
>> Note also that in the future we will have larger headers (e.g. to record
>> the state of a socket), 
> 
> This is a bit off-topic for this patch series, but what are your intents
> on socket serialization?

I should have been more precise - I was referring to unix domain sockets.

> 
> FWIW looking at the socket problem for our purposes, we expect to throw
> away all network sockets (and internal state including queued rx/tx
> data) at checkpoint (replacing them with sockets that return ECONNRESET
> at restore time) and rely on userland exception handling to re-establish
> RPC channels.  This failure mode looks exactly the same as a network
> partition/machine restart/application crash which our applications need
> to handle already.

Yes.

But ... we also have to support non-TCP and general connection-less
sockets too.

Also, there is the case of INET connections that are confined to the
container - connections to localhost (or local IP): since they do not
depend on the outside world, we should restore them as is.

Finally, in the case of live migration, where you wanna migrate the
network connections too, and then there is quite a bit of state associated.
But that it far down the road...

> 
>> and there may be some nested calls (e.g. to dump
>> a connected unix-domain socket we will want to first save the "parent"
>> listening socket, and also there is nesting in restart).
>>
> 
> Right, nesting can be a problem, so maybe on stack isn't the best way to
> handle the header allocations, but again this doesn't necessarily mean
> we need a new object allocation scheme.
> 
>> Instead of using the stack for some headers and memory allocation for
>> other headers, this abstraction provides a standard interface for all
>> checkpoint/restart code (and the actual implementation may vary for
>> different purposes).
> 
> At a minimum, I would expect that cr_hbuf_put() should take a pointer
> rather than a size argument.

I agree that this is nicer in terms of API. However, requiring the
size (oh, well, we have it already) sometimes allow simpler and more
efficient implementation, since the code need not keep track of pointers.

> 
>>
>>> In that case, cr_hbuf_put() stops doing anything at all because we keep
>>> the memory around.
>>>
>>> cr_hbuf_get() becomes, "I need some memory to write some checkpointy
>>> things into".
>>>
>>> cr_hbuf_put() becomes, "I'm done with this for now, only keep it if
>>> someone else needs it."
>>>
>>> This might all be a lot clearer if we just kept some more explicit
>>> accounting around about who is using the objects.  Something like:
>>>
>>> struct cr_buf {
>>>     struct kref ref;
>>>     int size;
>>>     char buf[0];
>>> };
>>>

[...]

>>
>> Yes, something like that, except you can do without the reference count
>> since the buffer is tied to 'struct cr_ctx'; so this is what I had in
>> mind:
>>
>> In non-buffering mode - as it is now - cr_kwrite() will write the data
>> out.
>>
>> In buffering mode (not implemented yet), cr_write() will either do
>> nothing
>> (if we borrow from the current code, that uses a temp buffer), or attach
>> the 'struct cr_buf' to the 'struct cr_ctx' (if we borrow Dave's
>> suggestion
>> above).
>>
>> In buffering mode, we'll also need 'cr_writeout()' which will write out
>> the entire buffer to the 'struct file'.  (This function will do nothing
>> in non-buffering mode).
>>
>> Finally, the buffers will be freed when the 'struct cr_ctx' is cleaned.
> 
> This smells like a memory pool abstraction and probably shouldn't be
> specific to cr_ctx at all.

Is there something else existing that I can use ?

In Zap I use a chain of large chunks of memory ("large" can even be
MBs each chunk, allocated with vmalloc). cr_hbuf_get() returns pointers
into these chunks. With successive checkpoints, the initial guess as
to the desired total size is based on the last checkpoint. Cleanup is
very fast. Write-out is also fast - write chunks in bulk, not small
buffers one by one. Eventually this allows a faster-than-once-a-second
checkpoint with degrading applications' response time. (of course,
there are other optimizations involved; but improvement was measurable).

> 
> As I mentioned earlier though, you still need to walk the entire
> buffered structure in 'buffered' mode to release COWable object
> references.  Adding a kfree in that path makes the code much easier to
> understand and places the exception/cleanup path in-band rather than
> out-of-band with a memory pool.

For the record, every cr_hbuf_get() is matched by a corresponding
cr_hbuf_put() exactly for this reason - so cleanup is correct even
if we change the implementation of cr_hbuf_... (as suggested by
Dave Hansen).

But actually, why do we need to keep the references to COWable objects
(currently - only memory pages) in the buffer ?
They are kept already on a separate structure (cr_ctx->pgarr_list).
So when we finally write out the data, we first flush the entire buffer
(no memory contents!), followed by the memory contents. The restart
code will be modified as well.

I deliberate avoided all these optimization at this early stage... but
tried to ensure that they can be "easily" added later.

Oren.

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [RFC v11][PATCH 03/13] General infrastructure for checkpoint restart
@ 2008-12-17  2:08             ` Oren Laadan
  0 siblings, 0 replies; 133+ messages in thread
From: Oren Laadan @ 2008-12-17  2:08 UTC (permalink / raw)
  To: Mike Waychison
  Cc: Dave Hansen, jeremy, arnd, linux-api, containers, linux-kernel,
	linux-mm, Linux Torvalds, Alexander Viro, H. Peter Anvin,
	Thomas Gleixner, Ingo Molnar



Mike Waychison wrote:
> Oren Laadan wrote:
>>
>> Dave Hansen wrote:
>>> On Tue, 2008-12-16 at 13:54 -0800, Mike Waychison wrote:
>>>> Oren Laadan wrote:
>>>>> diff --git a/checkpoint/sys.c b/checkpoint/sys.c
>>>>> index 375129c..bd14ef9 100644
>>>>> --- a/checkpoint/sys.c
>>>>> +++ b/checkpoint/sys.c
>>>>> +/*
>>>>> + * During checkpoint and restart the code writes outs/reads in data
>>>>> + * to/from the checkpoint image from/to a temporary buffer
>>>>> (ctx->hbuf).
>>>>> + * Because operations can be nested, use cr_hbuf_get() to reserve
>>>>> space
>>>>> + * in the buffer, then cr_hbuf_put() when you no longer need that
>>>>> space.
>>>>> + */
>>>> This seems a bit over-kill for buffer management no?  The only large
>>>> header seems to be cr_hdr_head and the blowup comes from utsinfo
>>>> string data (which could easily be moved out to be in it's own
>>>> CR_HDR_STRING blocks).
>>>>
>>>> Wouldn't it be easier to use stack-local storage than balancing the
>>>> cr_hbuf_get/put routines?
>>> I've asked the same question, so I'll give you Oren's response that I
>>> remember:
>>>
>>> cr_hbuf_get/put() are more of an API that we can use later.  For now,
>>> those buffers really are temporary.  But, in a case where we want to do
>>> a really fast checkpoint (to reduce "downtime" during the checkpoint) we
>>> store the image entirely in kernel memory to be written out later.
>>
>> Precisely.
>>
>> Note that by "store the image entirely" we mean everything that cannot
>> be saved COW - so memory pages are not duplicated; the rest of the data
>> tends to take less than 5-10% of the total size.
>>
>> Buffering the checkpoint image in kernel is useful to reduce downtime
>> during checkpoint, and also useful for super-fast rollback of a task
>> (or container) by always keeping everything in memory.
> 
> I agree that buffering metadata and references to COWable structures is
> useful, but this doesn't require a new allocator.

First, many thanks for the comments !

The idea is to hide the details behind a pair of operation: one to get a
buffer, and the other to "deposit" the data (in non-buffering mode this
means write out and free).
My first goal is to have all code use a standard method (cr_hbuf_get/put);
the specific implementation of which may vary for buffering/non-buffering
cases, and may change.
(In the code, I used a "static" common buffer attached to 'struct ct_ctx'
for simplicity and speed, and because it is similar in idea to how the
buffering mode could be implemented).

> 
>>
>> This abstraction is also useful for restart, e.g. to implement read-ahead
>> of the checkpoint image into the kernel.
> 
> I'm not sure what you mean by checkpoint image read-ahead.  The data is
> in a stream format and not a randomly accessible packet format.

I mean that for efficiency, rather than reading many small chunks of data
(header by header), we could read-ahead a large buffer into kernel memory
and then have cr_hbuf_get() provide a pointer to the right position, and
cr_hbuf_put() do nothing.

>>
>> Note also that in the future we will have larger headers (e.g. to record
>> the state of a socket), 
> 
> This is a bit off-topic for this patch series, but what are your intents
> on socket serialization?

I should have been more precise - I was referring to unix domain sockets.

> 
> FWIW looking at the socket problem for our purposes, we expect to throw
> away all network sockets (and internal state including queued rx/tx
> data) at checkpoint (replacing them with sockets that return ECONNRESET
> at restore time) and rely on userland exception handling to re-establish
> RPC channels.  This failure mode looks exactly the same as a network
> partition/machine restart/application crash which our applications need
> to handle already.

Yes.

But ... we also have to support non-TCP and general connection-less
sockets too.

Also, there is the case of INET connections that are confined to the
container - connections to localhost (or local IP): since they do not
depend on the outside world, we should restore them as is.

Finally, in the case of live migration, where you wanna migrate the
network connections too, and then there is quite a bit of state associated.
But that it far down the road...

> 
>> and there may be some nested calls (e.g. to dump
>> a connected unix-domain socket we will want to first save the "parent"
>> listening socket, and also there is nesting in restart).
>>
> 
> Right, nesting can be a problem, so maybe on stack isn't the best way to
> handle the header allocations, but again this doesn't necessarily mean
> we need a new object allocation scheme.
> 
>> Instead of using the stack for some headers and memory allocation for
>> other headers, this abstraction provides a standard interface for all
>> checkpoint/restart code (and the actual implementation may vary for
>> different purposes).
> 
> At a minimum, I would expect that cr_hbuf_put() should take a pointer
> rather than a size argument.

I agree that this is nicer in terms of API. However, requiring the
size (oh, well, we have it already) sometimes allow simpler and more
efficient implementation, since the code need not keep track of pointers.

> 
>>
>>> In that case, cr_hbuf_put() stops doing anything at all because we keep
>>> the memory around.
>>>
>>> cr_hbuf_get() becomes, "I need some memory to write some checkpointy
>>> things into".
>>>
>>> cr_hbuf_put() becomes, "I'm done with this for now, only keep it if
>>> someone else needs it."
>>>
>>> This might all be a lot clearer if we just kept some more explicit
>>> accounting around about who is using the objects.  Something like:
>>>
>>> struct cr_buf {
>>>     struct kref ref;
>>>     int size;
>>>     char buf[0];
>>> };
>>>

[...]

>>
>> Yes, something like that, except you can do without the reference count
>> since the buffer is tied to 'struct cr_ctx'; so this is what I had in
>> mind:
>>
>> In non-buffering mode - as it is now - cr_kwrite() will write the data
>> out.
>>
>> In buffering mode (not implemented yet), cr_write() will either do
>> nothing
>> (if we borrow from the current code, that uses a temp buffer), or attach
>> the 'struct cr_buf' to the 'struct cr_ctx' (if we borrow Dave's
>> suggestion
>> above).
>>
>> In buffering mode, we'll also need 'cr_writeout()' which will write out
>> the entire buffer to the 'struct file'.  (This function will do nothing
>> in non-buffering mode).
>>
>> Finally, the buffers will be freed when the 'struct cr_ctx' is cleaned.
> 
> This smells like a memory pool abstraction and probably shouldn't be
> specific to cr_ctx at all.

Is there something else existing that I can use ?

In Zap I use a chain of large chunks of memory ("large" can even be
MBs each chunk, allocated with vmalloc). cr_hbuf_get() returns pointers
into these chunks. With successive checkpoints, the initial guess as
to the desired total size is based on the last checkpoint. Cleanup is
very fast. Write-out is also fast - write chunks in bulk, not small
buffers one by one. Eventually this allows a faster-than-once-a-second
checkpoint with degrading applications' response time. (of course,
there are other optimizations involved; but improvement was measurable).

> 
> As I mentioned earlier though, you still need to walk the entire
> buffered structure in 'buffered' mode to release COWable object
> references.  Adding a kfree in that path makes the code much easier to
> understand and places the exception/cleanup path in-band rather than
> out-of-band with a memory pool.

For the record, every cr_hbuf_get() is matched by a corresponding
cr_hbuf_put() exactly for this reason - so cleanup is correct even
if we change the implementation of cr_hbuf_... (as suggested by
Dave Hansen).

But actually, why do we need to keep the references to COWable objects
(currently - only memory pages) in the buffer ?
They are kept already on a separate structure (cr_ctx->pgarr_list).
So when we finally write out the data, we first flush the entire buffer
(no memory contents!), followed by the memory contents. The restart
code will be modified as well.

I deliberate avoided all these optimization at this early stage... but
tried to ensure that they can be "easily" added later.

Oren.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [RFC v11][PATCH 04/13] x86 support for checkpoint/restart
       [not found]   ` <1228498282-11804-5-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
@ 2008-12-17  2:19     ` Mike Waychison
  0 siblings, 0 replies; 133+ messages in thread
From: Mike Waychison @ 2008-12-17  2:19 UTC (permalink / raw)
  To: Oren Laadan
  Cc: jeremy-TSDbQ3PG+2Y, arnd-r2nGTMty4D4,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Dave Hansen,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Linux Torvalds, Alexander Viro,
	H. Peter Anvin, Thomas Gleixner, Ingo Molnar

Oren Laadan wrote:
> Add logic to save and restore architecture specific state, including
> thread-specific state, CPU registers and FPU state.
> 
> In addition, architecture capabilities are saved in an architecure
> specific extension of the header (cr_hdr_head_arch); Currently this
> includes only FPU capabilities.
> 
> Currently only x86-32 is supported. Compiling on x86-64 will trigger
> an explicit error.
> 
> Changelog[v9]:
>   - Add arch-specific header that details architecture capabilities;
>     split FPU restore to send capabilities only once.
>   - Test for zero TLS entries in cr_write_thread()
>   - Fix asm/checkpoint_hdr.h so it can be included from user-space
> 
> Changelog[v7]:
>   - Fix save/restore state of FPU
> 
> Changelog[v5]:
>   - Remove preempt_disable() when restoring debug registers
> 
> Changelog[v4]:
>   - Fix header structure alignment
> 
> Changelog[v2]:
>   - Pad header structures to 64 bits to ensure compatibility
>   - Follow Dave Hansen's refactoring of the original post
> 
> Signed-off-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
> Acked-by: Serge Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
> Signed-off-by: Dave Hansen <dave-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
> ---
>  arch/x86/include/asm/checkpoint_hdr.h |   85 ++++++++++++
>  arch/x86/mm/Makefile                  |    2 +
>  arch/x86/mm/checkpoint.c              |  223 +++++++++++++++++++++++++++++++
>  arch/x86/mm/restart.c                 |  232 +++++++++++++++++++++++++++++++++
>  checkpoint/checkpoint.c               |   19 +++-
>  checkpoint/checkpoint_arch.h          |    9 ++
>  checkpoint/restart.c                  |   17 ++-
>  include/linux/checkpoint_hdr.h        |    2 +
>  8 files changed, 583 insertions(+), 6 deletions(-)
>  create mode 100644 arch/x86/include/asm/checkpoint_hdr.h
>  create mode 100644 arch/x86/mm/checkpoint.c
>  create mode 100644 arch/x86/mm/restart.c
>  create mode 100644 checkpoint/checkpoint_arch.h
> 
> diff --git a/arch/x86/include/asm/checkpoint_hdr.h b/arch/x86/include/asm/checkpoint_hdr.h
> new file mode 100644
> index 0000000..6325062
> --- /dev/null
> +++ b/arch/x86/include/asm/checkpoint_hdr.h
> @@ -0,0 +1,85 @@
> +#ifndef __ASM_X86_CKPT_HDR_H
> +#define __ASM_X86_CKPT_HDR_H
> +/*
> + *  Checkpoint/restart - architecture specific headers x86
> + *
> + *  Copyright (C) 2008 Oren Laadan
> + *
> + *  This file is subject to the terms and conditions of the GNU General Public
> + *  License.  See the file COPYING in the main directory of the Linux
> + *  distribution for more details.
> + */
> +
> +#include <linux/types.h>
> +
> +/* i387 structure seen from kernel/userspace */
> +#ifdef __KERNEL__
> +#include <asm/processor.h>
> +#else
> +#include <sys/user.h>
> +#endif
> +
> +struct cr_hdr_head_arch {
> +	/* FIXME: add HAVE_HWFP */
> +
> +	__u16 has_fxsr;
> +	__u16 has_xsave;
> +	__u16 xstate_size;
> +	__u16 _pading;
> +} __attribute__((aligned(8)));
> +
> +struct cr_hdr_thread {
> +	/* FIXME: restart blocks */
> +
> +	__s16 gdt_entry_tls_entries;
> +	__s16 sizeof_tls_array;
> +	__s16 ntls;	/* number of TLS entries to follow */
> +} __attribute__((aligned(8)));
> +
> +struct cr_hdr_cpu {
> +	/* see struct pt_regs (x86-64) */
> +	__u64 r15;
> +	__u64 r14;
> +	__u64 r13;
> +	__u64 r12;
> +	__u64 bp;
> +	__u64 bx;
> +	__u64 r11;
> +	__u64 r10;
> +	__u64 r9;
> +	__u64 r8;
> +	__u64 ax;
> +	__u64 cx;
> +	__u64 dx;
> +	__u64 si;
> +	__u64 di;
> +	__u64 orig_ax;
> +	__u64 ip;
> +	__u64 cs;
> +	__u64 flags;
> +	__u64 sp;
> +	__u64 ss;
> +
> +	/* segment registers */
> +	__u64 ds;
> +	__u64 es;
> +	__u64 fs;
> +	__u64 gs;
> +
> +	/* debug registers */
> +	__u64 debugreg0;
> +	__u64 debugreg1;
> +	__u64 debugreg2;
> +	__u64 debugreg3;
> +	__u64 debugreg4;
> +	__u64 debugreg5;
> +	__u64 debugreg6;
> +	__u64 debugreg7;
> +
> +	__u32 uses_debug;
> +	__u32 used_math;
> +
> +	/* thread_xstate contents follow (if used_math) */
> +} __attribute__((aligned(8)));
> +
> +#endif /* __ASM_X86_CKPT_HDR__H */
> diff --git a/arch/x86/mm/Makefile b/arch/x86/mm/Makefile
> index fea4565..6527ea2 100644
> --- a/arch/x86/mm/Makefile
> +++ b/arch/x86/mm/Makefile
> @@ -18,3 +18,5 @@ obj-$(CONFIG_K8_NUMA)		+= k8topology_64.o
>  obj-$(CONFIG_ACPI_NUMA)		+= srat_$(BITS).o
>  
>  obj-$(CONFIG_MEMTEST)		+= memtest.o
> +
> +obj-$(CONFIG_CHECKPOINT_RESTART) += checkpoint.o restart.o
> diff --git a/arch/x86/mm/checkpoint.c b/arch/x86/mm/checkpoint.c
> new file mode 100644
> index 0000000..8dd6d2d
> --- /dev/null
> +++ b/arch/x86/mm/checkpoint.c
> @@ -0,0 +1,223 @@
> +/*
> + *  Checkpoint/restart - architecture specific support for x86
> + *
> + *  Copyright (C) 2008 Oren Laadan
> + *
> + *  This file is subject to the terms and conditions of the GNU General Public
> + *  License.  See the file COPYING in the main directory of the Linux
> + *  distribution for more details.
> + */
> +
> +#include <asm/desc.h>
> +#include <asm/i387.h>
> +
> +#include <linux/checkpoint.h>
> +#include <linux/checkpoint_hdr.h>
> +
> +/* dump the thread_struct of a given task */
> +int cr_write_thread(struct cr_ctx *ctx, struct task_struct *t)
> +{
> +	struct cr_hdr h;
> +	struct cr_hdr_thread *hh = cr_hbuf_get(ctx, sizeof(*hh));
> +	struct thread_struct *thread;
> +	struct desc_struct *desc;
> +	int ntls = 0;
> +	int n, ret;
> +
> +	h.type = CR_HDR_THREAD;
> +	h.len = sizeof(*hh);
> +	h.parent = task_pid_vnr(t);
> +
> +	thread = &t->thread;
> +
> +	/* calculate no. of TLS entries that follow */
> +	desc = thread->tls_array;
> +	for (n = GDT_ENTRY_TLS_ENTRIES; n > 0; n--, desc++) {
> +		if (desc->a || desc->b)
> +			ntls++;
> +	}
> +
> +	hh->gdt_entry_tls_entries = GDT_ENTRY_TLS_ENTRIES;
> +	hh->sizeof_tls_array = sizeof(thread->tls_array);
> +	hh->ntls = ntls;
> +
> +	ret = cr_write_obj(ctx, &h, hh);
> +	cr_hbuf_put(ctx, sizeof(*hh));
> +	if (ret < 0)
> +		return ret;
> +
> +	cr_debug("ntls %d\n", ntls);
> +	if (ntls == 0)
> +		return 0;
> +
> +	/* for simplicity dump the entire array, cherry-pick upon restart */
> +	ret = cr_kwrite(ctx, thread->tls_array, sizeof(thread->tls_array));

Again, the the TLS descriptors in the GDT should be called out an not 
tied to the in-kernel representation.

> +
> +	/* IGNORE RESTART BLOCKS FOR NOW ... */
> +
> +	return ret;
> +}
> +
> +#ifdef CONFIG_X86_64
> +
> +#error "CONFIG_X86_64 unsupported yet."
> +
> +#else	/* !CONFIG_X86_64 */
> +
> +static void cr_save_cpu_regs(struct cr_hdr_cpu *hh, struct task_struct *t)
> +{
> +	struct thread_struct *thread = &t->thread;
> +	struct pt_regs *regs = task_pt_regs(t);
> +
> +	hh->bp = regs->bp;
> +	hh->bx = regs->bx;
> +	hh->ax = regs->ax;
> +	hh->cx = regs->cx;
> +	hh->dx = regs->dx;
> +	hh->si = regs->si;
> +	hh->di = regs->di;
> +	hh->orig_ax = regs->orig_ax;
> +	hh->ip = regs->ip;
> +	hh->cs = regs->cs;
> +	hh->flags = regs->flags;
> +	hh->sp = regs->sp;
> +	hh->ss = regs->ss;
> +
> +	hh->ds = regs->ds;
> +	hh->es = regs->es;
> +
> +	/*
> +	 * for checkpoint in process context (from within a container)
> +	 * the GS and FS registers should be saved from the hardware;
> +	 * otherwise they are already sabed on the thread structure
> +	 */
> +	if (t == current) {
> +		savesegment(gs, hh->gs);
> +		savesegment(fs, hh->fs);
> +	} else {
> +		hh->gs = thread->gs;
> +		hh->fs = thread->fs;
> +	}
> +
> +	/*
> +	 * for checkpoint in process context (from within a container),
> +	 * the actual syscall is taking place at this very moment; so
> +	 * we (optimistically) subtitute the future return value (0) of
> +	 * this syscall into the orig_eax, so that upon restart it will
> +	 * succeed (or it will endlessly retry checkpoint...)
> +	 */
> +	if (t == current) {
> +		BUG_ON(hh->orig_ax < 0);
> +		hh->ax = 0;
> +	}
> +}
> +
> +static void cr_save_cpu_debug(struct cr_hdr_cpu *hh, struct task_struct *t)
> +{
> +	struct thread_struct *thread = &t->thread;
> +
> +	/* debug regs */
> +
> +	/*
> +	 * for checkpoint in process context (from within a container),
> +	 * get the actual registers; otherwise get the saved values.
> +	 */
> +
> +	if (t == current) {
> +		get_debugreg(hh->debugreg0, 0);
> +		get_debugreg(hh->debugreg1, 1);
> +		get_debugreg(hh->debugreg2, 2);
> +		get_debugreg(hh->debugreg3, 3);
> +		get_debugreg(hh->debugreg6, 6);
> +		get_debugreg(hh->debugreg7, 7);
> +	} else {
> +		hh->debugreg0 = thread->debugreg0;
> +		hh->debugreg1 = thread->debugreg1;
> +		hh->debugreg2 = thread->debugreg2;
> +		hh->debugreg3 = thread->debugreg3;
> +		hh->debugreg6 = thread->debugreg6;
> +		hh->debugreg7 = thread->debugreg7;
> +	}
> +
> +	hh->debugreg4 = 0;
> +	hh->debugreg5 = 0;
> +
> +	hh->uses_debug = !!(task_thread_info(t)->flags & TIF_DEBUG);
> +}
> +
> +static void cr_save_cpu_fpu(struct cr_hdr_cpu *hh, struct task_struct *t)
> +{
> +	hh->used_math = tsk_used_math(t) ? 1 : 0;
> +}
> +
> +static int cr_write_cpu_fpu(struct cr_ctx *ctx, struct task_struct *t)
> +{
> +	void *xstate_buf = cr_hbuf_get(ctx, xstate_size);
> +
> +	/* i387 + MMU + SSE logic */
> +	preempt_disable();	/* needed it (t == current) */
> +
> +	/*
> +	 * normally, no need to unlazy_fpu(), since TS_USEDFPU flag
> +	 * have been cleared when task was context-switched out...
> +	 * except if we are in process context, in which case we do
> +	 */
> +	if (t == current && (task_thread_info(t)->status & TS_USEDFPU))
> +		unlazy_fpu(current);
> +
> +	memcpy(xstate_buf, t->thread.xstate, xstate_size);

This is probably better off being very deliberate about what registers 
we are dumping from a traceability and compatibility point of view?

> +	preempt_enable();	/* needed it (t == current) */
> +
> +	return cr_kwrite(ctx, xstate_buf, xstate_size);

Missed cr_huf_put()

> +}
> +
> +#endif	/* CONFIG_X86_64 */
> +
> +/* dump the cpu state and registers of a given task */
> +int cr_write_cpu(struct cr_ctx *ctx, struct task_struct *t)
> +{
> +	struct cr_hdr h;
> +	struct cr_hdr_cpu *hh = cr_hbuf_get(ctx, sizeof(*hh));
> +	int ret;
> +
> +	h.type = CR_HDR_CPU;
> +	h.len = sizeof(*hh);
> +	h.parent = task_pid_vnr(t);
> +
> +	cr_save_cpu_regs(hh, t);
> +	cr_save_cpu_debug(hh, t);
> +	cr_save_cpu_fpu(hh, t);
> +
> +	cr_debug("math %d debug %d\n", hh->used_math, hh->uses_debug);
> +
> +	ret = cr_write_obj(ctx, &h, hh);
> +	if (ret < 0)
> +		goto out;
> +
> +	if (hh->used_math)
> +		ret = cr_write_cpu_fpu(ctx, t);
> + out:
> +	cr_hbuf_put(ctx, sizeof(*hh));
> +	return ret;
> +}
> +
> +int cr_write_head_arch(struct cr_ctx *ctx)
> +{
> +	struct cr_hdr h;
> +	struct cr_hdr_head_arch *hh = cr_hbuf_get(ctx, sizeof(*hh));
> +	int ret;
> +
> +	h.type = CR_HDR_HEAD_ARCH;
> +	h.len = sizeof(*hh);
> +	h.parent = 0;
> +
> +	/* FPU capabilities */
> +	hh->has_fxsr = cpu_has_fxsr;
> +	hh->has_xsave = cpu_has_xsave;
> +	hh->xstate_size = xstate_size;
> +
> +	ret = cr_write_obj(ctx, &h, hh);
> +	cr_hbuf_put(ctx, sizeof(*hh));
> +
> +	return ret;
> +}
> diff --git a/arch/x86/mm/restart.c b/arch/x86/mm/restart.c
> new file mode 100644
> index 0000000..45ad790
> --- /dev/null
> +++ b/arch/x86/mm/restart.c
> @@ -0,0 +1,232 @@
> +/*
> + *  Checkpoint/restart - architecture specific support for x86
> + *
> + *  Copyright (C) 2008 Oren Laadan
> + *
> + *  This file is subject to the terms and conditions of the GNU General Public
> + *  License.  See the file COPYING in the main directory of the Linux
> + *  distribution for more details.
> + */
> +
> +#include <asm/desc.h>
> +#include <asm/i387.h>
> +
> +#include <linux/checkpoint.h>
> +#include <linux/checkpoint_hdr.h>
> +
> +/* read the thread_struct into the current task */
> +int cr_read_thread(struct cr_ctx *ctx)
> +{
> +	struct cr_hdr_thread *hh = cr_hbuf_get(ctx, sizeof(*hh));
> +	struct task_struct *t = current;
> +	struct thread_struct *thread = &t->thread;
> +	int parent, ret;
> +
> +	parent = cr_read_obj_type(ctx, hh, sizeof(*hh), CR_HDR_THREAD);
> +	if (parent < 0) {
> +		ret = parent;
> +		goto out;
> +	}
> +
> +	ret = -EINVAL;
> +
> +#if 0	/* activate when containers are used */
> +	if (parent != task_pid_vnr(t))
> +		goto out;
> +#endif
> +	cr_debug("ntls %d\n", hh->ntls);
> +
> +	if (hh->gdt_entry_tls_entries != GDT_ENTRY_TLS_ENTRIES ||
> +	    hh->sizeof_tls_array != sizeof(thread->tls_array) ||
> +	    hh->ntls < 0 || hh->ntls > GDT_ENTRY_TLS_ENTRIES)
> +		goto out;
> +
> +	if (hh->ntls > 0) {
> +		struct desc_struct *desc;
> +		int size, cpu;
> +
> +		/*
> +		 * restore TLS by hand: why convert to struct user_desc if
> +		 * sys_set_thread_entry() will convert it back ?
> +		 */
> +
> +		size = sizeof(*desc) * GDT_ENTRY_TLS_ENTRIES;
> +		desc = kmalloc(size, GFP_KERNEL);
> +		if (!desc)

cr_hbuf_put() here.

> +			return -ENOMEM;
> +
> +		ret = cr_kread(ctx, desc, size);
> +		if (ret >= 0) {

if (ret == 0)

> +			/*
> +			 * FIX: add sanity checks (eg. that values makes
> +			 * sense, that we don't overwrite old values, etc
> +			 */
> +			cpu = get_cpu();
> +			memcpy(thread->tls_array, desc, size);
> +			load_TLS(thread, cpu);
> +			put_cpu();
> +		}
> +		kfree(desc);
> +	}
> +
> +	ret = 0;
> + out:
> +	cr_hbuf_put(ctx, sizeof(*hh));
> +	return ret;
> +}
> +
> +#ifdef CONFIG_X86_64
> +
> +#error "CONFIG_X86_64 unsupported yet."
> +
> +#else	/* !CONFIG_X86_64 */
> +
> +static int cr_load_cpu_regs(struct cr_hdr_cpu *hh, struct task_struct *t)
> +{
> +	struct thread_struct *thread = &t->thread;
> +	struct pt_regs *regs = task_pt_regs(t);
> +
> +	regs->bx = hh->bx;
> +	regs->cx = hh->cx;
> +	regs->dx = hh->dx;
> +	regs->si = hh->si;
> +	regs->di = hh->di;
> +	regs->bp = hh->bp;
> +	regs->ax = hh->ax;
> +	regs->ds = hh->ds;
> +	regs->es = hh->es;
> +	regs->orig_ax = hh->orig_ax;
> +	regs->ip = hh->ip;
> +	regs->cs = hh->cs;
> +	regs->flags = hh->flags;
> +	regs->sp = hh->sp;
> +	regs->ss = hh->ss;
> +
> +	thread->gs = hh->gs;
> +	thread->fs = hh->fs;
> +	loadsegment(gs, hh->gs);
> +	loadsegment(fs, hh->fs);
> +
> +	return 0;
> +}
> +
> +static int cr_load_cpu_debug(struct cr_hdr_cpu *hh, struct task_struct *t)
> +{
> +	/* debug regs */
> +
> +	if (hh->uses_debug) {
> +		set_debugreg(hh->debugreg0, 0);
> +		set_debugreg(hh->debugreg1, 1);
> +		/* ignore 4, 5 */
> +		set_debugreg(hh->debugreg2, 2);
> +		set_debugreg(hh->debugreg3, 3);
> +		set_debugreg(hh->debugreg6, 6);
> +		set_debugreg(hh->debugreg7, 7);
> +	}
> +
> +	return 0;
> +}
> +
> +static int cr_load_cpu_fpu(struct cr_hdr_cpu *hh, struct task_struct *t)
> +{
> +	preempt_disable();
> +
> +	__clear_fpu(t);		/* in case we used FPU in user mode */
> +
> +	if (!hh->used_math)
> +		clear_used_math();
> +
> +	preempt_enable();
> +	return 0;
> +}
> +
> +static int cr_read_cpu_fpu(struct cr_ctx *ctx, struct task_struct *t)
> +{
> +	void *xstate_buf = cr_hbuf_get(ctx, xstate_size);
> +	int ret;
> +
> +	ret = cr_kread(ctx, xstate_buf, xstate_size);
> +	if (ret < 0)
> +		goto out;
> +
> +	/* i387 + MMU + SSE */
> +	preempt_disable();
> +
> +	/* init_fpu() also calls set_used_math() */
> +	ret = init_fpu(current);
> +	if (ret < 0)
> +		return ret;
> +
> +	memcpy(t->thread.xstate, xstate_buf, xstate_size);
> +	preempt_enable();
> + out:
> +	cr_hbuf_put(ctx, xstate_size);
> +	return 0;
> +}
> +
> +#endif	/* CONFIG_X86_64 */
> +
> +/* read the cpu state and registers for the current task */
> +int cr_read_cpu(struct cr_ctx *ctx)
> +{
> +	struct cr_hdr_cpu *hh = cr_hbuf_get(ctx, sizeof(*hh));
> +	struct task_struct *t = current;
> +	int parent, ret;
> +
> +	parent = cr_read_obj_type(ctx, hh, sizeof(*hh), CR_HDR_CPU);
> +	if (parent < 0) {
> +		ret = parent;
> +		goto out;
> +	}
> +
> +	ret = -EINVAL;
> +
> +#if 0	/* activate when containers are used */
> +	if (parent != task_pid_vnr(t))
> +		goto out;
> +#endif
> +	/* FIX: sanity check for sensitive registers (eg. eflags) */
> +
> +	cr_debug("math %d debug %d\n", hh->used_math, hh->uses_debug);
> +
> +	ret = cr_load_cpu_regs(hh, t);
> +	if (ret < 0)
> +		goto out;
> +	ret = cr_load_cpu_debug(hh, t);
> +	if (ret < 0)
> +		goto out;
> +	ret = cr_load_cpu_fpu(hh, t);
> +	if (ret < 0)
> +		goto out;
> +
> +	if (hh->used_math)
> +		ret = cr_read_cpu_fpu(ctx, t);
> + out:
> +	cr_hbuf_put(ctx, sizeof(*hh));
> +	return ret;
> +}
> +
> +int cr_read_head_arch(struct cr_ctx *ctx)
> +{
> +	struct cr_hdr_head_arch *hh = cr_hbuf_get(ctx, sizeof(*hh));
> +	int parent, ret = 0;
> +
> +	parent = cr_read_obj_type(ctx, hh, sizeof(*hh), CR_HDR_HEAD_ARCH);
> +	if (parent < 0) {
> +		ret = parent;
> +		goto out;
> +	} else if (parent != 0)
> +		goto out;
> +
> +	/* FIX: verify compatibility of architecture features */
> +
> +	/* verify FPU capabilities */
> +	if (hh->has_fxsr != cpu_has_fxsr ||
> +	    hh->has_xsave != cpu_has_xsave ||
> +	    hh->xstate_size != xstate_size)
> +		ret = -EINVAL;
> + out:
> +	cr_hbuf_put(ctx, sizeof(*hh));
> +
> +	return ret;
> +}
> diff --git a/checkpoint/checkpoint.c b/checkpoint/checkpoint.c
> index fccf723..17cc8d2 100644
> --- a/checkpoint/checkpoint.c
> +++ b/checkpoint/checkpoint.c
> @@ -20,6 +20,8 @@
>  #include <linux/checkpoint.h>
>  #include <linux/checkpoint_hdr.h>
>  
> +#include "checkpoint_arch.h"
> +
>  /* unique checkpoint identifier (FIXME: should be per-container ?) */
>  static atomic_t cr_ctx_count = ATOMIC_INIT(0);
>  
> @@ -105,7 +107,10 @@ static int cr_write_head(struct cr_ctx *ctx)
>  
>  	ret = cr_write_obj(ctx, &h, hh);
>  	cr_hbuf_put(ctx, sizeof(*hh));
> -	return ret;
> +	if (ret < 0)
> +		return ret;
> +
> +	return cr_write_head_arch(ctx);
>  }
>  
>  /* write the checkpoint trailer */
> @@ -160,8 +165,16 @@ static int cr_write_task(struct cr_ctx *ctx, struct task_struct *t)
>  	int ret;
>  
>  	ret = cr_write_task_struct(ctx, t);
> -	cr_debug("ret %d\n", ret);
> -
> +	cr_debug("task_struct: ret %d\n", ret);
> +	if (ret < 0)
> +		goto out;
> +	ret = cr_write_thread(ctx, t);
> +	cr_debug("thread: ret %d\n", ret);
> +	if (ret < 0)
> +		goto out;
> +	ret = cr_write_cpu(ctx, t);
> +	cr_debug("cpu: ret %d\n", ret);
> + out:
>  	return ret;
>  }
>  
> diff --git a/checkpoint/checkpoint_arch.h b/checkpoint/checkpoint_arch.h
> new file mode 100644
> index 0000000..ada1369
> --- /dev/null
> +++ b/checkpoint/checkpoint_arch.h
> @@ -0,0 +1,9 @@
> +#include <linux/checkpoint.h>
> +
> +extern int cr_write_head_arch(struct cr_ctx *ctx);
> +extern int cr_write_thread(struct cr_ctx *ctx, struct task_struct *t);
> +extern int cr_write_cpu(struct cr_ctx *ctx, struct task_struct *t);
> +
> +extern int cr_read_head_arch(struct cr_ctx *ctx);
> +extern int cr_read_thread(struct cr_ctx *ctx);
> +extern int cr_read_cpu(struct cr_ctx *ctx);
> diff --git a/checkpoint/restart.c b/checkpoint/restart.c
> index a95d2e8..d74d755 100644
> --- a/checkpoint/restart.c
> +++ b/checkpoint/restart.c
> @@ -15,6 +15,8 @@
>  #include <linux/checkpoint.h>
>  #include <linux/checkpoint_hdr.h>
>  
> +#include "checkpoint_arch.h"
> +
>  /**
>   * cr_read_obj - read a whole record (cr_hdr followed by payload)
>   * @ctx: checkpoint context
> @@ -142,9 +144,9 @@ static int cr_read_head(struct cr_ctx *ctx)
>  
>  	ctx->oflags = hh->flags;
>  
> -	/* FIX: verify compatibility of release, version and machine */
> +	/* FIX: verify compatibility of release, version */
>  
> -	ret = 0;
> +	ret = cr_read_head_arch(ctx);
>   out:
>  	cr_hbuf_put(ctx, sizeof(*hh));
>  	return ret;
> @@ -214,8 +216,17 @@ static int cr_read_task(struct cr_ctx *ctx)
>  	int ret;
>  
>  	ret = cr_read_task_struct(ctx);
> -	cr_debug("ret %d\n", ret);
> +	cr_debug("task_struct: ret %d\n", ret);
> +	if (ret < 0)
> +		goto out;
> +	ret = cr_read_thread(ctx);
> +	cr_debug("thread: ret %d\n", ret);
> +	if (ret < 0)
> +		goto out;
> +	ret = cr_read_cpu(ctx);
> +	cr_debug("cpu: ret %d\n", ret);
>  
> + out:
>  	return ret;
>  }
>  
> diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
> index 257f87f..b74b5f9 100644
> --- a/include/linux/checkpoint_hdr.h
> +++ b/include/linux/checkpoint_hdr.h
> @@ -12,6 +12,7 @@
>  
>  #include <linux/types.h>
>  #include <linux/utsname.h>
> +#include <asm/checkpoint_hdr.h>
>  
>  /*
>   * To maintain compatibility between 32-bit and 64-bit architecture flavors,
> @@ -30,6 +31,7 @@ struct cr_hdr {
>  /* header types */
>  enum {
>  	CR_HDR_HEAD = 1,
> +	CR_HDR_HEAD_ARCH,
>  	CR_HDR_BUFFER,
>  	CR_HDR_STRING,
>  

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [RFC v11][PATCH 04/13] x86 support for checkpoint/restart
  2008-12-05 17:31   ` Oren Laadan
@ 2008-12-17  2:19     ` Mike Waychison
  -1 siblings, 0 replies; 133+ messages in thread
From: Mike Waychison @ 2008-12-17  2:19 UTC (permalink / raw)
  To: Oren Laadan
  Cc: jeremy, arnd, linux-api, containers, linux-kernel, Dave Hansen,
	linux-mm, Linux Torvalds, Alexander Viro, H. Peter Anvin,
	Thomas Gleixner, Ingo Molnar

Oren Laadan wrote:
> Add logic to save and restore architecture specific state, including
> thread-specific state, CPU registers and FPU state.
> 
> In addition, architecture capabilities are saved in an architecure
> specific extension of the header (cr_hdr_head_arch); Currently this
> includes only FPU capabilities.
> 
> Currently only x86-32 is supported. Compiling on x86-64 will trigger
> an explicit error.
> 
> Changelog[v9]:
>   - Add arch-specific header that details architecture capabilities;
>     split FPU restore to send capabilities only once.
>   - Test for zero TLS entries in cr_write_thread()
>   - Fix asm/checkpoint_hdr.h so it can be included from user-space
> 
> Changelog[v7]:
>   - Fix save/restore state of FPU
> 
> Changelog[v5]:
>   - Remove preempt_disable() when restoring debug registers
> 
> Changelog[v4]:
>   - Fix header structure alignment
> 
> Changelog[v2]:
>   - Pad header structures to 64 bits to ensure compatibility
>   - Follow Dave Hansen's refactoring of the original post
> 
> Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
> Acked-by: Serge Hallyn <serue@us.ibm.com>
> Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com>
> ---
>  arch/x86/include/asm/checkpoint_hdr.h |   85 ++++++++++++
>  arch/x86/mm/Makefile                  |    2 +
>  arch/x86/mm/checkpoint.c              |  223 +++++++++++++++++++++++++++++++
>  arch/x86/mm/restart.c                 |  232 +++++++++++++++++++++++++++++++++
>  checkpoint/checkpoint.c               |   19 +++-
>  checkpoint/checkpoint_arch.h          |    9 ++
>  checkpoint/restart.c                  |   17 ++-
>  include/linux/checkpoint_hdr.h        |    2 +
>  8 files changed, 583 insertions(+), 6 deletions(-)
>  create mode 100644 arch/x86/include/asm/checkpoint_hdr.h
>  create mode 100644 arch/x86/mm/checkpoint.c
>  create mode 100644 arch/x86/mm/restart.c
>  create mode 100644 checkpoint/checkpoint_arch.h
> 
> diff --git a/arch/x86/include/asm/checkpoint_hdr.h b/arch/x86/include/asm/checkpoint_hdr.h
> new file mode 100644
> index 0000000..6325062
> --- /dev/null
> +++ b/arch/x86/include/asm/checkpoint_hdr.h
> @@ -0,0 +1,85 @@
> +#ifndef __ASM_X86_CKPT_HDR_H
> +#define __ASM_X86_CKPT_HDR_H
> +/*
> + *  Checkpoint/restart - architecture specific headers x86
> + *
> + *  Copyright (C) 2008 Oren Laadan
> + *
> + *  This file is subject to the terms and conditions of the GNU General Public
> + *  License.  See the file COPYING in the main directory of the Linux
> + *  distribution for more details.
> + */
> +
> +#include <linux/types.h>
> +
> +/* i387 structure seen from kernel/userspace */
> +#ifdef __KERNEL__
> +#include <asm/processor.h>
> +#else
> +#include <sys/user.h>
> +#endif
> +
> +struct cr_hdr_head_arch {
> +	/* FIXME: add HAVE_HWFP */
> +
> +	__u16 has_fxsr;
> +	__u16 has_xsave;
> +	__u16 xstate_size;
> +	__u16 _pading;
> +} __attribute__((aligned(8)));
> +
> +struct cr_hdr_thread {
> +	/* FIXME: restart blocks */
> +
> +	__s16 gdt_entry_tls_entries;
> +	__s16 sizeof_tls_array;
> +	__s16 ntls;	/* number of TLS entries to follow */
> +} __attribute__((aligned(8)));
> +
> +struct cr_hdr_cpu {
> +	/* see struct pt_regs (x86-64) */
> +	__u64 r15;
> +	__u64 r14;
> +	__u64 r13;
> +	__u64 r12;
> +	__u64 bp;
> +	__u64 bx;
> +	__u64 r11;
> +	__u64 r10;
> +	__u64 r9;
> +	__u64 r8;
> +	__u64 ax;
> +	__u64 cx;
> +	__u64 dx;
> +	__u64 si;
> +	__u64 di;
> +	__u64 orig_ax;
> +	__u64 ip;
> +	__u64 cs;
> +	__u64 flags;
> +	__u64 sp;
> +	__u64 ss;
> +
> +	/* segment registers */
> +	__u64 ds;
> +	__u64 es;
> +	__u64 fs;
> +	__u64 gs;
> +
> +	/* debug registers */
> +	__u64 debugreg0;
> +	__u64 debugreg1;
> +	__u64 debugreg2;
> +	__u64 debugreg3;
> +	__u64 debugreg4;
> +	__u64 debugreg5;
> +	__u64 debugreg6;
> +	__u64 debugreg7;
> +
> +	__u32 uses_debug;
> +	__u32 used_math;
> +
> +	/* thread_xstate contents follow (if used_math) */
> +} __attribute__((aligned(8)));
> +
> +#endif /* __ASM_X86_CKPT_HDR__H */
> diff --git a/arch/x86/mm/Makefile b/arch/x86/mm/Makefile
> index fea4565..6527ea2 100644
> --- a/arch/x86/mm/Makefile
> +++ b/arch/x86/mm/Makefile
> @@ -18,3 +18,5 @@ obj-$(CONFIG_K8_NUMA)		+= k8topology_64.o
>  obj-$(CONFIG_ACPI_NUMA)		+= srat_$(BITS).o
>  
>  obj-$(CONFIG_MEMTEST)		+= memtest.o
> +
> +obj-$(CONFIG_CHECKPOINT_RESTART) += checkpoint.o restart.o
> diff --git a/arch/x86/mm/checkpoint.c b/arch/x86/mm/checkpoint.c
> new file mode 100644
> index 0000000..8dd6d2d
> --- /dev/null
> +++ b/arch/x86/mm/checkpoint.c
> @@ -0,0 +1,223 @@
> +/*
> + *  Checkpoint/restart - architecture specific support for x86
> + *
> + *  Copyright (C) 2008 Oren Laadan
> + *
> + *  This file is subject to the terms and conditions of the GNU General Public
> + *  License.  See the file COPYING in the main directory of the Linux
> + *  distribution for more details.
> + */
> +
> +#include <asm/desc.h>
> +#include <asm/i387.h>
> +
> +#include <linux/checkpoint.h>
> +#include <linux/checkpoint_hdr.h>
> +
> +/* dump the thread_struct of a given task */
> +int cr_write_thread(struct cr_ctx *ctx, struct task_struct *t)
> +{
> +	struct cr_hdr h;
> +	struct cr_hdr_thread *hh = cr_hbuf_get(ctx, sizeof(*hh));
> +	struct thread_struct *thread;
> +	struct desc_struct *desc;
> +	int ntls = 0;
> +	int n, ret;
> +
> +	h.type = CR_HDR_THREAD;
> +	h.len = sizeof(*hh);
> +	h.parent = task_pid_vnr(t);
> +
> +	thread = &t->thread;
> +
> +	/* calculate no. of TLS entries that follow */
> +	desc = thread->tls_array;
> +	for (n = GDT_ENTRY_TLS_ENTRIES; n > 0; n--, desc++) {
> +		if (desc->a || desc->b)
> +			ntls++;
> +	}
> +
> +	hh->gdt_entry_tls_entries = GDT_ENTRY_TLS_ENTRIES;
> +	hh->sizeof_tls_array = sizeof(thread->tls_array);
> +	hh->ntls = ntls;
> +
> +	ret = cr_write_obj(ctx, &h, hh);
> +	cr_hbuf_put(ctx, sizeof(*hh));
> +	if (ret < 0)
> +		return ret;
> +
> +	cr_debug("ntls %d\n", ntls);
> +	if (ntls == 0)
> +		return 0;
> +
> +	/* for simplicity dump the entire array, cherry-pick upon restart */
> +	ret = cr_kwrite(ctx, thread->tls_array, sizeof(thread->tls_array));

Again, the the TLS descriptors in the GDT should be called out an not 
tied to the in-kernel representation.

> +
> +	/* IGNORE RESTART BLOCKS FOR NOW ... */
> +
> +	return ret;
> +}
> +
> +#ifdef CONFIG_X86_64
> +
> +#error "CONFIG_X86_64 unsupported yet."
> +
> +#else	/* !CONFIG_X86_64 */
> +
> +static void cr_save_cpu_regs(struct cr_hdr_cpu *hh, struct task_struct *t)
> +{
> +	struct thread_struct *thread = &t->thread;
> +	struct pt_regs *regs = task_pt_regs(t);
> +
> +	hh->bp = regs->bp;
> +	hh->bx = regs->bx;
> +	hh->ax = regs->ax;
> +	hh->cx = regs->cx;
> +	hh->dx = regs->dx;
> +	hh->si = regs->si;
> +	hh->di = regs->di;
> +	hh->orig_ax = regs->orig_ax;
> +	hh->ip = regs->ip;
> +	hh->cs = regs->cs;
> +	hh->flags = regs->flags;
> +	hh->sp = regs->sp;
> +	hh->ss = regs->ss;
> +
> +	hh->ds = regs->ds;
> +	hh->es = regs->es;
> +
> +	/*
> +	 * for checkpoint in process context (from within a container)
> +	 * the GS and FS registers should be saved from the hardware;
> +	 * otherwise they are already sabed on the thread structure
> +	 */
> +	if (t == current) {
> +		savesegment(gs, hh->gs);
> +		savesegment(fs, hh->fs);
> +	} else {
> +		hh->gs = thread->gs;
> +		hh->fs = thread->fs;
> +	}
> +
> +	/*
> +	 * for checkpoint in process context (from within a container),
> +	 * the actual syscall is taking place at this very moment; so
> +	 * we (optimistically) subtitute the future return value (0) of
> +	 * this syscall into the orig_eax, so that upon restart it will
> +	 * succeed (or it will endlessly retry checkpoint...)
> +	 */
> +	if (t == current) {
> +		BUG_ON(hh->orig_ax < 0);
> +		hh->ax = 0;
> +	}
> +}
> +
> +static void cr_save_cpu_debug(struct cr_hdr_cpu *hh, struct task_struct *t)
> +{
> +	struct thread_struct *thread = &t->thread;
> +
> +	/* debug regs */
> +
> +	/*
> +	 * for checkpoint in process context (from within a container),
> +	 * get the actual registers; otherwise get the saved values.
> +	 */
> +
> +	if (t == current) {
> +		get_debugreg(hh->debugreg0, 0);
> +		get_debugreg(hh->debugreg1, 1);
> +		get_debugreg(hh->debugreg2, 2);
> +		get_debugreg(hh->debugreg3, 3);
> +		get_debugreg(hh->debugreg6, 6);
> +		get_debugreg(hh->debugreg7, 7);
> +	} else {
> +		hh->debugreg0 = thread->debugreg0;
> +		hh->debugreg1 = thread->debugreg1;
> +		hh->debugreg2 = thread->debugreg2;
> +		hh->debugreg3 = thread->debugreg3;
> +		hh->debugreg6 = thread->debugreg6;
> +		hh->debugreg7 = thread->debugreg7;
> +	}
> +
> +	hh->debugreg4 = 0;
> +	hh->debugreg5 = 0;
> +
> +	hh->uses_debug = !!(task_thread_info(t)->flags & TIF_DEBUG);
> +}
> +
> +static void cr_save_cpu_fpu(struct cr_hdr_cpu *hh, struct task_struct *t)
> +{
> +	hh->used_math = tsk_used_math(t) ? 1 : 0;
> +}
> +
> +static int cr_write_cpu_fpu(struct cr_ctx *ctx, struct task_struct *t)
> +{
> +	void *xstate_buf = cr_hbuf_get(ctx, xstate_size);
> +
> +	/* i387 + MMU + SSE logic */
> +	preempt_disable();	/* needed it (t == current) */
> +
> +	/*
> +	 * normally, no need to unlazy_fpu(), since TS_USEDFPU flag
> +	 * have been cleared when task was context-switched out...
> +	 * except if we are in process context, in which case we do
> +	 */
> +	if (t == current && (task_thread_info(t)->status & TS_USEDFPU))
> +		unlazy_fpu(current);
> +
> +	memcpy(xstate_buf, t->thread.xstate, xstate_size);

This is probably better off being very deliberate about what registers 
we are dumping from a traceability and compatibility point of view?

> +	preempt_enable();	/* needed it (t == current) */
> +
> +	return cr_kwrite(ctx, xstate_buf, xstate_size);

Missed cr_huf_put()

> +}
> +
> +#endif	/* CONFIG_X86_64 */
> +
> +/* dump the cpu state and registers of a given task */
> +int cr_write_cpu(struct cr_ctx *ctx, struct task_struct *t)
> +{
> +	struct cr_hdr h;
> +	struct cr_hdr_cpu *hh = cr_hbuf_get(ctx, sizeof(*hh));
> +	int ret;
> +
> +	h.type = CR_HDR_CPU;
> +	h.len = sizeof(*hh);
> +	h.parent = task_pid_vnr(t);
> +
> +	cr_save_cpu_regs(hh, t);
> +	cr_save_cpu_debug(hh, t);
> +	cr_save_cpu_fpu(hh, t);
> +
> +	cr_debug("math %d debug %d\n", hh->used_math, hh->uses_debug);
> +
> +	ret = cr_write_obj(ctx, &h, hh);
> +	if (ret < 0)
> +		goto out;
> +
> +	if (hh->used_math)
> +		ret = cr_write_cpu_fpu(ctx, t);
> + out:
> +	cr_hbuf_put(ctx, sizeof(*hh));
> +	return ret;
> +}
> +
> +int cr_write_head_arch(struct cr_ctx *ctx)
> +{
> +	struct cr_hdr h;
> +	struct cr_hdr_head_arch *hh = cr_hbuf_get(ctx, sizeof(*hh));
> +	int ret;
> +
> +	h.type = CR_HDR_HEAD_ARCH;
> +	h.len = sizeof(*hh);
> +	h.parent = 0;
> +
> +	/* FPU capabilities */
> +	hh->has_fxsr = cpu_has_fxsr;
> +	hh->has_xsave = cpu_has_xsave;
> +	hh->xstate_size = xstate_size;
> +
> +	ret = cr_write_obj(ctx, &h, hh);
> +	cr_hbuf_put(ctx, sizeof(*hh));
> +
> +	return ret;
> +}
> diff --git a/arch/x86/mm/restart.c b/arch/x86/mm/restart.c
> new file mode 100644
> index 0000000..45ad790
> --- /dev/null
> +++ b/arch/x86/mm/restart.c
> @@ -0,0 +1,232 @@
> +/*
> + *  Checkpoint/restart - architecture specific support for x86
> + *
> + *  Copyright (C) 2008 Oren Laadan
> + *
> + *  This file is subject to the terms and conditions of the GNU General Public
> + *  License.  See the file COPYING in the main directory of the Linux
> + *  distribution for more details.
> + */
> +
> +#include <asm/desc.h>
> +#include <asm/i387.h>
> +
> +#include <linux/checkpoint.h>
> +#include <linux/checkpoint_hdr.h>
> +
> +/* read the thread_struct into the current task */
> +int cr_read_thread(struct cr_ctx *ctx)
> +{
> +	struct cr_hdr_thread *hh = cr_hbuf_get(ctx, sizeof(*hh));
> +	struct task_struct *t = current;
> +	struct thread_struct *thread = &t->thread;
> +	int parent, ret;
> +
> +	parent = cr_read_obj_type(ctx, hh, sizeof(*hh), CR_HDR_THREAD);
> +	if (parent < 0) {
> +		ret = parent;
> +		goto out;
> +	}
> +
> +	ret = -EINVAL;
> +
> +#if 0	/* activate when containers are used */
> +	if (parent != task_pid_vnr(t))
> +		goto out;
> +#endif
> +	cr_debug("ntls %d\n", hh->ntls);
> +
> +	if (hh->gdt_entry_tls_entries != GDT_ENTRY_TLS_ENTRIES ||
> +	    hh->sizeof_tls_array != sizeof(thread->tls_array) ||
> +	    hh->ntls < 0 || hh->ntls > GDT_ENTRY_TLS_ENTRIES)
> +		goto out;
> +
> +	if (hh->ntls > 0) {
> +		struct desc_struct *desc;
> +		int size, cpu;
> +
> +		/*
> +		 * restore TLS by hand: why convert to struct user_desc if
> +		 * sys_set_thread_entry() will convert it back ?
> +		 */
> +
> +		size = sizeof(*desc) * GDT_ENTRY_TLS_ENTRIES;
> +		desc = kmalloc(size, GFP_KERNEL);
> +		if (!desc)

cr_hbuf_put() here.

> +			return -ENOMEM;
> +
> +		ret = cr_kread(ctx, desc, size);
> +		if (ret >= 0) {

if (ret == 0)

> +			/*
> +			 * FIX: add sanity checks (eg. that values makes
> +			 * sense, that we don't overwrite old values, etc
> +			 */
> +			cpu = get_cpu();
> +			memcpy(thread->tls_array, desc, size);
> +			load_TLS(thread, cpu);
> +			put_cpu();
> +		}
> +		kfree(desc);
> +	}
> +
> +	ret = 0;
> + out:
> +	cr_hbuf_put(ctx, sizeof(*hh));
> +	return ret;
> +}
> +
> +#ifdef CONFIG_X86_64
> +
> +#error "CONFIG_X86_64 unsupported yet."
> +
> +#else	/* !CONFIG_X86_64 */
> +
> +static int cr_load_cpu_regs(struct cr_hdr_cpu *hh, struct task_struct *t)
> +{
> +	struct thread_struct *thread = &t->thread;
> +	struct pt_regs *regs = task_pt_regs(t);
> +
> +	regs->bx = hh->bx;
> +	regs->cx = hh->cx;
> +	regs->dx = hh->dx;
> +	regs->si = hh->si;
> +	regs->di = hh->di;
> +	regs->bp = hh->bp;
> +	regs->ax = hh->ax;
> +	regs->ds = hh->ds;
> +	regs->es = hh->es;
> +	regs->orig_ax = hh->orig_ax;
> +	regs->ip = hh->ip;
> +	regs->cs = hh->cs;
> +	regs->flags = hh->flags;
> +	regs->sp = hh->sp;
> +	regs->ss = hh->ss;
> +
> +	thread->gs = hh->gs;
> +	thread->fs = hh->fs;
> +	loadsegment(gs, hh->gs);
> +	loadsegment(fs, hh->fs);
> +
> +	return 0;
> +}
> +
> +static int cr_load_cpu_debug(struct cr_hdr_cpu *hh, struct task_struct *t)
> +{
> +	/* debug regs */
> +
> +	if (hh->uses_debug) {
> +		set_debugreg(hh->debugreg0, 0);
> +		set_debugreg(hh->debugreg1, 1);
> +		/* ignore 4, 5 */
> +		set_debugreg(hh->debugreg2, 2);
> +		set_debugreg(hh->debugreg3, 3);
> +		set_debugreg(hh->debugreg6, 6);
> +		set_debugreg(hh->debugreg7, 7);
> +	}
> +
> +	return 0;
> +}
> +
> +static int cr_load_cpu_fpu(struct cr_hdr_cpu *hh, struct task_struct *t)
> +{
> +	preempt_disable();
> +
> +	__clear_fpu(t);		/* in case we used FPU in user mode */
> +
> +	if (!hh->used_math)
> +		clear_used_math();
> +
> +	preempt_enable();
> +	return 0;
> +}
> +
> +static int cr_read_cpu_fpu(struct cr_ctx *ctx, struct task_struct *t)
> +{
> +	void *xstate_buf = cr_hbuf_get(ctx, xstate_size);
> +	int ret;
> +
> +	ret = cr_kread(ctx, xstate_buf, xstate_size);
> +	if (ret < 0)
> +		goto out;
> +
> +	/* i387 + MMU + SSE */
> +	preempt_disable();
> +
> +	/* init_fpu() also calls set_used_math() */
> +	ret = init_fpu(current);
> +	if (ret < 0)
> +		return ret;
> +
> +	memcpy(t->thread.xstate, xstate_buf, xstate_size);
> +	preempt_enable();
> + out:
> +	cr_hbuf_put(ctx, xstate_size);
> +	return 0;
> +}
> +
> +#endif	/* CONFIG_X86_64 */
> +
> +/* read the cpu state and registers for the current task */
> +int cr_read_cpu(struct cr_ctx *ctx)
> +{
> +	struct cr_hdr_cpu *hh = cr_hbuf_get(ctx, sizeof(*hh));
> +	struct task_struct *t = current;
> +	int parent, ret;
> +
> +	parent = cr_read_obj_type(ctx, hh, sizeof(*hh), CR_HDR_CPU);
> +	if (parent < 0) {
> +		ret = parent;
> +		goto out;
> +	}
> +
> +	ret = -EINVAL;
> +
> +#if 0	/* activate when containers are used */
> +	if (parent != task_pid_vnr(t))
> +		goto out;
> +#endif
> +	/* FIX: sanity check for sensitive registers (eg. eflags) */
> +
> +	cr_debug("math %d debug %d\n", hh->used_math, hh->uses_debug);
> +
> +	ret = cr_load_cpu_regs(hh, t);
> +	if (ret < 0)
> +		goto out;
> +	ret = cr_load_cpu_debug(hh, t);
> +	if (ret < 0)
> +		goto out;
> +	ret = cr_load_cpu_fpu(hh, t);
> +	if (ret < 0)
> +		goto out;
> +
> +	if (hh->used_math)
> +		ret = cr_read_cpu_fpu(ctx, t);
> + out:
> +	cr_hbuf_put(ctx, sizeof(*hh));
> +	return ret;
> +}
> +
> +int cr_read_head_arch(struct cr_ctx *ctx)
> +{
> +	struct cr_hdr_head_arch *hh = cr_hbuf_get(ctx, sizeof(*hh));
> +	int parent, ret = 0;
> +
> +	parent = cr_read_obj_type(ctx, hh, sizeof(*hh), CR_HDR_HEAD_ARCH);
> +	if (parent < 0) {
> +		ret = parent;
> +		goto out;
> +	} else if (parent != 0)
> +		goto out;
> +
> +	/* FIX: verify compatibility of architecture features */
> +
> +	/* verify FPU capabilities */
> +	if (hh->has_fxsr != cpu_has_fxsr ||
> +	    hh->has_xsave != cpu_has_xsave ||
> +	    hh->xstate_size != xstate_size)
> +		ret = -EINVAL;
> + out:
> +	cr_hbuf_put(ctx, sizeof(*hh));
> +
> +	return ret;
> +}
> diff --git a/checkpoint/checkpoint.c b/checkpoint/checkpoint.c
> index fccf723..17cc8d2 100644
> --- a/checkpoint/checkpoint.c
> +++ b/checkpoint/checkpoint.c
> @@ -20,6 +20,8 @@
>  #include <linux/checkpoint.h>
>  #include <linux/checkpoint_hdr.h>
>  
> +#include "checkpoint_arch.h"
> +
>  /* unique checkpoint identifier (FIXME: should be per-container ?) */
>  static atomic_t cr_ctx_count = ATOMIC_INIT(0);
>  
> @@ -105,7 +107,10 @@ static int cr_write_head(struct cr_ctx *ctx)
>  
>  	ret = cr_write_obj(ctx, &h, hh);
>  	cr_hbuf_put(ctx, sizeof(*hh));
> -	return ret;
> +	if (ret < 0)
> +		return ret;
> +
> +	return cr_write_head_arch(ctx);
>  }
>  
>  /* write the checkpoint trailer */
> @@ -160,8 +165,16 @@ static int cr_write_task(struct cr_ctx *ctx, struct task_struct *t)
>  	int ret;
>  
>  	ret = cr_write_task_struct(ctx, t);
> -	cr_debug("ret %d\n", ret);
> -
> +	cr_debug("task_struct: ret %d\n", ret);
> +	if (ret < 0)
> +		goto out;
> +	ret = cr_write_thread(ctx, t);
> +	cr_debug("thread: ret %d\n", ret);
> +	if (ret < 0)
> +		goto out;
> +	ret = cr_write_cpu(ctx, t);
> +	cr_debug("cpu: ret %d\n", ret);
> + out:
>  	return ret;
>  }
>  
> diff --git a/checkpoint/checkpoint_arch.h b/checkpoint/checkpoint_arch.h
> new file mode 100644
> index 0000000..ada1369
> --- /dev/null
> +++ b/checkpoint/checkpoint_arch.h
> @@ -0,0 +1,9 @@
> +#include <linux/checkpoint.h>
> +
> +extern int cr_write_head_arch(struct cr_ctx *ctx);
> +extern int cr_write_thread(struct cr_ctx *ctx, struct task_struct *t);
> +extern int cr_write_cpu(struct cr_ctx *ctx, struct task_struct *t);
> +
> +extern int cr_read_head_arch(struct cr_ctx *ctx);
> +extern int cr_read_thread(struct cr_ctx *ctx);
> +extern int cr_read_cpu(struct cr_ctx *ctx);
> diff --git a/checkpoint/restart.c b/checkpoint/restart.c
> index a95d2e8..d74d755 100644
> --- a/checkpoint/restart.c
> +++ b/checkpoint/restart.c
> @@ -15,6 +15,8 @@
>  #include <linux/checkpoint.h>
>  #include <linux/checkpoint_hdr.h>
>  
> +#include "checkpoint_arch.h"
> +
>  /**
>   * cr_read_obj - read a whole record (cr_hdr followed by payload)
>   * @ctx: checkpoint context
> @@ -142,9 +144,9 @@ static int cr_read_head(struct cr_ctx *ctx)
>  
>  	ctx->oflags = hh->flags;
>  
> -	/* FIX: verify compatibility of release, version and machine */
> +	/* FIX: verify compatibility of release, version */
>  
> -	ret = 0;
> +	ret = cr_read_head_arch(ctx);
>   out:
>  	cr_hbuf_put(ctx, sizeof(*hh));
>  	return ret;
> @@ -214,8 +216,17 @@ static int cr_read_task(struct cr_ctx *ctx)
>  	int ret;
>  
>  	ret = cr_read_task_struct(ctx);
> -	cr_debug("ret %d\n", ret);
> +	cr_debug("task_struct: ret %d\n", ret);
> +	if (ret < 0)
> +		goto out;
> +	ret = cr_read_thread(ctx);
> +	cr_debug("thread: ret %d\n", ret);
> +	if (ret < 0)
> +		goto out;
> +	ret = cr_read_cpu(ctx);
> +	cr_debug("cpu: ret %d\n", ret);
>  
> + out:
>  	return ret;
>  }
>  
> diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
> index 257f87f..b74b5f9 100644
> --- a/include/linux/checkpoint_hdr.h
> +++ b/include/linux/checkpoint_hdr.h
> @@ -12,6 +12,7 @@
>  
>  #include <linux/types.h>
>  #include <linux/utsname.h>
> +#include <asm/checkpoint_hdr.h>
>  
>  /*
>   * To maintain compatibility between 32-bit and 64-bit architecture flavors,
> @@ -30,6 +31,7 @@ struct cr_hdr {
>  /* header types */
>  enum {
>  	CR_HDR_HEAD = 1,
> +	CR_HDR_HEAD_ARCH,
>  	CR_HDR_BUFFER,
>  	CR_HDR_STRING,
>  


^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [RFC v11][PATCH 04/13] x86 support for checkpoint/restart
@ 2008-12-17  2:19     ` Mike Waychison
  0 siblings, 0 replies; 133+ messages in thread
From: Mike Waychison @ 2008-12-17  2:19 UTC (permalink / raw)
  To: Oren Laadan
  Cc: jeremy, arnd, linux-api, containers, linux-kernel, Dave Hansen,
	linux-mm, Linux Torvalds, Alexander Viro, H. Peter Anvin,
	Thomas Gleixner, Ingo Molnar

Oren Laadan wrote:
> Add logic to save and restore architecture specific state, including
> thread-specific state, CPU registers and FPU state.
> 
> In addition, architecture capabilities are saved in an architecure
> specific extension of the header (cr_hdr_head_arch); Currently this
> includes only FPU capabilities.
> 
> Currently only x86-32 is supported. Compiling on x86-64 will trigger
> an explicit error.
> 
> Changelog[v9]:
>   - Add arch-specific header that details architecture capabilities;
>     split FPU restore to send capabilities only once.
>   - Test for zero TLS entries in cr_write_thread()
>   - Fix asm/checkpoint_hdr.h so it can be included from user-space
> 
> Changelog[v7]:
>   - Fix save/restore state of FPU
> 
> Changelog[v5]:
>   - Remove preempt_disable() when restoring debug registers
> 
> Changelog[v4]:
>   - Fix header structure alignment
> 
> Changelog[v2]:
>   - Pad header structures to 64 bits to ensure compatibility
>   - Follow Dave Hansen's refactoring of the original post
> 
> Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
> Acked-by: Serge Hallyn <serue@us.ibm.com>
> Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com>
> ---
>  arch/x86/include/asm/checkpoint_hdr.h |   85 ++++++++++++
>  arch/x86/mm/Makefile                  |    2 +
>  arch/x86/mm/checkpoint.c              |  223 +++++++++++++++++++++++++++++++
>  arch/x86/mm/restart.c                 |  232 +++++++++++++++++++++++++++++++++
>  checkpoint/checkpoint.c               |   19 +++-
>  checkpoint/checkpoint_arch.h          |    9 ++
>  checkpoint/restart.c                  |   17 ++-
>  include/linux/checkpoint_hdr.h        |    2 +
>  8 files changed, 583 insertions(+), 6 deletions(-)
>  create mode 100644 arch/x86/include/asm/checkpoint_hdr.h
>  create mode 100644 arch/x86/mm/checkpoint.c
>  create mode 100644 arch/x86/mm/restart.c
>  create mode 100644 checkpoint/checkpoint_arch.h
> 
> diff --git a/arch/x86/include/asm/checkpoint_hdr.h b/arch/x86/include/asm/checkpoint_hdr.h
> new file mode 100644
> index 0000000..6325062
> --- /dev/null
> +++ b/arch/x86/include/asm/checkpoint_hdr.h
> @@ -0,0 +1,85 @@
> +#ifndef __ASM_X86_CKPT_HDR_H
> +#define __ASM_X86_CKPT_HDR_H
> +/*
> + *  Checkpoint/restart - architecture specific headers x86
> + *
> + *  Copyright (C) 2008 Oren Laadan
> + *
> + *  This file is subject to the terms and conditions of the GNU General Public
> + *  License.  See the file COPYING in the main directory of the Linux
> + *  distribution for more details.
> + */
> +
> +#include <linux/types.h>
> +
> +/* i387 structure seen from kernel/userspace */
> +#ifdef __KERNEL__
> +#include <asm/processor.h>
> +#else
> +#include <sys/user.h>
> +#endif
> +
> +struct cr_hdr_head_arch {
> +	/* FIXME: add HAVE_HWFP */
> +
> +	__u16 has_fxsr;
> +	__u16 has_xsave;
> +	__u16 xstate_size;
> +	__u16 _pading;
> +} __attribute__((aligned(8)));
> +
> +struct cr_hdr_thread {
> +	/* FIXME: restart blocks */
> +
> +	__s16 gdt_entry_tls_entries;
> +	__s16 sizeof_tls_array;
> +	__s16 ntls;	/* number of TLS entries to follow */
> +} __attribute__((aligned(8)));
> +
> +struct cr_hdr_cpu {
> +	/* see struct pt_regs (x86-64) */
> +	__u64 r15;
> +	__u64 r14;
> +	__u64 r13;
> +	__u64 r12;
> +	__u64 bp;
> +	__u64 bx;
> +	__u64 r11;
> +	__u64 r10;
> +	__u64 r9;
> +	__u64 r8;
> +	__u64 ax;
> +	__u64 cx;
> +	__u64 dx;
> +	__u64 si;
> +	__u64 di;
> +	__u64 orig_ax;
> +	__u64 ip;
> +	__u64 cs;
> +	__u64 flags;
> +	__u64 sp;
> +	__u64 ss;
> +
> +	/* segment registers */
> +	__u64 ds;
> +	__u64 es;
> +	__u64 fs;
> +	__u64 gs;
> +
> +	/* debug registers */
> +	__u64 debugreg0;
> +	__u64 debugreg1;
> +	__u64 debugreg2;
> +	__u64 debugreg3;
> +	__u64 debugreg4;
> +	__u64 debugreg5;
> +	__u64 debugreg6;
> +	__u64 debugreg7;
> +
> +	__u32 uses_debug;
> +	__u32 used_math;
> +
> +	/* thread_xstate contents follow (if used_math) */
> +} __attribute__((aligned(8)));
> +
> +#endif /* __ASM_X86_CKPT_HDR__H */
> diff --git a/arch/x86/mm/Makefile b/arch/x86/mm/Makefile
> index fea4565..6527ea2 100644
> --- a/arch/x86/mm/Makefile
> +++ b/arch/x86/mm/Makefile
> @@ -18,3 +18,5 @@ obj-$(CONFIG_K8_NUMA)		+= k8topology_64.o
>  obj-$(CONFIG_ACPI_NUMA)		+= srat_$(BITS).o
>  
>  obj-$(CONFIG_MEMTEST)		+= memtest.o
> +
> +obj-$(CONFIG_CHECKPOINT_RESTART) += checkpoint.o restart.o
> diff --git a/arch/x86/mm/checkpoint.c b/arch/x86/mm/checkpoint.c
> new file mode 100644
> index 0000000..8dd6d2d
> --- /dev/null
> +++ b/arch/x86/mm/checkpoint.c
> @@ -0,0 +1,223 @@
> +/*
> + *  Checkpoint/restart - architecture specific support for x86
> + *
> + *  Copyright (C) 2008 Oren Laadan
> + *
> + *  This file is subject to the terms and conditions of the GNU General Public
> + *  License.  See the file COPYING in the main directory of the Linux
> + *  distribution for more details.
> + */
> +
> +#include <asm/desc.h>
> +#include <asm/i387.h>
> +
> +#include <linux/checkpoint.h>
> +#include <linux/checkpoint_hdr.h>
> +
> +/* dump the thread_struct of a given task */
> +int cr_write_thread(struct cr_ctx *ctx, struct task_struct *t)
> +{
> +	struct cr_hdr h;
> +	struct cr_hdr_thread *hh = cr_hbuf_get(ctx, sizeof(*hh));
> +	struct thread_struct *thread;
> +	struct desc_struct *desc;
> +	int ntls = 0;
> +	int n, ret;
> +
> +	h.type = CR_HDR_THREAD;
> +	h.len = sizeof(*hh);
> +	h.parent = task_pid_vnr(t);
> +
> +	thread = &t->thread;
> +
> +	/* calculate no. of TLS entries that follow */
> +	desc = thread->tls_array;
> +	for (n = GDT_ENTRY_TLS_ENTRIES; n > 0; n--, desc++) {
> +		if (desc->a || desc->b)
> +			ntls++;
> +	}
> +
> +	hh->gdt_entry_tls_entries = GDT_ENTRY_TLS_ENTRIES;
> +	hh->sizeof_tls_array = sizeof(thread->tls_array);
> +	hh->ntls = ntls;
> +
> +	ret = cr_write_obj(ctx, &h, hh);
> +	cr_hbuf_put(ctx, sizeof(*hh));
> +	if (ret < 0)
> +		return ret;
> +
> +	cr_debug("ntls %d\n", ntls);
> +	if (ntls == 0)
> +		return 0;
> +
> +	/* for simplicity dump the entire array, cherry-pick upon restart */
> +	ret = cr_kwrite(ctx, thread->tls_array, sizeof(thread->tls_array));

Again, the the TLS descriptors in the GDT should be called out an not 
tied to the in-kernel representation.

> +
> +	/* IGNORE RESTART BLOCKS FOR NOW ... */
> +
> +	return ret;
> +}
> +
> +#ifdef CONFIG_X86_64
> +
> +#error "CONFIG_X86_64 unsupported yet."
> +
> +#else	/* !CONFIG_X86_64 */
> +
> +static void cr_save_cpu_regs(struct cr_hdr_cpu *hh, struct task_struct *t)
> +{
> +	struct thread_struct *thread = &t->thread;
> +	struct pt_regs *regs = task_pt_regs(t);
> +
> +	hh->bp = regs->bp;
> +	hh->bx = regs->bx;
> +	hh->ax = regs->ax;
> +	hh->cx = regs->cx;
> +	hh->dx = regs->dx;
> +	hh->si = regs->si;
> +	hh->di = regs->di;
> +	hh->orig_ax = regs->orig_ax;
> +	hh->ip = regs->ip;
> +	hh->cs = regs->cs;
> +	hh->flags = regs->flags;
> +	hh->sp = regs->sp;
> +	hh->ss = regs->ss;
> +
> +	hh->ds = regs->ds;
> +	hh->es = regs->es;
> +
> +	/*
> +	 * for checkpoint in process context (from within a container)
> +	 * the GS and FS registers should be saved from the hardware;
> +	 * otherwise they are already sabed on the thread structure
> +	 */
> +	if (t == current) {
> +		savesegment(gs, hh->gs);
> +		savesegment(fs, hh->fs);
> +	} else {
> +		hh->gs = thread->gs;
> +		hh->fs = thread->fs;
> +	}
> +
> +	/*
> +	 * for checkpoint in process context (from within a container),
> +	 * the actual syscall is taking place at this very moment; so
> +	 * we (optimistically) subtitute the future return value (0) of
> +	 * this syscall into the orig_eax, so that upon restart it will
> +	 * succeed (or it will endlessly retry checkpoint...)
> +	 */
> +	if (t == current) {
> +		BUG_ON(hh->orig_ax < 0);
> +		hh->ax = 0;
> +	}
> +}
> +
> +static void cr_save_cpu_debug(struct cr_hdr_cpu *hh, struct task_struct *t)
> +{
> +	struct thread_struct *thread = &t->thread;
> +
> +	/* debug regs */
> +
> +	/*
> +	 * for checkpoint in process context (from within a container),
> +	 * get the actual registers; otherwise get the saved values.
> +	 */
> +
> +	if (t == current) {
> +		get_debugreg(hh->debugreg0, 0);
> +		get_debugreg(hh->debugreg1, 1);
> +		get_debugreg(hh->debugreg2, 2);
> +		get_debugreg(hh->debugreg3, 3);
> +		get_debugreg(hh->debugreg6, 6);
> +		get_debugreg(hh->debugreg7, 7);
> +	} else {
> +		hh->debugreg0 = thread->debugreg0;
> +		hh->debugreg1 = thread->debugreg1;
> +		hh->debugreg2 = thread->debugreg2;
> +		hh->debugreg3 = thread->debugreg3;
> +		hh->debugreg6 = thread->debugreg6;
> +		hh->debugreg7 = thread->debugreg7;
> +	}
> +
> +	hh->debugreg4 = 0;
> +	hh->debugreg5 = 0;
> +
> +	hh->uses_debug = !!(task_thread_info(t)->flags & TIF_DEBUG);
> +}
> +
> +static void cr_save_cpu_fpu(struct cr_hdr_cpu *hh, struct task_struct *t)
> +{
> +	hh->used_math = tsk_used_math(t) ? 1 : 0;
> +}
> +
> +static int cr_write_cpu_fpu(struct cr_ctx *ctx, struct task_struct *t)
> +{
> +	void *xstate_buf = cr_hbuf_get(ctx, xstate_size);
> +
> +	/* i387 + MMU + SSE logic */
> +	preempt_disable();	/* needed it (t == current) */
> +
> +	/*
> +	 * normally, no need to unlazy_fpu(), since TS_USEDFPU flag
> +	 * have been cleared when task was context-switched out...
> +	 * except if we are in process context, in which case we do
> +	 */
> +	if (t == current && (task_thread_info(t)->status & TS_USEDFPU))
> +		unlazy_fpu(current);
> +
> +	memcpy(xstate_buf, t->thread.xstate, xstate_size);

This is probably better off being very deliberate about what registers 
we are dumping from a traceability and compatibility point of view?

> +	preempt_enable();	/* needed it (t == current) */
> +
> +	return cr_kwrite(ctx, xstate_buf, xstate_size);

Missed cr_huf_put()

> +}
> +
> +#endif	/* CONFIG_X86_64 */
> +
> +/* dump the cpu state and registers of a given task */
> +int cr_write_cpu(struct cr_ctx *ctx, struct task_struct *t)
> +{
> +	struct cr_hdr h;
> +	struct cr_hdr_cpu *hh = cr_hbuf_get(ctx, sizeof(*hh));
> +	int ret;
> +
> +	h.type = CR_HDR_CPU;
> +	h.len = sizeof(*hh);
> +	h.parent = task_pid_vnr(t);
> +
> +	cr_save_cpu_regs(hh, t);
> +	cr_save_cpu_debug(hh, t);
> +	cr_save_cpu_fpu(hh, t);
> +
> +	cr_debug("math %d debug %d\n", hh->used_math, hh->uses_debug);
> +
> +	ret = cr_write_obj(ctx, &h, hh);
> +	if (ret < 0)
> +		goto out;
> +
> +	if (hh->used_math)
> +		ret = cr_write_cpu_fpu(ctx, t);
> + out:
> +	cr_hbuf_put(ctx, sizeof(*hh));
> +	return ret;
> +}
> +
> +int cr_write_head_arch(struct cr_ctx *ctx)
> +{
> +	struct cr_hdr h;
> +	struct cr_hdr_head_arch *hh = cr_hbuf_get(ctx, sizeof(*hh));
> +	int ret;
> +
> +	h.type = CR_HDR_HEAD_ARCH;
> +	h.len = sizeof(*hh);
> +	h.parent = 0;
> +
> +	/* FPU capabilities */
> +	hh->has_fxsr = cpu_has_fxsr;
> +	hh->has_xsave = cpu_has_xsave;
> +	hh->xstate_size = xstate_size;
> +
> +	ret = cr_write_obj(ctx, &h, hh);
> +	cr_hbuf_put(ctx, sizeof(*hh));
> +
> +	return ret;
> +}
> diff --git a/arch/x86/mm/restart.c b/arch/x86/mm/restart.c
> new file mode 100644
> index 0000000..45ad790
> --- /dev/null
> +++ b/arch/x86/mm/restart.c
> @@ -0,0 +1,232 @@
> +/*
> + *  Checkpoint/restart - architecture specific support for x86
> + *
> + *  Copyright (C) 2008 Oren Laadan
> + *
> + *  This file is subject to the terms and conditions of the GNU General Public
> + *  License.  See the file COPYING in the main directory of the Linux
> + *  distribution for more details.
> + */
> +
> +#include <asm/desc.h>
> +#include <asm/i387.h>
> +
> +#include <linux/checkpoint.h>
> +#include <linux/checkpoint_hdr.h>
> +
> +/* read the thread_struct into the current task */
> +int cr_read_thread(struct cr_ctx *ctx)
> +{
> +	struct cr_hdr_thread *hh = cr_hbuf_get(ctx, sizeof(*hh));
> +	struct task_struct *t = current;
> +	struct thread_struct *thread = &t->thread;
> +	int parent, ret;
> +
> +	parent = cr_read_obj_type(ctx, hh, sizeof(*hh), CR_HDR_THREAD);
> +	if (parent < 0) {
> +		ret = parent;
> +		goto out;
> +	}
> +
> +	ret = -EINVAL;
> +
> +#if 0	/* activate when containers are used */
> +	if (parent != task_pid_vnr(t))
> +		goto out;
> +#endif
> +	cr_debug("ntls %d\n", hh->ntls);
> +
> +	if (hh->gdt_entry_tls_entries != GDT_ENTRY_TLS_ENTRIES ||
> +	    hh->sizeof_tls_array != sizeof(thread->tls_array) ||
> +	    hh->ntls < 0 || hh->ntls > GDT_ENTRY_TLS_ENTRIES)
> +		goto out;
> +
> +	if (hh->ntls > 0) {
> +		struct desc_struct *desc;
> +		int size, cpu;
> +
> +		/*
> +		 * restore TLS by hand: why convert to struct user_desc if
> +		 * sys_set_thread_entry() will convert it back ?
> +		 */
> +
> +		size = sizeof(*desc) * GDT_ENTRY_TLS_ENTRIES;
> +		desc = kmalloc(size, GFP_KERNEL);
> +		if (!desc)

cr_hbuf_put() here.

> +			return -ENOMEM;
> +
> +		ret = cr_kread(ctx, desc, size);
> +		if (ret >= 0) {

if (ret == 0)

> +			/*
> +			 * FIX: add sanity checks (eg. that values makes
> +			 * sense, that we don't overwrite old values, etc
> +			 */
> +			cpu = get_cpu();
> +			memcpy(thread->tls_array, desc, size);
> +			load_TLS(thread, cpu);
> +			put_cpu();
> +		}
> +		kfree(desc);
> +	}
> +
> +	ret = 0;
> + out:
> +	cr_hbuf_put(ctx, sizeof(*hh));
> +	return ret;
> +}
> +
> +#ifdef CONFIG_X86_64
> +
> +#error "CONFIG_X86_64 unsupported yet."
> +
> +#else	/* !CONFIG_X86_64 */
> +
> +static int cr_load_cpu_regs(struct cr_hdr_cpu *hh, struct task_struct *t)
> +{
> +	struct thread_struct *thread = &t->thread;
> +	struct pt_regs *regs = task_pt_regs(t);
> +
> +	regs->bx = hh->bx;
> +	regs->cx = hh->cx;
> +	regs->dx = hh->dx;
> +	regs->si = hh->si;
> +	regs->di = hh->di;
> +	regs->bp = hh->bp;
> +	regs->ax = hh->ax;
> +	regs->ds = hh->ds;
> +	regs->es = hh->es;
> +	regs->orig_ax = hh->orig_ax;
> +	regs->ip = hh->ip;
> +	regs->cs = hh->cs;
> +	regs->flags = hh->flags;
> +	regs->sp = hh->sp;
> +	regs->ss = hh->ss;
> +
> +	thread->gs = hh->gs;
> +	thread->fs = hh->fs;
> +	loadsegment(gs, hh->gs);
> +	loadsegment(fs, hh->fs);
> +
> +	return 0;
> +}
> +
> +static int cr_load_cpu_debug(struct cr_hdr_cpu *hh, struct task_struct *t)
> +{
> +	/* debug regs */
> +
> +	if (hh->uses_debug) {
> +		set_debugreg(hh->debugreg0, 0);
> +		set_debugreg(hh->debugreg1, 1);
> +		/* ignore 4, 5 */
> +		set_debugreg(hh->debugreg2, 2);
> +		set_debugreg(hh->debugreg3, 3);
> +		set_debugreg(hh->debugreg6, 6);
> +		set_debugreg(hh->debugreg7, 7);
> +	}
> +
> +	return 0;
> +}
> +
> +static int cr_load_cpu_fpu(struct cr_hdr_cpu *hh, struct task_struct *t)
> +{
> +	preempt_disable();
> +
> +	__clear_fpu(t);		/* in case we used FPU in user mode */
> +
> +	if (!hh->used_math)
> +		clear_used_math();
> +
> +	preempt_enable();
> +	return 0;
> +}
> +
> +static int cr_read_cpu_fpu(struct cr_ctx *ctx, struct task_struct *t)
> +{
> +	void *xstate_buf = cr_hbuf_get(ctx, xstate_size);
> +	int ret;
> +
> +	ret = cr_kread(ctx, xstate_buf, xstate_size);
> +	if (ret < 0)
> +		goto out;
> +
> +	/* i387 + MMU + SSE */
> +	preempt_disable();
> +
> +	/* init_fpu() also calls set_used_math() */
> +	ret = init_fpu(current);
> +	if (ret < 0)
> +		return ret;
> +
> +	memcpy(t->thread.xstate, xstate_buf, xstate_size);
> +	preempt_enable();
> + out:
> +	cr_hbuf_put(ctx, xstate_size);
> +	return 0;
> +}
> +
> +#endif	/* CONFIG_X86_64 */
> +
> +/* read the cpu state and registers for the current task */
> +int cr_read_cpu(struct cr_ctx *ctx)
> +{
> +	struct cr_hdr_cpu *hh = cr_hbuf_get(ctx, sizeof(*hh));
> +	struct task_struct *t = current;
> +	int parent, ret;
> +
> +	parent = cr_read_obj_type(ctx, hh, sizeof(*hh), CR_HDR_CPU);
> +	if (parent < 0) {
> +		ret = parent;
> +		goto out;
> +	}
> +
> +	ret = -EINVAL;
> +
> +#if 0	/* activate when containers are used */
> +	if (parent != task_pid_vnr(t))
> +		goto out;
> +#endif
> +	/* FIX: sanity check for sensitive registers (eg. eflags) */
> +
> +	cr_debug("math %d debug %d\n", hh->used_math, hh->uses_debug);
> +
> +	ret = cr_load_cpu_regs(hh, t);
> +	if (ret < 0)
> +		goto out;
> +	ret = cr_load_cpu_debug(hh, t);
> +	if (ret < 0)
> +		goto out;
> +	ret = cr_load_cpu_fpu(hh, t);
> +	if (ret < 0)
> +		goto out;
> +
> +	if (hh->used_math)
> +		ret = cr_read_cpu_fpu(ctx, t);
> + out:
> +	cr_hbuf_put(ctx, sizeof(*hh));
> +	return ret;
> +}
> +
> +int cr_read_head_arch(struct cr_ctx *ctx)
> +{
> +	struct cr_hdr_head_arch *hh = cr_hbuf_get(ctx, sizeof(*hh));
> +	int parent, ret = 0;
> +
> +	parent = cr_read_obj_type(ctx, hh, sizeof(*hh), CR_HDR_HEAD_ARCH);
> +	if (parent < 0) {
> +		ret = parent;
> +		goto out;
> +	} else if (parent != 0)
> +		goto out;
> +
> +	/* FIX: verify compatibility of architecture features */
> +
> +	/* verify FPU capabilities */
> +	if (hh->has_fxsr != cpu_has_fxsr ||
> +	    hh->has_xsave != cpu_has_xsave ||
> +	    hh->xstate_size != xstate_size)
> +		ret = -EINVAL;
> + out:
> +	cr_hbuf_put(ctx, sizeof(*hh));
> +
> +	return ret;
> +}
> diff --git a/checkpoint/checkpoint.c b/checkpoint/checkpoint.c
> index fccf723..17cc8d2 100644
> --- a/checkpoint/checkpoint.c
> +++ b/checkpoint/checkpoint.c
> @@ -20,6 +20,8 @@
>  #include <linux/checkpoint.h>
>  #include <linux/checkpoint_hdr.h>
>  
> +#include "checkpoint_arch.h"
> +
>  /* unique checkpoint identifier (FIXME: should be per-container ?) */
>  static atomic_t cr_ctx_count = ATOMIC_INIT(0);
>  
> @@ -105,7 +107,10 @@ static int cr_write_head(struct cr_ctx *ctx)
>  
>  	ret = cr_write_obj(ctx, &h, hh);
>  	cr_hbuf_put(ctx, sizeof(*hh));
> -	return ret;
> +	if (ret < 0)
> +		return ret;
> +
> +	return cr_write_head_arch(ctx);
>  }
>  
>  /* write the checkpoint trailer */
> @@ -160,8 +165,16 @@ static int cr_write_task(struct cr_ctx *ctx, struct task_struct *t)
>  	int ret;
>  
>  	ret = cr_write_task_struct(ctx, t);
> -	cr_debug("ret %d\n", ret);
> -
> +	cr_debug("task_struct: ret %d\n", ret);
> +	if (ret < 0)
> +		goto out;
> +	ret = cr_write_thread(ctx, t);
> +	cr_debug("thread: ret %d\n", ret);
> +	if (ret < 0)
> +		goto out;
> +	ret = cr_write_cpu(ctx, t);
> +	cr_debug("cpu: ret %d\n", ret);
> + out:
>  	return ret;
>  }
>  
> diff --git a/checkpoint/checkpoint_arch.h b/checkpoint/checkpoint_arch.h
> new file mode 100644
> index 0000000..ada1369
> --- /dev/null
> +++ b/checkpoint/checkpoint_arch.h
> @@ -0,0 +1,9 @@
> +#include <linux/checkpoint.h>
> +
> +extern int cr_write_head_arch(struct cr_ctx *ctx);
> +extern int cr_write_thread(struct cr_ctx *ctx, struct task_struct *t);
> +extern int cr_write_cpu(struct cr_ctx *ctx, struct task_struct *t);
> +
> +extern int cr_read_head_arch(struct cr_ctx *ctx);
> +extern int cr_read_thread(struct cr_ctx *ctx);
> +extern int cr_read_cpu(struct cr_ctx *ctx);
> diff --git a/checkpoint/restart.c b/checkpoint/restart.c
> index a95d2e8..d74d755 100644
> --- a/checkpoint/restart.c
> +++ b/checkpoint/restart.c
> @@ -15,6 +15,8 @@
>  #include <linux/checkpoint.h>
>  #include <linux/checkpoint_hdr.h>
>  
> +#include "checkpoint_arch.h"
> +
>  /**
>   * cr_read_obj - read a whole record (cr_hdr followed by payload)
>   * @ctx: checkpoint context
> @@ -142,9 +144,9 @@ static int cr_read_head(struct cr_ctx *ctx)
>  
>  	ctx->oflags = hh->flags;
>  
> -	/* FIX: verify compatibility of release, version and machine */
> +	/* FIX: verify compatibility of release, version */
>  
> -	ret = 0;
> +	ret = cr_read_head_arch(ctx);
>   out:
>  	cr_hbuf_put(ctx, sizeof(*hh));
>  	return ret;
> @@ -214,8 +216,17 @@ static int cr_read_task(struct cr_ctx *ctx)
>  	int ret;
>  
>  	ret = cr_read_task_struct(ctx);
> -	cr_debug("ret %d\n", ret);
> +	cr_debug("task_struct: ret %d\n", ret);
> +	if (ret < 0)
> +		goto out;
> +	ret = cr_read_thread(ctx);
> +	cr_debug("thread: ret %d\n", ret);
> +	if (ret < 0)
> +		goto out;
> +	ret = cr_read_cpu(ctx);
> +	cr_debug("cpu: ret %d\n", ret);
>  
> + out:
>  	return ret;
>  }
>  
> diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
> index 257f87f..b74b5f9 100644
> --- a/include/linux/checkpoint_hdr.h
> +++ b/include/linux/checkpoint_hdr.h
> @@ -12,6 +12,7 @@
>  
>  #include <linux/types.h>
>  #include <linux/utsname.h>
> +#include <asm/checkpoint_hdr.h>
>  
>  /*
>   * To maintain compatibility between 32-bit and 64-bit architecture flavors,
> @@ -30,6 +31,7 @@ struct cr_hdr {
>  /* header types */
>  enum {
>  	CR_HDR_HEAD = 1,
> +	CR_HDR_HEAD_ARCH,
>  	CR_HDR_BUFFER,
>  	CR_HDR_STRING,
>  

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [RFC v11][PATCH 04/13] x86 support for checkpoint/restart
       [not found]     ` <494861CA.8000403-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
@ 2008-12-17 15:23       ` Oren Laadan
  0 siblings, 0 replies; 133+ messages in thread
From: Oren Laadan @ 2008-12-17 15:23 UTC (permalink / raw)
  To: Mike Waychison
  Cc: jeremy-TSDbQ3PG+2Y, arnd-r2nGTMty4D4,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Dave Hansen,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Linux Torvalds, Alexander Viro,
	H. Peter Anvin, Thomas Gleixner, Ingo Molnar



Mike Waychison wrote:
> Oren Laadan wrote:
>> Add logic to save and restore architecture specific state, including
>> thread-specific state, CPU registers and FPU state.
>>
>> In addition, architecture capabilities are saved in an architecure
>> specific extension of the header (cr_hdr_head_arch); Currently this
>> includes only FPU capabilities.
>>
>> Currently only x86-32 is supported. Compiling on x86-64 will trigger
>> an explicit error.
>>

[...]

>> +
>> +    hh->gdt_entry_tls_entries = GDT_ENTRY_TLS_ENTRIES;
>> +    hh->sizeof_tls_array = sizeof(thread->tls_array);
>> +    hh->ntls = ntls;
>> +
>> +    ret = cr_write_obj(ctx, &h, hh);
>> +    cr_hbuf_put(ctx, sizeof(*hh));
>> +    if (ret < 0)
>> +        return ret;
>> +
>> +    cr_debug("ntls %d\n", ntls);
>> +    if (ntls == 0)
>> +        return 0;
>> +
>> +    /* for simplicity dump the entire array, cherry-pick upon restart */
>> +    ret = cr_kwrite(ctx, thread->tls_array, sizeof(thread->tls_array));
> 
> Again, the the TLS descriptors in the GDT should be called out an not
> tied to the in-kernel representation.

True. I'll add a 'FIXME' comment.

However, I'm yet to see a case where this breaks among x86_32, and I'm no
expert in that area to tell whether it could. (Moving from x86_32 to x86_64
is another story, and will require some compatibility layer anyway).

> 
>> +
>> +    /* IGNORE RESTART BLOCKS FOR NOW ... */

[...]

>> +static int cr_write_cpu_fpu(struct cr_ctx *ctx, struct task_struct *t)
>> +{
>> +    void *xstate_buf = cr_hbuf_get(ctx, xstate_size);
>> +
>> +    /* i387 + MMU + SSE logic */
>> +    preempt_disable();    /* needed it (t == current) */
>> +
>> +    /*
>> +     * normally, no need to unlazy_fpu(), since TS_USEDFPU flag
>> +     * have been cleared when task was context-switched out...
>> +     * except if we are in process context, in which case we do
>> +     */
>> +    if (t == current && (task_thread_info(t)->status & TS_USEDFPU))
>> +        unlazy_fpu(current);
>> +
>> +    memcpy(xstate_buf, t->thread.xstate, xstate_size);
> 
> This is probably better off being very deliberate about what registers
> we are dumping from a traceability and compatibility point of view?

Same here.

> 
>> +    preempt_enable();    /* needed it (t == current) */
>> +
>> +    return cr_kwrite(ctx, xstate_buf, xstate_size);
> 
> Missed cr_huf_put()

Ooops ... will fix.

> 
>> +}
>> +
>> +#endif    /* CONFIG_X86_64 */

[...]

>> +        /*
>> +         * restore TLS by hand: why convert to struct user_desc if
>> +         * sys_set_thread_entry() will convert it back ?
>> +         */
>> +
>> +        size = sizeof(*desc) * GDT_ENTRY_TLS_ENTRIES;
>> +        desc = kmalloc(size, GFP_KERNEL);
>> +        if (!desc)
> 
> cr_hbuf_put() here.

Will fix.

> 
>> +            return -ENOMEM;
>> +
>> +        ret = cr_kread(ctx, desc, size);
>> +        if (ret >= 0) {
> 
> if (ret == 0)

Right.

> 
>> +            /*
>> +             * FIX: add sanity checks (eg. that values makes
>> +             * sense, that we don't overwrite old values, etc
>> +             */
>> +            cpu = get_cpu();
>> +            memcpy(thread->tls_array, desc, size);
>> +            load_TLS(thread, cpu);
>> +            put_cpu();
>> +        }
>> +        kfree(desc);
>> +    }
>> +
>> +    ret = 0;
>> + out:
>> +    cr_hbuf_put(ctx, sizeof(*hh));
>> +    return ret;
>> +}

[...]

Thanks for the review !

Oren.

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [RFC v11][PATCH 04/13] x86 support for checkpoint/restart
  2008-12-17  2:19     ` Mike Waychison
@ 2008-12-17 15:23       ` Oren Laadan
  -1 siblings, 0 replies; 133+ messages in thread
From: Oren Laadan @ 2008-12-17 15:23 UTC (permalink / raw)
  To: Mike Waychison
  Cc: jeremy, arnd, linux-api, containers, linux-kernel, Dave Hansen,
	linux-mm, Linux Torvalds, Alexander Viro, H. Peter Anvin,
	Thomas Gleixner, Ingo Molnar



Mike Waychison wrote:
> Oren Laadan wrote:
>> Add logic to save and restore architecture specific state, including
>> thread-specific state, CPU registers and FPU state.
>>
>> In addition, architecture capabilities are saved in an architecure
>> specific extension of the header (cr_hdr_head_arch); Currently this
>> includes only FPU capabilities.
>>
>> Currently only x86-32 is supported. Compiling on x86-64 will trigger
>> an explicit error.
>>

[...]

>> +
>> +    hh->gdt_entry_tls_entries = GDT_ENTRY_TLS_ENTRIES;
>> +    hh->sizeof_tls_array = sizeof(thread->tls_array);
>> +    hh->ntls = ntls;
>> +
>> +    ret = cr_write_obj(ctx, &h, hh);
>> +    cr_hbuf_put(ctx, sizeof(*hh));
>> +    if (ret < 0)
>> +        return ret;
>> +
>> +    cr_debug("ntls %d\n", ntls);
>> +    if (ntls == 0)
>> +        return 0;
>> +
>> +    /* for simplicity dump the entire array, cherry-pick upon restart */
>> +    ret = cr_kwrite(ctx, thread->tls_array, sizeof(thread->tls_array));
> 
> Again, the the TLS descriptors in the GDT should be called out an not
> tied to the in-kernel representation.

True. I'll add a 'FIXME' comment.

However, I'm yet to see a case where this breaks among x86_32, and I'm no
expert in that area to tell whether it could. (Moving from x86_32 to x86_64
is another story, and will require some compatibility layer anyway).

> 
>> +
>> +    /* IGNORE RESTART BLOCKS FOR NOW ... */

[...]

>> +static int cr_write_cpu_fpu(struct cr_ctx *ctx, struct task_struct *t)
>> +{
>> +    void *xstate_buf = cr_hbuf_get(ctx, xstate_size);
>> +
>> +    /* i387 + MMU + SSE logic */
>> +    preempt_disable();    /* needed it (t == current) */
>> +
>> +    /*
>> +     * normally, no need to unlazy_fpu(), since TS_USEDFPU flag
>> +     * have been cleared when task was context-switched out...
>> +     * except if we are in process context, in which case we do
>> +     */
>> +    if (t == current && (task_thread_info(t)->status & TS_USEDFPU))
>> +        unlazy_fpu(current);
>> +
>> +    memcpy(xstate_buf, t->thread.xstate, xstate_size);
> 
> This is probably better off being very deliberate about what registers
> we are dumping from a traceability and compatibility point of view?

Same here.

> 
>> +    preempt_enable();    /* needed it (t == current) */
>> +
>> +    return cr_kwrite(ctx, xstate_buf, xstate_size);
> 
> Missed cr_huf_put()

Ooops ... will fix.

> 
>> +}
>> +
>> +#endif    /* CONFIG_X86_64 */

[...]

>> +        /*
>> +         * restore TLS by hand: why convert to struct user_desc if
>> +         * sys_set_thread_entry() will convert it back ?
>> +         */
>> +
>> +        size = sizeof(*desc) * GDT_ENTRY_TLS_ENTRIES;
>> +        desc = kmalloc(size, GFP_KERNEL);
>> +        if (!desc)
> 
> cr_hbuf_put() here.

Will fix.

> 
>> +            return -ENOMEM;
>> +
>> +        ret = cr_kread(ctx, desc, size);
>> +        if (ret >= 0) {
> 
> if (ret == 0)

Right.

> 
>> +            /*
>> +             * FIX: add sanity checks (eg. that values makes
>> +             * sense, that we don't overwrite old values, etc
>> +             */
>> +            cpu = get_cpu();
>> +            memcpy(thread->tls_array, desc, size);
>> +            load_TLS(thread, cpu);
>> +            put_cpu();
>> +        }
>> +        kfree(desc);
>> +    }
>> +
>> +    ret = 0;
>> + out:
>> +    cr_hbuf_put(ctx, sizeof(*hh));
>> +    return ret;
>> +}

[...]

Thanks for the review !

Oren.


^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [RFC v11][PATCH 04/13] x86 support for checkpoint/restart
@ 2008-12-17 15:23       ` Oren Laadan
  0 siblings, 0 replies; 133+ messages in thread
From: Oren Laadan @ 2008-12-17 15:23 UTC (permalink / raw)
  To: Mike Waychison
  Cc: jeremy, arnd, linux-api, containers, linux-kernel, Dave Hansen,
	linux-mm, Linux Torvalds, Alexander Viro, H. Peter Anvin,
	Thomas Gleixner, Ingo Molnar



Mike Waychison wrote:
> Oren Laadan wrote:
>> Add logic to save and restore architecture specific state, including
>> thread-specific state, CPU registers and FPU state.
>>
>> In addition, architecture capabilities are saved in an architecure
>> specific extension of the header (cr_hdr_head_arch); Currently this
>> includes only FPU capabilities.
>>
>> Currently only x86-32 is supported. Compiling on x86-64 will trigger
>> an explicit error.
>>

[...]

>> +
>> +    hh->gdt_entry_tls_entries = GDT_ENTRY_TLS_ENTRIES;
>> +    hh->sizeof_tls_array = sizeof(thread->tls_array);
>> +    hh->ntls = ntls;
>> +
>> +    ret = cr_write_obj(ctx, &h, hh);
>> +    cr_hbuf_put(ctx, sizeof(*hh));
>> +    if (ret < 0)
>> +        return ret;
>> +
>> +    cr_debug("ntls %d\n", ntls);
>> +    if (ntls == 0)
>> +        return 0;
>> +
>> +    /* for simplicity dump the entire array, cherry-pick upon restart */
>> +    ret = cr_kwrite(ctx, thread->tls_array, sizeof(thread->tls_array));
> 
> Again, the the TLS descriptors in the GDT should be called out an not
> tied to the in-kernel representation.

True. I'll add a 'FIXME' comment.

However, I'm yet to see a case where this breaks among x86_32, and I'm no
expert in that area to tell whether it could. (Moving from x86_32 to x86_64
is another story, and will require some compatibility layer anyway).

> 
>> +
>> +    /* IGNORE RESTART BLOCKS FOR NOW ... */

[...]

>> +static int cr_write_cpu_fpu(struct cr_ctx *ctx, struct task_struct *t)
>> +{
>> +    void *xstate_buf = cr_hbuf_get(ctx, xstate_size);
>> +
>> +    /* i387 + MMU + SSE logic */
>> +    preempt_disable();    /* needed it (t == current) */
>> +
>> +    /*
>> +     * normally, no need to unlazy_fpu(), since TS_USEDFPU flag
>> +     * have been cleared when task was context-switched out...
>> +     * except if we are in process context, in which case we do
>> +     */
>> +    if (t == current && (task_thread_info(t)->status & TS_USEDFPU))
>> +        unlazy_fpu(current);
>> +
>> +    memcpy(xstate_buf, t->thread.xstate, xstate_size);
> 
> This is probably better off being very deliberate about what registers
> we are dumping from a traceability and compatibility point of view?

Same here.

> 
>> +    preempt_enable();    /* needed it (t == current) */
>> +
>> +    return cr_kwrite(ctx, xstate_buf, xstate_size);
> 
> Missed cr_huf_put()

Ooops ... will fix.

> 
>> +}
>> +
>> +#endif    /* CONFIG_X86_64 */

[...]

>> +        /*
>> +         * restore TLS by hand: why convert to struct user_desc if
>> +         * sys_set_thread_entry() will convert it back ?
>> +         */
>> +
>> +        size = sizeof(*desc) * GDT_ENTRY_TLS_ENTRIES;
>> +        desc = kmalloc(size, GFP_KERNEL);
>> +        if (!desc)
> 
> cr_hbuf_put() here.

Will fix.

> 
>> +            return -ENOMEM;
>> +
>> +        ret = cr_kread(ctx, desc, size);
>> +        if (ret >= 0) {
> 
> if (ret == 0)

Right.

> 
>> +            /*
>> +             * FIX: add sanity checks (eg. that values makes
>> +             * sense, that we don't overwrite old values, etc
>> +             */
>> +            cpu = get_cpu();
>> +            memcpy(thread->tls_array, desc, size);
>> +            load_TLS(thread, cpu);
>> +            put_cpu();
>> +        }
>> +        kfree(desc);
>> +    }
>> +
>> +    ret = 0;
>> + out:
>> +    cr_hbuf_put(ctx, sizeof(*hh));
>> +    return ret;
>> +}

[...]

Thanks for the review !

Oren.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [RFC v11][PATCH 05/13] Dump memory address space
       [not found]   ` <1228498282-11804-6-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
@ 2008-12-18  2:26     ` Mike Waychison
  0 siblings, 0 replies; 133+ messages in thread
From: Mike Waychison @ 2008-12-18  2:26 UTC (permalink / raw)
  To: Oren Laadan
  Cc: jeremy-TSDbQ3PG+2Y, arnd-r2nGTMty4D4,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Dave Hansen,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Linux Torvalds, Alexander Viro,
	H. Peter Anvin, Thomas Gleixner, Ingo Molnar

Comments below.

Oren Laadan wrote:
> For each VMA, there is a 'struct cr_vma'; if the VMA is file-mapped,
> it will be followed by the file name. Then comes the actual contents,
> in one or more chunk: each chunk begins with a header that specifies
> how many pages it holds, then the virtual addresses of all the dumped
> pages in that chunk, followed by the actual contents of all dumped
> pages. A header with zero number of pages marks the end of the contents.
> Then comes the next VMA and so on.
> 
> Changelog[v11]:
>   - Copy contents of 'init->fs->root' instead of pointing to them.
>   - Add missing test for VM_MAYSHARE when dumping memory
> 
> Changelog[v10]:
>   - Acquire dcache_lock around call to __d_path() in cr_fill_name()
> 
> Changelog[v9]:
>   - Introduce cr_ctx_checkpoint() for checkpoint-specific ctx setup
>   - Test if __d_path() changes mnt/dentry (when crossing filesystem
>     namespace boundary). for now cr_fill_fname() fails the checkpoint.
> 
> Changelog[v7]:
>   - Fix argument given to kunmap_atomic() in memory dump/restore
> 
> Changelog[v6]:
>   - Balance all calls to cr_hbuf_get() with matching cr_hbuf_put()
>     (even though it's not really needed)
> 
> Changelog[v5]:
>   - Improve memory dump code (following Dave Hansen's comments)
>   - Change dump format (and code) to allow chunks of <vaddrs, pages>
>     instead of one long list of each
>   - Fix use of follow_page() to avoid faulting in non-present pages
> 
> Changelog[v4]:
>   - Use standard list_... for cr_pgarr
> 
> Signed-off-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
> Acked-by: Serge Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
> Signed-off-by: Dave Hansen <dave-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
> ---
>  arch/x86/include/asm/checkpoint_hdr.h |    5 +
>  arch/x86/mm/checkpoint.c              |   31 ++
>  checkpoint/Makefile                   |    3 +-
>  checkpoint/checkpoint.c               |   88 ++++++
>  checkpoint/checkpoint_arch.h          |    2 +
>  checkpoint/checkpoint_mem.h           |   41 +++
>  checkpoint/ckpt_mem.c                 |  503 +++++++++++++++++++++++++++++++++
>  checkpoint/sys.c                      |   10 +
>  include/linux/checkpoint.h            |   12 +
>  include/linux/checkpoint_hdr.h        |   32 ++
>  10 files changed, 726 insertions(+), 1 deletions(-)
>  create mode 100644 checkpoint/checkpoint_mem.h
>  create mode 100644 checkpoint/ckpt_mem.c
> 
> diff --git a/arch/x86/include/asm/checkpoint_hdr.h b/arch/x86/include/asm/checkpoint_hdr.h
> index 6325062..33f4c70 100644
> --- a/arch/x86/include/asm/checkpoint_hdr.h
> +++ b/arch/x86/include/asm/checkpoint_hdr.h
> @@ -82,4 +82,9 @@ struct cr_hdr_cpu {
>  	/* thread_xstate contents follow (if used_math) */
>  } __attribute__((aligned(8)));
>  
> +struct cr_hdr_mm_context {
> +	__s16 ldt_entry_size;
> +	__s16 nldt;
> +} __attribute__((aligned(8)));
> +
>  #endif /* __ASM_X86_CKPT_HDR__H */
> diff --git a/arch/x86/mm/checkpoint.c b/arch/x86/mm/checkpoint.c
> index 8dd6d2d..757936e 100644
> --- a/arch/x86/mm/checkpoint.c
> +++ b/arch/x86/mm/checkpoint.c
> @@ -221,3 +221,34 @@ int cr_write_head_arch(struct cr_ctx *ctx)
>  
>  	return ret;
>  }
> +
> +/* dump the mm->context state */
> +int cr_write_mm_context(struct cr_ctx *ctx, struct mm_struct *mm, int parent)
> +{
> +	struct cr_hdr h;
> +	struct cr_hdr_mm_context *hh = cr_hbuf_get(ctx, sizeof(*hh));
> +	int ret;
> +
> +	h.type = CR_HDR_MM_CONTEXT;
> +	h.len = sizeof(*hh);
> +	h.parent = parent;
> +
> +	mutex_lock(&mm->context.lock);
> +
> +	hh->ldt_entry_size = LDT_ENTRY_SIZE;
> +	hh->nldt = mm->context.size;
> +
> +	cr_debug("nldt %d\n", hh->nldt);
> +
> +	ret = cr_write_obj(ctx, &h, hh);
> +	cr_hbuf_put(ctx, sizeof(*hh));
> +	if (ret < 0)
> +		goto out;
> +
> +	ret = cr_kwrite(ctx, mm->context.ldt,
> +			mm->context.size * LDT_ENTRY_SIZE);

Do we really want to emit anything under lock?  I realize that this 
patch goes and does a ton of writes with mmap_sem held for read -- is 
this ok?

> +
> + out:
> +	mutex_unlock(&mm->context.lock);
> +	return ret;
> +}
> diff --git a/checkpoint/Makefile b/checkpoint/Makefile
> index d2df68c..3a0df6d 100644
> --- a/checkpoint/Makefile
> +++ b/checkpoint/Makefile
> @@ -2,4 +2,5 @@
>  # Makefile for linux checkpoint/restart.
>  #
>  
> -obj-$(CONFIG_CHECKPOINT_RESTART) += sys.o checkpoint.o restart.o
> +obj-$(CONFIG_CHECKPOINT_RESTART) += sys.o checkpoint.o restart.o \
> +		ckpt_mem.o
> diff --git a/checkpoint/checkpoint.c b/checkpoint/checkpoint.c
> index 17cc8d2..56d0ec2 100644
> --- a/checkpoint/checkpoint.c
> +++ b/checkpoint/checkpoint.c
> @@ -13,6 +13,7 @@
>  #include <linux/time.h>
>  #include <linux/fs.h>
>  #include <linux/file.h>
> +#include <linux/fdtable.h>
>  #include <linux/dcache.h>
>  #include <linux/mount.h>
>  #include <linux/utsname.h>
> @@ -75,6 +76,66 @@ int cr_write_string(struct cr_ctx *ctx, char *str, int len)
>  	return cr_write_obj(ctx, &h, str);
>  }
>  
> +/**
> + * cr_fill_fname - return pathname of a given file
> + * @path: path name
> + * @root: relative root
> + * @buf: buffer for pathname
> + * @n: buffer length (in) and pathname length (out)
> + */
> +static char *
> +cr_fill_fname(struct path *path, struct path *root, char *buf, int *n)
> +{
> +	struct path tmp = *root;
> +	char *fname;
> +
> +	BUG_ON(!buf);
> +	spin_lock(&dcache_lock);
> +	fname = __d_path(path, &tmp, buf, *n);
> +	spin_unlock(&dcache_lock);
> +	if (!IS_ERR(fname))
> +		*n = (buf + (*n) - fname);
> +	/*
> +	 * FIXME: if __d_path() changed these, it must have stepped out of
> +	 * init's namespace. Since currently we require a unified namespace
> +	 * within the container: simply fail.
> +	 */
> +	if (tmp.mnt != root->mnt || tmp.dentry != root->dentry)
> +		fname = ERR_PTR(-EBADF);
> +
> +	return fname;
> +}
> +
> +/**
> + * cr_write_fname - write a file name
> + * @ctx: checkpoint context
> + * @path: path name
> + * @root: relative root
> + */
> +int cr_write_fname(struct cr_ctx *ctx, struct path *path, struct path *root)
> +{
> +	struct cr_hdr h;
> +	char *buf, *fname;
> +	int ret, flen;
> +
> +	flen = PATH_MAX;
> +	buf = kmalloc(flen, GFP_KERNEL);
> +	if (!buf)
> +		return -ENOMEM;
> +
> +	fname = cr_fill_fname(path, root, buf, &flen);
> +	if (!IS_ERR(fname)) {
> +		h.type = CR_HDR_FNAME;
> +		h.len = flen;
> +		h.parent = 0;
> +		ret = cr_write_obj(ctx, &h, fname);
> +	} else
> +		ret = PTR_ERR(fname);
> +
> +	kfree(buf);
> +	return ret;
> +}
> +
>  /* write the checkpoint header */
>  static int cr_write_head(struct cr_ctx *ctx)
>  {
> @@ -168,6 +229,10 @@ static int cr_write_task(struct cr_ctx *ctx, struct task_struct *t)
>  	cr_debug("task_struct: ret %d\n", ret);
>  	if (ret < 0)
>  		goto out;
> +	ret = cr_write_mm(ctx, t);
> +	cr_debug("memory: ret %d\n", ret);
> +	if (ret < 0)
> +		goto out;
>  	ret = cr_write_thread(ctx, t);
>  	cr_debug("thread: ret %d\n", ret);
>  	if (ret < 0)
> @@ -178,10 +243,33 @@ static int cr_write_task(struct cr_ctx *ctx, struct task_struct *t)
>  	return ret;
>  }
>  
> +static int cr_ctx_checkpoint(struct cr_ctx *ctx, pid_t pid)
> +{
> +	struct fs_struct *fs;
> +
> +	ctx->root_pid = pid;
> +
> +	/*
> +	 * assume checkpointer is in container's root vfs
> +	 * FIXME: this works for now, but will change with real containers
> +	 */
> +
> +	fs = current->fs;
> +	read_lock(&fs->lock);
> +	ctx->fs_mnt = fs->root;
> +	path_get(&ctx->fs_mnt);
> +	read_unlock(&fs->lock);
> +
> +	return 0;

Spurious return value?

> +}
> +
>  int do_checkpoint(struct cr_ctx *ctx, pid_t pid)
>  {
>  	int ret;
>  
> +	ret = cr_ctx_checkpoint(ctx, pid);
> +	if (ret < 0)
> +		goto out;
>  	ret = cr_write_head(ctx);
>  	if (ret < 0)
>  		goto out;
> diff --git a/checkpoint/checkpoint_arch.h b/checkpoint/checkpoint_arch.h
> index ada1369..f06c7eb 100644
> --- a/checkpoint/checkpoint_arch.h
> +++ b/checkpoint/checkpoint_arch.h
> @@ -3,6 +3,8 @@
>  extern int cr_write_head_arch(struct cr_ctx *ctx);
>  extern int cr_write_thread(struct cr_ctx *ctx, struct task_struct *t);
>  extern int cr_write_cpu(struct cr_ctx *ctx, struct task_struct *t);
> +extern int cr_write_mm_context(struct cr_ctx *ctx,
> +			       struct mm_struct *mm, int parent);
>  
>  extern int cr_read_head_arch(struct cr_ctx *ctx);
>  extern int cr_read_thread(struct cr_ctx *ctx);
> diff --git a/checkpoint/checkpoint_mem.h b/checkpoint/checkpoint_mem.h
> new file mode 100644
> index 0000000..85546f4
> --- /dev/null
> +++ b/checkpoint/checkpoint_mem.h
> @@ -0,0 +1,41 @@
> +#ifndef _CHECKPOINT_CKPT_MEM_H_
> +#define _CHECKPOINT_CKPT_MEM_H_
> +/*
> + *  Generic container checkpoint-restart
> + *
> + *  Copyright (C) 2008 Oren Laadan
> + *
> + *  This file is subject to the terms and conditions of the GNU General Public
> + *  License.  See the file COPYING in the main directory of the Linux
> + *  distribution for more details.
> + */
> +
> +#include <linux/mm_types.h>
> +
> +/*
> + * page-array chains: each cr_pgarr describes a set of <strcut page *,vaddr>

struct

> + * tuples (where vaddr is the virtual address of a page in a particular mm).
> + * Specifically, we use separate arrays so that all vaddrs can be written
> + * and read at once.
> + */
> +
> +struct cr_pgarr {
> +	unsigned long *vaddrs;
> +	struct page **pages;
> +	unsigned int nr_used;
> +	struct list_head list;
> +};
> +
> +#define CR_PGARR_TOTAL  (PAGE_SIZE / sizeof(void *))
> +#define CR_PGARR_CHUNK  (4 * CR_PGARR_TOTAL)
> +
> +extern void cr_pgarr_free(struct cr_ctx *ctx);
> +extern struct cr_pgarr *cr_pgarr_current(struct cr_ctx *ctx);
> +extern void cr_pgarr_reset_all(struct cr_ctx *ctx);
> +
> +static inline int cr_pgarr_is_full(struct cr_pgarr *pgarr)
> +{
> +	return (pgarr->nr_used == CR_PGARR_TOTAL);
> +}
> +
> +#endif /* _CHECKPOINT_CKPT_MEM_H_ */
> diff --git a/checkpoint/ckpt_mem.c b/checkpoint/ckpt_mem.c
> new file mode 100644
> index 0000000..a2fcdbf
> --- /dev/null
> +++ b/checkpoint/ckpt_mem.c
> @@ -0,0 +1,503 @@
> +/*
> + *  Checkpoint memory contents
> + *
> + *  Copyright (C) 2008 Oren Laadan
> + *
> + *  This file is subject to the terms and conditions of the GNU General Public
> + *  License.  See the file COPYING in the main directory of the Linux
> + *  distribution for more details.
> + */
> +
> +#include <linux/kernel.h>
> +#include <linux/sched.h>
> +#include <linux/slab.h>
> +#include <linux/file.h>
> +#include <linux/pagemap.h>
> +#include <linux/mm_types.h>
> +#include <linux/checkpoint.h>
> +#include <linux/checkpoint_hdr.h>
> +
> +#include "checkpoint_arch.h"
> +#include "checkpoint_mem.h"
> +
> +/*
> + * utilities to alloc, free, and handle 'struct cr_pgarr' (page-arrays)
> + * (common to ckpt_mem.c and rstr_mem.c).
> + *
> + * The checkpoint context structure has two members for page-arrays:
> + *   ctx->pgarr_list: list head of the page-array chain

What's the second member?

> + *
> + * During checkpoint (and restart) the chain tracks the dirty pages (page
> + * pointer and virtual address) of each MM. For a particular MM, these are
> + * always added to the head of the page-array chain (ctx->pgarr_list).
> + * This "current" page-array advances as necessary, and new page-array
> + * descriptors are allocated on-demand. Before the next chunk of pages,
> + * the chain is reset but not freed (that is, dereference page pointers).
> + */
> +
> +/* return first page-array in the chain */
> +static inline struct cr_pgarr *cr_pgarr_first(struct cr_ctx *ctx)
> +{
> +	if (list_empty(&ctx->pgarr_list))
> +		return NULL;
> +	return list_first_entry(&ctx->pgarr_list, struct cr_pgarr, list);
> +}
> +
> +/* release pages referenced by a page-array */
> +static void cr_pgarr_release_pages(struct cr_pgarr *pgarr)
> +{
> +	int i;
> +
> +	cr_debug("nr_used %d\n", pgarr->nr_used);
> +	/*
> +	 * although both checkpoint and restart use 'nr_used', we only
> +	 * collect pages during checkpoint; in restart we simply return
> +	 */
> +	if (!pgarr->pages)
> +		return;
> +	for (i = pgarr->nr_used; i--; /**/)
> +		page_cache_release(pgarr->pages[i]);

This is sorta hard to read (and non-intuitive).  Is it easier to do: 

 

for (i = 0; i < pgarr->nr_used; i++) 

	page_cache_release(pgarr->pages[i]);
 

It shouldn't matter what order you release the pages in..

> +}
> +
> +/* free a single page-array object */
> +static void cr_pgarr_free_one(struct cr_pgarr *pgarr)
> +{
> +	cr_pgarr_release_pages(pgarr);
> +	kfree(pgarr->pages);
> +	kfree(pgarr->vaddrs);
> +	kfree(pgarr);
> +}
> +
> +/* free a chain of page-arrays */
> +void cr_pgarr_free(struct cr_ctx *ctx)
> +{
> +	struct cr_pgarr *pgarr, *tmp;
> +
> +	list_for_each_entry_safe(pgarr, tmp, &ctx->pgarr_list, list) {
> +		list_del(&pgarr->list);
> +		cr_pgarr_free_one(pgarr);
> +	}
> +}
> +
> +/* allocate a single page-array object */
> +static struct cr_pgarr *cr_pgarr_alloc_one(unsigned long flags)
> +{
> +	struct cr_pgarr *pgarr;
> +
> +	pgarr = kzalloc(sizeof(*pgarr), GFP_KERNEL);
> +	if (!pgarr)
> +		return NULL;
> +
> +	pgarr->vaddrs = kmalloc(CR_PGARR_TOTAL * sizeof(unsigned long),
You used PAGE_SIZE / sizeof(void *) above.   Why not __get_free_page()?

> +				GFP_KERNEL);
> +	if (!pgarr->vaddrs)
> +		goto nomem;
> +
> +	/* pgarr->pages is needed only for checkpoint */
> +	if (flags & CR_CTX_CKPT) {
> +		pgarr->pages = kmalloc(CR_PGARR_TOTAL * sizeof(struct page *),
> +				       GFP_KERNEL);
> +		if (!pgarr->pages)
> +			goto nomem;
> +	}
> +
> +	return pgarr;
> +
> + nomem:
> +	cr_pgarr_free_one(pgarr);
> +	return NULL;
> +}
> +
> +/* cr_pgarr_current - return the next available page-array in the chain
> + * @ctx: checkpoint context
> + *
> + * Returns the first page-array in the list that has space. Extends the
> + * list if none has space.
> + */
> +struct cr_pgarr *cr_pgarr_current(struct cr_ctx *ctx)
> +{
> +	struct cr_pgarr *pgarr;
> +
> +	pgarr = cr_pgarr_first(ctx);
> +	if (pgarr && !cr_pgarr_is_full(pgarr))
> +		goto out;
> +	pgarr = cr_pgarr_alloc_one(ctx->flags);
> +	if (!pgarr)
> +		goto out;
> +	list_add(&pgarr->list, &ctx->pgarr_list);
> + out:
> +	return pgarr;
> +}
> +
> +/* reset the page-array chain (dropping page references if necessary) */
> +void cr_pgarr_reset_all(struct cr_ctx *ctx)
> +{
> +	struct cr_pgarr *pgarr;
> +
> +	list_for_each_entry(pgarr, &ctx->pgarr_list, list) {
> +		cr_pgarr_release_pages(pgarr);
> +		pgarr->nr_used = 0;
> +	}

This doesn't look right.  cr_pgarr_current only ever looks at the head 
of the list, so resetting a list with > 1 pgarr on it will mean the 
non-head elements in the list will go to waste.

> +}
> +
> +/*
> + * Checkpoint is outside the context of the checkpointee, so one cannot
> + * simply read pages from user-space. Instead, we scan the address space
> + * of the target to cherry-pick pages of interest. Selected pages are
> + * enlisted in a page-array chain (attached to the checkpoint context).
> + * To save their contents, each page is mapped to kernel memory and then
> + * dumped to the file descriptor.
> + */
> +
> +
> +/**
> + * cr_private_follow_page - return page pointer for dirty pages
> + * @vma - target vma
> + * @addr - page address
> + *
> + * Looks up the page that correspond to the address in the vma, and
> + * returns the page if it was modified (and grabs a reference to it),
> + * or otherwise returns NULL (or error).
> + *
> + * This function should _only_ called for private vma's.
> + */
> +static struct page *
> +cr_private_follow_page(struct vm_area_struct *vma, unsigned long addr)

s/cr_private_follow_page/cr_follow_page_private/ ?

Maybe even cr_dump_private_page?  The fact that it's following the page 
  tables down to the page is an implementation artifact and isn't really 
relevant to the semantics you want to express.

> +{
> +	struct page *page;
> +
> +	BUG_ON(vma->vm_flags & (VM_SHARED | VM_MAYSHARE));
> +

This BUG_ON shouldn't be needed if it's already done in 
cr_private_vma_fill_pgarr.

> +	/*
> +	 * simplified version of get_user_pages(): already have vma,
> +	 * only need FOLL_ANON, and (for now) ignore fault stats.
> +	 *
> +	 * follow_page() will return NULL if the page is not present
> +	 * (swapped), ZERO_PAGE(0) if the pte wasn't allocated, and
> +	 * the actual page pointer otherwise.
> +	 *
> +	 * FIXME: consolidate with get_user_pages()
> +	 */
> +
> +	cond_resched();
> +	while (!(page = follow_page(vma, addr, FOLL_ANON | FOLL_GET))) {
> +		int ret;
> +
> +		/* the page is swapped out - bring it in (optimize ?) */
> +		ret = handle_mm_fault(vma->vm_mm, vma, addr, 0);
> +		if (ret & VM_FAULT_ERROR) {
> +			if (ret & VM_FAULT_OOM)
> +				return ERR_PTR(-ENOMEM);
> +			else if (ret & VM_FAULT_SIGBUS)
> +				return ERR_PTR(-EFAULT);
> +			else
> +				BUG();
> +			break;
> +		}
> +		cond_resched();
> +	}
> +
> +	if (IS_ERR(page))
> +		return page;
> +
> +	/*
> +	 * We only care about dirty pages: either non-zero page, or
> +	 * file-backed (copy-on-write) that were touched. For the latter,
> +	 * the page_mapping() will be unset because it will no longer be
> +	 * mapped to the original file  after having been modified.
> +	 */
> +	if (page == ZERO_PAGE(0)) {
> +		/* this is the zero page: ignore */
> +		page_cache_release(page);
> +		page = NULL;
> +	} else if (vma->vm_file && (page_mapping(page) != NULL)) {
> +		/* file backed clean cow: ignore */

Probably better to describe 'why' it can be ignored here.


> +		page_cache_release(page);
> +		page = NULL;
> +	}
> +
> +	return page;
> +}
> +
> +/**
> + * cr_private_vma_fill_pgarr - fill a page-array with addr/page tuples
> + * @ctx - checkpoint context
> + * @pgarr - page-array to fill
> + * @vma - vma to scan
> + * @start - start address (updated)
> + *
> + * Returns the number of pages collected
> + */
> +static int
> +cr_private_vma_fill_pgarr(struct cr_ctx *ctx, struct cr_pgarr *pgarr,
> +			  struct vm_area_struct *vma, unsigned long *start)

This is sorta nasty because you shouldn't need to call into this routine 
with a container.  It should be able to enqueue the (vaddr, page) tuple 
directly on the ctx.  Doing so would also abstract out the pgarr 
management at this level and make the code a lot simpler.

> +{
> +	unsigned long end = vma->vm_end;
> +	unsigned long addr = *start;
> +	int orig_used = pgarr->nr_used;
> +
> +	/* this function is only for private memory (anon or file-mapped) */
> +	BUG_ON(vma->vm_flags & (VM_SHARED | VM_MAYSHARE));
> +
> +	while (addr < end) {
> +		struct page *page;
> +
> +		page = cr_private_follow_page(vma, addr);
> +		if (IS_ERR(page))
> +			return PTR_ERR(page);
> +
> +		if (page) {
> +			pgarr->pages[pgarr->nr_used] = page;
> +			pgarr->vaddrs[pgarr->nr_used] = addr;
> +			pgarr->nr_used++;

Should be something like:

ret = cr_ctx_append_page(ctx, addr, page);
if (ret < 0)
   goto out;

> +		}
> +
> +		addr += PAGE_SIZE;
> +
> +		if (cr_pgarr_is_full(pgarr))
> +			break;
> +	}
> +
> +	*start = addr;
> +	return pgarr->nr_used - orig_used;
> +}
> +
> +/* dump contents of a pages: use kmap_atomic() to avoid TLB flush */
> +static int cr_page_write(struct cr_ctx *ctx, struct page *page, char *buf)
> +{
> +	void *ptr;
> +
> +	ptr = kmap_atomic(page, KM_USER1);
> +	memcpy(buf, ptr, PAGE_SIZE);
> +	kunmap_atomic(ptr, KM_USER1);
> +
> +	return cr_kwrite(ctx, buf, PAGE_SIZE);
> +}
> +
> +/**
> + * cr_vma_dump_pages - dump pages listed in the ctx page-array chain
> + * @ctx - checkpoint context
> + * @total - total number of pages
> + *
> + * First dump all virtual addresses, followed by the contents of all pages
> + */
> +static int cr_vma_dump_pages(struct cr_ctx *ctx, int total)
> +{
> +	struct cr_pgarr *pgarr;
> +	char *buf;
> +	int i, ret = 0;
> +
> +	if (!total)
> +		return 0;
> +
> +	list_for_each_entry_reverse(pgarr, &ctx->pgarr_list, list) {
> +		ret = cr_kwrite(ctx, pgarr->vaddrs,
> +				pgarr->nr_used * sizeof(*pgarr->vaddrs));
> +		if (ret < 0)
> +			return ret;
> +	}
> +
> +	buf = kmalloc(PAGE_SIZE, GFP_KERNEL);

__get_free_page()

> +	if (!buf)
> +		return -ENOMEM;
> +
> +	list_for_each_entry_reverse(pgarr, &ctx->pgarr_list, list) {
> +		for (i = 0; i < pgarr->nr_used; i++) {
> +			ret = cr_page_write(ctx, pgarr->pages[i], buf);
> +			if (ret < 0)
> +				goto out;
> +		}
> +	}
> +
> + out:
> +	kfree(buf);
> +	return ret;
> +}
> +
> +/**
> + * cr_write_private_vma_contents - dump contents of a VMA with private memory
> + * @ctx - checkpoint context
> + * @vma - vma to scan
> + *
> + * Collect lists of pages that needs to be dumped, and corresponding
> + * virtual addresses into ctx->pgarr_list page-array chain. Then dump
> + * the addresses, followed by the page contents.
> + */
> +static int
> +cr_write_private_vma_contents(struct cr_ctx *ctx, struct vm_area_struct *vma)
> +{
> +	struct cr_hdr h;
> +	struct cr_hdr_pgarr *hh;
> +	unsigned long addr = vma->vm_start;
> +	struct cr_pgarr *pgarr;
> +	unsigned long cnt = 0;
> +	int ret;
> +
> +	/*
> +	 * Work iteratively, collecting and dumping at most CR_PGARR_CHUNK
> +	 * in each round. Each iterations is divided into two steps:
> +	 *
> +	 * (1) scan: scan through the PTEs of the vma to collect the pages
> +	 * to dump (later we'll also make them COW), while keeping a list
> +	 * of pages and their corresponding addresses on ctx->pgarr_list.
> +	 *
> +	 * (2) dump: write out a header specifying how many pages, followed
> +	 * by the addresses of all pages in ctx->pgarr_list, followed by
> +	 * the actual contents of all pages. (Then, release the references
> +	 * to the pages and reset the page-array chain).
> +	 *
> +	 * (This split makes the logic simpler by first counting the pages
> +	 * that need saving. More importantly, it allows for a future
> +	 * optimization that will reduce application downtime by deferring
> +	 * the actual write-out of the data to after the application is
> +	 * allowed to resume execution).
> +	 *
> +	 * After dumpting the entire contents, conclude with a header that
> +	 * specifies 0 pages to mark the end of the contents.
> +	 */
> +
> +	h.type = CR_HDR_PGARR;
> +	h.len = sizeof(*hh);
> +	h.parent = 0;
> +
> +	while (addr < vma->vm_end) {
> +		pgarr = cr_pgarr_current(ctx);
> +		if (!pgarr)
> +			return -ENOMEM;
> +		ret = cr_private_vma_fill_pgarr(ctx, pgarr, vma, &addr);
> +		if (ret < 0)
> +			return ret;
> +		cnt += ret;
> +
> +		/* did we complete a chunk, or is this the last chunk ? */
> +		if (cnt >= CR_PGARR_CHUNK || (cnt && addr == vma->vm_end)) {
> +			hh = cr_hbuf_get(ctx, sizeof(*hh));
> +			hh->nr_pages = cnt;
> +			ret = cr_write_obj(ctx, &h, hh);
> +			cr_hbuf_put(ctx, sizeof(*hh));
> +			if (ret < 0)
> +				return ret;
> +
> +			ret = cr_vma_dump_pages(ctx, cnt);
> +			if (ret < 0)
> +				return ret;
> +
> +			cr_pgarr_reset_all(ctx);
> +		}
> +	}
> +
> +	/* mark end of contents with header saying "0" pages */
> +	hh = cr_hbuf_get(ctx, sizeof(*hh));
> +	hh->nr_pages = 0;
> +	ret = cr_write_obj(ctx, &h, hh);
> +	cr_hbuf_put(ctx, sizeof(*hh));
> +
> +	return ret;
> +}
> +
> +static int cr_write_vma(struct cr_ctx *ctx, struct vm_area_struct *vma)
> +{
> +	struct cr_hdr h;
> +	struct cr_hdr_vma *hh = cr_hbuf_get(ctx, sizeof(*hh));
> +	int vma_type, ret;
> +
> +	h.type = CR_HDR_VMA;
> +	h.len = sizeof(*hh);
> +	h.parent = 0;
> +
> +	hh->vm_start = vma->vm_start;
> +	hh->vm_end = vma->vm_end;
> +	hh->vm_page_prot = vma->vm_page_prot.pgprot;
> +	hh->vm_flags = vma->vm_flags;
> +	hh->vm_pgoff = vma->vm_pgoff;
> +
> +#define CR_BAD_VM_FLAGS  \
> +	(VM_SHARED | VM_MAYSHARE | VM_IO | VM_HUGETLB | VM_NONLINEAR)
> +
> +	if (vma->vm_flags & CR_BAD_VM_FLAGS) {
> +		pr_warning("c/r: unsupported VMA %#lx\n", vma->vm_flags);
> +		cr_hbuf_put(ctx, sizeof(*hh));
> +		return -ENOSYS;
> +	}
> +

The following code should be broken into it's own function?  Handling of 
other types of memory will follow and will clutter this guy up.

> +	/* by default assume anon memory */
> +	vma_type = CR_VMA_ANON;
> +
> +	/*
> +	 * if there is a backing file, assume private-mapped

Shouldn't need to assume anything as you checked for VM_MAYSHARE and 
VM_SHARED above.

> +	 * (FIXME: check if the file is unlinked)
> +	 */
> +	if (vma->vm_file)
> +		vma_type = CR_VMA_FILE;
> +
> +	hh->vma_type = vma_type;
> +
> +	ret = cr_write_obj(ctx, &h, hh);
> +	cr_hbuf_put(ctx, sizeof(*hh));
> +	if (ret < 0)
> +		return ret;
> +
> +	/* save the file name, if relevant */

s/, if relevant//

> +	if (vma->vm_file) {
> +		ret = cr_write_fname(ctx, &vma->vm_file->f_path, &ctx->fs_mnt);

Why is this using a filename, rather than a reference to a file? 
Shouldn't this use the logic in patch 8/13?

> +		if (ret < 0)
> +			return ret;
> +	}
> +
> +	return cr_write_private_vma_contents(ctx, vma);
> +}
> +
> +int cr_write_mm(struct cr_ctx *ctx, struct task_struct *t)
> +{
> +	struct cr_hdr h;
> +	struct cr_hdr_mm *hh = cr_hbuf_get(ctx, sizeof(*hh));
> +	struct mm_struct *mm;
> +	struct vm_area_struct *vma;
> +	int objref, ret;
> +
> +	h.type = CR_HDR_MM;
> +	h.len = sizeof(*hh);
> +	h.parent = task_pid_vnr(t);
> +
> +	mm = get_task_mm(t);
> +
> +	objref = 0;	/* will be meaningful with multiple processes */
> +	hh->objref = objref;
> +
> +	down_read(&mm->mmap_sem);
> +
> +	hh->start_code = mm->start_code;
> +	hh->end_code = mm->end_code;
> +	hh->start_data = mm->start_data;
> +	hh->end_data = mm->end_data;
> +	hh->start_brk = mm->start_brk;
> +	hh->brk = mm->brk;
> +	hh->start_stack = mm->start_stack;
> +	hh->arg_start = mm->arg_start;
> +	hh->arg_end = mm->arg_end;
> +	hh->env_start = mm->env_start;
> +	hh->env_end = mm->env_end;
> +
> +	hh->map_count = mm->map_count;
> +
> +	/* FIX: need also mm->flags */
> +
> +	ret = cr_write_obj(ctx, &h, hh);
> +	cr_hbuf_put(ctx, sizeof(*hh));
> +	if (ret < 0)
> +		goto out;
> +
> +	/* write the vma's */
> +	for (vma = mm->mmap; vma; vma = vma->vm_next) {
> +		ret = cr_write_vma(ctx, vma);
> +		if (ret < 0)
> +			goto out;
> +	}
> +
> +	ret = cr_write_mm_context(ctx, mm, objref);
> +
> + out:
> +	up_read(&mm->mmap_sem);
> +	mmput(mm);
> +	return ret;
> +}
> diff --git a/checkpoint/sys.c b/checkpoint/sys.c
> index bd14ef9..c547a1c 100644
> --- a/checkpoint/sys.c
> +++ b/checkpoint/sys.c
> @@ -16,6 +16,8 @@
>  #include <linux/capability.h>
>  #include <linux/checkpoint.h>
>  
> +#include "checkpoint_mem.h"
> +
>  /*
>   * Helpers to write(read) from(to) kernel space to(from) the checkpoint
>   * image file descriptor (similar to how a core-dump is performed).
> @@ -131,7 +133,13 @@ static void cr_ctx_free(struct cr_ctx *ctx)
>  {
>  	if (ctx->file)
>  		fput(ctx->file);
> +
>  	kfree(ctx->hbuf);
> +
> +	path_put(&ctx->fs_mnt);		/* safe with NULL pointers */
> +
> +	cr_pgarr_free(ctx);
> +
>  	kfree(ctx);
>  }
>  
> @@ -146,6 +154,8 @@ static struct cr_ctx *cr_ctx_alloc(int fd, unsigned long flags)
>  
>  	ctx->flags = flags;
>  
> +	INIT_LIST_HEAD(&ctx->pgarr_list);
> +
>  	err = -EBADF;
>  	ctx->file = fget(fd);
>  	if (!ctx->file)
> diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
> index 63f298f..4e97f9f 100644
> --- a/include/linux/checkpoint.h
> +++ b/include/linux/checkpoint.h
> @@ -10,6 +10,9 @@
>   *  distribution for more details.
>   */
>  
> +#include <linux/path.h>
> +#include <linux/fs.h>
> +
>  #define CR_VERSION  1
>  
>  struct cr_ctx {
> @@ -25,6 +28,10 @@ struct cr_ctx {
>  
>  	void *hbuf;		/* temporary buffer for headers */
>  	int hpos;		/* position in headers buffer */
> +
> +	struct list_head pgarr_list;	/* page array to dump VMA contents */
> +
> +	struct path fs_mnt;	/* container root (FIXME) */
>  };
>  
>  /* cr_ctx: flags */
> @@ -42,6 +49,8 @@ struct cr_hdr;
>  extern int cr_write_obj(struct cr_ctx *ctx, struct cr_hdr *h, void *buf);
>  extern int cr_write_buffer(struct cr_ctx *ctx, void *buf, int len);
>  extern int cr_write_string(struct cr_ctx *ctx, char *str, int len);
> +extern int cr_write_fname(struct cr_ctx *ctx,
> +			  struct path *path, struct path *root);
>  
>  extern int cr_read_obj(struct cr_ctx *ctx, struct cr_hdr *h, void *buf, int n);
>  extern int cr_read_obj_type(struct cr_ctx *ctx, void *buf, int len, int type);
> @@ -50,7 +59,10 @@ extern int cr_read_buffer(struct cr_ctx *ctx, void *buf, int *len);
>  extern int cr_read_string(struct cr_ctx *ctx, char *str, int len);
>  
>  extern int do_checkpoint(struct cr_ctx *ctx, pid_t pid);
> +extern int cr_write_mm(struct cr_ctx *ctx, struct task_struct *t);
> +
>  extern int do_restart(struct cr_ctx *ctx, pid_t pid);
> +extern int cr_read_mm(struct cr_ctx *ctx);
>  
>  #define cr_debug(fmt, args...)  \
>  	pr_debug("[%d:c/r:%s] " fmt, task_pid_vnr(current), __func__, ## args)
> diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
> index b74b5f9..d78f0f1 100644
> --- a/include/linux/checkpoint_hdr.h
> +++ b/include/linux/checkpoint_hdr.h
> @@ -34,6 +34,7 @@ enum {
>  	CR_HDR_HEAD_ARCH,
>  	CR_HDR_BUFFER,
>  	CR_HDR_STRING,
> +	CR_HDR_FNAME,
>  
>  	CR_HDR_TASK = 101,
>  	CR_HDR_THREAD,
> @@ -41,6 +42,7 @@ enum {
>  
>  	CR_HDR_MM = 201,
>  	CR_HDR_VMA,
> +	CR_HDR_PGARR,
>  	CR_HDR_MM_CONTEXT,
>  
>  	CR_HDR_TAIL = 5001
> @@ -75,4 +77,34 @@ struct cr_hdr_task {
>  	__s32 task_comm_len;
>  } __attribute__((aligned(8)));
>  
> +struct cr_hdr_mm {
> +	__u32 objref;		/* identifier for shared objects */
> +	__u32 map_count;
> +
> +	__u64 start_code, end_code, start_data, end_data;
> +	__u64 start_brk, brk, start_stack;
> +	__u64 arg_start, arg_end, env_start, env_end;
> +} __attribute__((aligned(8)));
> +
> +/* vma subtypes */
> +enum vm_type {
> +	CR_VMA_ANON = 1,
> +	CR_VMA_FILE

We need to figure out what MAP_SHARED | MAP_ANONYMOUS should be exposed 
as in this setup (much in the same way we need to start defining what 
shm mappings look like).  Internally, they are 'file-backed', but to 
userland, they aren't.
 

Thoughts?

> +};
> +
> +struct cr_hdr_vma {
> +	__u32 vma_type;
> +	__u32 _padding;

Why padding?

> +
> +	__u64 vm_start;
> +	__u64 vm_end;
> +	__u64 vm_page_prot;
> +	__u64 vm_flags;
> +	__u64 vm_pgoff;
> +} __attribute__((aligned(8)));
> +
> +struct cr_hdr_pgarr {
> +	__u64 nr_pages;		/* number of pages to saved */
> +} __attribute__((aligned(8)));
> +
>  #endif /* _CHECKPOINT_CKPT_HDR_H_ */

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [RFC v11][PATCH 05/13] Dump memory address space
  2008-12-05 17:31   ` Oren Laadan
@ 2008-12-18  2:26     ` Mike Waychison
  -1 siblings, 0 replies; 133+ messages in thread
From: Mike Waychison @ 2008-12-18  2:26 UTC (permalink / raw)
  To: Oren Laadan
  Cc: jeremy, arnd, linux-api, containers, linux-kernel, Dave Hansen,
	linux-mm, Linux Torvalds, Alexander Viro, H. Peter Anvin,
	Thomas Gleixner, Ingo Molnar

Comments below.

Oren Laadan wrote:
> For each VMA, there is a 'struct cr_vma'; if the VMA is file-mapped,
> it will be followed by the file name. Then comes the actual contents,
> in one or more chunk: each chunk begins with a header that specifies
> how many pages it holds, then the virtual addresses of all the dumped
> pages in that chunk, followed by the actual contents of all dumped
> pages. A header with zero number of pages marks the end of the contents.
> Then comes the next VMA and so on.
> 
> Changelog[v11]:
>   - Copy contents of 'init->fs->root' instead of pointing to them.
>   - Add missing test for VM_MAYSHARE when dumping memory
> 
> Changelog[v10]:
>   - Acquire dcache_lock around call to __d_path() in cr_fill_name()
> 
> Changelog[v9]:
>   - Introduce cr_ctx_checkpoint() for checkpoint-specific ctx setup
>   - Test if __d_path() changes mnt/dentry (when crossing filesystem
>     namespace boundary). for now cr_fill_fname() fails the checkpoint.
> 
> Changelog[v7]:
>   - Fix argument given to kunmap_atomic() in memory dump/restore
> 
> Changelog[v6]:
>   - Balance all calls to cr_hbuf_get() with matching cr_hbuf_put()
>     (even though it's not really needed)
> 
> Changelog[v5]:
>   - Improve memory dump code (following Dave Hansen's comments)
>   - Change dump format (and code) to allow chunks of <vaddrs, pages>
>     instead of one long list of each
>   - Fix use of follow_page() to avoid faulting in non-present pages
> 
> Changelog[v4]:
>   - Use standard list_... for cr_pgarr
> 
> Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
> Acked-by: Serge Hallyn <serue@us.ibm.com>
> Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com>
> ---
>  arch/x86/include/asm/checkpoint_hdr.h |    5 +
>  arch/x86/mm/checkpoint.c              |   31 ++
>  checkpoint/Makefile                   |    3 +-
>  checkpoint/checkpoint.c               |   88 ++++++
>  checkpoint/checkpoint_arch.h          |    2 +
>  checkpoint/checkpoint_mem.h           |   41 +++
>  checkpoint/ckpt_mem.c                 |  503 +++++++++++++++++++++++++++++++++
>  checkpoint/sys.c                      |   10 +
>  include/linux/checkpoint.h            |   12 +
>  include/linux/checkpoint_hdr.h        |   32 ++
>  10 files changed, 726 insertions(+), 1 deletions(-)
>  create mode 100644 checkpoint/checkpoint_mem.h
>  create mode 100644 checkpoint/ckpt_mem.c
> 
> diff --git a/arch/x86/include/asm/checkpoint_hdr.h b/arch/x86/include/asm/checkpoint_hdr.h
> index 6325062..33f4c70 100644
> --- a/arch/x86/include/asm/checkpoint_hdr.h
> +++ b/arch/x86/include/asm/checkpoint_hdr.h
> @@ -82,4 +82,9 @@ struct cr_hdr_cpu {
>  	/* thread_xstate contents follow (if used_math) */
>  } __attribute__((aligned(8)));
>  
> +struct cr_hdr_mm_context {
> +	__s16 ldt_entry_size;
> +	__s16 nldt;
> +} __attribute__((aligned(8)));
> +
>  #endif /* __ASM_X86_CKPT_HDR__H */
> diff --git a/arch/x86/mm/checkpoint.c b/arch/x86/mm/checkpoint.c
> index 8dd6d2d..757936e 100644
> --- a/arch/x86/mm/checkpoint.c
> +++ b/arch/x86/mm/checkpoint.c
> @@ -221,3 +221,34 @@ int cr_write_head_arch(struct cr_ctx *ctx)
>  
>  	return ret;
>  }
> +
> +/* dump the mm->context state */
> +int cr_write_mm_context(struct cr_ctx *ctx, struct mm_struct *mm, int parent)
> +{
> +	struct cr_hdr h;
> +	struct cr_hdr_mm_context *hh = cr_hbuf_get(ctx, sizeof(*hh));
> +	int ret;
> +
> +	h.type = CR_HDR_MM_CONTEXT;
> +	h.len = sizeof(*hh);
> +	h.parent = parent;
> +
> +	mutex_lock(&mm->context.lock);
> +
> +	hh->ldt_entry_size = LDT_ENTRY_SIZE;
> +	hh->nldt = mm->context.size;
> +
> +	cr_debug("nldt %d\n", hh->nldt);
> +
> +	ret = cr_write_obj(ctx, &h, hh);
> +	cr_hbuf_put(ctx, sizeof(*hh));
> +	if (ret < 0)
> +		goto out;
> +
> +	ret = cr_kwrite(ctx, mm->context.ldt,
> +			mm->context.size * LDT_ENTRY_SIZE);

Do we really want to emit anything under lock?  I realize that this 
patch goes and does a ton of writes with mmap_sem held for read -- is 
this ok?

> +
> + out:
> +	mutex_unlock(&mm->context.lock);
> +	return ret;
> +}
> diff --git a/checkpoint/Makefile b/checkpoint/Makefile
> index d2df68c..3a0df6d 100644
> --- a/checkpoint/Makefile
> +++ b/checkpoint/Makefile
> @@ -2,4 +2,5 @@
>  # Makefile for linux checkpoint/restart.
>  #
>  
> -obj-$(CONFIG_CHECKPOINT_RESTART) += sys.o checkpoint.o restart.o
> +obj-$(CONFIG_CHECKPOINT_RESTART) += sys.o checkpoint.o restart.o \
> +		ckpt_mem.o
> diff --git a/checkpoint/checkpoint.c b/checkpoint/checkpoint.c
> index 17cc8d2..56d0ec2 100644
> --- a/checkpoint/checkpoint.c
> +++ b/checkpoint/checkpoint.c
> @@ -13,6 +13,7 @@
>  #include <linux/time.h>
>  #include <linux/fs.h>
>  #include <linux/file.h>
> +#include <linux/fdtable.h>
>  #include <linux/dcache.h>
>  #include <linux/mount.h>
>  #include <linux/utsname.h>
> @@ -75,6 +76,66 @@ int cr_write_string(struct cr_ctx *ctx, char *str, int len)
>  	return cr_write_obj(ctx, &h, str);
>  }
>  
> +/**
> + * cr_fill_fname - return pathname of a given file
> + * @path: path name
> + * @root: relative root
> + * @buf: buffer for pathname
> + * @n: buffer length (in) and pathname length (out)
> + */
> +static char *
> +cr_fill_fname(struct path *path, struct path *root, char *buf, int *n)
> +{
> +	struct path tmp = *root;
> +	char *fname;
> +
> +	BUG_ON(!buf);
> +	spin_lock(&dcache_lock);
> +	fname = __d_path(path, &tmp, buf, *n);
> +	spin_unlock(&dcache_lock);
> +	if (!IS_ERR(fname))
> +		*n = (buf + (*n) - fname);
> +	/*
> +	 * FIXME: if __d_path() changed these, it must have stepped out of
> +	 * init's namespace. Since currently we require a unified namespace
> +	 * within the container: simply fail.
> +	 */
> +	if (tmp.mnt != root->mnt || tmp.dentry != root->dentry)
> +		fname = ERR_PTR(-EBADF);
> +
> +	return fname;
> +}
> +
> +/**
> + * cr_write_fname - write a file name
> + * @ctx: checkpoint context
> + * @path: path name
> + * @root: relative root
> + */
> +int cr_write_fname(struct cr_ctx *ctx, struct path *path, struct path *root)
> +{
> +	struct cr_hdr h;
> +	char *buf, *fname;
> +	int ret, flen;
> +
> +	flen = PATH_MAX;
> +	buf = kmalloc(flen, GFP_KERNEL);
> +	if (!buf)
> +		return -ENOMEM;
> +
> +	fname = cr_fill_fname(path, root, buf, &flen);
> +	if (!IS_ERR(fname)) {
> +		h.type = CR_HDR_FNAME;
> +		h.len = flen;
> +		h.parent = 0;
> +		ret = cr_write_obj(ctx, &h, fname);
> +	} else
> +		ret = PTR_ERR(fname);
> +
> +	kfree(buf);
> +	return ret;
> +}
> +
>  /* write the checkpoint header */
>  static int cr_write_head(struct cr_ctx *ctx)
>  {
> @@ -168,6 +229,10 @@ static int cr_write_task(struct cr_ctx *ctx, struct task_struct *t)
>  	cr_debug("task_struct: ret %d\n", ret);
>  	if (ret < 0)
>  		goto out;
> +	ret = cr_write_mm(ctx, t);
> +	cr_debug("memory: ret %d\n", ret);
> +	if (ret < 0)
> +		goto out;
>  	ret = cr_write_thread(ctx, t);
>  	cr_debug("thread: ret %d\n", ret);
>  	if (ret < 0)
> @@ -178,10 +243,33 @@ static int cr_write_task(struct cr_ctx *ctx, struct task_struct *t)
>  	return ret;
>  }
>  
> +static int cr_ctx_checkpoint(struct cr_ctx *ctx, pid_t pid)
> +{
> +	struct fs_struct *fs;
> +
> +	ctx->root_pid = pid;
> +
> +	/*
> +	 * assume checkpointer is in container's root vfs
> +	 * FIXME: this works for now, but will change with real containers
> +	 */
> +
> +	fs = current->fs;
> +	read_lock(&fs->lock);
> +	ctx->fs_mnt = fs->root;
> +	path_get(&ctx->fs_mnt);
> +	read_unlock(&fs->lock);
> +
> +	return 0;

Spurious return value?

> +}
> +
>  int do_checkpoint(struct cr_ctx *ctx, pid_t pid)
>  {
>  	int ret;
>  
> +	ret = cr_ctx_checkpoint(ctx, pid);
> +	if (ret < 0)
> +		goto out;
>  	ret = cr_write_head(ctx);
>  	if (ret < 0)
>  		goto out;
> diff --git a/checkpoint/checkpoint_arch.h b/checkpoint/checkpoint_arch.h
> index ada1369..f06c7eb 100644
> --- a/checkpoint/checkpoint_arch.h
> +++ b/checkpoint/checkpoint_arch.h
> @@ -3,6 +3,8 @@
>  extern int cr_write_head_arch(struct cr_ctx *ctx);
>  extern int cr_write_thread(struct cr_ctx *ctx, struct task_struct *t);
>  extern int cr_write_cpu(struct cr_ctx *ctx, struct task_struct *t);
> +extern int cr_write_mm_context(struct cr_ctx *ctx,
> +			       struct mm_struct *mm, int parent);
>  
>  extern int cr_read_head_arch(struct cr_ctx *ctx);
>  extern int cr_read_thread(struct cr_ctx *ctx);
> diff --git a/checkpoint/checkpoint_mem.h b/checkpoint/checkpoint_mem.h
> new file mode 100644
> index 0000000..85546f4
> --- /dev/null
> +++ b/checkpoint/checkpoint_mem.h
> @@ -0,0 +1,41 @@
> +#ifndef _CHECKPOINT_CKPT_MEM_H_
> +#define _CHECKPOINT_CKPT_MEM_H_
> +/*
> + *  Generic container checkpoint-restart
> + *
> + *  Copyright (C) 2008 Oren Laadan
> + *
> + *  This file is subject to the terms and conditions of the GNU General Public
> + *  License.  See the file COPYING in the main directory of the Linux
> + *  distribution for more details.
> + */
> +
> +#include <linux/mm_types.h>
> +
> +/*
> + * page-array chains: each cr_pgarr describes a set of <strcut page *,vaddr>

struct

> + * tuples (where vaddr is the virtual address of a page in a particular mm).
> + * Specifically, we use separate arrays so that all vaddrs can be written
> + * and read at once.
> + */
> +
> +struct cr_pgarr {
> +	unsigned long *vaddrs;
> +	struct page **pages;
> +	unsigned int nr_used;
> +	struct list_head list;
> +};
> +
> +#define CR_PGARR_TOTAL  (PAGE_SIZE / sizeof(void *))
> +#define CR_PGARR_CHUNK  (4 * CR_PGARR_TOTAL)
> +
> +extern void cr_pgarr_free(struct cr_ctx *ctx);
> +extern struct cr_pgarr *cr_pgarr_current(struct cr_ctx *ctx);
> +extern void cr_pgarr_reset_all(struct cr_ctx *ctx);
> +
> +static inline int cr_pgarr_is_full(struct cr_pgarr *pgarr)
> +{
> +	return (pgarr->nr_used == CR_PGARR_TOTAL);
> +}
> +
> +#endif /* _CHECKPOINT_CKPT_MEM_H_ */
> diff --git a/checkpoint/ckpt_mem.c b/checkpoint/ckpt_mem.c
> new file mode 100644
> index 0000000..a2fcdbf
> --- /dev/null
> +++ b/checkpoint/ckpt_mem.c
> @@ -0,0 +1,503 @@
> +/*
> + *  Checkpoint memory contents
> + *
> + *  Copyright (C) 2008 Oren Laadan
> + *
> + *  This file is subject to the terms and conditions of the GNU General Public
> + *  License.  See the file COPYING in the main directory of the Linux
> + *  distribution for more details.
> + */
> +
> +#include <linux/kernel.h>
> +#include <linux/sched.h>
> +#include <linux/slab.h>
> +#include <linux/file.h>
> +#include <linux/pagemap.h>
> +#include <linux/mm_types.h>
> +#include <linux/checkpoint.h>
> +#include <linux/checkpoint_hdr.h>
> +
> +#include "checkpoint_arch.h"
> +#include "checkpoint_mem.h"
> +
> +/*
> + * utilities to alloc, free, and handle 'struct cr_pgarr' (page-arrays)
> + * (common to ckpt_mem.c and rstr_mem.c).
> + *
> + * The checkpoint context structure has two members for page-arrays:
> + *   ctx->pgarr_list: list head of the page-array chain

What's the second member?

> + *
> + * During checkpoint (and restart) the chain tracks the dirty pages (page
> + * pointer and virtual address) of each MM. For a particular MM, these are
> + * always added to the head of the page-array chain (ctx->pgarr_list).
> + * This "current" page-array advances as necessary, and new page-array
> + * descriptors are allocated on-demand. Before the next chunk of pages,
> + * the chain is reset but not freed (that is, dereference page pointers).
> + */
> +
> +/* return first page-array in the chain */
> +static inline struct cr_pgarr *cr_pgarr_first(struct cr_ctx *ctx)
> +{
> +	if (list_empty(&ctx->pgarr_list))
> +		return NULL;
> +	return list_first_entry(&ctx->pgarr_list, struct cr_pgarr, list);
> +}
> +
> +/* release pages referenced by a page-array */
> +static void cr_pgarr_release_pages(struct cr_pgarr *pgarr)
> +{
> +	int i;
> +
> +	cr_debug("nr_used %d\n", pgarr->nr_used);
> +	/*
> +	 * although both checkpoint and restart use 'nr_used', we only
> +	 * collect pages during checkpoint; in restart we simply return
> +	 */
> +	if (!pgarr->pages)
> +		return;
> +	for (i = pgarr->nr_used; i--; /**/)
> +		page_cache_release(pgarr->pages[i]);

This is sorta hard to read (and non-intuitive).  Is it easier to do: 

 

for (i = 0; i < pgarr->nr_used; i++) 

	page_cache_release(pgarr->pages[i]);
 

It shouldn't matter what order you release the pages in..

> +}
> +
> +/* free a single page-array object */
> +static void cr_pgarr_free_one(struct cr_pgarr *pgarr)
> +{
> +	cr_pgarr_release_pages(pgarr);
> +	kfree(pgarr->pages);
> +	kfree(pgarr->vaddrs);
> +	kfree(pgarr);
> +}
> +
> +/* free a chain of page-arrays */
> +void cr_pgarr_free(struct cr_ctx *ctx)
> +{
> +	struct cr_pgarr *pgarr, *tmp;
> +
> +	list_for_each_entry_safe(pgarr, tmp, &ctx->pgarr_list, list) {
> +		list_del(&pgarr->list);
> +		cr_pgarr_free_one(pgarr);
> +	}
> +}
> +
> +/* allocate a single page-array object */
> +static struct cr_pgarr *cr_pgarr_alloc_one(unsigned long flags)
> +{
> +	struct cr_pgarr *pgarr;
> +
> +	pgarr = kzalloc(sizeof(*pgarr), GFP_KERNEL);
> +	if (!pgarr)
> +		return NULL;
> +
> +	pgarr->vaddrs = kmalloc(CR_PGARR_TOTAL * sizeof(unsigned long),
You used PAGE_SIZE / sizeof(void *) above.   Why not __get_free_page()?

> +				GFP_KERNEL);
> +	if (!pgarr->vaddrs)
> +		goto nomem;
> +
> +	/* pgarr->pages is needed only for checkpoint */
> +	if (flags & CR_CTX_CKPT) {
> +		pgarr->pages = kmalloc(CR_PGARR_TOTAL * sizeof(struct page *),
> +				       GFP_KERNEL);
> +		if (!pgarr->pages)
> +			goto nomem;
> +	}
> +
> +	return pgarr;
> +
> + nomem:
> +	cr_pgarr_free_one(pgarr);
> +	return NULL;
> +}
> +
> +/* cr_pgarr_current - return the next available page-array in the chain
> + * @ctx: checkpoint context
> + *
> + * Returns the first page-array in the list that has space. Extends the
> + * list if none has space.
> + */
> +struct cr_pgarr *cr_pgarr_current(struct cr_ctx *ctx)
> +{
> +	struct cr_pgarr *pgarr;
> +
> +	pgarr = cr_pgarr_first(ctx);
> +	if (pgarr && !cr_pgarr_is_full(pgarr))
> +		goto out;
> +	pgarr = cr_pgarr_alloc_one(ctx->flags);
> +	if (!pgarr)
> +		goto out;
> +	list_add(&pgarr->list, &ctx->pgarr_list);
> + out:
> +	return pgarr;
> +}
> +
> +/* reset the page-array chain (dropping page references if necessary) */
> +void cr_pgarr_reset_all(struct cr_ctx *ctx)
> +{
> +	struct cr_pgarr *pgarr;
> +
> +	list_for_each_entry(pgarr, &ctx->pgarr_list, list) {
> +		cr_pgarr_release_pages(pgarr);
> +		pgarr->nr_used = 0;
> +	}

This doesn't look right.  cr_pgarr_current only ever looks at the head 
of the list, so resetting a list with > 1 pgarr on it will mean the 
non-head elements in the list will go to waste.

> +}
> +
> +/*
> + * Checkpoint is outside the context of the checkpointee, so one cannot
> + * simply read pages from user-space. Instead, we scan the address space
> + * of the target to cherry-pick pages of interest. Selected pages are
> + * enlisted in a page-array chain (attached to the checkpoint context).
> + * To save their contents, each page is mapped to kernel memory and then
> + * dumped to the file descriptor.
> + */
> +
> +
> +/**
> + * cr_private_follow_page - return page pointer for dirty pages
> + * @vma - target vma
> + * @addr - page address
> + *
> + * Looks up the page that correspond to the address in the vma, and
> + * returns the page if it was modified (and grabs a reference to it),
> + * or otherwise returns NULL (or error).
> + *
> + * This function should _only_ called for private vma's.
> + */
> +static struct page *
> +cr_private_follow_page(struct vm_area_struct *vma, unsigned long addr)

s/cr_private_follow_page/cr_follow_page_private/ ?

Maybe even cr_dump_private_page?  The fact that it's following the page 
  tables down to the page is an implementation artifact and isn't really 
relevant to the semantics you want to express.

> +{
> +	struct page *page;
> +
> +	BUG_ON(vma->vm_flags & (VM_SHARED | VM_MAYSHARE));
> +

This BUG_ON shouldn't be needed if it's already done in 
cr_private_vma_fill_pgarr.

> +	/*
> +	 * simplified version of get_user_pages(): already have vma,
> +	 * only need FOLL_ANON, and (for now) ignore fault stats.
> +	 *
> +	 * follow_page() will return NULL if the page is not present
> +	 * (swapped), ZERO_PAGE(0) if the pte wasn't allocated, and
> +	 * the actual page pointer otherwise.
> +	 *
> +	 * FIXME: consolidate with get_user_pages()
> +	 */
> +
> +	cond_resched();
> +	while (!(page = follow_page(vma, addr, FOLL_ANON | FOLL_GET))) {
> +		int ret;
> +
> +		/* the page is swapped out - bring it in (optimize ?) */
> +		ret = handle_mm_fault(vma->vm_mm, vma, addr, 0);
> +		if (ret & VM_FAULT_ERROR) {
> +			if (ret & VM_FAULT_OOM)
> +				return ERR_PTR(-ENOMEM);
> +			else if (ret & VM_FAULT_SIGBUS)
> +				return ERR_PTR(-EFAULT);
> +			else
> +				BUG();
> +			break;
> +		}
> +		cond_resched();
> +	}
> +
> +	if (IS_ERR(page))
> +		return page;
> +
> +	/*
> +	 * We only care about dirty pages: either non-zero page, or
> +	 * file-backed (copy-on-write) that were touched. For the latter,
> +	 * the page_mapping() will be unset because it will no longer be
> +	 * mapped to the original file  after having been modified.
> +	 */
> +	if (page == ZERO_PAGE(0)) {
> +		/* this is the zero page: ignore */
> +		page_cache_release(page);
> +		page = NULL;
> +	} else if (vma->vm_file && (page_mapping(page) != NULL)) {
> +		/* file backed clean cow: ignore */

Probably better to describe 'why' it can be ignored here.


> +		page_cache_release(page);
> +		page = NULL;
> +	}
> +
> +	return page;
> +}
> +
> +/**
> + * cr_private_vma_fill_pgarr - fill a page-array with addr/page tuples
> + * @ctx - checkpoint context
> + * @pgarr - page-array to fill
> + * @vma - vma to scan
> + * @start - start address (updated)
> + *
> + * Returns the number of pages collected
> + */
> +static int
> +cr_private_vma_fill_pgarr(struct cr_ctx *ctx, struct cr_pgarr *pgarr,
> +			  struct vm_area_struct *vma, unsigned long *start)

This is sorta nasty because you shouldn't need to call into this routine 
with a container.  It should be able to enqueue the (vaddr, page) tuple 
directly on the ctx.  Doing so would also abstract out the pgarr 
management at this level and make the code a lot simpler.

> +{
> +	unsigned long end = vma->vm_end;
> +	unsigned long addr = *start;
> +	int orig_used = pgarr->nr_used;
> +
> +	/* this function is only for private memory (anon or file-mapped) */
> +	BUG_ON(vma->vm_flags & (VM_SHARED | VM_MAYSHARE));
> +
> +	while (addr < end) {
> +		struct page *page;
> +
> +		page = cr_private_follow_page(vma, addr);
> +		if (IS_ERR(page))
> +			return PTR_ERR(page);
> +
> +		if (page) {
> +			pgarr->pages[pgarr->nr_used] = page;
> +			pgarr->vaddrs[pgarr->nr_used] = addr;
> +			pgarr->nr_used++;

Should be something like:

ret = cr_ctx_append_page(ctx, addr, page);
if (ret < 0)
   goto out;

> +		}
> +
> +		addr += PAGE_SIZE;
> +
> +		if (cr_pgarr_is_full(pgarr))
> +			break;
> +	}
> +
> +	*start = addr;
> +	return pgarr->nr_used - orig_used;
> +}
> +
> +/* dump contents of a pages: use kmap_atomic() to avoid TLB flush */
> +static int cr_page_write(struct cr_ctx *ctx, struct page *page, char *buf)
> +{
> +	void *ptr;
> +
> +	ptr = kmap_atomic(page, KM_USER1);
> +	memcpy(buf, ptr, PAGE_SIZE);
> +	kunmap_atomic(ptr, KM_USER1);
> +
> +	return cr_kwrite(ctx, buf, PAGE_SIZE);
> +}
> +
> +/**
> + * cr_vma_dump_pages - dump pages listed in the ctx page-array chain
> + * @ctx - checkpoint context
> + * @total - total number of pages
> + *
> + * First dump all virtual addresses, followed by the contents of all pages
> + */
> +static int cr_vma_dump_pages(struct cr_ctx *ctx, int total)
> +{
> +	struct cr_pgarr *pgarr;
> +	char *buf;
> +	int i, ret = 0;
> +
> +	if (!total)
> +		return 0;
> +
> +	list_for_each_entry_reverse(pgarr, &ctx->pgarr_list, list) {
> +		ret = cr_kwrite(ctx, pgarr->vaddrs,
> +				pgarr->nr_used * sizeof(*pgarr->vaddrs));
> +		if (ret < 0)
> +			return ret;
> +	}
> +
> +	buf = kmalloc(PAGE_SIZE, GFP_KERNEL);

__get_free_page()

> +	if (!buf)
> +		return -ENOMEM;
> +
> +	list_for_each_entry_reverse(pgarr, &ctx->pgarr_list, list) {
> +		for (i = 0; i < pgarr->nr_used; i++) {
> +			ret = cr_page_write(ctx, pgarr->pages[i], buf);
> +			if (ret < 0)
> +				goto out;
> +		}
> +	}
> +
> + out:
> +	kfree(buf);
> +	return ret;
> +}
> +
> +/**
> + * cr_write_private_vma_contents - dump contents of a VMA with private memory
> + * @ctx - checkpoint context
> + * @vma - vma to scan
> + *
> + * Collect lists of pages that needs to be dumped, and corresponding
> + * virtual addresses into ctx->pgarr_list page-array chain. Then dump
> + * the addresses, followed by the page contents.
> + */
> +static int
> +cr_write_private_vma_contents(struct cr_ctx *ctx, struct vm_area_struct *vma)
> +{
> +	struct cr_hdr h;
> +	struct cr_hdr_pgarr *hh;
> +	unsigned long addr = vma->vm_start;
> +	struct cr_pgarr *pgarr;
> +	unsigned long cnt = 0;
> +	int ret;
> +
> +	/*
> +	 * Work iteratively, collecting and dumping at most CR_PGARR_CHUNK
> +	 * in each round. Each iterations is divided into two steps:
> +	 *
> +	 * (1) scan: scan through the PTEs of the vma to collect the pages
> +	 * to dump (later we'll also make them COW), while keeping a list
> +	 * of pages and their corresponding addresses on ctx->pgarr_list.
> +	 *
> +	 * (2) dump: write out a header specifying how many pages, followed
> +	 * by the addresses of all pages in ctx->pgarr_list, followed by
> +	 * the actual contents of all pages. (Then, release the references
> +	 * to the pages and reset the page-array chain).
> +	 *
> +	 * (This split makes the logic simpler by first counting the pages
> +	 * that need saving. More importantly, it allows for a future
> +	 * optimization that will reduce application downtime by deferring
> +	 * the actual write-out of the data to after the application is
> +	 * allowed to resume execution).
> +	 *
> +	 * After dumpting the entire contents, conclude with a header that
> +	 * specifies 0 pages to mark the end of the contents.
> +	 */
> +
> +	h.type = CR_HDR_PGARR;
> +	h.len = sizeof(*hh);
> +	h.parent = 0;
> +
> +	while (addr < vma->vm_end) {
> +		pgarr = cr_pgarr_current(ctx);
> +		if (!pgarr)
> +			return -ENOMEM;
> +		ret = cr_private_vma_fill_pgarr(ctx, pgarr, vma, &addr);
> +		if (ret < 0)
> +			return ret;
> +		cnt += ret;
> +
> +		/* did we complete a chunk, or is this the last chunk ? */
> +		if (cnt >= CR_PGARR_CHUNK || (cnt && addr == vma->vm_end)) {
> +			hh = cr_hbuf_get(ctx, sizeof(*hh));
> +			hh->nr_pages = cnt;
> +			ret = cr_write_obj(ctx, &h, hh);
> +			cr_hbuf_put(ctx, sizeof(*hh));
> +			if (ret < 0)
> +				return ret;
> +
> +			ret = cr_vma_dump_pages(ctx, cnt);
> +			if (ret < 0)
> +				return ret;
> +
> +			cr_pgarr_reset_all(ctx);
> +		}
> +	}
> +
> +	/* mark end of contents with header saying "0" pages */
> +	hh = cr_hbuf_get(ctx, sizeof(*hh));
> +	hh->nr_pages = 0;
> +	ret = cr_write_obj(ctx, &h, hh);
> +	cr_hbuf_put(ctx, sizeof(*hh));
> +
> +	return ret;
> +}
> +
> +static int cr_write_vma(struct cr_ctx *ctx, struct vm_area_struct *vma)
> +{
> +	struct cr_hdr h;
> +	struct cr_hdr_vma *hh = cr_hbuf_get(ctx, sizeof(*hh));
> +	int vma_type, ret;
> +
> +	h.type = CR_HDR_VMA;
> +	h.len = sizeof(*hh);
> +	h.parent = 0;
> +
> +	hh->vm_start = vma->vm_start;
> +	hh->vm_end = vma->vm_end;
> +	hh->vm_page_prot = vma->vm_page_prot.pgprot;
> +	hh->vm_flags = vma->vm_flags;
> +	hh->vm_pgoff = vma->vm_pgoff;
> +
> +#define CR_BAD_VM_FLAGS  \
> +	(VM_SHARED | VM_MAYSHARE | VM_IO | VM_HUGETLB | VM_NONLINEAR)
> +
> +	if (vma->vm_flags & CR_BAD_VM_FLAGS) {
> +		pr_warning("c/r: unsupported VMA %#lx\n", vma->vm_flags);
> +		cr_hbuf_put(ctx, sizeof(*hh));
> +		return -ENOSYS;
> +	}
> +

The following code should be broken into it's own function?  Handling of 
other types of memory will follow and will clutter this guy up.

> +	/* by default assume anon memory */
> +	vma_type = CR_VMA_ANON;
> +
> +	/*
> +	 * if there is a backing file, assume private-mapped

Shouldn't need to assume anything as you checked for VM_MAYSHARE and 
VM_SHARED above.

> +	 * (FIXME: check if the file is unlinked)
> +	 */
> +	if (vma->vm_file)
> +		vma_type = CR_VMA_FILE;
> +
> +	hh->vma_type = vma_type;
> +
> +	ret = cr_write_obj(ctx, &h, hh);
> +	cr_hbuf_put(ctx, sizeof(*hh));
> +	if (ret < 0)
> +		return ret;
> +
> +	/* save the file name, if relevant */

s/, if relevant//

> +	if (vma->vm_file) {
> +		ret = cr_write_fname(ctx, &vma->vm_file->f_path, &ctx->fs_mnt);

Why is this using a filename, rather than a reference to a file? 
Shouldn't this use the logic in patch 8/13?

> +		if (ret < 0)
> +			return ret;
> +	}
> +
> +	return cr_write_private_vma_contents(ctx, vma);
> +}
> +
> +int cr_write_mm(struct cr_ctx *ctx, struct task_struct *t)
> +{
> +	struct cr_hdr h;
> +	struct cr_hdr_mm *hh = cr_hbuf_get(ctx, sizeof(*hh));
> +	struct mm_struct *mm;
> +	struct vm_area_struct *vma;
> +	int objref, ret;
> +
> +	h.type = CR_HDR_MM;
> +	h.len = sizeof(*hh);
> +	h.parent = task_pid_vnr(t);
> +
> +	mm = get_task_mm(t);
> +
> +	objref = 0;	/* will be meaningful with multiple processes */
> +	hh->objref = objref;
> +
> +	down_read(&mm->mmap_sem);
> +
> +	hh->start_code = mm->start_code;
> +	hh->end_code = mm->end_code;
> +	hh->start_data = mm->start_data;
> +	hh->end_data = mm->end_data;
> +	hh->start_brk = mm->start_brk;
> +	hh->brk = mm->brk;
> +	hh->start_stack = mm->start_stack;
> +	hh->arg_start = mm->arg_start;
> +	hh->arg_end = mm->arg_end;
> +	hh->env_start = mm->env_start;
> +	hh->env_end = mm->env_end;
> +
> +	hh->map_count = mm->map_count;
> +
> +	/* FIX: need also mm->flags */
> +
> +	ret = cr_write_obj(ctx, &h, hh);
> +	cr_hbuf_put(ctx, sizeof(*hh));
> +	if (ret < 0)
> +		goto out;
> +
> +	/* write the vma's */
> +	for (vma = mm->mmap; vma; vma = vma->vm_next) {
> +		ret = cr_write_vma(ctx, vma);
> +		if (ret < 0)
> +			goto out;
> +	}
> +
> +	ret = cr_write_mm_context(ctx, mm, objref);
> +
> + out:
> +	up_read(&mm->mmap_sem);
> +	mmput(mm);
> +	return ret;
> +}
> diff --git a/checkpoint/sys.c b/checkpoint/sys.c
> index bd14ef9..c547a1c 100644
> --- a/checkpoint/sys.c
> +++ b/checkpoint/sys.c
> @@ -16,6 +16,8 @@
>  #include <linux/capability.h>
>  #include <linux/checkpoint.h>
>  
> +#include "checkpoint_mem.h"
> +
>  /*
>   * Helpers to write(read) from(to) kernel space to(from) the checkpoint
>   * image file descriptor (similar to how a core-dump is performed).
> @@ -131,7 +133,13 @@ static void cr_ctx_free(struct cr_ctx *ctx)
>  {
>  	if (ctx->file)
>  		fput(ctx->file);
> +
>  	kfree(ctx->hbuf);
> +
> +	path_put(&ctx->fs_mnt);		/* safe with NULL pointers */
> +
> +	cr_pgarr_free(ctx);
> +
>  	kfree(ctx);
>  }
>  
> @@ -146,6 +154,8 @@ static struct cr_ctx *cr_ctx_alloc(int fd, unsigned long flags)
>  
>  	ctx->flags = flags;
>  
> +	INIT_LIST_HEAD(&ctx->pgarr_list);
> +
>  	err = -EBADF;
>  	ctx->file = fget(fd);
>  	if (!ctx->file)
> diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
> index 63f298f..4e97f9f 100644
> --- a/include/linux/checkpoint.h
> +++ b/include/linux/checkpoint.h
> @@ -10,6 +10,9 @@
>   *  distribution for more details.
>   */
>  
> +#include <linux/path.h>
> +#include <linux/fs.h>
> +
>  #define CR_VERSION  1
>  
>  struct cr_ctx {
> @@ -25,6 +28,10 @@ struct cr_ctx {
>  
>  	void *hbuf;		/* temporary buffer for headers */
>  	int hpos;		/* position in headers buffer */
> +
> +	struct list_head pgarr_list;	/* page array to dump VMA contents */
> +
> +	struct path fs_mnt;	/* container root (FIXME) */
>  };
>  
>  /* cr_ctx: flags */
> @@ -42,6 +49,8 @@ struct cr_hdr;
>  extern int cr_write_obj(struct cr_ctx *ctx, struct cr_hdr *h, void *buf);
>  extern int cr_write_buffer(struct cr_ctx *ctx, void *buf, int len);
>  extern int cr_write_string(struct cr_ctx *ctx, char *str, int len);
> +extern int cr_write_fname(struct cr_ctx *ctx,
> +			  struct path *path, struct path *root);
>  
>  extern int cr_read_obj(struct cr_ctx *ctx, struct cr_hdr *h, void *buf, int n);
>  extern int cr_read_obj_type(struct cr_ctx *ctx, void *buf, int len, int type);
> @@ -50,7 +59,10 @@ extern int cr_read_buffer(struct cr_ctx *ctx, void *buf, int *len);
>  extern int cr_read_string(struct cr_ctx *ctx, char *str, int len);
>  
>  extern int do_checkpoint(struct cr_ctx *ctx, pid_t pid);
> +extern int cr_write_mm(struct cr_ctx *ctx, struct task_struct *t);
> +
>  extern int do_restart(struct cr_ctx *ctx, pid_t pid);
> +extern int cr_read_mm(struct cr_ctx *ctx);
>  
>  #define cr_debug(fmt, args...)  \
>  	pr_debug("[%d:c/r:%s] " fmt, task_pid_vnr(current), __func__, ## args)
> diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
> index b74b5f9..d78f0f1 100644
> --- a/include/linux/checkpoint_hdr.h
> +++ b/include/linux/checkpoint_hdr.h
> @@ -34,6 +34,7 @@ enum {
>  	CR_HDR_HEAD_ARCH,
>  	CR_HDR_BUFFER,
>  	CR_HDR_STRING,
> +	CR_HDR_FNAME,
>  
>  	CR_HDR_TASK = 101,
>  	CR_HDR_THREAD,
> @@ -41,6 +42,7 @@ enum {
>  
>  	CR_HDR_MM = 201,
>  	CR_HDR_VMA,
> +	CR_HDR_PGARR,
>  	CR_HDR_MM_CONTEXT,
>  
>  	CR_HDR_TAIL = 5001
> @@ -75,4 +77,34 @@ struct cr_hdr_task {
>  	__s32 task_comm_len;
>  } __attribute__((aligned(8)));
>  
> +struct cr_hdr_mm {
> +	__u32 objref;		/* identifier for shared objects */
> +	__u32 map_count;
> +
> +	__u64 start_code, end_code, start_data, end_data;
> +	__u64 start_brk, brk, start_stack;
> +	__u64 arg_start, arg_end, env_start, env_end;
> +} __attribute__((aligned(8)));
> +
> +/* vma subtypes */
> +enum vm_type {
> +	CR_VMA_ANON = 1,
> +	CR_VMA_FILE

We need to figure out what MAP_SHARED | MAP_ANONYMOUS should be exposed 
as in this setup (much in the same way we need to start defining what 
shm mappings look like).  Internally, they are 'file-backed', but to 
userland, they aren't.
 

Thoughts?

> +};
> +
> +struct cr_hdr_vma {
> +	__u32 vma_type;
> +	__u32 _padding;

Why padding?

> +
> +	__u64 vm_start;
> +	__u64 vm_end;
> +	__u64 vm_page_prot;
> +	__u64 vm_flags;
> +	__u64 vm_pgoff;
> +} __attribute__((aligned(8)));
> +
> +struct cr_hdr_pgarr {
> +	__u64 nr_pages;		/* number of pages to saved */
> +} __attribute__((aligned(8)));
> +
>  #endif /* _CHECKPOINT_CKPT_HDR_H_ */


^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [RFC v11][PATCH 05/13] Dump memory address space
@ 2008-12-18  2:26     ` Mike Waychison
  0 siblings, 0 replies; 133+ messages in thread
From: Mike Waychison @ 2008-12-18  2:26 UTC (permalink / raw)
  To: Oren Laadan
  Cc: jeremy, arnd, linux-api, containers, linux-kernel, Dave Hansen,
	linux-mm, Linux Torvalds, Alexander Viro, H. Peter Anvin,
	Thomas Gleixner, Ingo Molnar

Comments below.

Oren Laadan wrote:
> For each VMA, there is a 'struct cr_vma'; if the VMA is file-mapped,
> it will be followed by the file name. Then comes the actual contents,
> in one or more chunk: each chunk begins with a header that specifies
> how many pages it holds, then the virtual addresses of all the dumped
> pages in that chunk, followed by the actual contents of all dumped
> pages. A header with zero number of pages marks the end of the contents.
> Then comes the next VMA and so on.
> 
> Changelog[v11]:
>   - Copy contents of 'init->fs->root' instead of pointing to them.
>   - Add missing test for VM_MAYSHARE when dumping memory
> 
> Changelog[v10]:
>   - Acquire dcache_lock around call to __d_path() in cr_fill_name()
> 
> Changelog[v9]:
>   - Introduce cr_ctx_checkpoint() for checkpoint-specific ctx setup
>   - Test if __d_path() changes mnt/dentry (when crossing filesystem
>     namespace boundary). for now cr_fill_fname() fails the checkpoint.
> 
> Changelog[v7]:
>   - Fix argument given to kunmap_atomic() in memory dump/restore
> 
> Changelog[v6]:
>   - Balance all calls to cr_hbuf_get() with matching cr_hbuf_put()
>     (even though it's not really needed)
> 
> Changelog[v5]:
>   - Improve memory dump code (following Dave Hansen's comments)
>   - Change dump format (and code) to allow chunks of <vaddrs, pages>
>     instead of one long list of each
>   - Fix use of follow_page() to avoid faulting in non-present pages
> 
> Changelog[v4]:
>   - Use standard list_... for cr_pgarr
> 
> Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
> Acked-by: Serge Hallyn <serue@us.ibm.com>
> Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com>
> ---
>  arch/x86/include/asm/checkpoint_hdr.h |    5 +
>  arch/x86/mm/checkpoint.c              |   31 ++
>  checkpoint/Makefile                   |    3 +-
>  checkpoint/checkpoint.c               |   88 ++++++
>  checkpoint/checkpoint_arch.h          |    2 +
>  checkpoint/checkpoint_mem.h           |   41 +++
>  checkpoint/ckpt_mem.c                 |  503 +++++++++++++++++++++++++++++++++
>  checkpoint/sys.c                      |   10 +
>  include/linux/checkpoint.h            |   12 +
>  include/linux/checkpoint_hdr.h        |   32 ++
>  10 files changed, 726 insertions(+), 1 deletions(-)
>  create mode 100644 checkpoint/checkpoint_mem.h
>  create mode 100644 checkpoint/ckpt_mem.c
> 
> diff --git a/arch/x86/include/asm/checkpoint_hdr.h b/arch/x86/include/asm/checkpoint_hdr.h
> index 6325062..33f4c70 100644
> --- a/arch/x86/include/asm/checkpoint_hdr.h
> +++ b/arch/x86/include/asm/checkpoint_hdr.h
> @@ -82,4 +82,9 @@ struct cr_hdr_cpu {
>  	/* thread_xstate contents follow (if used_math) */
>  } __attribute__((aligned(8)));
>  
> +struct cr_hdr_mm_context {
> +	__s16 ldt_entry_size;
> +	__s16 nldt;
> +} __attribute__((aligned(8)));
> +
>  #endif /* __ASM_X86_CKPT_HDR__H */
> diff --git a/arch/x86/mm/checkpoint.c b/arch/x86/mm/checkpoint.c
> index 8dd6d2d..757936e 100644
> --- a/arch/x86/mm/checkpoint.c
> +++ b/arch/x86/mm/checkpoint.c
> @@ -221,3 +221,34 @@ int cr_write_head_arch(struct cr_ctx *ctx)
>  
>  	return ret;
>  }
> +
> +/* dump the mm->context state */
> +int cr_write_mm_context(struct cr_ctx *ctx, struct mm_struct *mm, int parent)
> +{
> +	struct cr_hdr h;
> +	struct cr_hdr_mm_context *hh = cr_hbuf_get(ctx, sizeof(*hh));
> +	int ret;
> +
> +	h.type = CR_HDR_MM_CONTEXT;
> +	h.len = sizeof(*hh);
> +	h.parent = parent;
> +
> +	mutex_lock(&mm->context.lock);
> +
> +	hh->ldt_entry_size = LDT_ENTRY_SIZE;
> +	hh->nldt = mm->context.size;
> +
> +	cr_debug("nldt %d\n", hh->nldt);
> +
> +	ret = cr_write_obj(ctx, &h, hh);
> +	cr_hbuf_put(ctx, sizeof(*hh));
> +	if (ret < 0)
> +		goto out;
> +
> +	ret = cr_kwrite(ctx, mm->context.ldt,
> +			mm->context.size * LDT_ENTRY_SIZE);

Do we really want to emit anything under lock?  I realize that this 
patch goes and does a ton of writes with mmap_sem held for read -- is 
this ok?

> +
> + out:
> +	mutex_unlock(&mm->context.lock);
> +	return ret;
> +}
> diff --git a/checkpoint/Makefile b/checkpoint/Makefile
> index d2df68c..3a0df6d 100644
> --- a/checkpoint/Makefile
> +++ b/checkpoint/Makefile
> @@ -2,4 +2,5 @@
>  # Makefile for linux checkpoint/restart.
>  #
>  
> -obj-$(CONFIG_CHECKPOINT_RESTART) += sys.o checkpoint.o restart.o
> +obj-$(CONFIG_CHECKPOINT_RESTART) += sys.o checkpoint.o restart.o \
> +		ckpt_mem.o
> diff --git a/checkpoint/checkpoint.c b/checkpoint/checkpoint.c
> index 17cc8d2..56d0ec2 100644
> --- a/checkpoint/checkpoint.c
> +++ b/checkpoint/checkpoint.c
> @@ -13,6 +13,7 @@
>  #include <linux/time.h>
>  #include <linux/fs.h>
>  #include <linux/file.h>
> +#include <linux/fdtable.h>
>  #include <linux/dcache.h>
>  #include <linux/mount.h>
>  #include <linux/utsname.h>
> @@ -75,6 +76,66 @@ int cr_write_string(struct cr_ctx *ctx, char *str, int len)
>  	return cr_write_obj(ctx, &h, str);
>  }
>  
> +/**
> + * cr_fill_fname - return pathname of a given file
> + * @path: path name
> + * @root: relative root
> + * @buf: buffer for pathname
> + * @n: buffer length (in) and pathname length (out)
> + */
> +static char *
> +cr_fill_fname(struct path *path, struct path *root, char *buf, int *n)
> +{
> +	struct path tmp = *root;
> +	char *fname;
> +
> +	BUG_ON(!buf);
> +	spin_lock(&dcache_lock);
> +	fname = __d_path(path, &tmp, buf, *n);
> +	spin_unlock(&dcache_lock);
> +	if (!IS_ERR(fname))
> +		*n = (buf + (*n) - fname);
> +	/*
> +	 * FIXME: if __d_path() changed these, it must have stepped out of
> +	 * init's namespace. Since currently we require a unified namespace
> +	 * within the container: simply fail.
> +	 */
> +	if (tmp.mnt != root->mnt || tmp.dentry != root->dentry)
> +		fname = ERR_PTR(-EBADF);
> +
> +	return fname;
> +}
> +
> +/**
> + * cr_write_fname - write a file name
> + * @ctx: checkpoint context
> + * @path: path name
> + * @root: relative root
> + */
> +int cr_write_fname(struct cr_ctx *ctx, struct path *path, struct path *root)
> +{
> +	struct cr_hdr h;
> +	char *buf, *fname;
> +	int ret, flen;
> +
> +	flen = PATH_MAX;
> +	buf = kmalloc(flen, GFP_KERNEL);
> +	if (!buf)
> +		return -ENOMEM;
> +
> +	fname = cr_fill_fname(path, root, buf, &flen);
> +	if (!IS_ERR(fname)) {
> +		h.type = CR_HDR_FNAME;
> +		h.len = flen;
> +		h.parent = 0;
> +		ret = cr_write_obj(ctx, &h, fname);
> +	} else
> +		ret = PTR_ERR(fname);
> +
> +	kfree(buf);
> +	return ret;
> +}
> +
>  /* write the checkpoint header */
>  static int cr_write_head(struct cr_ctx *ctx)
>  {
> @@ -168,6 +229,10 @@ static int cr_write_task(struct cr_ctx *ctx, struct task_struct *t)
>  	cr_debug("task_struct: ret %d\n", ret);
>  	if (ret < 0)
>  		goto out;
> +	ret = cr_write_mm(ctx, t);
> +	cr_debug("memory: ret %d\n", ret);
> +	if (ret < 0)
> +		goto out;
>  	ret = cr_write_thread(ctx, t);
>  	cr_debug("thread: ret %d\n", ret);
>  	if (ret < 0)
> @@ -178,10 +243,33 @@ static int cr_write_task(struct cr_ctx *ctx, struct task_struct *t)
>  	return ret;
>  }
>  
> +static int cr_ctx_checkpoint(struct cr_ctx *ctx, pid_t pid)
> +{
> +	struct fs_struct *fs;
> +
> +	ctx->root_pid = pid;
> +
> +	/*
> +	 * assume checkpointer is in container's root vfs
> +	 * FIXME: this works for now, but will change with real containers
> +	 */
> +
> +	fs = current->fs;
> +	read_lock(&fs->lock);
> +	ctx->fs_mnt = fs->root;
> +	path_get(&ctx->fs_mnt);
> +	read_unlock(&fs->lock);
> +
> +	return 0;

Spurious return value?

> +}
> +
>  int do_checkpoint(struct cr_ctx *ctx, pid_t pid)
>  {
>  	int ret;
>  
> +	ret = cr_ctx_checkpoint(ctx, pid);
> +	if (ret < 0)
> +		goto out;
>  	ret = cr_write_head(ctx);
>  	if (ret < 0)
>  		goto out;
> diff --git a/checkpoint/checkpoint_arch.h b/checkpoint/checkpoint_arch.h
> index ada1369..f06c7eb 100644
> --- a/checkpoint/checkpoint_arch.h
> +++ b/checkpoint/checkpoint_arch.h
> @@ -3,6 +3,8 @@
>  extern int cr_write_head_arch(struct cr_ctx *ctx);
>  extern int cr_write_thread(struct cr_ctx *ctx, struct task_struct *t);
>  extern int cr_write_cpu(struct cr_ctx *ctx, struct task_struct *t);
> +extern int cr_write_mm_context(struct cr_ctx *ctx,
> +			       struct mm_struct *mm, int parent);
>  
>  extern int cr_read_head_arch(struct cr_ctx *ctx);
>  extern int cr_read_thread(struct cr_ctx *ctx);
> diff --git a/checkpoint/checkpoint_mem.h b/checkpoint/checkpoint_mem.h
> new file mode 100644
> index 0000000..85546f4
> --- /dev/null
> +++ b/checkpoint/checkpoint_mem.h
> @@ -0,0 +1,41 @@
> +#ifndef _CHECKPOINT_CKPT_MEM_H_
> +#define _CHECKPOINT_CKPT_MEM_H_
> +/*
> + *  Generic container checkpoint-restart
> + *
> + *  Copyright (C) 2008 Oren Laadan
> + *
> + *  This file is subject to the terms and conditions of the GNU General Public
> + *  License.  See the file COPYING in the main directory of the Linux
> + *  distribution for more details.
> + */
> +
> +#include <linux/mm_types.h>
> +
> +/*
> + * page-array chains: each cr_pgarr describes a set of <strcut page *,vaddr>

struct

> + * tuples (where vaddr is the virtual address of a page in a particular mm).
> + * Specifically, we use separate arrays so that all vaddrs can be written
> + * and read at once.
> + */
> +
> +struct cr_pgarr {
> +	unsigned long *vaddrs;
> +	struct page **pages;
> +	unsigned int nr_used;
> +	struct list_head list;
> +};
> +
> +#define CR_PGARR_TOTAL  (PAGE_SIZE / sizeof(void *))
> +#define CR_PGARR_CHUNK  (4 * CR_PGARR_TOTAL)
> +
> +extern void cr_pgarr_free(struct cr_ctx *ctx);
> +extern struct cr_pgarr *cr_pgarr_current(struct cr_ctx *ctx);
> +extern void cr_pgarr_reset_all(struct cr_ctx *ctx);
> +
> +static inline int cr_pgarr_is_full(struct cr_pgarr *pgarr)
> +{
> +	return (pgarr->nr_used == CR_PGARR_TOTAL);
> +}
> +
> +#endif /* _CHECKPOINT_CKPT_MEM_H_ */
> diff --git a/checkpoint/ckpt_mem.c b/checkpoint/ckpt_mem.c
> new file mode 100644
> index 0000000..a2fcdbf
> --- /dev/null
> +++ b/checkpoint/ckpt_mem.c
> @@ -0,0 +1,503 @@
> +/*
> + *  Checkpoint memory contents
> + *
> + *  Copyright (C) 2008 Oren Laadan
> + *
> + *  This file is subject to the terms and conditions of the GNU General Public
> + *  License.  See the file COPYING in the main directory of the Linux
> + *  distribution for more details.
> + */
> +
> +#include <linux/kernel.h>
> +#include <linux/sched.h>
> +#include <linux/slab.h>
> +#include <linux/file.h>
> +#include <linux/pagemap.h>
> +#include <linux/mm_types.h>
> +#include <linux/checkpoint.h>
> +#include <linux/checkpoint_hdr.h>
> +
> +#include "checkpoint_arch.h"
> +#include "checkpoint_mem.h"
> +
> +/*
> + * utilities to alloc, free, and handle 'struct cr_pgarr' (page-arrays)
> + * (common to ckpt_mem.c and rstr_mem.c).
> + *
> + * The checkpoint context structure has two members for page-arrays:
> + *   ctx->pgarr_list: list head of the page-array chain

What's the second member?

> + *
> + * During checkpoint (and restart) the chain tracks the dirty pages (page
> + * pointer and virtual address) of each MM. For a particular MM, these are
> + * always added to the head of the page-array chain (ctx->pgarr_list).
> + * This "current" page-array advances as necessary, and new page-array
> + * descriptors are allocated on-demand. Before the next chunk of pages,
> + * the chain is reset but not freed (that is, dereference page pointers).
> + */
> +
> +/* return first page-array in the chain */
> +static inline struct cr_pgarr *cr_pgarr_first(struct cr_ctx *ctx)
> +{
> +	if (list_empty(&ctx->pgarr_list))
> +		return NULL;
> +	return list_first_entry(&ctx->pgarr_list, struct cr_pgarr, list);
> +}
> +
> +/* release pages referenced by a page-array */
> +static void cr_pgarr_release_pages(struct cr_pgarr *pgarr)
> +{
> +	int i;
> +
> +	cr_debug("nr_used %d\n", pgarr->nr_used);
> +	/*
> +	 * although both checkpoint and restart use 'nr_used', we only
> +	 * collect pages during checkpoint; in restart we simply return
> +	 */
> +	if (!pgarr->pages)
> +		return;
> +	for (i = pgarr->nr_used; i--; /**/)
> +		page_cache_release(pgarr->pages[i]);

This is sorta hard to read (and non-intuitive).  Is it easier to do: 

 

for (i = 0; i < pgarr->nr_used; i++) 

	page_cache_release(pgarr->pages[i]);
 

It shouldn't matter what order you release the pages in..

> +}
> +
> +/* free a single page-array object */
> +static void cr_pgarr_free_one(struct cr_pgarr *pgarr)
> +{
> +	cr_pgarr_release_pages(pgarr);
> +	kfree(pgarr->pages);
> +	kfree(pgarr->vaddrs);
> +	kfree(pgarr);
> +}
> +
> +/* free a chain of page-arrays */
> +void cr_pgarr_free(struct cr_ctx *ctx)
> +{
> +	struct cr_pgarr *pgarr, *tmp;
> +
> +	list_for_each_entry_safe(pgarr, tmp, &ctx->pgarr_list, list) {
> +		list_del(&pgarr->list);
> +		cr_pgarr_free_one(pgarr);
> +	}
> +}
> +
> +/* allocate a single page-array object */
> +static struct cr_pgarr *cr_pgarr_alloc_one(unsigned long flags)
> +{
> +	struct cr_pgarr *pgarr;
> +
> +	pgarr = kzalloc(sizeof(*pgarr), GFP_KERNEL);
> +	if (!pgarr)
> +		return NULL;
> +
> +	pgarr->vaddrs = kmalloc(CR_PGARR_TOTAL * sizeof(unsigned long),
You used PAGE_SIZE / sizeof(void *) above.   Why not __get_free_page()?

> +				GFP_KERNEL);
> +	if (!pgarr->vaddrs)
> +		goto nomem;
> +
> +	/* pgarr->pages is needed only for checkpoint */
> +	if (flags & CR_CTX_CKPT) {
> +		pgarr->pages = kmalloc(CR_PGARR_TOTAL * sizeof(struct page *),
> +				       GFP_KERNEL);
> +		if (!pgarr->pages)
> +			goto nomem;
> +	}
> +
> +	return pgarr;
> +
> + nomem:
> +	cr_pgarr_free_one(pgarr);
> +	return NULL;
> +}
> +
> +/* cr_pgarr_current - return the next available page-array in the chain
> + * @ctx: checkpoint context
> + *
> + * Returns the first page-array in the list that has space. Extends the
> + * list if none has space.
> + */
> +struct cr_pgarr *cr_pgarr_current(struct cr_ctx *ctx)
> +{
> +	struct cr_pgarr *pgarr;
> +
> +	pgarr = cr_pgarr_first(ctx);
> +	if (pgarr && !cr_pgarr_is_full(pgarr))
> +		goto out;
> +	pgarr = cr_pgarr_alloc_one(ctx->flags);
> +	if (!pgarr)
> +		goto out;
> +	list_add(&pgarr->list, &ctx->pgarr_list);
> + out:
> +	return pgarr;
> +}
> +
> +/* reset the page-array chain (dropping page references if necessary) */
> +void cr_pgarr_reset_all(struct cr_ctx *ctx)
> +{
> +	struct cr_pgarr *pgarr;
> +
> +	list_for_each_entry(pgarr, &ctx->pgarr_list, list) {
> +		cr_pgarr_release_pages(pgarr);
> +		pgarr->nr_used = 0;
> +	}

This doesn't look right.  cr_pgarr_current only ever looks at the head 
of the list, so resetting a list with > 1 pgarr on it will mean the 
non-head elements in the list will go to waste.

> +}
> +
> +/*
> + * Checkpoint is outside the context of the checkpointee, so one cannot
> + * simply read pages from user-space. Instead, we scan the address space
> + * of the target to cherry-pick pages of interest. Selected pages are
> + * enlisted in a page-array chain (attached to the checkpoint context).
> + * To save their contents, each page is mapped to kernel memory and then
> + * dumped to the file descriptor.
> + */
> +
> +
> +/**
> + * cr_private_follow_page - return page pointer for dirty pages
> + * @vma - target vma
> + * @addr - page address
> + *
> + * Looks up the page that correspond to the address in the vma, and
> + * returns the page if it was modified (and grabs a reference to it),
> + * or otherwise returns NULL (or error).
> + *
> + * This function should _only_ called for private vma's.
> + */
> +static struct page *
> +cr_private_follow_page(struct vm_area_struct *vma, unsigned long addr)

s/cr_private_follow_page/cr_follow_page_private/ ?

Maybe even cr_dump_private_page?  The fact that it's following the page 
  tables down to the page is an implementation artifact and isn't really 
relevant to the semantics you want to express.

> +{
> +	struct page *page;
> +
> +	BUG_ON(vma->vm_flags & (VM_SHARED | VM_MAYSHARE));
> +

This BUG_ON shouldn't be needed if it's already done in 
cr_private_vma_fill_pgarr.

> +	/*
> +	 * simplified version of get_user_pages(): already have vma,
> +	 * only need FOLL_ANON, and (for now) ignore fault stats.
> +	 *
> +	 * follow_page() will return NULL if the page is not present
> +	 * (swapped), ZERO_PAGE(0) if the pte wasn't allocated, and
> +	 * the actual page pointer otherwise.
> +	 *
> +	 * FIXME: consolidate with get_user_pages()
> +	 */
> +
> +	cond_resched();
> +	while (!(page = follow_page(vma, addr, FOLL_ANON | FOLL_GET))) {
> +		int ret;
> +
> +		/* the page is swapped out - bring it in (optimize ?) */
> +		ret = handle_mm_fault(vma->vm_mm, vma, addr, 0);
> +		if (ret & VM_FAULT_ERROR) {
> +			if (ret & VM_FAULT_OOM)
> +				return ERR_PTR(-ENOMEM);
> +			else if (ret & VM_FAULT_SIGBUS)
> +				return ERR_PTR(-EFAULT);
> +			else
> +				BUG();
> +			break;
> +		}
> +		cond_resched();
> +	}
> +
> +	if (IS_ERR(page))
> +		return page;
> +
> +	/*
> +	 * We only care about dirty pages: either non-zero page, or
> +	 * file-backed (copy-on-write) that were touched. For the latter,
> +	 * the page_mapping() will be unset because it will no longer be
> +	 * mapped to the original file  after having been modified.
> +	 */
> +	if (page == ZERO_PAGE(0)) {
> +		/* this is the zero page: ignore */
> +		page_cache_release(page);
> +		page = NULL;
> +	} else if (vma->vm_file && (page_mapping(page) != NULL)) {
> +		/* file backed clean cow: ignore */

Probably better to describe 'why' it can be ignored here.


> +		page_cache_release(page);
> +		page = NULL;
> +	}
> +
> +	return page;
> +}
> +
> +/**
> + * cr_private_vma_fill_pgarr - fill a page-array with addr/page tuples
> + * @ctx - checkpoint context
> + * @pgarr - page-array to fill
> + * @vma - vma to scan
> + * @start - start address (updated)
> + *
> + * Returns the number of pages collected
> + */
> +static int
> +cr_private_vma_fill_pgarr(struct cr_ctx *ctx, struct cr_pgarr *pgarr,
> +			  struct vm_area_struct *vma, unsigned long *start)

This is sorta nasty because you shouldn't need to call into this routine 
with a container.  It should be able to enqueue the (vaddr, page) tuple 
directly on the ctx.  Doing so would also abstract out the pgarr 
management at this level and make the code a lot simpler.

> +{
> +	unsigned long end = vma->vm_end;
> +	unsigned long addr = *start;
> +	int orig_used = pgarr->nr_used;
> +
> +	/* this function is only for private memory (anon or file-mapped) */
> +	BUG_ON(vma->vm_flags & (VM_SHARED | VM_MAYSHARE));
> +
> +	while (addr < end) {
> +		struct page *page;
> +
> +		page = cr_private_follow_page(vma, addr);
> +		if (IS_ERR(page))
> +			return PTR_ERR(page);
> +
> +		if (page) {
> +			pgarr->pages[pgarr->nr_used] = page;
> +			pgarr->vaddrs[pgarr->nr_used] = addr;
> +			pgarr->nr_used++;

Should be something like:

ret = cr_ctx_append_page(ctx, addr, page);
if (ret < 0)
   goto out;

> +		}
> +
> +		addr += PAGE_SIZE;
> +
> +		if (cr_pgarr_is_full(pgarr))
> +			break;
> +	}
> +
> +	*start = addr;
> +	return pgarr->nr_used - orig_used;
> +}
> +
> +/* dump contents of a pages: use kmap_atomic() to avoid TLB flush */
> +static int cr_page_write(struct cr_ctx *ctx, struct page *page, char *buf)
> +{
> +	void *ptr;
> +
> +	ptr = kmap_atomic(page, KM_USER1);
> +	memcpy(buf, ptr, PAGE_SIZE);
> +	kunmap_atomic(ptr, KM_USER1);
> +
> +	return cr_kwrite(ctx, buf, PAGE_SIZE);
> +}
> +
> +/**
> + * cr_vma_dump_pages - dump pages listed in the ctx page-array chain
> + * @ctx - checkpoint context
> + * @total - total number of pages
> + *
> + * First dump all virtual addresses, followed by the contents of all pages
> + */
> +static int cr_vma_dump_pages(struct cr_ctx *ctx, int total)
> +{
> +	struct cr_pgarr *pgarr;
> +	char *buf;
> +	int i, ret = 0;
> +
> +	if (!total)
> +		return 0;
> +
> +	list_for_each_entry_reverse(pgarr, &ctx->pgarr_list, list) {
> +		ret = cr_kwrite(ctx, pgarr->vaddrs,
> +				pgarr->nr_used * sizeof(*pgarr->vaddrs));
> +		if (ret < 0)
> +			return ret;
> +	}
> +
> +	buf = kmalloc(PAGE_SIZE, GFP_KERNEL);

__get_free_page()

> +	if (!buf)
> +		return -ENOMEM;
> +
> +	list_for_each_entry_reverse(pgarr, &ctx->pgarr_list, list) {
> +		for (i = 0; i < pgarr->nr_used; i++) {
> +			ret = cr_page_write(ctx, pgarr->pages[i], buf);
> +			if (ret < 0)
> +				goto out;
> +		}
> +	}
> +
> + out:
> +	kfree(buf);
> +	return ret;
> +}
> +
> +/**
> + * cr_write_private_vma_contents - dump contents of a VMA with private memory
> + * @ctx - checkpoint context
> + * @vma - vma to scan
> + *
> + * Collect lists of pages that needs to be dumped, and corresponding
> + * virtual addresses into ctx->pgarr_list page-array chain. Then dump
> + * the addresses, followed by the page contents.
> + */
> +static int
> +cr_write_private_vma_contents(struct cr_ctx *ctx, struct vm_area_struct *vma)
> +{
> +	struct cr_hdr h;
> +	struct cr_hdr_pgarr *hh;
> +	unsigned long addr = vma->vm_start;
> +	struct cr_pgarr *pgarr;
> +	unsigned long cnt = 0;
> +	int ret;
> +
> +	/*
> +	 * Work iteratively, collecting and dumping at most CR_PGARR_CHUNK
> +	 * in each round. Each iterations is divided into two steps:
> +	 *
> +	 * (1) scan: scan through the PTEs of the vma to collect the pages
> +	 * to dump (later we'll also make them COW), while keeping a list
> +	 * of pages and their corresponding addresses on ctx->pgarr_list.
> +	 *
> +	 * (2) dump: write out a header specifying how many pages, followed
> +	 * by the addresses of all pages in ctx->pgarr_list, followed by
> +	 * the actual contents of all pages. (Then, release the references
> +	 * to the pages and reset the page-array chain).
> +	 *
> +	 * (This split makes the logic simpler by first counting the pages
> +	 * that need saving. More importantly, it allows for a future
> +	 * optimization that will reduce application downtime by deferring
> +	 * the actual write-out of the data to after the application is
> +	 * allowed to resume execution).
> +	 *
> +	 * After dumpting the entire contents, conclude with a header that
> +	 * specifies 0 pages to mark the end of the contents.
> +	 */
> +
> +	h.type = CR_HDR_PGARR;
> +	h.len = sizeof(*hh);
> +	h.parent = 0;
> +
> +	while (addr < vma->vm_end) {
> +		pgarr = cr_pgarr_current(ctx);
> +		if (!pgarr)
> +			return -ENOMEM;
> +		ret = cr_private_vma_fill_pgarr(ctx, pgarr, vma, &addr);
> +		if (ret < 0)
> +			return ret;
> +		cnt += ret;
> +
> +		/* did we complete a chunk, or is this the last chunk ? */
> +		if (cnt >= CR_PGARR_CHUNK || (cnt && addr == vma->vm_end)) {
> +			hh = cr_hbuf_get(ctx, sizeof(*hh));
> +			hh->nr_pages = cnt;
> +			ret = cr_write_obj(ctx, &h, hh);
> +			cr_hbuf_put(ctx, sizeof(*hh));
> +			if (ret < 0)
> +				return ret;
> +
> +			ret = cr_vma_dump_pages(ctx, cnt);
> +			if (ret < 0)
> +				return ret;
> +
> +			cr_pgarr_reset_all(ctx);
> +		}
> +	}
> +
> +	/* mark end of contents with header saying "0" pages */
> +	hh = cr_hbuf_get(ctx, sizeof(*hh));
> +	hh->nr_pages = 0;
> +	ret = cr_write_obj(ctx, &h, hh);
> +	cr_hbuf_put(ctx, sizeof(*hh));
> +
> +	return ret;
> +}
> +
> +static int cr_write_vma(struct cr_ctx *ctx, struct vm_area_struct *vma)
> +{
> +	struct cr_hdr h;
> +	struct cr_hdr_vma *hh = cr_hbuf_get(ctx, sizeof(*hh));
> +	int vma_type, ret;
> +
> +	h.type = CR_HDR_VMA;
> +	h.len = sizeof(*hh);
> +	h.parent = 0;
> +
> +	hh->vm_start = vma->vm_start;
> +	hh->vm_end = vma->vm_end;
> +	hh->vm_page_prot = vma->vm_page_prot.pgprot;
> +	hh->vm_flags = vma->vm_flags;
> +	hh->vm_pgoff = vma->vm_pgoff;
> +
> +#define CR_BAD_VM_FLAGS  \
> +	(VM_SHARED | VM_MAYSHARE | VM_IO | VM_HUGETLB | VM_NONLINEAR)
> +
> +	if (vma->vm_flags & CR_BAD_VM_FLAGS) {
> +		pr_warning("c/r: unsupported VMA %#lx\n", vma->vm_flags);
> +		cr_hbuf_put(ctx, sizeof(*hh));
> +		return -ENOSYS;
> +	}
> +

The following code should be broken into it's own function?  Handling of 
other types of memory will follow and will clutter this guy up.

> +	/* by default assume anon memory */
> +	vma_type = CR_VMA_ANON;
> +
> +	/*
> +	 * if there is a backing file, assume private-mapped

Shouldn't need to assume anything as you checked for VM_MAYSHARE and 
VM_SHARED above.

> +	 * (FIXME: check if the file is unlinked)
> +	 */
> +	if (vma->vm_file)
> +		vma_type = CR_VMA_FILE;
> +
> +	hh->vma_type = vma_type;
> +
> +	ret = cr_write_obj(ctx, &h, hh);
> +	cr_hbuf_put(ctx, sizeof(*hh));
> +	if (ret < 0)
> +		return ret;
> +
> +	/* save the file name, if relevant */

s/, if relevant//

> +	if (vma->vm_file) {
> +		ret = cr_write_fname(ctx, &vma->vm_file->f_path, &ctx->fs_mnt);

Why is this using a filename, rather than a reference to a file? 
Shouldn't this use the logic in patch 8/13?

> +		if (ret < 0)
> +			return ret;
> +	}
> +
> +	return cr_write_private_vma_contents(ctx, vma);
> +}
> +
> +int cr_write_mm(struct cr_ctx *ctx, struct task_struct *t)
> +{
> +	struct cr_hdr h;
> +	struct cr_hdr_mm *hh = cr_hbuf_get(ctx, sizeof(*hh));
> +	struct mm_struct *mm;
> +	struct vm_area_struct *vma;
> +	int objref, ret;
> +
> +	h.type = CR_HDR_MM;
> +	h.len = sizeof(*hh);
> +	h.parent = task_pid_vnr(t);
> +
> +	mm = get_task_mm(t);
> +
> +	objref = 0;	/* will be meaningful with multiple processes */
> +	hh->objref = objref;
> +
> +	down_read(&mm->mmap_sem);
> +
> +	hh->start_code = mm->start_code;
> +	hh->end_code = mm->end_code;
> +	hh->start_data = mm->start_data;
> +	hh->end_data = mm->end_data;
> +	hh->start_brk = mm->start_brk;
> +	hh->brk = mm->brk;
> +	hh->start_stack = mm->start_stack;
> +	hh->arg_start = mm->arg_start;
> +	hh->arg_end = mm->arg_end;
> +	hh->env_start = mm->env_start;
> +	hh->env_end = mm->env_end;
> +
> +	hh->map_count = mm->map_count;
> +
> +	/* FIX: need also mm->flags */
> +
> +	ret = cr_write_obj(ctx, &h, hh);
> +	cr_hbuf_put(ctx, sizeof(*hh));
> +	if (ret < 0)
> +		goto out;
> +
> +	/* write the vma's */
> +	for (vma = mm->mmap; vma; vma = vma->vm_next) {
> +		ret = cr_write_vma(ctx, vma);
> +		if (ret < 0)
> +			goto out;
> +	}
> +
> +	ret = cr_write_mm_context(ctx, mm, objref);
> +
> + out:
> +	up_read(&mm->mmap_sem);
> +	mmput(mm);
> +	return ret;
> +}
> diff --git a/checkpoint/sys.c b/checkpoint/sys.c
> index bd14ef9..c547a1c 100644
> --- a/checkpoint/sys.c
> +++ b/checkpoint/sys.c
> @@ -16,6 +16,8 @@
>  #include <linux/capability.h>
>  #include <linux/checkpoint.h>
>  
> +#include "checkpoint_mem.h"
> +
>  /*
>   * Helpers to write(read) from(to) kernel space to(from) the checkpoint
>   * image file descriptor (similar to how a core-dump is performed).
> @@ -131,7 +133,13 @@ static void cr_ctx_free(struct cr_ctx *ctx)
>  {
>  	if (ctx->file)
>  		fput(ctx->file);
> +
>  	kfree(ctx->hbuf);
> +
> +	path_put(&ctx->fs_mnt);		/* safe with NULL pointers */
> +
> +	cr_pgarr_free(ctx);
> +
>  	kfree(ctx);
>  }
>  
> @@ -146,6 +154,8 @@ static struct cr_ctx *cr_ctx_alloc(int fd, unsigned long flags)
>  
>  	ctx->flags = flags;
>  
> +	INIT_LIST_HEAD(&ctx->pgarr_list);
> +
>  	err = -EBADF;
>  	ctx->file = fget(fd);
>  	if (!ctx->file)
> diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
> index 63f298f..4e97f9f 100644
> --- a/include/linux/checkpoint.h
> +++ b/include/linux/checkpoint.h
> @@ -10,6 +10,9 @@
>   *  distribution for more details.
>   */
>  
> +#include <linux/path.h>
> +#include <linux/fs.h>
> +
>  #define CR_VERSION  1
>  
>  struct cr_ctx {
> @@ -25,6 +28,10 @@ struct cr_ctx {
>  
>  	void *hbuf;		/* temporary buffer for headers */
>  	int hpos;		/* position in headers buffer */
> +
> +	struct list_head pgarr_list;	/* page array to dump VMA contents */
> +
> +	struct path fs_mnt;	/* container root (FIXME) */
>  };
>  
>  /* cr_ctx: flags */
> @@ -42,6 +49,8 @@ struct cr_hdr;
>  extern int cr_write_obj(struct cr_ctx *ctx, struct cr_hdr *h, void *buf);
>  extern int cr_write_buffer(struct cr_ctx *ctx, void *buf, int len);
>  extern int cr_write_string(struct cr_ctx *ctx, char *str, int len);
> +extern int cr_write_fname(struct cr_ctx *ctx,
> +			  struct path *path, struct path *root);
>  
>  extern int cr_read_obj(struct cr_ctx *ctx, struct cr_hdr *h, void *buf, int n);
>  extern int cr_read_obj_type(struct cr_ctx *ctx, void *buf, int len, int type);
> @@ -50,7 +59,10 @@ extern int cr_read_buffer(struct cr_ctx *ctx, void *buf, int *len);
>  extern int cr_read_string(struct cr_ctx *ctx, char *str, int len);
>  
>  extern int do_checkpoint(struct cr_ctx *ctx, pid_t pid);
> +extern int cr_write_mm(struct cr_ctx *ctx, struct task_struct *t);
> +
>  extern int do_restart(struct cr_ctx *ctx, pid_t pid);
> +extern int cr_read_mm(struct cr_ctx *ctx);
>  
>  #define cr_debug(fmt, args...)  \
>  	pr_debug("[%d:c/r:%s] " fmt, task_pid_vnr(current), __func__, ## args)
> diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
> index b74b5f9..d78f0f1 100644
> --- a/include/linux/checkpoint_hdr.h
> +++ b/include/linux/checkpoint_hdr.h
> @@ -34,6 +34,7 @@ enum {
>  	CR_HDR_HEAD_ARCH,
>  	CR_HDR_BUFFER,
>  	CR_HDR_STRING,
> +	CR_HDR_FNAME,
>  
>  	CR_HDR_TASK = 101,
>  	CR_HDR_THREAD,
> @@ -41,6 +42,7 @@ enum {
>  
>  	CR_HDR_MM = 201,
>  	CR_HDR_VMA,
> +	CR_HDR_PGARR,
>  	CR_HDR_MM_CONTEXT,
>  
>  	CR_HDR_TAIL = 5001
> @@ -75,4 +77,34 @@ struct cr_hdr_task {
>  	__s32 task_comm_len;
>  } __attribute__((aligned(8)));
>  
> +struct cr_hdr_mm {
> +	__u32 objref;		/* identifier for shared objects */
> +	__u32 map_count;
> +
> +	__u64 start_code, end_code, start_data, end_data;
> +	__u64 start_brk, brk, start_stack;
> +	__u64 arg_start, arg_end, env_start, env_end;
> +} __attribute__((aligned(8)));
> +
> +/* vma subtypes */
> +enum vm_type {
> +	CR_VMA_ANON = 1,
> +	CR_VMA_FILE

We need to figure out what MAP_SHARED | MAP_ANONYMOUS should be exposed 
as in this setup (much in the same way we need to start defining what 
shm mappings look like).  Internally, they are 'file-backed', but to 
userland, they aren't.
 

Thoughts?

> +};
> +
> +struct cr_hdr_vma {
> +	__u32 vma_type;
> +	__u32 _padding;

Why padding?

> +
> +	__u64 vm_start;
> +	__u64 vm_end;
> +	__u64 vm_page_prot;
> +	__u64 vm_flags;
> +	__u64 vm_pgoff;
> +} __attribute__((aligned(8)));
> +
> +struct cr_hdr_pgarr {
> +	__u64 nr_pages;		/* number of pages to saved */
> +} __attribute__((aligned(8)));
> +
>  #endif /* _CHECKPOINT_CKPT_HDR_H_ */

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [RFC v11][PATCH 05/13] Dump memory address space
       [not found]     ` <4949B4ED.9060805-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
@ 2008-12-18 11:10       ` Oren Laadan
  0 siblings, 0 replies; 133+ messages in thread
From: Oren Laadan @ 2008-12-18 11:10 UTC (permalink / raw)
  To: Mike Waychison
  Cc: jeremy-TSDbQ3PG+2Y, arnd-r2nGTMty4D4,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Dave Hansen,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Linux Torvalds, Alexander Viro,
	H. Peter Anvin, Thomas Gleixner, Ingo Molnar



Mike Waychison wrote:
> Comments below.

Thanks for the detailed review.

> 
> Oren Laadan wrote:
>> For each VMA, there is a 'struct cr_vma'; if the VMA is file-mapped,
>> it will be followed by the file name. Then comes the actual contents,
>> in one or more chunk: each chunk begins with a header that specifies
>> how many pages it holds, then the virtual addresses of all the dumped
>> pages in that chunk, followed by the actual contents of all dumped
>> pages. A header with zero number of pages marks the end of the contents.
>> Then comes the next VMA and so on.
>>

[...]

>> +    mutex_lock(&mm->context.lock);
>> +
>> +    hh->ldt_entry_size = LDT_ENTRY_SIZE;
>> +    hh->nldt = mm->context.size;
>> +
>> +    cr_debug("nldt %d\n", hh->nldt);
>> +
>> +    ret = cr_write_obj(ctx, &h, hh);
>> +    cr_hbuf_put(ctx, sizeof(*hh));
>> +    if (ret < 0)
>> +        goto out;
>> +
>> +    ret = cr_kwrite(ctx, mm->context.ldt,
>> +            mm->context.size * LDT_ENTRY_SIZE);
> 
> Do we really want to emit anything under lock?  I realize that this
> patch goes and does a ton of writes with mmap_sem held for read -- is
> this ok?

Because all tasks in the container must be frozen during the checkpoint,
there is no performance penalty for keeping the locks. Although the object
should not change in the interim anyways, the locks protects us from, e.g.
the task unfreezing somehow, or being killed by the OOM killer, or any
other change incurred from the "outside world" (even future code).

Put in other words - in the long run it is safer to assume that the
underlying object may otherwise change.

(If we want to drop the lock here before cr_kwrite(), we need to copy the
data to a temporary buffer first. If we also want to drop mmap_sem(), we
need to be more careful with following the vma's.)

Do you see a reason to not keeping the locks ?

>> +
>> + out:
>> +    mutex_unlock(&mm->context.lock);
>> +    return ret;
>> +}

[...]

>> +static int cr_ctx_checkpoint(struct cr_ctx *ctx, pid_t pid)
>> +{
>> +    struct fs_struct *fs;
>> +
>> +    ctx->root_pid = pid;
>> +
>> +    /*
>> +     * assume checkpointer is in container's root vfs
>> +     * FIXME: this works for now, but will change with real containers
>> +     */
>> +
>> +    fs = current->fs;
>> +    read_lock(&fs->lock);
>> +    ctx->fs_mnt = fs->root;
>> +    path_get(&ctx->fs_mnt);
>> +    read_unlock(&fs->lock);
>> +
>> +    return 0;
> 
> Spurious return value?

In a later patch (10/13: External checkpoint of a task other than ourself)
it becomes more useful.

> 
>> +}
>> +

[...]

>> +/*
>> + * utilities to alloc, free, and handle 'struct cr_pgarr' (page-arrays)
>> + * (common to ckpt_mem.c and rstr_mem.c).
>> + *
>> + * The checkpoint context structure has two members for page-arrays:
>> + *   ctx->pgarr_list: list head of the page-array chain
> 
> What's the second member?

Duh... will update text.

> 

[...]

>> +    for (i = pgarr->nr_used; i--; /**/)
>> +        page_cache_release(pgarr->pages[i]);
> 
> This is sorta hard to read (and non-intuitive).  Is it easier to do:
> 
> 
> for (i = 0; i < pgarr->nr_used; i++)
>     page_cache_release(pgarr->pages[i]);
> 
> 
> It shouldn't matter what order you release the pages in..

Was meant to avoid a dereference to 'pgarr->nr_used' in the comparison.
(though I doubt if the performance impact is at all visible)

[...]

>> +/* allocate a single page-array object */
>> +static struct cr_pgarr *cr_pgarr_alloc_one(unsigned long flags)
>> +{
>> +    struct cr_pgarr *pgarr;
>> +
>> +    pgarr = kzalloc(sizeof(*pgarr), GFP_KERNEL);
>> +    if (!pgarr)
>> +        return NULL;
>> +
>> +    pgarr->vaddrs = kmalloc(CR_PGARR_TOTAL * sizeof(unsigned long),
> You used PAGE_SIZE / sizeof(void *) above.   Why not __get_free_page()?

Hahaha .. well, it's a guaranteed method to keep Dave Hansen from
barking about not using kmalloc ...

Personally I prefer __get_free_page() here, but not enough to keep
arguing with him. Let me know when the two of you settle it :)

> 
>> +                GFP_KERNEL);
>> +    if (!pgarr->vaddrs)
>> +        goto nomem;

[...]

>> +/* reset the page-array chain (dropping page references if necessary) */
>> +void cr_pgarr_reset_all(struct cr_ctx *ctx)
>> +{
>> +    struct cr_pgarr *pgarr;
>> +
>> +    list_for_each_entry(pgarr, &ctx->pgarr_list, list) {
>> +        cr_pgarr_release_pages(pgarr);
>> +        pgarr->nr_used = 0;
>> +    }
> 
> This doesn't look right.  cr_pgarr_current only ever looks at the head
> of the list, so resetting a list with > 1 pgarr on it will mean the
> non-head elements in the list will go to waste.

You're correct.
(the code is from a cleanup suggested to v4 and incorporated into v5).

> 
>> +}
>> +
>> +/*
>> + * Checkpoint is outside the context of the checkpointee, so one cannot
>> + * simply read pages from user-space. Instead, we scan the address space
>> + * of the target to cherry-pick pages of interest. Selected pages are
>> + * enlisted in a page-array chain (attached to the checkpoint context).
>> + * To save their contents, each page is mapped to kernel memory and then
>> + * dumped to the file descriptor.
>> + */
>> +
>> +
>> +/**
>> + * cr_private_follow_page - return page pointer for dirty pages
>> + * @vma - target vma
>> + * @addr - page address
>> + *
>> + * Looks up the page that correspond to the address in the vma, and
>> + * returns the page if it was modified (and grabs a reference to it),
>> + * or otherwise returns NULL (or error).
>> + *
>> + * This function should _only_ called for private vma's.
>> + */
>> +static struct page *
>> +cr_private_follow_page(struct vm_area_struct *vma, unsigned long addr)
> 
> s/cr_private_follow_page/cr_follow_page_private/ ?
> 

ok.

> Maybe even cr_dump_private_page?  The fact that it's following the page
>  tables down to the page is an implementation artifact and isn't really
> relevant to the semantics you want to express.

Except that we don't dump the page there - we follow the page tables and
decide whether we add it to the list of scanned pages. But, ok, we can
also do cr_consider_page_private() (or examine, or scan ..)

> 
>> +{
>> +    struct page *page;
>> +
>> +    BUG_ON(vma->vm_flags & (VM_SHARED | VM_MAYSHARE));
>> +
> 
> This BUG_ON shouldn't be needed if it's already done in
> cr_private_vma_fill_pgarr.
> 

Leftover, will remove.

>> +    /*

[...]

>> +    } else if (vma->vm_file && (page_mapping(page) != NULL)) {
>> +        /* file backed clean cow: ignore */
> 
> Probably better to describe 'why' it can be ignored here.

ok.

>> +        page_cache_release(page);
>> +        page = NULL;
>> +    }
>> +
>> +    return page;
>> +}
>> +
>> +/**
>> + * cr_private_vma_fill_pgarr - fill a page-array with addr/page tuples
>> + * @ctx - checkpoint context
>> + * @pgarr - page-array to fill
>> + * @vma - vma to scan
>> + * @start - start address (updated)
>> + *
>> + * Returns the number of pages collected
>> + */
>> +static int
>> +cr_private_vma_fill_pgarr(struct cr_ctx *ctx, struct cr_pgarr *pgarr,
>> +              struct vm_area_struct *vma, unsigned long *start)
> 
> This is sorta nasty because you shouldn't need to call into this routine
> with a container.  It should be able to enqueue the (vaddr, page) tuple
> directly on the ctx.  Doing so would also abstract out the pgarr
> management at this level and make the code a lot simpler.
> 

Yes, @pgarr can be abstracted inside here.

>> +{
>> +    unsigned long end = vma->vm_end;
>> +    unsigned long addr = *start;
>> +    int orig_used = pgarr->nr_used;
>> +
>> +    /* this function is only for private memory (anon or file-mapped) */
>> +    BUG_ON(vma->vm_flags & (VM_SHARED | VM_MAYSHARE));
>> +
>> +    while (addr < end) {
>> +        struct page *page;
>> +
>> +        page = cr_private_follow_page(vma, addr);
>> +        if (IS_ERR(page))
>> +            return PTR_ERR(page);
>> +
>> +        if (page) {
>> +            pgarr->pages[pgarr->nr_used] = page;
>> +            pgarr->vaddrs[pgarr->nr_used] = addr;
>> +            pgarr->nr_used++;
> 
> Should be something like:
> 
> ret = cr_ctx_append_page(ctx, addr, page);
> if (ret < 0)
>   goto out;

My concern here is performance: keeping track of @pgarr avoids the
reference through ctx. We may loop over MBs of memory, tens of
thousands of pages, in individual VMAs.

>> +        }

[...]

>> +
>> +    buf = kmalloc(PAGE_SIZE, GFP_KERNEL);
> 
> __get_free_page()

lol... gonna run an experiment and change this one (Dave's main
argument regarding better "debugability" of kmalloc() doesn't
hold here anyways !).

>> +    if (!buf)
>> +        return -ENOMEM;

[...]

>> +#define CR_BAD_VM_FLAGS  \
>> +    (VM_SHARED | VM_MAYSHARE | VM_IO | VM_HUGETLB | VM_NONLINEAR)
>> +
>> +    if (vma->vm_flags & CR_BAD_VM_FLAGS) {
>> +        pr_warning("c/r: unsupported VMA %#lx\n", vma->vm_flags);
>> +        cr_hbuf_put(ctx, sizeof(*hh));
>> +        return -ENOSYS;
>> +    }
>> +
> 
> The following code should be broken into it's own function?  Handling of
> other types of memory will follow and will clutter this guy up.
> 

I deferred this until I add those "other types".

>> +    /* by default assume anon memory */
>> +    vma_type = CR_VMA_ANON;

[...]

>> +    if (vma->vm_file) {
>> +        ret = cr_write_fname(ctx, &vma->vm_file->f_path, &ctx->fs_mnt);
> 
> Why is this using a filename, rather than a reference to a file?

Could be a reference to a file, but it isn't strictly necessary (*) and it
won't improve performance that much. won't gain that much.

Not necessary: open files may be shared and then we _must_ use the same file
pointer. In contrast, with memory mapping only needs _an_ open file.
Won't gain much: because file pointers of mapped regions are usually only
shared in the case of fork() without a following exec().

(*) It is strictly necessary when it comes to handling shared memory.

So I left this optimization for later.

> Shouldn't this use the logic in patch 8/13?

Yes. But need to make sure (especially on the restart side) to consider the
exceptions - e.g. a file in SHMFS used for anonymous shared memory, etc.

So yes, I'll add a FIXME comment there.

> 
>> +        if (ret < 0)
>> +            return ret;
>> +    }

[...]

>> +enum vm_type {
>> +    CR_VMA_ANON = 1,
>> +    CR_VMA_FILE
> 
> We need to figure out what MAP_SHARED | MAP_ANONYMOUS should be exposed
> as in this setup (much in the same way we need to start defining what
> shm mappings look like).  Internally, they are 'file-backed', but to
> userland, they aren't.
> 
> 
> Thoughts?

Eventually we'll have CR_VMA_ANON_SHM, CR_VMA_FILE_SHM, CR_VMA_IPC_SHM,
to identify the vma type. There will also be a flag "skip" that says that
the actual contents of the memory has already been copied earlier. (And,
for completeness, a flags "xfile" which indicated that the referenced
file is unlinked, in the case of CR_VMA_FILE and CR_VMA_FILE_SHM).

It's not a lot of work, only that I'm actually holding back on adding
more features, and focus on getting this into -mm tree first. I don't
want to write lots of code and then modify it again and again...

> 
>> +};
>> +
>> +struct cr_hdr_vma {
>> +    __u32 vma_type;
>> +    __u32 _padding;
> 
> Why padding?

For 64 bit architectures. See this threads:
https://lists.linux-foundation.org/pipermail/containers/2008-August/012318.html

Quoting Arnd Bergmann:
  "This structure has an odd multiple of 32-bit members, which means
  that if you put it into a larger structure that also contains
  64-bit members, the larger structure may get different alignment
  on x86-32 and x86-64, which you might want to avoid.
  I can't tell if this is an actual problem here.
  ...
  ...
  In this case, I'm pretty sure that sizeof(cr_hdr_task) on x86-32 is
  different from x86-64, since it will be 32-bit aligned on x86-32."

>> +
>> +    __u64 vm_start;
>> +    __u64 vm_end;
>> +    __u64 vm_page_prot;
>> +    __u64 vm_flags;
>> +    __u64 vm_pgoff;
>> +} __attribute__((aligned(8)));
>> +
>> +struct cr_hdr_pgarr {
>> +    __u64 nr_pages;        /* number of pages to saved */
>> +} __attribute__((aligned(8)));
>> +
>>  #endif /* _CHECKPOINT_CKPT_HDR_H_ */

Oren.

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [RFC v11][PATCH 05/13] Dump memory address space
  2008-12-18  2:26     ` Mike Waychison
@ 2008-12-18 11:10       ` Oren Laadan
  -1 siblings, 0 replies; 133+ messages in thread
From: Oren Laadan @ 2008-12-18 11:10 UTC (permalink / raw)
  To: Mike Waychison
  Cc: jeremy, arnd, linux-api, containers, linux-kernel, Dave Hansen,
	linux-mm, Linux Torvalds, Alexander Viro, H. Peter Anvin,
	Thomas Gleixner, Ingo Molnar



Mike Waychison wrote:
> Comments below.

Thanks for the detailed review.

> 
> Oren Laadan wrote:
>> For each VMA, there is a 'struct cr_vma'; if the VMA is file-mapped,
>> it will be followed by the file name. Then comes the actual contents,
>> in one or more chunk: each chunk begins with a header that specifies
>> how many pages it holds, then the virtual addresses of all the dumped
>> pages in that chunk, followed by the actual contents of all dumped
>> pages. A header with zero number of pages marks the end of the contents.
>> Then comes the next VMA and so on.
>>

[...]

>> +    mutex_lock(&mm->context.lock);
>> +
>> +    hh->ldt_entry_size = LDT_ENTRY_SIZE;
>> +    hh->nldt = mm->context.size;
>> +
>> +    cr_debug("nldt %d\n", hh->nldt);
>> +
>> +    ret = cr_write_obj(ctx, &h, hh);
>> +    cr_hbuf_put(ctx, sizeof(*hh));
>> +    if (ret < 0)
>> +        goto out;
>> +
>> +    ret = cr_kwrite(ctx, mm->context.ldt,
>> +            mm->context.size * LDT_ENTRY_SIZE);
> 
> Do we really want to emit anything under lock?  I realize that this
> patch goes and does a ton of writes with mmap_sem held for read -- is
> this ok?

Because all tasks in the container must be frozen during the checkpoint,
there is no performance penalty for keeping the locks. Although the object
should not change in the interim anyways, the locks protects us from, e.g.
the task unfreezing somehow, or being killed by the OOM killer, or any
other change incurred from the "outside world" (even future code).

Put in other words - in the long run it is safer to assume that the
underlying object may otherwise change.

(If we want to drop the lock here before cr_kwrite(), we need to copy the
data to a temporary buffer first. If we also want to drop mmap_sem(), we
need to be more careful with following the vma's.)

Do you see a reason to not keeping the locks ?

>> +
>> + out:
>> +    mutex_unlock(&mm->context.lock);
>> +    return ret;
>> +}

[...]

>> +static int cr_ctx_checkpoint(struct cr_ctx *ctx, pid_t pid)
>> +{
>> +    struct fs_struct *fs;
>> +
>> +    ctx->root_pid = pid;
>> +
>> +    /*
>> +     * assume checkpointer is in container's root vfs
>> +     * FIXME: this works for now, but will change with real containers
>> +     */
>> +
>> +    fs = current->fs;
>> +    read_lock(&fs->lock);
>> +    ctx->fs_mnt = fs->root;
>> +    path_get(&ctx->fs_mnt);
>> +    read_unlock(&fs->lock);
>> +
>> +    return 0;
> 
> Spurious return value?

In a later patch (10/13: External checkpoint of a task other than ourself)
it becomes more useful.

> 
>> +}
>> +

[...]

>> +/*
>> + * utilities to alloc, free, and handle 'struct cr_pgarr' (page-arrays)
>> + * (common to ckpt_mem.c and rstr_mem.c).
>> + *
>> + * The checkpoint context structure has two members for page-arrays:
>> + *   ctx->pgarr_list: list head of the page-array chain
> 
> What's the second member?

Duh... will update text.

> 

[...]

>> +    for (i = pgarr->nr_used; i--; /**/)
>> +        page_cache_release(pgarr->pages[i]);
> 
> This is sorta hard to read (and non-intuitive).  Is it easier to do:
> 
> 
> for (i = 0; i < pgarr->nr_used; i++)
>     page_cache_release(pgarr->pages[i]);
> 
> 
> It shouldn't matter what order you release the pages in..

Was meant to avoid a dereference to 'pgarr->nr_used' in the comparison.
(though I doubt if the performance impact is at all visible)

[...]

>> +/* allocate a single page-array object */
>> +static struct cr_pgarr *cr_pgarr_alloc_one(unsigned long flags)
>> +{
>> +    struct cr_pgarr *pgarr;
>> +
>> +    pgarr = kzalloc(sizeof(*pgarr), GFP_KERNEL);
>> +    if (!pgarr)
>> +        return NULL;
>> +
>> +    pgarr->vaddrs = kmalloc(CR_PGARR_TOTAL * sizeof(unsigned long),
> You used PAGE_SIZE / sizeof(void *) above.   Why not __get_free_page()?

Hahaha .. well, it's a guaranteed method to keep Dave Hansen from
barking about not using kmalloc ...

Personally I prefer __get_free_page() here, but not enough to keep
arguing with him. Let me know when the two of you settle it :)

> 
>> +                GFP_KERNEL);
>> +    if (!pgarr->vaddrs)
>> +        goto nomem;

[...]

>> +/* reset the page-array chain (dropping page references if necessary) */
>> +void cr_pgarr_reset_all(struct cr_ctx *ctx)
>> +{
>> +    struct cr_pgarr *pgarr;
>> +
>> +    list_for_each_entry(pgarr, &ctx->pgarr_list, list) {
>> +        cr_pgarr_release_pages(pgarr);
>> +        pgarr->nr_used = 0;
>> +    }
> 
> This doesn't look right.  cr_pgarr_current only ever looks at the head
> of the list, so resetting a list with > 1 pgarr on it will mean the
> non-head elements in the list will go to waste.

You're correct.
(the code is from a cleanup suggested to v4 and incorporated into v5).

> 
>> +}
>> +
>> +/*
>> + * Checkpoint is outside the context of the checkpointee, so one cannot
>> + * simply read pages from user-space. Instead, we scan the address space
>> + * of the target to cherry-pick pages of interest. Selected pages are
>> + * enlisted in a page-array chain (attached to the checkpoint context).
>> + * To save their contents, each page is mapped to kernel memory and then
>> + * dumped to the file descriptor.
>> + */
>> +
>> +
>> +/**
>> + * cr_private_follow_page - return page pointer for dirty pages
>> + * @vma - target vma
>> + * @addr - page address
>> + *
>> + * Looks up the page that correspond to the address in the vma, and
>> + * returns the page if it was modified (and grabs a reference to it),
>> + * or otherwise returns NULL (or error).
>> + *
>> + * This function should _only_ called for private vma's.
>> + */
>> +static struct page *
>> +cr_private_follow_page(struct vm_area_struct *vma, unsigned long addr)
> 
> s/cr_private_follow_page/cr_follow_page_private/ ?
> 

ok.

> Maybe even cr_dump_private_page?  The fact that it's following the page
>  tables down to the page is an implementation artifact and isn't really
> relevant to the semantics you want to express.

Except that we don't dump the page there - we follow the page tables and
decide whether we add it to the list of scanned pages. But, ok, we can
also do cr_consider_page_private() (or examine, or scan ..)

> 
>> +{
>> +    struct page *page;
>> +
>> +    BUG_ON(vma->vm_flags & (VM_SHARED | VM_MAYSHARE));
>> +
> 
> This BUG_ON shouldn't be needed if it's already done in
> cr_private_vma_fill_pgarr.
> 

Leftover, will remove.

>> +    /*

[...]

>> +    } else if (vma->vm_file && (page_mapping(page) != NULL)) {
>> +        /* file backed clean cow: ignore */
> 
> Probably better to describe 'why' it can be ignored here.

ok.

>> +        page_cache_release(page);
>> +        page = NULL;
>> +    }
>> +
>> +    return page;
>> +}
>> +
>> +/**
>> + * cr_private_vma_fill_pgarr - fill a page-array with addr/page tuples
>> + * @ctx - checkpoint context
>> + * @pgarr - page-array to fill
>> + * @vma - vma to scan
>> + * @start - start address (updated)
>> + *
>> + * Returns the number of pages collected
>> + */
>> +static int
>> +cr_private_vma_fill_pgarr(struct cr_ctx *ctx, struct cr_pgarr *pgarr,
>> +              struct vm_area_struct *vma, unsigned long *start)
> 
> This is sorta nasty because you shouldn't need to call into this routine
> with a container.  It should be able to enqueue the (vaddr, page) tuple
> directly on the ctx.  Doing so would also abstract out the pgarr
> management at this level and make the code a lot simpler.
> 

Yes, @pgarr can be abstracted inside here.

>> +{
>> +    unsigned long end = vma->vm_end;
>> +    unsigned long addr = *start;
>> +    int orig_used = pgarr->nr_used;
>> +
>> +    /* this function is only for private memory (anon or file-mapped) */
>> +    BUG_ON(vma->vm_flags & (VM_SHARED | VM_MAYSHARE));
>> +
>> +    while (addr < end) {
>> +        struct page *page;
>> +
>> +        page = cr_private_follow_page(vma, addr);
>> +        if (IS_ERR(page))
>> +            return PTR_ERR(page);
>> +
>> +        if (page) {
>> +            pgarr->pages[pgarr->nr_used] = page;
>> +            pgarr->vaddrs[pgarr->nr_used] = addr;
>> +            pgarr->nr_used++;
> 
> Should be something like:
> 
> ret = cr_ctx_append_page(ctx, addr, page);
> if (ret < 0)
>   goto out;

My concern here is performance: keeping track of @pgarr avoids the
reference through ctx. We may loop over MBs of memory, tens of
thousands of pages, in individual VMAs.

>> +        }

[...]

>> +
>> +    buf = kmalloc(PAGE_SIZE, GFP_KERNEL);
> 
> __get_free_page()

lol... gonna run an experiment and change this one (Dave's main
argument regarding better "debugability" of kmalloc() doesn't
hold here anyways !).

>> +    if (!buf)
>> +        return -ENOMEM;

[...]

>> +#define CR_BAD_VM_FLAGS  \
>> +    (VM_SHARED | VM_MAYSHARE | VM_IO | VM_HUGETLB | VM_NONLINEAR)
>> +
>> +    if (vma->vm_flags & CR_BAD_VM_FLAGS) {
>> +        pr_warning("c/r: unsupported VMA %#lx\n", vma->vm_flags);
>> +        cr_hbuf_put(ctx, sizeof(*hh));
>> +        return -ENOSYS;
>> +    }
>> +
> 
> The following code should be broken into it's own function?  Handling of
> other types of memory will follow and will clutter this guy up.
> 

I deferred this until I add those "other types".

>> +    /* by default assume anon memory */
>> +    vma_type = CR_VMA_ANON;

[...]

>> +    if (vma->vm_file) {
>> +        ret = cr_write_fname(ctx, &vma->vm_file->f_path, &ctx->fs_mnt);
> 
> Why is this using a filename, rather than a reference to a file?

Could be a reference to a file, but it isn't strictly necessary (*) and it
won't improve performance that much. won't gain that much.

Not necessary: open files may be shared and then we _must_ use the same file
pointer. In contrast, with memory mapping only needs _an_ open file.
Won't gain much: because file pointers of mapped regions are usually only
shared in the case of fork() without a following exec().

(*) It is strictly necessary when it comes to handling shared memory.

So I left this optimization for later.

> Shouldn't this use the logic in patch 8/13?

Yes. But need to make sure (especially on the restart side) to consider the
exceptions - e.g. a file in SHMFS used for anonymous shared memory, etc.

So yes, I'll add a FIXME comment there.

> 
>> +        if (ret < 0)
>> +            return ret;
>> +    }

[...]

>> +enum vm_type {
>> +    CR_VMA_ANON = 1,
>> +    CR_VMA_FILE
> 
> We need to figure out what MAP_SHARED | MAP_ANONYMOUS should be exposed
> as in this setup (much in the same way we need to start defining what
> shm mappings look like).  Internally, they are 'file-backed', but to
> userland, they aren't.
> 
> 
> Thoughts?

Eventually we'll have CR_VMA_ANON_SHM, CR_VMA_FILE_SHM, CR_VMA_IPC_SHM,
to identify the vma type. There will also be a flag "skip" that says that
the actual contents of the memory has already been copied earlier. (And,
for completeness, a flags "xfile" which indicated that the referenced
file is unlinked, in the case of CR_VMA_FILE and CR_VMA_FILE_SHM).

It's not a lot of work, only that I'm actually holding back on adding
more features, and focus on getting this into -mm tree first. I don't
want to write lots of code and then modify it again and again...

> 
>> +};
>> +
>> +struct cr_hdr_vma {
>> +    __u32 vma_type;
>> +    __u32 _padding;
> 
> Why padding?

For 64 bit architectures. See this threads:
https://lists.linux-foundation.org/pipermail/containers/2008-August/012318.html

Quoting Arnd Bergmann:
  "This structure has an odd multiple of 32-bit members, which means
  that if you put it into a larger structure that also contains
  64-bit members, the larger structure may get different alignment
  on x86-32 and x86-64, which you might want to avoid.
  I can't tell if this is an actual problem here.
  ...
  ...
  In this case, I'm pretty sure that sizeof(cr_hdr_task) on x86-32 is
  different from x86-64, since it will be 32-bit aligned on x86-32."

>> +
>> +    __u64 vm_start;
>> +    __u64 vm_end;
>> +    __u64 vm_page_prot;
>> +    __u64 vm_flags;
>> +    __u64 vm_pgoff;
>> +} __attribute__((aligned(8)));
>> +
>> +struct cr_hdr_pgarr {
>> +    __u64 nr_pages;        /* number of pages to saved */
>> +} __attribute__((aligned(8)));
>> +
>>  #endif /* _CHECKPOINT_CKPT_HDR_H_ */

Oren.

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [RFC v11][PATCH 05/13] Dump memory address space
@ 2008-12-18 11:10       ` Oren Laadan
  0 siblings, 0 replies; 133+ messages in thread
From: Oren Laadan @ 2008-12-18 11:10 UTC (permalink / raw)
  To: Mike Waychison
  Cc: jeremy, arnd, linux-api, containers, linux-kernel, Dave Hansen,
	linux-mm, Linux Torvalds, Alexander Viro, H. Peter Anvin,
	Thomas Gleixner, Ingo Molnar



Mike Waychison wrote:
> Comments below.

Thanks for the detailed review.

> 
> Oren Laadan wrote:
>> For each VMA, there is a 'struct cr_vma'; if the VMA is file-mapped,
>> it will be followed by the file name. Then comes the actual contents,
>> in one or more chunk: each chunk begins with a header that specifies
>> how many pages it holds, then the virtual addresses of all the dumped
>> pages in that chunk, followed by the actual contents of all dumped
>> pages. A header with zero number of pages marks the end of the contents.
>> Then comes the next VMA and so on.
>>

[...]

>> +    mutex_lock(&mm->context.lock);
>> +
>> +    hh->ldt_entry_size = LDT_ENTRY_SIZE;
>> +    hh->nldt = mm->context.size;
>> +
>> +    cr_debug("nldt %d\n", hh->nldt);
>> +
>> +    ret = cr_write_obj(ctx, &h, hh);
>> +    cr_hbuf_put(ctx, sizeof(*hh));
>> +    if (ret < 0)
>> +        goto out;
>> +
>> +    ret = cr_kwrite(ctx, mm->context.ldt,
>> +            mm->context.size * LDT_ENTRY_SIZE);
> 
> Do we really want to emit anything under lock?  I realize that this
> patch goes and does a ton of writes with mmap_sem held for read -- is
> this ok?

Because all tasks in the container must be frozen during the checkpoint,
there is no performance penalty for keeping the locks. Although the object
should not change in the interim anyways, the locks protects us from, e.g.
the task unfreezing somehow, or being killed by the OOM killer, or any
other change incurred from the "outside world" (even future code).

Put in other words - in the long run it is safer to assume that the
underlying object may otherwise change.

(If we want to drop the lock here before cr_kwrite(), we need to copy the
data to a temporary buffer first. If we also want to drop mmap_sem(), we
need to be more careful with following the vma's.)

Do you see a reason to not keeping the locks ?

>> +
>> + out:
>> +    mutex_unlock(&mm->context.lock);
>> +    return ret;
>> +}

[...]

>> +static int cr_ctx_checkpoint(struct cr_ctx *ctx, pid_t pid)
>> +{
>> +    struct fs_struct *fs;
>> +
>> +    ctx->root_pid = pid;
>> +
>> +    /*
>> +     * assume checkpointer is in container's root vfs
>> +     * FIXME: this works for now, but will change with real containers
>> +     */
>> +
>> +    fs = current->fs;
>> +    read_lock(&fs->lock);
>> +    ctx->fs_mnt = fs->root;
>> +    path_get(&ctx->fs_mnt);
>> +    read_unlock(&fs->lock);
>> +
>> +    return 0;
> 
> Spurious return value?

In a later patch (10/13: External checkpoint of a task other than ourself)
it becomes more useful.

> 
>> +}
>> +

[...]

>> +/*
>> + * utilities to alloc, free, and handle 'struct cr_pgarr' (page-arrays)
>> + * (common to ckpt_mem.c and rstr_mem.c).
>> + *
>> + * The checkpoint context structure has two members for page-arrays:
>> + *   ctx->pgarr_list: list head of the page-array chain
> 
> What's the second member?

Duh... will update text.

> 

[...]

>> +    for (i = pgarr->nr_used; i--; /**/)
>> +        page_cache_release(pgarr->pages[i]);
> 
> This is sorta hard to read (and non-intuitive).  Is it easier to do:
> 
> 
> for (i = 0; i < pgarr->nr_used; i++)
>     page_cache_release(pgarr->pages[i]);
> 
> 
> It shouldn't matter what order you release the pages in..

Was meant to avoid a dereference to 'pgarr->nr_used' in the comparison.
(though I doubt if the performance impact is at all visible)

[...]

>> +/* allocate a single page-array object */
>> +static struct cr_pgarr *cr_pgarr_alloc_one(unsigned long flags)
>> +{
>> +    struct cr_pgarr *pgarr;
>> +
>> +    pgarr = kzalloc(sizeof(*pgarr), GFP_KERNEL);
>> +    if (!pgarr)
>> +        return NULL;
>> +
>> +    pgarr->vaddrs = kmalloc(CR_PGARR_TOTAL * sizeof(unsigned long),
> You used PAGE_SIZE / sizeof(void *) above.   Why not __get_free_page()?

Hahaha .. well, it's a guaranteed method to keep Dave Hansen from
barking about not using kmalloc ...

Personally I prefer __get_free_page() here, but not enough to keep
arguing with him. Let me know when the two of you settle it :)

> 
>> +                GFP_KERNEL);
>> +    if (!pgarr->vaddrs)
>> +        goto nomem;

[...]

>> +/* reset the page-array chain (dropping page references if necessary) */
>> +void cr_pgarr_reset_all(struct cr_ctx *ctx)
>> +{
>> +    struct cr_pgarr *pgarr;
>> +
>> +    list_for_each_entry(pgarr, &ctx->pgarr_list, list) {
>> +        cr_pgarr_release_pages(pgarr);
>> +        pgarr->nr_used = 0;
>> +    }
> 
> This doesn't look right.  cr_pgarr_current only ever looks at the head
> of the list, so resetting a list with > 1 pgarr on it will mean the
> non-head elements in the list will go to waste.

You're correct.
(the code is from a cleanup suggested to v4 and incorporated into v5).

> 
>> +}
>> +
>> +/*
>> + * Checkpoint is outside the context of the checkpointee, so one cannot
>> + * simply read pages from user-space. Instead, we scan the address space
>> + * of the target to cherry-pick pages of interest. Selected pages are
>> + * enlisted in a page-array chain (attached to the checkpoint context).
>> + * To save their contents, each page is mapped to kernel memory and then
>> + * dumped to the file descriptor.
>> + */
>> +
>> +
>> +/**
>> + * cr_private_follow_page - return page pointer for dirty pages
>> + * @vma - target vma
>> + * @addr - page address
>> + *
>> + * Looks up the page that correspond to the address in the vma, and
>> + * returns the page if it was modified (and grabs a reference to it),
>> + * or otherwise returns NULL (or error).
>> + *
>> + * This function should _only_ called for private vma's.
>> + */
>> +static struct page *
>> +cr_private_follow_page(struct vm_area_struct *vma, unsigned long addr)
> 
> s/cr_private_follow_page/cr_follow_page_private/ ?
> 

ok.

> Maybe even cr_dump_private_page?  The fact that it's following the page
>  tables down to the page is an implementation artifact and isn't really
> relevant to the semantics you want to express.

Except that we don't dump the page there - we follow the page tables and
decide whether we add it to the list of scanned pages. But, ok, we can
also do cr_consider_page_private() (or examine, or scan ..)

> 
>> +{
>> +    struct page *page;
>> +
>> +    BUG_ON(vma->vm_flags & (VM_SHARED | VM_MAYSHARE));
>> +
> 
> This BUG_ON shouldn't be needed if it's already done in
> cr_private_vma_fill_pgarr.
> 

Leftover, will remove.

>> +    /*

[...]

>> +    } else if (vma->vm_file && (page_mapping(page) != NULL)) {
>> +        /* file backed clean cow: ignore */
> 
> Probably better to describe 'why' it can be ignored here.

ok.

>> +        page_cache_release(page);
>> +        page = NULL;
>> +    }
>> +
>> +    return page;
>> +}
>> +
>> +/**
>> + * cr_private_vma_fill_pgarr - fill a page-array with addr/page tuples
>> + * @ctx - checkpoint context
>> + * @pgarr - page-array to fill
>> + * @vma - vma to scan
>> + * @start - start address (updated)
>> + *
>> + * Returns the number of pages collected
>> + */
>> +static int
>> +cr_private_vma_fill_pgarr(struct cr_ctx *ctx, struct cr_pgarr *pgarr,
>> +              struct vm_area_struct *vma, unsigned long *start)
> 
> This is sorta nasty because you shouldn't need to call into this routine
> with a container.  It should be able to enqueue the (vaddr, page) tuple
> directly on the ctx.  Doing so would also abstract out the pgarr
> management at this level and make the code a lot simpler.
> 

Yes, @pgarr can be abstracted inside here.

>> +{
>> +    unsigned long end = vma->vm_end;
>> +    unsigned long addr = *start;
>> +    int orig_used = pgarr->nr_used;
>> +
>> +    /* this function is only for private memory (anon or file-mapped) */
>> +    BUG_ON(vma->vm_flags & (VM_SHARED | VM_MAYSHARE));
>> +
>> +    while (addr < end) {
>> +        struct page *page;
>> +
>> +        page = cr_private_follow_page(vma, addr);
>> +        if (IS_ERR(page))
>> +            return PTR_ERR(page);
>> +
>> +        if (page) {
>> +            pgarr->pages[pgarr->nr_used] = page;
>> +            pgarr->vaddrs[pgarr->nr_used] = addr;
>> +            pgarr->nr_used++;
> 
> Should be something like:
> 
> ret = cr_ctx_append_page(ctx, addr, page);
> if (ret < 0)
>   goto out;

My concern here is performance: keeping track of @pgarr avoids the
reference through ctx. We may loop over MBs of memory, tens of
thousands of pages, in individual VMAs.

>> +        }

[...]

>> +
>> +    buf = kmalloc(PAGE_SIZE, GFP_KERNEL);
> 
> __get_free_page()

lol... gonna run an experiment and change this one (Dave's main
argument regarding better "debugability" of kmalloc() doesn't
hold here anyways !).

>> +    if (!buf)
>> +        return -ENOMEM;

[...]

>> +#define CR_BAD_VM_FLAGS  \
>> +    (VM_SHARED | VM_MAYSHARE | VM_IO | VM_HUGETLB | VM_NONLINEAR)
>> +
>> +    if (vma->vm_flags & CR_BAD_VM_FLAGS) {
>> +        pr_warning("c/r: unsupported VMA %#lx\n", vma->vm_flags);
>> +        cr_hbuf_put(ctx, sizeof(*hh));
>> +        return -ENOSYS;
>> +    }
>> +
> 
> The following code should be broken into it's own function?  Handling of
> other types of memory will follow and will clutter this guy up.
> 

I deferred this until I add those "other types".

>> +    /* by default assume anon memory */
>> +    vma_type = CR_VMA_ANON;

[...]

>> +    if (vma->vm_file) {
>> +        ret = cr_write_fname(ctx, &vma->vm_file->f_path, &ctx->fs_mnt);
> 
> Why is this using a filename, rather than a reference to a file?

Could be a reference to a file, but it isn't strictly necessary (*) and it
won't improve performance that much. won't gain that much.

Not necessary: open files may be shared and then we _must_ use the same file
pointer. In contrast, with memory mapping only needs _an_ open file.
Won't gain much: because file pointers of mapped regions are usually only
shared in the case of fork() without a following exec().

(*) It is strictly necessary when it comes to handling shared memory.

So I left this optimization for later.

> Shouldn't this use the logic in patch 8/13?

Yes. But need to make sure (especially on the restart side) to consider the
exceptions - e.g. a file in SHMFS used for anonymous shared memory, etc.

So yes, I'll add a FIXME comment there.

> 
>> +        if (ret < 0)
>> +            return ret;
>> +    }

[...]

>> +enum vm_type {
>> +    CR_VMA_ANON = 1,
>> +    CR_VMA_FILE
> 
> We need to figure out what MAP_SHARED | MAP_ANONYMOUS should be exposed
> as in this setup (much in the same way we need to start defining what
> shm mappings look like).  Internally, they are 'file-backed', but to
> userland, they aren't.
> 
> 
> Thoughts?

Eventually we'll have CR_VMA_ANON_SHM, CR_VMA_FILE_SHM, CR_VMA_IPC_SHM,
to identify the vma type. There will also be a flag "skip" that says that
the actual contents of the memory has already been copied earlier. (And,
for completeness, a flags "xfile" which indicated that the referenced
file is unlinked, in the case of CR_VMA_FILE and CR_VMA_FILE_SHM).

It's not a lot of work, only that I'm actually holding back on adding
more features, and focus on getting this into -mm tree first. I don't
want to write lots of code and then modify it again and again...

> 
>> +};
>> +
>> +struct cr_hdr_vma {
>> +    __u32 vma_type;
>> +    __u32 _padding;
> 
> Why padding?

For 64 bit architectures. See this threads:
https://lists.linux-foundation.org/pipermail/containers/2008-August/012318.html

Quoting Arnd Bergmann:
  "This structure has an odd multiple of 32-bit members, which means
  that if you put it into a larger structure that also contains
  64-bit members, the larger structure may get different alignment
  on x86-32 and x86-64, which you might want to avoid.
  I can't tell if this is an actual problem here.
  ...
  ...
  In this case, I'm pretty sure that sizeof(cr_hdr_task) on x86-32 is
  different from x86-64, since it will be 32-bit aligned on x86-32."

>> +
>> +    __u64 vm_start;
>> +    __u64 vm_end;
>> +    __u64 vm_page_prot;
>> +    __u64 vm_flags;
>> +    __u64 vm_pgoff;
>> +} __attribute__((aligned(8)));
>> +
>> +struct cr_hdr_pgarr {
>> +    __u64 nr_pages;        /* number of pages to saved */
>> +} __attribute__((aligned(8)));
>> +
>>  #endif /* _CHECKPOINT_CKPT_HDR_H_ */

Oren.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [RFC v11][PATCH 05/13] Dump memory address space
       [not found]       ` <494A2F94.2090800-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
@ 2008-12-18 15:05         ` Dave Hansen
  2008-12-18 15:54         ` Dave Hansen
  2008-12-18 18:15         ` Mike Waychison
  2 siblings, 0 replies; 133+ messages in thread
From: Dave Hansen @ 2008-12-18 15:05 UTC (permalink / raw)
  To: Oren Laadan
  Cc: jeremy-TSDbQ3PG+2Y, arnd-r2nGTMty4D4,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Linux Torvalds, Alexander Viro,
	H. Peter Anvin, Thomas Gleixner, Ingo Molnar

On Thu, 2008-12-18 at 06:10 -0500, Oren Laadan wrote:
> >> +    mutex_lock(&mm->context.lock);
> >> +
> >> +    hh->ldt_entry_size = LDT_ENTRY_SIZE;
> >> +    hh->nldt = mm->context.size;
> >> +
> >> +    cr_debug("nldt %d\n", hh->nldt);
> >> +
> >> +    ret = cr_write_obj(ctx, &h, hh);
> >> +    cr_hbuf_put(ctx, sizeof(*hh));
> >> +    if (ret < 0)
> >> +        goto out;
> >> +
> >> +    ret = cr_kwrite(ctx, mm->context.ldt,
> >> +            mm->context.size * LDT_ENTRY_SIZE);
> > 
> > Do we really want to emit anything under lock?  I realize that this
> > patch goes and does a ton of writes with mmap_sem held for read -- is
> > this ok?
> 
> Because all tasks in the container must be frozen during the checkpoint,
> there is no performance penalty for keeping the locks. Although the object
> should not change in the interim anyways, the locks protects us from, e.g.
> the task unfreezing somehow, or being killed by the OOM killer, or any
> other change incurred from the "outside world" (even future code).
> 
> Put in other words - in the long run it is safer to assume that the
> underlying object may otherwise change.
> 
> (If we want to drop the lock here before cr_kwrite(), we need to copy the
> data to a temporary buffer first. If we also want to drop mmap_sem(), we
> need to be more careful with following the vma's.)
> 
> Do you see a reason to not keeping the locks ?

Mike, although we're doing writes of the checkpoint file here, the *mm*
access is read-only.  We only need really mmap_sem for write if we're
creating new VMAs, which we only do on restore.  Was there an action
taken on the mm that would require a write that we missed?

Oren, I never considered the locking overhead, either.  The fact that
the processes are frozen is very, very important here.  The code is fine
as it stands because this *is* a very simple way to do it.  But, this
probably deserves a comment. 

-- Dave

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [RFC v11][PATCH 05/13] Dump memory address space
       [not found]       ` <494A2F94.2090800-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
  2008-12-18 15:05         ` Dave Hansen
@ 2008-12-18 15:05         ` Dave Hansen
  2008-12-18 18:15         ` Mike Waychison
  2 siblings, 0 replies; 133+ messages in thread
From: Dave Hansen @ 2008-12-18 15:05 UTC (permalink / raw)
  To: Oren Laadan
  Cc: Mike Waychison, jeremy, arnd, linux-api, containers,
	linux-kernel, linux-mm, Linux Torvalds, Alexander Viro,
	H. Peter Anvin, Thomas Gleixner, Ingo Molnar

On Thu, 2008-12-18 at 06:10 -0500, Oren Laadan wrote:
> >> +    mutex_lock(&mm->context.lock);
> >> +
> >> +    hh->ldt_entry_size = LDT_ENTRY_SIZE;
> >> +    hh->nldt = mm->context.size;
> >> +
> >> +    cr_debug("nldt %d\n", hh->nldt);
> >> +
> >> +    ret = cr_write_obj(ctx, &h, hh);
> >> +    cr_hbuf_put(ctx, sizeof(*hh));
> >> +    if (ret < 0)
> >> +        goto out;
> >> +
> >> +    ret = cr_kwrite(ctx, mm->context.ldt,
> >> +            mm->context.size * LDT_ENTRY_SIZE);
> > 
> > Do we really want to emit anything under lock?  I realize that this
> > patch goes and does a ton of writes with mmap_sem held for read -- is
> > this ok?
> 
> Because all tasks in the container must be frozen during the checkpoint,
> there is no performance penalty for keeping the locks. Although the object
> should not change in the interim anyways, the locks protects us from, e.g.
> the task unfreezing somehow, or being killed by the OOM killer, or any
> other change incurred from the "outside world" (even future code).
> 
> Put in other words - in the long run it is safer to assume that the
> underlying object may otherwise change.
> 
> (If we want to drop the lock here before cr_kwrite(), we need to copy the
> data to a temporary buffer first. If we also want to drop mmap_sem(), we
> need to be more careful with following the vma's.)
> 
> Do you see a reason to not keeping the locks ?

Mike, although we're doing writes of the checkpoint file here, the *mm*
access is read-only.  We only need really mmap_sem for write if we're
creating new VMAs, which we only do on restore.  Was there an action
taken on the mm that would require a write that we missed?

Oren, I never considered the locking overhead, either.  The fact that
the processes are frozen is very, very important here.  The code is fine
as it stands because this *is* a very simple way to do it.  But, this
probably deserves a comment. 

-- Dave


^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [RFC v11][PATCH 05/13] Dump memory address space
@ 2008-12-18 15:05         ` Dave Hansen
  0 siblings, 0 replies; 133+ messages in thread
From: Dave Hansen @ 2008-12-18 15:05 UTC (permalink / raw)
  To: Oren Laadan
  Cc: Mike Waychison, jeremy-TSDbQ3PG+2Y, arnd-r2nGTMty4D4,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Linux Torvalds, Alexander Viro,
	H. Peter Anvin, Thomas Gleixner, Ingo Molnar

On Thu, 2008-12-18 at 06:10 -0500, Oren Laadan wrote:
> >> +    mutex_lock(&mm->context.lock);
> >> +
> >> +    hh->ldt_entry_size = LDT_ENTRY_SIZE;
> >> +    hh->nldt = mm->context.size;
> >> +
> >> +    cr_debug("nldt %d\n", hh->nldt);
> >> +
> >> +    ret = cr_write_obj(ctx, &h, hh);
> >> +    cr_hbuf_put(ctx, sizeof(*hh));
> >> +    if (ret < 0)
> >> +        goto out;
> >> +
> >> +    ret = cr_kwrite(ctx, mm->context.ldt,
> >> +            mm->context.size * LDT_ENTRY_SIZE);
> > 
> > Do we really want to emit anything under lock?  I realize that this
> > patch goes and does a ton of writes with mmap_sem held for read -- is
> > this ok?
> 
> Because all tasks in the container must be frozen during the checkpoint,
> there is no performance penalty for keeping the locks. Although the object
> should not change in the interim anyways, the locks protects us from, e.g.
> the task unfreezing somehow, or being killed by the OOM killer, or any
> other change incurred from the "outside world" (even future code).
> 
> Put in other words - in the long run it is safer to assume that the
> underlying object may otherwise change.
> 
> (If we want to drop the lock here before cr_kwrite(), we need to copy the
> data to a temporary buffer first. If we also want to drop mmap_sem(), we
> need to be more careful with following the vma's.)
> 
> Do you see a reason to not keeping the locks ?

Mike, although we're doing writes of the checkpoint file here, the *mm*
access is read-only.  We only need really mmap_sem for write if we're
creating new VMAs, which we only do on restore.  Was there an action
taken on the mm that would require a write that we missed?

Oren, I never considered the locking overhead, either.  The fact that
the processes are frozen is very, very important here.  The code is fine
as it stands because this *is* a very simple way to do it.  But, this
probably deserves a comment. 

-- Dave

--
To unsubscribe from this list: send the line "unsubscribe linux-api" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [RFC v11][PATCH 05/13] Dump memory address space
@ 2008-12-18 15:05         ` Dave Hansen
  0 siblings, 0 replies; 133+ messages in thread
From: Dave Hansen @ 2008-12-18 15:05 UTC (permalink / raw)
  To: Oren Laadan
  Cc: Mike Waychison, jeremy, arnd, linux-api, containers,
	linux-kernel, linux-mm, Linux Torvalds, Alexander Viro,
	H. Peter Anvin, Thomas Gleixner, Ingo Molnar

On Thu, 2008-12-18 at 06:10 -0500, Oren Laadan wrote:
> >> +    mutex_lock(&mm->context.lock);
> >> +
> >> +    hh->ldt_entry_size = LDT_ENTRY_SIZE;
> >> +    hh->nldt = mm->context.size;
> >> +
> >> +    cr_debug("nldt %d\n", hh->nldt);
> >> +
> >> +    ret = cr_write_obj(ctx, &h, hh);
> >> +    cr_hbuf_put(ctx, sizeof(*hh));
> >> +    if (ret < 0)
> >> +        goto out;
> >> +
> >> +    ret = cr_kwrite(ctx, mm->context.ldt,
> >> +            mm->context.size * LDT_ENTRY_SIZE);
> > 
> > Do we really want to emit anything under lock?  I realize that this
> > patch goes and does a ton of writes with mmap_sem held for read -- is
> > this ok?
> 
> Because all tasks in the container must be frozen during the checkpoint,
> there is no performance penalty for keeping the locks. Although the object
> should not change in the interim anyways, the locks protects us from, e.g.
> the task unfreezing somehow, or being killed by the OOM killer, or any
> other change incurred from the "outside world" (even future code).
> 
> Put in other words - in the long run it is safer to assume that the
> underlying object may otherwise change.
> 
> (If we want to drop the lock here before cr_kwrite(), we need to copy the
> data to a temporary buffer first. If we also want to drop mmap_sem(), we
> need to be more careful with following the vma's.)
> 
> Do you see a reason to not keeping the locks ?

Mike, although we're doing writes of the checkpoint file here, the *mm*
access is read-only.  We only need really mmap_sem for write if we're
creating new VMAs, which we only do on restore.  Was there an action
taken on the mm that would require a write that we missed?

Oren, I never considered the locking overhead, either.  The fact that
the processes are frozen is very, very important here.  The code is fine
as it stands because this *is* a very simple way to do it.  But, this
probably deserves a comment. 

-- Dave

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [RFC v11][PATCH 05/13] Dump memory address space
       [not found]       ` <494A2F94.2090800-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
  2008-12-18 15:05         ` Dave Hansen
@ 2008-12-18 15:54         ` Dave Hansen
  2008-12-18 18:15         ` Mike Waychison
  2 siblings, 0 replies; 133+ messages in thread
From: Dave Hansen @ 2008-12-18 15:54 UTC (permalink / raw)
  To: Oren Laadan
  Cc: jeremy-TSDbQ3PG+2Y, arnd-r2nGTMty4D4,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Linux Torvalds, Alexander Viro,
	H. Peter Anvin, Thomas Gleixner, Ingo Molnar

On Thu, 2008-12-18 at 06:10 -0500, Oren Laadan wrote:
> >> +    for (i = pgarr->nr_used; i--; /**/)
> >> +        page_cache_release(pgarr->pages[i]);
> > 
> > This is sorta hard to read (and non-intuitive).  Is it easier to do:
> > 
> > for (i = 0; i < pgarr->nr_used; i++)
> >     page_cache_release(pgarr->pages[i]);
> > 
> > It shouldn't matter what order you release the pages in..
> 
> Was meant to avoid a dereference to 'pgarr->nr_used' in the comparison.
> (though I doubt if the performance impact is at all visible)

That's a bit to aggressive an optimization.  You two piqued my
curiosity, so I tried a little experiment with this .c file:

extern void bar(int i);

struct s {
        int *array;
        int size;
};

extern struct s *s;
void foo(void)
{
        int i;
#ifdef OREN
        for (i = s->size; i--; )
#else
        for (i = 0; i < s->size; i++)
#endif
                bar(s->array[i]);
}

for O in "" -O -O1 -O2 -O3 -Os; do
	gcc -DOREN $O -c f1.c -o oren.o;
	gcc $O -c f1.c -o mike.o;
	echo -n Oren:; objdump -d oren.o | grep ret;
	echo -n Mike:; objdump -d mike.o | grep ret;
done

Smaller numbers are better, and indicate the size of that function,
basically:

Oren:  38:	c3                   	ret    
Mike:  3b:	c3                   	ret    
Oren:  44:	c3                   	ret    
Mike:  36:	c3                   	ret    
Oren:  44:	c3                   	ret    
Mike:  36:	c3                   	ret    
Oren:  43:	c3                   	ret    
Mike:  34:	c3                   	ret    
Oren:  43:	c3                   	ret    
Mike:  34:	c3                   	ret    
Oren:  3a:	c3                   	ret    
Mike:  2a:	c3                   	ret    

gcc version 4.2.4 (Ubuntu 4.2.4-1ubuntu3).  In all but the unoptimized
case, Mike's version wins.  Readability, and icache footprint all in one
package!

-- Dave

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [RFC v11][PATCH 05/13] Dump memory address space
       [not found]       ` <494A2F94.2090800-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
  2008-12-18 15:05         ` Dave Hansen
@ 2008-12-18 15:54         ` Dave Hansen
  2008-12-18 18:15         ` Mike Waychison
  2 siblings, 0 replies; 133+ messages in thread
From: Dave Hansen @ 2008-12-18 15:54 UTC (permalink / raw)
  To: Oren Laadan
  Cc: Mike Waychison, jeremy, arnd, linux-api, containers,
	linux-kernel, linux-mm, Linux Torvalds, Alexander Viro,
	H. Peter Anvin, Thomas Gleixner, Ingo Molnar

On Thu, 2008-12-18 at 06:10 -0500, Oren Laadan wrote:
> >> +    for (i = pgarr->nr_used; i--; /**/)
> >> +        page_cache_release(pgarr->pages[i]);
> > 
> > This is sorta hard to read (and non-intuitive).  Is it easier to do:
> > 
> > for (i = 0; i < pgarr->nr_used; i++)
> >     page_cache_release(pgarr->pages[i]);
> > 
> > It shouldn't matter what order you release the pages in..
> 
> Was meant to avoid a dereference to 'pgarr->nr_used' in the comparison.
> (though I doubt if the performance impact is at all visible)

That's a bit to aggressive an optimization.  You two piqued my
curiosity, so I tried a little experiment with this .c file:

extern void bar(int i);

struct s {
        int *array;
        int size;
};

extern struct s *s;
void foo(void)
{
        int i;
#ifdef OREN
        for (i = s->size; i--; )
#else
        for (i = 0; i < s->size; i++)
#endif
                bar(s->array[i]);
}

for O in "" -O -O1 -O2 -O3 -Os; do
	gcc -DOREN $O -c f1.c -o oren.o;
	gcc $O -c f1.c -o mike.o;
	echo -n Oren:; objdump -d oren.o | grep ret;
	echo -n Mike:; objdump -d mike.o | grep ret;
done

Smaller numbers are better, and indicate the size of that function,
basically:

Oren:  38:	c3                   	ret    
Mike:  3b:	c3                   	ret    
Oren:  44:	c3                   	ret    
Mike:  36:	c3                   	ret    
Oren:  44:	c3                   	ret    
Mike:  36:	c3                   	ret    
Oren:  43:	c3                   	ret    
Mike:  34:	c3                   	ret    
Oren:  43:	c3                   	ret    
Mike:  34:	c3                   	ret    
Oren:  3a:	c3                   	ret    
Mike:  2a:	c3                   	ret    

gcc version 4.2.4 (Ubuntu 4.2.4-1ubuntu3).  In all but the unoptimized
case, Mike's version wins.  Readability, and icache footprint all in one
package!

-- Dave


^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [RFC v11][PATCH 05/13] Dump memory address space
@ 2008-12-18 15:54         ` Dave Hansen
  0 siblings, 0 replies; 133+ messages in thread
From: Dave Hansen @ 2008-12-18 15:54 UTC (permalink / raw)
  To: Oren Laadan
  Cc: Mike Waychison, jeremy-TSDbQ3PG+2Y, arnd-r2nGTMty4D4,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Linux Torvalds, Alexander Viro,
	H. Peter Anvin, Thomas Gleixner, Ingo Molnar

On Thu, 2008-12-18 at 06:10 -0500, Oren Laadan wrote:
> >> +    for (i = pgarr->nr_used; i--; /**/)
> >> +        page_cache_release(pgarr->pages[i]);
> > 
> > This is sorta hard to read (and non-intuitive).  Is it easier to do:
> > 
> > for (i = 0; i < pgarr->nr_used; i++)
> >     page_cache_release(pgarr->pages[i]);
> > 
> > It shouldn't matter what order you release the pages in..
> 
> Was meant to avoid a dereference to 'pgarr->nr_used' in the comparison.
> (though I doubt if the performance impact is at all visible)

That's a bit to aggressive an optimization.  You two piqued my
curiosity, so I tried a little experiment with this .c file:

extern void bar(int i);

struct s {
        int *array;
        int size;
};

extern struct s *s;
void foo(void)
{
        int i;
#ifdef OREN
        for (i = s->size; i--; )
#else
        for (i = 0; i < s->size; i++)
#endif
                bar(s->array[i]);
}

for O in "" -O -O1 -O2 -O3 -Os; do
	gcc -DOREN $O -c f1.c -o oren.o;
	gcc $O -c f1.c -o mike.o;
	echo -n Oren:; objdump -d oren.o | grep ret;
	echo -n Mike:; objdump -d mike.o | grep ret;
done

Smaller numbers are better, and indicate the size of that function,
basically:

Oren:  38:	c3                   	ret    
Mike:  3b:	c3                   	ret    
Oren:  44:	c3                   	ret    
Mike:  36:	c3                   	ret    
Oren:  44:	c3                   	ret    
Mike:  36:	c3                   	ret    
Oren:  43:	c3                   	ret    
Mike:  34:	c3                   	ret    
Oren:  43:	c3                   	ret    
Mike:  34:	c3                   	ret    
Oren:  3a:	c3                   	ret    
Mike:  2a:	c3                   	ret    

gcc version 4.2.4 (Ubuntu 4.2.4-1ubuntu3).  In all but the unoptimized
case, Mike's version wins.  Readability, and icache footprint all in one
package!

-- Dave

--
To unsubscribe from this list: send the line "unsubscribe linux-api" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [RFC v11][PATCH 05/13] Dump memory address space
@ 2008-12-18 15:54         ` Dave Hansen
  0 siblings, 0 replies; 133+ messages in thread
From: Dave Hansen @ 2008-12-18 15:54 UTC (permalink / raw)
  To: Oren Laadan
  Cc: Mike Waychison, jeremy, arnd, linux-api, containers,
	linux-kernel, linux-mm, Linux Torvalds, Alexander Viro,
	H. Peter Anvin, Thomas Gleixner, Ingo Molnar

On Thu, 2008-12-18 at 06:10 -0500, Oren Laadan wrote:
> >> +    for (i = pgarr->nr_used; i--; /**/)
> >> +        page_cache_release(pgarr->pages[i]);
> > 
> > This is sorta hard to read (and non-intuitive).  Is it easier to do:
> > 
> > for (i = 0; i < pgarr->nr_used; i++)
> >     page_cache_release(pgarr->pages[i]);
> > 
> > It shouldn't matter what order you release the pages in..
> 
> Was meant to avoid a dereference to 'pgarr->nr_used' in the comparison.
> (though I doubt if the performance impact is at all visible)

That's a bit to aggressive an optimization.  You two piqued my
curiosity, so I tried a little experiment with this .c file:

extern void bar(int i);

struct s {
        int *array;
        int size;
};

extern struct s *s;
void foo(void)
{
        int i;
#ifdef OREN
        for (i = s->size; i--; )
#else
        for (i = 0; i < s->size; i++)
#endif
                bar(s->array[i]);
}

for O in "" -O -O1 -O2 -O3 -Os; do
	gcc -DOREN $O -c f1.c -o oren.o;
	gcc $O -c f1.c -o mike.o;
	echo -n Oren:; objdump -d oren.o | grep ret;
	echo -n Mike:; objdump -d mike.o | grep ret;
done

Smaller numbers are better, and indicate the size of that function,
basically:

Oren:  38:	c3                   	ret    
Mike:  3b:	c3                   	ret    
Oren:  44:	c3                   	ret    
Mike:  36:	c3                   	ret    
Oren:  44:	c3                   	ret    
Mike:  36:	c3                   	ret    
Oren:  43:	c3                   	ret    
Mike:  34:	c3                   	ret    
Oren:  43:	c3                   	ret    
Mike:  34:	c3                   	ret    
Oren:  3a:	c3                   	ret    
Mike:  2a:	c3                   	ret    

gcc version 4.2.4 (Ubuntu 4.2.4-1ubuntu3).  In all but the unoptimized
case, Mike's version wins.  Readability, and icache footprint all in one
package!

-- Dave

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [RFC v11][PATCH 05/13] Dump memory address space
       [not found]       ` <494A2F94.2090800-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
  2008-12-18 15:05         ` Dave Hansen
  2008-12-18 15:54         ` Dave Hansen
@ 2008-12-18 18:15         ` Mike Waychison
  2 siblings, 0 replies; 133+ messages in thread
From: Mike Waychison @ 2008-12-18 18:15 UTC (permalink / raw)
  To: Oren Laadan
  Cc: jeremy-TSDbQ3PG+2Y, arnd-r2nGTMty4D4,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Dave Hansen,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Linux Torvalds, Alexander Viro,
	H. Peter Anvin, Thomas Gleixner, Ingo Molnar

Oren Laadan wrote:
> 
> Mike Waychison wrote:
>> Comments below.
> 
> Thanks for the detailed review.
> 
>> Oren Laadan wrote:
>>> For each VMA, there is a 'struct cr_vma'; if the VMA is file-mapped,
>>> it will be followed by the file name. Then comes the actual contents,
>>> in one or more chunk: each chunk begins with a header that specifies
>>> how many pages it holds, then the virtual addresses of all the dumped
>>> pages in that chunk, followed by the actual contents of all dumped
>>> pages. A header with zero number of pages marks the end of the contents.
>>> Then comes the next VMA and so on.
>>>
> 
> [...]
> 
>>> +    mutex_lock(&mm->context.lock);
>>> +
>>> +    hh->ldt_entry_size = LDT_ENTRY_SIZE;
>>> +    hh->nldt = mm->context.size;
>>> +
>>> +    cr_debug("nldt %d\n", hh->nldt);
>>> +
>>> +    ret = cr_write_obj(ctx, &h, hh);
>>> +    cr_hbuf_put(ctx, sizeof(*hh));
>>> +    if (ret < 0)
>>> +        goto out;
>>> +
>>> +    ret = cr_kwrite(ctx, mm->context.ldt,
>>> +            mm->context.size * LDT_ENTRY_SIZE);
>> Do we really want to emit anything under lock?  I realize that this
>> patch goes and does a ton of writes with mmap_sem held for read -- is
>> this ok?
> 
> Because all tasks in the container must be frozen during the checkpoint,
> there is no performance penalty for keeping the locks. Although the object
> should not change in the interim anyways, the locks protects us from, e.g.
> the task unfreezing somehow, or being killed by the OOM killer, or any
> other change incurred from the "outside world" (even future code).
> 
> Put in other words - in the long run it is safer to assume that the
> underlying object may otherwise change.
> 
> (If we want to drop the lock here before cr_kwrite(), we need to copy the
> data to a temporary buffer first. If we also want to drop mmap_sem(), we
> need to be more careful with following the vma's.)
> 
> Do you see a reason to not keeping the locks ?
> 

I just thought it was a bit ugly, but I can't think of a case 
specifically where it's going to cause us harm.  If tasks are frozen, 
are they still subject to the oom killer?   Even that should be 
reasonably ok considering that the exit-path requires a 
down_read(mmap_sem) (at least, it used to..  I haven't gone over that 
path in a while..).



>>> +/* allocate a single page-array object */
>>> +static struct cr_pgarr *cr_pgarr_alloc_one(unsigned long flags)
>>> +{
>>> +    struct cr_pgarr *pgarr;
>>> +
>>> +    pgarr = kzalloc(sizeof(*pgarr), GFP_KERNEL);
>>> +    if (!pgarr)
>>> +        return NULL;
>>> +
>>> +    pgarr->vaddrs = kmalloc(CR_PGARR_TOTAL * sizeof(unsigned long),
>> You used PAGE_SIZE / sizeof(void *) above.   Why not __get_free_page()?
> 
> Hahaha .. well, it's a guaranteed method to keep Dave Hansen from
> barking about not using kmalloc ...
> 
> Personally I prefer __get_free_page() here, but not enough to keep
> arguing with him. Let me know when the two of you settle it :)

Alright, I just wasn't sure if it had been considered.



> 
>>> +{
>>> +    unsigned long end = vma->vm_end;
>>> +    unsigned long addr = *start;
>>> +    int orig_used = pgarr->nr_used;
>>> +
>>> +    /* this function is only for private memory (anon or file-mapped) */
>>> +    BUG_ON(vma->vm_flags & (VM_SHARED | VM_MAYSHARE));
>>> +
>>> +    while (addr < end) {
>>> +        struct page *page;
>>> +
>>> +        page = cr_private_follow_page(vma, addr);
>>> +        if (IS_ERR(page))
>>> +            return PTR_ERR(page);
>>> +
>>> +        if (page) {
>>> +            pgarr->pages[pgarr->nr_used] = page;
>>> +            pgarr->vaddrs[pgarr->nr_used] = addr;
>>> +            pgarr->nr_used++;
>> Should be something like:
>>
>> ret = cr_ctx_append_page(ctx, addr, page);
>> if (ret < 0)
>>   goto out;
> 
> My concern here is performance: keeping track of @pgarr avoids the
> reference through ctx. We may loop over MBs of memory, tens of
> thousands of pages, in individual VMAs.
> 

Even scanning over a large amount of memory, you aren't going to see a 
performance difference for accessing pgarr from an argument vs off of 
field in ctx which is going to be cache-hot.

> 
>>> +    if (vma->vm_file) {
>>> +        ret = cr_write_fname(ctx, &vma->vm_file->f_path, &ctx->fs_mnt);
>> Why is this using a filename, rather than a reference to a file?
> 
> Could be a reference to a file, but it isn't strictly necessary (*) and it
> won't improve performance that much. won't gain that much.
> 
> Not necessary: open files may be shared and then we _must_ use the same file
> pointer. In contrast, with memory mapping only needs _an_ open file.
> Won't gain much: because file pointers of mapped regions are usually only
> shared in the case of fork() without a following exec().
> 
> (*) It is strictly necessary when it comes to handling shared memory.
> 
> So I left this optimization for later.

I'm not sure I'm comfortable with churning on the file-format too much.

> 
>> Shouldn't this use the logic in patch 8/13?
> 
> Yes. But need to make sure (especially on the restart side) to consider the
> exceptions - e.g. a file in SHMFS used for anonymous shared memory, etc.
> 
> So yes, I'll add a FIXME comment there.

Right.

> 
>>> +        if (ret < 0)
>>> +            return ret;
>>> +    }
> 
> [...]
> 
>>> +enum vm_type {
>>> +    CR_VMA_ANON = 1,
>>> +    CR_VMA_FILE
>> We need to figure out what MAP_SHARED | MAP_ANONYMOUS should be exposed
>> as in this setup (much in the same way we need to start defining what
>> shm mappings look like).  Internally, they are 'file-backed', but to
>> userland, they aren't.
>>
>>
>> Thoughts?
> 
> Eventually we'll have CR_VMA_ANON_SHM, CR_VMA_FILE_SHM, CR_VMA_IPC_SHM,
> to identify the vma type. There will also be a flag "skip" that says that
> the actual contents of the memory has already been copied earlier. (And,
> for completeness, a flags "xfile" which indicated that the referenced
> file is unlinked, in the case of CR_VMA_FILE and CR_VMA_FILE_SHM).
> 
> It's not a lot of work, only that I'm actually holding back on adding
> more features, and focus on getting this into -mm tree first. I don't
> want to write lots of code and then modify it again and again...
> 
>>> +};
>>> +
>>> +struct cr_hdr_vma {
>>> +    __u32 vma_type;
>>> +    __u32 _padding;
>> Why padding?
> 
> For 64 bit architectures. See this threads:
> https://lists.linux-foundation.org/pipermail/containers/2008-August/012318.html
> 
> Quoting Arnd Bergmann:
>   "This structure has an odd multiple of 32-bit members, which means
>   that if you put it into a larger structure that also contains
>   64-bit members, the larger structure may get different alignment
>   on x86-32 and x86-64, which you might want to avoid.
>   I can't tell if this is an actual problem here.
>   ...
>   ...
>   In this case, I'm pretty sure that sizeof(cr_hdr_task) on x86-32 is
>   different from x86-64, since it will be 32-bit aligned on x86-32."
> 

Ok.  Please add the above note to the structure to explain it to the 
next wanderer reading this code :)

Mike Waychison

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [RFC v11][PATCH 05/13] Dump memory address space
  2008-12-18 11:10       ` Oren Laadan
@ 2008-12-18 18:15         ` Mike Waychison
  -1 siblings, 0 replies; 133+ messages in thread
From: Mike Waychison @ 2008-12-18 18:15 UTC (permalink / raw)
  To: Oren Laadan
  Cc: jeremy, arnd, linux-api, containers, linux-kernel, Dave Hansen,
	linux-mm, Linux Torvalds, Alexander Viro, H. Peter Anvin,
	Thomas Gleixner, Ingo Molnar

Oren Laadan wrote:
> 
> Mike Waychison wrote:
>> Comments below.
> 
> Thanks for the detailed review.
> 
>> Oren Laadan wrote:
>>> For each VMA, there is a 'struct cr_vma'; if the VMA is file-mapped,
>>> it will be followed by the file name. Then comes the actual contents,
>>> in one or more chunk: each chunk begins with a header that specifies
>>> how many pages it holds, then the virtual addresses of all the dumped
>>> pages in that chunk, followed by the actual contents of all dumped
>>> pages. A header with zero number of pages marks the end of the contents.
>>> Then comes the next VMA and so on.
>>>
> 
> [...]
> 
>>> +    mutex_lock(&mm->context.lock);
>>> +
>>> +    hh->ldt_entry_size = LDT_ENTRY_SIZE;
>>> +    hh->nldt = mm->context.size;
>>> +
>>> +    cr_debug("nldt %d\n", hh->nldt);
>>> +
>>> +    ret = cr_write_obj(ctx, &h, hh);
>>> +    cr_hbuf_put(ctx, sizeof(*hh));
>>> +    if (ret < 0)
>>> +        goto out;
>>> +
>>> +    ret = cr_kwrite(ctx, mm->context.ldt,
>>> +            mm->context.size * LDT_ENTRY_SIZE);
>> Do we really want to emit anything under lock?  I realize that this
>> patch goes and does a ton of writes with mmap_sem held for read -- is
>> this ok?
> 
> Because all tasks in the container must be frozen during the checkpoint,
> there is no performance penalty for keeping the locks. Although the object
> should not change in the interim anyways, the locks protects us from, e.g.
> the task unfreezing somehow, or being killed by the OOM killer, or any
> other change incurred from the "outside world" (even future code).
> 
> Put in other words - in the long run it is safer to assume that the
> underlying object may otherwise change.
> 
> (If we want to drop the lock here before cr_kwrite(), we need to copy the
> data to a temporary buffer first. If we also want to drop mmap_sem(), we
> need to be more careful with following the vma's.)
> 
> Do you see a reason to not keeping the locks ?
> 

I just thought it was a bit ugly, but I can't think of a case 
specifically where it's going to cause us harm.  If tasks are frozen, 
are they still subject to the oom killer?   Even that should be 
reasonably ok considering that the exit-path requires a 
down_read(mmap_sem) (at least, it used to..  I haven't gone over that 
path in a while..).



>>> +/* allocate a single page-array object */
>>> +static struct cr_pgarr *cr_pgarr_alloc_one(unsigned long flags)
>>> +{
>>> +    struct cr_pgarr *pgarr;
>>> +
>>> +    pgarr = kzalloc(sizeof(*pgarr), GFP_KERNEL);
>>> +    if (!pgarr)
>>> +        return NULL;
>>> +
>>> +    pgarr->vaddrs = kmalloc(CR_PGARR_TOTAL * sizeof(unsigned long),
>> You used PAGE_SIZE / sizeof(void *) above.   Why not __get_free_page()?
> 
> Hahaha .. well, it's a guaranteed method to keep Dave Hansen from
> barking about not using kmalloc ...
> 
> Personally I prefer __get_free_page() here, but not enough to keep
> arguing with him. Let me know when the two of you settle it :)

Alright, I just wasn't sure if it had been considered.



> 
>>> +{
>>> +    unsigned long end = vma->vm_end;
>>> +    unsigned long addr = *start;
>>> +    int orig_used = pgarr->nr_used;
>>> +
>>> +    /* this function is only for private memory (anon or file-mapped) */
>>> +    BUG_ON(vma->vm_flags & (VM_SHARED | VM_MAYSHARE));
>>> +
>>> +    while (addr < end) {
>>> +        struct page *page;
>>> +
>>> +        page = cr_private_follow_page(vma, addr);
>>> +        if (IS_ERR(page))
>>> +            return PTR_ERR(page);
>>> +
>>> +        if (page) {
>>> +            pgarr->pages[pgarr->nr_used] = page;
>>> +            pgarr->vaddrs[pgarr->nr_used] = addr;
>>> +            pgarr->nr_used++;
>> Should be something like:
>>
>> ret = cr_ctx_append_page(ctx, addr, page);
>> if (ret < 0)
>>   goto out;
> 
> My concern here is performance: keeping track of @pgarr avoids the
> reference through ctx. We may loop over MBs of memory, tens of
> thousands of pages, in individual VMAs.
> 

Even scanning over a large amount of memory, you aren't going to see a 
performance difference for accessing pgarr from an argument vs off of 
field in ctx which is going to be cache-hot.

> 
>>> +    if (vma->vm_file) {
>>> +        ret = cr_write_fname(ctx, &vma->vm_file->f_path, &ctx->fs_mnt);
>> Why is this using a filename, rather than a reference to a file?
> 
> Could be a reference to a file, but it isn't strictly necessary (*) and it
> won't improve performance that much. won't gain that much.
> 
> Not necessary: open files may be shared and then we _must_ use the same file
> pointer. In contrast, with memory mapping only needs _an_ open file.
> Won't gain much: because file pointers of mapped regions are usually only
> shared in the case of fork() without a following exec().
> 
> (*) It is strictly necessary when it comes to handling shared memory.
> 
> So I left this optimization for later.

I'm not sure I'm comfortable with churning on the file-format too much.

> 
>> Shouldn't this use the logic in patch 8/13?
> 
> Yes. But need to make sure (especially on the restart side) to consider the
> exceptions - e.g. a file in SHMFS used for anonymous shared memory, etc.
> 
> So yes, I'll add a FIXME comment there.

Right.

> 
>>> +        if (ret < 0)
>>> +            return ret;
>>> +    }
> 
> [...]
> 
>>> +enum vm_type {
>>> +    CR_VMA_ANON = 1,
>>> +    CR_VMA_FILE
>> We need to figure out what MAP_SHARED | MAP_ANONYMOUS should be exposed
>> as in this setup (much in the same way we need to start defining what
>> shm mappings look like).  Internally, they are 'file-backed', but to
>> userland, they aren't.
>>
>>
>> Thoughts?
> 
> Eventually we'll have CR_VMA_ANON_SHM, CR_VMA_FILE_SHM, CR_VMA_IPC_SHM,
> to identify the vma type. There will also be a flag "skip" that says that
> the actual contents of the memory has already been copied earlier. (And,
> for completeness, a flags "xfile" which indicated that the referenced
> file is unlinked, in the case of CR_VMA_FILE and CR_VMA_FILE_SHM).
> 
> It's not a lot of work, only that I'm actually holding back on adding
> more features, and focus on getting this into -mm tree first. I don't
> want to write lots of code and then modify it again and again...
> 
>>> +};
>>> +
>>> +struct cr_hdr_vma {
>>> +    __u32 vma_type;
>>> +    __u32 _padding;
>> Why padding?
> 
> For 64 bit architectures. See this threads:
> https://lists.linux-foundation.org/pipermail/containers/2008-August/012318.html
> 
> Quoting Arnd Bergmann:
>   "This structure has an odd multiple of 32-bit members, which means
>   that if you put it into a larger structure that also contains
>   64-bit members, the larger structure may get different alignment
>   on x86-32 and x86-64, which you might want to avoid.
>   I can't tell if this is an actual problem here.
>   ...
>   ...
>   In this case, I'm pretty sure that sizeof(cr_hdr_task) on x86-32 is
>   different from x86-64, since it will be 32-bit aligned on x86-32."
> 

Ok.  Please add the above note to the structure to explain it to the 
next wanderer reading this code :)

Mike Waychison

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [RFC v11][PATCH 05/13] Dump memory address space
@ 2008-12-18 18:15         ` Mike Waychison
  0 siblings, 0 replies; 133+ messages in thread
From: Mike Waychison @ 2008-12-18 18:15 UTC (permalink / raw)
  To: Oren Laadan
  Cc: jeremy, arnd, linux-api, containers, linux-kernel, Dave Hansen,
	linux-mm, Linux Torvalds, Alexander Viro, H. Peter Anvin,
	Thomas Gleixner, Ingo Molnar

Oren Laadan wrote:
> 
> Mike Waychison wrote:
>> Comments below.
> 
> Thanks for the detailed review.
> 
>> Oren Laadan wrote:
>>> For each VMA, there is a 'struct cr_vma'; if the VMA is file-mapped,
>>> it will be followed by the file name. Then comes the actual contents,
>>> in one or more chunk: each chunk begins with a header that specifies
>>> how many pages it holds, then the virtual addresses of all the dumped
>>> pages in that chunk, followed by the actual contents of all dumped
>>> pages. A header with zero number of pages marks the end of the contents.
>>> Then comes the next VMA and so on.
>>>
> 
> [...]
> 
>>> +    mutex_lock(&mm->context.lock);
>>> +
>>> +    hh->ldt_entry_size = LDT_ENTRY_SIZE;
>>> +    hh->nldt = mm->context.size;
>>> +
>>> +    cr_debug("nldt %d\n", hh->nldt);
>>> +
>>> +    ret = cr_write_obj(ctx, &h, hh);
>>> +    cr_hbuf_put(ctx, sizeof(*hh));
>>> +    if (ret < 0)
>>> +        goto out;
>>> +
>>> +    ret = cr_kwrite(ctx, mm->context.ldt,
>>> +            mm->context.size * LDT_ENTRY_SIZE);
>> Do we really want to emit anything under lock?  I realize that this
>> patch goes and does a ton of writes with mmap_sem held for read -- is
>> this ok?
> 
> Because all tasks in the container must be frozen during the checkpoint,
> there is no performance penalty for keeping the locks. Although the object
> should not change in the interim anyways, the locks protects us from, e.g.
> the task unfreezing somehow, or being killed by the OOM killer, or any
> other change incurred from the "outside world" (even future code).
> 
> Put in other words - in the long run it is safer to assume that the
> underlying object may otherwise change.
> 
> (If we want to drop the lock here before cr_kwrite(), we need to copy the
> data to a temporary buffer first. If we also want to drop mmap_sem(), we
> need to be more careful with following the vma's.)
> 
> Do you see a reason to not keeping the locks ?
> 

I just thought it was a bit ugly, but I can't think of a case 
specifically where it's going to cause us harm.  If tasks are frozen, 
are they still subject to the oom killer?   Even that should be 
reasonably ok considering that the exit-path requires a 
down_read(mmap_sem) (at least, it used to..  I haven't gone over that 
path in a while..).



>>> +/* allocate a single page-array object */
>>> +static struct cr_pgarr *cr_pgarr_alloc_one(unsigned long flags)
>>> +{
>>> +    struct cr_pgarr *pgarr;
>>> +
>>> +    pgarr = kzalloc(sizeof(*pgarr), GFP_KERNEL);
>>> +    if (!pgarr)
>>> +        return NULL;
>>> +
>>> +    pgarr->vaddrs = kmalloc(CR_PGARR_TOTAL * sizeof(unsigned long),
>> You used PAGE_SIZE / sizeof(void *) above.   Why not __get_free_page()?
> 
> Hahaha .. well, it's a guaranteed method to keep Dave Hansen from
> barking about not using kmalloc ...
> 
> Personally I prefer __get_free_page() here, but not enough to keep
> arguing with him. Let me know when the two of you settle it :)

Alright, I just wasn't sure if it had been considered.



> 
>>> +{
>>> +    unsigned long end = vma->vm_end;
>>> +    unsigned long addr = *start;
>>> +    int orig_used = pgarr->nr_used;
>>> +
>>> +    /* this function is only for private memory (anon or file-mapped) */
>>> +    BUG_ON(vma->vm_flags & (VM_SHARED | VM_MAYSHARE));
>>> +
>>> +    while (addr < end) {
>>> +        struct page *page;
>>> +
>>> +        page = cr_private_follow_page(vma, addr);
>>> +        if (IS_ERR(page))
>>> +            return PTR_ERR(page);
>>> +
>>> +        if (page) {
>>> +            pgarr->pages[pgarr->nr_used] = page;
>>> +            pgarr->vaddrs[pgarr->nr_used] = addr;
>>> +            pgarr->nr_used++;
>> Should be something like:
>>
>> ret = cr_ctx_append_page(ctx, addr, page);
>> if (ret < 0)
>>   goto out;
> 
> My concern here is performance: keeping track of @pgarr avoids the
> reference through ctx. We may loop over MBs of memory, tens of
> thousands of pages, in individual VMAs.
> 

Even scanning over a large amount of memory, you aren't going to see a 
performance difference for accessing pgarr from an argument vs off of 
field in ctx which is going to be cache-hot.

> 
>>> +    if (vma->vm_file) {
>>> +        ret = cr_write_fname(ctx, &vma->vm_file->f_path, &ctx->fs_mnt);
>> Why is this using a filename, rather than a reference to a file?
> 
> Could be a reference to a file, but it isn't strictly necessary (*) and it
> won't improve performance that much. won't gain that much.
> 
> Not necessary: open files may be shared and then we _must_ use the same file
> pointer. In contrast, with memory mapping only needs _an_ open file.
> Won't gain much: because file pointers of mapped regions are usually only
> shared in the case of fork() without a following exec().
> 
> (*) It is strictly necessary when it comes to handling shared memory.
> 
> So I left this optimization for later.

I'm not sure I'm comfortable with churning on the file-format too much.

> 
>> Shouldn't this use the logic in patch 8/13?
> 
> Yes. But need to make sure (especially on the restart side) to consider the
> exceptions - e.g. a file in SHMFS used for anonymous shared memory, etc.
> 
> So yes, I'll add a FIXME comment there.

Right.

> 
>>> +        if (ret < 0)
>>> +            return ret;
>>> +    }
> 
> [...]
> 
>>> +enum vm_type {
>>> +    CR_VMA_ANON = 1,
>>> +    CR_VMA_FILE
>> We need to figure out what MAP_SHARED | MAP_ANONYMOUS should be exposed
>> as in this setup (much in the same way we need to start defining what
>> shm mappings look like).  Internally, they are 'file-backed', but to
>> userland, they aren't.
>>
>>
>> Thoughts?
> 
> Eventually we'll have CR_VMA_ANON_SHM, CR_VMA_FILE_SHM, CR_VMA_IPC_SHM,
> to identify the vma type. There will also be a flag "skip" that says that
> the actual contents of the memory has already been copied earlier. (And,
> for completeness, a flags "xfile" which indicated that the referenced
> file is unlinked, in the case of CR_VMA_FILE and CR_VMA_FILE_SHM).
> 
> It's not a lot of work, only that I'm actually holding back on adding
> more features, and focus on getting this into -mm tree first. I don't
> want to write lots of code and then modify it again and again...
> 
>>> +};
>>> +
>>> +struct cr_hdr_vma {
>>> +    __u32 vma_type;
>>> +    __u32 _padding;
>> Why padding?
> 
> For 64 bit architectures. See this threads:
> https://lists.linux-foundation.org/pipermail/containers/2008-August/012318.html
> 
> Quoting Arnd Bergmann:
>   "This structure has an odd multiple of 32-bit members, which means
>   that if you put it into a larger structure that also contains
>   64-bit members, the larger structure may get different alignment
>   on x86-32 and x86-64, which you might want to avoid.
>   I can't tell if this is an actual problem here.
>   ...
>   ...
>   In this case, I'm pretty sure that sizeof(cr_hdr_task) on x86-32 is
>   different from x86-64, since it will be 32-bit aligned on x86-32."
> 

Ok.  Please add the above note to the structure to explain it to the 
next wanderer reading this code :)

Mike Waychison

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [RFC v11][PATCH 05/13] Dump memory address space
       [not found]         ` <494A9350.1060309-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
@ 2008-12-18 18:21           ` Dave Hansen
  2008-12-18 20:11           ` Oren Laadan
  1 sibling, 0 replies; 133+ messages in thread
From: Dave Hansen @ 2008-12-18 18:21 UTC (permalink / raw)
  To: Mike Waychison
  Cc: jeremy-TSDbQ3PG+2Y, arnd-r2nGTMty4D4,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	H. Peter Anvin, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Linux Torvalds, Alexander Viro,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Thomas Gleixner, Ingo Molnar

On Thu, 2008-12-18 at 10:15 -0800, Mike Waychison wrote:
> 
> >>> +    pgarr = kzalloc(sizeof(*pgarr), GFP_KERNEL);
> >>> +    if (!pgarr)
> >>> +        return NULL;
> >>> +
> >>> +    pgarr->vaddrs = kmalloc(CR_PGARR_TOTAL * sizeof(unsigned
> long),
> >> You used PAGE_SIZE / sizeof(void *) above.   Why not
> __get_free_page()?
> > 
> > Hahaha .. well, it's a guaranteed method to keep Dave Hansen from
> > barking about not using kmalloc ...
> > 
> > Personally I prefer __get_free_page() here, but not enough to keep
> > arguing with him. Let me know when the two of you settle it :)
> 
> Alright, I just wasn't sure if it had been considered.

__get_free_page() sucks.  It doesn't do cool stuff like redzoning when
you have slab debugging turned on.  :)

I would personally suggest never using __get_free_page() unless you
truly need a *PAGE*.  That's an aligned, and PAGE_SIZE chunk.  If you
don't need alignment, or don't literally need a 'struct page', don't use
it.

-- Dave

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [RFC v11][PATCH 05/13] Dump memory address space
       [not found]         ` <494A9350.1060309-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
  2008-12-18 18:21           ` Dave Hansen
@ 2008-12-18 18:21           ` Dave Hansen
  1 sibling, 0 replies; 133+ messages in thread
From: Dave Hansen @ 2008-12-18 18:21 UTC (permalink / raw)
  To: Mike Waychison
  Cc: Oren Laadan, jeremy, arnd, linux-api, containers, linux-kernel,
	linux-mm, Linux Torvalds, Alexander Viro, H. Peter Anvin,
	Thomas Gleixner, Ingo Molnar

On Thu, 2008-12-18 at 10:15 -0800, Mike Waychison wrote:
> 
> >>> +    pgarr = kzalloc(sizeof(*pgarr), GFP_KERNEL);
> >>> +    if (!pgarr)
> >>> +        return NULL;
> >>> +
> >>> +    pgarr->vaddrs = kmalloc(CR_PGARR_TOTAL * sizeof(unsigned
> long),
> >> You used PAGE_SIZE / sizeof(void *) above.   Why not
> __get_free_page()?
> > 
> > Hahaha .. well, it's a guaranteed method to keep Dave Hansen from
> > barking about not using kmalloc ...
> > 
> > Personally I prefer __get_free_page() here, but not enough to keep
> > arguing with him. Let me know when the two of you settle it :)
> 
> Alright, I just wasn't sure if it had been considered.

__get_free_page() sucks.  It doesn't do cool stuff like redzoning when
you have slab debugging turned on.  :)

I would personally suggest never using __get_free_page() unless you
truly need a *PAGE*.  That's an aligned, and PAGE_SIZE chunk.  If you
don't need alignment, or don't literally need a 'struct page', don't use
it.

-- Dave


^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [RFC v11][PATCH 05/13] Dump memory address space
@ 2008-12-18 18:21           ` Dave Hansen
  0 siblings, 0 replies; 133+ messages in thread
From: Dave Hansen @ 2008-12-18 18:21 UTC (permalink / raw)
  To: Mike Waychison
  Cc: Oren Laadan, jeremy-TSDbQ3PG+2Y, arnd-r2nGTMty4D4,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Linux Torvalds, Alexander Viro,
	H. Peter Anvin, Thomas Gleixner, Ingo Molnar

On Thu, 2008-12-18 at 10:15 -0800, Mike Waychison wrote:
> 
> >>> +    pgarr = kzalloc(sizeof(*pgarr), GFP_KERNEL);
> >>> +    if (!pgarr)
> >>> +        return NULL;
> >>> +
> >>> +    pgarr->vaddrs = kmalloc(CR_PGARR_TOTAL * sizeof(unsigned
> long),
> >> You used PAGE_SIZE / sizeof(void *) above.   Why not
> __get_free_page()?
> > 
> > Hahaha .. well, it's a guaranteed method to keep Dave Hansen from
> > barking about not using kmalloc ...
> > 
> > Personally I prefer __get_free_page() here, but not enough to keep
> > arguing with him. Let me know when the two of you settle it :)
> 
> Alright, I just wasn't sure if it had been considered.

__get_free_page() sucks.  It doesn't do cool stuff like redzoning when
you have slab debugging turned on.  :)

I would personally suggest never using __get_free_page() unless you
truly need a *PAGE*.  That's an aligned, and PAGE_SIZE chunk.  If you
don't need alignment, or don't literally need a 'struct page', don't use
it.

-- Dave

--
To unsubscribe from this list: send the line "unsubscribe linux-api" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [RFC v11][PATCH 05/13] Dump memory address space
@ 2008-12-18 18:21           ` Dave Hansen
  0 siblings, 0 replies; 133+ messages in thread
From: Dave Hansen @ 2008-12-18 18:21 UTC (permalink / raw)
  To: Mike Waychison
  Cc: Oren Laadan, jeremy, arnd, linux-api, containers, linux-kernel,
	linux-mm, Linux Torvalds, Alexander Viro, H. Peter Anvin,
	Thomas Gleixner, Ingo Molnar

On Thu, 2008-12-18 at 10:15 -0800, Mike Waychison wrote:
> 
> >>> +    pgarr = kzalloc(sizeof(*pgarr), GFP_KERNEL);
> >>> +    if (!pgarr)
> >>> +        return NULL;
> >>> +
> >>> +    pgarr->vaddrs = kmalloc(CR_PGARR_TOTAL * sizeof(unsigned
> long),
> >> You used PAGE_SIZE / sizeof(void *) above.   Why not
> __get_free_page()?
> > 
> > Hahaha .. well, it's a guaranteed method to keep Dave Hansen from
> > barking about not using kmalloc ...
> > 
> > Personally I prefer __get_free_page() here, but not enough to keep
> > arguing with him. Let me know when the two of you settle it :)
> 
> Alright, I just wasn't sure if it had been considered.

__get_free_page() sucks.  It doesn't do cool stuff like redzoning when
you have slab debugging turned on.  :)

I would personally suggest never using __get_free_page() unless you
truly need a *PAGE*.  That's an aligned, and PAGE_SIZE chunk.  If you
don't need alignment, or don't literally need a 'struct page', don't use
it.

-- Dave

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [RFC v11][PATCH 05/13] Dump memory address space
  2008-12-18 15:54         ` Dave Hansen
                           ` (2 preceding siblings ...)
  (?)
@ 2008-12-18 20:00         ` Oren Laadan
  -1 siblings, 0 replies; 133+ messages in thread
From: Oren Laadan @ 2008-12-18 20:00 UTC (permalink / raw)
  To: Dave Hansen
  Cc: jeremy-TSDbQ3PG+2Y, arnd-r2nGTMty4D4,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Linux Torvalds, Alexander Viro,
	H. Peter Anvin, Thomas Gleixner, Ingo Molnar



Dave Hansen wrote:
> On Thu, 2008-12-18 at 06:10 -0500, Oren Laadan wrote:
>>>> +    for (i = pgarr->nr_used; i--; /**/)
>>>> +        page_cache_release(pgarr->pages[i]);
>>> This is sorta hard to read (and non-intuitive).  Is it easier to do:
>>>
>>> for (i = 0; i < pgarr->nr_used; i++)
>>>     page_cache_release(pgarr->pages[i]);
>>>
>>> It shouldn't matter what order you release the pages in..
>> Was meant to avoid a dereference to 'pgarr->nr_used' in the comparison.
>> (though I doubt if the performance impact is at all visible)
> 
> That's a bit to aggressive an optimization.  You two piqued my
> curiosity, so I tried a little experiment with this .c file:
> 
> extern void bar(int i);
> 
> struct s {
>         int *array;
>         int size;
> };
> 
> extern struct s *s;
> void foo(void)
> {
>         int i;
> #ifdef OREN
>         for (i = s->size; i--; )
> #else
>         for (i = 0; i < s->size; i++)
> #endif
>                 bar(s->array[i]);
> }
> 
> for O in "" -O -O1 -O2 -O3 -Os; do
> 	gcc -DOREN $O -c f1.c -o oren.o;
> 	gcc $O -c f1.c -o mike.o;
> 	echo -n Oren:; objdump -d oren.o | grep ret;
> 	echo -n Mike:; objdump -d mike.o | grep ret;
> done

For what it's worth, the idea was to improve time... (not code length).
I changed the code anyway (in response to another comment).

Oren.

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [RFC v11][PATCH 05/13] Dump memory address space
  2008-12-18 15:54         ` Dave Hansen
@ 2008-12-18 20:00           ` Oren Laadan
  -1 siblings, 0 replies; 133+ messages in thread
From: Oren Laadan @ 2008-12-18 20:00 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Mike Waychison, jeremy, arnd, linux-api, containers,
	linux-kernel, linux-mm, Linux Torvalds, Alexander Viro,
	H. Peter Anvin, Thomas Gleixner, Ingo Molnar



Dave Hansen wrote:
> On Thu, 2008-12-18 at 06:10 -0500, Oren Laadan wrote:
>>>> +    for (i = pgarr->nr_used; i--; /**/)
>>>> +        page_cache_release(pgarr->pages[i]);
>>> This is sorta hard to read (and non-intuitive).  Is it easier to do:
>>>
>>> for (i = 0; i < pgarr->nr_used; i++)
>>>     page_cache_release(pgarr->pages[i]);
>>>
>>> It shouldn't matter what order you release the pages in..
>> Was meant to avoid a dereference to 'pgarr->nr_used' in the comparison.
>> (though I doubt if the performance impact is at all visible)
> 
> That's a bit to aggressive an optimization.  You two piqued my
> curiosity, so I tried a little experiment with this .c file:
> 
> extern void bar(int i);
> 
> struct s {
>         int *array;
>         int size;
> };
> 
> extern struct s *s;
> void foo(void)
> {
>         int i;
> #ifdef OREN
>         for (i = s->size; i--; )
> #else
>         for (i = 0; i < s->size; i++)
> #endif
>                 bar(s->array[i]);
> }
> 
> for O in "" -O -O1 -O2 -O3 -Os; do
> 	gcc -DOREN $O -c f1.c -o oren.o;
> 	gcc $O -c f1.c -o mike.o;
> 	echo -n Oren:; objdump -d oren.o | grep ret;
> 	echo -n Mike:; objdump -d mike.o | grep ret;
> done

For what it's worth, the idea was to improve time... (not code length).
I changed the code anyway (in response to another comment).

Oren.


^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [RFC v11][PATCH 05/13] Dump memory address space
@ 2008-12-18 20:00           ` Oren Laadan
  0 siblings, 0 replies; 133+ messages in thread
From: Oren Laadan @ 2008-12-18 20:00 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Mike Waychison, jeremy, arnd, linux-api, containers,
	linux-kernel, linux-mm, Linux Torvalds, Alexander Viro,
	H. Peter Anvin, Thomas Gleixner, Ingo Molnar



Dave Hansen wrote:
> On Thu, 2008-12-18 at 06:10 -0500, Oren Laadan wrote:
>>>> +    for (i = pgarr->nr_used; i--; /**/)
>>>> +        page_cache_release(pgarr->pages[i]);
>>> This is sorta hard to read (and non-intuitive).  Is it easier to do:
>>>
>>> for (i = 0; i < pgarr->nr_used; i++)
>>>     page_cache_release(pgarr->pages[i]);
>>>
>>> It shouldn't matter what order you release the pages in..
>> Was meant to avoid a dereference to 'pgarr->nr_used' in the comparison.
>> (though I doubt if the performance impact is at all visible)
> 
> That's a bit to aggressive an optimization.  You two piqued my
> curiosity, so I tried a little experiment with this .c file:
> 
> extern void bar(int i);
> 
> struct s {
>         int *array;
>         int size;
> };
> 
> extern struct s *s;
> void foo(void)
> {
>         int i;
> #ifdef OREN
>         for (i = s->size; i--; )
> #else
>         for (i = 0; i < s->size; i++)
> #endif
>                 bar(s->array[i]);
> }
> 
> for O in "" -O -O1 -O2 -O3 -Os; do
> 	gcc -DOREN $O -c f1.c -o oren.o;
> 	gcc $O -c f1.c -o mike.o;
> 	echo -n Oren:; objdump -d oren.o | grep ret;
> 	echo -n Mike:; objdump -d mike.o | grep ret;
> done

For what it's worth, the idea was to improve time... (not code length).
I changed the code anyway (in response to another comment).

Oren.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [RFC v11][PATCH 05/13] Dump memory address space
       [not found]         ` <494A9350.1060309-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
  2008-12-18 18:21           ` Dave Hansen
@ 2008-12-18 20:11           ` Oren Laadan
  1 sibling, 0 replies; 133+ messages in thread
From: Oren Laadan @ 2008-12-18 20:11 UTC (permalink / raw)
  To: Mike Waychison
  Cc: jeremy-TSDbQ3PG+2Y, arnd-r2nGTMty4D4,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Dave Hansen,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Linux Torvalds, Alexander Viro,
	H. Peter Anvin, Thomas Gleixner, Ingo Molnar



Mike Waychison wrote:
> Oren Laadan wrote:
>>
>> Mike Waychison wrote:
>>> Comments below.
>>
>> Thanks for the detailed review.
>>
>>> Oren Laadan wrote:
>>>> For each VMA, there is a 'struct cr_vma'; if the VMA is file-mapped,
>>>> it will be followed by the file name. Then comes the actual contents,
>>>> in one or more chunk: each chunk begins with a header that specifies
>>>> how many pages it holds, then the virtual addresses of all the dumped
>>>> pages in that chunk, followed by the actual contents of all dumped
>>>> pages. A header with zero number of pages marks the end of the
>>>> contents.
>>>> Then comes the next VMA and so on.
>>>>
>>
>> [...]
>>
>>>> +    mutex_lock(&mm->context.lock);
>>>> +
>>>> +    hh->ldt_entry_size = LDT_ENTRY_SIZE;
>>>> +    hh->nldt = mm->context.size;
>>>> +
>>>> +    cr_debug("nldt %d\n", hh->nldt);
>>>> +
>>>> +    ret = cr_write_obj(ctx, &h, hh);
>>>> +    cr_hbuf_put(ctx, sizeof(*hh));
>>>> +    if (ret < 0)
>>>> +        goto out;
>>>> +
>>>> +    ret = cr_kwrite(ctx, mm->context.ldt,
>>>> +            mm->context.size * LDT_ENTRY_SIZE);
>>> Do we really want to emit anything under lock?  I realize that this
>>> patch goes and does a ton of writes with mmap_sem held for read -- is
>>> this ok?
>>
>> Because all tasks in the container must be frozen during the checkpoint,
>> there is no performance penalty for keeping the locks. Although the
>> object
>> should not change in the interim anyways, the locks protects us from,
>> e.g.
>> the task unfreezing somehow, or being killed by the OOM killer, or any
>> other change incurred from the "outside world" (even future code).
>>
>> Put in other words - in the long run it is safer to assume that the
>> underlying object may otherwise change.
>>
>> (If we want to drop the lock here before cr_kwrite(), we need to copy the
>> data to a temporary buffer first. If we also want to drop mmap_sem(), we
>> need to be more careful with following the vma's.)
>>
>> Do you see a reason to not keeping the locks ?
>>
> 
> I just thought it was a bit ugly, but I can't think of a case
> specifically where it's going to cause us harm.  If tasks are frozen,
> are they still subject to the oom killer?   Even that should be
> reasonably ok considering that the exit-path requires a
> down_read(mmap_sem) (at least, it used to..  I haven't gone over that
> path in a while..).

Excatly: this is safe because we keep the lock. It all boils down to
two points: holding the locks doesn't impair performance or functionality,
and it protects us against existing (if any) and future undesired
interactions with other code.

[...]

Oren.

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [RFC v11][PATCH 05/13] Dump memory address space
  2008-12-18 18:15         ` Mike Waychison
@ 2008-12-18 20:11           ` Oren Laadan
  -1 siblings, 0 replies; 133+ messages in thread
From: Oren Laadan @ 2008-12-18 20:11 UTC (permalink / raw)
  To: Mike Waychison
  Cc: jeremy, arnd, linux-api, containers, linux-kernel, Dave Hansen,
	linux-mm, Linux Torvalds, Alexander Viro, H. Peter Anvin,
	Thomas Gleixner, Ingo Molnar



Mike Waychison wrote:
> Oren Laadan wrote:
>>
>> Mike Waychison wrote:
>>> Comments below.
>>
>> Thanks for the detailed review.
>>
>>> Oren Laadan wrote:
>>>> For each VMA, there is a 'struct cr_vma'; if the VMA is file-mapped,
>>>> it will be followed by the file name. Then comes the actual contents,
>>>> in one or more chunk: each chunk begins with a header that specifies
>>>> how many pages it holds, then the virtual addresses of all the dumped
>>>> pages in that chunk, followed by the actual contents of all dumped
>>>> pages. A header with zero number of pages marks the end of the
>>>> contents.
>>>> Then comes the next VMA and so on.
>>>>
>>
>> [...]
>>
>>>> +    mutex_lock(&mm->context.lock);
>>>> +
>>>> +    hh->ldt_entry_size = LDT_ENTRY_SIZE;
>>>> +    hh->nldt = mm->context.size;
>>>> +
>>>> +    cr_debug("nldt %d\n", hh->nldt);
>>>> +
>>>> +    ret = cr_write_obj(ctx, &h, hh);
>>>> +    cr_hbuf_put(ctx, sizeof(*hh));
>>>> +    if (ret < 0)
>>>> +        goto out;
>>>> +
>>>> +    ret = cr_kwrite(ctx, mm->context.ldt,
>>>> +            mm->context.size * LDT_ENTRY_SIZE);
>>> Do we really want to emit anything under lock?  I realize that this
>>> patch goes and does a ton of writes with mmap_sem held for read -- is
>>> this ok?
>>
>> Because all tasks in the container must be frozen during the checkpoint,
>> there is no performance penalty for keeping the locks. Although the
>> object
>> should not change in the interim anyways, the locks protects us from,
>> e.g.
>> the task unfreezing somehow, or being killed by the OOM killer, or any
>> other change incurred from the "outside world" (even future code).
>>
>> Put in other words - in the long run it is safer to assume that the
>> underlying object may otherwise change.
>>
>> (If we want to drop the lock here before cr_kwrite(), we need to copy the
>> data to a temporary buffer first. If we also want to drop mmap_sem(), we
>> need to be more careful with following the vma's.)
>>
>> Do you see a reason to not keeping the locks ?
>>
> 
> I just thought it was a bit ugly, but I can't think of a case
> specifically where it's going to cause us harm.  If tasks are frozen,
> are they still subject to the oom killer?   Even that should be
> reasonably ok considering that the exit-path requires a
> down_read(mmap_sem) (at least, it used to..  I haven't gone over that
> path in a while..).

Excatly: this is safe because we keep the lock. It all boils down to
two points: holding the locks doesn't impair performance or functionality,
and it protects us against existing (if any) and future undesired
interactions with other code.

[...]

Oren.


^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [RFC v11][PATCH 05/13] Dump memory address space
@ 2008-12-18 20:11           ` Oren Laadan
  0 siblings, 0 replies; 133+ messages in thread
From: Oren Laadan @ 2008-12-18 20:11 UTC (permalink / raw)
  To: Mike Waychison
  Cc: jeremy, arnd, linux-api, containers, linux-kernel, Dave Hansen,
	linux-mm, Linux Torvalds, Alexander Viro, H. Peter Anvin,
	Thomas Gleixner, Ingo Molnar



Mike Waychison wrote:
> Oren Laadan wrote:
>>
>> Mike Waychison wrote:
>>> Comments below.
>>
>> Thanks for the detailed review.
>>
>>> Oren Laadan wrote:
>>>> For each VMA, there is a 'struct cr_vma'; if the VMA is file-mapped,
>>>> it will be followed by the file name. Then comes the actual contents,
>>>> in one or more chunk: each chunk begins with a header that specifies
>>>> how many pages it holds, then the virtual addresses of all the dumped
>>>> pages in that chunk, followed by the actual contents of all dumped
>>>> pages. A header with zero number of pages marks the end of the
>>>> contents.
>>>> Then comes the next VMA and so on.
>>>>
>>
>> [...]
>>
>>>> +    mutex_lock(&mm->context.lock);
>>>> +
>>>> +    hh->ldt_entry_size = LDT_ENTRY_SIZE;
>>>> +    hh->nldt = mm->context.size;
>>>> +
>>>> +    cr_debug("nldt %d\n", hh->nldt);
>>>> +
>>>> +    ret = cr_write_obj(ctx, &h, hh);
>>>> +    cr_hbuf_put(ctx, sizeof(*hh));
>>>> +    if (ret < 0)
>>>> +        goto out;
>>>> +
>>>> +    ret = cr_kwrite(ctx, mm->context.ldt,
>>>> +            mm->context.size * LDT_ENTRY_SIZE);
>>> Do we really want to emit anything under lock?  I realize that this
>>> patch goes and does a ton of writes with mmap_sem held for read -- is
>>> this ok?
>>
>> Because all tasks in the container must be frozen during the checkpoint,
>> there is no performance penalty for keeping the locks. Although the
>> object
>> should not change in the interim anyways, the locks protects us from,
>> e.g.
>> the task unfreezing somehow, or being killed by the OOM killer, or any
>> other change incurred from the "outside world" (even future code).
>>
>> Put in other words - in the long run it is safer to assume that the
>> underlying object may otherwise change.
>>
>> (If we want to drop the lock here before cr_kwrite(), we need to copy the
>> data to a temporary buffer first. If we also want to drop mmap_sem(), we
>> need to be more careful with following the vma's.)
>>
>> Do you see a reason to not keeping the locks ?
>>
> 
> I just thought it was a bit ugly, but I can't think of a case
> specifically where it's going to cause us harm.  If tasks are frozen,
> are they still subject to the oom killer?   Even that should be
> reasonably ok considering that the exit-path requires a
> down_read(mmap_sem) (at least, it used to..  I haven't gone over that
> path in a while..).

Excatly: this is safe because we keep the lock. It all boils down to
two points: holding the locks doesn't impair performance or functionality,
and it protects us against existing (if any) and future undesired
interactions with other code.

[...]

Oren.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 133+ messages in thread

* [RFC v11][PATCH 00/13] Kernel based checkpoint/restart
@ 2008-12-05 17:31 Oren Laadan
  0 siblings, 0 replies; 133+ messages in thread
From: Oren Laadan @ 2008-12-05 17:31 UTC (permalink / raw)
  To: Oren Laadan
  Cc: jeremy-TSDbQ3PG+2Y, arnd-r2nGTMty4D4,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Dave Hansen,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Linux Torvalds, Alexander Viro,
	H. Peter Anvin, Thomas Gleixner, Ingo Molnar

Checkpoint-restart (c/r): fixed races in file handling (comments from
from Al Viro). Updated and tested against v2.6.28-rc7 (feaf384...)

We'd like these to make it into -mm. This version addresses the
last of the known bugs. Please pull at least the first 11 patches,
as they are similar to before.

Patches 1-11 are stable, providing self- and external- c/r of a
single process.
Patches 12 and 13 are newer, adding support for c/r of multiple
processes.

The git tree tracking v11, branch 'ckpt-v11' (and older versions):
	git://git.ncl.cs.columbia.edu/pub/git/linux-cr.git

Restarting multiple processes requires 'mktree' userspace tool:
	git://git.ncl.cs.columbia.edu/pub/git/user-cr.git

Oren.


--
Why do we want it?  It allows containers to be moved between physical
machines' kernels in the same way that VMWare can move VMs between
physical machines' hypervisors.  There are currently at least two
out-of-tree implementations of this in the commercial world (IBM's
Metacluster and Parallels' OpenVZ/Virtuozzo) and several in the academic
world like Zap.

Why do we need it in mainline now?  Because we already have plenty of
out-of-tree ones, and  want to know what an in-tree one will be like.   :)  
What *I* want right now is the extra review and scrutiny that comes with
a mainline submission to make sure we're not going in a direction
contrary to the community.

This only supports pretty simple apps.  But, I trust Ingo when he says:
>> > > Generally, if something works for simple apps already (in a robust, 
>> > > compatible and supportable way) and users find it "very cool", then 
>> > > support for more complex apps is not far in the future.  but if you
>> > > want to support more complex apps straight away, it takes forever and
>> > > gets ugly.

We're *certainly* going to be changing the ABI (which is the format of
the checkpoint).  I'd like to follow the model that we used for
ext4-dev, which is to make it very clear that this is a development-only
feature for now.  Perhaps we do that by making the interface only
available through debugfs or something similar for now.  Or, reserving
the syscall numbers but require some runtime switch to be thrown before
they can be used.  I'm open to suggestions here.
--

--
Todo:
- Add support for x86-64 and improve ABI
- Refine or change syscall interface
- Handle multiple namespaces in a container (e.g. save the filesystem
  namespaces state with the file descriptors)
- Security (without CAPS_SYS_ADMIN files restore may fail)

Changelog:

[2008-Dec-05] v11:
  - Use contents of 'init->fs->root' instead of pointing to it
  - Ignore symlinks (there is no such thing as an open symlink)
  - cr_scan_fds() retries from scratch if it hits size limits
  - Add missing test for VM_MAYSHARE when dumping memory
  - Improve documentation about: behavior when tasks aren't fronen,
    life span of the object hash, references to objects in the hash
 
[2008-Nov-26] v10:
  - Grab vfs root of container init, rather than current process
  - Acquire dcache_lock around call to __d_path() in cr_fill_name()
  - Force end-of-string in cr_read_string() (fix possible DoS)
  - Introduce cr_write_buffer(), cr_read_buffer() and cr_read_buf_type()

[2008-Nov-10] v9:
  - Support multiple processes c/r
  - Extend checkpoint header with archtiecture dependent header 
  - Misc bug fixes (see individual changelogs)
  - Rebase to v2.6.28-rc3.

[2008-Oct-29] v8:
  - Support "external" checkpoint
  - Include Dave Hansen's 'deny-checkpoint' patch
  - Split docs in Documentation/checkpoint/..., and improve contents

[2008-Oct-17] v7:
  - Fix save/restore state of FPU
  - Fix argument given to kunmap_atomic() in memory dump/restore

[2008-Oct-07] v6:
  - Balance all calls to cr_hbuf_get() with matching cr_hbuf_put()
    (even though it's not really needed)
  - Add assumptions and what's-missing to documentation
  - Misc fixes and cleanups

[2008-Sep-11] v5:
  - Config is now 'def_bool n' by default
  - Improve memory dump/restore code (following Dave Hansen's comments)
  - Change dump format (and code) to allow chunks of <vaddrs, pages>
    instead of one long list of each
  - Fix use of follow_page() to avoid faulting in non-present pages
  - Memory restore now maps user pages explicitly to copy data into them,
    instead of reading directly to user space; got rid of mprotect_fixup()
  - Remove preempt_disable() when restoring debug registers
  - Rename headers files s/ckpt/checkpoint/
  - Fix misc bugs in files dump/restore
  - Fixes and cleanups on some error paths
  - Fix misc coding style

[2008-Sep-09] v4:
  - Various fixes and clean-ups
  - Fix calculation of hash table size
  - Fix header structure alignment
  - Use stand list_... for cr_pgarr

[2008-Aug-29] v3:
  - Various fixes and clean-ups
  - Use standard hlist_... for hash table
  - Better use of standard kmalloc/kfree

[2008-Aug-20] v2:
  - Added Dump and restore of open files (regular and directories)
  - Added basic handling of shared objects, and improve handling of
    'parent tag' concept
  - Added documentation
  - Improved ABI, 64bit padding for image data
  - Improved locking when saving/restoring memory
  - Added UTS information to header (release, version, machine)
  - Cleanup extraction of filename from a file pointer
  - Refactor to allow easier reviewing
  - Remove requirement for CAPS_SYS_ADMIN until we come up with a
    security policy (this means that file restore may fail)
  - Other cleanup and response to comments for v1

[2008-Jul-29] v1:
  - Initial version: support a single task with address space of only
    private anonymous or file-mapped VMAs; syscalls ignore pid/crid
    argument and act on current process.

--
At the containers mini-conference before OLS, the consensus among
all the stakeholders was that doing checkpoint/restart in the kernel
as much as possible was the best approach.  With this approach, the
kernel will export a relatively opaque 'blob' of data to userspace
which can then be handed to the new kernel at restore time.

This is different than what had been proposed before, which was
that a userspace application would be responsible for collecting
all of this data.  We were also planning on adding lots of new,
little kernel interfaces for all of the things that needed
checkpointing.  This unites those into a single, grand interface.

The 'blob' will contain copies of select portions of kernel
structures such as vmas and mm_structs.  It will also contain
copies of the actual memory that the process uses.  Any changes
in this blob's format between kernel revisions can be handled by
an in-userspace conversion program.

This is a similar approach to virtually all of the commercial
checkpoint/restart products out there, as well as the research
project Zap.

These patches basically serialize internel kernel state and write
it out to a file descriptor.  The checkpoint and restore are done
with two new system calls: sys_checkpoint and sys_restart.

In this incarnation, they can only work checkpoint and restore a
single task. The task's address space may consist of only private,
simple vma's - anonymous or file-mapped. The open files may consist
of only simple files and directories.
--

^ permalink raw reply	[flat|nested] 133+ messages in thread

end of thread, other threads:[~2008-12-18 20:11 UTC | newest]

Thread overview: 133+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2008-12-05 17:31 [RFC v11][PATCH 00/13] Kernel based checkpoint/restart Oren Laadan
2008-12-05 17:31 ` Oren Laadan
2008-12-05 17:31 ` Oren Laadan
2008-12-05 17:31 ` [RFC v11][PATCH 01/13] Create syscalls: sys_checkpoint, sys_restart Oren Laadan
2008-12-05 17:31   ` Oren Laadan
2008-12-05 17:31   ` Oren Laadan
2008-12-05 17:31 ` [RFC v11][PATCH 02/13] Checkpoint/restart: initial documentation Oren Laadan
2008-12-05 17:31   ` Oren Laadan
2008-12-05 17:31   ` Oren Laadan
2008-12-05 17:31 ` [RFC v11][PATCH 03/13] General infrastructure for checkpoint restart Oren Laadan
2008-12-05 17:31   ` Oren Laadan
2008-12-05 17:31   ` Oren Laadan
2008-12-06  7:26   ` Joe Perches
2008-12-06  7:26     ` Joe Perches
2008-12-06  7:26     ` Joe Perches
     [not found]   ` <1228498282-11804-4-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
2008-12-06  7:26     ` Joe Perches
2008-12-16 19:04     ` Mike Waychison
2008-12-16 21:54     ` Mike Waychison
2008-12-16 19:04   ` Mike Waychison
2008-12-16 19:04     ` Mike Waychison
2008-12-16 19:28     ` Linus Torvalds
2008-12-16 19:28       ` Linus Torvalds
     [not found]     ` <4947FBC8.2000601-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
2008-12-16 19:28       ` Linus Torvalds
2008-12-16 21:54   ` Mike Waychison
2008-12-16 21:54     ` Mike Waychison
2008-12-16 22:14     ` Dave Hansen
2008-12-16 22:14       ` Dave Hansen
2008-12-16 22:43       ` Mike Waychison
2008-12-16 22:43       ` Mike Waychison
2008-12-16 22:43         ` Mike Waychison
2008-12-17  0:13         ` Dave Hansen
2008-12-17  0:13           ` Dave Hansen
     [not found]         ` <49482F14.1040407-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
2008-12-17  0:13           ` Dave Hansen
2008-12-16 23:42       ` Oren Laadan
2008-12-16 23:42         ` Oren Laadan
     [not found]         ` <49483D01.1050603-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
2008-12-17  0:42           ` Mike Waychison
2008-12-17  0:42         ` Mike Waychison
2008-12-17  0:42           ` Mike Waychison
     [not found]           ` <49484AE2.3000007-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
2008-12-17  2:08             ` Oren Laadan
2008-12-17  2:08           ` Oren Laadan
2008-12-17  2:08             ` Oren Laadan
2008-12-16 23:42       ` Oren Laadan
     [not found]     ` <49482394.10006-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
2008-12-16 22:14       ` Dave Hansen
2008-12-05 17:31 ` [RFC v11][PATCH 04/13] x86 support for checkpoint/restart Oren Laadan
2008-12-05 17:31   ` Oren Laadan
2008-12-05 17:31   ` Oren Laadan
2008-12-17  2:19   ` Mike Waychison
2008-12-17  2:19     ` Mike Waychison
2008-12-17 15:23     ` Oren Laadan
2008-12-17 15:23       ` Oren Laadan
     [not found]     ` <494861CA.8000403-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
2008-12-17 15:23       ` Oren Laadan
     [not found]   ` <1228498282-11804-5-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
2008-12-17  2:19     ` Mike Waychison
2008-12-05 17:31 ` [RFC v11][PATCH 05/13] Dump memory address space Oren Laadan
2008-12-05 17:31   ` Oren Laadan
2008-12-05 17:31   ` Oren Laadan
     [not found]   ` <1228498282-11804-6-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
2008-12-18  2:26     ` Mike Waychison
2008-12-18  2:26   ` Mike Waychison
2008-12-18  2:26     ` Mike Waychison
2008-12-18 11:10     ` Oren Laadan
2008-12-18 11:10       ` Oren Laadan
2008-12-18 15:05       ` Dave Hansen
2008-12-18 15:05         ` Dave Hansen
2008-12-18 15:05         ` Dave Hansen
2008-12-18 15:54       ` Dave Hansen
2008-12-18 15:54         ` Dave Hansen
2008-12-18 15:54         ` Dave Hansen
2008-12-18 20:00         ` Oren Laadan
2008-12-18 20:00           ` Oren Laadan
2008-12-18 20:00         ` Oren Laadan
     [not found]       ` <494A2F94.2090800-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
2008-12-18 15:05         ` Dave Hansen
2008-12-18 15:54         ` Dave Hansen
2008-12-18 18:15         ` Mike Waychison
2008-12-18 18:15       ` Mike Waychison
2008-12-18 18:15         ` Mike Waychison
2008-12-18 18:21         ` Dave Hansen
2008-12-18 18:21           ` Dave Hansen
2008-12-18 18:21           ` Dave Hansen
     [not found]         ` <494A9350.1060309-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
2008-12-18 18:21           ` Dave Hansen
2008-12-18 20:11           ` Oren Laadan
2008-12-18 20:11         ` Oren Laadan
2008-12-18 20:11           ` Oren Laadan
     [not found]     ` <4949B4ED.9060805-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
2008-12-18 11:10       ` Oren Laadan
2008-12-05 17:31 ` [RFC v11][PATCH 06/13] Restore " Oren Laadan
2008-12-05 17:31   ` Oren Laadan
2008-12-05 17:31   ` Oren Laadan
2008-12-05 17:31 ` [RFC v11][PATCH 07/13] Infrastructure for shared objects Oren Laadan
2008-12-05 17:31   ` Oren Laadan
2008-12-05 17:31   ` Oren Laadan
2008-12-05 17:31 ` [RFC v11][PATCH 08/13] Dump open file descriptors Oren Laadan
2008-12-05 17:31   ` Oren Laadan
2008-12-05 17:31   ` Oren Laadan
2008-12-05 17:31 ` [RFC v11][PATCH 09/13] Restore open file descriprtors Oren Laadan
2008-12-05 17:31   ` Oren Laadan
2008-12-05 17:31   ` Oren Laadan
2008-12-05 17:31 ` [RFC v11][PATCH 10/13] External checkpoint of a task other than ourself Oren Laadan
2008-12-05 17:31 ` Oren Laadan
2008-12-05 17:31   ` Oren Laadan
     [not found] ` <1228498282-11804-1-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
2008-12-05 17:31   ` [RFC v11][PATCH 01/13] Create syscalls: sys_checkpoint, sys_restart Oren Laadan
2008-12-05 17:31   ` [RFC v11][PATCH 02/13] Checkpoint/restart: initial documentation Oren Laadan
2008-12-05 17:31   ` [RFC v11][PATCH 03/13] General infrastructure for checkpoint restart Oren Laadan
2008-12-05 17:31   ` [RFC v11][PATCH 04/13] x86 support for checkpoint/restart Oren Laadan
2008-12-05 17:31   ` [RFC v11][PATCH 05/13] Dump memory address space Oren Laadan
2008-12-05 17:31   ` [RFC v11][PATCH 06/13] Restore " Oren Laadan
2008-12-05 17:31   ` [RFC v11][PATCH 07/13] Infrastructure for shared objects Oren Laadan
2008-12-05 17:31   ` [RFC v11][PATCH 08/13] Dump open file descriptors Oren Laadan
2008-12-05 17:31   ` [RFC v11][PATCH 09/13] Restore open file descriprtors Oren Laadan
2008-12-05 17:31   ` [RFC v11][PATCH 10/13] External checkpoint of a task other than ourself Oren Laadan
2008-12-05 17:31   ` Oren Laadan
2008-12-05 17:31   ` [RFC v11][PATCH 11/13] Track in-kernel when we expect checkpoint/restart to work Oren Laadan
2008-12-05 17:31   ` [RFC v11][PATCH 12/13] Checkpoint multiple processes Oren Laadan
2008-12-05 17:31   ` [RFC v11][PATCH 13/13] Restart " Oren Laadan
2008-12-05 17:31   ` Oren Laadan
2008-12-06  0:19   ` [RFC v11][PATCH 00/13] Kernel based checkpoint/restart Serge E. Hallyn
2008-12-09 19:42   ` Serge E. Hallyn
2008-12-16 18:43   ` Dave Hansen
2008-12-05 17:31 ` [RFC v11][PATCH 11/13] Track in-kernel when we expect checkpoint/restart to work Oren Laadan
2008-12-05 17:31   ` Oren Laadan
2008-12-05 17:31   ` Oren Laadan
2008-12-05 17:31 ` [RFC v11][PATCH 12/13] Checkpoint multiple processes Oren Laadan
2008-12-05 17:31   ` Oren Laadan
2008-12-05 17:31   ` Oren Laadan
2008-12-05 17:31 ` [RFC v11][PATCH 13/13] Restart " Oren Laadan
2008-12-05 17:31   ` Oren Laadan
2008-12-05 17:31 ` Oren Laadan
2008-12-06  0:19 ` [RFC v11][PATCH 00/13] Kernel based checkpoint/restart Serge E. Hallyn
2008-12-06  0:19   ` Serge E. Hallyn
2008-12-06  0:19   ` Serge E. Hallyn
2008-12-09 19:42 ` Serge E. Hallyn
2008-12-09 19:42   ` Serge E. Hallyn
2008-12-09 19:42   ` Serge E. Hallyn
2008-12-16 18:43 ` Dave Hansen
2008-12-16 18:43   ` Dave Hansen
2008-12-05 17:31 Oren Laadan

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.