All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 00/80] Kernel based checkpoint/restart [v18]
@ 2009-09-23 23:50 ` Oren Laadan
  0 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-23 23:50 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, containers, linux-kernel, linux-mm, linux-api,
	Serge Hallyn, Ingo Molnar, Pavel Emelyanov, Oren Laadan

Hi Andrew,

This is our recent round of checkpoint/restart patches. It can
checkpoint and restart interactive sessions of 'screen' across 
kernel reboot. Please consider applying to -mm.

Patches 1-17 are clean-ups and preparations for c/r:
 * 1,2,3,4 and 9,10: cleanups, also useful for c/r.
 * 5,6: fix freezer control group
 * 7,8: extend freezer control group for c/r.
 * 11-17: clone_with_pid

Patch 18 reserves the system calls slots - please apply so we
don't need to keep changing them.

Patches 19-80 contain the actual c/r code; we've exhausted the
reviewers for most of them.

Patch 32 implements a deferqueue - mechanism for a process to
defer work for some later time (unlike workqueue, designed for
the work to execute in the context of same/original process).

Thanks,

Oren.

----

Application checkpoint/restart (c/r) is the ability to save the state
of a running application so that it can later resume its execution
from the time at which it was checkpointed, on the same or a different
machine.

This version brings support many new features, including support for
unix domain sockets, fifos, pseudo-terminals, and signals (see the
detailed changelog below).

With these in place, it can now checkpoint and restart not only batch
jobs, but also interactive programs using 'screen'. For example, users
can checkpoint a 'screen' session with multiple shells, upgrade their
kernel, reboot, and restart their interactive 'screen' session from
before !

This patchset was compiled and tested against v2.6.31. For more
information, check out Documentation/checkpoint/*.txt

Q: How useful is this code as it stands in real-world usage?
A: The application can be single- or multi-processes and threads. It
   handles open files (regular files/directories on most file systems,
   pipes, fifos, af_unix sockets, /dev/{null,zero,random,urandom} and
   pseudo-terminals. It supports shared memory. sysv IPC (except undo
   of sempahores). It's suitable for many types of batch jobs as well
   as some interactive jobs. (Note: it is assumed that the fs view is
   available at restart).

Q: What can it checkpoint and restart ?
A: A (single threaded) process can checkpoint itself, aka "self"
   checkpoint, if it calls the new system calls. Otherise, for an
   "external" checkpoint, the caller must first freeze the target
   processes. One can either checkpoint an entire container (and
   we make best effort to ensure that the result is self-contained),
   or merely a subtree of a process hierarchy.

Q: What about namespaces ?
A: Currrently, UTS and IPC namespaces are restored. They demonstrate
   how namespaces are handled. More to come.

Q: What additional work needs to be done to it?
A: Fill in the gory details following the examples so far. Current WIP
   includes inet sockets, event-poll, and early work on inotify, mount
   namespace and mount-points, pseudo file systems, and x86_64 support.
   
Q: How can I try it ?
A: Use it for simple batch jobs (pipes, too), or an interactive
   'screen' session, in a whole container or just a subtree of
   tasks:

   create the freezer cgroup:
     $ mount -t cgroup -ofreezer freezer /cgroup
     $ mkdir /cgroup/0
   
   run the test, freeze it:  
     $ test/multitask &
     [1] 2754
     $ for i in `pidof multitask`; do echo $i > /cgroup/0/tasks; done
     $ echo FROZEN > /cgruop/0/freezer.state
   
   checkpoint:
     $ ./ckpt 2754 > ckpt.out
   
   restart:
     $ ./mktree < ckpt.out
   
   voila :)
   
To do all this, you'll need:

The git tree tracking v18, branch 'ckpt-v18' (and past versions):
	git://git.ncl.cs.columbia.edu/pub/git/linux-cr.git

The userspace tools are available through the matching branch [v18]:
	git://git.ncl.cs.columbia.edu/pub/git/user-cr.git


Changelog:

[2009-Sep-22] v18

  (new features)
  - [Nathan Lynch] Re-introduce powerpc support
  - Save/restore pseudo-terminals
  - Save/restore (pty) controlling terminals
  - Save/restore restore PGIDs
  - [Dan Smith] Save/restore unix domain sockets
  - Save/restore FIFOs
  - Save/restore pending signals
  - Save/restore rlimits
  - Save/restore itimers
  - [Matt Helsley] Handle many non-pseudo file-systems

  (other changes)
  - Rename headerless struct ckpt_hdr_* to struct ckpt_*
  - [Nathan Lynch] discard const from struct cred * where appropriate
  - [Serge Hallyn][s390] Set return value for self-checkpoint 
  - Handle kmalloc failure in restore_sem_array()
  - [IPC] Collect files used by shm objects
  - [IPC] Use file (not inode) as shared object on checkpoint of shm
  - More ckpt_write_err()s to give information on checkpoint failure
  - Adjust format of pipe buffer to include the mandatory pre-header
  - [LEAKS] Mark the backing file as visited at chekcpoint
  - Tighten checks on supported vma to checkpoint or restart
  - [Serge Hallyn] Export filemap_checkpoint() (used for ext4)
  - Introduce ckpt_collect_file() that also uses file->collect method
  - Use ckpt_collect_file() instead of ckpt_obj_collect() for files
  - Fix leak-detection issue in collect_mm() (test for first-time obj)
  - Invoke set_close_on_exec() unconditionally on restart
  - [Dan Smith] Export fill_fname() as ckpt_fill_fname()
  - Interface to pass simple pointers as data with deferqueue
  - [Dan Smith] Fix ckpt_obj_lookup_add() leak detection logic
  - Replace EAGAIN with EBUSY where necessary
  - Introduce CKPT_OBJ_VISITED in leak detection
  - ckpt_obj_collect() returns objref for new objects, 0 otherwise
  - Rename ckpt_obj_checkpointed() to ckpt_obj_visited()
  - Introduce ckpt_obj_visit() to mark objects as visited
  - Set the CHECKPOINTED flag on objects before calling checkpoint
  - Introduce ckpt_obj_reserve()
  - Change ref_drop() to accept a @lastref argument (for cleanup)
  - Disallow multiple objects with same objref in restart
  - Allow _ckpt_read_obj_type() to read header only (w/o payload)
  - Fix leak of ckpt_ctx when restoring zombie tasks
  - Fix race of prepare_descendant() with an ongoing fork()
  - Track and report the first error if restart fails
  - Tighten logic to protect against bogus pids in input
  - [Matt Helsley] Improve debug output from ckpt_notify_error()
  - [Nathan Lynch] fix compilation errors with CONFIG_COMPAT=y
  - Detect error-headers in input data on restart, and abort.
  - Standard format for checkpoint error strings (and documentation)
  - [Dan Smith] Add an errno validation function
  - Add ckpt_read_payload(): read a variable-length object (no header)
  - Add ckpt_read_string(): same for strings (ensures null-terminated)
  - Add ckpt_read_consume(): consumes next object without processing
  - [John Dykstra] Fix no-dot-config-targets pattern in linux/Makefile

[2009-Jul-21] v17
  - Introduce syscall clone_with_pids() to restore original pids
  - Support threads and zombies
  - Save/restore task->files
  - Save/restore task->sighand
  - Save/restore futex
  - Save/restore credentials
  - Introduce PF_RESTARTING to skip notifications on task exit
  - restart(2) allow caller to ask to freeze tasks after restart
  - restart(2) isn't idempotent: return -EINTR if interrupted
  - Improve debugging output handling 
  - Make multi-process restart logic more robust and complete
  - Correctly select return value for restarting tasks on success
  - Tighten ptrace test for checkpoint to PTRACE_MODE_ATTACH
  - Use CHECKPOINTING state for frozen checkpointed tasks
  - Fix compilation without CONFIG_CHECKPOINT
  - Fix compilation with CONFIG_COMPAT
  - Fix headers includes and exports
  - Leak detection performed in two steps
  - Detect "inverse" leaks of objects (dis)appearing unexpectedly
  - Memory: save/restore mm->{flags,def_flags,saved_auxv}
  - Memory: only collect sub-objects of mm once (leak detection)
  - Files: validate f_mode after restore
  - Namespaces: leak detection for nsproxy sub-components
  - Namespaces: proper restart from namespace(s) without namespace(s)
  - Save global constants in header instead of per-object
  - IPC: replace sys_unshare() with create_ipc_ns()
  - IPC: restore objects in suitable namespace
  - IPC: correct behavior under !CONFIG_IPC_NS
  - UTS: save/restore all fields
  - UTS: replace sys_unshare() with create_uts_ns()
  - X86_32: sanitize cpu, debug, and segment registers on restart
  - cgroup_freezer: add CHECKPOINTING state to safeguard checkpoint
  - cgroup_freezer: add interface to freeze a cgroup (given a task)

[2009-May-27] v16
  - Privilege checks for IPC checkpoint
  - Fix error string generation during checkpoint
  - Use kzalloc for header allocation
  - Restart blocks are arch-independent
  - Redo pipe c/r using splice
  - Fixes to s390 arch
  - Remove powerpc arch (temporary)
  - Explicitly restore ->nsproxy
  - All objects in image are precedeed by 'struct ckpt_hdr'
  - Fix leaks detection (and leaks)
  - Reorder of patchset
  - Misc bugs and compilation fixes

[2009-Apr-12] v15
  - Minor fixes

[2009-Apr-28] v14
  - Tested against kernel v2.6.30-rc3 on x86_32.
  - Refactor files chekpoint to use f_ops (file operations)
  - Refactor mm/vma to use vma_ops
  - Explicitly handle VDSO vma (and require compat mode)
  - Added code to c/r restat-blocks (restart timeout related syscalls)
  - Added code to c/r namespaces: uts, ipc (with Dan Smith)
  - Added code to c/r sysvipc (shm, msg, sem)
  - Support for VM_CLONE shared memory
  - Added resource leak detection for whole-container checkpoint
  - Added sysctl gauge to allow unprivileged restart/checkpoint
  - Improve and simplify the code and logic of shared objects
  - Rework image format: shared objects appear prior to their use
  - Merge checkpoint and restart functionality into same files
  - Massive renaming of functions: prefix "ckpt_" for generics,
    "checkpoint_" for checkpoint, and "restore_" for restart.
  - Report checkpoint errors as a valid (string record) in the output
  - Merged PPC architecture (by Nathan Lunch),
  - Requires updates to userspace tools too.
  - Misc nits and bug fixes

[2009-Mar-31] v14-rc2
  - Change along Dave's suggestion to use f_ops->checkpoint() for files
  - Merge patch simplifying Kconfig, with CONFIG_CHECKPOINT_SUPPORT
  - Merge support for PPC arch (Nathan Lynch)
  - Misc cleanups and fixes in response to comments

[2009-Mar-20] v14-rc1:
  - The 'h.parent' field of 'struct cr_hdr' isn't used - discard
  - Check whether calls to cr_hbuf_get() succeed or fail.
  - Fixed of pipe c/r code
  - Prevent deadlock by refusing c/r when a pipe inode == ctx->file inode
  - Refuse non-self checkpoint if a task isn't frozen
  - Use unsigned fields in checkpoint headers unless otherwise required
  - Rename functions in files c/r to better reflect their role
  - Add support for anonymous shared memory
  - Merge support for s390 arch (Dan Smith, Serge Hallyn)
    
[2008-Dec-03] v13:
  - Cleanups of 'struct cr_ctx' - remove unused fields
  - Misc fixes for comments
  
[2008-Dec-17] v12:
  - Fix re-alloc/reset of pgarr chain to correctly reuse buffers
    (empty pgarr are saves in a separate pool chain)
  - Add a couple of missed calls to cr_hbuf_put()
  - cr_kwrite/cr_kread() again use vfs_read(), vfs_write() (safer)
  - Split cr_write/cr_read() to two parts: _cr_write/read() helper
  - Befriend with sparse: explicit conversion to 'void __user *'
  - Redrefine 'pr_fmt' ind replace cr_debug() with pr_debug()

[2008-Dec-05] v11:
  - Use contents of 'init->fs->root' instead of pointing to it
  - Ignore symlinks (there is no such thing as an open symlink)
  - cr_scan_fds() retries from scratch if it hits size limits
  - Add missing test for VM_MAYSHARE when dumping memory
  - Improve documentation about: behavior when tasks aren't fronen,
    life span of the object hash, references to objects in the hash
 
[2008-Nov-26] v10:
  - Grab vfs root of container init, rather than current process
  - Acquire dcache_lock around call to __d_path() in cr_fill_name()
  - Force end-of-string in cr_read_string() (fix possible DoS)
  - Introduce cr_write_buffer(), cr_read_buffer() and cr_read_buf_type()

[2008-Nov-10] v9:
  - Support multiple processes c/r
  - Extend checkpoint header with archtiecture dependent header 
  - Misc bug fixes (see individual changelogs)
  - Rebase to v2.6.28-rc3.

[2008-Oct-29] v8:
  - Support "external" checkpoint
  - Include Dave Hansen's 'deny-checkpoint' patch
  - Split docs in Documentation/checkpoint/..., and improve contents

[2008-Oct-17] v7:
  - Fix save/restore state of FPU
  - Fix argument given to kunmap_atomic() in memory dump/restore

[2008-Oct-07] v6:
  - Balance all calls to cr_hbuf_get() with matching cr_hbuf_put()
    (even though it's not really needed)
  - Add assumptions and what's-missing to documentation
  - Misc fixes and cleanups

[2008-Sep-11] v5:
  - Config is now 'def_bool n' by default
  - Improve memory dump/restore code (following Dave Hansen's comments)
  - Change dump format (and code) to allow chunks of <vaddrs, pages>
    instead of one long list of each
  - Fix use of follow_page() to avoid faulting in non-present pages
  - Memory restore now maps user pages explicitly to copy data into them,
    instead of reading directly to user space; got rid of mprotect_fixup()
  - Remove preempt_disable() when restoring debug registers
  - Rename headers files s/ckpt/checkpoint/
  - Fix misc bugs in files dump/restore
  - Fixes and cleanups on some error paths
  - Fix misc coding style

[2008-Sep-09] v4:
  - Various fixes and clean-ups
  - Fix calculation of hash table size
  - Fix header structure alignment
  - Use stand list_... for cr_pgarr

[2008-Aug-29] v3:
  - Various fixes and clean-ups
  - Use standard hlist_... for hash table
  - Better use of standard kmalloc/kfree

[2008-Aug-20] v2:
  - Added Dump and restore of open files (regular and directories)
  - Added basic handling of shared objects, and improve handling of
    'parent tag' concept
  - Added documentation
  - Improved ABI, 64bit padding for image data
  - Improved locking when saving/restoring memory
  - Added UTS information to header (release, version, machine)
  - Cleanup extraction of filename from a file pointer
  - Refactor to allow easier reviewing
  - Remove requirement for CAPS_SYS_ADMIN until we come up with a
    security policy (this means that file restore may fail)
  - Other cleanup and response to comments for v1

[2008-Jul-29] v1:
  - Initial version: support a single task with address space of only
    private anonymous or file-mapped VMAs; syscalls ignore pid/crid
    argument and act on current process.

--
At the containers mini-conference before OLS, the consensus among
all the stakeholders was that doing checkpoint/restart in the kernel
as much as possible was the best approach.  With this approach, the
kernel will export a relatively opaque 'blob' of data to userspace
which can then be handed to the new kernel at restore time.

This is different than what had been proposed before, which was
that a userspace application would be responsible for collecting
all of this data.  We were also planning on adding lots of new,
little kernel interfaces for all of the things that needed
checkpointing.  This unites those into a single, grand interface.

The 'blob' will contain copies of select portions of kernel
structures such as vmas and mm_structs.  It will also contain
copies of the actual memory that the process uses.  Any changes
in this blob's format between kernel revisions can be handled by
an in-userspace conversion program.

This is a similar approach to virtually all of the commercial
checkpoint/restart products out there, as well as the research
project Zap.

These patches basically serialize internel kernel state and write
it out to a file descriptor.  The checkpoint and restore are done
with two new system calls: sys_checkpoint and sys_restart.

In this incarnation, they can only work checkpoint and restore a
single task. The task's address space may consist of only private,
simple vma's - anonymous or file-mapped. The open files may consist
of only simple files and directories.
--

^ permalink raw reply	[flat|nested] 327+ messages in thread

* [PATCH 00/80] Kernel based checkpoint/restart [v18]
@ 2009-09-23 23:50 ` Oren Laadan
  0 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-23 23:50 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, containers, linux-kernel, linux-mm, linux-api,
	Serge Hallyn, Ingo Molnar, Pavel Emelyanov, Oren Laadan

Hi Andrew,

This is our recent round of checkpoint/restart patches. It can
checkpoint and restart interactive sessions of 'screen' across 
kernel reboot. Please consider applying to -mm.

Patches 1-17 are clean-ups and preparations for c/r:
 * 1,2,3,4 and 9,10: cleanups, also useful for c/r.
 * 5,6: fix freezer control group
 * 7,8: extend freezer control group for c/r.
 * 11-17: clone_with_pid

Patch 18 reserves the system calls slots - please apply so we
don't need to keep changing them.

Patches 19-80 contain the actual c/r code; we've exhausted the
reviewers for most of them.

Patch 32 implements a deferqueue - mechanism for a process to
defer work for some later time (unlike workqueue, designed for
the work to execute in the context of same/original process).

Thanks,

Oren.

----

Application checkpoint/restart (c/r) is the ability to save the state
of a running application so that it can later resume its execution
from the time at which it was checkpointed, on the same or a different
machine.

This version brings support many new features, including support for
unix domain sockets, fifos, pseudo-terminals, and signals (see the
detailed changelog below).

With these in place, it can now checkpoint and restart not only batch
jobs, but also interactive programs using 'screen'. For example, users
can checkpoint a 'screen' session with multiple shells, upgrade their
kernel, reboot, and restart their interactive 'screen' session from
before !

This patchset was compiled and tested against v2.6.31. For more
information, check out Documentation/checkpoint/*.txt

Q: How useful is this code as it stands in real-world usage?
A: The application can be single- or multi-processes and threads. It
   handles open files (regular files/directories on most file systems,
   pipes, fifos, af_unix sockets, /dev/{null,zero,random,urandom} and
   pseudo-terminals. It supports shared memory. sysv IPC (except undo
   of sempahores). It's suitable for many types of batch jobs as well
   as some interactive jobs. (Note: it is assumed that the fs view is
   available at restart).

Q: What can it checkpoint and restart ?
A: A (single threaded) process can checkpoint itself, aka "self"
   checkpoint, if it calls the new system calls. Otherise, for an
   "external" checkpoint, the caller must first freeze the target
   processes. One can either checkpoint an entire container (and
   we make best effort to ensure that the result is self-contained),
   or merely a subtree of a process hierarchy.

Q: What about namespaces ?
A: Currrently, UTS and IPC namespaces are restored. They demonstrate
   how namespaces are handled. More to come.

Q: What additional work needs to be done to it?
A: Fill in the gory details following the examples so far. Current WIP
   includes inet sockets, event-poll, and early work on inotify, mount
   namespace and mount-points, pseudo file systems, and x86_64 support.
   
Q: How can I try it ?
A: Use it for simple batch jobs (pipes, too), or an interactive
   'screen' session, in a whole container or just a subtree of
   tasks:

   create the freezer cgroup:
     $ mount -t cgroup -ofreezer freezer /cgroup
     $ mkdir /cgroup/0
   
   run the test, freeze it:  
     $ test/multitask &
     [1] 2754
     $ for i in `pidof multitask`; do echo $i > /cgroup/0/tasks; done
     $ echo FROZEN > /cgruop/0/freezer.state
   
   checkpoint:
     $ ./ckpt 2754 > ckpt.out
   
   restart:
     $ ./mktree < ckpt.out
   
   voila :)
   
To do all this, you'll need:

The git tree tracking v18, branch 'ckpt-v18' (and past versions):
	git://git.ncl.cs.columbia.edu/pub/git/linux-cr.git

The userspace tools are available through the matching branch [v18]:
	git://git.ncl.cs.columbia.edu/pub/git/user-cr.git


Changelog:

[2009-Sep-22] v18

  (new features)
  - [Nathan Lynch] Re-introduce powerpc support
  - Save/restore pseudo-terminals
  - Save/restore (pty) controlling terminals
  - Save/restore restore PGIDs
  - [Dan Smith] Save/restore unix domain sockets
  - Save/restore FIFOs
  - Save/restore pending signals
  - Save/restore rlimits
  - Save/restore itimers
  - [Matt Helsley] Handle many non-pseudo file-systems

  (other changes)
  - Rename headerless struct ckpt_hdr_* to struct ckpt_*
  - [Nathan Lynch] discard const from struct cred * where appropriate
  - [Serge Hallyn][s390] Set return value for self-checkpoint 
  - Handle kmalloc failure in restore_sem_array()
  - [IPC] Collect files used by shm objects
  - [IPC] Use file (not inode) as shared object on checkpoint of shm
  - More ckpt_write_err()s to give information on checkpoint failure
  - Adjust format of pipe buffer to include the mandatory pre-header
  - [LEAKS] Mark the backing file as visited at chekcpoint
  - Tighten checks on supported vma to checkpoint or restart
  - [Serge Hallyn] Export filemap_checkpoint() (used for ext4)
  - Introduce ckpt_collect_file() that also uses file->collect method
  - Use ckpt_collect_file() instead of ckpt_obj_collect() for files
  - Fix leak-detection issue in collect_mm() (test for first-time obj)
  - Invoke set_close_on_exec() unconditionally on restart
  - [Dan Smith] Export fill_fname() as ckpt_fill_fname()
  - Interface to pass simple pointers as data with deferqueue
  - [Dan Smith] Fix ckpt_obj_lookup_add() leak detection logic
  - Replace EAGAIN with EBUSY where necessary
  - Introduce CKPT_OBJ_VISITED in leak detection
  - ckpt_obj_collect() returns objref for new objects, 0 otherwise
  - Rename ckpt_obj_checkpointed() to ckpt_obj_visited()
  - Introduce ckpt_obj_visit() to mark objects as visited
  - Set the CHECKPOINTED flag on objects before calling checkpoint
  - Introduce ckpt_obj_reserve()
  - Change ref_drop() to accept a @lastref argument (for cleanup)
  - Disallow multiple objects with same objref in restart
  - Allow _ckpt_read_obj_type() to read header only (w/o payload)
  - Fix leak of ckpt_ctx when restoring zombie tasks
  - Fix race of prepare_descendant() with an ongoing fork()
  - Track and report the first error if restart fails
  - Tighten logic to protect against bogus pids in input
  - [Matt Helsley] Improve debug output from ckpt_notify_error()
  - [Nathan Lynch] fix compilation errors with CONFIG_COMPAT=y
  - Detect error-headers in input data on restart, and abort.
  - Standard format for checkpoint error strings (and documentation)
  - [Dan Smith] Add an errno validation function
  - Add ckpt_read_payload(): read a variable-length object (no header)
  - Add ckpt_read_string(): same for strings (ensures null-terminated)
  - Add ckpt_read_consume(): consumes next object without processing
  - [John Dykstra] Fix no-dot-config-targets pattern in linux/Makefile

[2009-Jul-21] v17
  - Introduce syscall clone_with_pids() to restore original pids
  - Support threads and zombies
  - Save/restore task->files
  - Save/restore task->sighand
  - Save/restore futex
  - Save/restore credentials
  - Introduce PF_RESTARTING to skip notifications on task exit
  - restart(2) allow caller to ask to freeze tasks after restart
  - restart(2) isn't idempotent: return -EINTR if interrupted
  - Improve debugging output handling 
  - Make multi-process restart logic more robust and complete
  - Correctly select return value for restarting tasks on success
  - Tighten ptrace test for checkpoint to PTRACE_MODE_ATTACH
  - Use CHECKPOINTING state for frozen checkpointed tasks
  - Fix compilation without CONFIG_CHECKPOINT
  - Fix compilation with CONFIG_COMPAT
  - Fix headers includes and exports
  - Leak detection performed in two steps
  - Detect "inverse" leaks of objects (dis)appearing unexpectedly
  - Memory: save/restore mm->{flags,def_flags,saved_auxv}
  - Memory: only collect sub-objects of mm once (leak detection)
  - Files: validate f_mode after restore
  - Namespaces: leak detection for nsproxy sub-components
  - Namespaces: proper restart from namespace(s) without namespace(s)
  - Save global constants in header instead of per-object
  - IPC: replace sys_unshare() with create_ipc_ns()
  - IPC: restore objects in suitable namespace
  - IPC: correct behavior under !CONFIG_IPC_NS
  - UTS: save/restore all fields
  - UTS: replace sys_unshare() with create_uts_ns()
  - X86_32: sanitize cpu, debug, and segment registers on restart
  - cgroup_freezer: add CHECKPOINTING state to safeguard checkpoint
  - cgroup_freezer: add interface to freeze a cgroup (given a task)

[2009-May-27] v16
  - Privilege checks for IPC checkpoint
  - Fix error string generation during checkpoint
  - Use kzalloc for header allocation
  - Restart blocks are arch-independent
  - Redo pipe c/r using splice
  - Fixes to s390 arch
  - Remove powerpc arch (temporary)
  - Explicitly restore ->nsproxy
  - All objects in image are precedeed by 'struct ckpt_hdr'
  - Fix leaks detection (and leaks)
  - Reorder of patchset
  - Misc bugs and compilation fixes

[2009-Apr-12] v15
  - Minor fixes

[2009-Apr-28] v14
  - Tested against kernel v2.6.30-rc3 on x86_32.
  - Refactor files chekpoint to use f_ops (file operations)
  - Refactor mm/vma to use vma_ops
  - Explicitly handle VDSO vma (and require compat mode)
  - Added code to c/r restat-blocks (restart timeout related syscalls)
  - Added code to c/r namespaces: uts, ipc (with Dan Smith)
  - Added code to c/r sysvipc (shm, msg, sem)
  - Support for VM_CLONE shared memory
  - Added resource leak detection for whole-container checkpoint
  - Added sysctl gauge to allow unprivileged restart/checkpoint
  - Improve and simplify the code and logic of shared objects
  - Rework image format: shared objects appear prior to their use
  - Merge checkpoint and restart functionality into same files
  - Massive renaming of functions: prefix "ckpt_" for generics,
    "checkpoint_" for checkpoint, and "restore_" for restart.
  - Report checkpoint errors as a valid (string record) in the output
  - Merged PPC architecture (by Nathan Lunch),
  - Requires updates to userspace tools too.
  - Misc nits and bug fixes

[2009-Mar-31] v14-rc2
  - Change along Dave's suggestion to use f_ops->checkpoint() for files
  - Merge patch simplifying Kconfig, with CONFIG_CHECKPOINT_SUPPORT
  - Merge support for PPC arch (Nathan Lynch)
  - Misc cleanups and fixes in response to comments

[2009-Mar-20] v14-rc1:
  - The 'h.parent' field of 'struct cr_hdr' isn't used - discard
  - Check whether calls to cr_hbuf_get() succeed or fail.
  - Fixed of pipe c/r code
  - Prevent deadlock by refusing c/r when a pipe inode == ctx->file inode
  - Refuse non-self checkpoint if a task isn't frozen
  - Use unsigned fields in checkpoint headers unless otherwise required
  - Rename functions in files c/r to better reflect their role
  - Add support for anonymous shared memory
  - Merge support for s390 arch (Dan Smith, Serge Hallyn)
    
[2008-Dec-03] v13:
  - Cleanups of 'struct cr_ctx' - remove unused fields
  - Misc fixes for comments
  
[2008-Dec-17] v12:
  - Fix re-alloc/reset of pgarr chain to correctly reuse buffers
    (empty pgarr are saves in a separate pool chain)
  - Add a couple of missed calls to cr_hbuf_put()
  - cr_kwrite/cr_kread() again use vfs_read(), vfs_write() (safer)
  - Split cr_write/cr_read() to two parts: _cr_write/read() helper
  - Befriend with sparse: explicit conversion to 'void __user *'
  - Redrefine 'pr_fmt' ind replace cr_debug() with pr_debug()

[2008-Dec-05] v11:
  - Use contents of 'init->fs->root' instead of pointing to it
  - Ignore symlinks (there is no such thing as an open symlink)
  - cr_scan_fds() retries from scratch if it hits size limits
  - Add missing test for VM_MAYSHARE when dumping memory
  - Improve documentation about: behavior when tasks aren't fronen,
    life span of the object hash, references to objects in the hash
 
[2008-Nov-26] v10:
  - Grab vfs root of container init, rather than current process
  - Acquire dcache_lock around call to __d_path() in cr_fill_name()
  - Force end-of-string in cr_read_string() (fix possible DoS)
  - Introduce cr_write_buffer(), cr_read_buffer() and cr_read_buf_type()

[2008-Nov-10] v9:
  - Support multiple processes c/r
  - Extend checkpoint header with archtiecture dependent header 
  - Misc bug fixes (see individual changelogs)
  - Rebase to v2.6.28-rc3.

[2008-Oct-29] v8:
  - Support "external" checkpoint
  - Include Dave Hansen's 'deny-checkpoint' patch
  - Split docs in Documentation/checkpoint/..., and improve contents

[2008-Oct-17] v7:
  - Fix save/restore state of FPU
  - Fix argument given to kunmap_atomic() in memory dump/restore

[2008-Oct-07] v6:
  - Balance all calls to cr_hbuf_get() with matching cr_hbuf_put()
    (even though it's not really needed)
  - Add assumptions and what's-missing to documentation
  - Misc fixes and cleanups

[2008-Sep-11] v5:
  - Config is now 'def_bool n' by default
  - Improve memory dump/restore code (following Dave Hansen's comments)
  - Change dump format (and code) to allow chunks of <vaddrs, pages>
    instead of one long list of each
  - Fix use of follow_page() to avoid faulting in non-present pages
  - Memory restore now maps user pages explicitly to copy data into them,
    instead of reading directly to user space; got rid of mprotect_fixup()
  - Remove preempt_disable() when restoring debug registers
  - Rename headers files s/ckpt/checkpoint/
  - Fix misc bugs in files dump/restore
  - Fixes and cleanups on some error paths
  - Fix misc coding style

[2008-Sep-09] v4:
  - Various fixes and clean-ups
  - Fix calculation of hash table size
  - Fix header structure alignment
  - Use stand list_... for cr_pgarr

[2008-Aug-29] v3:
  - Various fixes and clean-ups
  - Use standard hlist_... for hash table
  - Better use of standard kmalloc/kfree

[2008-Aug-20] v2:
  - Added Dump and restore of open files (regular and directories)
  - Added basic handling of shared objects, and improve handling of
    'parent tag' concept
  - Added documentation
  - Improved ABI, 64bit padding for image data
  - Improved locking when saving/restoring memory
  - Added UTS information to header (release, version, machine)
  - Cleanup extraction of filename from a file pointer
  - Refactor to allow easier reviewing
  - Remove requirement for CAPS_SYS_ADMIN until we come up with a
    security policy (this means that file restore may fail)
  - Other cleanup and response to comments for v1

[2008-Jul-29] v1:
  - Initial version: support a single task with address space of only
    private anonymous or file-mapped VMAs; syscalls ignore pid/crid
    argument and act on current process.

--
At the containers mini-conference before OLS, the consensus among
all the stakeholders was that doing checkpoint/restart in the kernel
as much as possible was the best approach.  With this approach, the
kernel will export a relatively opaque 'blob' of data to userspace
which can then be handed to the new kernel at restore time.

This is different than what had been proposed before, which was
that a userspace application would be responsible for collecting
all of this data.  We were also planning on adding lots of new,
little kernel interfaces for all of the things that needed
checkpointing.  This unites those into a single, grand interface.

The 'blob' will contain copies of select portions of kernel
structures such as vmas and mm_structs.  It will also contain
copies of the actual memory that the process uses.  Any changes
in this blob's format between kernel revisions can be handled by
an in-userspace conversion program.

This is a similar approach to virtually all of the commercial
checkpoint/restart products out there, as well as the research
project Zap.

These patches basically serialize internel kernel state and write
it out to a file descriptor.  The checkpoint and restore are done
with two new system calls: sys_checkpoint and sys_restart.

In this incarnation, they can only work checkpoint and restore a
single task. The task's address space may consist of only private,
simple vma's - anonymous or file-mapped. The open files may consist
of only simple files and directories.
--

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 327+ messages in thread

* [PATCH v18 01/80] c/r: extend arch_setup_additional_pages()
       [not found] ` <1253749920-18673-1-git-send-email-orenl-RdfvBDnrOixBDgjK7y7TUQ@public.gmane.org>
@ 2009-09-23 23:50   ` Oren Laadan
  2009-09-23 23:50   ` [PATCH v18 02/80] x86: ptrace debugreg checks rewrite Oren Laadan
                     ` (80 subsequent siblings)
  81 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-23 23:50 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-api-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Linus Torvalds, Ingo Molnar,
	Alexey Dobriyan, Pavel Emelyanov

From: Alexey Dobriyan <adobriyan-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>

Add "start" argument, to request to map vDSO to a specific place,
and fail the operation if not.

This is useful for restart(2) to ensure that memory layout is restore
exactly as needed.

Changelog[v2]:
  - [ntl] powerpc: vdso build fix (ckpt-v17)

Signed-off-by: Alexey Dobriyan <adobriyan-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
Signed-off-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
---
 arch/powerpc/include/asm/elf.h     |    1 +
 arch/powerpc/kernel/vdso.c         |   13 ++++++++++++-
 arch/s390/include/asm/elf.h        |    2 +-
 arch/s390/kernel/vdso.c            |   13 ++++++++++++-
 arch/sh/include/asm/elf.h          |    1 +
 arch/sh/kernel/vsyscall/vsyscall.c |    2 +-
 arch/x86/include/asm/elf.h         |    3 ++-
 arch/x86/vdso/vdso32-setup.c       |    9 +++++++--
 arch/x86/vdso/vma.c                |    9 +++++++--
 fs/binfmt_elf.c                    |    2 +-
 10 files changed, 45 insertions(+), 10 deletions(-)

diff --git a/arch/powerpc/include/asm/elf.h b/arch/powerpc/include/asm/elf.h
index 014a624..3cef9cf 100644
--- a/arch/powerpc/include/asm/elf.h
+++ b/arch/powerpc/include/asm/elf.h
@@ -271,6 +271,7 @@ extern int ucache_bsize;
 #define ARCH_HAS_SETUP_ADDITIONAL_PAGES
 struct linux_binprm;
 extern int arch_setup_additional_pages(struct linux_binprm *bprm,
+				       unsigned long start,
 				       int uses_interp);
 #define VDSO_AUX_ENT(a,b) NEW_AUX_ENT(a,b);
 
diff --git a/arch/powerpc/kernel/vdso.c b/arch/powerpc/kernel/vdso.c
index ad06d5c..679b394 100644
--- a/arch/powerpc/kernel/vdso.c
+++ b/arch/powerpc/kernel/vdso.c
@@ -184,7 +184,8 @@ static void dump_vdso_pages(struct vm_area_struct * vma)
  * This is called from binfmt_elf, we create the special vma for the
  * vDSO and insert it into the mm struct tree
  */
-int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp)
+int arch_setup_additional_pages(struct linux_binprm *bprm,
+				unsigned long start, int uses_interp)
 {
 	struct mm_struct *mm = current->mm;
 	struct page **vdso_pagelist;
@@ -211,6 +212,10 @@ int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp)
 	vdso_base = VDSO32_MBASE;
 #endif
 
+	/* in case restart(2) mandates a specific location */
+	if (start)
+		vdso_base = start;
+
 	current->mm->context.vdso_base = 0;
 
 	/* vDSO has a problem and was disabled, just don't "enable" it for the
@@ -234,6 +239,12 @@ int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp)
 		goto fail_mmapsem;
 	}
 
+	/* for restart(2), double check that we got we asked for */
+	if (start && vdso_base != start) {
+		rc = -EBUSY;
+		goto fail_mmapsem;
+	}
+
 	/*
 	 * our vma flags don't have VM_WRITE so by default, the process isn't
 	 * allowed to write those pages.
diff --git a/arch/s390/include/asm/elf.h b/arch/s390/include/asm/elf.h
index 74d0bbb..54235bc 100644
--- a/arch/s390/include/asm/elf.h
+++ b/arch/s390/include/asm/elf.h
@@ -205,6 +205,6 @@ do {									    \
 struct linux_binprm;
 
 #define ARCH_HAS_SETUP_ADDITIONAL_PAGES 1
-int arch_setup_additional_pages(struct linux_binprm *, int);
+int arch_setup_additional_pages(struct linux_binprm *, unsigned long, int);
 
 #endif
diff --git a/arch/s390/kernel/vdso.c b/arch/s390/kernel/vdso.c
index 45e1708..c2ee689 100644
--- a/arch/s390/kernel/vdso.c
+++ b/arch/s390/kernel/vdso.c
@@ -193,7 +193,8 @@ static void vdso_init_cr5(void)
  * This is called from binfmt_elf, we create the special vma for the
  * vDSO and insert it into the mm struct tree
  */
-int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp)
+int arch_setup_additional_pages(struct linux_binprm *bprm,
+				unsigned long start, int uses_interp)
 {
 	struct mm_struct *mm = current->mm;
 	struct page **vdso_pagelist;
@@ -224,6 +225,10 @@ int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp)
 	vdso_pages = vdso32_pages;
 #endif
 
+	/* in case restart(2) mandates a specific location */
+	if (start)
+		vdso_base = start;
+
 	/*
 	 * vDSO has a problem and was disabled, just don't "enable" it for
 	 * the process
@@ -246,6 +251,12 @@ int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp)
 		goto out_up;
 	}
 
+	/* for restart(2), double check that we got we asked for */
+	if (start && vdso_base != start) {
+		rc = -EINVAL;
+		goto out_up;
+	}
+
 	/*
 	 * our vma flags don't have VM_WRITE so by default, the process
 	 * isn't allowed to write those pages.
diff --git a/arch/sh/include/asm/elf.h b/arch/sh/include/asm/elf.h
index ccb1d93..6c27b1f 100644
--- a/arch/sh/include/asm/elf.h
+++ b/arch/sh/include/asm/elf.h
@@ -202,6 +202,7 @@ do {									\
 #define ARCH_HAS_SETUP_ADDITIONAL_PAGES
 struct linux_binprm;
 extern int arch_setup_additional_pages(struct linux_binprm *bprm,
+				       unsigned long start,
 				       int uses_interp);
 
 extern unsigned int vdso_enabled;
diff --git a/arch/sh/kernel/vsyscall/vsyscall.c b/arch/sh/kernel/vsyscall/vsyscall.c
index 3f7e415..64c70e5 100644
--- a/arch/sh/kernel/vsyscall/vsyscall.c
+++ b/arch/sh/kernel/vsyscall/vsyscall.c
@@ -59,7 +59,7 @@ int __init vsyscall_init(void)
 }
 
 /* Setup a VMA at program startup for the vsyscall page */
-int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp)
+int arch_setup_additional_pages(struct linux_binprm *bprm, unsigned long start, int uses_interp)
 {
 	struct mm_struct *mm = current->mm;
 	unsigned long addr;
diff --git a/arch/x86/include/asm/elf.h b/arch/x86/include/asm/elf.h
index 83c1bc8..a4398c8 100644
--- a/arch/x86/include/asm/elf.h
+++ b/arch/x86/include/asm/elf.h
@@ -336,9 +336,10 @@ struct linux_binprm;
 
 #define ARCH_HAS_SETUP_ADDITIONAL_PAGES 1
 extern int arch_setup_additional_pages(struct linux_binprm *bprm,
+				       unsigned long start,
 				       int uses_interp);
 
-extern int syscall32_setup_pages(struct linux_binprm *, int exstack);
+extern int syscall32_setup_pages(struct linux_binprm *, unsigned long start, int exstack);
 #define compat_arch_setup_additional_pages	syscall32_setup_pages
 
 extern unsigned long arch_randomize_brk(struct mm_struct *mm);
diff --git a/arch/x86/vdso/vdso32-setup.c b/arch/x86/vdso/vdso32-setup.c
index 58bc00f..5c914b0 100644
--- a/arch/x86/vdso/vdso32-setup.c
+++ b/arch/x86/vdso/vdso32-setup.c
@@ -310,7 +310,8 @@ int __init sysenter_setup(void)
 }
 
 /* Setup a VMA at program startup for the vsyscall page */
-int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp)
+int arch_setup_additional_pages(struct linux_binprm *bprm,
+				unsigned long start, int uses_interp)
 {
 	struct mm_struct *mm = current->mm;
 	unsigned long addr;
@@ -331,13 +332,17 @@ int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp)
 	if (compat)
 		addr = VDSO_HIGH_BASE;
 	else {
-		addr = get_unmapped_area(NULL, 0, PAGE_SIZE, 0, 0);
+		addr = get_unmapped_area(NULL, start, PAGE_SIZE, 0, 0);
 		if (IS_ERR_VALUE(addr)) {
 			ret = addr;
 			goto up_fail;
 		}
 	}
 
+	/* for restart(2), double check that we got we asked for */
+	if (start && addr != start)
+		goto up_fail;
+
 	current->mm->context.vdso = (void *)addr;
 
 	if (compat_uses_vma || !compat) {
diff --git a/arch/x86/vdso/vma.c b/arch/x86/vdso/vma.c
index 21e1aeb..393b22a 100644
--- a/arch/x86/vdso/vma.c
+++ b/arch/x86/vdso/vma.c
@@ -99,7 +99,8 @@ static unsigned long vdso_addr(unsigned long start, unsigned len)
 
 /* Setup a VMA at program startup for the vsyscall page.
    Not called for compat tasks */
-int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp)
+int arch_setup_additional_pages(struct linux_binprm *bprm,
+				unsigned long start, int uses_interp)
 {
 	struct mm_struct *mm = current->mm;
 	unsigned long addr;
@@ -109,13 +110,17 @@ int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp)
 		return 0;
 
 	down_write(&mm->mmap_sem);
-	addr = vdso_addr(mm->start_stack, vdso_size);
+	addr = start ? : vdso_addr(mm->start_stack, vdso_size);
 	addr = get_unmapped_area(NULL, addr, vdso_size, 0, 0);
 	if (IS_ERR_VALUE(addr)) {
 		ret = addr;
 		goto up_fail;
 	}
 
+	/* for restart(2), double check that we got we asked for */
+	if (start && addr != start)
+		goto up_fail;
+
 	current->mm->context.vdso = (void *)addr;
 
 	ret = install_special_mapping(mm, addr, vdso_size,
diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c
index b7c1603..14a1b3c 100644
--- a/fs/binfmt_elf.c
+++ b/fs/binfmt_elf.c
@@ -945,7 +945,7 @@ static int load_elf_binary(struct linux_binprm *bprm, struct pt_regs *regs)
 	set_binfmt(&elf_format);
 
 #ifdef ARCH_HAS_SETUP_ADDITIONAL_PAGES
-	retval = arch_setup_additional_pages(bprm, !!elf_interpreter);
+	retval = arch_setup_additional_pages(bprm, 0, !!elf_interpreter);
 	if (retval < 0) {
 		send_sig(SIGKILL, current, 0);
 		goto out;
-- 
1.6.0.4

^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH v18 01/80] c/r: extend arch_setup_additional_pages()
  2009-09-23 23:50 ` Oren Laadan
@ 2009-09-23 23:50   ` Oren Laadan
  -1 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-23 23:50 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, containers, linux-kernel, linux-mm, linux-api,
	Serge Hallyn, Ingo Molnar, Pavel Emelyanov, Alexey Dobriyan,
	Oren Laadan

From: Alexey Dobriyan <adobriyan@gmail.com>

Add "start" argument, to request to map vDSO to a specific place,
and fail the operation if not.

This is useful for restart(2) to ensure that memory layout is restore
exactly as needed.

Changelog[v2]:
  - [ntl] powerpc: vdso build fix (ckpt-v17)

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
---
 arch/powerpc/include/asm/elf.h     |    1 +
 arch/powerpc/kernel/vdso.c         |   13 ++++++++++++-
 arch/s390/include/asm/elf.h        |    2 +-
 arch/s390/kernel/vdso.c            |   13 ++++++++++++-
 arch/sh/include/asm/elf.h          |    1 +
 arch/sh/kernel/vsyscall/vsyscall.c |    2 +-
 arch/x86/include/asm/elf.h         |    3 ++-
 arch/x86/vdso/vdso32-setup.c       |    9 +++++++--
 arch/x86/vdso/vma.c                |    9 +++++++--
 fs/binfmt_elf.c                    |    2 +-
 10 files changed, 45 insertions(+), 10 deletions(-)

diff --git a/arch/powerpc/include/asm/elf.h b/arch/powerpc/include/asm/elf.h
index 014a624..3cef9cf 100644
--- a/arch/powerpc/include/asm/elf.h
+++ b/arch/powerpc/include/asm/elf.h
@@ -271,6 +271,7 @@ extern int ucache_bsize;
 #define ARCH_HAS_SETUP_ADDITIONAL_PAGES
 struct linux_binprm;
 extern int arch_setup_additional_pages(struct linux_binprm *bprm,
+				       unsigned long start,
 				       int uses_interp);
 #define VDSO_AUX_ENT(a,b) NEW_AUX_ENT(a,b);
 
diff --git a/arch/powerpc/kernel/vdso.c b/arch/powerpc/kernel/vdso.c
index ad06d5c..679b394 100644
--- a/arch/powerpc/kernel/vdso.c
+++ b/arch/powerpc/kernel/vdso.c
@@ -184,7 +184,8 @@ static void dump_vdso_pages(struct vm_area_struct * vma)
  * This is called from binfmt_elf, we create the special vma for the
  * vDSO and insert it into the mm struct tree
  */
-int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp)
+int arch_setup_additional_pages(struct linux_binprm *bprm,
+				unsigned long start, int uses_interp)
 {
 	struct mm_struct *mm = current->mm;
 	struct page **vdso_pagelist;
@@ -211,6 +212,10 @@ int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp)
 	vdso_base = VDSO32_MBASE;
 #endif
 
+	/* in case restart(2) mandates a specific location */
+	if (start)
+		vdso_base = start;
+
 	current->mm->context.vdso_base = 0;
 
 	/* vDSO has a problem and was disabled, just don't "enable" it for the
@@ -234,6 +239,12 @@ int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp)
 		goto fail_mmapsem;
 	}
 
+	/* for restart(2), double check that we got we asked for */
+	if (start && vdso_base != start) {
+		rc = -EBUSY;
+		goto fail_mmapsem;
+	}
+
 	/*
 	 * our vma flags don't have VM_WRITE so by default, the process isn't
 	 * allowed to write those pages.
diff --git a/arch/s390/include/asm/elf.h b/arch/s390/include/asm/elf.h
index 74d0bbb..54235bc 100644
--- a/arch/s390/include/asm/elf.h
+++ b/arch/s390/include/asm/elf.h
@@ -205,6 +205,6 @@ do {									    \
 struct linux_binprm;
 
 #define ARCH_HAS_SETUP_ADDITIONAL_PAGES 1
-int arch_setup_additional_pages(struct linux_binprm *, int);
+int arch_setup_additional_pages(struct linux_binprm *, unsigned long, int);
 
 #endif
diff --git a/arch/s390/kernel/vdso.c b/arch/s390/kernel/vdso.c
index 45e1708..c2ee689 100644
--- a/arch/s390/kernel/vdso.c
+++ b/arch/s390/kernel/vdso.c
@@ -193,7 +193,8 @@ static void vdso_init_cr5(void)
  * This is called from binfmt_elf, we create the special vma for the
  * vDSO and insert it into the mm struct tree
  */
-int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp)
+int arch_setup_additional_pages(struct linux_binprm *bprm,
+				unsigned long start, int uses_interp)
 {
 	struct mm_struct *mm = current->mm;
 	struct page **vdso_pagelist;
@@ -224,6 +225,10 @@ int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp)
 	vdso_pages = vdso32_pages;
 #endif
 
+	/* in case restart(2) mandates a specific location */
+	if (start)
+		vdso_base = start;
+
 	/*
 	 * vDSO has a problem and was disabled, just don't "enable" it for
 	 * the process
@@ -246,6 +251,12 @@ int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp)
 		goto out_up;
 	}
 
+	/* for restart(2), double check that we got we asked for */
+	if (start && vdso_base != start) {
+		rc = -EINVAL;
+		goto out_up;
+	}
+
 	/*
 	 * our vma flags don't have VM_WRITE so by default, the process
 	 * isn't allowed to write those pages.
diff --git a/arch/sh/include/asm/elf.h b/arch/sh/include/asm/elf.h
index ccb1d93..6c27b1f 100644
--- a/arch/sh/include/asm/elf.h
+++ b/arch/sh/include/asm/elf.h
@@ -202,6 +202,7 @@ do {									\
 #define ARCH_HAS_SETUP_ADDITIONAL_PAGES
 struct linux_binprm;
 extern int arch_setup_additional_pages(struct linux_binprm *bprm,
+				       unsigned long start,
 				       int uses_interp);
 
 extern unsigned int vdso_enabled;
diff --git a/arch/sh/kernel/vsyscall/vsyscall.c b/arch/sh/kernel/vsyscall/vsyscall.c
index 3f7e415..64c70e5 100644
--- a/arch/sh/kernel/vsyscall/vsyscall.c
+++ b/arch/sh/kernel/vsyscall/vsyscall.c
@@ -59,7 +59,7 @@ int __init vsyscall_init(void)
 }
 
 /* Setup a VMA at program startup for the vsyscall page */
-int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp)
+int arch_setup_additional_pages(struct linux_binprm *bprm, unsigned long start, int uses_interp)
 {
 	struct mm_struct *mm = current->mm;
 	unsigned long addr;
diff --git a/arch/x86/include/asm/elf.h b/arch/x86/include/asm/elf.h
index 83c1bc8..a4398c8 100644
--- a/arch/x86/include/asm/elf.h
+++ b/arch/x86/include/asm/elf.h
@@ -336,9 +336,10 @@ struct linux_binprm;
 
 #define ARCH_HAS_SETUP_ADDITIONAL_PAGES 1
 extern int arch_setup_additional_pages(struct linux_binprm *bprm,
+				       unsigned long start,
 				       int uses_interp);
 
-extern int syscall32_setup_pages(struct linux_binprm *, int exstack);
+extern int syscall32_setup_pages(struct linux_binprm *, unsigned long start, int exstack);
 #define compat_arch_setup_additional_pages	syscall32_setup_pages
 
 extern unsigned long arch_randomize_brk(struct mm_struct *mm);
diff --git a/arch/x86/vdso/vdso32-setup.c b/arch/x86/vdso/vdso32-setup.c
index 58bc00f..5c914b0 100644
--- a/arch/x86/vdso/vdso32-setup.c
+++ b/arch/x86/vdso/vdso32-setup.c
@@ -310,7 +310,8 @@ int __init sysenter_setup(void)
 }
 
 /* Setup a VMA at program startup for the vsyscall page */
-int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp)
+int arch_setup_additional_pages(struct linux_binprm *bprm,
+				unsigned long start, int uses_interp)
 {
 	struct mm_struct *mm = current->mm;
 	unsigned long addr;
@@ -331,13 +332,17 @@ int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp)
 	if (compat)
 		addr = VDSO_HIGH_BASE;
 	else {
-		addr = get_unmapped_area(NULL, 0, PAGE_SIZE, 0, 0);
+		addr = get_unmapped_area(NULL, start, PAGE_SIZE, 0, 0);
 		if (IS_ERR_VALUE(addr)) {
 			ret = addr;
 			goto up_fail;
 		}
 	}
 
+	/* for restart(2), double check that we got we asked for */
+	if (start && addr != start)
+		goto up_fail;
+
 	current->mm->context.vdso = (void *)addr;
 
 	if (compat_uses_vma || !compat) {
diff --git a/arch/x86/vdso/vma.c b/arch/x86/vdso/vma.c
index 21e1aeb..393b22a 100644
--- a/arch/x86/vdso/vma.c
+++ b/arch/x86/vdso/vma.c
@@ -99,7 +99,8 @@ static unsigned long vdso_addr(unsigned long start, unsigned len)
 
 /* Setup a VMA at program startup for the vsyscall page.
    Not called for compat tasks */
-int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp)
+int arch_setup_additional_pages(struct linux_binprm *bprm,
+				unsigned long start, int uses_interp)
 {
 	struct mm_struct *mm = current->mm;
 	unsigned long addr;
@@ -109,13 +110,17 @@ int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp)
 		return 0;
 
 	down_write(&mm->mmap_sem);
-	addr = vdso_addr(mm->start_stack, vdso_size);
+	addr = start ? : vdso_addr(mm->start_stack, vdso_size);
 	addr = get_unmapped_area(NULL, addr, vdso_size, 0, 0);
 	if (IS_ERR_VALUE(addr)) {
 		ret = addr;
 		goto up_fail;
 	}
 
+	/* for restart(2), double check that we got we asked for */
+	if (start && addr != start)
+		goto up_fail;
+
 	current->mm->context.vdso = (void *)addr;
 
 	ret = install_special_mapping(mm, addr, vdso_size,
diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c
index b7c1603..14a1b3c 100644
--- a/fs/binfmt_elf.c
+++ b/fs/binfmt_elf.c
@@ -945,7 +945,7 @@ static int load_elf_binary(struct linux_binprm *bprm, struct pt_regs *regs)
 	set_binfmt(&elf_format);
 
 #ifdef ARCH_HAS_SETUP_ADDITIONAL_PAGES
-	retval = arch_setup_additional_pages(bprm, !!elf_interpreter);
+	retval = arch_setup_additional_pages(bprm, 0, !!elf_interpreter);
 	if (retval < 0) {
 		send_sig(SIGKILL, current, 0);
 		goto out;
-- 
1.6.0.4


^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH v18 01/80] c/r: extend arch_setup_additional_pages()
@ 2009-09-23 23:50   ` Oren Laadan
  0 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-23 23:50 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, containers, linux-kernel, linux-mm, linux-api,
	Serge Hallyn, Ingo Molnar, Pavel Emelyanov, Alexey Dobriyan,
	Oren Laadan

From: Alexey Dobriyan <adobriyan@gmail.com>

Add "start" argument, to request to map vDSO to a specific place,
and fail the operation if not.

This is useful for restart(2) to ensure that memory layout is restore
exactly as needed.

Changelog[v2]:
  - [ntl] powerpc: vdso build fix (ckpt-v17)

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
---
 arch/powerpc/include/asm/elf.h     |    1 +
 arch/powerpc/kernel/vdso.c         |   13 ++++++++++++-
 arch/s390/include/asm/elf.h        |    2 +-
 arch/s390/kernel/vdso.c            |   13 ++++++++++++-
 arch/sh/include/asm/elf.h          |    1 +
 arch/sh/kernel/vsyscall/vsyscall.c |    2 +-
 arch/x86/include/asm/elf.h         |    3 ++-
 arch/x86/vdso/vdso32-setup.c       |    9 +++++++--
 arch/x86/vdso/vma.c                |    9 +++++++--
 fs/binfmt_elf.c                    |    2 +-
 10 files changed, 45 insertions(+), 10 deletions(-)

diff --git a/arch/powerpc/include/asm/elf.h b/arch/powerpc/include/asm/elf.h
index 014a624..3cef9cf 100644
--- a/arch/powerpc/include/asm/elf.h
+++ b/arch/powerpc/include/asm/elf.h
@@ -271,6 +271,7 @@ extern int ucache_bsize;
 #define ARCH_HAS_SETUP_ADDITIONAL_PAGES
 struct linux_binprm;
 extern int arch_setup_additional_pages(struct linux_binprm *bprm,
+				       unsigned long start,
 				       int uses_interp);
 #define VDSO_AUX_ENT(a,b) NEW_AUX_ENT(a,b);
 
diff --git a/arch/powerpc/kernel/vdso.c b/arch/powerpc/kernel/vdso.c
index ad06d5c..679b394 100644
--- a/arch/powerpc/kernel/vdso.c
+++ b/arch/powerpc/kernel/vdso.c
@@ -184,7 +184,8 @@ static void dump_vdso_pages(struct vm_area_struct * vma)
  * This is called from binfmt_elf, we create the special vma for the
  * vDSO and insert it into the mm struct tree
  */
-int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp)
+int arch_setup_additional_pages(struct linux_binprm *bprm,
+				unsigned long start, int uses_interp)
 {
 	struct mm_struct *mm = current->mm;
 	struct page **vdso_pagelist;
@@ -211,6 +212,10 @@ int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp)
 	vdso_base = VDSO32_MBASE;
 #endif
 
+	/* in case restart(2) mandates a specific location */
+	if (start)
+		vdso_base = start;
+
 	current->mm->context.vdso_base = 0;
 
 	/* vDSO has a problem and was disabled, just don't "enable" it for the
@@ -234,6 +239,12 @@ int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp)
 		goto fail_mmapsem;
 	}
 
+	/* for restart(2), double check that we got we asked for */
+	if (start && vdso_base != start) {
+		rc = -EBUSY;
+		goto fail_mmapsem;
+	}
+
 	/*
 	 * our vma flags don't have VM_WRITE so by default, the process isn't
 	 * allowed to write those pages.
diff --git a/arch/s390/include/asm/elf.h b/arch/s390/include/asm/elf.h
index 74d0bbb..54235bc 100644
--- a/arch/s390/include/asm/elf.h
+++ b/arch/s390/include/asm/elf.h
@@ -205,6 +205,6 @@ do {									    \
 struct linux_binprm;
 
 #define ARCH_HAS_SETUP_ADDITIONAL_PAGES 1
-int arch_setup_additional_pages(struct linux_binprm *, int);
+int arch_setup_additional_pages(struct linux_binprm *, unsigned long, int);
 
 #endif
diff --git a/arch/s390/kernel/vdso.c b/arch/s390/kernel/vdso.c
index 45e1708..c2ee689 100644
--- a/arch/s390/kernel/vdso.c
+++ b/arch/s390/kernel/vdso.c
@@ -193,7 +193,8 @@ static void vdso_init_cr5(void)
  * This is called from binfmt_elf, we create the special vma for the
  * vDSO and insert it into the mm struct tree
  */
-int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp)
+int arch_setup_additional_pages(struct linux_binprm *bprm,
+				unsigned long start, int uses_interp)
 {
 	struct mm_struct *mm = current->mm;
 	struct page **vdso_pagelist;
@@ -224,6 +225,10 @@ int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp)
 	vdso_pages = vdso32_pages;
 #endif
 
+	/* in case restart(2) mandates a specific location */
+	if (start)
+		vdso_base = start;
+
 	/*
 	 * vDSO has a problem and was disabled, just don't "enable" it for
 	 * the process
@@ -246,6 +251,12 @@ int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp)
 		goto out_up;
 	}
 
+	/* for restart(2), double check that we got we asked for */
+	if (start && vdso_base != start) {
+		rc = -EINVAL;
+		goto out_up;
+	}
+
 	/*
 	 * our vma flags don't have VM_WRITE so by default, the process
 	 * isn't allowed to write those pages.
diff --git a/arch/sh/include/asm/elf.h b/arch/sh/include/asm/elf.h
index ccb1d93..6c27b1f 100644
--- a/arch/sh/include/asm/elf.h
+++ b/arch/sh/include/asm/elf.h
@@ -202,6 +202,7 @@ do {									\
 #define ARCH_HAS_SETUP_ADDITIONAL_PAGES
 struct linux_binprm;
 extern int arch_setup_additional_pages(struct linux_binprm *bprm,
+				       unsigned long start,
 				       int uses_interp);
 
 extern unsigned int vdso_enabled;
diff --git a/arch/sh/kernel/vsyscall/vsyscall.c b/arch/sh/kernel/vsyscall/vsyscall.c
index 3f7e415..64c70e5 100644
--- a/arch/sh/kernel/vsyscall/vsyscall.c
+++ b/arch/sh/kernel/vsyscall/vsyscall.c
@@ -59,7 +59,7 @@ int __init vsyscall_init(void)
 }
 
 /* Setup a VMA at program startup for the vsyscall page */
-int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp)
+int arch_setup_additional_pages(struct linux_binprm *bprm, unsigned long start, int uses_interp)
 {
 	struct mm_struct *mm = current->mm;
 	unsigned long addr;
diff --git a/arch/x86/include/asm/elf.h b/arch/x86/include/asm/elf.h
index 83c1bc8..a4398c8 100644
--- a/arch/x86/include/asm/elf.h
+++ b/arch/x86/include/asm/elf.h
@@ -336,9 +336,10 @@ struct linux_binprm;
 
 #define ARCH_HAS_SETUP_ADDITIONAL_PAGES 1
 extern int arch_setup_additional_pages(struct linux_binprm *bprm,
+				       unsigned long start,
 				       int uses_interp);
 
-extern int syscall32_setup_pages(struct linux_binprm *, int exstack);
+extern int syscall32_setup_pages(struct linux_binprm *, unsigned long start, int exstack);
 #define compat_arch_setup_additional_pages	syscall32_setup_pages
 
 extern unsigned long arch_randomize_brk(struct mm_struct *mm);
diff --git a/arch/x86/vdso/vdso32-setup.c b/arch/x86/vdso/vdso32-setup.c
index 58bc00f..5c914b0 100644
--- a/arch/x86/vdso/vdso32-setup.c
+++ b/arch/x86/vdso/vdso32-setup.c
@@ -310,7 +310,8 @@ int __init sysenter_setup(void)
 }
 
 /* Setup a VMA at program startup for the vsyscall page */
-int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp)
+int arch_setup_additional_pages(struct linux_binprm *bprm,
+				unsigned long start, int uses_interp)
 {
 	struct mm_struct *mm = current->mm;
 	unsigned long addr;
@@ -331,13 +332,17 @@ int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp)
 	if (compat)
 		addr = VDSO_HIGH_BASE;
 	else {
-		addr = get_unmapped_area(NULL, 0, PAGE_SIZE, 0, 0);
+		addr = get_unmapped_area(NULL, start, PAGE_SIZE, 0, 0);
 		if (IS_ERR_VALUE(addr)) {
 			ret = addr;
 			goto up_fail;
 		}
 	}
 
+	/* for restart(2), double check that we got we asked for */
+	if (start && addr != start)
+		goto up_fail;
+
 	current->mm->context.vdso = (void *)addr;
 
 	if (compat_uses_vma || !compat) {
diff --git a/arch/x86/vdso/vma.c b/arch/x86/vdso/vma.c
index 21e1aeb..393b22a 100644
--- a/arch/x86/vdso/vma.c
+++ b/arch/x86/vdso/vma.c
@@ -99,7 +99,8 @@ static unsigned long vdso_addr(unsigned long start, unsigned len)
 
 /* Setup a VMA at program startup for the vsyscall page.
    Not called for compat tasks */
-int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp)
+int arch_setup_additional_pages(struct linux_binprm *bprm,
+				unsigned long start, int uses_interp)
 {
 	struct mm_struct *mm = current->mm;
 	unsigned long addr;
@@ -109,13 +110,17 @@ int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp)
 		return 0;
 
 	down_write(&mm->mmap_sem);
-	addr = vdso_addr(mm->start_stack, vdso_size);
+	addr = start ? : vdso_addr(mm->start_stack, vdso_size);
 	addr = get_unmapped_area(NULL, addr, vdso_size, 0, 0);
 	if (IS_ERR_VALUE(addr)) {
 		ret = addr;
 		goto up_fail;
 	}
 
+	/* for restart(2), double check that we got we asked for */
+	if (start && addr != start)
+		goto up_fail;
+
 	current->mm->context.vdso = (void *)addr;
 
 	ret = install_special_mapping(mm, addr, vdso_size,
diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c
index b7c1603..14a1b3c 100644
--- a/fs/binfmt_elf.c
+++ b/fs/binfmt_elf.c
@@ -945,7 +945,7 @@ static int load_elf_binary(struct linux_binprm *bprm, struct pt_regs *regs)
 	set_binfmt(&elf_format);
 
 #ifdef ARCH_HAS_SETUP_ADDITIONAL_PAGES
-	retval = arch_setup_additional_pages(bprm, !!elf_interpreter);
+	retval = arch_setup_additional_pages(bprm, 0, !!elf_interpreter);
 	if (retval < 0) {
 		send_sig(SIGKILL, current, 0);
 		goto out;
-- 
1.6.0.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH v18 02/80] x86: ptrace debugreg checks rewrite
       [not found] ` <1253749920-18673-1-git-send-email-orenl-RdfvBDnrOixBDgjK7y7TUQ@public.gmane.org>
  2009-09-23 23:50   ` [PATCH v18 01/80] c/r: extend arch_setup_additional_pages() Oren Laadan
@ 2009-09-23 23:50   ` Oren Laadan
  2009-09-23 23:50   ` [PATCH v18 03/80] c/r: break out new_user_ns() Oren Laadan
                     ` (79 subsequent siblings)
  81 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-23 23:50 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-api-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Linus Torvalds, Ingo Molnar,
	Alexey Dobriyan, Pavel Emelyanov

From: Alexey Dobriyan <adobriyan-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>

This is a mess.

Pre unified-x86 code did check for breakpoint addr
to be "< TASK_SIZE - 3 (or 7)". This was fine from security POV,
but banned valid breakpoint usage when address is close to TASK_SIZE.
E. g. 1-byte breakpoint at TASK_SIZE - 1 should be allowed, but it wasn't.

Then came commit 84929801e14d968caeb84795bfbb88f04283fbd9
("[PATCH] x86_64: TASK_SIZE fixes for compatibility mode processes")
which for some reason touched ptrace as well and made effective
TASK_SIZE of 32-bit process depending on IA32_PAGE_OFFSET
which is not a constant!:

	#define IA32_PAGE_OFFSET ((current->personality & ADDR_LIMIT_3GB) ? 0xc0000000 : 0xFFFFe000)
				   ^^^^^^^
Maximum addr for breakpoint became dependent on personality of ptracer.

Commit also relaxed danger zone for 32-bit processes from 8 bytes to 4
not taking into account that 8-byte wide breakpoints are possible even
for 32-bit processes. This was fine, however, because 64-bit kernel
addresses are too far from 32-bit ones.

Then came utrace with commit 2047b08be67b70875d8765fc81d34ce28041bec3
("x86: x86 ptrace getreg/putreg merge") which copy-pasted and ifdeffed 32-bit
part of TASK_SIZE_OF() leaving 8-byte issue as-is.

So, what patch fixes?
1) Too strict logic near TASK_SIZE boundary -- as long as we don't cross
   TASK_SIZE_MAX, we're fine.
2) Too smart logic of using breakpoints over non-existent kernel
   boundary -- we should only protect against setting up after
   TASK_SIZE_MAX, the rest is none of kernel business. This fixes
   IA32_PAGE_OFFSET beartrap as well.

As a bonus, remove uberhack and big comment determining DR7 validness,
rewrite with clear algorithm when it's obvious what's going on.

Make DR validness checker suitable for C/R. On restart DR registers
must be checked the same way they are checked on PTRACE_POKEUSR.

Question 1: TIF_DEBUG can set even if none of breakpoints is turned on,
should this be optimized?

Question 2: Breakpoints are allowed to be globally enabled, is this a
security risk?

Signed-off-by: Alexey Dobriyan <adobriyan-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
---
 arch/x86/kernel/ptrace.c |  175 +++++++++++++++++++++++++++-------------------
 1 files changed, 103 insertions(+), 72 deletions(-)

diff --git a/arch/x86/kernel/ptrace.c b/arch/x86/kernel/ptrace.c
index 09ecbde..9b4cacf 100644
--- a/arch/x86/kernel/ptrace.c
+++ b/arch/x86/kernel/ptrace.c
@@ -136,11 +136,6 @@ static int set_segment_reg(struct task_struct *task,
 	return 0;
 }
 
-static unsigned long debugreg_addr_limit(struct task_struct *task)
-{
-	return TASK_SIZE - 3;
-}
-
 #else  /* CONFIG_X86_64 */
 
 #define FLAG_MASK		(FLAG_MASK_32 | X86_EFLAGS_NT)
@@ -264,16 +259,6 @@ static int set_segment_reg(struct task_struct *task,
 
 	return 0;
 }
-
-static unsigned long debugreg_addr_limit(struct task_struct *task)
-{
-#ifdef CONFIG_IA32_EMULATION
-	if (test_tsk_thread_flag(task, TIF_IA32))
-		return IA32_PAGE_OFFSET - 3;
-#endif
-	return TASK_SIZE_MAX - 7;
-}
-
 #endif	/* CONFIG_X86_32 */
 
 static unsigned long get_flags(struct task_struct *task)
@@ -481,77 +466,123 @@ static unsigned long ptrace_get_debugreg(struct task_struct *child, int n)
 	return 0;
 }
 
+static int ptrace_check_debugreg(int _32bit,
+				 unsigned long dr0, unsigned long dr1,
+				 unsigned long dr2, unsigned long dr3,
+				 unsigned long dr6, unsigned long dr7)
+{
+	/* Breakpoint type: 00: --x, 01: -w-, 10: undefined, 11: rw- */
+	unsigned int rw[4];
+	/* Breakpoint length: 00: 1 byte, 01: 2 bytes, 10: 8 bytes, 11: 4 bytes */
+	unsigned int len[4];
+	int n;
+
+	if (dr0 >= TASK_SIZE_MAX)
+		return -EINVAL;
+	if (dr1 >= TASK_SIZE_MAX)
+		return -EINVAL;
+	if (dr2 >= TASK_SIZE_MAX)
+		return -EINVAL;
+	if (dr3 >= TASK_SIZE_MAX)
+		return -EINVAL;
+
+	for (n = 0; n < 4; n++) {
+		rw[n] = (dr7 >> (16 + n * 4)) & 0x3;
+		len[n] = (dr7 >> (16 + n * 4 + 2)) & 0x3;
+
+		if (rw[n] == 0x2)
+			return -EINVAL;
+		if (rw[n] == 0x0 && len[n] != 0x0)
+			return -EINVAL;
+		if (_32bit && len[n] == 0x2)
+			return -EINVAL;
+
+		if (len[n] == 0x0)
+			len[n] = 1;
+		else if (len[n] == 0x1)
+			len[n] = 2;
+		else if (len[n] == 0x2)
+			len[n] = 8;
+		else if (len[n] == 0x3)
+			len[n] = 4;
+		/* From now breakpoint length is in bytes. */
+	}
+
+	if (dr6 & ~0xFFFFFFFFUL)
+		return -EINVAL;
+	if (dr7 & ~0xFFFFFFFFUL)
+		return -EINVAL;
+
+	if (dr7 == 0)
+		return 0;
+
+	if (dr0 + len[0] > TASK_SIZE_MAX)
+		return -EINVAL;
+	if (dr1 + len[1] > TASK_SIZE_MAX)
+		return -EINVAL;
+	if (dr2 + len[2] > TASK_SIZE_MAX)
+		return -EINVAL;
+	if (dr3 + len[3] > TASK_SIZE_MAX)
+		return -EINVAL;
+
+	return 0;
+}
+
 static int ptrace_set_debugreg(struct task_struct *child,
 			       int n, unsigned long data)
 {
-	int i;
+	unsigned long dr0, dr1, dr2, dr3, dr6, dr7;
+	int _32bit;
 
 	if (unlikely(n == 4 || n == 5))
 		return -EIO;
 
-	if (n < 4 && unlikely(data >= debugreg_addr_limit(child)))
-		return -EIO;
-
+	dr0 = child->thread.debugreg0;
+	dr1 = child->thread.debugreg1;
+	dr2 = child->thread.debugreg2;
+	dr3 = child->thread.debugreg3;
+	dr6 = child->thread.debugreg6;
+	dr7 = child->thread.debugreg7;
 	switch (n) {
-	case 0:		child->thread.debugreg0 = data; break;
-	case 1:		child->thread.debugreg1 = data; break;
-	case 2:		child->thread.debugreg2 = data; break;
-	case 3:		child->thread.debugreg3 = data; break;
-
+	case 0:
+		dr0 = data;
+		break;
+	case 1:
+		dr1 = data;
+		break;
+	case 2:
+		dr2 = data;
+		break;
+	case 3:
+		dr3 = data;
+		break;
 	case 6:
-		if ((data & ~0xffffffffUL) != 0)
-			return -EIO;
-		child->thread.debugreg6 = data;
+		dr6 = data;
 		break;
-
 	case 7:
-		/*
-		 * Sanity-check data. Take one half-byte at once with
-		 * check = (val >> (16 + 4*i)) & 0xf. It contains the
-		 * R/Wi and LENi bits; bits 0 and 1 are R/Wi, and bits
-		 * 2 and 3 are LENi. Given a list of invalid values,
-		 * we do mask |= 1 << invalid_value, so that
-		 * (mask >> check) & 1 is a correct test for invalid
-		 * values.
-		 *
-		 * R/Wi contains the type of the breakpoint /
-		 * watchpoint, LENi contains the length of the watched
-		 * data in the watchpoint case.
-		 *
-		 * The invalid values are:
-		 * - LENi == 0x10 (undefined), so mask |= 0x0f00.	[32-bit]
-		 * - R/Wi == 0x10 (break on I/O reads or writes), so
-		 *   mask |= 0x4444.
-		 * - R/Wi == 0x00 && LENi != 0x00, so we have mask |=
-		 *   0x1110.
-		 *
-		 * Finally, mask = 0x0f00 | 0x4444 | 0x1110 == 0x5f54.
-		 *
-		 * See the Intel Manual "System Programming Guide",
-		 * 15.2.4
-		 *
-		 * Note that LENi == 0x10 is defined on x86_64 in long
-		 * mode (i.e. even for 32-bit userspace software, but
-		 * 64-bit kernel), so the x86_64 mask value is 0x5454.
-		 * See the AMD manual no. 24593 (AMD64 System Programming)
-		 */
-#ifdef CONFIG_X86_32
-#define	DR7_MASK	0x5f54
-#else
-#define	DR7_MASK	0x5554
-#endif
-		data &= ~DR_CONTROL_RESERVED;
-		for (i = 0; i < 4; i++)
-			if ((DR7_MASK >> ((data >> (16 + 4*i)) & 0xf)) & 1)
-				return -EIO;
-		child->thread.debugreg7 = data;
-		if (data)
-			set_tsk_thread_flag(child, TIF_DEBUG);
-		else
-			clear_tsk_thread_flag(child, TIF_DEBUG);
+		dr7 = data & ~DR_CONTROL_RESERVED;
 		break;
 	}
 
+	_32bit = (sizeof(unsigned long) == 4);
+#ifdef CONFIG_COMPAT
+	if (test_tsk_thread_flag(child, TIF_IA32))
+		_32bit = 1;
+#endif
+	if (ptrace_check_debugreg(_32bit, dr0, dr1, dr2, dr3, dr6, dr7))
+		return -EIO;
+
+	child->thread.debugreg0 = dr0;
+	child->thread.debugreg1 = dr1;
+	child->thread.debugreg2 = dr2;
+	child->thread.debugreg3 = dr3;
+	child->thread.debugreg6 = dr6;
+	child->thread.debugreg7 = dr7;
+	if (dr7)
+		set_tsk_thread_flag(child, TIF_DEBUG);
+	else
+		clear_tsk_thread_flag(child, TIF_DEBUG);
+
 	return 0;
 }
 
-- 
1.6.0.4

^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH v18 02/80] x86: ptrace debugreg checks rewrite
  2009-09-23 23:50 ` Oren Laadan
@ 2009-09-23 23:50   ` Oren Laadan
  -1 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-23 23:50 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, containers, linux-kernel, linux-mm, linux-api,
	Serge Hallyn, Ingo Molnar, Pavel Emelyanov, Alexey Dobriyan

From: Alexey Dobriyan <adobriyan@gmail.com>

This is a mess.

Pre unified-x86 code did check for breakpoint addr
to be "< TASK_SIZE - 3 (or 7)". This was fine from security POV,
but banned valid breakpoint usage when address is close to TASK_SIZE.
E. g. 1-byte breakpoint at TASK_SIZE - 1 should be allowed, but it wasn't.

Then came commit 84929801e14d968caeb84795bfbb88f04283fbd9
("[PATCH] x86_64: TASK_SIZE fixes for compatibility mode processes")
which for some reason touched ptrace as well and made effective
TASK_SIZE of 32-bit process depending on IA32_PAGE_OFFSET
which is not a constant!:

	#define IA32_PAGE_OFFSET ((current->personality & ADDR_LIMIT_3GB) ? 0xc0000000 : 0xFFFFe000)
				   ^^^^^^^
Maximum addr for breakpoint became dependent on personality of ptracer.

Commit also relaxed danger zone for 32-bit processes from 8 bytes to 4
not taking into account that 8-byte wide breakpoints are possible even
for 32-bit processes. This was fine, however, because 64-bit kernel
addresses are too far from 32-bit ones.

Then came utrace with commit 2047b08be67b70875d8765fc81d34ce28041bec3
("x86: x86 ptrace getreg/putreg merge") which copy-pasted and ifdeffed 32-bit
part of TASK_SIZE_OF() leaving 8-byte issue as-is.

So, what patch fixes?
1) Too strict logic near TASK_SIZE boundary -- as long as we don't cross
   TASK_SIZE_MAX, we're fine.
2) Too smart logic of using breakpoints over non-existent kernel
   boundary -- we should only protect against setting up after
   TASK_SIZE_MAX, the rest is none of kernel business. This fixes
   IA32_PAGE_OFFSET beartrap as well.

As a bonus, remove uberhack and big comment determining DR7 validness,
rewrite with clear algorithm when it's obvious what's going on.

Make DR validness checker suitable for C/R. On restart DR registers
must be checked the same way they are checked on PTRACE_POKEUSR.

Question 1: TIF_DEBUG can set even if none of breakpoints is turned on,
should this be optimized?

Question 2: Breakpoints are allowed to be globally enabled, is this a
security risk?

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
---
 arch/x86/kernel/ptrace.c |  175 +++++++++++++++++++++++++++-------------------
 1 files changed, 103 insertions(+), 72 deletions(-)

diff --git a/arch/x86/kernel/ptrace.c b/arch/x86/kernel/ptrace.c
index 09ecbde..9b4cacf 100644
--- a/arch/x86/kernel/ptrace.c
+++ b/arch/x86/kernel/ptrace.c
@@ -136,11 +136,6 @@ static int set_segment_reg(struct task_struct *task,
 	return 0;
 }
 
-static unsigned long debugreg_addr_limit(struct task_struct *task)
-{
-	return TASK_SIZE - 3;
-}
-
 #else  /* CONFIG_X86_64 */
 
 #define FLAG_MASK		(FLAG_MASK_32 | X86_EFLAGS_NT)
@@ -264,16 +259,6 @@ static int set_segment_reg(struct task_struct *task,
 
 	return 0;
 }
-
-static unsigned long debugreg_addr_limit(struct task_struct *task)
-{
-#ifdef CONFIG_IA32_EMULATION
-	if (test_tsk_thread_flag(task, TIF_IA32))
-		return IA32_PAGE_OFFSET - 3;
-#endif
-	return TASK_SIZE_MAX - 7;
-}
-
 #endif	/* CONFIG_X86_32 */
 
 static unsigned long get_flags(struct task_struct *task)
@@ -481,77 +466,123 @@ static unsigned long ptrace_get_debugreg(struct task_struct *child, int n)
 	return 0;
 }
 
+static int ptrace_check_debugreg(int _32bit,
+				 unsigned long dr0, unsigned long dr1,
+				 unsigned long dr2, unsigned long dr3,
+				 unsigned long dr6, unsigned long dr7)
+{
+	/* Breakpoint type: 00: --x, 01: -w-, 10: undefined, 11: rw- */
+	unsigned int rw[4];
+	/* Breakpoint length: 00: 1 byte, 01: 2 bytes, 10: 8 bytes, 11: 4 bytes */
+	unsigned int len[4];
+	int n;
+
+	if (dr0 >= TASK_SIZE_MAX)
+		return -EINVAL;
+	if (dr1 >= TASK_SIZE_MAX)
+		return -EINVAL;
+	if (dr2 >= TASK_SIZE_MAX)
+		return -EINVAL;
+	if (dr3 >= TASK_SIZE_MAX)
+		return -EINVAL;
+
+	for (n = 0; n < 4; n++) {
+		rw[n] = (dr7 >> (16 + n * 4)) & 0x3;
+		len[n] = (dr7 >> (16 + n * 4 + 2)) & 0x3;
+
+		if (rw[n] == 0x2)
+			return -EINVAL;
+		if (rw[n] == 0x0 && len[n] != 0x0)
+			return -EINVAL;
+		if (_32bit && len[n] == 0x2)
+			return -EINVAL;
+
+		if (len[n] == 0x0)
+			len[n] = 1;
+		else if (len[n] == 0x1)
+			len[n] = 2;
+		else if (len[n] == 0x2)
+			len[n] = 8;
+		else if (len[n] == 0x3)
+			len[n] = 4;
+		/* From now breakpoint length is in bytes. */
+	}
+
+	if (dr6 & ~0xFFFFFFFFUL)
+		return -EINVAL;
+	if (dr7 & ~0xFFFFFFFFUL)
+		return -EINVAL;
+
+	if (dr7 == 0)
+		return 0;
+
+	if (dr0 + len[0] > TASK_SIZE_MAX)
+		return -EINVAL;
+	if (dr1 + len[1] > TASK_SIZE_MAX)
+		return -EINVAL;
+	if (dr2 + len[2] > TASK_SIZE_MAX)
+		return -EINVAL;
+	if (dr3 + len[3] > TASK_SIZE_MAX)
+		return -EINVAL;
+
+	return 0;
+}
+
 static int ptrace_set_debugreg(struct task_struct *child,
 			       int n, unsigned long data)
 {
-	int i;
+	unsigned long dr0, dr1, dr2, dr3, dr6, dr7;
+	int _32bit;
 
 	if (unlikely(n == 4 || n == 5))
 		return -EIO;
 
-	if (n < 4 && unlikely(data >= debugreg_addr_limit(child)))
-		return -EIO;
-
+	dr0 = child->thread.debugreg0;
+	dr1 = child->thread.debugreg1;
+	dr2 = child->thread.debugreg2;
+	dr3 = child->thread.debugreg3;
+	dr6 = child->thread.debugreg6;
+	dr7 = child->thread.debugreg7;
 	switch (n) {
-	case 0:		child->thread.debugreg0 = data; break;
-	case 1:		child->thread.debugreg1 = data; break;
-	case 2:		child->thread.debugreg2 = data; break;
-	case 3:		child->thread.debugreg3 = data; break;
-
+	case 0:
+		dr0 = data;
+		break;
+	case 1:
+		dr1 = data;
+		break;
+	case 2:
+		dr2 = data;
+		break;
+	case 3:
+		dr3 = data;
+		break;
 	case 6:
-		if ((data & ~0xffffffffUL) != 0)
-			return -EIO;
-		child->thread.debugreg6 = data;
+		dr6 = data;
 		break;
-
 	case 7:
-		/*
-		 * Sanity-check data. Take one half-byte at once with
-		 * check = (val >> (16 + 4*i)) & 0xf. It contains the
-		 * R/Wi and LENi bits; bits 0 and 1 are R/Wi, and bits
-		 * 2 and 3 are LENi. Given a list of invalid values,
-		 * we do mask |= 1 << invalid_value, so that
-		 * (mask >> check) & 1 is a correct test for invalid
-		 * values.
-		 *
-		 * R/Wi contains the type of the breakpoint /
-		 * watchpoint, LENi contains the length of the watched
-		 * data in the watchpoint case.
-		 *
-		 * The invalid values are:
-		 * - LENi == 0x10 (undefined), so mask |= 0x0f00.	[32-bit]
-		 * - R/Wi == 0x10 (break on I/O reads or writes), so
-		 *   mask |= 0x4444.
-		 * - R/Wi == 0x00 && LENi != 0x00, so we have mask |=
-		 *   0x1110.
-		 *
-		 * Finally, mask = 0x0f00 | 0x4444 | 0x1110 == 0x5f54.
-		 *
-		 * See the Intel Manual "System Programming Guide",
-		 * 15.2.4
-		 *
-		 * Note that LENi == 0x10 is defined on x86_64 in long
-		 * mode (i.e. even for 32-bit userspace software, but
-		 * 64-bit kernel), so the x86_64 mask value is 0x5454.
-		 * See the AMD manual no. 24593 (AMD64 System Programming)
-		 */
-#ifdef CONFIG_X86_32
-#define	DR7_MASK	0x5f54
-#else
-#define	DR7_MASK	0x5554
-#endif
-		data &= ~DR_CONTROL_RESERVED;
-		for (i = 0; i < 4; i++)
-			if ((DR7_MASK >> ((data >> (16 + 4*i)) & 0xf)) & 1)
-				return -EIO;
-		child->thread.debugreg7 = data;
-		if (data)
-			set_tsk_thread_flag(child, TIF_DEBUG);
-		else
-			clear_tsk_thread_flag(child, TIF_DEBUG);
+		dr7 = data & ~DR_CONTROL_RESERVED;
 		break;
 	}
 
+	_32bit = (sizeof(unsigned long) == 4);
+#ifdef CONFIG_COMPAT
+	if (test_tsk_thread_flag(child, TIF_IA32))
+		_32bit = 1;
+#endif
+	if (ptrace_check_debugreg(_32bit, dr0, dr1, dr2, dr3, dr6, dr7))
+		return -EIO;
+
+	child->thread.debugreg0 = dr0;
+	child->thread.debugreg1 = dr1;
+	child->thread.debugreg2 = dr2;
+	child->thread.debugreg3 = dr3;
+	child->thread.debugreg6 = dr6;
+	child->thread.debugreg7 = dr7;
+	if (dr7)
+		set_tsk_thread_flag(child, TIF_DEBUG);
+	else
+		clear_tsk_thread_flag(child, TIF_DEBUG);
+
 	return 0;
 }
 
-- 
1.6.0.4


^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH v18 02/80] x86: ptrace debugreg checks rewrite
@ 2009-09-23 23:50   ` Oren Laadan
  0 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-23 23:50 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, containers, linux-kernel, linux-mm, linux-api,
	Serge Hallyn, Ingo Molnar, Pavel Emelyanov, Alexey Dobriyan

From: Alexey Dobriyan <adobriyan@gmail.com>

This is a mess.

Pre unified-x86 code did check for breakpoint addr
to be "< TASK_SIZE - 3 (or 7)". This was fine from security POV,
but banned valid breakpoint usage when address is close to TASK_SIZE.
E. g. 1-byte breakpoint at TASK_SIZE - 1 should be allowed, but it wasn't.

Then came commit 84929801e14d968caeb84795bfbb88f04283fbd9
("[PATCH] x86_64: TASK_SIZE fixes for compatibility mode processes")
which for some reason touched ptrace as well and made effective
TASK_SIZE of 32-bit process depending on IA32_PAGE_OFFSET
which is not a constant!:

	#define IA32_PAGE_OFFSET ((current->personality & ADDR_LIMIT_3GB) ? 0xc0000000 : 0xFFFFe000)
				   ^^^^^^^
Maximum addr for breakpoint became dependent on personality of ptracer.

Commit also relaxed danger zone for 32-bit processes from 8 bytes to 4
not taking into account that 8-byte wide breakpoints are possible even
for 32-bit processes. This was fine, however, because 64-bit kernel
addresses are too far from 32-bit ones.

Then came utrace with commit 2047b08be67b70875d8765fc81d34ce28041bec3
("x86: x86 ptrace getreg/putreg merge") which copy-pasted and ifdeffed 32-bit
part of TASK_SIZE_OF() leaving 8-byte issue as-is.

So, what patch fixes?
1) Too strict logic near TASK_SIZE boundary -- as long as we don't cross
   TASK_SIZE_MAX, we're fine.
2) Too smart logic of using breakpoints over non-existent kernel
   boundary -- we should only protect against setting up after
   TASK_SIZE_MAX, the rest is none of kernel business. This fixes
   IA32_PAGE_OFFSET beartrap as well.

As a bonus, remove uberhack and big comment determining DR7 validness,
rewrite with clear algorithm when it's obvious what's going on.

Make DR validness checker suitable for C/R. On restart DR registers
must be checked the same way they are checked on PTRACE_POKEUSR.

Question 1: TIF_DEBUG can set even if none of breakpoints is turned on,
should this be optimized?

Question 2: Breakpoints are allowed to be globally enabled, is this a
security risk?

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
---
 arch/x86/kernel/ptrace.c |  175 +++++++++++++++++++++++++++-------------------
 1 files changed, 103 insertions(+), 72 deletions(-)

diff --git a/arch/x86/kernel/ptrace.c b/arch/x86/kernel/ptrace.c
index 09ecbde..9b4cacf 100644
--- a/arch/x86/kernel/ptrace.c
+++ b/arch/x86/kernel/ptrace.c
@@ -136,11 +136,6 @@ static int set_segment_reg(struct task_struct *task,
 	return 0;
 }
 
-static unsigned long debugreg_addr_limit(struct task_struct *task)
-{
-	return TASK_SIZE - 3;
-}
-
 #else  /* CONFIG_X86_64 */
 
 #define FLAG_MASK		(FLAG_MASK_32 | X86_EFLAGS_NT)
@@ -264,16 +259,6 @@ static int set_segment_reg(struct task_struct *task,
 
 	return 0;
 }
-
-static unsigned long debugreg_addr_limit(struct task_struct *task)
-{
-#ifdef CONFIG_IA32_EMULATION
-	if (test_tsk_thread_flag(task, TIF_IA32))
-		return IA32_PAGE_OFFSET - 3;
-#endif
-	return TASK_SIZE_MAX - 7;
-}
-
 #endif	/* CONFIG_X86_32 */
 
 static unsigned long get_flags(struct task_struct *task)
@@ -481,77 +466,123 @@ static unsigned long ptrace_get_debugreg(struct task_struct *child, int n)
 	return 0;
 }
 
+static int ptrace_check_debugreg(int _32bit,
+				 unsigned long dr0, unsigned long dr1,
+				 unsigned long dr2, unsigned long dr3,
+				 unsigned long dr6, unsigned long dr7)
+{
+	/* Breakpoint type: 00: --x, 01: -w-, 10: undefined, 11: rw- */
+	unsigned int rw[4];
+	/* Breakpoint length: 00: 1 byte, 01: 2 bytes, 10: 8 bytes, 11: 4 bytes */
+	unsigned int len[4];
+	int n;
+
+	if (dr0 >= TASK_SIZE_MAX)
+		return -EINVAL;
+	if (dr1 >= TASK_SIZE_MAX)
+		return -EINVAL;
+	if (dr2 >= TASK_SIZE_MAX)
+		return -EINVAL;
+	if (dr3 >= TASK_SIZE_MAX)
+		return -EINVAL;
+
+	for (n = 0; n < 4; n++) {
+		rw[n] = (dr7 >> (16 + n * 4)) & 0x3;
+		len[n] = (dr7 >> (16 + n * 4 + 2)) & 0x3;
+
+		if (rw[n] == 0x2)
+			return -EINVAL;
+		if (rw[n] == 0x0 && len[n] != 0x0)
+			return -EINVAL;
+		if (_32bit && len[n] == 0x2)
+			return -EINVAL;
+
+		if (len[n] == 0x0)
+			len[n] = 1;
+		else if (len[n] == 0x1)
+			len[n] = 2;
+		else if (len[n] == 0x2)
+			len[n] = 8;
+		else if (len[n] == 0x3)
+			len[n] = 4;
+		/* From now breakpoint length is in bytes. */
+	}
+
+	if (dr6 & ~0xFFFFFFFFUL)
+		return -EINVAL;
+	if (dr7 & ~0xFFFFFFFFUL)
+		return -EINVAL;
+
+	if (dr7 == 0)
+		return 0;
+
+	if (dr0 + len[0] > TASK_SIZE_MAX)
+		return -EINVAL;
+	if (dr1 + len[1] > TASK_SIZE_MAX)
+		return -EINVAL;
+	if (dr2 + len[2] > TASK_SIZE_MAX)
+		return -EINVAL;
+	if (dr3 + len[3] > TASK_SIZE_MAX)
+		return -EINVAL;
+
+	return 0;
+}
+
 static int ptrace_set_debugreg(struct task_struct *child,
 			       int n, unsigned long data)
 {
-	int i;
+	unsigned long dr0, dr1, dr2, dr3, dr6, dr7;
+	int _32bit;
 
 	if (unlikely(n == 4 || n == 5))
 		return -EIO;
 
-	if (n < 4 && unlikely(data >= debugreg_addr_limit(child)))
-		return -EIO;
-
+	dr0 = child->thread.debugreg0;
+	dr1 = child->thread.debugreg1;
+	dr2 = child->thread.debugreg2;
+	dr3 = child->thread.debugreg3;
+	dr6 = child->thread.debugreg6;
+	dr7 = child->thread.debugreg7;
 	switch (n) {
-	case 0:		child->thread.debugreg0 = data; break;
-	case 1:		child->thread.debugreg1 = data; break;
-	case 2:		child->thread.debugreg2 = data; break;
-	case 3:		child->thread.debugreg3 = data; break;
-
+	case 0:
+		dr0 = data;
+		break;
+	case 1:
+		dr1 = data;
+		break;
+	case 2:
+		dr2 = data;
+		break;
+	case 3:
+		dr3 = data;
+		break;
 	case 6:
-		if ((data & ~0xffffffffUL) != 0)
-			return -EIO;
-		child->thread.debugreg6 = data;
+		dr6 = data;
 		break;
-
 	case 7:
-		/*
-		 * Sanity-check data. Take one half-byte at once with
-		 * check = (val >> (16 + 4*i)) & 0xf. It contains the
-		 * R/Wi and LENi bits; bits 0 and 1 are R/Wi, and bits
-		 * 2 and 3 are LENi. Given a list of invalid values,
-		 * we do mask |= 1 << invalid_value, so that
-		 * (mask >> check) & 1 is a correct test for invalid
-		 * values.
-		 *
-		 * R/Wi contains the type of the breakpoint /
-		 * watchpoint, LENi contains the length of the watched
-		 * data in the watchpoint case.
-		 *
-		 * The invalid values are:
-		 * - LENi == 0x10 (undefined), so mask |= 0x0f00.	[32-bit]
-		 * - R/Wi == 0x10 (break on I/O reads or writes), so
-		 *   mask |= 0x4444.
-		 * - R/Wi == 0x00 && LENi != 0x00, so we have mask |=
-		 *   0x1110.
-		 *
-		 * Finally, mask = 0x0f00 | 0x4444 | 0x1110 == 0x5f54.
-		 *
-		 * See the Intel Manual "System Programming Guide",
-		 * 15.2.4
-		 *
-		 * Note that LENi == 0x10 is defined on x86_64 in long
-		 * mode (i.e. even for 32-bit userspace software, but
-		 * 64-bit kernel), so the x86_64 mask value is 0x5454.
-		 * See the AMD manual no. 24593 (AMD64 System Programming)
-		 */
-#ifdef CONFIG_X86_32
-#define	DR7_MASK	0x5f54
-#else
-#define	DR7_MASK	0x5554
-#endif
-		data &= ~DR_CONTROL_RESERVED;
-		for (i = 0; i < 4; i++)
-			if ((DR7_MASK >> ((data >> (16 + 4*i)) & 0xf)) & 1)
-				return -EIO;
-		child->thread.debugreg7 = data;
-		if (data)
-			set_tsk_thread_flag(child, TIF_DEBUG);
-		else
-			clear_tsk_thread_flag(child, TIF_DEBUG);
+		dr7 = data & ~DR_CONTROL_RESERVED;
 		break;
 	}
 
+	_32bit = (sizeof(unsigned long) == 4);
+#ifdef CONFIG_COMPAT
+	if (test_tsk_thread_flag(child, TIF_IA32))
+		_32bit = 1;
+#endif
+	if (ptrace_check_debugreg(_32bit, dr0, dr1, dr2, dr3, dr6, dr7))
+		return -EIO;
+
+	child->thread.debugreg0 = dr0;
+	child->thread.debugreg1 = dr1;
+	child->thread.debugreg2 = dr2;
+	child->thread.debugreg3 = dr3;
+	child->thread.debugreg6 = dr6;
+	child->thread.debugreg7 = dr7;
+	if (dr7)
+		set_tsk_thread_flag(child, TIF_DEBUG);
+	else
+		clear_tsk_thread_flag(child, TIF_DEBUG);
+
 	return 0;
 }
 
-- 
1.6.0.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH v18 03/80] c/r: break out new_user_ns()
       [not found] ` <1253749920-18673-1-git-send-email-orenl-RdfvBDnrOixBDgjK7y7TUQ@public.gmane.org>
  2009-09-23 23:50   ` [PATCH v18 01/80] c/r: extend arch_setup_additional_pages() Oren Laadan
  2009-09-23 23:50   ` [PATCH v18 02/80] x86: ptrace debugreg checks rewrite Oren Laadan
@ 2009-09-23 23:50   ` Oren Laadan
  2009-09-23 23:50   ` [PATCH v18 04/80] c/r: split core function out of some set*{u, g}id functions Oren Laadan
                     ` (78 subsequent siblings)
  81 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-23 23:50 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-api-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Linus Torvalds, Ingo Molnar,
	Pavel Emelyanov

From: Serge E. Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>

Break out the core function which checks privilege and (if
allowed) creates a new user namespace, with the passed-in
creating user_struct.  Note that a user_namespace, unlike
other namespace pointers, is not stored in the nsproxy.
Rather it is purely a property of user_structs.

This will let us keep the task restore code simpler.

Signed-off-by: Serge E. Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
---
 include/linux/user_namespace.h |    8 ++++++
 kernel/user_namespace.c        |   53 ++++++++++++++++++++++++++++------------
 2 files changed, 45 insertions(+), 16 deletions(-)

diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h
index cc4f453..f6ea75d 100644
--- a/include/linux/user_namespace.h
+++ b/include/linux/user_namespace.h
@@ -20,6 +20,8 @@ extern struct user_namespace init_user_ns;
 
 #ifdef CONFIG_USER_NS
 
+struct user_namespace *new_user_ns(struct user_struct *creator,
+				   struct user_struct **newroot);
 static inline struct user_namespace *get_user_ns(struct user_namespace *ns)
 {
 	if (ns)
@@ -38,6 +40,12 @@ static inline void put_user_ns(struct user_namespace *ns)
 
 #else
 
+static inline struct user_namespace *new_user_ns(struct user_struct *creator,
+				   struct user_struct **newroot)
+{
+	return ERR_PTR(-EINVAL);
+}
+
 static inline struct user_namespace *get_user_ns(struct user_namespace *ns)
 {
 	return &init_user_ns;
diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
index 076c7c8..e624b0f 100644
--- a/kernel/user_namespace.c
+++ b/kernel/user_namespace.c
@@ -11,15 +11,8 @@
 #include <linux/user_namespace.h>
 #include <linux/cred.h>
 
-/*
- * Create a new user namespace, deriving the creator from the user in the
- * passed credentials, and replacing that user with the new root user for the
- * new namespace.
- *
- * This is called by copy_creds(), which will finish setting the target task's
- * credentials.
- */
-int create_user_ns(struct cred *new)
+static struct user_namespace *_new_user_ns(struct user_struct *creator,
+				   struct user_struct **newroot)
 {
 	struct user_namespace *ns;
 	struct user_struct *root_user;
@@ -27,7 +20,7 @@ int create_user_ns(struct cred *new)
 
 	ns = kmalloc(sizeof(struct user_namespace), GFP_KERNEL);
 	if (!ns)
-		return -ENOMEM;
+		return ERR_PTR(-ENOMEM);
 
 	kref_init(&ns->kref);
 
@@ -38,12 +31,43 @@ int create_user_ns(struct cred *new)
 	root_user = alloc_uid(ns, 0);
 	if (!root_user) {
 		kfree(ns);
-		return -ENOMEM;
+		return ERR_PTR(-ENOMEM);
 	}
 
 	/* set the new root user in the credentials under preparation */
-	ns->creator = new->user;
-	new->user = root_user;
+	ns->creator = creator;
+
+	/* alloc_uid() incremented the userns refcount.  Just set it to 1 */
+	kref_set(&ns->kref, 1);
+
+	*newroot = root_user;
+	return ns;
+}
+
+struct user_namespace *new_user_ns(struct user_struct *creator,
+				   struct user_struct **newroot)
+{
+	if (!capable(CAP_SYS_ADMIN))
+		return ERR_PTR(-EPERM);
+	return _new_user_ns(creator, newroot);
+}
+
+/*
+ * Create a new user namespace, deriving the creator from the user in the
+ * passed credentials, and replacing that user with the new root user for the
+ * new namespace.
+ *
+ * This is called by copy_creds(), which will finish setting the target task's
+ * credentials.
+ */
+int create_user_ns(struct cred *new)
+{
+	struct user_namespace *ns;
+
+	ns = new_user_ns(new->user, &new->user);
+	if (IS_ERR(ns))
+		return PTR_ERR(ns);
+
 	new->uid = new->euid = new->suid = new->fsuid = 0;
 	new->gid = new->egid = new->sgid = new->fsgid = 0;
 	put_group_info(new->group_info);
@@ -54,9 +78,6 @@ int create_user_ns(struct cred *new)
 #endif
 	/* tgcred will be cleared in our caller bc CLONE_THREAD won't be set */
 
-	/* alloc_uid() incremented the userns refcount.  Just set it to 1 */
-	kref_set(&ns->kref, 1);
-
 	return 0;
 }
 
-- 
1.6.0.4

^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH v18 03/80] c/r: break out new_user_ns()
  2009-09-23 23:50 ` Oren Laadan
@ 2009-09-23 23:50   ` Oren Laadan
  -1 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-23 23:50 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, containers, linux-kernel, linux-mm, linux-api,
	Serge Hallyn, Ingo Molnar, Pavel Emelyanov

From: Serge E. Hallyn <serue@us.ibm.com>

Break out the core function which checks privilege and (if
allowed) creates a new user namespace, with the passed-in
creating user_struct.  Note that a user_namespace, unlike
other namespace pointers, is not stored in the nsproxy.
Rather it is purely a property of user_structs.

This will let us keep the task restore code simpler.

Signed-off-by: Serge E. Hallyn <serue@us.ibm.com>
---
 include/linux/user_namespace.h |    8 ++++++
 kernel/user_namespace.c        |   53 ++++++++++++++++++++++++++++------------
 2 files changed, 45 insertions(+), 16 deletions(-)

diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h
index cc4f453..f6ea75d 100644
--- a/include/linux/user_namespace.h
+++ b/include/linux/user_namespace.h
@@ -20,6 +20,8 @@ extern struct user_namespace init_user_ns;
 
 #ifdef CONFIG_USER_NS
 
+struct user_namespace *new_user_ns(struct user_struct *creator,
+				   struct user_struct **newroot);
 static inline struct user_namespace *get_user_ns(struct user_namespace *ns)
 {
 	if (ns)
@@ -38,6 +40,12 @@ static inline void put_user_ns(struct user_namespace *ns)
 
 #else
 
+static inline struct user_namespace *new_user_ns(struct user_struct *creator,
+				   struct user_struct **newroot)
+{
+	return ERR_PTR(-EINVAL);
+}
+
 static inline struct user_namespace *get_user_ns(struct user_namespace *ns)
 {
 	return &init_user_ns;
diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
index 076c7c8..e624b0f 100644
--- a/kernel/user_namespace.c
+++ b/kernel/user_namespace.c
@@ -11,15 +11,8 @@
 #include <linux/user_namespace.h>
 #include <linux/cred.h>
 
-/*
- * Create a new user namespace, deriving the creator from the user in the
- * passed credentials, and replacing that user with the new root user for the
- * new namespace.
- *
- * This is called by copy_creds(), which will finish setting the target task's
- * credentials.
- */
-int create_user_ns(struct cred *new)
+static struct user_namespace *_new_user_ns(struct user_struct *creator,
+				   struct user_struct **newroot)
 {
 	struct user_namespace *ns;
 	struct user_struct *root_user;
@@ -27,7 +20,7 @@ int create_user_ns(struct cred *new)
 
 	ns = kmalloc(sizeof(struct user_namespace), GFP_KERNEL);
 	if (!ns)
-		return -ENOMEM;
+		return ERR_PTR(-ENOMEM);
 
 	kref_init(&ns->kref);
 
@@ -38,12 +31,43 @@ int create_user_ns(struct cred *new)
 	root_user = alloc_uid(ns, 0);
 	if (!root_user) {
 		kfree(ns);
-		return -ENOMEM;
+		return ERR_PTR(-ENOMEM);
 	}
 
 	/* set the new root user in the credentials under preparation */
-	ns->creator = new->user;
-	new->user = root_user;
+	ns->creator = creator;
+
+	/* alloc_uid() incremented the userns refcount.  Just set it to 1 */
+	kref_set(&ns->kref, 1);
+
+	*newroot = root_user;
+	return ns;
+}
+
+struct user_namespace *new_user_ns(struct user_struct *creator,
+				   struct user_struct **newroot)
+{
+	if (!capable(CAP_SYS_ADMIN))
+		return ERR_PTR(-EPERM);
+	return _new_user_ns(creator, newroot);
+}
+
+/*
+ * Create a new user namespace, deriving the creator from the user in the
+ * passed credentials, and replacing that user with the new root user for the
+ * new namespace.
+ *
+ * This is called by copy_creds(), which will finish setting the target task's
+ * credentials.
+ */
+int create_user_ns(struct cred *new)
+{
+	struct user_namespace *ns;
+
+	ns = new_user_ns(new->user, &new->user);
+	if (IS_ERR(ns))
+		return PTR_ERR(ns);
+
 	new->uid = new->euid = new->suid = new->fsuid = 0;
 	new->gid = new->egid = new->sgid = new->fsgid = 0;
 	put_group_info(new->group_info);
@@ -54,9 +78,6 @@ int create_user_ns(struct cred *new)
 #endif
 	/* tgcred will be cleared in our caller bc CLONE_THREAD won't be set */
 
-	/* alloc_uid() incremented the userns refcount.  Just set it to 1 */
-	kref_set(&ns->kref, 1);
-
 	return 0;
 }
 
-- 
1.6.0.4


^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH v18 03/80] c/r: break out new_user_ns()
@ 2009-09-23 23:50   ` Oren Laadan
  0 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-23 23:50 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, containers, linux-kernel, linux-mm, linux-api,
	Serge Hallyn, Ingo Molnar, Pavel Emelyanov

From: Serge E. Hallyn <serue@us.ibm.com>

Break out the core function which checks privilege and (if
allowed) creates a new user namespace, with the passed-in
creating user_struct.  Note that a user_namespace, unlike
other namespace pointers, is not stored in the nsproxy.
Rather it is purely a property of user_structs.

This will let us keep the task restore code simpler.

Signed-off-by: Serge E. Hallyn <serue@us.ibm.com>
---
 include/linux/user_namespace.h |    8 ++++++
 kernel/user_namespace.c        |   53 ++++++++++++++++++++++++++++------------
 2 files changed, 45 insertions(+), 16 deletions(-)

diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h
index cc4f453..f6ea75d 100644
--- a/include/linux/user_namespace.h
+++ b/include/linux/user_namespace.h
@@ -20,6 +20,8 @@ extern struct user_namespace init_user_ns;
 
 #ifdef CONFIG_USER_NS
 
+struct user_namespace *new_user_ns(struct user_struct *creator,
+				   struct user_struct **newroot);
 static inline struct user_namespace *get_user_ns(struct user_namespace *ns)
 {
 	if (ns)
@@ -38,6 +40,12 @@ static inline void put_user_ns(struct user_namespace *ns)
 
 #else
 
+static inline struct user_namespace *new_user_ns(struct user_struct *creator,
+				   struct user_struct **newroot)
+{
+	return ERR_PTR(-EINVAL);
+}
+
 static inline struct user_namespace *get_user_ns(struct user_namespace *ns)
 {
 	return &init_user_ns;
diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
index 076c7c8..e624b0f 100644
--- a/kernel/user_namespace.c
+++ b/kernel/user_namespace.c
@@ -11,15 +11,8 @@
 #include <linux/user_namespace.h>
 #include <linux/cred.h>
 
-/*
- * Create a new user namespace, deriving the creator from the user in the
- * passed credentials, and replacing that user with the new root user for the
- * new namespace.
- *
- * This is called by copy_creds(), which will finish setting the target task's
- * credentials.
- */
-int create_user_ns(struct cred *new)
+static struct user_namespace *_new_user_ns(struct user_struct *creator,
+				   struct user_struct **newroot)
 {
 	struct user_namespace *ns;
 	struct user_struct *root_user;
@@ -27,7 +20,7 @@ int create_user_ns(struct cred *new)
 
 	ns = kmalloc(sizeof(struct user_namespace), GFP_KERNEL);
 	if (!ns)
-		return -ENOMEM;
+		return ERR_PTR(-ENOMEM);
 
 	kref_init(&ns->kref);
 
@@ -38,12 +31,43 @@ int create_user_ns(struct cred *new)
 	root_user = alloc_uid(ns, 0);
 	if (!root_user) {
 		kfree(ns);
-		return -ENOMEM;
+		return ERR_PTR(-ENOMEM);
 	}
 
 	/* set the new root user in the credentials under preparation */
-	ns->creator = new->user;
-	new->user = root_user;
+	ns->creator = creator;
+
+	/* alloc_uid() incremented the userns refcount.  Just set it to 1 */
+	kref_set(&ns->kref, 1);
+
+	*newroot = root_user;
+	return ns;
+}
+
+struct user_namespace *new_user_ns(struct user_struct *creator,
+				   struct user_struct **newroot)
+{
+	if (!capable(CAP_SYS_ADMIN))
+		return ERR_PTR(-EPERM);
+	return _new_user_ns(creator, newroot);
+}
+
+/*
+ * Create a new user namespace, deriving the creator from the user in the
+ * passed credentials, and replacing that user with the new root user for the
+ * new namespace.
+ *
+ * This is called by copy_creds(), which will finish setting the target task's
+ * credentials.
+ */
+int create_user_ns(struct cred *new)
+{
+	struct user_namespace *ns;
+
+	ns = new_user_ns(new->user, &new->user);
+	if (IS_ERR(ns))
+		return PTR_ERR(ns);
+
 	new->uid = new->euid = new->suid = new->fsuid = 0;
 	new->gid = new->egid = new->sgid = new->fsgid = 0;
 	put_group_info(new->group_info);
@@ -54,9 +78,6 @@ int create_user_ns(struct cred *new)
 #endif
 	/* tgcred will be cleared in our caller bc CLONE_THREAD won't be set */
 
-	/* alloc_uid() incremented the userns refcount.  Just set it to 1 */
-	kref_set(&ns->kref, 1);
-
 	return 0;
 }
 
-- 
1.6.0.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH v18 04/80] c/r: split core function out of some set*{u, g}id functions
       [not found] ` <1253749920-18673-1-git-send-email-orenl-RdfvBDnrOixBDgjK7y7TUQ@public.gmane.org>
                     ` (2 preceding siblings ...)
  2009-09-23 23:50   ` [PATCH v18 03/80] c/r: break out new_user_ns() Oren Laadan
@ 2009-09-23 23:50   ` Oren Laadan
  2009-09-23 23:50   ` [PATCH v18 05/80] cgroup freezer: Fix buggy resume test for tasks frozen with cgroup freezer Oren Laadan
                     ` (77 subsequent siblings)
  81 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-23 23:50 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-api-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Linus Torvalds, Ingo Molnar,
	Pavel Emelyanov

From: Serge E. Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>

When restarting tasks, we want to be able to change xuid and
xgid in a struct cred, and do so with security checks.  Break
the core functionality of set{fs,res}{u,g}id into cred_setX
which performs the access checks based on current_cred(),
but performs the requested change on a passed-in cred.

This will allow us to securely construct struct creds based
on a checkpoint image, constrained by the caller's permissions,
and apply them to the caller at the end of sys_restart().

Signed-off-by: Serge E. Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
---
 include/linux/cred.h |    8 +++
 kernel/cred.c        |  114 ++++++++++++++++++++++++++++++++++++++++++
 kernel/sys.c         |  134 ++++++++------------------------------------------
 3 files changed, 143 insertions(+), 113 deletions(-)

diff --git a/include/linux/cred.h b/include/linux/cred.h
index 4fa9996..2ffffbe 100644
--- a/include/linux/cred.h
+++ b/include/linux/cred.h
@@ -21,6 +21,9 @@ struct user_struct;
 struct cred;
 struct inode;
 
+/* defined in sys.c, used in cred_setresuid */
+extern int set_user(struct cred *new);
+
 /*
  * COW Supplementary groups list
  */
@@ -344,4 +347,9 @@ do {						\
 	*(_fsgid) = __cred->fsgid;		\
 } while(0)
 
+int cred_setresuid(struct cred *new, uid_t ruid, uid_t euid, uid_t suid);
+int cred_setresgid(struct cred *new, gid_t rgid, gid_t egid, gid_t sgid);
+int cred_setfsuid(struct cred *new, uid_t uid, uid_t *old_fsuid);
+int cred_setfsgid(struct cred *new, gid_t gid, gid_t *old_fsgid);
+
 #endif /* _LINUX_CRED_H */
diff --git a/kernel/cred.c b/kernel/cred.c
index 1bb4d7e..5c8db56 100644
--- a/kernel/cred.c
+++ b/kernel/cred.c
@@ -589,3 +589,117 @@ int set_create_files_as(struct cred *new, struct inode *inode)
 	return security_kernel_create_files_as(new, inode);
 }
 EXPORT_SYMBOL(set_create_files_as);
+
+int cred_setresuid(struct cred *new, uid_t ruid, uid_t euid, uid_t suid)
+{
+	int retval;
+	const struct cred *old;
+
+	retval = security_task_setuid(ruid, euid, suid, LSM_SETID_RES);
+	if (retval)
+		return retval;
+	old = current_cred();
+
+	if (!capable(CAP_SETUID)) {
+		if (ruid != (uid_t) -1 && ruid != old->uid &&
+		    ruid != old->euid  && ruid != old->suid)
+			return -EPERM;
+		if (euid != (uid_t) -1 && euid != old->uid &&
+		    euid != old->euid  && euid != old->suid)
+			return -EPERM;
+		if (suid != (uid_t) -1 && suid != old->uid &&
+		    suid != old->euid  && suid != old->suid)
+			return -EPERM;
+	}
+
+	if (ruid != (uid_t) -1) {
+		new->uid = ruid;
+		if (ruid != old->uid) {
+			retval = set_user(new);
+			if (retval < 0)
+				return retval;
+		}
+	}
+	if (euid != (uid_t) -1)
+		new->euid = euid;
+	if (suid != (uid_t) -1)
+		new->suid = suid;
+	new->fsuid = new->euid;
+
+	return security_task_fix_setuid(new, old, LSM_SETID_RES);
+}
+
+int cred_setresgid(struct cred *new, gid_t rgid, gid_t egid,
+			gid_t sgid)
+{
+	const struct cred *old = current_cred();
+	int retval;
+
+	retval = security_task_setgid(rgid, egid, sgid, LSM_SETID_RES);
+	if (retval)
+		return retval;
+
+	if (!capable(CAP_SETGID)) {
+		if (rgid != (gid_t) -1 && rgid != old->gid &&
+		    rgid != old->egid  && rgid != old->sgid)
+			return -EPERM;
+		if (egid != (gid_t) -1 && egid != old->gid &&
+		    egid != old->egid  && egid != old->sgid)
+			return -EPERM;
+		if (sgid != (gid_t) -1 && sgid != old->gid &&
+		    sgid != old->egid  && sgid != old->sgid)
+			return -EPERM;
+	}
+
+	if (rgid != (gid_t) -1)
+		new->gid = rgid;
+	if (egid != (gid_t) -1)
+		new->egid = egid;
+	if (sgid != (gid_t) -1)
+		new->sgid = sgid;
+	new->fsgid = new->egid;
+	return 0;
+}
+
+int cred_setfsuid(struct cred *new, uid_t uid, uid_t *old_fsuid)
+{
+	const struct cred *old;
+
+	old = current_cred();
+	*old_fsuid = old->fsuid;
+
+	if (security_task_setuid(uid, (uid_t)-1, (uid_t)-1, LSM_SETID_FS) < 0)
+		return -EPERM;
+
+	if (uid == old->uid  || uid == old->euid  ||
+	    uid == old->suid || uid == old->fsuid ||
+	    capable(CAP_SETUID)) {
+		if (uid != *old_fsuid) {
+			new->fsuid = uid;
+			if (security_task_fix_setuid(new, old, LSM_SETID_FS) == 0)
+				return 0;
+		}
+	}
+	return -EPERM;
+}
+
+int cred_setfsgid(struct cred *new, gid_t gid, gid_t *old_fsgid)
+{
+	const struct cred *old;
+
+	old = current_cred();
+	*old_fsgid = old->fsgid;
+
+	if (security_task_setgid(gid, (gid_t)-1, (gid_t)-1, LSM_SETID_FS))
+		return -EPERM;
+
+	if (gid == old->gid  || gid == old->egid  ||
+	    gid == old->sgid || gid == old->fsgid ||
+	    capable(CAP_SETGID)) {
+		if (gid != *old_fsgid) {
+			new->fsgid = gid;
+			return 0;
+		}
+	}
+	return -EPERM;
+}
diff --git a/kernel/sys.c b/kernel/sys.c
index b3f1097..da4f9e0 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -559,11 +559,12 @@ error:
 /*
  * change the user struct in a credentials set to match the new UID
  */
-static int set_user(struct cred *new)
+int set_user(struct cred *new)
 {
 	struct user_struct *new_user;
 
-	new_user = alloc_uid(current_user_ns(), new->uid);
+	/* is this ok? */
+	new_user = alloc_uid(new->user->user_ns, new->uid);
 	if (!new_user)
 		return -EAGAIN;
 
@@ -704,14 +705,12 @@ error:
 	return retval;
 }
 
-
 /*
  * This function implements a generic ability to update ruid, euid,
  * and suid.  This allows you to implement the 4.4 compatible seteuid().
  */
 SYSCALL_DEFINE3(setresuid, uid_t, ruid, uid_t, euid, uid_t, suid)
 {
-	const struct cred *old;
 	struct cred *new;
 	int retval;
 
@@ -719,45 +718,10 @@ SYSCALL_DEFINE3(setresuid, uid_t, ruid, uid_t, euid, uid_t, suid)
 	if (!new)
 		return -ENOMEM;
 
-	retval = security_task_setuid(ruid, euid, suid, LSM_SETID_RES);
-	if (retval)
-		goto error;
-	old = current_cred();
-
-	retval = -EPERM;
-	if (!capable(CAP_SETUID)) {
-		if (ruid != (uid_t) -1 && ruid != old->uid &&
-		    ruid != old->euid  && ruid != old->suid)
-			goto error;
-		if (euid != (uid_t) -1 && euid != old->uid &&
-		    euid != old->euid  && euid != old->suid)
-			goto error;
-		if (suid != (uid_t) -1 && suid != old->uid &&
-		    suid != old->euid  && suid != old->suid)
-			goto error;
-	}
-
-	if (ruid != (uid_t) -1) {
-		new->uid = ruid;
-		if (ruid != old->uid) {
-			retval = set_user(new);
-			if (retval < 0)
-				goto error;
-		}
-	}
-	if (euid != (uid_t) -1)
-		new->euid = euid;
-	if (suid != (uid_t) -1)
-		new->suid = suid;
-	new->fsuid = new->euid;
-
-	retval = security_task_fix_setuid(new, old, LSM_SETID_RES);
-	if (retval < 0)
-		goto error;
-
-	return commit_creds(new);
+	retval = cred_setresuid(new, ruid, euid, suid);
+	if (retval == 0)
+		return commit_creds(new);
 
-error:
 	abort_creds(new);
 	return retval;
 }
@@ -779,43 +743,17 @@ SYSCALL_DEFINE3(getresuid, uid_t __user *, ruid, uid_t __user *, euid, uid_t __u
  */
 SYSCALL_DEFINE3(setresgid, gid_t, rgid, gid_t, egid, gid_t, sgid)
 {
-	const struct cred *old;
 	struct cred *new;
 	int retval;
 
 	new = prepare_creds();
 	if (!new)
 		return -ENOMEM;
-	old = current_cred();
 
-	retval = security_task_setgid(rgid, egid, sgid, LSM_SETID_RES);
-	if (retval)
-		goto error;
+	retval = cred_setresgid(new, rgid, egid, sgid);
+	if (retval == 0)
+		return commit_creds(new);
 
-	retval = -EPERM;
-	if (!capable(CAP_SETGID)) {
-		if (rgid != (gid_t) -1 && rgid != old->gid &&
-		    rgid != old->egid  && rgid != old->sgid)
-			goto error;
-		if (egid != (gid_t) -1 && egid != old->gid &&
-		    egid != old->egid  && egid != old->sgid)
-			goto error;
-		if (sgid != (gid_t) -1 && sgid != old->gid &&
-		    sgid != old->egid  && sgid != old->sgid)
-			goto error;
-	}
-
-	if (rgid != (gid_t) -1)
-		new->gid = rgid;
-	if (egid != (gid_t) -1)
-		new->egid = egid;
-	if (sgid != (gid_t) -1)
-		new->sgid = sgid;
-	new->fsgid = new->egid;
-
-	return commit_creds(new);
-
-error:
 	abort_creds(new);
 	return retval;
 }
@@ -832,7 +770,6 @@ SYSCALL_DEFINE3(getresgid, gid_t __user *, rgid, gid_t __user *, egid, gid_t __u
 	return retval;
 }
 
-
 /*
  * "setfsuid()" sets the fsuid - the uid used for filesystem checks. This
  * is used for "access()" and for the NFS daemon (letting nfsd stay at
@@ -841,35 +778,20 @@ SYSCALL_DEFINE3(getresgid, gid_t __user *, rgid, gid_t __user *, egid, gid_t __u
  */
 SYSCALL_DEFINE1(setfsuid, uid_t, uid)
 {
-	const struct cred *old;
 	struct cred *new;
 	uid_t old_fsuid;
+	int retval;
 
 	new = prepare_creds();
 	if (!new)
 		return current_fsuid();
-	old = current_cred();
-	old_fsuid = old->fsuid;
-
-	if (security_task_setuid(uid, (uid_t)-1, (uid_t)-1, LSM_SETID_FS) < 0)
-		goto error;
-
-	if (uid == old->uid  || uid == old->euid  ||
-	    uid == old->suid || uid == old->fsuid ||
-	    capable(CAP_SETUID)) {
-		if (uid != old_fsuid) {
-			new->fsuid = uid;
-			if (security_task_fix_setuid(new, old, LSM_SETID_FS) == 0)
-				goto change_okay;
-		}
-	}
 
-error:
-	abort_creds(new);
-	return old_fsuid;
+	retval = cred_setfsuid(new, uid, &old_fsuid);
+	if (retval == 0)
+		commit_creds(new);
+	else
+		abort_creds(new);
 
-change_okay:
-	commit_creds(new);
 	return old_fsuid;
 }
 
@@ -878,34 +800,20 @@ change_okay:
  */
 SYSCALL_DEFINE1(setfsgid, gid_t, gid)
 {
-	const struct cred *old;
 	struct cred *new;
 	gid_t old_fsgid;
+	int retval;
 
 	new = prepare_creds();
 	if (!new)
 		return current_fsgid();
-	old = current_cred();
-	old_fsgid = old->fsgid;
-
-	if (security_task_setgid(gid, (gid_t)-1, (gid_t)-1, LSM_SETID_FS))
-		goto error;
-
-	if (gid == old->gid  || gid == old->egid  ||
-	    gid == old->sgid || gid == old->fsgid ||
-	    capable(CAP_SETGID)) {
-		if (gid != old_fsgid) {
-			new->fsgid = gid;
-			goto change_okay;
-		}
-	}
 
-error:
-	abort_creds(new);
-	return old_fsgid;
+	retval = cred_setfsgid(new, gid, &old_fsgid);
+	if (retval == 0)
+		commit_creds(new);
+	else
+		abort_creds(new);
 
-change_okay:
-	commit_creds(new);
 	return old_fsgid;
 }
 
-- 
1.6.0.4

^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH v18 04/80] c/r: split core function out of some set*{u,g}id functions
  2009-09-23 23:50 ` Oren Laadan
@ 2009-09-23 23:50   ` Oren Laadan
  -1 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-23 23:50 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, containers, linux-kernel, linux-mm, linux-api,
	Serge Hallyn, Ingo Molnar, Pavel Emelyanov

From: Serge E. Hallyn <serue@us.ibm.com>

When restarting tasks, we want to be able to change xuid and
xgid in a struct cred, and do so with security checks.  Break
the core functionality of set{fs,res}{u,g}id into cred_setX
which performs the access checks based on current_cred(),
but performs the requested change on a passed-in cred.

This will allow us to securely construct struct creds based
on a checkpoint image, constrained by the caller's permissions,
and apply them to the caller at the end of sys_restart().

Signed-off-by: Serge E. Hallyn <serue@us.ibm.com>
---
 include/linux/cred.h |    8 +++
 kernel/cred.c        |  114 ++++++++++++++++++++++++++++++++++++++++++
 kernel/sys.c         |  134 ++++++++------------------------------------------
 3 files changed, 143 insertions(+), 113 deletions(-)

diff --git a/include/linux/cred.h b/include/linux/cred.h
index 4fa9996..2ffffbe 100644
--- a/include/linux/cred.h
+++ b/include/linux/cred.h
@@ -21,6 +21,9 @@ struct user_struct;
 struct cred;
 struct inode;
 
+/* defined in sys.c, used in cred_setresuid */
+extern int set_user(struct cred *new);
+
 /*
  * COW Supplementary groups list
  */
@@ -344,4 +347,9 @@ do {						\
 	*(_fsgid) = __cred->fsgid;		\
 } while(0)
 
+int cred_setresuid(struct cred *new, uid_t ruid, uid_t euid, uid_t suid);
+int cred_setresgid(struct cred *new, gid_t rgid, gid_t egid, gid_t sgid);
+int cred_setfsuid(struct cred *new, uid_t uid, uid_t *old_fsuid);
+int cred_setfsgid(struct cred *new, gid_t gid, gid_t *old_fsgid);
+
 #endif /* _LINUX_CRED_H */
diff --git a/kernel/cred.c b/kernel/cred.c
index 1bb4d7e..5c8db56 100644
--- a/kernel/cred.c
+++ b/kernel/cred.c
@@ -589,3 +589,117 @@ int set_create_files_as(struct cred *new, struct inode *inode)
 	return security_kernel_create_files_as(new, inode);
 }
 EXPORT_SYMBOL(set_create_files_as);
+
+int cred_setresuid(struct cred *new, uid_t ruid, uid_t euid, uid_t suid)
+{
+	int retval;
+	const struct cred *old;
+
+	retval = security_task_setuid(ruid, euid, suid, LSM_SETID_RES);
+	if (retval)
+		return retval;
+	old = current_cred();
+
+	if (!capable(CAP_SETUID)) {
+		if (ruid != (uid_t) -1 && ruid != old->uid &&
+		    ruid != old->euid  && ruid != old->suid)
+			return -EPERM;
+		if (euid != (uid_t) -1 && euid != old->uid &&
+		    euid != old->euid  && euid != old->suid)
+			return -EPERM;
+		if (suid != (uid_t) -1 && suid != old->uid &&
+		    suid != old->euid  && suid != old->suid)
+			return -EPERM;
+	}
+
+	if (ruid != (uid_t) -1) {
+		new->uid = ruid;
+		if (ruid != old->uid) {
+			retval = set_user(new);
+			if (retval < 0)
+				return retval;
+		}
+	}
+	if (euid != (uid_t) -1)
+		new->euid = euid;
+	if (suid != (uid_t) -1)
+		new->suid = suid;
+	new->fsuid = new->euid;
+
+	return security_task_fix_setuid(new, old, LSM_SETID_RES);
+}
+
+int cred_setresgid(struct cred *new, gid_t rgid, gid_t egid,
+			gid_t sgid)
+{
+	const struct cred *old = current_cred();
+	int retval;
+
+	retval = security_task_setgid(rgid, egid, sgid, LSM_SETID_RES);
+	if (retval)
+		return retval;
+
+	if (!capable(CAP_SETGID)) {
+		if (rgid != (gid_t) -1 && rgid != old->gid &&
+		    rgid != old->egid  && rgid != old->sgid)
+			return -EPERM;
+		if (egid != (gid_t) -1 && egid != old->gid &&
+		    egid != old->egid  && egid != old->sgid)
+			return -EPERM;
+		if (sgid != (gid_t) -1 && sgid != old->gid &&
+		    sgid != old->egid  && sgid != old->sgid)
+			return -EPERM;
+	}
+
+	if (rgid != (gid_t) -1)
+		new->gid = rgid;
+	if (egid != (gid_t) -1)
+		new->egid = egid;
+	if (sgid != (gid_t) -1)
+		new->sgid = sgid;
+	new->fsgid = new->egid;
+	return 0;
+}
+
+int cred_setfsuid(struct cred *new, uid_t uid, uid_t *old_fsuid)
+{
+	const struct cred *old;
+
+	old = current_cred();
+	*old_fsuid = old->fsuid;
+
+	if (security_task_setuid(uid, (uid_t)-1, (uid_t)-1, LSM_SETID_FS) < 0)
+		return -EPERM;
+
+	if (uid == old->uid  || uid == old->euid  ||
+	    uid == old->suid || uid == old->fsuid ||
+	    capable(CAP_SETUID)) {
+		if (uid != *old_fsuid) {
+			new->fsuid = uid;
+			if (security_task_fix_setuid(new, old, LSM_SETID_FS) == 0)
+				return 0;
+		}
+	}
+	return -EPERM;
+}
+
+int cred_setfsgid(struct cred *new, gid_t gid, gid_t *old_fsgid)
+{
+	const struct cred *old;
+
+	old = current_cred();
+	*old_fsgid = old->fsgid;
+
+	if (security_task_setgid(gid, (gid_t)-1, (gid_t)-1, LSM_SETID_FS))
+		return -EPERM;
+
+	if (gid == old->gid  || gid == old->egid  ||
+	    gid == old->sgid || gid == old->fsgid ||
+	    capable(CAP_SETGID)) {
+		if (gid != *old_fsgid) {
+			new->fsgid = gid;
+			return 0;
+		}
+	}
+	return -EPERM;
+}
diff --git a/kernel/sys.c b/kernel/sys.c
index b3f1097..da4f9e0 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -559,11 +559,12 @@ error:
 /*
  * change the user struct in a credentials set to match the new UID
  */
-static int set_user(struct cred *new)
+int set_user(struct cred *new)
 {
 	struct user_struct *new_user;
 
-	new_user = alloc_uid(current_user_ns(), new->uid);
+	/* is this ok? */
+	new_user = alloc_uid(new->user->user_ns, new->uid);
 	if (!new_user)
 		return -EAGAIN;
 
@@ -704,14 +705,12 @@ error:
 	return retval;
 }
 
-
 /*
  * This function implements a generic ability to update ruid, euid,
  * and suid.  This allows you to implement the 4.4 compatible seteuid().
  */
 SYSCALL_DEFINE3(setresuid, uid_t, ruid, uid_t, euid, uid_t, suid)
 {
-	const struct cred *old;
 	struct cred *new;
 	int retval;
 
@@ -719,45 +718,10 @@ SYSCALL_DEFINE3(setresuid, uid_t, ruid, uid_t, euid, uid_t, suid)
 	if (!new)
 		return -ENOMEM;
 
-	retval = security_task_setuid(ruid, euid, suid, LSM_SETID_RES);
-	if (retval)
-		goto error;
-	old = current_cred();
-
-	retval = -EPERM;
-	if (!capable(CAP_SETUID)) {
-		if (ruid != (uid_t) -1 && ruid != old->uid &&
-		    ruid != old->euid  && ruid != old->suid)
-			goto error;
-		if (euid != (uid_t) -1 && euid != old->uid &&
-		    euid != old->euid  && euid != old->suid)
-			goto error;
-		if (suid != (uid_t) -1 && suid != old->uid &&
-		    suid != old->euid  && suid != old->suid)
-			goto error;
-	}
-
-	if (ruid != (uid_t) -1) {
-		new->uid = ruid;
-		if (ruid != old->uid) {
-			retval = set_user(new);
-			if (retval < 0)
-				goto error;
-		}
-	}
-	if (euid != (uid_t) -1)
-		new->euid = euid;
-	if (suid != (uid_t) -1)
-		new->suid = suid;
-	new->fsuid = new->euid;
-
-	retval = security_task_fix_setuid(new, old, LSM_SETID_RES);
-	if (retval < 0)
-		goto error;
-
-	return commit_creds(new);
+	retval = cred_setresuid(new, ruid, euid, suid);
+	if (retval == 0)
+		return commit_creds(new);
 
-error:
 	abort_creds(new);
 	return retval;
 }
@@ -779,43 +743,17 @@ SYSCALL_DEFINE3(getresuid, uid_t __user *, ruid, uid_t __user *, euid, uid_t __u
  */
 SYSCALL_DEFINE3(setresgid, gid_t, rgid, gid_t, egid, gid_t, sgid)
 {
-	const struct cred *old;
 	struct cred *new;
 	int retval;
 
 	new = prepare_creds();
 	if (!new)
 		return -ENOMEM;
-	old = current_cred();
 
-	retval = security_task_setgid(rgid, egid, sgid, LSM_SETID_RES);
-	if (retval)
-		goto error;
+	retval = cred_setresgid(new, rgid, egid, sgid);
+	if (retval == 0)
+		return commit_creds(new);
 
-	retval = -EPERM;
-	if (!capable(CAP_SETGID)) {
-		if (rgid != (gid_t) -1 && rgid != old->gid &&
-		    rgid != old->egid  && rgid != old->sgid)
-			goto error;
-		if (egid != (gid_t) -1 && egid != old->gid &&
-		    egid != old->egid  && egid != old->sgid)
-			goto error;
-		if (sgid != (gid_t) -1 && sgid != old->gid &&
-		    sgid != old->egid  && sgid != old->sgid)
-			goto error;
-	}
-
-	if (rgid != (gid_t) -1)
-		new->gid = rgid;
-	if (egid != (gid_t) -1)
-		new->egid = egid;
-	if (sgid != (gid_t) -1)
-		new->sgid = sgid;
-	new->fsgid = new->egid;
-
-	return commit_creds(new);
-
-error:
 	abort_creds(new);
 	return retval;
 }
@@ -832,7 +770,6 @@ SYSCALL_DEFINE3(getresgid, gid_t __user *, rgid, gid_t __user *, egid, gid_t __u
 	return retval;
 }
 
-
 /*
  * "setfsuid()" sets the fsuid - the uid used for filesystem checks. This
  * is used for "access()" and for the NFS daemon (letting nfsd stay at
@@ -841,35 +778,20 @@ SYSCALL_DEFINE3(getresgid, gid_t __user *, rgid, gid_t __user *, egid, gid_t __u
  */
 SYSCALL_DEFINE1(setfsuid, uid_t, uid)
 {
-	const struct cred *old;
 	struct cred *new;
 	uid_t old_fsuid;
+	int retval;
 
 	new = prepare_creds();
 	if (!new)
 		return current_fsuid();
-	old = current_cred();
-	old_fsuid = old->fsuid;
-
-	if (security_task_setuid(uid, (uid_t)-1, (uid_t)-1, LSM_SETID_FS) < 0)
-		goto error;
-
-	if (uid == old->uid  || uid == old->euid  ||
-	    uid == old->suid || uid == old->fsuid ||
-	    capable(CAP_SETUID)) {
-		if (uid != old_fsuid) {
-			new->fsuid = uid;
-			if (security_task_fix_setuid(new, old, LSM_SETID_FS) == 0)
-				goto change_okay;
-		}
-	}
 
-error:
-	abort_creds(new);
-	return old_fsuid;
+	retval = cred_setfsuid(new, uid, &old_fsuid);
+	if (retval == 0)
+		commit_creds(new);
+	else
+		abort_creds(new);
 
-change_okay:
-	commit_creds(new);
 	return old_fsuid;
 }
 
@@ -878,34 +800,20 @@ change_okay:
  */
 SYSCALL_DEFINE1(setfsgid, gid_t, gid)
 {
-	const struct cred *old;
 	struct cred *new;
 	gid_t old_fsgid;
+	int retval;
 
 	new = prepare_creds();
 	if (!new)
 		return current_fsgid();
-	old = current_cred();
-	old_fsgid = old->fsgid;
-
-	if (security_task_setgid(gid, (gid_t)-1, (gid_t)-1, LSM_SETID_FS))
-		goto error;
-
-	if (gid == old->gid  || gid == old->egid  ||
-	    gid == old->sgid || gid == old->fsgid ||
-	    capable(CAP_SETGID)) {
-		if (gid != old_fsgid) {
-			new->fsgid = gid;
-			goto change_okay;
-		}
-	}
 
-error:
-	abort_creds(new);
-	return old_fsgid;
+	retval = cred_setfsgid(new, gid, &old_fsgid);
+	if (retval == 0)
+		commit_creds(new);
+	else
+		abort_creds(new);
 
-change_okay:
-	commit_creds(new);
 	return old_fsgid;
 }
 
-- 
1.6.0.4


^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH v18 04/80] c/r: split core function out of some set*{u,g}id functions
@ 2009-09-23 23:50   ` Oren Laadan
  0 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-23 23:50 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, containers, linux-kernel, linux-mm, linux-api,
	Serge Hallyn, Ingo Molnar, Pavel Emelyanov

From: Serge E. Hallyn <serue@us.ibm.com>

When restarting tasks, we want to be able to change xuid and
xgid in a struct cred, and do so with security checks.  Break
the core functionality of set{fs,res}{u,g}id into cred_setX
which performs the access checks based on current_cred(),
but performs the requested change on a passed-in cred.

This will allow us to securely construct struct creds based
on a checkpoint image, constrained by the caller's permissions,
and apply them to the caller at the end of sys_restart().

Signed-off-by: Serge E. Hallyn <serue@us.ibm.com>
---
 include/linux/cred.h |    8 +++
 kernel/cred.c        |  114 ++++++++++++++++++++++++++++++++++++++++++
 kernel/sys.c         |  134 ++++++++------------------------------------------
 3 files changed, 143 insertions(+), 113 deletions(-)

diff --git a/include/linux/cred.h b/include/linux/cred.h
index 4fa9996..2ffffbe 100644
--- a/include/linux/cred.h
+++ b/include/linux/cred.h
@@ -21,6 +21,9 @@ struct user_struct;
 struct cred;
 struct inode;
 
+/* defined in sys.c, used in cred_setresuid */
+extern int set_user(struct cred *new);
+
 /*
  * COW Supplementary groups list
  */
@@ -344,4 +347,9 @@ do {						\
 	*(_fsgid) = __cred->fsgid;		\
 } while(0)
 
+int cred_setresuid(struct cred *new, uid_t ruid, uid_t euid, uid_t suid);
+int cred_setresgid(struct cred *new, gid_t rgid, gid_t egid, gid_t sgid);
+int cred_setfsuid(struct cred *new, uid_t uid, uid_t *old_fsuid);
+int cred_setfsgid(struct cred *new, gid_t gid, gid_t *old_fsgid);
+
 #endif /* _LINUX_CRED_H */
diff --git a/kernel/cred.c b/kernel/cred.c
index 1bb4d7e..5c8db56 100644
--- a/kernel/cred.c
+++ b/kernel/cred.c
@@ -589,3 +589,117 @@ int set_create_files_as(struct cred *new, struct inode *inode)
 	return security_kernel_create_files_as(new, inode);
 }
 EXPORT_SYMBOL(set_create_files_as);
+
+int cred_setresuid(struct cred *new, uid_t ruid, uid_t euid, uid_t suid)
+{
+	int retval;
+	const struct cred *old;
+
+	retval = security_task_setuid(ruid, euid, suid, LSM_SETID_RES);
+	if (retval)
+		return retval;
+	old = current_cred();
+
+	if (!capable(CAP_SETUID)) {
+		if (ruid != (uid_t) -1 && ruid != old->uid &&
+		    ruid != old->euid  && ruid != old->suid)
+			return -EPERM;
+		if (euid != (uid_t) -1 && euid != old->uid &&
+		    euid != old->euid  && euid != old->suid)
+			return -EPERM;
+		if (suid != (uid_t) -1 && suid != old->uid &&
+		    suid != old->euid  && suid != old->suid)
+			return -EPERM;
+	}
+
+	if (ruid != (uid_t) -1) {
+		new->uid = ruid;
+		if (ruid != old->uid) {
+			retval = set_user(new);
+			if (retval < 0)
+				return retval;
+		}
+	}
+	if (euid != (uid_t) -1)
+		new->euid = euid;
+	if (suid != (uid_t) -1)
+		new->suid = suid;
+	new->fsuid = new->euid;
+
+	return security_task_fix_setuid(new, old, LSM_SETID_RES);
+}
+
+int cred_setresgid(struct cred *new, gid_t rgid, gid_t egid,
+			gid_t sgid)
+{
+	const struct cred *old = current_cred();
+	int retval;
+
+	retval = security_task_setgid(rgid, egid, sgid, LSM_SETID_RES);
+	if (retval)
+		return retval;
+
+	if (!capable(CAP_SETGID)) {
+		if (rgid != (gid_t) -1 && rgid != old->gid &&
+		    rgid != old->egid  && rgid != old->sgid)
+			return -EPERM;
+		if (egid != (gid_t) -1 && egid != old->gid &&
+		    egid != old->egid  && egid != old->sgid)
+			return -EPERM;
+		if (sgid != (gid_t) -1 && sgid != old->gid &&
+		    sgid != old->egid  && sgid != old->sgid)
+			return -EPERM;
+	}
+
+	if (rgid != (gid_t) -1)
+		new->gid = rgid;
+	if (egid != (gid_t) -1)
+		new->egid = egid;
+	if (sgid != (gid_t) -1)
+		new->sgid = sgid;
+	new->fsgid = new->egid;
+	return 0;
+}
+
+int cred_setfsuid(struct cred *new, uid_t uid, uid_t *old_fsuid)
+{
+	const struct cred *old;
+
+	old = current_cred();
+	*old_fsuid = old->fsuid;
+
+	if (security_task_setuid(uid, (uid_t)-1, (uid_t)-1, LSM_SETID_FS) < 0)
+		return -EPERM;
+
+	if (uid == old->uid  || uid == old->euid  ||
+	    uid == old->suid || uid == old->fsuid ||
+	    capable(CAP_SETUID)) {
+		if (uid != *old_fsuid) {
+			new->fsuid = uid;
+			if (security_task_fix_setuid(new, old, LSM_SETID_FS) == 0)
+				return 0;
+		}
+	}
+	return -EPERM;
+}
+
+int cred_setfsgid(struct cred *new, gid_t gid, gid_t *old_fsgid)
+{
+	const struct cred *old;
+
+	old = current_cred();
+	*old_fsgid = old->fsgid;
+
+	if (security_task_setgid(gid, (gid_t)-1, (gid_t)-1, LSM_SETID_FS))
+		return -EPERM;
+
+	if (gid == old->gid  || gid == old->egid  ||
+	    gid == old->sgid || gid == old->fsgid ||
+	    capable(CAP_SETGID)) {
+		if (gid != *old_fsgid) {
+			new->fsgid = gid;
+			return 0;
+		}
+	}
+	return -EPERM;
+}
diff --git a/kernel/sys.c b/kernel/sys.c
index b3f1097..da4f9e0 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -559,11 +559,12 @@ error:
 /*
  * change the user struct in a credentials set to match the new UID
  */
-static int set_user(struct cred *new)
+int set_user(struct cred *new)
 {
 	struct user_struct *new_user;
 
-	new_user = alloc_uid(current_user_ns(), new->uid);
+	/* is this ok? */
+	new_user = alloc_uid(new->user->user_ns, new->uid);
 	if (!new_user)
 		return -EAGAIN;
 
@@ -704,14 +705,12 @@ error:
 	return retval;
 }
 
-
 /*
  * This function implements a generic ability to update ruid, euid,
  * and suid.  This allows you to implement the 4.4 compatible seteuid().
  */
 SYSCALL_DEFINE3(setresuid, uid_t, ruid, uid_t, euid, uid_t, suid)
 {
-	const struct cred *old;
 	struct cred *new;
 	int retval;
 
@@ -719,45 +718,10 @@ SYSCALL_DEFINE3(setresuid, uid_t, ruid, uid_t, euid, uid_t, suid)
 	if (!new)
 		return -ENOMEM;
 
-	retval = security_task_setuid(ruid, euid, suid, LSM_SETID_RES);
-	if (retval)
-		goto error;
-	old = current_cred();
-
-	retval = -EPERM;
-	if (!capable(CAP_SETUID)) {
-		if (ruid != (uid_t) -1 && ruid != old->uid &&
-		    ruid != old->euid  && ruid != old->suid)
-			goto error;
-		if (euid != (uid_t) -1 && euid != old->uid &&
-		    euid != old->euid  && euid != old->suid)
-			goto error;
-		if (suid != (uid_t) -1 && suid != old->uid &&
-		    suid != old->euid  && suid != old->suid)
-			goto error;
-	}
-
-	if (ruid != (uid_t) -1) {
-		new->uid = ruid;
-		if (ruid != old->uid) {
-			retval = set_user(new);
-			if (retval < 0)
-				goto error;
-		}
-	}
-	if (euid != (uid_t) -1)
-		new->euid = euid;
-	if (suid != (uid_t) -1)
-		new->suid = suid;
-	new->fsuid = new->euid;
-
-	retval = security_task_fix_setuid(new, old, LSM_SETID_RES);
-	if (retval < 0)
-		goto error;
-
-	return commit_creds(new);
+	retval = cred_setresuid(new, ruid, euid, suid);
+	if (retval == 0)
+		return commit_creds(new);
 
-error:
 	abort_creds(new);
 	return retval;
 }
@@ -779,43 +743,17 @@ SYSCALL_DEFINE3(getresuid, uid_t __user *, ruid, uid_t __user *, euid, uid_t __u
  */
 SYSCALL_DEFINE3(setresgid, gid_t, rgid, gid_t, egid, gid_t, sgid)
 {
-	const struct cred *old;
 	struct cred *new;
 	int retval;
 
 	new = prepare_creds();
 	if (!new)
 		return -ENOMEM;
-	old = current_cred();
 
-	retval = security_task_setgid(rgid, egid, sgid, LSM_SETID_RES);
-	if (retval)
-		goto error;
+	retval = cred_setresgid(new, rgid, egid, sgid);
+	if (retval == 0)
+		return commit_creds(new);
 
-	retval = -EPERM;
-	if (!capable(CAP_SETGID)) {
-		if (rgid != (gid_t) -1 && rgid != old->gid &&
-		    rgid != old->egid  && rgid != old->sgid)
-			goto error;
-		if (egid != (gid_t) -1 && egid != old->gid &&
-		    egid != old->egid  && egid != old->sgid)
-			goto error;
-		if (sgid != (gid_t) -1 && sgid != old->gid &&
-		    sgid != old->egid  && sgid != old->sgid)
-			goto error;
-	}
-
-	if (rgid != (gid_t) -1)
-		new->gid = rgid;
-	if (egid != (gid_t) -1)
-		new->egid = egid;
-	if (sgid != (gid_t) -1)
-		new->sgid = sgid;
-	new->fsgid = new->egid;
-
-	return commit_creds(new);
-
-error:
 	abort_creds(new);
 	return retval;
 }
@@ -832,7 +770,6 @@ SYSCALL_DEFINE3(getresgid, gid_t __user *, rgid, gid_t __user *, egid, gid_t __u
 	return retval;
 }
 
-
 /*
  * "setfsuid()" sets the fsuid - the uid used for filesystem checks. This
  * is used for "access()" and for the NFS daemon (letting nfsd stay at
@@ -841,35 +778,20 @@ SYSCALL_DEFINE3(getresgid, gid_t __user *, rgid, gid_t __user *, egid, gid_t __u
  */
 SYSCALL_DEFINE1(setfsuid, uid_t, uid)
 {
-	const struct cred *old;
 	struct cred *new;
 	uid_t old_fsuid;
+	int retval;
 
 	new = prepare_creds();
 	if (!new)
 		return current_fsuid();
-	old = current_cred();
-	old_fsuid = old->fsuid;
-
-	if (security_task_setuid(uid, (uid_t)-1, (uid_t)-1, LSM_SETID_FS) < 0)
-		goto error;
-
-	if (uid == old->uid  || uid == old->euid  ||
-	    uid == old->suid || uid == old->fsuid ||
-	    capable(CAP_SETUID)) {
-		if (uid != old_fsuid) {
-			new->fsuid = uid;
-			if (security_task_fix_setuid(new, old, LSM_SETID_FS) == 0)
-				goto change_okay;
-		}
-	}
 
-error:
-	abort_creds(new);
-	return old_fsuid;
+	retval = cred_setfsuid(new, uid, &old_fsuid);
+	if (retval == 0)
+		commit_creds(new);
+	else
+		abort_creds(new);
 
-change_okay:
-	commit_creds(new);
 	return old_fsuid;
 }
 
@@ -878,34 +800,20 @@ change_okay:
  */
 SYSCALL_DEFINE1(setfsgid, gid_t, gid)
 {
-	const struct cred *old;
 	struct cred *new;
 	gid_t old_fsgid;
+	int retval;
 
 	new = prepare_creds();
 	if (!new)
 		return current_fsgid();
-	old = current_cred();
-	old_fsgid = old->fsgid;
-
-	if (security_task_setgid(gid, (gid_t)-1, (gid_t)-1, LSM_SETID_FS))
-		goto error;
-
-	if (gid == old->gid  || gid == old->egid  ||
-	    gid == old->sgid || gid == old->fsgid ||
-	    capable(CAP_SETGID)) {
-		if (gid != old_fsgid) {
-			new->fsgid = gid;
-			goto change_okay;
-		}
-	}
 
-error:
-	abort_creds(new);
-	return old_fsgid;
+	retval = cred_setfsgid(new, gid, &old_fsgid);
+	if (retval == 0)
+		commit_creds(new);
+	else
+		abort_creds(new);
 
-change_okay:
-	commit_creds(new);
 	return old_fsgid;
 }
 
-- 
1.6.0.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH v18 05/80] cgroup freezer: Fix buggy resume test for tasks frozen with cgroup freezer
       [not found] ` <1253749920-18673-1-git-send-email-orenl-RdfvBDnrOixBDgjK7y7TUQ@public.gmane.org>
                     ` (3 preceding siblings ...)
  2009-09-23 23:50   ` [PATCH v18 04/80] c/r: split core function out of some set*{u, g}id functions Oren Laadan
@ 2009-09-23 23:50   ` Oren Laadan
  2009-09-23 23:50   ` [PATCH v18 06/80] cgroup freezer: Update stale locking comments Oren Laadan
                     ` (76 subsequent siblings)
  81 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-23 23:50 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Pavel Machek, linux-api-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Rafael J. Wysocki,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Linus Torvalds,
	linux-pm-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Ingo Molnar, Paul Menage, Pavel Emelyanov

From: Matt Helsley <matthltc-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>

When the cgroup freezer is used to freeze tasks we do not want to thaw
those tasks during resume. Currently we test the cgroup freezer
state of the resuming tasks to see if the cgroup is FROZEN.  If so
then we don't thaw the task. However, the FREEZING state also indicates
that the task should remain frozen.

This also avoids a problem pointed out by Oren Ladaan: the freezer state
transition from FREEZING to FROZEN is updated lazily when userspace reads
or writes the freezer.state file in the cgroup filesystem. This means that
resume will thaw tasks in cgroups which should be in the FROZEN state if
there is no read/write of the freezer.state file to trigger this
transition before suspend.

NOTE: Another "simple" solution would be to always update the cgroup
freezer state during resume. However it's a bad choice for several reasons:
Updating the cgroup freezer state is somewhat expensive because it requires
walking all the tasks in the cgroup and checking if they are each frozen.
Worse, this could easily make resume run in N^2 time where N is the number
of tasks in the cgroup. Finally, updating the freezer state from this code
path requires trickier locking because of the way locks must be ordered.

Instead of updating the freezer state we rely on the fact that lazy
updates only manage the transition from FREEZING to FROZEN. We know that
a cgroup with the FREEZING state may actually be FROZEN so test for that
state too. This makes sense in the resume path even for partially-frozen
cgroups -- those that really are FREEZING but not FROZEN.

Reported-by: Oren Ladaan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
Signed-off-by: Matt Helsley <matthltc-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
Cc: Cedric Le Goater <legoater-GANU6spQydw@public.gmane.org>
Cc: Paul Menage <menage-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
Cc: Li Zefan <lizf-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
Cc: Rafael J. Wysocki <rjw-KKrjLPT3xs0@public.gmane.org>
Cc: Pavel Machek <pavel-AlSwsSmVLrQ@public.gmane.org>
Cc: linux-pm-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org

Seems like a candidate for -stable.
---
 include/linux/freezer.h |    7 +++++--
 kernel/cgroup_freezer.c |    9 ++++++---
 kernel/power/process.c  |    2 +-
 3 files changed, 12 insertions(+), 6 deletions(-)

diff --git a/include/linux/freezer.h b/include/linux/freezer.h
index 5a361f8..da7e52b 100644
--- a/include/linux/freezer.h
+++ b/include/linux/freezer.h
@@ -64,9 +64,12 @@ extern bool freeze_task(struct task_struct *p, bool sig_only);
 extern void cancel_freezing(struct task_struct *p);
 
 #ifdef CONFIG_CGROUP_FREEZER
-extern int cgroup_frozen(struct task_struct *task);
+extern int cgroup_freezing_or_frozen(struct task_struct *task);
 #else /* !CONFIG_CGROUP_FREEZER */
-static inline int cgroup_frozen(struct task_struct *task) { return 0; }
+static inline int cgroup_freezing_or_frozen(struct task_struct *task)
+{
+	return 0;
+}
 #endif /* !CONFIG_CGROUP_FREEZER */
 
 /*
diff --git a/kernel/cgroup_freezer.c b/kernel/cgroup_freezer.c
index fb249e2..765e2c1 100644
--- a/kernel/cgroup_freezer.c
+++ b/kernel/cgroup_freezer.c
@@ -47,17 +47,20 @@ static inline struct freezer *task_freezer(struct task_struct *task)
 			    struct freezer, css);
 }
 
-int cgroup_frozen(struct task_struct *task)
+int cgroup_freezing_or_frozen(struct task_struct *task)
 {
 	struct freezer *freezer;
 	enum freezer_state state;
 
 	task_lock(task);
 	freezer = task_freezer(task);
-	state = freezer->state;
+	if (!freezer->css.cgroup->parent)
+		state = CGROUP_THAWED; /* root cgroup can't be frozen */
+	else
+		state = freezer->state;
 	task_unlock(task);
 
-	return state == CGROUP_FROZEN;
+	return (state == CGROUP_FREEZING) || (state == CGROUP_FROZEN);
 }
 
 /*
diff --git a/kernel/power/process.c b/kernel/power/process.c
index da2072d..3728d4c 100644
--- a/kernel/power/process.c
+++ b/kernel/power/process.c
@@ -138,7 +138,7 @@ static void thaw_tasks(bool nosig_only)
 		if (nosig_only && should_send_signal(p))
 			continue;
 
-		if (cgroup_frozen(p))
+		if (cgroup_freezing_or_frozen(p))
 			continue;
 
 		thaw_process(p);
-- 
1.6.0.4

^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH v18 05/80] cgroup freezer: Fix buggy resume test for tasks frozen with cgroup freezer
  2009-09-23 23:50 ` Oren Laadan
@ 2009-09-23 23:50   ` Oren Laadan
  -1 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-23 23:50 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, containers, linux-kernel, linux-mm, linux-api,
	Serge Hallyn, Ingo Molnar, Pavel Emelyanov, Matt Helsley,
	Cedric Le Goater, Paul Menage, Li Zefan, Rafael J. Wysocki,
	Pavel Machek, linux-pm

From: Matt Helsley <matthltc@us.ibm.com>

When the cgroup freezer is used to freeze tasks we do not want to thaw
those tasks during resume. Currently we test the cgroup freezer
state of the resuming tasks to see if the cgroup is FROZEN.  If so
then we don't thaw the task. However, the FREEZING state also indicates
that the task should remain frozen.

This also avoids a problem pointed out by Oren Ladaan: the freezer state
transition from FREEZING to FROZEN is updated lazily when userspace reads
or writes the freezer.state file in the cgroup filesystem. This means that
resume will thaw tasks in cgroups which should be in the FROZEN state if
there is no read/write of the freezer.state file to trigger this
transition before suspend.

NOTE: Another "simple" solution would be to always update the cgroup
freezer state during resume. However it's a bad choice for several reasons:
Updating the cgroup freezer state is somewhat expensive because it requires
walking all the tasks in the cgroup and checking if they are each frozen.
Worse, this could easily make resume run in N^2 time where N is the number
of tasks in the cgroup. Finally, updating the freezer state from this code
path requires trickier locking because of the way locks must be ordered.

Instead of updating the freezer state we rely on the fact that lazy
updates only manage the transition from FREEZING to FROZEN. We know that
a cgroup with the FREEZING state may actually be FROZEN so test for that
state too. This makes sense in the resume path even for partially-frozen
cgroups -- those that really are FREEZING but not FROZEN.

Reported-by: Oren Ladaan <orenl@cs.columbia.edu>
Signed-off-by: Matt Helsley <matthltc@us.ibm.com>
Cc: Cedric Le Goater <legoater@free.fr>
Cc: Paul Menage <menage@google.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Rafael J. Wysocki <rjw@sisk.pl>
Cc: Pavel Machek <pavel@suse.cz>
Cc: linux-pm@lists.linux-foundation.org

Seems like a candidate for -stable.
---
 include/linux/freezer.h |    7 +++++--
 kernel/cgroup_freezer.c |    9 ++++++---
 kernel/power/process.c  |    2 +-
 3 files changed, 12 insertions(+), 6 deletions(-)

diff --git a/include/linux/freezer.h b/include/linux/freezer.h
index 5a361f8..da7e52b 100644
--- a/include/linux/freezer.h
+++ b/include/linux/freezer.h
@@ -64,9 +64,12 @@ extern bool freeze_task(struct task_struct *p, bool sig_only);
 extern void cancel_freezing(struct task_struct *p);
 
 #ifdef CONFIG_CGROUP_FREEZER
-extern int cgroup_frozen(struct task_struct *task);
+extern int cgroup_freezing_or_frozen(struct task_struct *task);
 #else /* !CONFIG_CGROUP_FREEZER */
-static inline int cgroup_frozen(struct task_struct *task) { return 0; }
+static inline int cgroup_freezing_or_frozen(struct task_struct *task)
+{
+	return 0;
+}
 #endif /* !CONFIG_CGROUP_FREEZER */
 
 /*
diff --git a/kernel/cgroup_freezer.c b/kernel/cgroup_freezer.c
index fb249e2..765e2c1 100644
--- a/kernel/cgroup_freezer.c
+++ b/kernel/cgroup_freezer.c
@@ -47,17 +47,20 @@ static inline struct freezer *task_freezer(struct task_struct *task)
 			    struct freezer, css);
 }
 
-int cgroup_frozen(struct task_struct *task)
+int cgroup_freezing_or_frozen(struct task_struct *task)
 {
 	struct freezer *freezer;
 	enum freezer_state state;
 
 	task_lock(task);
 	freezer = task_freezer(task);
-	state = freezer->state;
+	if (!freezer->css.cgroup->parent)
+		state = CGROUP_THAWED; /* root cgroup can't be frozen */
+	else
+		state = freezer->state;
 	task_unlock(task);
 
-	return state == CGROUP_FROZEN;
+	return (state == CGROUP_FREEZING) || (state == CGROUP_FROZEN);
 }
 
 /*
diff --git a/kernel/power/process.c b/kernel/power/process.c
index da2072d..3728d4c 100644
--- a/kernel/power/process.c
+++ b/kernel/power/process.c
@@ -138,7 +138,7 @@ static void thaw_tasks(bool nosig_only)
 		if (nosig_only && should_send_signal(p))
 			continue;
 
-		if (cgroup_frozen(p))
+		if (cgroup_freezing_or_frozen(p))
 			continue;
 
 		thaw_process(p);
-- 
1.6.0.4


^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH v18 05/80] cgroup freezer: Fix buggy resume test for tasks frozen with cgroup freezer
@ 2009-09-23 23:50   ` Oren Laadan
  0 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-23 23:50 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, containers, linux-kernel, linux-mm, linux-api,
	Serge Hallyn, Ingo Molnar, Pavel Emelyanov, Matt Helsley,
	Cedric Le Goater, Paul Menage, Li Zefan, Rafael J. Wysocki,
	Pavel Machek, linux-pm

From: Matt Helsley <matthltc@us.ibm.com>

When the cgroup freezer is used to freeze tasks we do not want to thaw
those tasks during resume. Currently we test the cgroup freezer
state of the resuming tasks to see if the cgroup is FROZEN.  If so
then we don't thaw the task. However, the FREEZING state also indicates
that the task should remain frozen.

This also avoids a problem pointed out by Oren Ladaan: the freezer state
transition from FREEZING to FROZEN is updated lazily when userspace reads
or writes the freezer.state file in the cgroup filesystem. This means that
resume will thaw tasks in cgroups which should be in the FROZEN state if
there is no read/write of the freezer.state file to trigger this
transition before suspend.

NOTE: Another "simple" solution would be to always update the cgroup
freezer state during resume. However it's a bad choice for several reasons:
Updating the cgroup freezer state is somewhat expensive because it requires
walking all the tasks in the cgroup and checking if they are each frozen.
Worse, this could easily make resume run in N^2 time where N is the number
of tasks in the cgroup. Finally, updating the freezer state from this code
path requires trickier locking because of the way locks must be ordered.

Instead of updating the freezer state we rely on the fact that lazy
updates only manage the transition from FREEZING to FROZEN. We know that
a cgroup with the FREEZING state may actually be FROZEN so test for that
state too. This makes sense in the resume path even for partially-frozen
cgroups -- those that really are FREEZING but not FROZEN.

Reported-by: Oren Ladaan <orenl@cs.columbia.edu>
Signed-off-by: Matt Helsley <matthltc@us.ibm.com>
Cc: Cedric Le Goater <legoater@free.fr>
Cc: Paul Menage <menage@google.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Rafael J. Wysocki <rjw@sisk.pl>
Cc: Pavel Machek <pavel@suse.cz>
Cc: linux-pm@lists.linux-foundation.org

Seems like a candidate for -stable.
---
 include/linux/freezer.h |    7 +++++--
 kernel/cgroup_freezer.c |    9 ++++++---
 kernel/power/process.c  |    2 +-
 3 files changed, 12 insertions(+), 6 deletions(-)

diff --git a/include/linux/freezer.h b/include/linux/freezer.h
index 5a361f8..da7e52b 100644
--- a/include/linux/freezer.h
+++ b/include/linux/freezer.h
@@ -64,9 +64,12 @@ extern bool freeze_task(struct task_struct *p, bool sig_only);
 extern void cancel_freezing(struct task_struct *p);
 
 #ifdef CONFIG_CGROUP_FREEZER
-extern int cgroup_frozen(struct task_struct *task);
+extern int cgroup_freezing_or_frozen(struct task_struct *task);
 #else /* !CONFIG_CGROUP_FREEZER */
-static inline int cgroup_frozen(struct task_struct *task) { return 0; }
+static inline int cgroup_freezing_or_frozen(struct task_struct *task)
+{
+	return 0;
+}
 #endif /* !CONFIG_CGROUP_FREEZER */
 
 /*
diff --git a/kernel/cgroup_freezer.c b/kernel/cgroup_freezer.c
index fb249e2..765e2c1 100644
--- a/kernel/cgroup_freezer.c
+++ b/kernel/cgroup_freezer.c
@@ -47,17 +47,20 @@ static inline struct freezer *task_freezer(struct task_struct *task)
 			    struct freezer, css);
 }
 
-int cgroup_frozen(struct task_struct *task)
+int cgroup_freezing_or_frozen(struct task_struct *task)
 {
 	struct freezer *freezer;
 	enum freezer_state state;
 
 	task_lock(task);
 	freezer = task_freezer(task);
-	state = freezer->state;
+	if (!freezer->css.cgroup->parent)
+		state = CGROUP_THAWED; /* root cgroup can't be frozen */
+	else
+		state = freezer->state;
 	task_unlock(task);
 
-	return state == CGROUP_FROZEN;
+	return (state == CGROUP_FREEZING) || (state == CGROUP_FROZEN);
 }
 
 /*
diff --git a/kernel/power/process.c b/kernel/power/process.c
index da2072d..3728d4c 100644
--- a/kernel/power/process.c
+++ b/kernel/power/process.c
@@ -138,7 +138,7 @@ static void thaw_tasks(bool nosig_only)
 		if (nosig_only && should_send_signal(p))
 			continue;
 
-		if (cgroup_frozen(p))
+		if (cgroup_freezing_or_frozen(p))
 			continue;
 
 		thaw_process(p);
-- 
1.6.0.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH v18 05/80] cgroup freezer: Fix buggy resume test for tasks frozen with cgroup freezer
  2009-09-23 23:50 ` Oren Laadan
                   ` (4 preceding siblings ...)
  (?)
@ 2009-09-23 23:50 ` Oren Laadan
  -1 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-23 23:50 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Pavel Machek, Cedric Le Goater, linux-api, containers, Li Zefan,
	linux-kernel, linux-mm, Linus Torvalds, linux-pm, Serge Hallyn,
	Ingo Molnar, Paul Menage, Pavel Emelyanov

From: Matt Helsley <matthltc@us.ibm.com>

When the cgroup freezer is used to freeze tasks we do not want to thaw
those tasks during resume. Currently we test the cgroup freezer
state of the resuming tasks to see if the cgroup is FROZEN.  If so
then we don't thaw the task. However, the FREEZING state also indicates
that the task should remain frozen.

This also avoids a problem pointed out by Oren Ladaan: the freezer state
transition from FREEZING to FROZEN is updated lazily when userspace reads
or writes the freezer.state file in the cgroup filesystem. This means that
resume will thaw tasks in cgroups which should be in the FROZEN state if
there is no read/write of the freezer.state file to trigger this
transition before suspend.

NOTE: Another "simple" solution would be to always update the cgroup
freezer state during resume. However it's a bad choice for several reasons:
Updating the cgroup freezer state is somewhat expensive because it requires
walking all the tasks in the cgroup and checking if they are each frozen.
Worse, this could easily make resume run in N^2 time where N is the number
of tasks in the cgroup. Finally, updating the freezer state from this code
path requires trickier locking because of the way locks must be ordered.

Instead of updating the freezer state we rely on the fact that lazy
updates only manage the transition from FREEZING to FROZEN. We know that
a cgroup with the FREEZING state may actually be FROZEN so test for that
state too. This makes sense in the resume path even for partially-frozen
cgroups -- those that really are FREEZING but not FROZEN.

Reported-by: Oren Ladaan <orenl@cs.columbia.edu>
Signed-off-by: Matt Helsley <matthltc@us.ibm.com>
Cc: Cedric Le Goater <legoater@free.fr>
Cc: Paul Menage <menage@google.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Rafael J. Wysocki <rjw@sisk.pl>
Cc: Pavel Machek <pavel@suse.cz>
Cc: linux-pm@lists.linux-foundation.org

Seems like a candidate for -stable.
---
 include/linux/freezer.h |    7 +++++--
 kernel/cgroup_freezer.c |    9 ++++++---
 kernel/power/process.c  |    2 +-
 3 files changed, 12 insertions(+), 6 deletions(-)

diff --git a/include/linux/freezer.h b/include/linux/freezer.h
index 5a361f8..da7e52b 100644
--- a/include/linux/freezer.h
+++ b/include/linux/freezer.h
@@ -64,9 +64,12 @@ extern bool freeze_task(struct task_struct *p, bool sig_only);
 extern void cancel_freezing(struct task_struct *p);
 
 #ifdef CONFIG_CGROUP_FREEZER
-extern int cgroup_frozen(struct task_struct *task);
+extern int cgroup_freezing_or_frozen(struct task_struct *task);
 #else /* !CONFIG_CGROUP_FREEZER */
-static inline int cgroup_frozen(struct task_struct *task) { return 0; }
+static inline int cgroup_freezing_or_frozen(struct task_struct *task)
+{
+	return 0;
+}
 #endif /* !CONFIG_CGROUP_FREEZER */
 
 /*
diff --git a/kernel/cgroup_freezer.c b/kernel/cgroup_freezer.c
index fb249e2..765e2c1 100644
--- a/kernel/cgroup_freezer.c
+++ b/kernel/cgroup_freezer.c
@@ -47,17 +47,20 @@ static inline struct freezer *task_freezer(struct task_struct *task)
 			    struct freezer, css);
 }
 
-int cgroup_frozen(struct task_struct *task)
+int cgroup_freezing_or_frozen(struct task_struct *task)
 {
 	struct freezer *freezer;
 	enum freezer_state state;
 
 	task_lock(task);
 	freezer = task_freezer(task);
-	state = freezer->state;
+	if (!freezer->css.cgroup->parent)
+		state = CGROUP_THAWED; /* root cgroup can't be frozen */
+	else
+		state = freezer->state;
 	task_unlock(task);
 
-	return state == CGROUP_FROZEN;
+	return (state == CGROUP_FREEZING) || (state == CGROUP_FROZEN);
 }
 
 /*
diff --git a/kernel/power/process.c b/kernel/power/process.c
index da2072d..3728d4c 100644
--- a/kernel/power/process.c
+++ b/kernel/power/process.c
@@ -138,7 +138,7 @@ static void thaw_tasks(bool nosig_only)
 		if (nosig_only && should_send_signal(p))
 			continue;
 
-		if (cgroup_frozen(p))
+		if (cgroup_freezing_or_frozen(p))
 			continue;
 
 		thaw_process(p);
-- 
1.6.0.4

^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH v18 06/80] cgroup freezer: Update stale locking comments
       [not found] ` <1253749920-18673-1-git-send-email-orenl-RdfvBDnrOixBDgjK7y7TUQ@public.gmane.org>
                     ` (4 preceding siblings ...)
  2009-09-23 23:50   ` [PATCH v18 05/80] cgroup freezer: Fix buggy resume test for tasks frozen with cgroup freezer Oren Laadan
@ 2009-09-23 23:50   ` Oren Laadan
  2009-09-23 23:50   ` [PATCH v18 07/80] cgroup freezer: Add CHECKPOINTING state to safeguard container checkpoint Oren Laadan
                     ` (75 subsequent siblings)
  81 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-23 23:50 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-api-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Linus Torvalds, Ingo Molnar,
	Paul Menage, Pavel Emelyanov

From: Matt Helsley <matthltc-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>

Update stale comments regarding locking order and add a little more detail
so it's easier to follow the locking between the cgroup freezer and the
power management freezer code.

Signed-off-by: Matt Helsley <matthltc-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
Cc: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
Cc: Cedric Le Goater <legoater-GANU6spQydw@public.gmane.org>
Cc: Paul Menage <menage-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
Cc: Li Zefan <lizf-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
---
 kernel/cgroup_freezer.c |   21 +++++++++++++--------
 1 files changed, 13 insertions(+), 8 deletions(-)

diff --git a/kernel/cgroup_freezer.c b/kernel/cgroup_freezer.c
index 765e2c1..22fce5d 100644
--- a/kernel/cgroup_freezer.c
+++ b/kernel/cgroup_freezer.c
@@ -88,10 +88,10 @@ struct cgroup_subsys freezer_subsys;
 
 /* Locks taken and their ordering
  * ------------------------------
- * css_set_lock
  * cgroup_mutex (AKA cgroup_lock)
- * task->alloc_lock (AKA task_lock)
  * freezer->lock
+ * css_set_lock
+ * task->alloc_lock (AKA task_lock)
  * task->sighand->siglock
  *
  * cgroup code forces css_set_lock to be taken before task->alloc_lock
@@ -99,33 +99,38 @@ struct cgroup_subsys freezer_subsys;
  * freezer_create(), freezer_destroy():
  * cgroup_mutex [ by cgroup core ]
  *
- * can_attach():
- * cgroup_mutex
+ * freezer_can_attach():
+ * cgroup_mutex (held by caller of can_attach)
  *
- * cgroup_frozen():
+ * cgroup_freezing_or_frozen():
  * task->alloc_lock (to get task's cgroup)
  *
  * freezer_fork() (preserving fork() performance means can't take cgroup_mutex):
- * task->alloc_lock (to get task's cgroup)
  * freezer->lock
  *  sighand->siglock (if the cgroup is freezing)
  *
  * freezer_read():
  * cgroup_mutex
  *  freezer->lock
+ *   write_lock css_set_lock (cgroup iterator start)
+ *    task->alloc_lock
  *   read_lock css_set_lock (cgroup iterator start)
  *
  * freezer_write() (freeze):
  * cgroup_mutex
  *  freezer->lock
+ *   write_lock css_set_lock (cgroup iterator start)
+ *    task->alloc_lock
  *   read_lock css_set_lock (cgroup iterator start)
- *    sighand->siglock
+ *    sighand->siglock (fake signal delivery inside freeze_task())
  *
  * freezer_write() (unfreeze):
  * cgroup_mutex
  *  freezer->lock
+ *   write_lock css_set_lock (cgroup iterator start)
+ *    task->alloc_lock
  *   read_lock css_set_lock (cgroup iterator start)
- *    task->alloc_lock (to prevent races with freeze_task())
+ *    task->alloc_lock (inside thaw_process(), prevents race with refrigerator())
  *     sighand->siglock
  */
 static struct cgroup_subsys_state *freezer_create(struct cgroup_subsys *ss,
-- 
1.6.0.4

^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH v18 06/80] cgroup freezer: Update stale locking comments
  2009-09-23 23:50 ` Oren Laadan
@ 2009-09-23 23:50   ` Oren Laadan
  -1 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-23 23:50 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, containers, linux-kernel, linux-mm, linux-api,
	Serge Hallyn, Ingo Molnar, Pavel Emelyanov, Matt Helsley,
	Oren Laadan, Cedric Le Goater, Paul Menage, Li Zefan

From: Matt Helsley <matthltc@us.ibm.com>

Update stale comments regarding locking order and add a little more detail
so it's easier to follow the locking between the cgroup freezer and the
power management freezer code.

Signed-off-by: Matt Helsley <matthltc@us.ibm.com>
Cc: Oren Laadan <orenl@cs.columbia.edu>
Cc: Cedric Le Goater <legoater@free.fr>
Cc: Paul Menage <menage@google.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
---
 kernel/cgroup_freezer.c |   21 +++++++++++++--------
 1 files changed, 13 insertions(+), 8 deletions(-)

diff --git a/kernel/cgroup_freezer.c b/kernel/cgroup_freezer.c
index 765e2c1..22fce5d 100644
--- a/kernel/cgroup_freezer.c
+++ b/kernel/cgroup_freezer.c
@@ -88,10 +88,10 @@ struct cgroup_subsys freezer_subsys;
 
 /* Locks taken and their ordering
  * ------------------------------
- * css_set_lock
  * cgroup_mutex (AKA cgroup_lock)
- * task->alloc_lock (AKA task_lock)
  * freezer->lock
+ * css_set_lock
+ * task->alloc_lock (AKA task_lock)
  * task->sighand->siglock
  *
  * cgroup code forces css_set_lock to be taken before task->alloc_lock
@@ -99,33 +99,38 @@ struct cgroup_subsys freezer_subsys;
  * freezer_create(), freezer_destroy():
  * cgroup_mutex [ by cgroup core ]
  *
- * can_attach():
- * cgroup_mutex
+ * freezer_can_attach():
+ * cgroup_mutex (held by caller of can_attach)
  *
- * cgroup_frozen():
+ * cgroup_freezing_or_frozen():
  * task->alloc_lock (to get task's cgroup)
  *
  * freezer_fork() (preserving fork() performance means can't take cgroup_mutex):
- * task->alloc_lock (to get task's cgroup)
  * freezer->lock
  *  sighand->siglock (if the cgroup is freezing)
  *
  * freezer_read():
  * cgroup_mutex
  *  freezer->lock
+ *   write_lock css_set_lock (cgroup iterator start)
+ *    task->alloc_lock
  *   read_lock css_set_lock (cgroup iterator start)
  *
  * freezer_write() (freeze):
  * cgroup_mutex
  *  freezer->lock
+ *   write_lock css_set_lock (cgroup iterator start)
+ *    task->alloc_lock
  *   read_lock css_set_lock (cgroup iterator start)
- *    sighand->siglock
+ *    sighand->siglock (fake signal delivery inside freeze_task())
  *
  * freezer_write() (unfreeze):
  * cgroup_mutex
  *  freezer->lock
+ *   write_lock css_set_lock (cgroup iterator start)
+ *    task->alloc_lock
  *   read_lock css_set_lock (cgroup iterator start)
- *    task->alloc_lock (to prevent races with freeze_task())
+ *    task->alloc_lock (inside thaw_process(), prevents race with refrigerator())
  *     sighand->siglock
  */
 static struct cgroup_subsys_state *freezer_create(struct cgroup_subsys *ss,
-- 
1.6.0.4


^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH v18 06/80] cgroup freezer: Update stale locking comments
@ 2009-09-23 23:50   ` Oren Laadan
  0 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-23 23:50 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, containers, linux-kernel, linux-mm, linux-api,
	Serge Hallyn, Ingo Molnar, Pavel Emelyanov, Matt Helsley,
	Oren Laadan, Cedric Le Goater, Paul Menage, Li Zefan

From: Matt Helsley <matthltc@us.ibm.com>

Update stale comments regarding locking order and add a little more detail
so it's easier to follow the locking between the cgroup freezer and the
power management freezer code.

Signed-off-by: Matt Helsley <matthltc@us.ibm.com>
Cc: Oren Laadan <orenl@cs.columbia.edu>
Cc: Cedric Le Goater <legoater@free.fr>
Cc: Paul Menage <menage@google.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
---
 kernel/cgroup_freezer.c |   21 +++++++++++++--------
 1 files changed, 13 insertions(+), 8 deletions(-)

diff --git a/kernel/cgroup_freezer.c b/kernel/cgroup_freezer.c
index 765e2c1..22fce5d 100644
--- a/kernel/cgroup_freezer.c
+++ b/kernel/cgroup_freezer.c
@@ -88,10 +88,10 @@ struct cgroup_subsys freezer_subsys;
 
 /* Locks taken and their ordering
  * ------------------------------
- * css_set_lock
  * cgroup_mutex (AKA cgroup_lock)
- * task->alloc_lock (AKA task_lock)
  * freezer->lock
+ * css_set_lock
+ * task->alloc_lock (AKA task_lock)
  * task->sighand->siglock
  *
  * cgroup code forces css_set_lock to be taken before task->alloc_lock
@@ -99,33 +99,38 @@ struct cgroup_subsys freezer_subsys;
  * freezer_create(), freezer_destroy():
  * cgroup_mutex [ by cgroup core ]
  *
- * can_attach():
- * cgroup_mutex
+ * freezer_can_attach():
+ * cgroup_mutex (held by caller of can_attach)
  *
- * cgroup_frozen():
+ * cgroup_freezing_or_frozen():
  * task->alloc_lock (to get task's cgroup)
  *
  * freezer_fork() (preserving fork() performance means can't take cgroup_mutex):
- * task->alloc_lock (to get task's cgroup)
  * freezer->lock
  *  sighand->siglock (if the cgroup is freezing)
  *
  * freezer_read():
  * cgroup_mutex
  *  freezer->lock
+ *   write_lock css_set_lock (cgroup iterator start)
+ *    task->alloc_lock
  *   read_lock css_set_lock (cgroup iterator start)
  *
  * freezer_write() (freeze):
  * cgroup_mutex
  *  freezer->lock
+ *   write_lock css_set_lock (cgroup iterator start)
+ *    task->alloc_lock
  *   read_lock css_set_lock (cgroup iterator start)
- *    sighand->siglock
+ *    sighand->siglock (fake signal delivery inside freeze_task())
  *
  * freezer_write() (unfreeze):
  * cgroup_mutex
  *  freezer->lock
+ *   write_lock css_set_lock (cgroup iterator start)
+ *    task->alloc_lock
  *   read_lock css_set_lock (cgroup iterator start)
- *    task->alloc_lock (to prevent races with freeze_task())
+ *    task->alloc_lock (inside thaw_process(), prevents race with refrigerator())
  *     sighand->siglock
  */
 static struct cgroup_subsys_state *freezer_create(struct cgroup_subsys *ss,
-- 
1.6.0.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH v18 07/80] cgroup freezer: Add CHECKPOINTING state to safeguard container checkpoint
       [not found] ` <1253749920-18673-1-git-send-email-orenl-RdfvBDnrOixBDgjK7y7TUQ@public.gmane.org>
                     ` (5 preceding siblings ...)
  2009-09-23 23:50   ` [PATCH v18 06/80] cgroup freezer: Update stale locking comments Oren Laadan
@ 2009-09-23 23:50   ` Oren Laadan
  2009-09-23 23:50   ` [PATCH v18 08/80] cgroup freezer: interface to freeze a cgroup from within the kernel Oren Laadan
                     ` (74 subsequent siblings)
  81 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-23 23:50 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-api-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Linus Torvalds, Ingo Molnar,
	Paul Menage, Pavel Emelyanov

From: Matt Helsley <matthltc-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>

The CHECKPOINTING state prevents userspace from unfreezing tasks until
sys_checkpoint() is finished. When doing container checkpoint userspace
will do:

	echo FROZEN > /cgroups/my_container/freezer.state
	...
	rc = sys_checkpoint( <pid of container root> );

To ensure a consistent checkpoint image userspace should not be allowed
to thaw the cgroup (echo THAWED > /cgroups/my_container/freezer.state)
during checkpoint.

"CHECKPOINTING" can only be set on a "FROZEN" cgroup using the checkpoint
system call. Once in the "CHECKPOINTING" state, the cgroup may not leave until
the checkpoint system call is finished and ready to return. Then the
freezer state returns to "FROZEN". Writing any new state to freezer.state while
checkpointing will return EBUSY. These semantics ensure that userspace cannot
unfreeze the cgroup midway through the checkpoint system call.

The cgroup_freezer_begin_checkpoint() and cgroup_freezer_end_checkpoint()
make relatively few assumptions about the task that is passed in. However the
way they are called in do_checkpoint() assumes that the root of the container
is in the same freezer cgroup as all the other tasks that will be
checkpointed.

Notes:
        As a side-effect this prevents the multiple tasks from entering the
        CHECKPOINTING state simultaneously. All but one will get -EBUSY.

Signed-off-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
Signed-off-by: Matt Helsley <matthltc-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
Cc: Paul Menage <menage-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
Cc: Li Zefan <lizf-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
Cc: Cedric Le Goater <legoater-GANU6spQydw@public.gmane.org>
---
 Documentation/cgroups/freezer-subsystem.txt |   10 ++
 include/linux/freezer.h                     |    8 ++
 kernel/cgroup_freezer.c                     |  166 ++++++++++++++++++++-------
 3 files changed, 142 insertions(+), 42 deletions(-)

diff --git a/Documentation/cgroups/freezer-subsystem.txt b/Documentation/cgroups/freezer-subsystem.txt
index 41f37fe..92b68e6 100644
--- a/Documentation/cgroups/freezer-subsystem.txt
+++ b/Documentation/cgroups/freezer-subsystem.txt
@@ -100,3 +100,13 @@ things happens:
 		and returns EINVAL)
 	3) The tasks that blocked the cgroup from entering the "FROZEN"
 		state disappear from the cgroup's set of tasks.
+
+When the cgroup freezer is used to guard container checkpoint operations the
+freezer.state may be "CHECKPOINTING". "CHECKPOINTING" can only be set on a
+"FROZEN" cgroup using the checkpoint system call. Once in the "CHECKPOINTING"
+state, the cgroup may not leave until the checkpoint system call returns the
+freezer state to "FROZEN". Writing any new state to freezer.state while
+checkpointing will return EBUSY. These semantics ensure that userspace cannot
+unfreeze the cgroup midway through the checkpoint system call. Note that,
+unlike "FROZEN" and "FREEZING", there is no corresponding "CHECKPOINTED"
+state.
diff --git a/include/linux/freezer.h b/include/linux/freezer.h
index da7e52b..3d32641 100644
--- a/include/linux/freezer.h
+++ b/include/linux/freezer.h
@@ -65,11 +65,19 @@ extern void cancel_freezing(struct task_struct *p);
 
 #ifdef CONFIG_CGROUP_FREEZER
 extern int cgroup_freezing_or_frozen(struct task_struct *task);
+extern int in_same_cgroup_freezer(struct task_struct *p, struct task_struct *q);
+extern int cgroup_freezer_begin_checkpoint(struct task_struct *task);
+extern void cgroup_freezer_end_checkpoint(struct task_struct *task);
 #else /* !CONFIG_CGROUP_FREEZER */
 static inline int cgroup_freezing_or_frozen(struct task_struct *task)
 {
 	return 0;
 }
+static inline int in_same_cgroup_freezer(struct task_struct *p,
+					 struct task_struct *q)
+{
+	return 0;
+}
 #endif /* !CONFIG_CGROUP_FREEZER */
 
 /*
diff --git a/kernel/cgroup_freezer.c b/kernel/cgroup_freezer.c
index 22fce5d..87dfbfb 100644
--- a/kernel/cgroup_freezer.c
+++ b/kernel/cgroup_freezer.c
@@ -25,6 +25,7 @@ enum freezer_state {
 	CGROUP_THAWED = 0,
 	CGROUP_FREEZING,
 	CGROUP_FROZEN,
+	CGROUP_CHECKPOINTING,
 };
 
 struct freezer {
@@ -63,6 +64,44 @@ int cgroup_freezing_or_frozen(struct task_struct *task)
 	return (state == CGROUP_FREEZING) || (state == CGROUP_FROZEN);
 }
 
+/* Task is frozen or will freeze immediately when next it gets woken */
+static bool is_task_frozen_enough(struct task_struct *task)
+{
+	return frozen(task) ||
+		(task_is_stopped_or_traced(task) && freezing(task));
+}
+
+/*
+ * caller must hold freezer->lock
+ */
+static void update_freezer_state(struct cgroup *cgroup,
+				 struct freezer *freezer)
+{
+	struct cgroup_iter it;
+	struct task_struct *task;
+	unsigned int nfrozen = 0, ntotal = 0;
+
+	cgroup_iter_start(cgroup, &it);
+	while ((task = cgroup_iter_next(cgroup, &it))) {
+		ntotal++;
+		if (is_task_frozen_enough(task))
+			nfrozen++;
+	}
+
+	/*
+	 * Transition to FROZEN when no new tasks can be added ensures
+	 * that we never exist in the FROZEN state while there are unfrozen
+	 * tasks.
+	 */
+	if (nfrozen == ntotal)
+		freezer->state = CGROUP_FROZEN;
+	else if (nfrozen > 0)
+		freezer->state = CGROUP_FREEZING;
+	else
+		freezer->state = CGROUP_THAWED;
+	cgroup_iter_end(cgroup, &it);
+}
+
 /*
  * cgroups_write_string() limits the size of freezer state strings to
  * CGROUP_LOCAL_BUFFER_SIZE
@@ -71,6 +110,7 @@ static const char *freezer_state_strs[] = {
 	"THAWED",
 	"FREEZING",
 	"FROZEN",
+	"CHECKPOINTING",
 };
 
 /*
@@ -78,9 +118,9 @@ static const char *freezer_state_strs[] = {
  * Transitions are caused by userspace writes to the freezer.state file.
  * The values in parenthesis are state labels. The rest are edge labels.
  *
- * (THAWED) --FROZEN--> (FREEZING) --FROZEN--> (FROZEN)
- *    ^ ^                    |                     |
- *    | \_______THAWED_______/                     |
+ * (THAWED) --FROZEN--> (FREEZING) --FROZEN--> (FROZEN) --> (CHECKPOINTING)
+ *    ^ ^                    |                     | ^             |
+ *    | \_______THAWED_______/                     | \_____________/
  *    \__________________________THAWED____________/
  */
 
@@ -153,13 +193,6 @@ static void freezer_destroy(struct cgroup_subsys *ss,
 	kfree(cgroup_freezer(cgroup));
 }
 
-/* Task is frozen or will freeze immediately when next it gets woken */
-static bool is_task_frozen_enough(struct task_struct *task)
-{
-	return frozen(task) ||
-		(task_is_stopped_or_traced(task) && freezing(task));
-}
-
 /*
  * The call to cgroup_lock() in the freezer.state write method prevents
  * a write to that file racing against an attach, and hence the
@@ -216,37 +249,6 @@ static void freezer_fork(struct cgroup_subsys *ss, struct task_struct *task)
 	spin_unlock_irq(&freezer->lock);
 }
 
-/*
- * caller must hold freezer->lock
- */
-static void update_freezer_state(struct cgroup *cgroup,
-				 struct freezer *freezer)
-{
-	struct cgroup_iter it;
-	struct task_struct *task;
-	unsigned int nfrozen = 0, ntotal = 0;
-
-	cgroup_iter_start(cgroup, &it);
-	while ((task = cgroup_iter_next(cgroup, &it))) {
-		ntotal++;
-		if (is_task_frozen_enough(task))
-			nfrozen++;
-	}
-
-	/*
-	 * Transition to FROZEN when no new tasks can be added ensures
-	 * that we never exist in the FROZEN state while there are unfrozen
-	 * tasks.
-	 */
-	if (nfrozen == ntotal)
-		freezer->state = CGROUP_FROZEN;
-	else if (nfrozen > 0)
-		freezer->state = CGROUP_FREEZING;
-	else
-		freezer->state = CGROUP_THAWED;
-	cgroup_iter_end(cgroup, &it);
-}
-
 static int freezer_read(struct cgroup *cgroup, struct cftype *cft,
 			struct seq_file *m)
 {
@@ -317,7 +319,10 @@ static int freezer_change_state(struct cgroup *cgroup,
 	freezer = cgroup_freezer(cgroup);
 
 	spin_lock_irq(&freezer->lock);
-
+	if (freezer->state == CGROUP_CHECKPOINTING) {
+		retval = -EBUSY;
+		goto out;
+	}
 	update_freezer_state(cgroup, freezer);
 	if (goal_state == freezer->state)
 		goto out;
@@ -385,3 +390,80 @@ struct cgroup_subsys freezer_subsys = {
 	.fork		= freezer_fork,
 	.exit		= NULL,
 };
+
+#ifdef CONFIG_CHECKPOINT
+/*
+ * Caller is expected to ensure that neither @p nor @q may change its
+ * freezer cgroup during this test in a way that may affect the result.
+ * E.g., when called form c/r, @p must be in CHECKPOINTING cgroup, so
+ * may not change cgroup, and either @q is also there, or is not there
+ * and may not join.
+ */
+int in_same_cgroup_freezer(struct task_struct *p, struct task_struct *q)
+{
+	struct cgroup_subsys_state *p_css, *q_css;
+
+	task_lock(p);
+	p_css = task_subsys_state(p, freezer_subsys_id);
+	task_unlock(p);
+
+	task_lock(q);
+	q_css = task_subsys_state(q, freezer_subsys_id);
+	task_unlock(q);
+
+	return (p_css == q_css);
+}
+
+/*
+ * cgroup freezer state changes made without the aid of the cgroup filesystem
+ * must go through this function to ensure proper locking is observed.
+ */
+static int freezer_checkpointing(struct task_struct *task,
+				 enum freezer_state next_state)
+{
+	struct freezer *freezer;
+	struct cgroup_subsys_state *css;
+	enum freezer_state state;
+
+	task_lock(task);
+	css = task_subsys_state(task, freezer_subsys_id);
+	css_get(css); /* make sure freezer doesn't go away */
+	freezer = container_of(css, struct freezer, css);
+	task_unlock(task);
+
+	if (freezer->state == CGROUP_FREEZING) {
+		/* May be in middle of a lazy FREEZING -> FROZEN transition */
+		if (cgroup_lock_live_group(css->cgroup)) {
+			spin_lock_irq(&freezer->lock);
+			update_freezer_state(css->cgroup, freezer);
+			spin_unlock_irq(&freezer->lock);
+			cgroup_unlock();
+		}
+	}
+
+	spin_lock_irq(&freezer->lock);
+	state = freezer->state;
+	if ((state == CGROUP_FROZEN && next_state == CGROUP_CHECKPOINTING) ||
+	    (state == CGROUP_CHECKPOINTING && next_state == CGROUP_FROZEN))
+		freezer->state = next_state;
+	spin_unlock_irq(&freezer->lock);
+	css_put(css);
+	return state;
+}
+
+int cgroup_freezer_begin_checkpoint(struct task_struct *task)
+{
+	if (freezer_checkpointing(task, CGROUP_CHECKPOINTING) != CGROUP_FROZEN)
+		return -EBUSY;
+	return 0;
+}
+
+void cgroup_freezer_end_checkpoint(struct task_struct *task)
+{
+	/*
+	 * If we weren't in CHECKPOINTING state then userspace could have
+	 * unfrozen a task and given us an inconsistent checkpoint image
+	 */
+	WARN_ON(freezer_checkpointing(task, CGROUP_FROZEN) != CGROUP_CHECKPOINTING);
+}
+#endif /* CONFIG_CHECKPOINT */
-- 
1.6.0.4

^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH v18 07/80] cgroup freezer: Add CHECKPOINTING state to safeguard container checkpoint
  2009-09-23 23:50 ` Oren Laadan
@ 2009-09-23 23:50   ` Oren Laadan
  -1 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-23 23:50 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, containers, linux-kernel, linux-mm, linux-api,
	Serge Hallyn, Ingo Molnar, Pavel Emelyanov, Matt Helsley,
	Oren Laadan, Paul Menage, Li Zefan, Cedric Le Goater

From: Matt Helsley <matthltc@us.ibm.com>

The CHECKPOINTING state prevents userspace from unfreezing tasks until
sys_checkpoint() is finished. When doing container checkpoint userspace
will do:

	echo FROZEN > /cgroups/my_container/freezer.state
	...
	rc = sys_checkpoint( <pid of container root> );

To ensure a consistent checkpoint image userspace should not be allowed
to thaw the cgroup (echo THAWED > /cgroups/my_container/freezer.state)
during checkpoint.

"CHECKPOINTING" can only be set on a "FROZEN" cgroup using the checkpoint
system call. Once in the "CHECKPOINTING" state, the cgroup may not leave until
the checkpoint system call is finished and ready to return. Then the
freezer state returns to "FROZEN". Writing any new state to freezer.state while
checkpointing will return EBUSY. These semantics ensure that userspace cannot
unfreeze the cgroup midway through the checkpoint system call.

The cgroup_freezer_begin_checkpoint() and cgroup_freezer_end_checkpoint()
make relatively few assumptions about the task that is passed in. However the
way they are called in do_checkpoint() assumes that the root of the container
is in the same freezer cgroup as all the other tasks that will be
checkpointed.

Notes:
        As a side-effect this prevents the multiple tasks from entering the
        CHECKPOINTING state simultaneously. All but one will get -EBUSY.

Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
Signed-off-by: Matt Helsley <matthltc@us.ibm.com>
Cc: Paul Menage <menage@google.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Cedric Le Goater <legoater@free.fr>
---
 Documentation/cgroups/freezer-subsystem.txt |   10 ++
 include/linux/freezer.h                     |    8 ++
 kernel/cgroup_freezer.c                     |  166 ++++++++++++++++++++-------
 3 files changed, 142 insertions(+), 42 deletions(-)

diff --git a/Documentation/cgroups/freezer-subsystem.txt b/Documentation/cgroups/freezer-subsystem.txt
index 41f37fe..92b68e6 100644
--- a/Documentation/cgroups/freezer-subsystem.txt
+++ b/Documentation/cgroups/freezer-subsystem.txt
@@ -100,3 +100,13 @@ things happens:
 		and returns EINVAL)
 	3) The tasks that blocked the cgroup from entering the "FROZEN"
 		state disappear from the cgroup's set of tasks.
+
+When the cgroup freezer is used to guard container checkpoint operations the
+freezer.state may be "CHECKPOINTING". "CHECKPOINTING" can only be set on a
+"FROZEN" cgroup using the checkpoint system call. Once in the "CHECKPOINTING"
+state, the cgroup may not leave until the checkpoint system call returns the
+freezer state to "FROZEN". Writing any new state to freezer.state while
+checkpointing will return EBUSY. These semantics ensure that userspace cannot
+unfreeze the cgroup midway through the checkpoint system call. Note that,
+unlike "FROZEN" and "FREEZING", there is no corresponding "CHECKPOINTED"
+state.
diff --git a/include/linux/freezer.h b/include/linux/freezer.h
index da7e52b..3d32641 100644
--- a/include/linux/freezer.h
+++ b/include/linux/freezer.h
@@ -65,11 +65,19 @@ extern void cancel_freezing(struct task_struct *p);
 
 #ifdef CONFIG_CGROUP_FREEZER
 extern int cgroup_freezing_or_frozen(struct task_struct *task);
+extern int in_same_cgroup_freezer(struct task_struct *p, struct task_struct *q);
+extern int cgroup_freezer_begin_checkpoint(struct task_struct *task);
+extern void cgroup_freezer_end_checkpoint(struct task_struct *task);
 #else /* !CONFIG_CGROUP_FREEZER */
 static inline int cgroup_freezing_or_frozen(struct task_struct *task)
 {
 	return 0;
 }
+static inline int in_same_cgroup_freezer(struct task_struct *p,
+					 struct task_struct *q)
+{
+	return 0;
+}
 #endif /* !CONFIG_CGROUP_FREEZER */
 
 /*
diff --git a/kernel/cgroup_freezer.c b/kernel/cgroup_freezer.c
index 22fce5d..87dfbfb 100644
--- a/kernel/cgroup_freezer.c
+++ b/kernel/cgroup_freezer.c
@@ -25,6 +25,7 @@ enum freezer_state {
 	CGROUP_THAWED = 0,
 	CGROUP_FREEZING,
 	CGROUP_FROZEN,
+	CGROUP_CHECKPOINTING,
 };
 
 struct freezer {
@@ -63,6 +64,44 @@ int cgroup_freezing_or_frozen(struct task_struct *task)
 	return (state == CGROUP_FREEZING) || (state == CGROUP_FROZEN);
 }
 
+/* Task is frozen or will freeze immediately when next it gets woken */
+static bool is_task_frozen_enough(struct task_struct *task)
+{
+	return frozen(task) ||
+		(task_is_stopped_or_traced(task) && freezing(task));
+}
+
+/*
+ * caller must hold freezer->lock
+ */
+static void update_freezer_state(struct cgroup *cgroup,
+				 struct freezer *freezer)
+{
+	struct cgroup_iter it;
+	struct task_struct *task;
+	unsigned int nfrozen = 0, ntotal = 0;
+
+	cgroup_iter_start(cgroup, &it);
+	while ((task = cgroup_iter_next(cgroup, &it))) {
+		ntotal++;
+		if (is_task_frozen_enough(task))
+			nfrozen++;
+	}
+
+	/*
+	 * Transition to FROZEN when no new tasks can be added ensures
+	 * that we never exist in the FROZEN state while there are unfrozen
+	 * tasks.
+	 */
+	if (nfrozen == ntotal)
+		freezer->state = CGROUP_FROZEN;
+	else if (nfrozen > 0)
+		freezer->state = CGROUP_FREEZING;
+	else
+		freezer->state = CGROUP_THAWED;
+	cgroup_iter_end(cgroup, &it);
+}
+
 /*
  * cgroups_write_string() limits the size of freezer state strings to
  * CGROUP_LOCAL_BUFFER_SIZE
@@ -71,6 +110,7 @@ static const char *freezer_state_strs[] = {
 	"THAWED",
 	"FREEZING",
 	"FROZEN",
+	"CHECKPOINTING",
 };
 
 /*
@@ -78,9 +118,9 @@ static const char *freezer_state_strs[] = {
  * Transitions are caused by userspace writes to the freezer.state file.
  * The values in parenthesis are state labels. The rest are edge labels.
  *
- * (THAWED) --FROZEN--> (FREEZING) --FROZEN--> (FROZEN)
- *    ^ ^                    |                     |
- *    | \_______THAWED_______/                     |
+ * (THAWED) --FROZEN--> (FREEZING) --FROZEN--> (FROZEN) --> (CHECKPOINTING)
+ *    ^ ^                    |                     | ^             |
+ *    | \_______THAWED_______/                     | \_____________/
  *    \__________________________THAWED____________/
  */
 
@@ -153,13 +193,6 @@ static void freezer_destroy(struct cgroup_subsys *ss,
 	kfree(cgroup_freezer(cgroup));
 }
 
-/* Task is frozen or will freeze immediately when next it gets woken */
-static bool is_task_frozen_enough(struct task_struct *task)
-{
-	return frozen(task) ||
-		(task_is_stopped_or_traced(task) && freezing(task));
-}
-
 /*
  * The call to cgroup_lock() in the freezer.state write method prevents
  * a write to that file racing against an attach, and hence the
@@ -216,37 +249,6 @@ static void freezer_fork(struct cgroup_subsys *ss, struct task_struct *task)
 	spin_unlock_irq(&freezer->lock);
 }
 
-/*
- * caller must hold freezer->lock
- */
-static void update_freezer_state(struct cgroup *cgroup,
-				 struct freezer *freezer)
-{
-	struct cgroup_iter it;
-	struct task_struct *task;
-	unsigned int nfrozen = 0, ntotal = 0;
-
-	cgroup_iter_start(cgroup, &it);
-	while ((task = cgroup_iter_next(cgroup, &it))) {
-		ntotal++;
-		if (is_task_frozen_enough(task))
-			nfrozen++;
-	}
-
-	/*
-	 * Transition to FROZEN when no new tasks can be added ensures
-	 * that we never exist in the FROZEN state while there are unfrozen
-	 * tasks.
-	 */
-	if (nfrozen == ntotal)
-		freezer->state = CGROUP_FROZEN;
-	else if (nfrozen > 0)
-		freezer->state = CGROUP_FREEZING;
-	else
-		freezer->state = CGROUP_THAWED;
-	cgroup_iter_end(cgroup, &it);
-}
-
 static int freezer_read(struct cgroup *cgroup, struct cftype *cft,
 			struct seq_file *m)
 {
@@ -317,7 +319,10 @@ static int freezer_change_state(struct cgroup *cgroup,
 	freezer = cgroup_freezer(cgroup);
 
 	spin_lock_irq(&freezer->lock);
-
+	if (freezer->state == CGROUP_CHECKPOINTING) {
+		retval = -EBUSY;
+		goto out;
+	}
 	update_freezer_state(cgroup, freezer);
 	if (goal_state == freezer->state)
 		goto out;
@@ -385,3 +390,80 @@ struct cgroup_subsys freezer_subsys = {
 	.fork		= freezer_fork,
 	.exit		= NULL,
 };
+
+#ifdef CONFIG_CHECKPOINT
+/*
+ * Caller is expected to ensure that neither @p nor @q may change its
+ * freezer cgroup during this test in a way that may affect the result.
+ * E.g., when called form c/r, @p must be in CHECKPOINTING cgroup, so
+ * may not change cgroup, and either @q is also there, or is not there
+ * and may not join.
+ */
+int in_same_cgroup_freezer(struct task_struct *p, struct task_struct *q)
+{
+	struct cgroup_subsys_state *p_css, *q_css;
+
+	task_lock(p);
+	p_css = task_subsys_state(p, freezer_subsys_id);
+	task_unlock(p);
+
+	task_lock(q);
+	q_css = task_subsys_state(q, freezer_subsys_id);
+	task_unlock(q);
+
+	return (p_css == q_css);
+}
+
+/*
+ * cgroup freezer state changes made without the aid of the cgroup filesystem
+ * must go through this function to ensure proper locking is observed.
+ */
+static int freezer_checkpointing(struct task_struct *task,
+				 enum freezer_state next_state)
+{
+	struct freezer *freezer;
+	struct cgroup_subsys_state *css;
+	enum freezer_state state;
+
+	task_lock(task);
+	css = task_subsys_state(task, freezer_subsys_id);
+	css_get(css); /* make sure freezer doesn't go away */
+	freezer = container_of(css, struct freezer, css);
+	task_unlock(task);
+
+	if (freezer->state == CGROUP_FREEZING) {
+		/* May be in middle of a lazy FREEZING -> FROZEN transition */
+		if (cgroup_lock_live_group(css->cgroup)) {
+			spin_lock_irq(&freezer->lock);
+			update_freezer_state(css->cgroup, freezer);
+			spin_unlock_irq(&freezer->lock);
+			cgroup_unlock();
+		}
+	}
+
+	spin_lock_irq(&freezer->lock);
+	state = freezer->state;
+	if ((state == CGROUP_FROZEN && next_state == CGROUP_CHECKPOINTING) ||
+	    (state == CGROUP_CHECKPOINTING && next_state == CGROUP_FROZEN))
+		freezer->state = next_state;
+	spin_unlock_irq(&freezer->lock);
+	css_put(css);
+	return state;
+}
+
+int cgroup_freezer_begin_checkpoint(struct task_struct *task)
+{
+	if (freezer_checkpointing(task, CGROUP_CHECKPOINTING) != CGROUP_FROZEN)
+		return -EBUSY;
+	return 0;
+}
+
+void cgroup_freezer_end_checkpoint(struct task_struct *task)
+{
+	/*
+	 * If we weren't in CHECKPOINTING state then userspace could have
+	 * unfrozen a task and given us an inconsistent checkpoint image
+	 */
+	WARN_ON(freezer_checkpointing(task, CGROUP_FROZEN) != CGROUP_CHECKPOINTING);
+}
+#endif /* CONFIG_CHECKPOINT */
-- 
1.6.0.4


^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH v18 07/80] cgroup freezer: Add CHECKPOINTING state to safeguard container checkpoint
@ 2009-09-23 23:50   ` Oren Laadan
  0 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-23 23:50 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, containers, linux-kernel, linux-mm, linux-api,
	Serge Hallyn, Ingo Molnar, Pavel Emelyanov, Matt Helsley,
	Oren Laadan, Paul Menage, Li Zefan, Cedric Le Goater

From: Matt Helsley <matthltc@us.ibm.com>

The CHECKPOINTING state prevents userspace from unfreezing tasks until
sys_checkpoint() is finished. When doing container checkpoint userspace
will do:

	echo FROZEN > /cgroups/my_container/freezer.state
	...
	rc = sys_checkpoint( <pid of container root> );

To ensure a consistent checkpoint image userspace should not be allowed
to thaw the cgroup (echo THAWED > /cgroups/my_container/freezer.state)
during checkpoint.

"CHECKPOINTING" can only be set on a "FROZEN" cgroup using the checkpoint
system call. Once in the "CHECKPOINTING" state, the cgroup may not leave until
the checkpoint system call is finished and ready to return. Then the
freezer state returns to "FROZEN". Writing any new state to freezer.state while
checkpointing will return EBUSY. These semantics ensure that userspace cannot
unfreeze the cgroup midway through the checkpoint system call.

The cgroup_freezer_begin_checkpoint() and cgroup_freezer_end_checkpoint()
make relatively few assumptions about the task that is passed in. However the
way they are called in do_checkpoint() assumes that the root of the container
is in the same freezer cgroup as all the other tasks that will be
checkpointed.

Notes:
        As a side-effect this prevents the multiple tasks from entering the
        CHECKPOINTING state simultaneously. All but one will get -EBUSY.

Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
Signed-off-by: Matt Helsley <matthltc@us.ibm.com>
Cc: Paul Menage <menage@google.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Cedric Le Goater <legoater@free.fr>
---
 Documentation/cgroups/freezer-subsystem.txt |   10 ++
 include/linux/freezer.h                     |    8 ++
 kernel/cgroup_freezer.c                     |  166 ++++++++++++++++++++-------
 3 files changed, 142 insertions(+), 42 deletions(-)

diff --git a/Documentation/cgroups/freezer-subsystem.txt b/Documentation/cgroups/freezer-subsystem.txt
index 41f37fe..92b68e6 100644
--- a/Documentation/cgroups/freezer-subsystem.txt
+++ b/Documentation/cgroups/freezer-subsystem.txt
@@ -100,3 +100,13 @@ things happens:
 		and returns EINVAL)
 	3) The tasks that blocked the cgroup from entering the "FROZEN"
 		state disappear from the cgroup's set of tasks.
+
+When the cgroup freezer is used to guard container checkpoint operations the
+freezer.state may be "CHECKPOINTING". "CHECKPOINTING" can only be set on a
+"FROZEN" cgroup using the checkpoint system call. Once in the "CHECKPOINTING"
+state, the cgroup may not leave until the checkpoint system call returns the
+freezer state to "FROZEN". Writing any new state to freezer.state while
+checkpointing will return EBUSY. These semantics ensure that userspace cannot
+unfreeze the cgroup midway through the checkpoint system call. Note that,
+unlike "FROZEN" and "FREEZING", there is no corresponding "CHECKPOINTED"
+state.
diff --git a/include/linux/freezer.h b/include/linux/freezer.h
index da7e52b..3d32641 100644
--- a/include/linux/freezer.h
+++ b/include/linux/freezer.h
@@ -65,11 +65,19 @@ extern void cancel_freezing(struct task_struct *p);
 
 #ifdef CONFIG_CGROUP_FREEZER
 extern int cgroup_freezing_or_frozen(struct task_struct *task);
+extern int in_same_cgroup_freezer(struct task_struct *p, struct task_struct *q);
+extern int cgroup_freezer_begin_checkpoint(struct task_struct *task);
+extern void cgroup_freezer_end_checkpoint(struct task_struct *task);
 #else /* !CONFIG_CGROUP_FREEZER */
 static inline int cgroup_freezing_or_frozen(struct task_struct *task)
 {
 	return 0;
 }
+static inline int in_same_cgroup_freezer(struct task_struct *p,
+					 struct task_struct *q)
+{
+	return 0;
+}
 #endif /* !CONFIG_CGROUP_FREEZER */
 
 /*
diff --git a/kernel/cgroup_freezer.c b/kernel/cgroup_freezer.c
index 22fce5d..87dfbfb 100644
--- a/kernel/cgroup_freezer.c
+++ b/kernel/cgroup_freezer.c
@@ -25,6 +25,7 @@ enum freezer_state {
 	CGROUP_THAWED = 0,
 	CGROUP_FREEZING,
 	CGROUP_FROZEN,
+	CGROUP_CHECKPOINTING,
 };
 
 struct freezer {
@@ -63,6 +64,44 @@ int cgroup_freezing_or_frozen(struct task_struct *task)
 	return (state == CGROUP_FREEZING) || (state == CGROUP_FROZEN);
 }
 
+/* Task is frozen or will freeze immediately when next it gets woken */
+static bool is_task_frozen_enough(struct task_struct *task)
+{
+	return frozen(task) ||
+		(task_is_stopped_or_traced(task) && freezing(task));
+}
+
+/*
+ * caller must hold freezer->lock
+ */
+static void update_freezer_state(struct cgroup *cgroup,
+				 struct freezer *freezer)
+{
+	struct cgroup_iter it;
+	struct task_struct *task;
+	unsigned int nfrozen = 0, ntotal = 0;
+
+	cgroup_iter_start(cgroup, &it);
+	while ((task = cgroup_iter_next(cgroup, &it))) {
+		ntotal++;
+		if (is_task_frozen_enough(task))
+			nfrozen++;
+	}
+
+	/*
+	 * Transition to FROZEN when no new tasks can be added ensures
+	 * that we never exist in the FROZEN state while there are unfrozen
+	 * tasks.
+	 */
+	if (nfrozen == ntotal)
+		freezer->state = CGROUP_FROZEN;
+	else if (nfrozen > 0)
+		freezer->state = CGROUP_FREEZING;
+	else
+		freezer->state = CGROUP_THAWED;
+	cgroup_iter_end(cgroup, &it);
+}
+
 /*
  * cgroups_write_string() limits the size of freezer state strings to
  * CGROUP_LOCAL_BUFFER_SIZE
@@ -71,6 +110,7 @@ static const char *freezer_state_strs[] = {
 	"THAWED",
 	"FREEZING",
 	"FROZEN",
+	"CHECKPOINTING",
 };
 
 /*
@@ -78,9 +118,9 @@ static const char *freezer_state_strs[] = {
  * Transitions are caused by userspace writes to the freezer.state file.
  * The values in parenthesis are state labels. The rest are edge labels.
  *
- * (THAWED) --FROZEN--> (FREEZING) --FROZEN--> (FROZEN)
- *    ^ ^                    |                     |
- *    | \_______THAWED_______/                     |
+ * (THAWED) --FROZEN--> (FREEZING) --FROZEN--> (FROZEN) --> (CHECKPOINTING)
+ *    ^ ^                    |                     | ^             |
+ *    | \_______THAWED_______/                     | \_____________/
  *    \__________________________THAWED____________/
  */
 
@@ -153,13 +193,6 @@ static void freezer_destroy(struct cgroup_subsys *ss,
 	kfree(cgroup_freezer(cgroup));
 }
 
-/* Task is frozen or will freeze immediately when next it gets woken */
-static bool is_task_frozen_enough(struct task_struct *task)
-{
-	return frozen(task) ||
-		(task_is_stopped_or_traced(task) && freezing(task));
-}
-
 /*
  * The call to cgroup_lock() in the freezer.state write method prevents
  * a write to that file racing against an attach, and hence the
@@ -216,37 +249,6 @@ static void freezer_fork(struct cgroup_subsys *ss, struct task_struct *task)
 	spin_unlock_irq(&freezer->lock);
 }
 
-/*
- * caller must hold freezer->lock
- */
-static void update_freezer_state(struct cgroup *cgroup,
-				 struct freezer *freezer)
-{
-	struct cgroup_iter it;
-	struct task_struct *task;
-	unsigned int nfrozen = 0, ntotal = 0;
-
-	cgroup_iter_start(cgroup, &it);
-	while ((task = cgroup_iter_next(cgroup, &it))) {
-		ntotal++;
-		if (is_task_frozen_enough(task))
-			nfrozen++;
-	}
-
-	/*
-	 * Transition to FROZEN when no new tasks can be added ensures
-	 * that we never exist in the FROZEN state while there are unfrozen
-	 * tasks.
-	 */
-	if (nfrozen == ntotal)
-		freezer->state = CGROUP_FROZEN;
-	else if (nfrozen > 0)
-		freezer->state = CGROUP_FREEZING;
-	else
-		freezer->state = CGROUP_THAWED;
-	cgroup_iter_end(cgroup, &it);
-}
-
 static int freezer_read(struct cgroup *cgroup, struct cftype *cft,
 			struct seq_file *m)
 {
@@ -317,7 +319,10 @@ static int freezer_change_state(struct cgroup *cgroup,
 	freezer = cgroup_freezer(cgroup);
 
 	spin_lock_irq(&freezer->lock);
-
+	if (freezer->state == CGROUP_CHECKPOINTING) {
+		retval = -EBUSY;
+		goto out;
+	}
 	update_freezer_state(cgroup, freezer);
 	if (goal_state == freezer->state)
 		goto out;
@@ -385,3 +390,80 @@ struct cgroup_subsys freezer_subsys = {
 	.fork		= freezer_fork,
 	.exit		= NULL,
 };
+
+#ifdef CONFIG_CHECKPOINT
+/*
+ * Caller is expected to ensure that neither @p nor @q may change its
+ * freezer cgroup during this test in a way that may affect the result.
+ * E.g., when called form c/r, @p must be in CHECKPOINTING cgroup, so
+ * may not change cgroup, and either @q is also there, or is not there
+ * and may not join.
+ */
+int in_same_cgroup_freezer(struct task_struct *p, struct task_struct *q)
+{
+	struct cgroup_subsys_state *p_css, *q_css;
+
+	task_lock(p);
+	p_css = task_subsys_state(p, freezer_subsys_id);
+	task_unlock(p);
+
+	task_lock(q);
+	q_css = task_subsys_state(q, freezer_subsys_id);
+	task_unlock(q);
+
+	return (p_css == q_css);
+}
+
+/*
+ * cgroup freezer state changes made without the aid of the cgroup filesystem
+ * must go through this function to ensure proper locking is observed.
+ */
+static int freezer_checkpointing(struct task_struct *task,
+				 enum freezer_state next_state)
+{
+	struct freezer *freezer;
+	struct cgroup_subsys_state *css;
+	enum freezer_state state;
+
+	task_lock(task);
+	css = task_subsys_state(task, freezer_subsys_id);
+	css_get(css); /* make sure freezer doesn't go away */
+	freezer = container_of(css, struct freezer, css);
+	task_unlock(task);
+
+	if (freezer->state == CGROUP_FREEZING) {
+		/* May be in middle of a lazy FREEZING -> FROZEN transition */
+		if (cgroup_lock_live_group(css->cgroup)) {
+			spin_lock_irq(&freezer->lock);
+			update_freezer_state(css->cgroup, freezer);
+			spin_unlock_irq(&freezer->lock);
+			cgroup_unlock();
+		}
+	}
+
+	spin_lock_irq(&freezer->lock);
+	state = freezer->state;
+	if ((state == CGROUP_FROZEN && next_state == CGROUP_CHECKPOINTING) ||
+	    (state == CGROUP_CHECKPOINTING && next_state == CGROUP_FROZEN))
+		freezer->state = next_state;
+	spin_unlock_irq(&freezer->lock);
+	css_put(css);
+	return state;
+}
+
+int cgroup_freezer_begin_checkpoint(struct task_struct *task)
+{
+	if (freezer_checkpointing(task, CGROUP_CHECKPOINTING) != CGROUP_FROZEN)
+		return -EBUSY;
+	return 0;
+}
+
+void cgroup_freezer_end_checkpoint(struct task_struct *task)
+{
+	/*
+	 * If we weren't in CHECKPOINTING state then userspace could have
+	 * unfrozen a task and given us an inconsistent checkpoint image
+	 */
+	WARN_ON(freezer_checkpointing(task, CGROUP_FROZEN) != CGROUP_CHECKPOINTING);
+}
+#endif /* CONFIG_CHECKPOINT */
-- 
1.6.0.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH v18 08/80] cgroup freezer: interface to freeze a cgroup from within the kernel
       [not found] ` <1253749920-18673-1-git-send-email-orenl-RdfvBDnrOixBDgjK7y7TUQ@public.gmane.org>
                     ` (6 preceding siblings ...)
  2009-09-23 23:50   ` [PATCH v18 07/80] cgroup freezer: Add CHECKPOINTING state to safeguard container checkpoint Oren Laadan
@ 2009-09-23 23:50   ` Oren Laadan
  2009-09-23 23:50   ` [PATCH v18 09/80] Namespaces submenu Oren Laadan
                     ` (73 subsequent siblings)
  81 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-23 23:50 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-api-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Linus Torvalds, Ingo Molnar,
	Paul Menage, Pavel Emelyanov

Add public interface to freeze a cgroup freezer given a task that
belongs to that cgroup:  cgroup_freezer_make_frozen(task)

Freezing the root cgroup is not permitted. Freezing the cgroup to
which current process belong is also not permitted.

This will be used for restart(2) to be able to leave the restarted
processes in a frozen state, instead of resuming execution.

This is useful for debugging, if the user would like to attach a
debugger to the restarted task(s).

It is also useful if the restart procedure would like to perform
additional setup once the tasks are restored but before they are
allowed to proceed execution.

Signed-off-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
CC: Matt Helsley <matthltc-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
Cc: Paul Menage <menage-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
Cc: Li Zefan <lizf-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
Cc: Cedric Le Goater <legoater-GANU6spQydw@public.gmane.org>
---
 include/linux/freezer.h |    1 +
 kernel/cgroup_freezer.c |   27 +++++++++++++++++++++++++++
 2 files changed, 28 insertions(+), 0 deletions(-)

diff --git a/include/linux/freezer.h b/include/linux/freezer.h
index 3d32641..0cb22cb 100644
--- a/include/linux/freezer.h
+++ b/include/linux/freezer.h
@@ -68,6 +68,7 @@ extern int cgroup_freezing_or_frozen(struct task_struct *task);
 extern int in_same_cgroup_freezer(struct task_struct *p, struct task_struct *q);
 extern int cgroup_freezer_begin_checkpoint(struct task_struct *task);
 extern void cgroup_freezer_end_checkpoint(struct task_struct *task);
+extern int cgroup_freezer_make_frozen(struct task_struct *task);
 #else /* !CONFIG_CGROUP_FREEZER */
 static inline int cgroup_freezing_or_frozen(struct task_struct *task)
 {
diff --git a/kernel/cgroup_freezer.c b/kernel/cgroup_freezer.c
index 87dfbfb..7925850 100644
--- a/kernel/cgroup_freezer.c
+++ b/kernel/cgroup_freezer.c
@@ -466,4 +466,31 @@ void cgroup_freezer_end_checkpoint(struct task_struct *task)
 	 */
 	WARN_ON(freezer_checkpointing(task, CGROUP_FROZEN) != CGROUP_CHECKPOINTING);
 }
+
+int cgroup_freezer_make_frozen(struct task_struct *task)
+{
+	struct freezer *freezer;
+	struct cgroup_subsys_state *css;
+	int ret = -ENODEV;
+
+	task_lock(task);
+	css = task_subsys_state(task, freezer_subsys_id);
+	css_get(css); /* make sure freezer doesn't go away */
+	freezer = container_of(css, struct freezer, css);
+	task_unlock(task);
+
+	/* Never freeze the root cgroup */
+	if (!test_bit(CSS_ROOT, &css->flags) &&
+	    cgroup_lock_live_group(css->cgroup)) {
+		/* do not freeze outselves, ei ?! */
+		if (css != task_subsys_state(current, freezer_subsys_id))
+			ret = freezer_change_state(css->cgroup, CGROUP_FROZEN);
+		else
+			ret = -EPERM;
+		cgroup_unlock();
+	}
+
+	css_put(css);
+	return ret;
+}
 #endif /* CONFIG_CHECKPOINT */
-- 
1.6.0.4

^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH v18 08/80] cgroup freezer: interface to freeze a cgroup from within the kernel
  2009-09-23 23:50 ` Oren Laadan
@ 2009-09-23 23:50   ` Oren Laadan
  -1 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-23 23:50 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, containers, linux-kernel, linux-mm, linux-api,
	Serge Hallyn, Ingo Molnar, Pavel Emelyanov, Oren Laadan,
	Oren Laadan, Matt Helsley, Paul Menage, Li Zefan,
	Cedric Le Goater

Add public interface to freeze a cgroup freezer given a task that
belongs to that cgroup:  cgroup_freezer_make_frozen(task)

Freezing the root cgroup is not permitted. Freezing the cgroup to
which current process belong is also not permitted.

This will be used for restart(2) to be able to leave the restarted
processes in a frozen state, instead of resuming execution.

This is useful for debugging, if the user would like to attach a
debugger to the restarted task(s).

It is also useful if the restart procedure would like to perform
additional setup once the tasks are restored but before they are
allowed to proceed execution.

Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
CC: Matt Helsley <matthltc@us.ibm.com>
Cc: Paul Menage <menage@google.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Cedric Le Goater <legoater@free.fr>
---
 include/linux/freezer.h |    1 +
 kernel/cgroup_freezer.c |   27 +++++++++++++++++++++++++++
 2 files changed, 28 insertions(+), 0 deletions(-)

diff --git a/include/linux/freezer.h b/include/linux/freezer.h
index 3d32641..0cb22cb 100644
--- a/include/linux/freezer.h
+++ b/include/linux/freezer.h
@@ -68,6 +68,7 @@ extern int cgroup_freezing_or_frozen(struct task_struct *task);
 extern int in_same_cgroup_freezer(struct task_struct *p, struct task_struct *q);
 extern int cgroup_freezer_begin_checkpoint(struct task_struct *task);
 extern void cgroup_freezer_end_checkpoint(struct task_struct *task);
+extern int cgroup_freezer_make_frozen(struct task_struct *task);
 #else /* !CONFIG_CGROUP_FREEZER */
 static inline int cgroup_freezing_or_frozen(struct task_struct *task)
 {
diff --git a/kernel/cgroup_freezer.c b/kernel/cgroup_freezer.c
index 87dfbfb..7925850 100644
--- a/kernel/cgroup_freezer.c
+++ b/kernel/cgroup_freezer.c
@@ -466,4 +466,31 @@ void cgroup_freezer_end_checkpoint(struct task_struct *task)
 	 */
 	WARN_ON(freezer_checkpointing(task, CGROUP_FROZEN) != CGROUP_CHECKPOINTING);
 }
+
+int cgroup_freezer_make_frozen(struct task_struct *task)
+{
+	struct freezer *freezer;
+	struct cgroup_subsys_state *css;
+	int ret = -ENODEV;
+
+	task_lock(task);
+	css = task_subsys_state(task, freezer_subsys_id);
+	css_get(css); /* make sure freezer doesn't go away */
+	freezer = container_of(css, struct freezer, css);
+	task_unlock(task);
+
+	/* Never freeze the root cgroup */
+	if (!test_bit(CSS_ROOT, &css->flags) &&
+	    cgroup_lock_live_group(css->cgroup)) {
+		/* do not freeze outselves, ei ?! */
+		if (css != task_subsys_state(current, freezer_subsys_id))
+			ret = freezer_change_state(css->cgroup, CGROUP_FROZEN);
+		else
+			ret = -EPERM;
+		cgroup_unlock();
+	}
+
+	css_put(css);
+	return ret;
+}
 #endif /* CONFIG_CHECKPOINT */
-- 
1.6.0.4


^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH v18 08/80] cgroup freezer: interface to freeze a cgroup from within the kernel
@ 2009-09-23 23:50   ` Oren Laadan
  0 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-23 23:50 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, containers, linux-kernel, linux-mm, linux-api,
	Serge Hallyn, Ingo Molnar, Pavel Emelyanov, Oren Laadan,
	Oren Laadan, Matt Helsley, Paul Menage, Li Zefan,
	Cedric Le Goater

Add public interface to freeze a cgroup freezer given a task that
belongs to that cgroup:  cgroup_freezer_make_frozen(task)

Freezing the root cgroup is not permitted. Freezing the cgroup to
which current process belong is also not permitted.

This will be used for restart(2) to be able to leave the restarted
processes in a frozen state, instead of resuming execution.

This is useful for debugging, if the user would like to attach a
debugger to the restarted task(s).

It is also useful if the restart procedure would like to perform
additional setup once the tasks are restored but before they are
allowed to proceed execution.

Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
CC: Matt Helsley <matthltc@us.ibm.com>
Cc: Paul Menage <menage@google.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Cedric Le Goater <legoater@free.fr>
---
 include/linux/freezer.h |    1 +
 kernel/cgroup_freezer.c |   27 +++++++++++++++++++++++++++
 2 files changed, 28 insertions(+), 0 deletions(-)

diff --git a/include/linux/freezer.h b/include/linux/freezer.h
index 3d32641..0cb22cb 100644
--- a/include/linux/freezer.h
+++ b/include/linux/freezer.h
@@ -68,6 +68,7 @@ extern int cgroup_freezing_or_frozen(struct task_struct *task);
 extern int in_same_cgroup_freezer(struct task_struct *p, struct task_struct *q);
 extern int cgroup_freezer_begin_checkpoint(struct task_struct *task);
 extern void cgroup_freezer_end_checkpoint(struct task_struct *task);
+extern int cgroup_freezer_make_frozen(struct task_struct *task);
 #else /* !CONFIG_CGROUP_FREEZER */
 static inline int cgroup_freezing_or_frozen(struct task_struct *task)
 {
diff --git a/kernel/cgroup_freezer.c b/kernel/cgroup_freezer.c
index 87dfbfb..7925850 100644
--- a/kernel/cgroup_freezer.c
+++ b/kernel/cgroup_freezer.c
@@ -466,4 +466,31 @@ void cgroup_freezer_end_checkpoint(struct task_struct *task)
 	 */
 	WARN_ON(freezer_checkpointing(task, CGROUP_FROZEN) != CGROUP_CHECKPOINTING);
 }
+
+int cgroup_freezer_make_frozen(struct task_struct *task)
+{
+	struct freezer *freezer;
+	struct cgroup_subsys_state *css;
+	int ret = -ENODEV;
+
+	task_lock(task);
+	css = task_subsys_state(task, freezer_subsys_id);
+	css_get(css); /* make sure freezer doesn't go away */
+	freezer = container_of(css, struct freezer, css);
+	task_unlock(task);
+
+	/* Never freeze the root cgroup */
+	if (!test_bit(CSS_ROOT, &css->flags) &&
+	    cgroup_lock_live_group(css->cgroup)) {
+		/* do not freeze outselves, ei ?! */
+		if (css != task_subsys_state(current, freezer_subsys_id))
+			ret = freezer_change_state(css->cgroup, CGROUP_FROZEN);
+		else
+			ret = -EPERM;
+		cgroup_unlock();
+	}
+
+	css_put(css);
+	return ret;
+}
 #endif /* CONFIG_CHECKPOINT */
-- 
1.6.0.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH v18 09/80] Namespaces submenu
       [not found] ` <1253749920-18673-1-git-send-email-orenl-RdfvBDnrOixBDgjK7y7TUQ@public.gmane.org>
                     ` (7 preceding siblings ...)
  2009-09-23 23:50   ` [PATCH v18 08/80] cgroup freezer: interface to freeze a cgroup from within the kernel Oren Laadan
@ 2009-09-23 23:50   ` Oren Laadan
  2009-09-23 23:50   ` [PATCH v18 10/80] c/r: make file_pos_read/write() public Oren Laadan
                     ` (72 subsequent siblings)
  81 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-23 23:50 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-api-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Dave Hansen,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Linus Torvalds, Ingo Molnar,
	Pavel Emelyanov

From: Dave Hansen <dave-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>

Let's not steal too much space in the 'General Setup' menu.
Take a cue from the cgroups code and create a submenu.

This can go upstream now.

Signed-off-by: Dave Hansen <dave-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
Acked-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
---
 init/Kconfig |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/init/Kconfig b/init/Kconfig
index 3f7e609..46ee2c8 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -664,7 +664,7 @@ config RELAY
 
 	  If unsure, say N.
 
-config NAMESPACES
+menuconfig NAMESPACES
 	bool "Namespaces support" if EMBEDDED
 	default !EMBEDDED
 	help
-- 
1.6.0.4

^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH v18 09/80] Namespaces submenu
  2009-09-23 23:50 ` Oren Laadan
@ 2009-09-23 23:50   ` Oren Laadan
  -1 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-23 23:50 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, containers, linux-kernel, linux-mm, linux-api,
	Serge Hallyn, Ingo Molnar, Pavel Emelyanov, Dave Hansen

From: Dave Hansen <dave@linux.vnet.ibm.com>

Let's not steal too much space in the 'General Setup' menu.
Take a cue from the cgroups code and create a submenu.

This can go upstream now.

Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com>
Acked-by: Oren Laadan <orenl@cs.columbia.edu>
---
 init/Kconfig |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/init/Kconfig b/init/Kconfig
index 3f7e609..46ee2c8 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -664,7 +664,7 @@ config RELAY
 
 	  If unsure, say N.
 
-config NAMESPACES
+menuconfig NAMESPACES
 	bool "Namespaces support" if EMBEDDED
 	default !EMBEDDED
 	help
-- 
1.6.0.4


^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH v18 09/80] Namespaces submenu
@ 2009-09-23 23:50   ` Oren Laadan
  0 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-23 23:50 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, containers, linux-kernel, linux-mm, linux-api,
	Serge Hallyn, Ingo Molnar, Pavel Emelyanov, Dave Hansen

From: Dave Hansen <dave@linux.vnet.ibm.com>

Let's not steal too much space in the 'General Setup' menu.
Take a cue from the cgroups code and create a submenu.

This can go upstream now.

Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com>
Acked-by: Oren Laadan <orenl@cs.columbia.edu>
---
 init/Kconfig |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/init/Kconfig b/init/Kconfig
index 3f7e609..46ee2c8 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -664,7 +664,7 @@ config RELAY
 
 	  If unsure, say N.
 
-config NAMESPACES
+menuconfig NAMESPACES
 	bool "Namespaces support" if EMBEDDED
 	default !EMBEDDED
 	help
-- 
1.6.0.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH v18 10/80] c/r: make file_pos_read/write() public
       [not found] ` <1253749920-18673-1-git-send-email-orenl-RdfvBDnrOixBDgjK7y7TUQ@public.gmane.org>
                     ` (8 preceding siblings ...)
  2009-09-23 23:50   ` [PATCH v18 09/80] Namespaces submenu Oren Laadan
@ 2009-09-23 23:50   ` Oren Laadan
  2009-09-23 23:50   ` [PATCH v18 11/80] pids 1/7: Factor out code to allocate pidmap page Oren Laadan
                     ` (71 subsequent siblings)
  81 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-23 23:50 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-api-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Linus Torvalds, Ingo Molnar,
	Pavel Emelyanov

These two are used in the next patch when calling vfs_read/write()

Signed-off-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
---
 fs/read_write.c    |   10 ----------
 include/linux/fs.h |   10 ++++++++++
 2 files changed, 10 insertions(+), 10 deletions(-)

diff --git a/fs/read_write.c b/fs/read_write.c
index 6c8c55d..d331975 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -359,16 +359,6 @@ ssize_t vfs_write(struct file *file, const char __user *buf, size_t count, loff_
 
 EXPORT_SYMBOL(vfs_write);
 
-static inline loff_t file_pos_read(struct file *file)
-{
-	return file->f_pos;
-}
-
-static inline void file_pos_write(struct file *file, loff_t pos)
-{
-	file->f_pos = pos;
-}
-
 SYSCALL_DEFINE3(read, unsigned int, fd, char __user *, buf, size_t, count)
 {
 	struct file *file;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 73e9b64..a21f175 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1548,6 +1548,16 @@ ssize_t rw_copy_check_uvector(int type, const struct iovec __user * uvector,
 				struct iovec *fast_pointer,
 				struct iovec **ret_pointer);
 
+static inline loff_t file_pos_read(struct file *file)
+{
+	return file->f_pos;
+}
+
+static inline void file_pos_write(struct file *file, loff_t pos)
+{
+	file->f_pos = pos;
+}
+
 extern ssize_t vfs_read(struct file *, char __user *, size_t, loff_t *);
 extern ssize_t vfs_write(struct file *, const char __user *, size_t, loff_t *);
 extern ssize_t vfs_readv(struct file *, const struct iovec __user *,
-- 
1.6.0.4

^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH v18 10/80] c/r: make file_pos_read/write() public
  2009-09-23 23:50 ` Oren Laadan
@ 2009-09-23 23:50   ` Oren Laadan
  -1 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-23 23:50 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, containers, linux-kernel, linux-mm, linux-api,
	Serge Hallyn, Ingo Molnar, Pavel Emelyanov, Oren Laadan,
	Oren Laadan

These two are used in the next patch when calling vfs_read/write()

Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
---
 fs/read_write.c    |   10 ----------
 include/linux/fs.h |   10 ++++++++++
 2 files changed, 10 insertions(+), 10 deletions(-)

diff --git a/fs/read_write.c b/fs/read_write.c
index 6c8c55d..d331975 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -359,16 +359,6 @@ ssize_t vfs_write(struct file *file, const char __user *buf, size_t count, loff_
 
 EXPORT_SYMBOL(vfs_write);
 
-static inline loff_t file_pos_read(struct file *file)
-{
-	return file->f_pos;
-}
-
-static inline void file_pos_write(struct file *file, loff_t pos)
-{
-	file->f_pos = pos;
-}
-
 SYSCALL_DEFINE3(read, unsigned int, fd, char __user *, buf, size_t, count)
 {
 	struct file *file;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 73e9b64..a21f175 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1548,6 +1548,16 @@ ssize_t rw_copy_check_uvector(int type, const struct iovec __user * uvector,
 				struct iovec *fast_pointer,
 				struct iovec **ret_pointer);
 
+static inline loff_t file_pos_read(struct file *file)
+{
+	return file->f_pos;
+}
+
+static inline void file_pos_write(struct file *file, loff_t pos)
+{
+	file->f_pos = pos;
+}
+
 extern ssize_t vfs_read(struct file *, char __user *, size_t, loff_t *);
 extern ssize_t vfs_write(struct file *, const char __user *, size_t, loff_t *);
 extern ssize_t vfs_readv(struct file *, const struct iovec __user *,
-- 
1.6.0.4


^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH v18 10/80] c/r: make file_pos_read/write() public
@ 2009-09-23 23:50   ` Oren Laadan
  0 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-23 23:50 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, containers, linux-kernel, linux-mm, linux-api,
	Serge Hallyn, Ingo Molnar, Pavel Emelyanov, Oren Laadan,
	Oren Laadan

These two are used in the next patch when calling vfs_read/write()

Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
---
 fs/read_write.c    |   10 ----------
 include/linux/fs.h |   10 ++++++++++
 2 files changed, 10 insertions(+), 10 deletions(-)

diff --git a/fs/read_write.c b/fs/read_write.c
index 6c8c55d..d331975 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -359,16 +359,6 @@ ssize_t vfs_write(struct file *file, const char __user *buf, size_t count, loff_
 
 EXPORT_SYMBOL(vfs_write);
 
-static inline loff_t file_pos_read(struct file *file)
-{
-	return file->f_pos;
-}
-
-static inline void file_pos_write(struct file *file, loff_t pos)
-{
-	file->f_pos = pos;
-}
-
 SYSCALL_DEFINE3(read, unsigned int, fd, char __user *, buf, size_t, count)
 {
 	struct file *file;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 73e9b64..a21f175 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1548,6 +1548,16 @@ ssize_t rw_copy_check_uvector(int type, const struct iovec __user * uvector,
 				struct iovec *fast_pointer,
 				struct iovec **ret_pointer);
 
+static inline loff_t file_pos_read(struct file *file)
+{
+	return file->f_pos;
+}
+
+static inline void file_pos_write(struct file *file, loff_t pos)
+{
+	file->f_pos = pos;
+}
+
 extern ssize_t vfs_read(struct file *, char __user *, size_t, loff_t *);
 extern ssize_t vfs_write(struct file *, const char __user *, size_t, loff_t *);
 extern ssize_t vfs_readv(struct file *, const struct iovec __user *,
-- 
1.6.0.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH v18 11/80] pids 1/7: Factor out code to allocate pidmap page
       [not found] ` <1253749920-18673-1-git-send-email-orenl-RdfvBDnrOixBDgjK7y7TUQ@public.gmane.org>
                     ` (9 preceding siblings ...)
  2009-09-23 23:50   ` [PATCH v18 10/80] c/r: make file_pos_read/write() public Oren Laadan
@ 2009-09-23 23:50   ` Oren Laadan
  2009-09-23 23:50   ` [PATCH v18 12/80] pids 2/7: Have alloc_pidmap() return actual error code Oren Laadan
                     ` (70 subsequent siblings)
  81 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-23 23:50 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-api-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Linus Torvalds, Ingo Molnar,
	Sukadev Bhattiprolu, Pavel Emelyanov

From: Sukadev Bhattiprolu <sukadev-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>

To implement support for clone_with_pids() system call we would
need to allocate pidmap page in more than one place. Move this
code to a new function alloc_pidmap_page().

Changelog[v2]:
	- (Matt Helsley, Dave Hansen) Have alloc_pidmap_page() return
	  -ENOMEM on error instead of -1.

Signed-off-by: Sukadev Bhattiprolu <sukadev-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
Acked-by: Serge Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
Reviewed-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
---
 kernel/pid.c |   46 ++++++++++++++++++++++++++++++----------------
 1 files changed, 30 insertions(+), 16 deletions(-)

diff --git a/kernel/pid.c b/kernel/pid.c
index 31310b5..f618096 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -122,9 +122,34 @@ static void free_pidmap(struct upid *upid)
 	atomic_inc(&map->nr_free);
 }
 
+static int alloc_pidmap_page(struct pidmap *map)
+{
+	void *page;
+
+	if (likely(map->page))
+		return 0;
+
+	page = kzalloc(PAGE_SIZE, GFP_KERNEL);
+
+	/*
+	 * Free the page if someone raced with us installing it:
+	 */
+	spin_lock_irq(&pidmap_lock);
+	if (map->page)
+		kfree(page);
+	else
+		map->page = page;
+	spin_unlock_irq(&pidmap_lock);
+
+	if (unlikely(!map->page))
+		return -ENOMEM;
+
+	return 0;
+}
+
 static int alloc_pidmap(struct pid_namespace *pid_ns)
 {
-	int i, offset, max_scan, pid, last = pid_ns->last_pid;
+	int i, rc, offset, max_scan, pid, last = pid_ns->last_pid;
 	struct pidmap *map;
 
 	pid = last + 1;
@@ -134,21 +159,10 @@ static int alloc_pidmap(struct pid_namespace *pid_ns)
 	map = &pid_ns->pidmap[pid/BITS_PER_PAGE];
 	max_scan = (pid_max + BITS_PER_PAGE - 1)/BITS_PER_PAGE - !offset;
 	for (i = 0; i <= max_scan; ++i) {
-		if (unlikely(!map->page)) {
-			void *page = kzalloc(PAGE_SIZE, GFP_KERNEL);
-			/*
-			 * Free the page if someone raced with us
-			 * installing it:
-			 */
-			spin_lock_irq(&pidmap_lock);
-			if (map->page)
-				kfree(page);
-			else
-				map->page = page;
-			spin_unlock_irq(&pidmap_lock);
-			if (unlikely(!map->page))
-				break;
-		}
+		rc = alloc_pidmap_page(map);
+		if (rc)
+			break;
+
 		if (likely(atomic_read(&map->nr_free))) {
 			do {
 				if (!test_and_set_bit(offset, map->page)) {
-- 
1.6.0.4

^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH v18 11/80] pids 1/7: Factor out code to allocate pidmap page
  2009-09-23 23:50 ` Oren Laadan
@ 2009-09-23 23:50   ` Oren Laadan
  -1 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-23 23:50 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, containers, linux-kernel, linux-mm, linux-api,
	Serge Hallyn, Ingo Molnar, Pavel Emelyanov, Sukadev Bhattiprolu

From: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>

To implement support for clone_with_pids() system call we would
need to allocate pidmap page in more than one place. Move this
code to a new function alloc_pidmap_page().

Changelog[v2]:
	- (Matt Helsley, Dave Hansen) Have alloc_pidmap_page() return
	  -ENOMEM on error instead of -1.

Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Reviewed-by: Oren Laadan <orenl@cs.columbia.edu>
---
 kernel/pid.c |   46 ++++++++++++++++++++++++++++++----------------
 1 files changed, 30 insertions(+), 16 deletions(-)

diff --git a/kernel/pid.c b/kernel/pid.c
index 31310b5..f618096 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -122,9 +122,34 @@ static void free_pidmap(struct upid *upid)
 	atomic_inc(&map->nr_free);
 }
 
+static int alloc_pidmap_page(struct pidmap *map)
+{
+	void *page;
+
+	if (likely(map->page))
+		return 0;
+
+	page = kzalloc(PAGE_SIZE, GFP_KERNEL);
+
+	/*
+	 * Free the page if someone raced with us installing it:
+	 */
+	spin_lock_irq(&pidmap_lock);
+	if (map->page)
+		kfree(page);
+	else
+		map->page = page;
+	spin_unlock_irq(&pidmap_lock);
+
+	if (unlikely(!map->page))
+		return -ENOMEM;
+
+	return 0;
+}
+
 static int alloc_pidmap(struct pid_namespace *pid_ns)
 {
-	int i, offset, max_scan, pid, last = pid_ns->last_pid;
+	int i, rc, offset, max_scan, pid, last = pid_ns->last_pid;
 	struct pidmap *map;
 
 	pid = last + 1;
@@ -134,21 +159,10 @@ static int alloc_pidmap(struct pid_namespace *pid_ns)
 	map = &pid_ns->pidmap[pid/BITS_PER_PAGE];
 	max_scan = (pid_max + BITS_PER_PAGE - 1)/BITS_PER_PAGE - !offset;
 	for (i = 0; i <= max_scan; ++i) {
-		if (unlikely(!map->page)) {
-			void *page = kzalloc(PAGE_SIZE, GFP_KERNEL);
-			/*
-			 * Free the page if someone raced with us
-			 * installing it:
-			 */
-			spin_lock_irq(&pidmap_lock);
-			if (map->page)
-				kfree(page);
-			else
-				map->page = page;
-			spin_unlock_irq(&pidmap_lock);
-			if (unlikely(!map->page))
-				break;
-		}
+		rc = alloc_pidmap_page(map);
+		if (rc)
+			break;
+
 		if (likely(atomic_read(&map->nr_free))) {
 			do {
 				if (!test_and_set_bit(offset, map->page)) {
-- 
1.6.0.4


^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH v18 11/80] pids 1/7: Factor out code to allocate pidmap page
@ 2009-09-23 23:50   ` Oren Laadan
  0 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-23 23:50 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, containers, linux-kernel, linux-mm, linux-api,
	Serge Hallyn, Ingo Molnar, Pavel Emelyanov, Sukadev Bhattiprolu

From: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>

To implement support for clone_with_pids() system call we would
need to allocate pidmap page in more than one place. Move this
code to a new function alloc_pidmap_page().

Changelog[v2]:
	- (Matt Helsley, Dave Hansen) Have alloc_pidmap_page() return
	  -ENOMEM on error instead of -1.

Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Reviewed-by: Oren Laadan <orenl@cs.columbia.edu>
---
 kernel/pid.c |   46 ++++++++++++++++++++++++++++++----------------
 1 files changed, 30 insertions(+), 16 deletions(-)

diff --git a/kernel/pid.c b/kernel/pid.c
index 31310b5..f618096 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -122,9 +122,34 @@ static void free_pidmap(struct upid *upid)
 	atomic_inc(&map->nr_free);
 }
 
+static int alloc_pidmap_page(struct pidmap *map)
+{
+	void *page;
+
+	if (likely(map->page))
+		return 0;
+
+	page = kzalloc(PAGE_SIZE, GFP_KERNEL);
+
+	/*
+	 * Free the page if someone raced with us installing it:
+	 */
+	spin_lock_irq(&pidmap_lock);
+	if (map->page)
+		kfree(page);
+	else
+		map->page = page;
+	spin_unlock_irq(&pidmap_lock);
+
+	if (unlikely(!map->page))
+		return -ENOMEM;
+
+	return 0;
+}
+
 static int alloc_pidmap(struct pid_namespace *pid_ns)
 {
-	int i, offset, max_scan, pid, last = pid_ns->last_pid;
+	int i, rc, offset, max_scan, pid, last = pid_ns->last_pid;
 	struct pidmap *map;
 
 	pid = last + 1;
@@ -134,21 +159,10 @@ static int alloc_pidmap(struct pid_namespace *pid_ns)
 	map = &pid_ns->pidmap[pid/BITS_PER_PAGE];
 	max_scan = (pid_max + BITS_PER_PAGE - 1)/BITS_PER_PAGE - !offset;
 	for (i = 0; i <= max_scan; ++i) {
-		if (unlikely(!map->page)) {
-			void *page = kzalloc(PAGE_SIZE, GFP_KERNEL);
-			/*
-			 * Free the page if someone raced with us
-			 * installing it:
-			 */
-			spin_lock_irq(&pidmap_lock);
-			if (map->page)
-				kfree(page);
-			else
-				map->page = page;
-			spin_unlock_irq(&pidmap_lock);
-			if (unlikely(!map->page))
-				break;
-		}
+		rc = alloc_pidmap_page(map);
+		if (rc)
+			break;
+
 		if (likely(atomic_read(&map->nr_free))) {
 			do {
 				if (!test_and_set_bit(offset, map->page)) {
-- 
1.6.0.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH v18 12/80] pids 2/7: Have alloc_pidmap() return actual error code
       [not found] ` <1253749920-18673-1-git-send-email-orenl-RdfvBDnrOixBDgjK7y7TUQ@public.gmane.org>
                     ` (10 preceding siblings ...)
  2009-09-23 23:50   ` [PATCH v18 11/80] pids 1/7: Factor out code to allocate pidmap page Oren Laadan
@ 2009-09-23 23:50   ` Oren Laadan
  2009-09-23 23:50   ` [PATCH v18 13/80] pids 3/7: Add target_pid parameter to alloc_pidmap() Oren Laadan
                     ` (69 subsequent siblings)
  81 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-23 23:50 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-api-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Linus Torvalds, Ingo Molnar,
	Sukadev Bhattiprolu, Pavel Emelyanov

From: Sukadev Bhattiprolu <sukadev-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>

alloc_pidmap() can fail either because all pid numbers are in use or
because memory allocation failed.  With support for setting a specific
pid number, alloc_pidmap() would also fail if either the given pid
number is invalid or in use.

Rather than have callers assume -ENOMEM, have alloc_pidmap() return
the actual error.

Signed-off-by: Sukadev Bhattiprolu <sukadev-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
Acked-by: Serge Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
Reviewed-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
---
 kernel/fork.c |    5 +++--
 kernel/pid.c  |    9 ++++++---
 2 files changed, 9 insertions(+), 5 deletions(-)

diff --git a/kernel/fork.c b/kernel/fork.c
index e6c04d4..851ccd1 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1110,10 +1110,11 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 		goto bad_fork_cleanup_io;
 
 	if (pid != &init_struct_pid) {
-		retval = -ENOMEM;
 		pid = alloc_pid(p->nsproxy->pid_ns);
-		if (!pid)
+		if (IS_ERR(pid)) {
+			retval = PTR_ERR(pid);
 			goto bad_fork_cleanup_io;
+		}
 
 		if (clone_flags & CLONE_NEWPID) {
 			retval = pid_ns_prepare_proc(p->nsproxy->pid_ns);
diff --git a/kernel/pid.c b/kernel/pid.c
index f618096..9c678ce 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -158,6 +158,7 @@ static int alloc_pidmap(struct pid_namespace *pid_ns)
 	offset = pid & BITS_PER_PAGE_MASK;
 	map = &pid_ns->pidmap[pid/BITS_PER_PAGE];
 	max_scan = (pid_max + BITS_PER_PAGE - 1)/BITS_PER_PAGE - !offset;
+	rc = -EAGAIN;
 	for (i = 0; i <= max_scan; ++i) {
 		rc = alloc_pidmap_page(map);
 		if (rc)
@@ -188,12 +189,14 @@ static int alloc_pidmap(struct pid_namespace *pid_ns)
 		} else {
 			map = &pid_ns->pidmap[0];
 			offset = RESERVED_PIDS;
-			if (unlikely(last == offset))
+			if (unlikely(last == offset)) {
+				rc = -EAGAIN;
 				break;
+			}
 		}
 		pid = mk_pid(pid_ns, map, offset);
 	}
-	return -1;
+	return rc;
 }
 
 int next_pidmap(struct pid_namespace *pid_ns, int last)
@@ -298,7 +301,7 @@ out_free:
 		free_pidmap(pid->numbers + i);
 
 	kmem_cache_free(ns->pid_cachep, pid);
-	pid = NULL;
+	pid = ERR_PTR(nr);
 	goto out;
 }
 
-- 
1.6.0.4

^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH v18 12/80] pids 2/7: Have alloc_pidmap() return actual error code
  2009-09-23 23:50 ` Oren Laadan
@ 2009-09-23 23:50   ` Oren Laadan
  -1 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-23 23:50 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, containers, linux-kernel, linux-mm, linux-api,
	Serge Hallyn, Ingo Molnar, Pavel Emelyanov, Sukadev Bhattiprolu

From: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>

alloc_pidmap() can fail either because all pid numbers are in use or
because memory allocation failed.  With support for setting a specific
pid number, alloc_pidmap() would also fail if either the given pid
number is invalid or in use.

Rather than have callers assume -ENOMEM, have alloc_pidmap() return
the actual error.

Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Reviewed-by: Oren Laadan <orenl@cs.columbia.edu>
---
 kernel/fork.c |    5 +++--
 kernel/pid.c  |    9 ++++++---
 2 files changed, 9 insertions(+), 5 deletions(-)

diff --git a/kernel/fork.c b/kernel/fork.c
index e6c04d4..851ccd1 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1110,10 +1110,11 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 		goto bad_fork_cleanup_io;
 
 	if (pid != &init_struct_pid) {
-		retval = -ENOMEM;
 		pid = alloc_pid(p->nsproxy->pid_ns);
-		if (!pid)
+		if (IS_ERR(pid)) {
+			retval = PTR_ERR(pid);
 			goto bad_fork_cleanup_io;
+		}
 
 		if (clone_flags & CLONE_NEWPID) {
 			retval = pid_ns_prepare_proc(p->nsproxy->pid_ns);
diff --git a/kernel/pid.c b/kernel/pid.c
index f618096..9c678ce 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -158,6 +158,7 @@ static int alloc_pidmap(struct pid_namespace *pid_ns)
 	offset = pid & BITS_PER_PAGE_MASK;
 	map = &pid_ns->pidmap[pid/BITS_PER_PAGE];
 	max_scan = (pid_max + BITS_PER_PAGE - 1)/BITS_PER_PAGE - !offset;
+	rc = -EAGAIN;
 	for (i = 0; i <= max_scan; ++i) {
 		rc = alloc_pidmap_page(map);
 		if (rc)
@@ -188,12 +189,14 @@ static int alloc_pidmap(struct pid_namespace *pid_ns)
 		} else {
 			map = &pid_ns->pidmap[0];
 			offset = RESERVED_PIDS;
-			if (unlikely(last == offset))
+			if (unlikely(last == offset)) {
+				rc = -EAGAIN;
 				break;
+			}
 		}
 		pid = mk_pid(pid_ns, map, offset);
 	}
-	return -1;
+	return rc;
 }
 
 int next_pidmap(struct pid_namespace *pid_ns, int last)
@@ -298,7 +301,7 @@ out_free:
 		free_pidmap(pid->numbers + i);
 
 	kmem_cache_free(ns->pid_cachep, pid);
-	pid = NULL;
+	pid = ERR_PTR(nr);
 	goto out;
 }
 
-- 
1.6.0.4


^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH v18 12/80] pids 2/7: Have alloc_pidmap() return actual error code
@ 2009-09-23 23:50   ` Oren Laadan
  0 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-23 23:50 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, containers, linux-kernel, linux-mm, linux-api,
	Serge Hallyn, Ingo Molnar, Pavel Emelyanov, Sukadev Bhattiprolu

From: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>

alloc_pidmap() can fail either because all pid numbers are in use or
because memory allocation failed.  With support for setting a specific
pid number, alloc_pidmap() would also fail if either the given pid
number is invalid or in use.

Rather than have callers assume -ENOMEM, have alloc_pidmap() return
the actual error.

Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Reviewed-by: Oren Laadan <orenl@cs.columbia.edu>
---
 kernel/fork.c |    5 +++--
 kernel/pid.c  |    9 ++++++---
 2 files changed, 9 insertions(+), 5 deletions(-)

diff --git a/kernel/fork.c b/kernel/fork.c
index e6c04d4..851ccd1 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1110,10 +1110,11 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 		goto bad_fork_cleanup_io;
 
 	if (pid != &init_struct_pid) {
-		retval = -ENOMEM;
 		pid = alloc_pid(p->nsproxy->pid_ns);
-		if (!pid)
+		if (IS_ERR(pid)) {
+			retval = PTR_ERR(pid);
 			goto bad_fork_cleanup_io;
+		}
 
 		if (clone_flags & CLONE_NEWPID) {
 			retval = pid_ns_prepare_proc(p->nsproxy->pid_ns);
diff --git a/kernel/pid.c b/kernel/pid.c
index f618096..9c678ce 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -158,6 +158,7 @@ static int alloc_pidmap(struct pid_namespace *pid_ns)
 	offset = pid & BITS_PER_PAGE_MASK;
 	map = &pid_ns->pidmap[pid/BITS_PER_PAGE];
 	max_scan = (pid_max + BITS_PER_PAGE - 1)/BITS_PER_PAGE - !offset;
+	rc = -EAGAIN;
 	for (i = 0; i <= max_scan; ++i) {
 		rc = alloc_pidmap_page(map);
 		if (rc)
@@ -188,12 +189,14 @@ static int alloc_pidmap(struct pid_namespace *pid_ns)
 		} else {
 			map = &pid_ns->pidmap[0];
 			offset = RESERVED_PIDS;
-			if (unlikely(last == offset))
+			if (unlikely(last == offset)) {
+				rc = -EAGAIN;
 				break;
+			}
 		}
 		pid = mk_pid(pid_ns, map, offset);
 	}
-	return -1;
+	return rc;
 }
 
 int next_pidmap(struct pid_namespace *pid_ns, int last)
@@ -298,7 +301,7 @@ out_free:
 		free_pidmap(pid->numbers + i);
 
 	kmem_cache_free(ns->pid_cachep, pid);
-	pid = NULL;
+	pid = ERR_PTR(nr);
 	goto out;
 }
 
-- 
1.6.0.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH v18 13/80] pids 3/7: Add target_pid parameter to alloc_pidmap()
       [not found] ` <1253749920-18673-1-git-send-email-orenl-RdfvBDnrOixBDgjK7y7TUQ@public.gmane.org>
                     ` (11 preceding siblings ...)
  2009-09-23 23:50   ` [PATCH v18 12/80] pids 2/7: Have alloc_pidmap() return actual error code Oren Laadan
@ 2009-09-23 23:50   ` Oren Laadan
  2009-09-23 23:50   ` [PATCH v18 14/80] pids 4/7: Add target_pids parameter to alloc_pid() Oren Laadan
                     ` (68 subsequent siblings)
  81 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-23 23:50 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-api-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Linus Torvalds, Ingo Molnar,
	Sukadev Bhattiprolu, Pavel Emelyanov

From: Sukadev Bhattiprolu <sukadev-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>

With support for setting a specific pid number for a process,
alloc_pidmap() will need a paramter a 'target_pid' parameter.

Changelog[v2]:
	- (Serge Hallyn) Check for 'pid < 0' in set_pidmap().(Code
	  actually checks for 'pid <= 0' for completeness).

Signed-off-by: Sukadev Bhattiprolu <sukadev-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
Acked-by: Serge Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
Reviewed-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
---
 kernel/pid.c |   28 ++++++++++++++++++++++++++--
 1 files changed, 26 insertions(+), 2 deletions(-)

diff --git a/kernel/pid.c b/kernel/pid.c
index 9c678ce..29cf119 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -147,11 +147,35 @@ static int alloc_pidmap_page(struct pidmap *map)
 	return 0;
 }
 
-static int alloc_pidmap(struct pid_namespace *pid_ns)
+static int set_pidmap(struct pid_namespace *pid_ns, int pid)
+{
+	int offset;
+	struct pidmap *map;
+
+	if (pid <= 0 || pid >= pid_max)
+		return -EINVAL;
+
+	offset = pid & BITS_PER_PAGE_MASK;
+	map = &pid_ns->pidmap[pid/BITS_PER_PAGE];
+
+	if (alloc_pidmap_page(map))
+		return -ENOMEM;
+
+	if (test_and_set_bit(offset, map->page))
+		return -EBUSY;
+
+	atomic_dec(&map->nr_free);
+	return pid;
+}
+
+static int alloc_pidmap(struct pid_namespace *pid_ns, int target_pid)
 {
 	int i, rc, offset, max_scan, pid, last = pid_ns->last_pid;
 	struct pidmap *map;
 
+	if (target_pid)
+		return set_pidmap(pid_ns, target_pid);
+
 	pid = last + 1;
 	if (pid >= pid_max)
 		pid = RESERVED_PIDS;
@@ -270,7 +294,7 @@ struct pid *alloc_pid(struct pid_namespace *ns)
 
 	tmp = ns;
 	for (i = ns->level; i >= 0; i--) {
-		nr = alloc_pidmap(tmp);
+		nr = alloc_pidmap(tmp, 0);
 		if (nr < 0)
 			goto out_free;
 
-- 
1.6.0.4

^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH v18 13/80] pids 3/7: Add target_pid parameter to alloc_pidmap()
  2009-09-23 23:50 ` Oren Laadan
@ 2009-09-23 23:50   ` Oren Laadan
  -1 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-23 23:50 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, containers, linux-kernel, linux-mm, linux-api,
	Serge Hallyn, Ingo Molnar, Pavel Emelyanov, Sukadev Bhattiprolu

From: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>

With support for setting a specific pid number for a process,
alloc_pidmap() will need a paramter a 'target_pid' parameter.

Changelog[v2]:
	- (Serge Hallyn) Check for 'pid < 0' in set_pidmap().(Code
	  actually checks for 'pid <= 0' for completeness).

Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Reviewed-by: Oren Laadan <orenl@cs.columbia.edu>
---
 kernel/pid.c |   28 ++++++++++++++++++++++++++--
 1 files changed, 26 insertions(+), 2 deletions(-)

diff --git a/kernel/pid.c b/kernel/pid.c
index 9c678ce..29cf119 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -147,11 +147,35 @@ static int alloc_pidmap_page(struct pidmap *map)
 	return 0;
 }
 
-static int alloc_pidmap(struct pid_namespace *pid_ns)
+static int set_pidmap(struct pid_namespace *pid_ns, int pid)
+{
+	int offset;
+	struct pidmap *map;
+
+	if (pid <= 0 || pid >= pid_max)
+		return -EINVAL;
+
+	offset = pid & BITS_PER_PAGE_MASK;
+	map = &pid_ns->pidmap[pid/BITS_PER_PAGE];
+
+	if (alloc_pidmap_page(map))
+		return -ENOMEM;
+
+	if (test_and_set_bit(offset, map->page))
+		return -EBUSY;
+
+	atomic_dec(&map->nr_free);
+	return pid;
+}
+
+static int alloc_pidmap(struct pid_namespace *pid_ns, int target_pid)
 {
 	int i, rc, offset, max_scan, pid, last = pid_ns->last_pid;
 	struct pidmap *map;
 
+	if (target_pid)
+		return set_pidmap(pid_ns, target_pid);
+
 	pid = last + 1;
 	if (pid >= pid_max)
 		pid = RESERVED_PIDS;
@@ -270,7 +294,7 @@ struct pid *alloc_pid(struct pid_namespace *ns)
 
 	tmp = ns;
 	for (i = ns->level; i >= 0; i--) {
-		nr = alloc_pidmap(tmp);
+		nr = alloc_pidmap(tmp, 0);
 		if (nr < 0)
 			goto out_free;
 
-- 
1.6.0.4


^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH v18 13/80] pids 3/7: Add target_pid parameter to alloc_pidmap()
@ 2009-09-23 23:50   ` Oren Laadan
  0 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-23 23:50 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, containers, linux-kernel, linux-mm, linux-api,
	Serge Hallyn, Ingo Molnar, Pavel Emelyanov, Sukadev Bhattiprolu

From: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>

With support for setting a specific pid number for a process,
alloc_pidmap() will need a paramter a 'target_pid' parameter.

Changelog[v2]:
	- (Serge Hallyn) Check for 'pid < 0' in set_pidmap().(Code
	  actually checks for 'pid <= 0' for completeness).

Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Reviewed-by: Oren Laadan <orenl@cs.columbia.edu>
---
 kernel/pid.c |   28 ++++++++++++++++++++++++++--
 1 files changed, 26 insertions(+), 2 deletions(-)

diff --git a/kernel/pid.c b/kernel/pid.c
index 9c678ce..29cf119 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -147,11 +147,35 @@ static int alloc_pidmap_page(struct pidmap *map)
 	return 0;
 }
 
-static int alloc_pidmap(struct pid_namespace *pid_ns)
+static int set_pidmap(struct pid_namespace *pid_ns, int pid)
+{
+	int offset;
+	struct pidmap *map;
+
+	if (pid <= 0 || pid >= pid_max)
+		return -EINVAL;
+
+	offset = pid & BITS_PER_PAGE_MASK;
+	map = &pid_ns->pidmap[pid/BITS_PER_PAGE];
+
+	if (alloc_pidmap_page(map))
+		return -ENOMEM;
+
+	if (test_and_set_bit(offset, map->page))
+		return -EBUSY;
+
+	atomic_dec(&map->nr_free);
+	return pid;
+}
+
+static int alloc_pidmap(struct pid_namespace *pid_ns, int target_pid)
 {
 	int i, rc, offset, max_scan, pid, last = pid_ns->last_pid;
 	struct pidmap *map;
 
+	if (target_pid)
+		return set_pidmap(pid_ns, target_pid);
+
 	pid = last + 1;
 	if (pid >= pid_max)
 		pid = RESERVED_PIDS;
@@ -270,7 +294,7 @@ struct pid *alloc_pid(struct pid_namespace *ns)
 
 	tmp = ns;
 	for (i = ns->level; i >= 0; i--) {
-		nr = alloc_pidmap(tmp);
+		nr = alloc_pidmap(tmp, 0);
 		if (nr < 0)
 			goto out_free;
 
-- 
1.6.0.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH v18 14/80] pids 4/7: Add target_pids parameter to alloc_pid()
       [not found] ` <1253749920-18673-1-git-send-email-orenl-RdfvBDnrOixBDgjK7y7TUQ@public.gmane.org>
                     ` (12 preceding siblings ...)
  2009-09-23 23:50   ` [PATCH v18 13/80] pids 3/7: Add target_pid parameter to alloc_pidmap() Oren Laadan
@ 2009-09-23 23:50   ` Oren Laadan
  2009-09-23 23:50   ` [PATCH v18 15/80] pids 5/7: Add target_pids parameter to copy_process() Oren Laadan
                     ` (67 subsequent siblings)
  81 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-23 23:50 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-api-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Linus Torvalds, Ingo Molnar,
	Sukadev Bhattiprolu, Pavel Emelyanov

From: Sukadev Bhattiprolu <sukadev-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>

This parameter is currently NULL, but will be used in a follow-on patch.

Signed-off-by: Sukadev Bhattiprolu <sukadev-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
Acked-by: Serge Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
Reviewed-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
---
 include/linux/pid.h |    2 +-
 kernel/fork.c       |    3 ++-
 kernel/pid.c        |   13 ++++++++++---
 3 files changed, 13 insertions(+), 5 deletions(-)

diff --git a/include/linux/pid.h b/include/linux/pid.h
index 49f1c2f..914185d 100644
--- a/include/linux/pid.h
+++ b/include/linux/pid.h
@@ -119,7 +119,7 @@ extern struct pid *find_get_pid(int nr);
 extern struct pid *find_ge_pid(int nr, struct pid_namespace *);
 int next_pidmap(struct pid_namespace *pid_ns, int last);
 
-extern struct pid *alloc_pid(struct pid_namespace *ns);
+extern struct pid *alloc_pid(struct pid_namespace *ns, pid_t *target_pids);
 extern void free_pid(struct pid *pid);
 
 /*
diff --git a/kernel/fork.c b/kernel/fork.c
index 851ccd1..2811bdb 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -940,6 +940,7 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 	int retval;
 	struct task_struct *p;
 	int cgroup_callbacks_done = 0;
+	pid_t *target_pids = NULL;
 
 	if ((clone_flags & (CLONE_NEWNS|CLONE_FS)) == (CLONE_NEWNS|CLONE_FS))
 		return ERR_PTR(-EINVAL);
@@ -1110,7 +1111,7 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 		goto bad_fork_cleanup_io;
 
 	if (pid != &init_struct_pid) {
-		pid = alloc_pid(p->nsproxy->pid_ns);
+		pid = alloc_pid(p->nsproxy->pid_ns, target_pids);
 		if (IS_ERR(pid)) {
 			retval = PTR_ERR(pid);
 			goto bad_fork_cleanup_io;
diff --git a/kernel/pid.c b/kernel/pid.c
index 29cf119..10a6b3a 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -280,21 +280,28 @@ void free_pid(struct pid *pid)
 	call_rcu(&pid->rcu, delayed_put_pid);
 }
 
-struct pid *alloc_pid(struct pid_namespace *ns)
+struct pid *alloc_pid(struct pid_namespace *ns, pid_t *target_pids)
 {
 	struct pid *pid;
 	enum pid_type type;
 	int i, nr;
 	struct pid_namespace *tmp;
 	struct upid *upid;
+	int tpid;
 
 	pid = kmem_cache_alloc(ns->pid_cachep, GFP_KERNEL);
-	if (!pid)
+	if (!pid) {
+		pid = ERR_PTR(-ENOMEM);
 		goto out;
+	}
 
 	tmp = ns;
 	for (i = ns->level; i >= 0; i--) {
-		nr = alloc_pidmap(tmp, 0);
+		tpid = 0;
+		if (target_pids)
+			tpid = target_pids[i];
+
+		nr = alloc_pidmap(tmp, tpid);
 		if (nr < 0)
 			goto out_free;
 
-- 
1.6.0.4

^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH v18 14/80] pids 4/7: Add target_pids parameter to alloc_pid()
  2009-09-23 23:50 ` Oren Laadan
@ 2009-09-23 23:50   ` Oren Laadan
  -1 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-23 23:50 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, containers, linux-kernel, linux-mm, linux-api,
	Serge Hallyn, Ingo Molnar, Pavel Emelyanov, Sukadev Bhattiprolu

From: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>

This parameter is currently NULL, but will be used in a follow-on patch.

Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Reviewed-by: Oren Laadan <orenl@cs.columbia.edu>
---
 include/linux/pid.h |    2 +-
 kernel/fork.c       |    3 ++-
 kernel/pid.c        |   13 ++++++++++---
 3 files changed, 13 insertions(+), 5 deletions(-)

diff --git a/include/linux/pid.h b/include/linux/pid.h
index 49f1c2f..914185d 100644
--- a/include/linux/pid.h
+++ b/include/linux/pid.h
@@ -119,7 +119,7 @@ extern struct pid *find_get_pid(int nr);
 extern struct pid *find_ge_pid(int nr, struct pid_namespace *);
 int next_pidmap(struct pid_namespace *pid_ns, int last);
 
-extern struct pid *alloc_pid(struct pid_namespace *ns);
+extern struct pid *alloc_pid(struct pid_namespace *ns, pid_t *target_pids);
 extern void free_pid(struct pid *pid);
 
 /*
diff --git a/kernel/fork.c b/kernel/fork.c
index 851ccd1..2811bdb 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -940,6 +940,7 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 	int retval;
 	struct task_struct *p;
 	int cgroup_callbacks_done = 0;
+	pid_t *target_pids = NULL;
 
 	if ((clone_flags & (CLONE_NEWNS|CLONE_FS)) == (CLONE_NEWNS|CLONE_FS))
 		return ERR_PTR(-EINVAL);
@@ -1110,7 +1111,7 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 		goto bad_fork_cleanup_io;
 
 	if (pid != &init_struct_pid) {
-		pid = alloc_pid(p->nsproxy->pid_ns);
+		pid = alloc_pid(p->nsproxy->pid_ns, target_pids);
 		if (IS_ERR(pid)) {
 			retval = PTR_ERR(pid);
 			goto bad_fork_cleanup_io;
diff --git a/kernel/pid.c b/kernel/pid.c
index 29cf119..10a6b3a 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -280,21 +280,28 @@ void free_pid(struct pid *pid)
 	call_rcu(&pid->rcu, delayed_put_pid);
 }
 
-struct pid *alloc_pid(struct pid_namespace *ns)
+struct pid *alloc_pid(struct pid_namespace *ns, pid_t *target_pids)
 {
 	struct pid *pid;
 	enum pid_type type;
 	int i, nr;
 	struct pid_namespace *tmp;
 	struct upid *upid;
+	int tpid;
 
 	pid = kmem_cache_alloc(ns->pid_cachep, GFP_KERNEL);
-	if (!pid)
+	if (!pid) {
+		pid = ERR_PTR(-ENOMEM);
 		goto out;
+	}
 
 	tmp = ns;
 	for (i = ns->level; i >= 0; i--) {
-		nr = alloc_pidmap(tmp, 0);
+		tpid = 0;
+		if (target_pids)
+			tpid = target_pids[i];
+
+		nr = alloc_pidmap(tmp, tpid);
 		if (nr < 0)
 			goto out_free;
 
-- 
1.6.0.4


^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH v18 14/80] pids 4/7: Add target_pids parameter to alloc_pid()
@ 2009-09-23 23:50   ` Oren Laadan
  0 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-23 23:50 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, containers, linux-kernel, linux-mm, linux-api,
	Serge Hallyn, Ingo Molnar, Pavel Emelyanov, Sukadev Bhattiprolu

From: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>

This parameter is currently NULL, but will be used in a follow-on patch.

Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Reviewed-by: Oren Laadan <orenl@cs.columbia.edu>
---
 include/linux/pid.h |    2 +-
 kernel/fork.c       |    3 ++-
 kernel/pid.c        |   13 ++++++++++---
 3 files changed, 13 insertions(+), 5 deletions(-)

diff --git a/include/linux/pid.h b/include/linux/pid.h
index 49f1c2f..914185d 100644
--- a/include/linux/pid.h
+++ b/include/linux/pid.h
@@ -119,7 +119,7 @@ extern struct pid *find_get_pid(int nr);
 extern struct pid *find_ge_pid(int nr, struct pid_namespace *);
 int next_pidmap(struct pid_namespace *pid_ns, int last);
 
-extern struct pid *alloc_pid(struct pid_namespace *ns);
+extern struct pid *alloc_pid(struct pid_namespace *ns, pid_t *target_pids);
 extern void free_pid(struct pid *pid);
 
 /*
diff --git a/kernel/fork.c b/kernel/fork.c
index 851ccd1..2811bdb 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -940,6 +940,7 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 	int retval;
 	struct task_struct *p;
 	int cgroup_callbacks_done = 0;
+	pid_t *target_pids = NULL;
 
 	if ((clone_flags & (CLONE_NEWNS|CLONE_FS)) == (CLONE_NEWNS|CLONE_FS))
 		return ERR_PTR(-EINVAL);
@@ -1110,7 +1111,7 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 		goto bad_fork_cleanup_io;
 
 	if (pid != &init_struct_pid) {
-		pid = alloc_pid(p->nsproxy->pid_ns);
+		pid = alloc_pid(p->nsproxy->pid_ns, target_pids);
 		if (IS_ERR(pid)) {
 			retval = PTR_ERR(pid);
 			goto bad_fork_cleanup_io;
diff --git a/kernel/pid.c b/kernel/pid.c
index 29cf119..10a6b3a 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -280,21 +280,28 @@ void free_pid(struct pid *pid)
 	call_rcu(&pid->rcu, delayed_put_pid);
 }
 
-struct pid *alloc_pid(struct pid_namespace *ns)
+struct pid *alloc_pid(struct pid_namespace *ns, pid_t *target_pids)
 {
 	struct pid *pid;
 	enum pid_type type;
 	int i, nr;
 	struct pid_namespace *tmp;
 	struct upid *upid;
+	int tpid;
 
 	pid = kmem_cache_alloc(ns->pid_cachep, GFP_KERNEL);
-	if (!pid)
+	if (!pid) {
+		pid = ERR_PTR(-ENOMEM);
 		goto out;
+	}
 
 	tmp = ns;
 	for (i = ns->level; i >= 0; i--) {
-		nr = alloc_pidmap(tmp, 0);
+		tpid = 0;
+		if (target_pids)
+			tpid = target_pids[i];
+
+		nr = alloc_pidmap(tmp, tpid);
 		if (nr < 0)
 			goto out_free;
 
-- 
1.6.0.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH v18 15/80] pids 5/7: Add target_pids parameter to copy_process()
       [not found] ` <1253749920-18673-1-git-send-email-orenl-RdfvBDnrOixBDgjK7y7TUQ@public.gmane.org>
                     ` (13 preceding siblings ...)
  2009-09-23 23:50   ` [PATCH v18 14/80] pids 4/7: Add target_pids parameter to alloc_pid() Oren Laadan
@ 2009-09-23 23:50   ` Oren Laadan
  2009-09-23 23:50   ` [PATCH v18 16/80] pids 6/7: Define do_fork_with_pids() Oren Laadan
                     ` (66 subsequent siblings)
  81 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-23 23:50 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-api-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Linus Torvalds, Ingo Molnar,
	Sukadev Bhattiprolu, Pavel Emelyanov

From: Sukadev Bhattiprolu <sukadev-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>

The new parameter will be used in a follow-on patch when clone_with_pids()
is implemented.

Signed-off-by: Sukadev Bhattiprolu <sukadev-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
Acked-by: Serge Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
Reviewed-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
---
 kernel/fork.c |    7 ++++---
 1 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/kernel/fork.c b/kernel/fork.c
index 2811bdb..5156d02 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -935,12 +935,12 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 					unsigned long stack_size,
 					int __user *child_tidptr,
 					struct pid *pid,
+					pid_t *target_pids,
 					int trace)
 {
 	int retval;
 	struct task_struct *p;
 	int cgroup_callbacks_done = 0;
-	pid_t *target_pids = NULL;
 
 	if ((clone_flags & (CLONE_NEWNS|CLONE_FS)) == (CLONE_NEWNS|CLONE_FS))
 		return ERR_PTR(-EINVAL);
@@ -1319,7 +1319,7 @@ struct task_struct * __cpuinit fork_idle(int cpu)
 	struct pt_regs regs;
 
 	task = copy_process(CLONE_VM, 0, idle_regs(&regs), 0, NULL,
-			    &init_struct_pid, 0);
+			    &init_struct_pid, NULL, 0);
 	if (!IS_ERR(task))
 		init_idle(task, cpu);
 
@@ -1342,6 +1342,7 @@ long do_fork(unsigned long clone_flags,
 	struct task_struct *p;
 	int trace = 0;
 	long nr;
+	pid_t *target_pids = NULL;
 
 	/*
 	 * Do some preliminary argument and permissions checking before we
@@ -1382,7 +1383,7 @@ long do_fork(unsigned long clone_flags,
 		trace = tracehook_prepare_clone(clone_flags);
 
 	p = copy_process(clone_flags, stack_start, regs, stack_size,
-			 child_tidptr, NULL, trace);
+			 child_tidptr, NULL, target_pids, trace);
 	/*
 	 * Do this prior waking up the new thread - the thread pointer
 	 * might get invalid after that point, if the thread exits quickly.
-- 
1.6.0.4

^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH v18 15/80] pids 5/7: Add target_pids parameter to copy_process()
  2009-09-23 23:50 ` Oren Laadan
@ 2009-09-23 23:50   ` Oren Laadan
  -1 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-23 23:50 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, containers, linux-kernel, linux-mm, linux-api,
	Serge Hallyn, Ingo Molnar, Pavel Emelyanov, Sukadev Bhattiprolu

From: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>

The new parameter will be used in a follow-on patch when clone_with_pids()
is implemented.

Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Reviewed-by: Oren Laadan <orenl@cs.columbia.edu>
---
 kernel/fork.c |    7 ++++---
 1 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/kernel/fork.c b/kernel/fork.c
index 2811bdb..5156d02 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -935,12 +935,12 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 					unsigned long stack_size,
 					int __user *child_tidptr,
 					struct pid *pid,
+					pid_t *target_pids,
 					int trace)
 {
 	int retval;
 	struct task_struct *p;
 	int cgroup_callbacks_done = 0;
-	pid_t *target_pids = NULL;
 
 	if ((clone_flags & (CLONE_NEWNS|CLONE_FS)) == (CLONE_NEWNS|CLONE_FS))
 		return ERR_PTR(-EINVAL);
@@ -1319,7 +1319,7 @@ struct task_struct * __cpuinit fork_idle(int cpu)
 	struct pt_regs regs;
 
 	task = copy_process(CLONE_VM, 0, idle_regs(&regs), 0, NULL,
-			    &init_struct_pid, 0);
+			    &init_struct_pid, NULL, 0);
 	if (!IS_ERR(task))
 		init_idle(task, cpu);
 
@@ -1342,6 +1342,7 @@ long do_fork(unsigned long clone_flags,
 	struct task_struct *p;
 	int trace = 0;
 	long nr;
+	pid_t *target_pids = NULL;
 
 	/*
 	 * Do some preliminary argument and permissions checking before we
@@ -1382,7 +1383,7 @@ long do_fork(unsigned long clone_flags,
 		trace = tracehook_prepare_clone(clone_flags);
 
 	p = copy_process(clone_flags, stack_start, regs, stack_size,
-			 child_tidptr, NULL, trace);
+			 child_tidptr, NULL, target_pids, trace);
 	/*
 	 * Do this prior waking up the new thread - the thread pointer
 	 * might get invalid after that point, if the thread exits quickly.
-- 
1.6.0.4


^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH v18 15/80] pids 5/7: Add target_pids parameter to copy_process()
@ 2009-09-23 23:50   ` Oren Laadan
  0 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-23 23:50 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, containers, linux-kernel, linux-mm, linux-api,
	Serge Hallyn, Ingo Molnar, Pavel Emelyanov, Sukadev Bhattiprolu

From: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>

The new parameter will be used in a follow-on patch when clone_with_pids()
is implemented.

Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Reviewed-by: Oren Laadan <orenl@cs.columbia.edu>
---
 kernel/fork.c |    7 ++++---
 1 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/kernel/fork.c b/kernel/fork.c
index 2811bdb..5156d02 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -935,12 +935,12 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 					unsigned long stack_size,
 					int __user *child_tidptr,
 					struct pid *pid,
+					pid_t *target_pids,
 					int trace)
 {
 	int retval;
 	struct task_struct *p;
 	int cgroup_callbacks_done = 0;
-	pid_t *target_pids = NULL;
 
 	if ((clone_flags & (CLONE_NEWNS|CLONE_FS)) == (CLONE_NEWNS|CLONE_FS))
 		return ERR_PTR(-EINVAL);
@@ -1319,7 +1319,7 @@ struct task_struct * __cpuinit fork_idle(int cpu)
 	struct pt_regs regs;
 
 	task = copy_process(CLONE_VM, 0, idle_regs(&regs), 0, NULL,
-			    &init_struct_pid, 0);
+			    &init_struct_pid, NULL, 0);
 	if (!IS_ERR(task))
 		init_idle(task, cpu);
 
@@ -1342,6 +1342,7 @@ long do_fork(unsigned long clone_flags,
 	struct task_struct *p;
 	int trace = 0;
 	long nr;
+	pid_t *target_pids = NULL;
 
 	/*
 	 * Do some preliminary argument and permissions checking before we
@@ -1382,7 +1383,7 @@ long do_fork(unsigned long clone_flags,
 		trace = tracehook_prepare_clone(clone_flags);
 
 	p = copy_process(clone_flags, stack_start, regs, stack_size,
-			 child_tidptr, NULL, trace);
+			 child_tidptr, NULL, target_pids, trace);
 	/*
 	 * Do this prior waking up the new thread - the thread pointer
 	 * might get invalid after that point, if the thread exits quickly.
-- 
1.6.0.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH v18 16/80] pids 6/7: Define do_fork_with_pids()
       [not found] ` <1253749920-18673-1-git-send-email-orenl-RdfvBDnrOixBDgjK7y7TUQ@public.gmane.org>
                     ` (14 preceding siblings ...)
  2009-09-23 23:50   ` [PATCH v18 15/80] pids 5/7: Add target_pids parameter to copy_process() Oren Laadan
@ 2009-09-23 23:50   ` Oren Laadan
  2009-09-23 23:50   ` [PATCH v18 17/80] pids 7/7: Define clone_with_pids syscall Oren Laadan
                     ` (65 subsequent siblings)
  81 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-23 23:50 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-api-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Linus Torvalds, Ingo Molnar,
	Sukadev Bhattiprolu, Pavel Emelyanov

From: Sukadev Bhattiprolu <sukadev-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>

do_fork_with_pids() is same as do_fork(), except that it takes an
additional, 'pid_set', parameter. This parameter, currently unused,
specifies the set of target pids of the process in each of its pid
namespaces.

Changelog[v3]:
	- Fix "long-line" warning from checkpatch.pl

Changelog[v2]:
	- To facilitate moving architecture-inpdendent code to kernel/fork.c
	  pass in 'struct target_pid_set __user *' to do_fork_with_pids()
	  rather than 'pid_t *' (next patch moves the arch-independent
	  code to kernel/fork.c)

Signed-off-by: Sukadev Bhattiprolu <sukadev-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
Acked-by: Serge Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
Reviewed-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
---
 include/linux/sched.h |    3 +++
 include/linux/types.h |    5 +++++
 kernel/fork.c         |   16 ++++++++++++++--
 3 files changed, 22 insertions(+), 2 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 0f1ea4a..2fa783d 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2054,6 +2054,9 @@ extern int disallow_signal(int);
 
 extern int do_execve(char *, char __user * __user *, char __user * __user *, struct pt_regs *);
 extern long do_fork(unsigned long, unsigned long, struct pt_regs *, unsigned long, int __user *, int __user *);
+extern long do_fork_with_pids(unsigned long, unsigned long, struct pt_regs *,
+				unsigned long, int __user *, int __user *,
+				struct target_pid_set __user *pid_set);
 struct task_struct *fork_idle(int);
 
 extern void set_task_comm(struct task_struct *tsk, char *from);
diff --git a/include/linux/types.h b/include/linux/types.h
index c42724f..d9efefe 100644
--- a/include/linux/types.h
+++ b/include/linux/types.h
@@ -204,6 +204,11 @@ struct ustat {
 	char			f_fpack[6];
 };
 
+struct target_pid_set {
+	int num_pids;
+	pid_t *target_pids;
+};
+
 #endif	/* __KERNEL__ */
 #endif /*  __ASSEMBLY__ */
 #endif /* _LINUX_TYPES_H */
diff --git a/kernel/fork.c b/kernel/fork.c
index 5156d02..59b21db 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1332,12 +1332,13 @@ struct task_struct * __cpuinit fork_idle(int cpu)
  * It copies the process, and if successful kick-starts
  * it and waits for it to finish using the VM if required.
  */
-long do_fork(unsigned long clone_flags,
+long do_fork_with_pids(unsigned long clone_flags,
 	      unsigned long stack_start,
 	      struct pt_regs *regs,
 	      unsigned long stack_size,
 	      int __user *parent_tidptr,
-	      int __user *child_tidptr)
+	      int __user *child_tidptr,
+	      struct target_pid_set __user *pid_setp)
 {
 	struct task_struct *p;
 	int trace = 0;
@@ -1440,6 +1441,17 @@ long do_fork(unsigned long clone_flags,
 	return nr;
 }
 
+long do_fork(unsigned long clone_flags,
+	      unsigned long stack_start,
+	      struct pt_regs *regs,
+	      unsigned long stack_size,
+	      int __user *parent_tidptr,
+	      int __user *child_tidptr)
+{
+	return do_fork_with_pids(clone_flags, stack_start, regs, stack_size,
+			parent_tidptr, child_tidptr, NULL);
+}
+
 #ifndef ARCH_MIN_MMSTRUCT_ALIGN
 #define ARCH_MIN_MMSTRUCT_ALIGN 0
 #endif
-- 
1.6.0.4

^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH v18 16/80] pids 6/7: Define do_fork_with_pids()
  2009-09-23 23:50 ` Oren Laadan
@ 2009-09-23 23:50   ` Oren Laadan
  -1 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-23 23:50 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, containers, linux-kernel, linux-mm, linux-api,
	Serge Hallyn, Ingo Molnar, Pavel Emelyanov, Sukadev Bhattiprolu

From: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>

do_fork_with_pids() is same as do_fork(), except that it takes an
additional, 'pid_set', parameter. This parameter, currently unused,
specifies the set of target pids of the process in each of its pid
namespaces.

Changelog[v3]:
	- Fix "long-line" warning from checkpatch.pl

Changelog[v2]:
	- To facilitate moving architecture-inpdendent code to kernel/fork.c
	  pass in 'struct target_pid_set __user *' to do_fork_with_pids()
	  rather than 'pid_t *' (next patch moves the arch-independent
	  code to kernel/fork.c)

Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Reviewed-by: Oren Laadan <orenl@cs.columbia.edu>
---
 include/linux/sched.h |    3 +++
 include/linux/types.h |    5 +++++
 kernel/fork.c         |   16 ++++++++++++++--
 3 files changed, 22 insertions(+), 2 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 0f1ea4a..2fa783d 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2054,6 +2054,9 @@ extern int disallow_signal(int);
 
 extern int do_execve(char *, char __user * __user *, char __user * __user *, struct pt_regs *);
 extern long do_fork(unsigned long, unsigned long, struct pt_regs *, unsigned long, int __user *, int __user *);
+extern long do_fork_with_pids(unsigned long, unsigned long, struct pt_regs *,
+				unsigned long, int __user *, int __user *,
+				struct target_pid_set __user *pid_set);
 struct task_struct *fork_idle(int);
 
 extern void set_task_comm(struct task_struct *tsk, char *from);
diff --git a/include/linux/types.h b/include/linux/types.h
index c42724f..d9efefe 100644
--- a/include/linux/types.h
+++ b/include/linux/types.h
@@ -204,6 +204,11 @@ struct ustat {
 	char			f_fpack[6];
 };
 
+struct target_pid_set {
+	int num_pids;
+	pid_t *target_pids;
+};
+
 #endif	/* __KERNEL__ */
 #endif /*  __ASSEMBLY__ */
 #endif /* _LINUX_TYPES_H */
diff --git a/kernel/fork.c b/kernel/fork.c
index 5156d02..59b21db 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1332,12 +1332,13 @@ struct task_struct * __cpuinit fork_idle(int cpu)
  * It copies the process, and if successful kick-starts
  * it and waits for it to finish using the VM if required.
  */
-long do_fork(unsigned long clone_flags,
+long do_fork_with_pids(unsigned long clone_flags,
 	      unsigned long stack_start,
 	      struct pt_regs *regs,
 	      unsigned long stack_size,
 	      int __user *parent_tidptr,
-	      int __user *child_tidptr)
+	      int __user *child_tidptr,
+	      struct target_pid_set __user *pid_setp)
 {
 	struct task_struct *p;
 	int trace = 0;
@@ -1440,6 +1441,17 @@ long do_fork(unsigned long clone_flags,
 	return nr;
 }
 
+long do_fork(unsigned long clone_flags,
+	      unsigned long stack_start,
+	      struct pt_regs *regs,
+	      unsigned long stack_size,
+	      int __user *parent_tidptr,
+	      int __user *child_tidptr)
+{
+	return do_fork_with_pids(clone_flags, stack_start, regs, stack_size,
+			parent_tidptr, child_tidptr, NULL);
+}
+
 #ifndef ARCH_MIN_MMSTRUCT_ALIGN
 #define ARCH_MIN_MMSTRUCT_ALIGN 0
 #endif
-- 
1.6.0.4


^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH v18 16/80] pids 6/7: Define do_fork_with_pids()
@ 2009-09-23 23:50   ` Oren Laadan
  0 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-23 23:50 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, containers, linux-kernel, linux-mm, linux-api,
	Serge Hallyn, Ingo Molnar, Pavel Emelyanov, Sukadev Bhattiprolu

From: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>

do_fork_with_pids() is same as do_fork(), except that it takes an
additional, 'pid_set', parameter. This parameter, currently unused,
specifies the set of target pids of the process in each of its pid
namespaces.

Changelog[v3]:
	- Fix "long-line" warning from checkpatch.pl

Changelog[v2]:
	- To facilitate moving architecture-inpdendent code to kernel/fork.c
	  pass in 'struct target_pid_set __user *' to do_fork_with_pids()
	  rather than 'pid_t *' (next patch moves the arch-independent
	  code to kernel/fork.c)

Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Reviewed-by: Oren Laadan <orenl@cs.columbia.edu>
---
 include/linux/sched.h |    3 +++
 include/linux/types.h |    5 +++++
 kernel/fork.c         |   16 ++++++++++++++--
 3 files changed, 22 insertions(+), 2 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 0f1ea4a..2fa783d 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2054,6 +2054,9 @@ extern int disallow_signal(int);
 
 extern int do_execve(char *, char __user * __user *, char __user * __user *, struct pt_regs *);
 extern long do_fork(unsigned long, unsigned long, struct pt_regs *, unsigned long, int __user *, int __user *);
+extern long do_fork_with_pids(unsigned long, unsigned long, struct pt_regs *,
+				unsigned long, int __user *, int __user *,
+				struct target_pid_set __user *pid_set);
 struct task_struct *fork_idle(int);
 
 extern void set_task_comm(struct task_struct *tsk, char *from);
diff --git a/include/linux/types.h b/include/linux/types.h
index c42724f..d9efefe 100644
--- a/include/linux/types.h
+++ b/include/linux/types.h
@@ -204,6 +204,11 @@ struct ustat {
 	char			f_fpack[6];
 };
 
+struct target_pid_set {
+	int num_pids;
+	pid_t *target_pids;
+};
+
 #endif	/* __KERNEL__ */
 #endif /*  __ASSEMBLY__ */
 #endif /* _LINUX_TYPES_H */
diff --git a/kernel/fork.c b/kernel/fork.c
index 5156d02..59b21db 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1332,12 +1332,13 @@ struct task_struct * __cpuinit fork_idle(int cpu)
  * It copies the process, and if successful kick-starts
  * it and waits for it to finish using the VM if required.
  */
-long do_fork(unsigned long clone_flags,
+long do_fork_with_pids(unsigned long clone_flags,
 	      unsigned long stack_start,
 	      struct pt_regs *regs,
 	      unsigned long stack_size,
 	      int __user *parent_tidptr,
-	      int __user *child_tidptr)
+	      int __user *child_tidptr,
+	      struct target_pid_set __user *pid_setp)
 {
 	struct task_struct *p;
 	int trace = 0;
@@ -1440,6 +1441,17 @@ long do_fork(unsigned long clone_flags,
 	return nr;
 }
 
+long do_fork(unsigned long clone_flags,
+	      unsigned long stack_start,
+	      struct pt_regs *regs,
+	      unsigned long stack_size,
+	      int __user *parent_tidptr,
+	      int __user *child_tidptr)
+{
+	return do_fork_with_pids(clone_flags, stack_start, regs, stack_size,
+			parent_tidptr, child_tidptr, NULL);
+}
+
 #ifndef ARCH_MIN_MMSTRUCT_ALIGN
 #define ARCH_MIN_MMSTRUCT_ALIGN 0
 #endif
-- 
1.6.0.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH v18 17/80] pids 7/7: Define clone_with_pids syscall
       [not found] ` <1253749920-18673-1-git-send-email-orenl-RdfvBDnrOixBDgjK7y7TUQ@public.gmane.org>
                     ` (15 preceding siblings ...)
  2009-09-23 23:50   ` [PATCH v18 16/80] pids 6/7: Define do_fork_with_pids() Oren Laadan
@ 2009-09-23 23:50   ` Oren Laadan
  2009-09-23 23:50   ` [PATCH v18 18/80] c/r: create syscalls: sys_checkpoint, sys_restart Oren Laadan
                     ` (64 subsequent siblings)
  81 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-23 23:50 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-api-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Linus Torvalds, Ingo Molnar,
	Sukadev Bhattiprolu, Pavel Emelyanov

From: Sukadev Bhattiprolu <sukadev-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>

Container restart requires that a task have the same pid it had when it was
checkpointed. When containers are nested the tasks within the containers
exist in multiple pid namespaces and hence have multiple pids to specify
during restart.

clone_with_pids(), intended for use during restart, is the same as clone(),
except that it takes a 'target_pid_set' paramter. This parameter lets caller
choose specific pid numbers for the child process, in the process's active
and ancestor pid namespaces. (Descendant pid namespaces in general don't
matter since processes don't have pids in them anyway, but see comments
in copy_target_pids() regarding CLONE_NEWPID).

Unlike clone(), clone_with_pids() needs CAP_SYS_ADMIN, at least for now, to
prevent unprivileged processes from misusing this interface.

Call clone_with_pids as follows:

	pid_t pids[] = { 0, 77, 99 };
	struct target_pid_set pid_set;

	pid_set.num_pids = sizeof(pids) / sizeof(int);
	pid_set.target_pids = &pids;

	syscall(__NR_clone_with_pids, flags, stack, NULL, NULL, NULL, &pid_set);

If a target-pid is 0, the kernel continues to assign a pid for the process in
that namespace. In the above example, pids[0] is 0, meaning the kernel will
assign next available pid to the process in init_pid_ns. But kernel will assign
pid 77 in the child pid namespace 1 and pid 99 in pid namespace 2. If either
77 or 99 are taken, the system call fails with -EBUSY.

If 'pid_set.num_pids' exceeds the current nesting level of pid namespaces,
the system call fails with -EINVAL.

Its mostly an exploratory patch seeking feedback on the interface.

NOTE:
	Compared to clone(), clone_with_pids() needs to pass in two more
	pieces of information:

		- number of pids in the set
		- user buffer containing the list of pids.

	But since clone() already takes 5 parameters, use a 'struct
	target_pid_set'.

TODO:
	- Gently tested.
	- May need additional sanity checks in do_fork_with_pids().

Changelog[v3]:
	- (Oren Laadan) Allow CLONE_NEWPID flag (by allocating an extra pid
	  in the target_pids[] list and setting it 0. See copy_target_pids()).
	- (Oren Laadan) Specified target pids should apply only to youngest
	  pid-namespaces (see copy_target_pids())
	- (Matt Helsley) Update patch description.

Changelog[v2]:
	- Remove unnecessary printk and add a note to callers of
	  copy_target_pids() to free target_pids.
	- (Serge Hallyn) Mention CAP_SYS_ADMIN restriction in patch description.
	- (Oren Laadan) Add checks for 'num_pids < 0' (return -EINVAL) and
	  'num_pids == 0' (fall back to normal clone()).
	- Move arch-independent code (sanity checks and copy-in of target-pids)
	  into kernel/fork.c and simplify sys_clone_with_pids()

Changelog[v1]:
	- Fixed some compile errors (had fixed these errors earlier in my
	  git tree but had not refreshed patches before emailing them)

Signed-off-by: Sukadev Bhattiprolu <sukadev-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
---
 arch/x86/include/asm/syscalls.h    |    2 +
 arch/x86/include/asm/unistd_32.h   |    1 +
 arch/x86/kernel/entry_32.S         |    1 +
 arch/x86/kernel/process_32.c       |   21 +++++++
 arch/x86/kernel/syscall_table_32.S |    1 +
 kernel/fork.c                      |  108 +++++++++++++++++++++++++++++++++++-
 6 files changed, 133 insertions(+), 1 deletions(-)

diff --git a/arch/x86/include/asm/syscalls.h b/arch/x86/include/asm/syscalls.h
index 372b76e..df3c4a8 100644
--- a/arch/x86/include/asm/syscalls.h
+++ b/arch/x86/include/asm/syscalls.h
@@ -40,6 +40,8 @@ long sys_iopl(struct pt_regs *);
 
 /* kernel/process_32.c */
 int sys_clone(struct pt_regs *);
+int sys_clone_with_pids(struct pt_regs *);
+int sys_vfork(struct pt_regs *);
 int sys_execve(struct pt_regs *);
 
 /* kernel/signal.c */
diff --git a/arch/x86/include/asm/unistd_32.h b/arch/x86/include/asm/unistd_32.h
index 732a307..f65b750 100644
--- a/arch/x86/include/asm/unistd_32.h
+++ b/arch/x86/include/asm/unistd_32.h
@@ -342,6 +342,7 @@
 #define __NR_pwritev		334
 #define __NR_rt_tgsigqueueinfo	335
 #define __NR_perf_counter_open	336
+#define __NR_clone_with_pids	337
 
 #ifdef __KERNEL__
 
diff --git a/arch/x86/kernel/entry_32.S b/arch/x86/kernel/entry_32.S
index c097e7d..c7bd1f6 100644
--- a/arch/x86/kernel/entry_32.S
+++ b/arch/x86/kernel/entry_32.S
@@ -718,6 +718,7 @@ ptregs_##name: \
 PTREGSCALL(iopl)
 PTREGSCALL(fork)
 PTREGSCALL(clone)
+PTREGSCALL(clone_with_pids)
 PTREGSCALL(vfork)
 PTREGSCALL(execve)
 PTREGSCALL(sigaltstack)
diff --git a/arch/x86/kernel/process_32.c b/arch/x86/kernel/process_32.c
index 59f4524..9965c06 100644
--- a/arch/x86/kernel/process_32.c
+++ b/arch/x86/kernel/process_32.c
@@ -443,6 +443,27 @@ int sys_clone(struct pt_regs *regs)
 	return do_fork(clone_flags, newsp, regs, 0, parent_tidptr, child_tidptr);
 }
 
+int sys_clone_with_pids(struct pt_regs *regs)
+{
+	unsigned long clone_flags;
+	unsigned long newsp;
+	int __user *parent_tidptr;
+	int __user *child_tidptr;
+	void __user *upid_setp;
+
+	clone_flags = regs->bx;
+	newsp = regs->cx;
+	parent_tidptr = (int __user *)regs->dx;
+	child_tidptr = (int __user *)regs->di;
+	upid_setp = (void __user *)regs->bp;
+
+	if (!newsp)
+		newsp = regs->sp;
+
+	return do_fork_with_pids(clone_flags, newsp, regs, 0, parent_tidptr,
+			child_tidptr, upid_setp);
+}
+
 /*
  * sys_execve() executes a new program.
  */
diff --git a/arch/x86/kernel/syscall_table_32.S b/arch/x86/kernel/syscall_table_32.S
index d51321d..879e5ec 100644
--- a/arch/x86/kernel/syscall_table_32.S
+++ b/arch/x86/kernel/syscall_table_32.S
@@ -336,3 +336,4 @@ ENTRY(sys_call_table)
 	.long sys_pwritev
 	.long sys_rt_tgsigqueueinfo	/* 335 */
 	.long sys_perf_counter_open
+	.long ptregs_clone_with_pids
diff --git a/kernel/fork.c b/kernel/fork.c
index 59b21db..f5a0cef 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1327,6 +1327,97 @@ struct task_struct * __cpuinit fork_idle(int cpu)
 }
 
 /*
+ * If user specified any 'target-pids' in @upid_setp, copy them from
+ * user and return a pointer to a local copy of the list of pids. The
+ * caller must free the list, when they are done using it.
+ *
+ * If user did not specify any target pids, return NULL (caller should
+ * treat this like normal clone).
+ *
+ * On any errors, return the error code
+ */
+static pid_t *copy_target_pids(void __user *upid_setp)
+{
+	int j;
+	int rc;
+	int size;
+	int unum_pids;		/* # of pids specified by user */
+	int knum_pids;		/* # of pids needed in kernel */
+	pid_t *target_pids;
+	struct target_pid_set pid_set;
+
+	if (!upid_setp)
+		return NULL;
+
+	rc = copy_from_user(&pid_set, upid_setp, sizeof(pid_set));
+	if (rc)
+		return ERR_PTR(-EFAULT);
+
+	unum_pids = pid_set.num_pids;
+	knum_pids = task_pid(current)->level + 1;
+
+	if (!unum_pids)
+		return NULL;
+
+	if (unum_pids < 0 || unum_pids > knum_pids)
+		return ERR_PTR(-EINVAL);
+
+	/*
+	 * To keep alloc_pid() simple, allocate an extra pid_t in target_pids[]
+	 * and set it to 0. This last entry in target_pids[] corresponds to the
+	 * (yet-to-be-created) descendant pid-namespace if CLONE_NEWPID was
+	 * specified. If CLONE_NEWPID was not specified, this last entry will
+	 * simply be ignored.
+	 */
+	target_pids = kzalloc((knum_pids + 1) * sizeof(pid_t), GFP_KERNEL);
+	if (!target_pids)
+		return ERR_PTR(-ENOMEM);
+
+	/*
+	 * A process running in a level 2 pid namespace has three pid namespaces
+	 * and hence three pid numbers. If this process is checkpointed,
+	 * information about these three namespaces are saved. We refer to these
+	 * namespaces as 'known namespaces'.
+	 *
+	 * If this checkpointed process is however restarted in a level 3 pid
+	 * namespace, the restarted process has an extra ancestor pid namespace
+	 * (i.e 'unknown namespace') and 'knum_pids' exceeds 'unum_pids'.
+	 *
+	 * During restart, the process requests specific pids for its 'known
+	 * namespaces' and lets kernel assign pids to its 'unknown namespaces'.
+	 *
+	 * Since the requested-pids correspond to 'known namespaces' and since
+	 * 'known-namespaces' are younger than (i.e descendants of) 'unknown-
+	 * namespaces', copy requested pids to the back-end of target_pids[]
+	 * (i.e before the last entry for CLONE_NEWPID mentioned above).
+	 * Any entries in target_pids[] not corresponding to a requested pid
+	 * will be set to zero and kernel assigns a pid in those namespaces.
+	 *
+	 * NOTE: The order of pids in target_pids[] is oldest pid namespace to
+	 * 	 youngest (target_pids[0] corresponds to init_pid_ns). i.e.
+	 * 	 the order is:
+	 *
+	 * 		- pids for 'unknown-namespaces' (if any)
+	 * 		- pids for 'known-namespaces' (requested pids)
+	 * 		- 0 in the last entry (for CLONE_NEWPID).
+	 */
+	j = knum_pids - unum_pids;
+	size = unum_pids * sizeof(pid_t);
+
+	rc = copy_from_user(&target_pids[j], pid_set.target_pids, size);
+	if (rc) {
+		rc = -EFAULT;
+		goto out_free;
+	}
+
+	return target_pids;
+
+out_free:
+	kfree(target_pids);
+	return ERR_PTR(rc);
+}
+
+/*
  *  Ok, this is the main fork-routine.
  *
  * It copies the process, and if successful kick-starts
@@ -1343,7 +1434,7 @@ long do_fork_with_pids(unsigned long clone_flags,
 	struct task_struct *p;
 	int trace = 0;
 	long nr;
-	pid_t *target_pids = NULL;
+	pid_t *target_pids;
 
 	/*
 	 * Do some preliminary argument and permissions checking before we
@@ -1377,6 +1468,17 @@ long do_fork_with_pids(unsigned long clone_flags,
 		}
 	}
 
+	target_pids = copy_target_pids(pid_setp);
+
+	if (target_pids) {
+		if (IS_ERR(target_pids))
+			return PTR_ERR(target_pids);
+
+		nr = -EPERM;
+		if (!capable(CAP_SYS_ADMIN))
+			goto out_free;
+	}
+
 	/*
 	 * When called from kernel_thread, don't do user tracing stuff.
 	 */
@@ -1438,6 +1540,10 @@ long do_fork_with_pids(unsigned long clone_flags,
 	} else {
 		nr = PTR_ERR(p);
 	}
+
+out_free:
+	kfree(target_pids);
+
 	return nr;
 }
 
-- 
1.6.0.4

^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH v18 17/80] pids 7/7: Define clone_with_pids syscall
  2009-09-23 23:50 ` Oren Laadan
@ 2009-09-23 23:50   ` Oren Laadan
  -1 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-23 23:50 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, containers, linux-kernel, linux-mm, linux-api,
	Serge Hallyn, Ingo Molnar, Pavel Emelyanov, Sukadev Bhattiprolu

From: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>

Container restart requires that a task have the same pid it had when it was
checkpointed. When containers are nested the tasks within the containers
exist in multiple pid namespaces and hence have multiple pids to specify
during restart.

clone_with_pids(), intended for use during restart, is the same as clone(),
except that it takes a 'target_pid_set' paramter. This parameter lets caller
choose specific pid numbers for the child process, in the process's active
and ancestor pid namespaces. (Descendant pid namespaces in general don't
matter since processes don't have pids in them anyway, but see comments
in copy_target_pids() regarding CLONE_NEWPID).

Unlike clone(), clone_with_pids() needs CAP_SYS_ADMIN, at least for now, to
prevent unprivileged processes from misusing this interface.

Call clone_with_pids as follows:

	pid_t pids[] = { 0, 77, 99 };
	struct target_pid_set pid_set;

	pid_set.num_pids = sizeof(pids) / sizeof(int);
	pid_set.target_pids = &pids;

	syscall(__NR_clone_with_pids, flags, stack, NULL, NULL, NULL, &pid_set);

If a target-pid is 0, the kernel continues to assign a pid for the process in
that namespace. In the above example, pids[0] is 0, meaning the kernel will
assign next available pid to the process in init_pid_ns. But kernel will assign
pid 77 in the child pid namespace 1 and pid 99 in pid namespace 2. If either
77 or 99 are taken, the system call fails with -EBUSY.

If 'pid_set.num_pids' exceeds the current nesting level of pid namespaces,
the system call fails with -EINVAL.

Its mostly an exploratory patch seeking feedback on the interface.

NOTE:
	Compared to clone(), clone_with_pids() needs to pass in two more
	pieces of information:

		- number of pids in the set
		- user buffer containing the list of pids.

	But since clone() already takes 5 parameters, use a 'struct
	target_pid_set'.

TODO:
	- Gently tested.
	- May need additional sanity checks in do_fork_with_pids().

Changelog[v3]:
	- (Oren Laadan) Allow CLONE_NEWPID flag (by allocating an extra pid
	  in the target_pids[] list and setting it 0. See copy_target_pids()).
	- (Oren Laadan) Specified target pids should apply only to youngest
	  pid-namespaces (see copy_target_pids())
	- (Matt Helsley) Update patch description.

Changelog[v2]:
	- Remove unnecessary printk and add a note to callers of
	  copy_target_pids() to free target_pids.
	- (Serge Hallyn) Mention CAP_SYS_ADMIN restriction in patch description.
	- (Oren Laadan) Add checks for 'num_pids < 0' (return -EINVAL) and
	  'num_pids == 0' (fall back to normal clone()).
	- Move arch-independent code (sanity checks and copy-in of target-pids)
	  into kernel/fork.c and simplify sys_clone_with_pids()

Changelog[v1]:
	- Fixed some compile errors (had fixed these errors earlier in my
	  git tree but had not refreshed patches before emailing them)

Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
---
 arch/x86/include/asm/syscalls.h    |    2 +
 arch/x86/include/asm/unistd_32.h   |    1 +
 arch/x86/kernel/entry_32.S         |    1 +
 arch/x86/kernel/process_32.c       |   21 +++++++
 arch/x86/kernel/syscall_table_32.S |    1 +
 kernel/fork.c                      |  108 +++++++++++++++++++++++++++++++++++-
 6 files changed, 133 insertions(+), 1 deletions(-)

diff --git a/arch/x86/include/asm/syscalls.h b/arch/x86/include/asm/syscalls.h
index 372b76e..df3c4a8 100644
--- a/arch/x86/include/asm/syscalls.h
+++ b/arch/x86/include/asm/syscalls.h
@@ -40,6 +40,8 @@ long sys_iopl(struct pt_regs *);
 
 /* kernel/process_32.c */
 int sys_clone(struct pt_regs *);
+int sys_clone_with_pids(struct pt_regs *);
+int sys_vfork(struct pt_regs *);
 int sys_execve(struct pt_regs *);
 
 /* kernel/signal.c */
diff --git a/arch/x86/include/asm/unistd_32.h b/arch/x86/include/asm/unistd_32.h
index 732a307..f65b750 100644
--- a/arch/x86/include/asm/unistd_32.h
+++ b/arch/x86/include/asm/unistd_32.h
@@ -342,6 +342,7 @@
 #define __NR_pwritev		334
 #define __NR_rt_tgsigqueueinfo	335
 #define __NR_perf_counter_open	336
+#define __NR_clone_with_pids	337
 
 #ifdef __KERNEL__
 
diff --git a/arch/x86/kernel/entry_32.S b/arch/x86/kernel/entry_32.S
index c097e7d..c7bd1f6 100644
--- a/arch/x86/kernel/entry_32.S
+++ b/arch/x86/kernel/entry_32.S
@@ -718,6 +718,7 @@ ptregs_##name: \
 PTREGSCALL(iopl)
 PTREGSCALL(fork)
 PTREGSCALL(clone)
+PTREGSCALL(clone_with_pids)
 PTREGSCALL(vfork)
 PTREGSCALL(execve)
 PTREGSCALL(sigaltstack)
diff --git a/arch/x86/kernel/process_32.c b/arch/x86/kernel/process_32.c
index 59f4524..9965c06 100644
--- a/arch/x86/kernel/process_32.c
+++ b/arch/x86/kernel/process_32.c
@@ -443,6 +443,27 @@ int sys_clone(struct pt_regs *regs)
 	return do_fork(clone_flags, newsp, regs, 0, parent_tidptr, child_tidptr);
 }
 
+int sys_clone_with_pids(struct pt_regs *regs)
+{
+	unsigned long clone_flags;
+	unsigned long newsp;
+	int __user *parent_tidptr;
+	int __user *child_tidptr;
+	void __user *upid_setp;
+
+	clone_flags = regs->bx;
+	newsp = regs->cx;
+	parent_tidptr = (int __user *)regs->dx;
+	child_tidptr = (int __user *)regs->di;
+	upid_setp = (void __user *)regs->bp;
+
+	if (!newsp)
+		newsp = regs->sp;
+
+	return do_fork_with_pids(clone_flags, newsp, regs, 0, parent_tidptr,
+			child_tidptr, upid_setp);
+}
+
 /*
  * sys_execve() executes a new program.
  */
diff --git a/arch/x86/kernel/syscall_table_32.S b/arch/x86/kernel/syscall_table_32.S
index d51321d..879e5ec 100644
--- a/arch/x86/kernel/syscall_table_32.S
+++ b/arch/x86/kernel/syscall_table_32.S
@@ -336,3 +336,4 @@ ENTRY(sys_call_table)
 	.long sys_pwritev
 	.long sys_rt_tgsigqueueinfo	/* 335 */
 	.long sys_perf_counter_open
+	.long ptregs_clone_with_pids
diff --git a/kernel/fork.c b/kernel/fork.c
index 59b21db..f5a0cef 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1327,6 +1327,97 @@ struct task_struct * __cpuinit fork_idle(int cpu)
 }
 
 /*
+ * If user specified any 'target-pids' in @upid_setp, copy them from
+ * user and return a pointer to a local copy of the list of pids. The
+ * caller must free the list, when they are done using it.
+ *
+ * If user did not specify any target pids, return NULL (caller should
+ * treat this like normal clone).
+ *
+ * On any errors, return the error code
+ */
+static pid_t *copy_target_pids(void __user *upid_setp)
+{
+	int j;
+	int rc;
+	int size;
+	int unum_pids;		/* # of pids specified by user */
+	int knum_pids;		/* # of pids needed in kernel */
+	pid_t *target_pids;
+	struct target_pid_set pid_set;
+
+	if (!upid_setp)
+		return NULL;
+
+	rc = copy_from_user(&pid_set, upid_setp, sizeof(pid_set));
+	if (rc)
+		return ERR_PTR(-EFAULT);
+
+	unum_pids = pid_set.num_pids;
+	knum_pids = task_pid(current)->level + 1;
+
+	if (!unum_pids)
+		return NULL;
+
+	if (unum_pids < 0 || unum_pids > knum_pids)
+		return ERR_PTR(-EINVAL);
+
+	/*
+	 * To keep alloc_pid() simple, allocate an extra pid_t in target_pids[]
+	 * and set it to 0. This last entry in target_pids[] corresponds to the
+	 * (yet-to-be-created) descendant pid-namespace if CLONE_NEWPID was
+	 * specified. If CLONE_NEWPID was not specified, this last entry will
+	 * simply be ignored.
+	 */
+	target_pids = kzalloc((knum_pids + 1) * sizeof(pid_t), GFP_KERNEL);
+	if (!target_pids)
+		return ERR_PTR(-ENOMEM);
+
+	/*
+	 * A process running in a level 2 pid namespace has three pid namespaces
+	 * and hence three pid numbers. If this process is checkpointed,
+	 * information about these three namespaces are saved. We refer to these
+	 * namespaces as 'known namespaces'.
+	 *
+	 * If this checkpointed process is however restarted in a level 3 pid
+	 * namespace, the restarted process has an extra ancestor pid namespace
+	 * (i.e 'unknown namespace') and 'knum_pids' exceeds 'unum_pids'.
+	 *
+	 * During restart, the process requests specific pids for its 'known
+	 * namespaces' and lets kernel assign pids to its 'unknown namespaces'.
+	 *
+	 * Since the requested-pids correspond to 'known namespaces' and since
+	 * 'known-namespaces' are younger than (i.e descendants of) 'unknown-
+	 * namespaces', copy requested pids to the back-end of target_pids[]
+	 * (i.e before the last entry for CLONE_NEWPID mentioned above).
+	 * Any entries in target_pids[] not corresponding to a requested pid
+	 * will be set to zero and kernel assigns a pid in those namespaces.
+	 *
+	 * NOTE: The order of pids in target_pids[] is oldest pid namespace to
+	 * 	 youngest (target_pids[0] corresponds to init_pid_ns). i.e.
+	 * 	 the order is:
+	 *
+	 * 		- pids for 'unknown-namespaces' (if any)
+	 * 		- pids for 'known-namespaces' (requested pids)
+	 * 		- 0 in the last entry (for CLONE_NEWPID).
+	 */
+	j = knum_pids - unum_pids;
+	size = unum_pids * sizeof(pid_t);
+
+	rc = copy_from_user(&target_pids[j], pid_set.target_pids, size);
+	if (rc) {
+		rc = -EFAULT;
+		goto out_free;
+	}
+
+	return target_pids;
+
+out_free:
+	kfree(target_pids);
+	return ERR_PTR(rc);
+}
+
+/*
  *  Ok, this is the main fork-routine.
  *
  * It copies the process, and if successful kick-starts
@@ -1343,7 +1434,7 @@ long do_fork_with_pids(unsigned long clone_flags,
 	struct task_struct *p;
 	int trace = 0;
 	long nr;
-	pid_t *target_pids = NULL;
+	pid_t *target_pids;
 
 	/*
 	 * Do some preliminary argument and permissions checking before we
@@ -1377,6 +1468,17 @@ long do_fork_with_pids(unsigned long clone_flags,
 		}
 	}
 
+	target_pids = copy_target_pids(pid_setp);
+
+	if (target_pids) {
+		if (IS_ERR(target_pids))
+			return PTR_ERR(target_pids);
+
+		nr = -EPERM;
+		if (!capable(CAP_SYS_ADMIN))
+			goto out_free;
+	}
+
 	/*
 	 * When called from kernel_thread, don't do user tracing stuff.
 	 */
@@ -1438,6 +1540,10 @@ long do_fork_with_pids(unsigned long clone_flags,
 	} else {
 		nr = PTR_ERR(p);
 	}
+
+out_free:
+	kfree(target_pids);
+
 	return nr;
 }
 
-- 
1.6.0.4


^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH v18 17/80] pids 7/7: Define clone_with_pids syscall
@ 2009-09-23 23:50   ` Oren Laadan
  0 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-23 23:50 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, containers, linux-kernel, linux-mm, linux-api,
	Serge Hallyn, Ingo Molnar, Pavel Emelyanov, Sukadev Bhattiprolu

From: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>

Container restart requires that a task have the same pid it had when it was
checkpointed. When containers are nested the tasks within the containers
exist in multiple pid namespaces and hence have multiple pids to specify
during restart.

clone_with_pids(), intended for use during restart, is the same as clone(),
except that it takes a 'target_pid_set' paramter. This parameter lets caller
choose specific pid numbers for the child process, in the process's active
and ancestor pid namespaces. (Descendant pid namespaces in general don't
matter since processes don't have pids in them anyway, but see comments
in copy_target_pids() regarding CLONE_NEWPID).

Unlike clone(), clone_with_pids() needs CAP_SYS_ADMIN, at least for now, to
prevent unprivileged processes from misusing this interface.

Call clone_with_pids as follows:

	pid_t pids[] = { 0, 77, 99 };
	struct target_pid_set pid_set;

	pid_set.num_pids = sizeof(pids) / sizeof(int);
	pid_set.target_pids = &pids;

	syscall(__NR_clone_with_pids, flags, stack, NULL, NULL, NULL, &pid_set);

If a target-pid is 0, the kernel continues to assign a pid for the process in
that namespace. In the above example, pids[0] is 0, meaning the kernel will
assign next available pid to the process in init_pid_ns. But kernel will assign
pid 77 in the child pid namespace 1 and pid 99 in pid namespace 2. If either
77 or 99 are taken, the system call fails with -EBUSY.

If 'pid_set.num_pids' exceeds the current nesting level of pid namespaces,
the system call fails with -EINVAL.

Its mostly an exploratory patch seeking feedback on the interface.

NOTE:
	Compared to clone(), clone_with_pids() needs to pass in two more
	pieces of information:

		- number of pids in the set
		- user buffer containing the list of pids.

	But since clone() already takes 5 parameters, use a 'struct
	target_pid_set'.

TODO:
	- Gently tested.
	- May need additional sanity checks in do_fork_with_pids().

Changelog[v3]:
	- (Oren Laadan) Allow CLONE_NEWPID flag (by allocating an extra pid
	  in the target_pids[] list and setting it 0. See copy_target_pids()).
	- (Oren Laadan) Specified target pids should apply only to youngest
	  pid-namespaces (see copy_target_pids())
	- (Matt Helsley) Update patch description.

Changelog[v2]:
	- Remove unnecessary printk and add a note to callers of
	  copy_target_pids() to free target_pids.
	- (Serge Hallyn) Mention CAP_SYS_ADMIN restriction in patch description.
	- (Oren Laadan) Add checks for 'num_pids < 0' (return -EINVAL) and
	  'num_pids == 0' (fall back to normal clone()).
	- Move arch-independent code (sanity checks and copy-in of target-pids)
	  into kernel/fork.c and simplify sys_clone_with_pids()

Changelog[v1]:
	- Fixed some compile errors (had fixed these errors earlier in my
	  git tree but had not refreshed patches before emailing them)

Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
---
 arch/x86/include/asm/syscalls.h    |    2 +
 arch/x86/include/asm/unistd_32.h   |    1 +
 arch/x86/kernel/entry_32.S         |    1 +
 arch/x86/kernel/process_32.c       |   21 +++++++
 arch/x86/kernel/syscall_table_32.S |    1 +
 kernel/fork.c                      |  108 +++++++++++++++++++++++++++++++++++-
 6 files changed, 133 insertions(+), 1 deletions(-)

diff --git a/arch/x86/include/asm/syscalls.h b/arch/x86/include/asm/syscalls.h
index 372b76e..df3c4a8 100644
--- a/arch/x86/include/asm/syscalls.h
+++ b/arch/x86/include/asm/syscalls.h
@@ -40,6 +40,8 @@ long sys_iopl(struct pt_regs *);
 
 /* kernel/process_32.c */
 int sys_clone(struct pt_regs *);
+int sys_clone_with_pids(struct pt_regs *);
+int sys_vfork(struct pt_regs *);
 int sys_execve(struct pt_regs *);
 
 /* kernel/signal.c */
diff --git a/arch/x86/include/asm/unistd_32.h b/arch/x86/include/asm/unistd_32.h
index 732a307..f65b750 100644
--- a/arch/x86/include/asm/unistd_32.h
+++ b/arch/x86/include/asm/unistd_32.h
@@ -342,6 +342,7 @@
 #define __NR_pwritev		334
 #define __NR_rt_tgsigqueueinfo	335
 #define __NR_perf_counter_open	336
+#define __NR_clone_with_pids	337
 
 #ifdef __KERNEL__
 
diff --git a/arch/x86/kernel/entry_32.S b/arch/x86/kernel/entry_32.S
index c097e7d..c7bd1f6 100644
--- a/arch/x86/kernel/entry_32.S
+++ b/arch/x86/kernel/entry_32.S
@@ -718,6 +718,7 @@ ptregs_##name: \
 PTREGSCALL(iopl)
 PTREGSCALL(fork)
 PTREGSCALL(clone)
+PTREGSCALL(clone_with_pids)
 PTREGSCALL(vfork)
 PTREGSCALL(execve)
 PTREGSCALL(sigaltstack)
diff --git a/arch/x86/kernel/process_32.c b/arch/x86/kernel/process_32.c
index 59f4524..9965c06 100644
--- a/arch/x86/kernel/process_32.c
+++ b/arch/x86/kernel/process_32.c
@@ -443,6 +443,27 @@ int sys_clone(struct pt_regs *regs)
 	return do_fork(clone_flags, newsp, regs, 0, parent_tidptr, child_tidptr);
 }
 
+int sys_clone_with_pids(struct pt_regs *regs)
+{
+	unsigned long clone_flags;
+	unsigned long newsp;
+	int __user *parent_tidptr;
+	int __user *child_tidptr;
+	void __user *upid_setp;
+
+	clone_flags = regs->bx;
+	newsp = regs->cx;
+	parent_tidptr = (int __user *)regs->dx;
+	child_tidptr = (int __user *)regs->di;
+	upid_setp = (void __user *)regs->bp;
+
+	if (!newsp)
+		newsp = regs->sp;
+
+	return do_fork_with_pids(clone_flags, newsp, regs, 0, parent_tidptr,
+			child_tidptr, upid_setp);
+}
+
 /*
  * sys_execve() executes a new program.
  */
diff --git a/arch/x86/kernel/syscall_table_32.S b/arch/x86/kernel/syscall_table_32.S
index d51321d..879e5ec 100644
--- a/arch/x86/kernel/syscall_table_32.S
+++ b/arch/x86/kernel/syscall_table_32.S
@@ -336,3 +336,4 @@ ENTRY(sys_call_table)
 	.long sys_pwritev
 	.long sys_rt_tgsigqueueinfo	/* 335 */
 	.long sys_perf_counter_open
+	.long ptregs_clone_with_pids
diff --git a/kernel/fork.c b/kernel/fork.c
index 59b21db..f5a0cef 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1327,6 +1327,97 @@ struct task_struct * __cpuinit fork_idle(int cpu)
 }
 
 /*
+ * If user specified any 'target-pids' in @upid_setp, copy them from
+ * user and return a pointer to a local copy of the list of pids. The
+ * caller must free the list, when they are done using it.
+ *
+ * If user did not specify any target pids, return NULL (caller should
+ * treat this like normal clone).
+ *
+ * On any errors, return the error code
+ */
+static pid_t *copy_target_pids(void __user *upid_setp)
+{
+	int j;
+	int rc;
+	int size;
+	int unum_pids;		/* # of pids specified by user */
+	int knum_pids;		/* # of pids needed in kernel */
+	pid_t *target_pids;
+	struct target_pid_set pid_set;
+
+	if (!upid_setp)
+		return NULL;
+
+	rc = copy_from_user(&pid_set, upid_setp, sizeof(pid_set));
+	if (rc)
+		return ERR_PTR(-EFAULT);
+
+	unum_pids = pid_set.num_pids;
+	knum_pids = task_pid(current)->level + 1;
+
+	if (!unum_pids)
+		return NULL;
+
+	if (unum_pids < 0 || unum_pids > knum_pids)
+		return ERR_PTR(-EINVAL);
+
+	/*
+	 * To keep alloc_pid() simple, allocate an extra pid_t in target_pids[]
+	 * and set it to 0. This last entry in target_pids[] corresponds to the
+	 * (yet-to-be-created) descendant pid-namespace if CLONE_NEWPID was
+	 * specified. If CLONE_NEWPID was not specified, this last entry will
+	 * simply be ignored.
+	 */
+	target_pids = kzalloc((knum_pids + 1) * sizeof(pid_t), GFP_KERNEL);
+	if (!target_pids)
+		return ERR_PTR(-ENOMEM);
+
+	/*
+	 * A process running in a level 2 pid namespace has three pid namespaces
+	 * and hence three pid numbers. If this process is checkpointed,
+	 * information about these three namespaces are saved. We refer to these
+	 * namespaces as 'known namespaces'.
+	 *
+	 * If this checkpointed process is however restarted in a level 3 pid
+	 * namespace, the restarted process has an extra ancestor pid namespace
+	 * (i.e 'unknown namespace') and 'knum_pids' exceeds 'unum_pids'.
+	 *
+	 * During restart, the process requests specific pids for its 'known
+	 * namespaces' and lets kernel assign pids to its 'unknown namespaces'.
+	 *
+	 * Since the requested-pids correspond to 'known namespaces' and since
+	 * 'known-namespaces' are younger than (i.e descendants of) 'unknown-
+	 * namespaces', copy requested pids to the back-end of target_pids[]
+	 * (i.e before the last entry for CLONE_NEWPID mentioned above).
+	 * Any entries in target_pids[] not corresponding to a requested pid
+	 * will be set to zero and kernel assigns a pid in those namespaces.
+	 *
+	 * NOTE: The order of pids in target_pids[] is oldest pid namespace to
+	 * 	 youngest (target_pids[0] corresponds to init_pid_ns). i.e.
+	 * 	 the order is:
+	 *
+	 * 		- pids for 'unknown-namespaces' (if any)
+	 * 		- pids for 'known-namespaces' (requested pids)
+	 * 		- 0 in the last entry (for CLONE_NEWPID).
+	 */
+	j = knum_pids - unum_pids;
+	size = unum_pids * sizeof(pid_t);
+
+	rc = copy_from_user(&target_pids[j], pid_set.target_pids, size);
+	if (rc) {
+		rc = -EFAULT;
+		goto out_free;
+	}
+
+	return target_pids;
+
+out_free:
+	kfree(target_pids);
+	return ERR_PTR(rc);
+}
+
+/*
  *  Ok, this is the main fork-routine.
  *
  * It copies the process, and if successful kick-starts
@@ -1343,7 +1434,7 @@ long do_fork_with_pids(unsigned long clone_flags,
 	struct task_struct *p;
 	int trace = 0;
 	long nr;
-	pid_t *target_pids = NULL;
+	pid_t *target_pids;
 
 	/*
 	 * Do some preliminary argument and permissions checking before we
@@ -1377,6 +1468,17 @@ long do_fork_with_pids(unsigned long clone_flags,
 		}
 	}
 
+	target_pids = copy_target_pids(pid_setp);
+
+	if (target_pids) {
+		if (IS_ERR(target_pids))
+			return PTR_ERR(target_pids);
+
+		nr = -EPERM;
+		if (!capable(CAP_SYS_ADMIN))
+			goto out_free;
+	}
+
 	/*
 	 * When called from kernel_thread, don't do user tracing stuff.
 	 */
@@ -1438,6 +1540,10 @@ long do_fork_with_pids(unsigned long clone_flags,
 	} else {
 		nr = PTR_ERR(p);
 	}
+
+out_free:
+	kfree(target_pids);
+
 	return nr;
 }
 
-- 
1.6.0.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH v18 18/80] c/r: create syscalls: sys_checkpoint, sys_restart
       [not found] ` <1253749920-18673-1-git-send-email-orenl-RdfvBDnrOixBDgjK7y7TUQ@public.gmane.org>
                     ` (16 preceding siblings ...)
  2009-09-23 23:50   ` [PATCH v18 17/80] pids 7/7: Define clone_with_pids syscall Oren Laadan
@ 2009-09-23 23:50   ` Oren Laadan
  2009-09-23 23:50   ` [PATCH v18 19/80] c/r: documentation Oren Laadan
                     ` (63 subsequent siblings)
  81 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-23 23:50 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-api-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Linus Torvalds, Ingo Molnar,
	Dave Hansen, Pavel Emelyanov

Create trivial sys_checkpoint and sys_restore system calls. They will
enable to checkpoint and restart an entire container, to and from a
checkpoint image file descriptor.

The syscalls take a pid, a file descriptor (for the image file) and
flags as arguments. The pid identifies the top-most (root) task in the
process tree, e.g. the container init: for sys_checkpoint the first
argument identifies the pid of the target container/subtree; for
sys_restart it will identify the pid of restarting root task.

A checkpoint, much like a process coredump, dumps the state of multiple
processes at once, including the state of the container. The checkpoint
image is written to (and read from) the file descriptor directly from
the kernel. This way the data is generated and then pushed out naturally
as resources and tasks are scanned to save their state. This is the
approach taken by, e.g., Zap and OpenVZ.

By using a return value and not a file descriptor, we can distinguish
between a return from checkpoint, a return from restart (in case of a
checkpoint that includes self, i.e. a task checkpointing its own
container, or itself), and an error condition, in a manner analogous
to a fork() call.

We don't use copy_from_user()/copy_to_user() because it requires
holding the entire image in user space, and does not make sense for
restart.  Also, we don't use a pipe, pseudo-fs file and the like,
because they work by generating data on demand as the user pulls it
(unless the entire image is buffered in the kernel) and would require
more complex logic.  They also would significantly complicate
checkpoint that includes self.

Changelog[v18]:
  - [John Dykstra] Fix no-dot-config-targets pattern in linux/Makefile
Changelog[v17]:
  - Move checkpoint closer to namespaces (kconfig)
  - Kill "Enable" in c/r config option
Changelog[v16]:
  - Change sys_restart() first argument to be 'pid_t pid'
Changelog[v14]:
  - Change CONFIG_CHEKCPOINT_RESTART to CONFIG_CHECKPOINT (Ingo)
  - Remove line 'def_bool n' (default is already 'n')
  - Add CHECKPOINT_SUPPORT in Kconfig (Nathan Lynch)
Changelog[v5]:
  - Config is 'def_bool n' by default

Signed-off-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
Acked-by: Serge Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
Signed-off-by: Dave Hansen <dave-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
---
 Makefile                           |    2 +-
 arch/x86/Kconfig                   |    4 +++
 arch/x86/include/asm/unistd_32.h   |    2 +
 arch/x86/kernel/syscall_table_32.S |    2 +
 checkpoint/Kconfig                 |   14 ++++++++++++
 checkpoint/Makefile                |    5 ++++
 checkpoint/sys.c                   |   41 ++++++++++++++++++++++++++++++++++++
 include/linux/syscalls.h           |    2 +
 init/Kconfig                       |    2 +
 kernel/sys_ni.c                    |    4 +++
 10 files changed, 77 insertions(+), 1 deletions(-)
 create mode 100644 checkpoint/Kconfig
 create mode 100644 checkpoint/Makefile
 create mode 100644 checkpoint/sys.c

diff --git a/Makefile b/Makefile
index 60de4ef..fe45658 100644
--- a/Makefile
+++ b/Makefile
@@ -401,7 +401,7 @@ endif
 # of make so .config is not included in this case either (for *config).
 
 no-dot-config-targets := clean mrproper distclean \
-			 cscope TAGS tags help %docs check% \
+			 cscope TAGS tags help %docs checkstack \
 			 include/linux/version.h headers_% \
 			 kernelrelease kernelversion
 
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 13ffa5d..5776452 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -86,6 +86,10 @@ config STACKTRACE_SUPPORT
 config HAVE_LATENCYTOP_SUPPORT
 	def_bool y
 
+config CHECKPOINT_SUPPORT
+	bool
+	default y if X86_32
+
 config FAST_CMPXCHG_LOCAL
 	bool
 	default y
diff --git a/arch/x86/include/asm/unistd_32.h b/arch/x86/include/asm/unistd_32.h
index f65b750..c25971b 100644
--- a/arch/x86/include/asm/unistd_32.h
+++ b/arch/x86/include/asm/unistd_32.h
@@ -343,6 +343,8 @@
 #define __NR_rt_tgsigqueueinfo	335
 #define __NR_perf_counter_open	336
 #define __NR_clone_with_pids	337
+#define __NR_checkpoint		338
+#define __NR_restart		339
 
 #ifdef __KERNEL__
 
diff --git a/arch/x86/kernel/syscall_table_32.S b/arch/x86/kernel/syscall_table_32.S
index 879e5ec..4741554 100644
--- a/arch/x86/kernel/syscall_table_32.S
+++ b/arch/x86/kernel/syscall_table_32.S
@@ -337,3 +337,5 @@ ENTRY(sys_call_table)
 	.long sys_rt_tgsigqueueinfo	/* 335 */
 	.long sys_perf_counter_open
 	.long ptregs_clone_with_pids
+	.long sys_checkpoint
+	.long sys_restart
diff --git a/checkpoint/Kconfig b/checkpoint/Kconfig
new file mode 100644
index 0000000..ef7d406
--- /dev/null
+++ b/checkpoint/Kconfig
@@ -0,0 +1,14 @@
+# Architectures should define CHECKPOINT_SUPPORT when they have
+# implemented the hooks for processor state etc. needed by the
+# core checkpoint/restart code.
+
+config CHECKPOINT
+	bool "Checkpoint/restart (EXPERIMENTAL)"
+	depends on CHECKPOINT_SUPPORT && EXPERIMENTAL
+	help
+	  Application checkpoint/restart is the ability to save the
+	  state of a running application so that it can later resume
+	  its execution from the time at which it was checkpointed.
+
+	  Turning this option on will enable checkpoint and restart
+	  functionality in the kernel.
diff --git a/checkpoint/Makefile b/checkpoint/Makefile
new file mode 100644
index 0000000..8a32c6f
--- /dev/null
+++ b/checkpoint/Makefile
@@ -0,0 +1,5 @@
+#
+# Makefile for linux checkpoint/restart.
+#
+
+obj-$(CONFIG_CHECKPOINT) += sys.o
diff --git a/checkpoint/sys.c b/checkpoint/sys.c
new file mode 100644
index 0000000..79936cc
--- /dev/null
+++ b/checkpoint/sys.c
@@ -0,0 +1,41 @@
+/*
+ *  Generic container checkpoint-restart
+ *
+ *  Copyright (C) 2008-2009 Oren Laadan
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+#include <linux/sched.h>
+#include <linux/kernel.h>
+#include <linux/syscalls.h>
+
+/**
+ * sys_checkpoint - checkpoint a container
+ * @pid: pid of the container init(1) process
+ * @fd: file to which dump the checkpoint image
+ * @flags: checkpoint operation flags
+ *
+ * Returns positive identifier on success, 0 when returning from restart
+ * or negative value on error
+ */
+SYSCALL_DEFINE3(checkpoint, pid_t, pid, int, fd, unsigned long, flags)
+{
+	return -ENOSYS;
+}
+
+/**
+ * sys_restart - restart a container
+ * @pid: pid of task root (in coordinator's namespace), or 0
+ * @fd: file from which read the checkpoint image
+ * @flags: restart operation flags
+ *
+ * Returns negative value on error, or otherwise returns in the realm
+ * of the original checkpoint
+ */
+SYSCALL_DEFINE3(restart, pid_t, pid, int, fd, unsigned long, flags)
+{
+	return -ENOSYS;
+}
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 80de700..33bce6e 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -754,6 +754,8 @@ asmlinkage long sys_pselect6(int, fd_set __user *, fd_set __user *,
 asmlinkage long sys_ppoll(struct pollfd __user *, unsigned int,
 			  struct timespec __user *, const sigset_t __user *,
 			  size_t);
+asmlinkage long sys_checkpoint(pid_t pid, int fd, unsigned long flags);
+asmlinkage long sys_restart(pid_t pid, int fd, unsigned long flags);
 
 int kernel_execve(const char *filename, char *const argv[], char *const envp[]);
 
diff --git a/init/Kconfig b/init/Kconfig
index 46ee2c8..46329b9 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -715,6 +715,8 @@ config NET_NS
 	  Allow user space to create what appear to be multiple instances
 	  of the network stack.
 
+source "checkpoint/Kconfig"
+
 config BLK_DEV_INITRD
 	bool "Initial RAM filesystem and RAM disk (initramfs/initrd) support"
 	depends on BROKEN || !FRV
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index 68320f6..32f3f26 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -178,3 +178,7 @@ cond_syscall(sys_eventfd2);
 
 /* performance counters: */
 cond_syscall(sys_perf_counter_open);
+
+/* checkpoint/restart */
+cond_syscall(sys_checkpoint);
+cond_syscall(sys_restart);
-- 
1.6.0.4

^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH v18 18/80] c/r: create syscalls: sys_checkpoint, sys_restart
  2009-09-23 23:50 ` Oren Laadan
@ 2009-09-23 23:50   ` Oren Laadan
  -1 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-23 23:50 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, containers, linux-kernel, linux-mm, linux-api,
	Serge Hallyn, Ingo Molnar, Pavel Emelyanov, Oren Laadan,
	Oren Laadan, Dave Hansen

Create trivial sys_checkpoint and sys_restore system calls. They will
enable to checkpoint and restart an entire container, to and from a
checkpoint image file descriptor.

The syscalls take a pid, a file descriptor (for the image file) and
flags as arguments. The pid identifies the top-most (root) task in the
process tree, e.g. the container init: for sys_checkpoint the first
argument identifies the pid of the target container/subtree; for
sys_restart it will identify the pid of restarting root task.

A checkpoint, much like a process coredump, dumps the state of multiple
processes at once, including the state of the container. The checkpoint
image is written to (and read from) the file descriptor directly from
the kernel. This way the data is generated and then pushed out naturally
as resources and tasks are scanned to save their state. This is the
approach taken by, e.g., Zap and OpenVZ.

By using a return value and not a file descriptor, we can distinguish
between a return from checkpoint, a return from restart (in case of a
checkpoint that includes self, i.e. a task checkpointing its own
container, or itself), and an error condition, in a manner analogous
to a fork() call.

We don't use copy_from_user()/copy_to_user() because it requires
holding the entire image in user space, and does not make sense for
restart.  Also, we don't use a pipe, pseudo-fs file and the like,
because they work by generating data on demand as the user pulls it
(unless the entire image is buffered in the kernel) and would require
more complex logic.  They also would significantly complicate
checkpoint that includes self.

Changelog[v18]:
  - [John Dykstra] Fix no-dot-config-targets pattern in linux/Makefile
Changelog[v17]:
  - Move checkpoint closer to namespaces (kconfig)
  - Kill "Enable" in c/r config option
Changelog[v16]:
  - Change sys_restart() first argument to be 'pid_t pid'
Changelog[v14]:
  - Change CONFIG_CHEKCPOINT_RESTART to CONFIG_CHECKPOINT (Ingo)
  - Remove line 'def_bool n' (default is already 'n')
  - Add CHECKPOINT_SUPPORT in Kconfig (Nathan Lynch)
Changelog[v5]:
  - Config is 'def_bool n' by default

Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com>
---
 Makefile                           |    2 +-
 arch/x86/Kconfig                   |    4 +++
 arch/x86/include/asm/unistd_32.h   |    2 +
 arch/x86/kernel/syscall_table_32.S |    2 +
 checkpoint/Kconfig                 |   14 ++++++++++++
 checkpoint/Makefile                |    5 ++++
 checkpoint/sys.c                   |   41 ++++++++++++++++++++++++++++++++++++
 include/linux/syscalls.h           |    2 +
 init/Kconfig                       |    2 +
 kernel/sys_ni.c                    |    4 +++
 10 files changed, 77 insertions(+), 1 deletions(-)
 create mode 100644 checkpoint/Kconfig
 create mode 100644 checkpoint/Makefile
 create mode 100644 checkpoint/sys.c

diff --git a/Makefile b/Makefile
index 60de4ef..fe45658 100644
--- a/Makefile
+++ b/Makefile
@@ -401,7 +401,7 @@ endif
 # of make so .config is not included in this case either (for *config).
 
 no-dot-config-targets := clean mrproper distclean \
-			 cscope TAGS tags help %docs check% \
+			 cscope TAGS tags help %docs checkstack \
 			 include/linux/version.h headers_% \
 			 kernelrelease kernelversion
 
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 13ffa5d..5776452 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -86,6 +86,10 @@ config STACKTRACE_SUPPORT
 config HAVE_LATENCYTOP_SUPPORT
 	def_bool y
 
+config CHECKPOINT_SUPPORT
+	bool
+	default y if X86_32
+
 config FAST_CMPXCHG_LOCAL
 	bool
 	default y
diff --git a/arch/x86/include/asm/unistd_32.h b/arch/x86/include/asm/unistd_32.h
index f65b750..c25971b 100644
--- a/arch/x86/include/asm/unistd_32.h
+++ b/arch/x86/include/asm/unistd_32.h
@@ -343,6 +343,8 @@
 #define __NR_rt_tgsigqueueinfo	335
 #define __NR_perf_counter_open	336
 #define __NR_clone_with_pids	337
+#define __NR_checkpoint		338
+#define __NR_restart		339
 
 #ifdef __KERNEL__
 
diff --git a/arch/x86/kernel/syscall_table_32.S b/arch/x86/kernel/syscall_table_32.S
index 879e5ec..4741554 100644
--- a/arch/x86/kernel/syscall_table_32.S
+++ b/arch/x86/kernel/syscall_table_32.S
@@ -337,3 +337,5 @@ ENTRY(sys_call_table)
 	.long sys_rt_tgsigqueueinfo	/* 335 */
 	.long sys_perf_counter_open
 	.long ptregs_clone_with_pids
+	.long sys_checkpoint
+	.long sys_restart
diff --git a/checkpoint/Kconfig b/checkpoint/Kconfig
new file mode 100644
index 0000000..ef7d406
--- /dev/null
+++ b/checkpoint/Kconfig
@@ -0,0 +1,14 @@
+# Architectures should define CHECKPOINT_SUPPORT when they have
+# implemented the hooks for processor state etc. needed by the
+# core checkpoint/restart code.
+
+config CHECKPOINT
+	bool "Checkpoint/restart (EXPERIMENTAL)"
+	depends on CHECKPOINT_SUPPORT && EXPERIMENTAL
+	help
+	  Application checkpoint/restart is the ability to save the
+	  state of a running application so that it can later resume
+	  its execution from the time at which it was checkpointed.
+
+	  Turning this option on will enable checkpoint and restart
+	  functionality in the kernel.
diff --git a/checkpoint/Makefile b/checkpoint/Makefile
new file mode 100644
index 0000000..8a32c6f
--- /dev/null
+++ b/checkpoint/Makefile
@@ -0,0 +1,5 @@
+#
+# Makefile for linux checkpoint/restart.
+#
+
+obj-$(CONFIG_CHECKPOINT) += sys.o
diff --git a/checkpoint/sys.c b/checkpoint/sys.c
new file mode 100644
index 0000000..79936cc
--- /dev/null
+++ b/checkpoint/sys.c
@@ -0,0 +1,41 @@
+/*
+ *  Generic container checkpoint-restart
+ *
+ *  Copyright (C) 2008-2009 Oren Laadan
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+#include <linux/sched.h>
+#include <linux/kernel.h>
+#include <linux/syscalls.h>
+
+/**
+ * sys_checkpoint - checkpoint a container
+ * @pid: pid of the container init(1) process
+ * @fd: file to which dump the checkpoint image
+ * @flags: checkpoint operation flags
+ *
+ * Returns positive identifier on success, 0 when returning from restart
+ * or negative value on error
+ */
+SYSCALL_DEFINE3(checkpoint, pid_t, pid, int, fd, unsigned long, flags)
+{
+	return -ENOSYS;
+}
+
+/**
+ * sys_restart - restart a container
+ * @pid: pid of task root (in coordinator's namespace), or 0
+ * @fd: file from which read the checkpoint image
+ * @flags: restart operation flags
+ *
+ * Returns negative value on error, or otherwise returns in the realm
+ * of the original checkpoint
+ */
+SYSCALL_DEFINE3(restart, pid_t, pid, int, fd, unsigned long, flags)
+{
+	return -ENOSYS;
+}
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 80de700..33bce6e 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -754,6 +754,8 @@ asmlinkage long sys_pselect6(int, fd_set __user *, fd_set __user *,
 asmlinkage long sys_ppoll(struct pollfd __user *, unsigned int,
 			  struct timespec __user *, const sigset_t __user *,
 			  size_t);
+asmlinkage long sys_checkpoint(pid_t pid, int fd, unsigned long flags);
+asmlinkage long sys_restart(pid_t pid, int fd, unsigned long flags);
 
 int kernel_execve(const char *filename, char *const argv[], char *const envp[]);
 
diff --git a/init/Kconfig b/init/Kconfig
index 46ee2c8..46329b9 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -715,6 +715,8 @@ config NET_NS
 	  Allow user space to create what appear to be multiple instances
 	  of the network stack.
 
+source "checkpoint/Kconfig"
+
 config BLK_DEV_INITRD
 	bool "Initial RAM filesystem and RAM disk (initramfs/initrd) support"
 	depends on BROKEN || !FRV
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index 68320f6..32f3f26 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -178,3 +178,7 @@ cond_syscall(sys_eventfd2);
 
 /* performance counters: */
 cond_syscall(sys_perf_counter_open);
+
+/* checkpoint/restart */
+cond_syscall(sys_checkpoint);
+cond_syscall(sys_restart);
-- 
1.6.0.4


^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH v18 18/80] c/r: create syscalls: sys_checkpoint, sys_restart
@ 2009-09-23 23:50   ` Oren Laadan
  0 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-23 23:50 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, containers, linux-kernel, linux-mm, linux-api,
	Serge Hallyn, Ingo Molnar, Pavel Emelyanov, Oren Laadan,
	Oren Laadan, Dave Hansen

Create trivial sys_checkpoint and sys_restore system calls. They will
enable to checkpoint and restart an entire container, to and from a
checkpoint image file descriptor.

The syscalls take a pid, a file descriptor (for the image file) and
flags as arguments. The pid identifies the top-most (root) task in the
process tree, e.g. the container init: for sys_checkpoint the first
argument identifies the pid of the target container/subtree; for
sys_restart it will identify the pid of restarting root task.

A checkpoint, much like a process coredump, dumps the state of multiple
processes at once, including the state of the container. The checkpoint
image is written to (and read from) the file descriptor directly from
the kernel. This way the data is generated and then pushed out naturally
as resources and tasks are scanned to save their state. This is the
approach taken by, e.g., Zap and OpenVZ.

By using a return value and not a file descriptor, we can distinguish
between a return from checkpoint, a return from restart (in case of a
checkpoint that includes self, i.e. a task checkpointing its own
container, or itself), and an error condition, in a manner analogous
to a fork() call.

We don't use copy_from_user()/copy_to_user() because it requires
holding the entire image in user space, and does not make sense for
restart.  Also, we don't use a pipe, pseudo-fs file and the like,
because they work by generating data on demand as the user pulls it
(unless the entire image is buffered in the kernel) and would require
more complex logic.  They also would significantly complicate
checkpoint that includes self.

Changelog[v18]:
  - [John Dykstra] Fix no-dot-config-targets pattern in linux/Makefile
Changelog[v17]:
  - Move checkpoint closer to namespaces (kconfig)
  - Kill "Enable" in c/r config option
Changelog[v16]:
  - Change sys_restart() first argument to be 'pid_t pid'
Changelog[v14]:
  - Change CONFIG_CHEKCPOINT_RESTART to CONFIG_CHECKPOINT (Ingo)
  - Remove line 'def_bool n' (default is already 'n')
  - Add CHECKPOINT_SUPPORT in Kconfig (Nathan Lynch)
Changelog[v5]:
  - Config is 'def_bool n' by default

Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com>
---
 Makefile                           |    2 +-
 arch/x86/Kconfig                   |    4 +++
 arch/x86/include/asm/unistd_32.h   |    2 +
 arch/x86/kernel/syscall_table_32.S |    2 +
 checkpoint/Kconfig                 |   14 ++++++++++++
 checkpoint/Makefile                |    5 ++++
 checkpoint/sys.c                   |   41 ++++++++++++++++++++++++++++++++++++
 include/linux/syscalls.h           |    2 +
 init/Kconfig                       |    2 +
 kernel/sys_ni.c                    |    4 +++
 10 files changed, 77 insertions(+), 1 deletions(-)
 create mode 100644 checkpoint/Kconfig
 create mode 100644 checkpoint/Makefile
 create mode 100644 checkpoint/sys.c

diff --git a/Makefile b/Makefile
index 60de4ef..fe45658 100644
--- a/Makefile
+++ b/Makefile
@@ -401,7 +401,7 @@ endif
 # of make so .config is not included in this case either (for *config).
 
 no-dot-config-targets := clean mrproper distclean \
-			 cscope TAGS tags help %docs check% \
+			 cscope TAGS tags help %docs checkstack \
 			 include/linux/version.h headers_% \
 			 kernelrelease kernelversion
 
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 13ffa5d..5776452 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -86,6 +86,10 @@ config STACKTRACE_SUPPORT
 config HAVE_LATENCYTOP_SUPPORT
 	def_bool y
 
+config CHECKPOINT_SUPPORT
+	bool
+	default y if X86_32
+
 config FAST_CMPXCHG_LOCAL
 	bool
 	default y
diff --git a/arch/x86/include/asm/unistd_32.h b/arch/x86/include/asm/unistd_32.h
index f65b750..c25971b 100644
--- a/arch/x86/include/asm/unistd_32.h
+++ b/arch/x86/include/asm/unistd_32.h
@@ -343,6 +343,8 @@
 #define __NR_rt_tgsigqueueinfo	335
 #define __NR_perf_counter_open	336
 #define __NR_clone_with_pids	337
+#define __NR_checkpoint		338
+#define __NR_restart		339
 
 #ifdef __KERNEL__
 
diff --git a/arch/x86/kernel/syscall_table_32.S b/arch/x86/kernel/syscall_table_32.S
index 879e5ec..4741554 100644
--- a/arch/x86/kernel/syscall_table_32.S
+++ b/arch/x86/kernel/syscall_table_32.S
@@ -337,3 +337,5 @@ ENTRY(sys_call_table)
 	.long sys_rt_tgsigqueueinfo	/* 335 */
 	.long sys_perf_counter_open
 	.long ptregs_clone_with_pids
+	.long sys_checkpoint
+	.long sys_restart
diff --git a/checkpoint/Kconfig b/checkpoint/Kconfig
new file mode 100644
index 0000000..ef7d406
--- /dev/null
+++ b/checkpoint/Kconfig
@@ -0,0 +1,14 @@
+# Architectures should define CHECKPOINT_SUPPORT when they have
+# implemented the hooks for processor state etc. needed by the
+# core checkpoint/restart code.
+
+config CHECKPOINT
+	bool "Checkpoint/restart (EXPERIMENTAL)"
+	depends on CHECKPOINT_SUPPORT && EXPERIMENTAL
+	help
+	  Application checkpoint/restart is the ability to save the
+	  state of a running application so that it can later resume
+	  its execution from the time at which it was checkpointed.
+
+	  Turning this option on will enable checkpoint and restart
+	  functionality in the kernel.
diff --git a/checkpoint/Makefile b/checkpoint/Makefile
new file mode 100644
index 0000000..8a32c6f
--- /dev/null
+++ b/checkpoint/Makefile
@@ -0,0 +1,5 @@
+#
+# Makefile for linux checkpoint/restart.
+#
+
+obj-$(CONFIG_CHECKPOINT) += sys.o
diff --git a/checkpoint/sys.c b/checkpoint/sys.c
new file mode 100644
index 0000000..79936cc
--- /dev/null
+++ b/checkpoint/sys.c
@@ -0,0 +1,41 @@
+/*
+ *  Generic container checkpoint-restart
+ *
+ *  Copyright (C) 2008-2009 Oren Laadan
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+#include <linux/sched.h>
+#include <linux/kernel.h>
+#include <linux/syscalls.h>
+
+/**
+ * sys_checkpoint - checkpoint a container
+ * @pid: pid of the container init(1) process
+ * @fd: file to which dump the checkpoint image
+ * @flags: checkpoint operation flags
+ *
+ * Returns positive identifier on success, 0 when returning from restart
+ * or negative value on error
+ */
+SYSCALL_DEFINE3(checkpoint, pid_t, pid, int, fd, unsigned long, flags)
+{
+	return -ENOSYS;
+}
+
+/**
+ * sys_restart - restart a container
+ * @pid: pid of task root (in coordinator's namespace), or 0
+ * @fd: file from which read the checkpoint image
+ * @flags: restart operation flags
+ *
+ * Returns negative value on error, or otherwise returns in the realm
+ * of the original checkpoint
+ */
+SYSCALL_DEFINE3(restart, pid_t, pid, int, fd, unsigned long, flags)
+{
+	return -ENOSYS;
+}
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 80de700..33bce6e 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -754,6 +754,8 @@ asmlinkage long sys_pselect6(int, fd_set __user *, fd_set __user *,
 asmlinkage long sys_ppoll(struct pollfd __user *, unsigned int,
 			  struct timespec __user *, const sigset_t __user *,
 			  size_t);
+asmlinkage long sys_checkpoint(pid_t pid, int fd, unsigned long flags);
+asmlinkage long sys_restart(pid_t pid, int fd, unsigned long flags);
 
 int kernel_execve(const char *filename, char *const argv[], char *const envp[]);
 
diff --git a/init/Kconfig b/init/Kconfig
index 46ee2c8..46329b9 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -715,6 +715,8 @@ config NET_NS
 	  Allow user space to create what appear to be multiple instances
 	  of the network stack.
 
+source "checkpoint/Kconfig"
+
 config BLK_DEV_INITRD
 	bool "Initial RAM filesystem and RAM disk (initramfs/initrd) support"
 	depends on BROKEN || !FRV
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index 68320f6..32f3f26 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -178,3 +178,7 @@ cond_syscall(sys_eventfd2);
 
 /* performance counters: */
 cond_syscall(sys_perf_counter_open);
+
+/* checkpoint/restart */
+cond_syscall(sys_checkpoint);
+cond_syscall(sys_restart);
-- 
1.6.0.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH v18 19/80] c/r: documentation
       [not found] ` <1253749920-18673-1-git-send-email-orenl-RdfvBDnrOixBDgjK7y7TUQ@public.gmane.org>
                     ` (17 preceding siblings ...)
  2009-09-23 23:50   ` [PATCH v18 18/80] c/r: create syscalls: sys_checkpoint, sys_restart Oren Laadan
@ 2009-09-23 23:50   ` Oren Laadan
  2009-09-23 23:51   ` [PATCH v18 20/80] c/r: basic infrastructure for checkpoint/restart Oren Laadan
                     ` (62 subsequent siblings)
  81 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-23 23:50 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-api-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Linus Torvalds, Ingo Molnar,
	Dave Hansen, Pavel Emelyanov

Covers application checkpoint/restart, overall design, interfaces,
usage, shared objects, and and checkpoint image format.

Changelog[v16]:
  - Update documentation
  - Unify into readme.txt and usage.txt
Changelog[v14]:
  - Discard the 'h.parent' field
  - New image format (shared objects appear before they are referenced
    unless they are compound)
Changelog[v8]:
  - Split into multiple files in Documentation/checkpoint/...
  - Extend documentation, fix typos and comments from feedback

Signed-off-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
Acked-by: Serge Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
Signed-off-by: Dave Hansen <dave-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
---
 Documentation/checkpoint/checkpoint.c      |   32 +++
 Documentation/checkpoint/checkpoint_self.c |   57 +++++
 Documentation/checkpoint/readme.txt        |  360 ++++++++++++++++++++++++++++
 Documentation/checkpoint/restart_self.c    |   20 ++
 Documentation/checkpoint/test.c            |   48 ++++
 Documentation/checkpoint/usage.txt         |  229 ++++++++++++++++++
 6 files changed, 746 insertions(+), 0 deletions(-)
 create mode 100644 Documentation/checkpoint/checkpoint.c
 create mode 100644 Documentation/checkpoint/checkpoint_self.c
 create mode 100644 Documentation/checkpoint/readme.txt
 create mode 100644 Documentation/checkpoint/restart_self.c
 create mode 100644 Documentation/checkpoint/test.c
 create mode 100644 Documentation/checkpoint/usage.txt

diff --git a/Documentation/checkpoint/checkpoint.c b/Documentation/checkpoint/checkpoint.c
new file mode 100644
index 0000000..094408c
--- /dev/null
+++ b/Documentation/checkpoint/checkpoint.c
@@ -0,0 +1,32 @@
+#include <stdio.h>
+#include <stdlib.h>
+#include <errno.h>
+#include <unistd.h>
+#include <sys/syscall.h>
+
+int main(int argc, char *argv[])
+{
+	pid_t pid;
+	int ret;
+
+	if (argc != 2) {
+		printf("usage: ckpt PID\n");
+		exit(1);
+	}
+
+	pid = atoi(argv[1]);
+	if (pid <= 0) {
+		printf("invalid pid\n");
+		exit(1);
+	}
+
+	ret = syscall(__NR_checkpoint, pid, STDOUT_FILENO, 0);
+
+	if (ret < 0)
+		perror("checkpoint");
+	else
+		printf("checkpoint id %d\n", ret);
+
+	return (ret > 0 ? 0 : 1);
+}
+
diff --git a/Documentation/checkpoint/checkpoint_self.c b/Documentation/checkpoint/checkpoint_self.c
new file mode 100644
index 0000000..febb888
--- /dev/null
+++ b/Documentation/checkpoint/checkpoint_self.c
@@ -0,0 +1,57 @@
+#include <stdio.h>
+#include <stdlib.h>
+#include <unistd.h>
+#include <string.h>
+#include <errno.h>
+#include <math.h>
+#include <sys/syscall.h>
+
+#define OUTFILE  "/tmp/cr-test.out"
+
+int main(int argc, char *argv[])
+{
+	pid_t pid = getpid();
+	FILE *file;
+	int i, ret;
+	float a;
+
+	close(0);
+	close(2);
+
+	unlink(OUTFILE);
+	file = fopen(OUTFILE, "w+");
+	if (!file) {
+		perror("open");
+		exit(1);
+	}
+	if (dup2(0, 2) < 0) {
+		perror("dup2");
+		exit(1);
+	}
+
+	a = sqrt(2.53 * (getpid() / 1.21));
+
+	fprintf(file, "hello, world (%.2f)!\n", a);
+	fflush(file);
+
+	for (i = 0; i < 1000; i++) {
+		sleep(1);
+		/* make the fpu work ->  a = a + i/10  */
+		a = sqrt(a*a + 2*a*(i/10.0) + i*i/100.0);
+		fprintf(file, "count %d (%.2f)!\n", i, a);
+		fflush(file);
+
+		if (i == 2) {
+			ret = syscall(__NR_checkpoint, pid, STDOUT_FILENO, 0);
+			if (ret < 0) {
+				fprintf(file, "ckpt: %s\n", strerror(errno));
+				exit(2);
+			}
+			fprintf(file, "checkpoint ret: %d\n", ret);
+			fflush(file);
+		}
+	}
+
+	return 0;
+}
+
diff --git a/Documentation/checkpoint/readme.txt b/Documentation/checkpoint/readme.txt
new file mode 100644
index 0000000..571c469
--- /dev/null
+++ b/Documentation/checkpoint/readme.txt
@@ -0,0 +1,360 @@
+
+	      Checkpoint-Restart support in the Linux kernel
+	==========================================================
+
+Copyright (C) 2008-2009 Oren Laadan
+
+Author:		Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
+
+License:	The GNU Free Documentation License, Version 1.2
+		(dual licensed under the GPL v2)
+
+Contributors:	Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
+		Serge Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
+		Dan Smith <danms-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
+		Matt Helsley <matthltc-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
+		Nathan Lynch <ntl-e+AXbWqSrlAAvxtiuMwx3w@public.gmane.org>
+		Sukadev Bhattiprolu <sukadev-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
+		Dave Hansen <dave-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
+
+
+Introduction
+============
+
+Application checkpoint/restart [C/R] is the ability to save the state
+of a running application so that it can later resume its execution
+from the time at which it was checkpointed. An application can be
+migrated by checkpointing it on one machine and restarting it on
+another. C/R can provide many potential benefits:
+
+* Failure recovery: by rolling back to a previous checkpoint
+
+* Improved response time: by restarting applications from checkpoints
+  instead of from scratch.
+
+* Improved system utilization: by suspending long running CPU
+  intensive jobs and resuming them when load decreases.
+
+* Fault resilience: by migrating applications off faulty hosts.
+
+* Dynamic load balancing: by migrating applications to less loaded
+  hosts.
+
+* Improved service availability and administration: by migrating
+  applications before host maintenance so that they continue to run
+  with minimal downtime
+
+* Time-travel: by taking periodic checkpoints and restarting from
+  any previous checkpoint.
+
+Compared to hypervisor approaches, application C/R is more lightweight
+since it need only save the state associated with applications, while
+operating system data structures (e.g. buffer cache, drivers state
+and the like) are uninteresting.
+
+
+Overall design
+==============
+
+Checkpoint and restart are done in the kernel as much as possible.
+Two new system calls are introduced to provide C/R: sys_checkpoint()
+and sys_restart(). They both operate on a process tree (hierarchy),
+either a whole container or a subtree of a container.
+
+Checkpointing entire containers ensures that there are no dependencies
+on anything outside the container, which guarantees that a matching
+restart will succeed (assuming that the file system state remains
+consistent). However, it requires that users will always run the tasks
+that they wish to checkpoint inside containers. This is ideal for,
+e.g., private virtual servers and the like.
+
+In contrast, when checkpointing a subtree of a container it is up to
+the user to ensure that dependencies either don't exist or can be
+safely ignored. This is useful, for instance, for HPC scenarios or
+even a user that would like to periodically checkpoint a long-running
+batch job.
+
+An additional system call, a la madvise(), is planned, so that tasks
+can advise the kernel how to handle specific resources. For instance,
+a task could ask to skip a memory area at checkpoint to save space,
+or to use a preset file descriptor at restart instead of restoring it
+from the checkpoint image. It will provide the flexibility that is
+particularly useful to address the needs of a diverse crowd of users
+and use-cases.
+
+Syscall sys_checkpoint() is given a pid that indicates the top of the
+hierarchy, a file descriptor to store the image, and flags. The code
+serializes internal user- and kernel-state and writes it out to the
+file descriptor. The resulting image is stream-able. The processes are
+expected to be frozen for the duration of the checkpoint.
+
+In general, a checkpoint consists of 5 steps:
+1. Pre-dump
+2. Freeze the container/subtree
+3. Save tasks' and kernel state		<-- sys_checkpoint()
+4. Thaw (or kill) the container/subtree
+5. Post-dump
+
+Step 3 is done by calling sys_checkpoint(). Steps 1 and 5 are an
+optimization to reduce application downtime. In particular, "pre-dump"
+works before freezing the container, e.g. the pre-copy for live
+migration, and "post-dump" works after the container resumes
+execution, e.g. write-back the data to secondary storage.
+
+The kernel exports a relatively opaque 'blob' of data to userspace
+which can then be handed to the new kernel at restart time.  The
+'blob' contains data and state of select portions of kernel structures
+such as VMAs and mm_structs, as well as copies of the actual memory
+that the tasks use. Any changes in this blob's format between kernel
+revisions can be handled by an in-userspace conversion program.
+
+To restart, userspace first create a process hierarchy that matches
+that of the checkpoint, and each task calls sys_restart(). The syscall
+reads the saved kernel state from a file descriptor, and re-creates
+the resources that the tasks need to resume execution. The restart
+code is executed by each task that is restored in the new hierarchy to
+reconstruct its own state.
+
+In general, a restart consists of 3 steps:
+1. Create hierarchy
+2. Restore tasks' and kernel state	<-- sys_restart()
+3. Resume userspace (or freeze tasks)
+
+Because the process hierarchy, during restart in created in userspace,
+the restarting tasks have the flexibility to prepare before calling
+sys_restart().
+
+
+Checkpoint image format
+=======================
+
+The checkpoint image format is built of records that consist of a
+pre-header identifying its contents, followed by a payload. This
+format allow userspace tools to easily parse and skip through the
+image without requiring intimate knowledge of the data. It will also
+be handy to enable parallel checkpointing in the future where multiple
+threads interleave data from multiple processes into a single stream.
+
+The pre-header is defined by 'struct ckpt_hdr' as follows: @type
+identifies the type of the payload, @len tells its length in bytes
+including the pre-header.
+
+struct ckpt_hdr {
+	__s32 type;
+	__s32 len;
+};
+
+The pre-header must be the first component in all other headers. For
+instance, the task data is saved in 'struct ckpt_hdr_task', which
+looks something like this:
+
+struct ckpt_hdr_task {
+	struct ckpt_hdr h;
+	__u32 pid;
+	...
+};
+
+THE IMAGE FORMAT IS EXPECTED TO CHANGE over time as more features are
+supported, or as existing features change in the kernel and require to
+adjust their representation. Any such changes will be be handled by
+in-userspace conversion tools.
+
+The general format of the checkpoint image is as follows:
+1. Image header
+2. Task hierarchy
+3. Tasks' state
+4. Image trailer
+
+The image always begins with a general header that holds a magic
+number, an architecture identifier (little endian format), a format
+version number (@rev), followed by information about the kernel
+(currently version and UTS data). It also holds the time of the
+checkpoint and the flags given to sys_checkpoint(). This header is
+followed by an arch-specific header.
+
+The task hierarchy comes next so that userspace tools can read it
+early (even from a stream) and re-create the restarting tasks. This is
+basically an array of all checkpointed tasks, and their relationships
+(parent, siblings, threads, etc).
+
+Then the state of all tasks is saved, in the order that they appear in
+the tasks array above. For each state, we save data like task_struct,
+namespaces, open files, memory layout, memory contents, cpu state,
+signals and signal handlers, etc. For resources that are shared among
+multiple processes, we first checkpoint said resource (and only once),
+and in the task data we give a reference to it. More about shared
+resources below.
+
+Finally, the image always ends with a trailer that holds a (different)
+magic number, serving for sanity check.
+
+
+Shared objects
+==============
+
+Many resources may be shared by multiple tasks (e.g. file descriptors,
+memory address space, etc), or even have multiple references from
+other resources (e.g. a single inode that represents two ends of a
+pipe).
+
+Shared objects are tracked using a hash table (objhash) to ensure that
+they are only checkpointed or restored once. To handle a shared
+object, it is first looked up in the hash table, to determine if is
+the first encounter or a recurring appearance.  The hash table itself
+is not saved as part of the checkpoint image: it is constructed
+dynamically during both checkpoint and restart, and discarded at the
+end of the operation.
+
+During checkpoint, when a shared object is encountered for the first
+time, it is inserted to the hash table, indexed by its kernel address.
+It is assigned an identifier (@objref) in order of appearance, and
+then its state if saved. Subsequent lookups of that object in the hash
+will yield that entry, in which case only the @objref is saved, as
+opposed the entire state of the object.
+
+During restart, shared objects are indexed by their @objref as given
+during the checkpoint. On the first appearance of each shared object,
+a new resource will be created and its state restored from the image.
+Then the object is added to the hash table. Subsequent lookups of the
+same unique identifier in the hash table will yield that entry, and
+then the existing object instance is reused instead of creating
+a new one.
+
+The hash grabs a reference to each object that is inserted, and
+maintains this reference for the entire lifetime of the hash. Thus,
+it is always safe to reference an object that is stored in the hash.
+The hash is "one-way" in the sense that objects that are added are
+never deleted from the hash until the hash is discarded. This, in
+turn, happens only when the checkpoint (or restart) terminates.
+
+Shared objects are thus saved when they are first seen, and _before_
+the parent object that uses them. Therefore by the time the parent
+objects needs them, they should already be in the objhash. The one
+exception is when more than a single shared resource will be restarted
+at once (e.g. like the two ends of a pipe, or all the namespaces in an
+nsproxy). In this case the parent object is dumped first followed by
+the individual sub-resources).
+
+The checkpoint image is stream-able, meaning that restarting from it
+may not require lseek(). This is enforced at checkpoint time, by
+carefully selecting the order of shared objects, to respect the rule
+that an object is always saved before the objects that refers to it.
+
+
+Memory contents format
+======================
+
+The memory contents of a given memory address space (->mm) is dumped
+as a sequence of vma objects, represented by 'struct ckpt_hdr_vma'.
+This header details the vma properties, and a reference to a file
+(if file backed) or an inode (or shared memory) object.
+
+The vma header is followed by the actual contents - but only those
+pages that need to be saved, i.e. dirty pages. They are written in
+chunks of data, where each chunks contains a header that indicates
+that number of pages in the chunk, followed by an array of virtual
+addresses and then an array of actual page contents. The last chunk
+holds zero pages.
+
+To illustrate this, consider a single simple task with two vmas: one
+is file mapped with two dumped pages, and the other is anonymous with
+three dumped pages. The memory dump will look like this:
+
+	ckpt_hdr + ckpt_hdr_vma
+		ckpt_hdr_pgarr (nr_pages = 2)
+			addr1, addr2
+			page1, page2
+		ckpt_hdr_pgarr (nr_pages = 0)
+	ckpt_hdr + ckpt_hdr_vma
+		ckpt_hdr_pgarr (nr_pages = 3)
+		addr3, addr4, addr5
+		page3, page4, page5
+		ckpt_hdr_pgarr (nr_pages = 0)
+
+
+Error handling
+==============
+
+Both checkpoint and restart operations may fail due to a variety of
+reasons. Using a simple, single return value from the system call is
+insufficient to report the reason of a failure.
+
+Checkpoint - to provide informative status report upon failure, the
+checkpoint image may contain one (or more) error objects, 'struct
+ckpt_hdr_err'.  An error objects consists of a mandatory pre-header
+followed by a null character ('\0'), and then a string that describes
+the error. By default, if an error occurs, this will be the last
+object written to the checkpoint image.
+
+Upon failure, the caller can examine the image (e.g. with 'ckptinfo')
+and extract the detailed error message. The leading '\0' is useful if
+one wants to seek back from the end of the checkpoint image, instead
+of parsing the entire image separately.
+
+Restart - to be defined.
+
+
+Security
+========
+
+The main question is whether sys_checkpoint() and sys_restart()
+require privileged or unprivileged operation.
+
+Early versions checked capable(CAP_SYS_ADMIN) assuming that we would
+attempt to remove the need for privilege, so that all users could
+safely use it. Arnd Bergmann pointed out that it'd make more sense to
+let unprivileged users use them now, so that we'll be more careful
+about the security as patches roll in.
+
+Checkpoint: the main concern is whether a task that performs the
+checkpoint of another task has sufficient privileges to access its
+state. We address this by requiring that the checkpointer task will be
+able to ptrace the target task, by means of ptrace_may_access() with
+access mode.
+
+Restart: the main concern is that we may allow an unprivileged user to
+feed the kernel with random data. To this end, the restart works in a
+way that does not skip the usual security checks. Task credentials,
+i.e. euid, reuid, and LSM security contexts currently come from the
+caller, not the checkpoint image.  As credentials are restored too,
+the ability of a task that calls sys_restore() to setresuid/setresgid
+to those values must be checked.
+
+Keeping the restart procedure to operate within the limits of the
+caller's credentials means that there various scenarios that cannot
+be supported. For instance, a setuid program that opened a protected
+log file and then dropped privileges will fail the restart, because
+the user won't have enough credentials to reopen the file. In these
+cases, we should probably treat restarting like inserting a kernel
+module: surely the user can cause havoc by providing incorrect data,
+but then again we must trust the root account.
+
+So that's why we don't want CAP_SYS_ADMIN required up-front. That way
+we will be forced to more carefully review each of those features.
+However, this can be controlled with a sysctl-variable.
+
+
+Kernel interfaces
+=================
+
+* To checkpoint a vma, the 'struct vm_operations_struct' needs to
+  provide a method ->checkpoint:
+    int checkpoint(struct ckpt_ctx *, struct vma_struct *)
+  Restart requires a matching (exported) restore:
+    int restore(struct ckpt_ctx *, struct mm_struct *, struct ckpt_hdr_vma *)
+
+* To checkpoint a file, the 'struct file_operations' needs to provide
+  the methods ->checkpoint and ->collect:
+    int checkpoint(struct ckpt_ctx *, struct file *)
+    int collect(struct ckpt_ctx *, struct file *)
+  Restart requires a matching (exported) restore:
+    int restore(struct ckpt_ctx *, struct ckpt_hdr_file *)
+  For most file systems, generic_file_{checkpoint,restore}() can be
+  used.
+
+* To checkpoint a socket, the 'struct proto_ops' needs to provide
+  the methods ->checkpoint, ->collect and ->restore:
+    int checkpoint(struct ckpt_ctx *ctx, struct socket *sock);
+    int collect(struct ckpt_ctx *ctx, struct socket *sock);
+    int restore(struct ckpt_ctx *, struct socket *sock, struct ckpt_hdr_socket *h)
+
diff --git a/Documentation/checkpoint/restart_self.c b/Documentation/checkpoint/restart_self.c
new file mode 100644
index 0000000..288209d
--- /dev/null
+++ b/Documentation/checkpoint/restart_self.c
@@ -0,0 +1,20 @@
+#include <stdio.h>
+#include <stdlib.h>
+#include <unistd.h>
+#include <errno.h>
+#include <sys/syscall.h>
+
+int main(int argc, char *argv[])
+{
+	pid_t pid = getpid();
+	int ret;
+
+	ret = syscall(__NR_restart, pid, STDIN_FILENO, 0);
+	if (ret < 0)
+		perror("restart");
+
+	printf("should not reach here !\n");
+
+	return 0;
+}
+
diff --git a/Documentation/checkpoint/test.c b/Documentation/checkpoint/test.c
new file mode 100644
index 0000000..1183655
--- /dev/null
+++ b/Documentation/checkpoint/test.c
@@ -0,0 +1,48 @@
+#include <stdio.h>
+#include <stdlib.h>
+#include <unistd.h>
+#include <errno.h>
+#include <math.h>
+
+#define OUTFILE  "/tmp/cr-test.out"
+
+int main(int argc, char *argv[])
+{
+	FILE *file;
+	float a;
+	int i;
+
+	close(0);
+	close(1);
+	close(2);
+
+	unlink(OUTFILE);
+	file = fopen(OUTFILE, "w+");
+	if (!file) {
+		perror("open");
+		exit(1);
+	}
+	if (dup2(0, 2) < 0) {
+		perror("dup2");
+		exit(1);
+	}
+
+	a = sqrt(2.53 * (getpid() / 1.21));
+
+	fprintf(file, "hello, world (%.2f)!\n", a);
+	fflush(file);
+
+	for (i = 0; i < 1000; i++) {
+		sleep(1);
+		/* make the fpu work ->  a = a + i/10  */
+		a = sqrt(a*a + 2*a*(i/10.0) + i*i/100.0);
+		fprintf(file, "count %d (%.2f)!\n", i, a);
+		fflush(file);
+	}
+
+	fprintf(file, "world, hello (%.2f) !\n", a);
+	fflush(file);
+
+	return 0;
+}
+
diff --git a/Documentation/checkpoint/usage.txt b/Documentation/checkpoint/usage.txt
new file mode 100644
index 0000000..4c75441
--- /dev/null
+++ b/Documentation/checkpoint/usage.txt
@@ -0,0 +1,229 @@
+
+	      How to use Checkpoint-Restart
+	=========================================
+
+
+API
+===
+
+The API consists of three new system calls:
+
+* long checkpoint(pid_t pid, int fd, unsigned long flag);
+
+ Checkpoint a (sub-)container whose root task is identified by @pid,
+ to the open file indicated by @fd. @flags may be one or more of:
+   - CHECKPOINT_SUBTREE : allow checkpoint of sub-container
+ (other value are not allowed).
+
+ Returns: a positive checkpoint identifier (ckptid) upon success, 0 if
+ it returns from a restart, and -1 if an error occurs. The ckptid will
+ uniquely identify a checkpoint image, for as long as the checkpoint
+ is kept in the kernel (e.g. if one wishes to keep a checkpoint, or a
+ partial checkpoint, residing in kernel memory).
+
+* long sys_restart(pid_t pid, int fd, unsigned long flags);
+
+ Restart a process hierarchy from a checkpoint image that is read from
+ the blob stored in the file indicated by @fd. The @flags' will have
+ future meaning (must be 0 for now). @pid indicates the root of the
+ hierarchy as seen in the coordinator's pid-namespace, and is expected
+ to be a child of the coordinator. @flags may be one or more of:
+   - RESTART_TASKSELF : (self) restart of a single process
+   - RESTART_FROEZN : processes remain frozen once restart completes
+ (Note that this argument may mean 'ckptid' to identify an in-kernel
+ checkpoint image, with some @flags in the future).
+
+ Returns: -1 if an error occurs, 0 on success when restarting from a
+ "self" checkpoint, and return value of system call at the time of the
+ checkpoint when restarting from an "external" checkpoint.
+
+ (If a process was frozen for checkpoint while in userspace, it will
+ resume running in userspace exactly where it was interrupted. If it
+ was frozen while in kernel doing a syscall, it will return what the
+ syscall returned when interrupted/completed, and proceed from there
+ as if it had only been frozen and then thawed. Finally, if it did a
+ self-checkpoint, it will resume to the first instruction after the
+ call to checkpoint(2), having returned 0, to indicate whether the
+ return is from the checkpoint or a restart).
+
+* int clone_with_pid(unsigned long clone_flags, void *news,
+		     int *parent_tidptr, int *child_tidptr,
+		     struct target_pid_set *pid_set)
+
+  struct target_pid_set {
+	 int num_pids;
+	 pid_t *target_pids;
+  }
+
+ Container restart requires that a task have the same pid it had when
+ it was checkpointed. When containers are nested the tasks within the
+ containers exist in multiple pid namespaces and hence have multiple
+ pids to specify during restart.
+
+ clone_with_pids(), intended for use during restart, is similar to
+ clone(), except that it takes a 'target_pid_set' parameter. This
+ parameter lets caller choose specific pid numbers for the child
+ process, in the process's active and ancestor pid namespaces.
+
+ Unlike clone(), clone_with_pids() needs CAP_SYS_ADMIN, at least for
+ now, to prevent unprivileged processes from misusing this interface.
+
+ If a target-pid is 0, the kernel continues to assign a pid for the
+ process in that namespace. If a requested pid is taken, the system
+ call fails with -EBUSY. If 'pid_set.num_pids' exceeds the current
+ nesting level of pid namespaces, the system call fails with -EINVAL.
+
+
+Sysctl/proc
+===========
+
+/proc/sys/kernel/ckpt_unpriv_allowed		[default = 1]
+  controls whether c/r operation is allowed for unprivileged users
+
+
+Operation
+=========
+
+The granularity of a checkpoint usually is a process hierarchy. The
+'pid' argument is interpreted in the caller's pid namespace. So to
+checkpoint a container whose init task (pid 1 in that pidns) appears
+as pid 3497 the caller's pidns, the caller must use pid 3497. Passing
+pid 1 will attempt to checkpoint the caller's container, and if the
+caller isn't privileged and init is owned by root, it will fail.
+
+Unless the CHECKPOINT_SUBTREE flag is set, if the caller passes a pid
+which does not refer to a container's init task, then sys_checkpoint()
+would return -EINVAL.
+
+We assume that during checkpoint and restart the container state is
+quiescent. During checkpoint, this means that all affected tasks are
+frozen (or otherwise stopped). During restart, this means that all
+affected tasks are executing the sys_restart() call. In both cases, if
+there are other tasks possible sharing state with the container, they
+must not modify it during the operation. It is the responsibility of
+the caller to follow this requirement.
+
+If the assumption that all tasks are frozen and that there is no other
+sharing doesn't hold - then the results of the operation are undefined
+(just as, e.g. not calling execve() immediately after vfork() produces
+undefined results). In particular, either checkpoint will fail, or it
+may produce a checkpoint image that can't be restarted, or (unlikely)
+the restart may produce a container whose state does not match that of
+the original container.
+
+
+User tools
+==========
+
+* checkpoint(1): a tool to perform a checkpoint of a container/subtree
+* restart(1): a tool to restart a container/subtree
+* ckptinfo: a tool to examine a checkpoint image
+
+It is best to use the dedicated user tools for checkpoint and restart.
+
+If you insist, then here is a code snippet that illustrates how a
+checkpoint is initiated by a process inside a container - the logic is
+similar to fork():
+	...
+	ckptid = checkpoint(0, ...);
+	switch (crid) {
+	case -1:
+		perror("checkpoint failed");
+		break;
+	default:
+		fprintf(stderr, "checkpoint succeeded, CRID=%d\n", ret);
+		/* proceed with execution after checkpoint */
+		...
+		break;
+	case 0:
+		fprintf(stderr, "returned after restart\n");
+		/* proceed with action required following a restart */
+		...
+		break;
+	}
+	...
+
+And to initiate a restart, the process in an empty container can use
+logic similar to execve():
+	...
+	if (restart(pid, ...) < 0)
+		perror("restart failed");
+	/* only get here if restart failed */
+	...
+
+Note, that the code also supports "self" checkpoint, where a process
+can checkpoint itself. This mode does not capture the relationships of
+the task with other tasks, or any shared resources. It is useful for
+application that wish to be able to save and restore their state.
+They will either not use (or care about) shared resources, or they
+will be aware of the operations and adapt suitably after a restart.
+The code above can also be used for "self" checkpoint.
+
+
+You may find the following sample programs useful:
+
+* checkpoint.c: accepts a 'pid' argument and checkpoint that task to stdout
+* checkpoint_self.c: a simple test program doing self-checkpoint
+* restart_self.c: restarts a (self-) checkpoint image from stdin
+* test.c: a simple test program to checkpoint
+
+
+"External" checkpoint
+=====================
+
+To do "external" checkpoint, you need to first freeze that other task
+either using the freezer cgroup.
+
+Restart does not preserve the original PID yet, (because we haven't
+solved yet the fork-with-specific-pid issue). In a real scenario, you
+probably want to first create a new names space, and have the init
+task there call 'sys_restart()'.
+
+I tested it this way:
+	$ ./test &
+	[1] 3493
+
+	$ echo 3493 > /cgroup/0/tasks
+	$ echo FROZEN > /cgroup/0/freezer.state
+	$ ./checkpoint 3493 > ckpt.image
+
+	$ mv /tmp/cr-test.out /tmp/cr-test.out.orig
+	$ cp /tmp/cr-test.out.orig /tmp/cr-test.out
+
+	$ echo THAWED > /cgroup/0/freezer.state
+
+	$ ./restart_self < ckpt.image
+Now compare the output of the two output files.
+
+
+"Self" checkpoint
+================
+
+To do "self" checkpoint, you can incorporate the code from ckpt.c into
+your application.
+
+Here is how to test the "self" checkpoint:
+	$ ./checkpoint_self > self.image &
+	[1] 3512
+
+	$ sleep 3
+	$ mv /tmp/cr-test.out /tmp/cr-test.out.orig
+	$ cp /tmp/cr-test.out.orig /tmp/cr-test.out
+
+	$ cat /tmp/cr-rest.out
+	hello, world (85.46)!
+	count 0 (85.46)!
+	count 1 (85.56)!
+	count 2 (85.76)!
+	count 3 (86.46)!
+
+	$ sed -i 's/count/xxxx/g' /tmp/cr-rest.out
+
+	$ ./restart_self < self.image &
+Now compare the output of the two output files.
+
+Note how in test.c we close stdin, stdout, stderr - that's because
+currently we only support regular files (not ttys/ptys).
+
+If you check the output of ps, you'll see that "rstr" changed its name
+to "test" or "self", as expected.
-- 
1.6.0.4

^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH v18 19/80] c/r: documentation
  2009-09-23 23:50 ` Oren Laadan
@ 2009-09-23 23:50   ` Oren Laadan
  -1 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-23 23:50 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, containers, linux-kernel, linux-mm, linux-api,
	Serge Hallyn, Ingo Molnar, Pavel Emelyanov, Oren Laadan,
	Oren Laadan, Dave Hansen

Covers application checkpoint/restart, overall design, interfaces,
usage, shared objects, and and checkpoint image format.

Changelog[v16]:
  - Update documentation
  - Unify into readme.txt and usage.txt
Changelog[v14]:
  - Discard the 'h.parent' field
  - New image format (shared objects appear before they are referenced
    unless they are compound)
Changelog[v8]:
  - Split into multiple files in Documentation/checkpoint/...
  - Extend documentation, fix typos and comments from feedback

Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com>
---
 Documentation/checkpoint/checkpoint.c      |   32 +++
 Documentation/checkpoint/checkpoint_self.c |   57 +++++
 Documentation/checkpoint/readme.txt        |  360 ++++++++++++++++++++++++++++
 Documentation/checkpoint/restart_self.c    |   20 ++
 Documentation/checkpoint/test.c            |   48 ++++
 Documentation/checkpoint/usage.txt         |  229 ++++++++++++++++++
 6 files changed, 746 insertions(+), 0 deletions(-)
 create mode 100644 Documentation/checkpoint/checkpoint.c
 create mode 100644 Documentation/checkpoint/checkpoint_self.c
 create mode 100644 Documentation/checkpoint/readme.txt
 create mode 100644 Documentation/checkpoint/restart_self.c
 create mode 100644 Documentation/checkpoint/test.c
 create mode 100644 Documentation/checkpoint/usage.txt

diff --git a/Documentation/checkpoint/checkpoint.c b/Documentation/checkpoint/checkpoint.c
new file mode 100644
index 0000000..094408c
--- /dev/null
+++ b/Documentation/checkpoint/checkpoint.c
@@ -0,0 +1,32 @@
+#include <stdio.h>
+#include <stdlib.h>
+#include <errno.h>
+#include <unistd.h>
+#include <sys/syscall.h>
+
+int main(int argc, char *argv[])
+{
+	pid_t pid;
+	int ret;
+
+	if (argc != 2) {
+		printf("usage: ckpt PID\n");
+		exit(1);
+	}
+
+	pid = atoi(argv[1]);
+	if (pid <= 0) {
+		printf("invalid pid\n");
+		exit(1);
+	}
+
+	ret = syscall(__NR_checkpoint, pid, STDOUT_FILENO, 0);
+
+	if (ret < 0)
+		perror("checkpoint");
+	else
+		printf("checkpoint id %d\n", ret);
+
+	return (ret > 0 ? 0 : 1);
+}
+
diff --git a/Documentation/checkpoint/checkpoint_self.c b/Documentation/checkpoint/checkpoint_self.c
new file mode 100644
index 0000000..febb888
--- /dev/null
+++ b/Documentation/checkpoint/checkpoint_self.c
@@ -0,0 +1,57 @@
+#include <stdio.h>
+#include <stdlib.h>
+#include <unistd.h>
+#include <string.h>
+#include <errno.h>
+#include <math.h>
+#include <sys/syscall.h>
+
+#define OUTFILE  "/tmp/cr-test.out"
+
+int main(int argc, char *argv[])
+{
+	pid_t pid = getpid();
+	FILE *file;
+	int i, ret;
+	float a;
+
+	close(0);
+	close(2);
+
+	unlink(OUTFILE);
+	file = fopen(OUTFILE, "w+");
+	if (!file) {
+		perror("open");
+		exit(1);
+	}
+	if (dup2(0, 2) < 0) {
+		perror("dup2");
+		exit(1);
+	}
+
+	a = sqrt(2.53 * (getpid() / 1.21));
+
+	fprintf(file, "hello, world (%.2f)!\n", a);
+	fflush(file);
+
+	for (i = 0; i < 1000; i++) {
+		sleep(1);
+		/* make the fpu work ->  a = a + i/10  */
+		a = sqrt(a*a + 2*a*(i/10.0) + i*i/100.0);
+		fprintf(file, "count %d (%.2f)!\n", i, a);
+		fflush(file);
+
+		if (i == 2) {
+			ret = syscall(__NR_checkpoint, pid, STDOUT_FILENO, 0);
+			if (ret < 0) {
+				fprintf(file, "ckpt: %s\n", strerror(errno));
+				exit(2);
+			}
+			fprintf(file, "checkpoint ret: %d\n", ret);
+			fflush(file);
+		}
+	}
+
+	return 0;
+}
+
diff --git a/Documentation/checkpoint/readme.txt b/Documentation/checkpoint/readme.txt
new file mode 100644
index 0000000..571c469
--- /dev/null
+++ b/Documentation/checkpoint/readme.txt
@@ -0,0 +1,360 @@
+
+	      Checkpoint-Restart support in the Linux kernel
+	==========================================================
+
+Copyright (C) 2008-2009 Oren Laadan
+
+Author:		Oren Laadan <orenl@cs.columbia.edu>
+
+License:	The GNU Free Documentation License, Version 1.2
+		(dual licensed under the GPL v2)
+
+Contributors:	Oren Laadan <orenl@cs.columbia.edu>
+		Serge Hallyn <serue@us.ibm.com>
+		Dan Smith <danms@us.ibm.com>
+		Matt Helsley <matthltc@us.ibm.com>
+		Nathan Lynch <ntl@pobox.com>
+		Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
+		Dave Hansen <dave@linux.vnet.ibm.com>
+
+
+Introduction
+============
+
+Application checkpoint/restart [C/R] is the ability to save the state
+of a running application so that it can later resume its execution
+from the time at which it was checkpointed. An application can be
+migrated by checkpointing it on one machine and restarting it on
+another. C/R can provide many potential benefits:
+
+* Failure recovery: by rolling back to a previous checkpoint
+
+* Improved response time: by restarting applications from checkpoints
+  instead of from scratch.
+
+* Improved system utilization: by suspending long running CPU
+  intensive jobs and resuming them when load decreases.
+
+* Fault resilience: by migrating applications off faulty hosts.
+
+* Dynamic load balancing: by migrating applications to less loaded
+  hosts.
+
+* Improved service availability and administration: by migrating
+  applications before host maintenance so that they continue to run
+  with minimal downtime
+
+* Time-travel: by taking periodic checkpoints and restarting from
+  any previous checkpoint.
+
+Compared to hypervisor approaches, application C/R is more lightweight
+since it need only save the state associated with applications, while
+operating system data structures (e.g. buffer cache, drivers state
+and the like) are uninteresting.
+
+
+Overall design
+==============
+
+Checkpoint and restart are done in the kernel as much as possible.
+Two new system calls are introduced to provide C/R: sys_checkpoint()
+and sys_restart(). They both operate on a process tree (hierarchy),
+either a whole container or a subtree of a container.
+
+Checkpointing entire containers ensures that there are no dependencies
+on anything outside the container, which guarantees that a matching
+restart will succeed (assuming that the file system state remains
+consistent). However, it requires that users will always run the tasks
+that they wish to checkpoint inside containers. This is ideal for,
+e.g., private virtual servers and the like.
+
+In contrast, when checkpointing a subtree of a container it is up to
+the user to ensure that dependencies either don't exist or can be
+safely ignored. This is useful, for instance, for HPC scenarios or
+even a user that would like to periodically checkpoint a long-running
+batch job.
+
+An additional system call, a la madvise(), is planned, so that tasks
+can advise the kernel how to handle specific resources. For instance,
+a task could ask to skip a memory area at checkpoint to save space,
+or to use a preset file descriptor at restart instead of restoring it
+from the checkpoint image. It will provide the flexibility that is
+particularly useful to address the needs of a diverse crowd of users
+and use-cases.
+
+Syscall sys_checkpoint() is given a pid that indicates the top of the
+hierarchy, a file descriptor to store the image, and flags. The code
+serializes internal user- and kernel-state and writes it out to the
+file descriptor. The resulting image is stream-able. The processes are
+expected to be frozen for the duration of the checkpoint.
+
+In general, a checkpoint consists of 5 steps:
+1. Pre-dump
+2. Freeze the container/subtree
+3. Save tasks' and kernel state		<-- sys_checkpoint()
+4. Thaw (or kill) the container/subtree
+5. Post-dump
+
+Step 3 is done by calling sys_checkpoint(). Steps 1 and 5 are an
+optimization to reduce application downtime. In particular, "pre-dump"
+works before freezing the container, e.g. the pre-copy for live
+migration, and "post-dump" works after the container resumes
+execution, e.g. write-back the data to secondary storage.
+
+The kernel exports a relatively opaque 'blob' of data to userspace
+which can then be handed to the new kernel at restart time.  The
+'blob' contains data and state of select portions of kernel structures
+such as VMAs and mm_structs, as well as copies of the actual memory
+that the tasks use. Any changes in this blob's format between kernel
+revisions can be handled by an in-userspace conversion program.
+
+To restart, userspace first create a process hierarchy that matches
+that of the checkpoint, and each task calls sys_restart(). The syscall
+reads the saved kernel state from a file descriptor, and re-creates
+the resources that the tasks need to resume execution. The restart
+code is executed by each task that is restored in the new hierarchy to
+reconstruct its own state.
+
+In general, a restart consists of 3 steps:
+1. Create hierarchy
+2. Restore tasks' and kernel state	<-- sys_restart()
+3. Resume userspace (or freeze tasks)
+
+Because the process hierarchy, during restart in created in userspace,
+the restarting tasks have the flexibility to prepare before calling
+sys_restart().
+
+
+Checkpoint image format
+=======================
+
+The checkpoint image format is built of records that consist of a
+pre-header identifying its contents, followed by a payload. This
+format allow userspace tools to easily parse and skip through the
+image without requiring intimate knowledge of the data. It will also
+be handy to enable parallel checkpointing in the future where multiple
+threads interleave data from multiple processes into a single stream.
+
+The pre-header is defined by 'struct ckpt_hdr' as follows: @type
+identifies the type of the payload, @len tells its length in bytes
+including the pre-header.
+
+struct ckpt_hdr {
+	__s32 type;
+	__s32 len;
+};
+
+The pre-header must be the first component in all other headers. For
+instance, the task data is saved in 'struct ckpt_hdr_task', which
+looks something like this:
+
+struct ckpt_hdr_task {
+	struct ckpt_hdr h;
+	__u32 pid;
+	...
+};
+
+THE IMAGE FORMAT IS EXPECTED TO CHANGE over time as more features are
+supported, or as existing features change in the kernel and require to
+adjust their representation. Any such changes will be be handled by
+in-userspace conversion tools.
+
+The general format of the checkpoint image is as follows:
+1. Image header
+2. Task hierarchy
+3. Tasks' state
+4. Image trailer
+
+The image always begins with a general header that holds a magic
+number, an architecture identifier (little endian format), a format
+version number (@rev), followed by information about the kernel
+(currently version and UTS data). It also holds the time of the
+checkpoint and the flags given to sys_checkpoint(). This header is
+followed by an arch-specific header.
+
+The task hierarchy comes next so that userspace tools can read it
+early (even from a stream) and re-create the restarting tasks. This is
+basically an array of all checkpointed tasks, and their relationships
+(parent, siblings, threads, etc).
+
+Then the state of all tasks is saved, in the order that they appear in
+the tasks array above. For each state, we save data like task_struct,
+namespaces, open files, memory layout, memory contents, cpu state,
+signals and signal handlers, etc. For resources that are shared among
+multiple processes, we first checkpoint said resource (and only once),
+and in the task data we give a reference to it. More about shared
+resources below.
+
+Finally, the image always ends with a trailer that holds a (different)
+magic number, serving for sanity check.
+
+
+Shared objects
+==============
+
+Many resources may be shared by multiple tasks (e.g. file descriptors,
+memory address space, etc), or even have multiple references from
+other resources (e.g. a single inode that represents two ends of a
+pipe).
+
+Shared objects are tracked using a hash table (objhash) to ensure that
+they are only checkpointed or restored once. To handle a shared
+object, it is first looked up in the hash table, to determine if is
+the first encounter or a recurring appearance.  The hash table itself
+is not saved as part of the checkpoint image: it is constructed
+dynamically during both checkpoint and restart, and discarded at the
+end of the operation.
+
+During checkpoint, when a shared object is encountered for the first
+time, it is inserted to the hash table, indexed by its kernel address.
+It is assigned an identifier (@objref) in order of appearance, and
+then its state if saved. Subsequent lookups of that object in the hash
+will yield that entry, in which case only the @objref is saved, as
+opposed the entire state of the object.
+
+During restart, shared objects are indexed by their @objref as given
+during the checkpoint. On the first appearance of each shared object,
+a new resource will be created and its state restored from the image.
+Then the object is added to the hash table. Subsequent lookups of the
+same unique identifier in the hash table will yield that entry, and
+then the existing object instance is reused instead of creating
+a new one.
+
+The hash grabs a reference to each object that is inserted, and
+maintains this reference for the entire lifetime of the hash. Thus,
+it is always safe to reference an object that is stored in the hash.
+The hash is "one-way" in the sense that objects that are added are
+never deleted from the hash until the hash is discarded. This, in
+turn, happens only when the checkpoint (or restart) terminates.
+
+Shared objects are thus saved when they are first seen, and _before_
+the parent object that uses them. Therefore by the time the parent
+objects needs them, they should already be in the objhash. The one
+exception is when more than a single shared resource will be restarted
+at once (e.g. like the two ends of a pipe, or all the namespaces in an
+nsproxy). In this case the parent object is dumped first followed by
+the individual sub-resources).
+
+The checkpoint image is stream-able, meaning that restarting from it
+may not require lseek(). This is enforced at checkpoint time, by
+carefully selecting the order of shared objects, to respect the rule
+that an object is always saved before the objects that refers to it.
+
+
+Memory contents format
+======================
+
+The memory contents of a given memory address space (->mm) is dumped
+as a sequence of vma objects, represented by 'struct ckpt_hdr_vma'.
+This header details the vma properties, and a reference to a file
+(if file backed) or an inode (or shared memory) object.
+
+The vma header is followed by the actual contents - but only those
+pages that need to be saved, i.e. dirty pages. They are written in
+chunks of data, where each chunks contains a header that indicates
+that number of pages in the chunk, followed by an array of virtual
+addresses and then an array of actual page contents. The last chunk
+holds zero pages.
+
+To illustrate this, consider a single simple task with two vmas: one
+is file mapped with two dumped pages, and the other is anonymous with
+three dumped pages. The memory dump will look like this:
+
+	ckpt_hdr + ckpt_hdr_vma
+		ckpt_hdr_pgarr (nr_pages = 2)
+			addr1, addr2
+			page1, page2
+		ckpt_hdr_pgarr (nr_pages = 0)
+	ckpt_hdr + ckpt_hdr_vma
+		ckpt_hdr_pgarr (nr_pages = 3)
+		addr3, addr4, addr5
+		page3, page4, page5
+		ckpt_hdr_pgarr (nr_pages = 0)
+
+
+Error handling
+==============
+
+Both checkpoint and restart operations may fail due to a variety of
+reasons. Using a simple, single return value from the system call is
+insufficient to report the reason of a failure.
+
+Checkpoint - to provide informative status report upon failure, the
+checkpoint image may contain one (or more) error objects, 'struct
+ckpt_hdr_err'.  An error objects consists of a mandatory pre-header
+followed by a null character ('\0'), and then a string that describes
+the error. By default, if an error occurs, this will be the last
+object written to the checkpoint image.
+
+Upon failure, the caller can examine the image (e.g. with 'ckptinfo')
+and extract the detailed error message. The leading '\0' is useful if
+one wants to seek back from the end of the checkpoint image, instead
+of parsing the entire image separately.
+
+Restart - to be defined.
+
+
+Security
+========
+
+The main question is whether sys_checkpoint() and sys_restart()
+require privileged or unprivileged operation.
+
+Early versions checked capable(CAP_SYS_ADMIN) assuming that we would
+attempt to remove the need for privilege, so that all users could
+safely use it. Arnd Bergmann pointed out that it'd make more sense to
+let unprivileged users use them now, so that we'll be more careful
+about the security as patches roll in.
+
+Checkpoint: the main concern is whether a task that performs the
+checkpoint of another task has sufficient privileges to access its
+state. We address this by requiring that the checkpointer task will be
+able to ptrace the target task, by means of ptrace_may_access() with
+access mode.
+
+Restart: the main concern is that we may allow an unprivileged user to
+feed the kernel with random data. To this end, the restart works in a
+way that does not skip the usual security checks. Task credentials,
+i.e. euid, reuid, and LSM security contexts currently come from the
+caller, not the checkpoint image.  As credentials are restored too,
+the ability of a task that calls sys_restore() to setresuid/setresgid
+to those values must be checked.
+
+Keeping the restart procedure to operate within the limits of the
+caller's credentials means that there various scenarios that cannot
+be supported. For instance, a setuid program that opened a protected
+log file and then dropped privileges will fail the restart, because
+the user won't have enough credentials to reopen the file. In these
+cases, we should probably treat restarting like inserting a kernel
+module: surely the user can cause havoc by providing incorrect data,
+but then again we must trust the root account.
+
+So that's why we don't want CAP_SYS_ADMIN required up-front. That way
+we will be forced to more carefully review each of those features.
+However, this can be controlled with a sysctl-variable.
+
+
+Kernel interfaces
+=================
+
+* To checkpoint a vma, the 'struct vm_operations_struct' needs to
+  provide a method ->checkpoint:
+    int checkpoint(struct ckpt_ctx *, struct vma_struct *)
+  Restart requires a matching (exported) restore:
+    int restore(struct ckpt_ctx *, struct mm_struct *, struct ckpt_hdr_vma *)
+
+* To checkpoint a file, the 'struct file_operations' needs to provide
+  the methods ->checkpoint and ->collect:
+    int checkpoint(struct ckpt_ctx *, struct file *)
+    int collect(struct ckpt_ctx *, struct file *)
+  Restart requires a matching (exported) restore:
+    int restore(struct ckpt_ctx *, struct ckpt_hdr_file *)
+  For most file systems, generic_file_{checkpoint,restore}() can be
+  used.
+
+* To checkpoint a socket, the 'struct proto_ops' needs to provide
+  the methods ->checkpoint, ->collect and ->restore:
+    int checkpoint(struct ckpt_ctx *ctx, struct socket *sock);
+    int collect(struct ckpt_ctx *ctx, struct socket *sock);
+    int restore(struct ckpt_ctx *, struct socket *sock, struct ckpt_hdr_socket *h)
+
diff --git a/Documentation/checkpoint/restart_self.c b/Documentation/checkpoint/restart_self.c
new file mode 100644
index 0000000..288209d
--- /dev/null
+++ b/Documentation/checkpoint/restart_self.c
@@ -0,0 +1,20 @@
+#include <stdio.h>
+#include <stdlib.h>
+#include <unistd.h>
+#include <errno.h>
+#include <sys/syscall.h>
+
+int main(int argc, char *argv[])
+{
+	pid_t pid = getpid();
+	int ret;
+
+	ret = syscall(__NR_restart, pid, STDIN_FILENO, 0);
+	if (ret < 0)
+		perror("restart");
+
+	printf("should not reach here !\n");
+
+	return 0;
+}
+
diff --git a/Documentation/checkpoint/test.c b/Documentation/checkpoint/test.c
new file mode 100644
index 0000000..1183655
--- /dev/null
+++ b/Documentation/checkpoint/test.c
@@ -0,0 +1,48 @@
+#include <stdio.h>
+#include <stdlib.h>
+#include <unistd.h>
+#include <errno.h>
+#include <math.h>
+
+#define OUTFILE  "/tmp/cr-test.out"
+
+int main(int argc, char *argv[])
+{
+	FILE *file;
+	float a;
+	int i;
+
+	close(0);
+	close(1);
+	close(2);
+
+	unlink(OUTFILE);
+	file = fopen(OUTFILE, "w+");
+	if (!file) {
+		perror("open");
+		exit(1);
+	}
+	if (dup2(0, 2) < 0) {
+		perror("dup2");
+		exit(1);
+	}
+
+	a = sqrt(2.53 * (getpid() / 1.21));
+
+	fprintf(file, "hello, world (%.2f)!\n", a);
+	fflush(file);
+
+	for (i = 0; i < 1000; i++) {
+		sleep(1);
+		/* make the fpu work ->  a = a + i/10  */
+		a = sqrt(a*a + 2*a*(i/10.0) + i*i/100.0);
+		fprintf(file, "count %d (%.2f)!\n", i, a);
+		fflush(file);
+	}
+
+	fprintf(file, "world, hello (%.2f) !\n", a);
+	fflush(file);
+
+	return 0;
+}
+
diff --git a/Documentation/checkpoint/usage.txt b/Documentation/checkpoint/usage.txt
new file mode 100644
index 0000000..4c75441
--- /dev/null
+++ b/Documentation/checkpoint/usage.txt
@@ -0,0 +1,229 @@
+
+	      How to use Checkpoint-Restart
+	=========================================
+
+
+API
+===
+
+The API consists of three new system calls:
+
+* long checkpoint(pid_t pid, int fd, unsigned long flag);
+
+ Checkpoint a (sub-)container whose root task is identified by @pid,
+ to the open file indicated by @fd. @flags may be one or more of:
+   - CHECKPOINT_SUBTREE : allow checkpoint of sub-container
+ (other value are not allowed).
+
+ Returns: a positive checkpoint identifier (ckptid) upon success, 0 if
+ it returns from a restart, and -1 if an error occurs. The ckptid will
+ uniquely identify a checkpoint image, for as long as the checkpoint
+ is kept in the kernel (e.g. if one wishes to keep a checkpoint, or a
+ partial checkpoint, residing in kernel memory).
+
+* long sys_restart(pid_t pid, int fd, unsigned long flags);
+
+ Restart a process hierarchy from a checkpoint image that is read from
+ the blob stored in the file indicated by @fd. The @flags' will have
+ future meaning (must be 0 for now). @pid indicates the root of the
+ hierarchy as seen in the coordinator's pid-namespace, and is expected
+ to be a child of the coordinator. @flags may be one or more of:
+   - RESTART_TASKSELF : (self) restart of a single process
+   - RESTART_FROEZN : processes remain frozen once restart completes
+ (Note that this argument may mean 'ckptid' to identify an in-kernel
+ checkpoint image, with some @flags in the future).
+
+ Returns: -1 if an error occurs, 0 on success when restarting from a
+ "self" checkpoint, and return value of system call at the time of the
+ checkpoint when restarting from an "external" checkpoint.
+
+ (If a process was frozen for checkpoint while in userspace, it will
+ resume running in userspace exactly where it was interrupted. If it
+ was frozen while in kernel doing a syscall, it will return what the
+ syscall returned when interrupted/completed, and proceed from there
+ as if it had only been frozen and then thawed. Finally, if it did a
+ self-checkpoint, it will resume to the first instruction after the
+ call to checkpoint(2), having returned 0, to indicate whether the
+ return is from the checkpoint or a restart).
+
+* int clone_with_pid(unsigned long clone_flags, void *news,
+		     int *parent_tidptr, int *child_tidptr,
+		     struct target_pid_set *pid_set)
+
+  struct target_pid_set {
+	 int num_pids;
+	 pid_t *target_pids;
+  }
+
+ Container restart requires that a task have the same pid it had when
+ it was checkpointed. When containers are nested the tasks within the
+ containers exist in multiple pid namespaces and hence have multiple
+ pids to specify during restart.
+
+ clone_with_pids(), intended for use during restart, is similar to
+ clone(), except that it takes a 'target_pid_set' parameter. This
+ parameter lets caller choose specific pid numbers for the child
+ process, in the process's active and ancestor pid namespaces.
+
+ Unlike clone(), clone_with_pids() needs CAP_SYS_ADMIN, at least for
+ now, to prevent unprivileged processes from misusing this interface.
+
+ If a target-pid is 0, the kernel continues to assign a pid for the
+ process in that namespace. If a requested pid is taken, the system
+ call fails with -EBUSY. If 'pid_set.num_pids' exceeds the current
+ nesting level of pid namespaces, the system call fails with -EINVAL.
+
+
+Sysctl/proc
+===========
+
+/proc/sys/kernel/ckpt_unpriv_allowed		[default = 1]
+  controls whether c/r operation is allowed for unprivileged users
+
+
+Operation
+=========
+
+The granularity of a checkpoint usually is a process hierarchy. The
+'pid' argument is interpreted in the caller's pid namespace. So to
+checkpoint a container whose init task (pid 1 in that pidns) appears
+as pid 3497 the caller's pidns, the caller must use pid 3497. Passing
+pid 1 will attempt to checkpoint the caller's container, and if the
+caller isn't privileged and init is owned by root, it will fail.
+
+Unless the CHECKPOINT_SUBTREE flag is set, if the caller passes a pid
+which does not refer to a container's init task, then sys_checkpoint()
+would return -EINVAL.
+
+We assume that during checkpoint and restart the container state is
+quiescent. During checkpoint, this means that all affected tasks are
+frozen (or otherwise stopped). During restart, this means that all
+affected tasks are executing the sys_restart() call. In both cases, if
+there are other tasks possible sharing state with the container, they
+must not modify it during the operation. It is the responsibility of
+the caller to follow this requirement.
+
+If the assumption that all tasks are frozen and that there is no other
+sharing doesn't hold - then the results of the operation are undefined
+(just as, e.g. not calling execve() immediately after vfork() produces
+undefined results). In particular, either checkpoint will fail, or it
+may produce a checkpoint image that can't be restarted, or (unlikely)
+the restart may produce a container whose state does not match that of
+the original container.
+
+
+User tools
+==========
+
+* checkpoint(1): a tool to perform a checkpoint of a container/subtree
+* restart(1): a tool to restart a container/subtree
+* ckptinfo: a tool to examine a checkpoint image
+
+It is best to use the dedicated user tools for checkpoint and restart.
+
+If you insist, then here is a code snippet that illustrates how a
+checkpoint is initiated by a process inside a container - the logic is
+similar to fork():
+	...
+	ckptid = checkpoint(0, ...);
+	switch (crid) {
+	case -1:
+		perror("checkpoint failed");
+		break;
+	default:
+		fprintf(stderr, "checkpoint succeeded, CRID=%d\n", ret);
+		/* proceed with execution after checkpoint */
+		...
+		break;
+	case 0:
+		fprintf(stderr, "returned after restart\n");
+		/* proceed with action required following a restart */
+		...
+		break;
+	}
+	...
+
+And to initiate a restart, the process in an empty container can use
+logic similar to execve():
+	...
+	if (restart(pid, ...) < 0)
+		perror("restart failed");
+	/* only get here if restart failed */
+	...
+
+Note, that the code also supports "self" checkpoint, where a process
+can checkpoint itself. This mode does not capture the relationships of
+the task with other tasks, or any shared resources. It is useful for
+application that wish to be able to save and restore their state.
+They will either not use (or care about) shared resources, or they
+will be aware of the operations and adapt suitably after a restart.
+The code above can also be used for "self" checkpoint.
+
+
+You may find the following sample programs useful:
+
+* checkpoint.c: accepts a 'pid' argument and checkpoint that task to stdout
+* checkpoint_self.c: a simple test program doing self-checkpoint
+* restart_self.c: restarts a (self-) checkpoint image from stdin
+* test.c: a simple test program to checkpoint
+
+
+"External" checkpoint
+=====================
+
+To do "external" checkpoint, you need to first freeze that other task
+either using the freezer cgroup.
+
+Restart does not preserve the original PID yet, (because we haven't
+solved yet the fork-with-specific-pid issue). In a real scenario, you
+probably want to first create a new names space, and have the init
+task there call 'sys_restart()'.
+
+I tested it this way:
+	$ ./test &
+	[1] 3493
+
+	$ echo 3493 > /cgroup/0/tasks
+	$ echo FROZEN > /cgroup/0/freezer.state
+	$ ./checkpoint 3493 > ckpt.image
+
+	$ mv /tmp/cr-test.out /tmp/cr-test.out.orig
+	$ cp /tmp/cr-test.out.orig /tmp/cr-test.out
+
+	$ echo THAWED > /cgroup/0/freezer.state
+
+	$ ./restart_self < ckpt.image
+Now compare the output of the two output files.
+
+
+"Self" checkpoint
+================
+
+To do "self" checkpoint, you can incorporate the code from ckpt.c into
+your application.
+
+Here is how to test the "self" checkpoint:
+	$ ./checkpoint_self > self.image &
+	[1] 3512
+
+	$ sleep 3
+	$ mv /tmp/cr-test.out /tmp/cr-test.out.orig
+	$ cp /tmp/cr-test.out.orig /tmp/cr-test.out
+
+	$ cat /tmp/cr-rest.out
+	hello, world (85.46)!
+	count 0 (85.46)!
+	count 1 (85.56)!
+	count 2 (85.76)!
+	count 3 (86.46)!
+
+	$ sed -i 's/count/xxxx/g' /tmp/cr-rest.out
+
+	$ ./restart_self < self.image &
+Now compare the output of the two output files.
+
+Note how in test.c we close stdin, stdout, stderr - that's because
+currently we only support regular files (not ttys/ptys).
+
+If you check the output of ps, you'll see that "rstr" changed its name
+to "test" or "self", as expected.
-- 
1.6.0.4


^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH v18 19/80] c/r: documentation
@ 2009-09-23 23:50   ` Oren Laadan
  0 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-23 23:50 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, containers, linux-kernel, linux-mm, linux-api,
	Serge Hallyn, Ingo Molnar, Pavel Emelyanov, Oren Laadan,
	Oren Laadan, Dave Hansen

Covers application checkpoint/restart, overall design, interfaces,
usage, shared objects, and and checkpoint image format.

Changelog[v16]:
  - Update documentation
  - Unify into readme.txt and usage.txt
Changelog[v14]:
  - Discard the 'h.parent' field
  - New image format (shared objects appear before they are referenced
    unless they are compound)
Changelog[v8]:
  - Split into multiple files in Documentation/checkpoint/...
  - Extend documentation, fix typos and comments from feedback

Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com>
---
 Documentation/checkpoint/checkpoint.c      |   32 +++
 Documentation/checkpoint/checkpoint_self.c |   57 +++++
 Documentation/checkpoint/readme.txt        |  360 ++++++++++++++++++++++++++++
 Documentation/checkpoint/restart_self.c    |   20 ++
 Documentation/checkpoint/test.c            |   48 ++++
 Documentation/checkpoint/usage.txt         |  229 ++++++++++++++++++
 6 files changed, 746 insertions(+), 0 deletions(-)
 create mode 100644 Documentation/checkpoint/checkpoint.c
 create mode 100644 Documentation/checkpoint/checkpoint_self.c
 create mode 100644 Documentation/checkpoint/readme.txt
 create mode 100644 Documentation/checkpoint/restart_self.c
 create mode 100644 Documentation/checkpoint/test.c
 create mode 100644 Documentation/checkpoint/usage.txt

diff --git a/Documentation/checkpoint/checkpoint.c b/Documentation/checkpoint/checkpoint.c
new file mode 100644
index 0000000..094408c
--- /dev/null
+++ b/Documentation/checkpoint/checkpoint.c
@@ -0,0 +1,32 @@
+#include <stdio.h>
+#include <stdlib.h>
+#include <errno.h>
+#include <unistd.h>
+#include <sys/syscall.h>
+
+int main(int argc, char *argv[])
+{
+	pid_t pid;
+	int ret;
+
+	if (argc != 2) {
+		printf("usage: ckpt PID\n");
+		exit(1);
+	}
+
+	pid = atoi(argv[1]);
+	if (pid <= 0) {
+		printf("invalid pid\n");
+		exit(1);
+	}
+
+	ret = syscall(__NR_checkpoint, pid, STDOUT_FILENO, 0);
+
+	if (ret < 0)
+		perror("checkpoint");
+	else
+		printf("checkpoint id %d\n", ret);
+
+	return (ret > 0 ? 0 : 1);
+}
+
diff --git a/Documentation/checkpoint/checkpoint_self.c b/Documentation/checkpoint/checkpoint_self.c
new file mode 100644
index 0000000..febb888
--- /dev/null
+++ b/Documentation/checkpoint/checkpoint_self.c
@@ -0,0 +1,57 @@
+#include <stdio.h>
+#include <stdlib.h>
+#include <unistd.h>
+#include <string.h>
+#include <errno.h>
+#include <math.h>
+#include <sys/syscall.h>
+
+#define OUTFILE  "/tmp/cr-test.out"
+
+int main(int argc, char *argv[])
+{
+	pid_t pid = getpid();
+	FILE *file;
+	int i, ret;
+	float a;
+
+	close(0);
+	close(2);
+
+	unlink(OUTFILE);
+	file = fopen(OUTFILE, "w+");
+	if (!file) {
+		perror("open");
+		exit(1);
+	}
+	if (dup2(0, 2) < 0) {
+		perror("dup2");
+		exit(1);
+	}
+
+	a = sqrt(2.53 * (getpid() / 1.21));
+
+	fprintf(file, "hello, world (%.2f)!\n", a);
+	fflush(file);
+
+	for (i = 0; i < 1000; i++) {
+		sleep(1);
+		/* make the fpu work ->  a = a + i/10  */
+		a = sqrt(a*a + 2*a*(i/10.0) + i*i/100.0);
+		fprintf(file, "count %d (%.2f)!\n", i, a);
+		fflush(file);
+
+		if (i == 2) {
+			ret = syscall(__NR_checkpoint, pid, STDOUT_FILENO, 0);
+			if (ret < 0) {
+				fprintf(file, "ckpt: %s\n", strerror(errno));
+				exit(2);
+			}
+			fprintf(file, "checkpoint ret: %d\n", ret);
+			fflush(file);
+		}
+	}
+
+	return 0;
+}
+
diff --git a/Documentation/checkpoint/readme.txt b/Documentation/checkpoint/readme.txt
new file mode 100644
index 0000000..571c469
--- /dev/null
+++ b/Documentation/checkpoint/readme.txt
@@ -0,0 +1,360 @@
+
+	      Checkpoint-Restart support in the Linux kernel
+	==========================================================
+
+Copyright (C) 2008-2009 Oren Laadan
+
+Author:		Oren Laadan <orenl@cs.columbia.edu>
+
+License:	The GNU Free Documentation License, Version 1.2
+		(dual licensed under the GPL v2)
+
+Contributors:	Oren Laadan <orenl@cs.columbia.edu>
+		Serge Hallyn <serue@us.ibm.com>
+		Dan Smith <danms@us.ibm.com>
+		Matt Helsley <matthltc@us.ibm.com>
+		Nathan Lynch <ntl@pobox.com>
+		Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
+		Dave Hansen <dave@linux.vnet.ibm.com>
+
+
+Introduction
+============
+
+Application checkpoint/restart [C/R] is the ability to save the state
+of a running application so that it can later resume its execution
+from the time at which it was checkpointed. An application can be
+migrated by checkpointing it on one machine and restarting it on
+another. C/R can provide many potential benefits:
+
+* Failure recovery: by rolling back to a previous checkpoint
+
+* Improved response time: by restarting applications from checkpoints
+  instead of from scratch.
+
+* Improved system utilization: by suspending long running CPU
+  intensive jobs and resuming them when load decreases.
+
+* Fault resilience: by migrating applications off faulty hosts.
+
+* Dynamic load balancing: by migrating applications to less loaded
+  hosts.
+
+* Improved service availability and administration: by migrating
+  applications before host maintenance so that they continue to run
+  with minimal downtime
+
+* Time-travel: by taking periodic checkpoints and restarting from
+  any previous checkpoint.
+
+Compared to hypervisor approaches, application C/R is more lightweight
+since it need only save the state associated with applications, while
+operating system data structures (e.g. buffer cache, drivers state
+and the like) are uninteresting.
+
+
+Overall design
+==============
+
+Checkpoint and restart are done in the kernel as much as possible.
+Two new system calls are introduced to provide C/R: sys_checkpoint()
+and sys_restart(). They both operate on a process tree (hierarchy),
+either a whole container or a subtree of a container.
+
+Checkpointing entire containers ensures that there are no dependencies
+on anything outside the container, which guarantees that a matching
+restart will succeed (assuming that the file system state remains
+consistent). However, it requires that users will always run the tasks
+that they wish to checkpoint inside containers. This is ideal for,
+e.g., private virtual servers and the like.
+
+In contrast, when checkpointing a subtree of a container it is up to
+the user to ensure that dependencies either don't exist or can be
+safely ignored. This is useful, for instance, for HPC scenarios or
+even a user that would like to periodically checkpoint a long-running
+batch job.
+
+An additional system call, a la madvise(), is planned, so that tasks
+can advise the kernel how to handle specific resources. For instance,
+a task could ask to skip a memory area at checkpoint to save space,
+or to use a preset file descriptor at restart instead of restoring it
+from the checkpoint image. It will provide the flexibility that is
+particularly useful to address the needs of a diverse crowd of users
+and use-cases.
+
+Syscall sys_checkpoint() is given a pid that indicates the top of the
+hierarchy, a file descriptor to store the image, and flags. The code
+serializes internal user- and kernel-state and writes it out to the
+file descriptor. The resulting image is stream-able. The processes are
+expected to be frozen for the duration of the checkpoint.
+
+In general, a checkpoint consists of 5 steps:
+1. Pre-dump
+2. Freeze the container/subtree
+3. Save tasks' and kernel state		<-- sys_checkpoint()
+4. Thaw (or kill) the container/subtree
+5. Post-dump
+
+Step 3 is done by calling sys_checkpoint(). Steps 1 and 5 are an
+optimization to reduce application downtime. In particular, "pre-dump"
+works before freezing the container, e.g. the pre-copy for live
+migration, and "post-dump" works after the container resumes
+execution, e.g. write-back the data to secondary storage.
+
+The kernel exports a relatively opaque 'blob' of data to userspace
+which can then be handed to the new kernel at restart time.  The
+'blob' contains data and state of select portions of kernel structures
+such as VMAs and mm_structs, as well as copies of the actual memory
+that the tasks use. Any changes in this blob's format between kernel
+revisions can be handled by an in-userspace conversion program.
+
+To restart, userspace first create a process hierarchy that matches
+that of the checkpoint, and each task calls sys_restart(). The syscall
+reads the saved kernel state from a file descriptor, and re-creates
+the resources that the tasks need to resume execution. The restart
+code is executed by each task that is restored in the new hierarchy to
+reconstruct its own state.
+
+In general, a restart consists of 3 steps:
+1. Create hierarchy
+2. Restore tasks' and kernel state	<-- sys_restart()
+3. Resume userspace (or freeze tasks)
+
+Because the process hierarchy, during restart in created in userspace,
+the restarting tasks have the flexibility to prepare before calling
+sys_restart().
+
+
+Checkpoint image format
+=======================
+
+The checkpoint image format is built of records that consist of a
+pre-header identifying its contents, followed by a payload. This
+format allow userspace tools to easily parse and skip through the
+image without requiring intimate knowledge of the data. It will also
+be handy to enable parallel checkpointing in the future where multiple
+threads interleave data from multiple processes into a single stream.
+
+The pre-header is defined by 'struct ckpt_hdr' as follows: @type
+identifies the type of the payload, @len tells its length in bytes
+including the pre-header.
+
+struct ckpt_hdr {
+	__s32 type;
+	__s32 len;
+};
+
+The pre-header must be the first component in all other headers. For
+instance, the task data is saved in 'struct ckpt_hdr_task', which
+looks something like this:
+
+struct ckpt_hdr_task {
+	struct ckpt_hdr h;
+	__u32 pid;
+	...
+};
+
+THE IMAGE FORMAT IS EXPECTED TO CHANGE over time as more features are
+supported, or as existing features change in the kernel and require to
+adjust their representation. Any such changes will be be handled by
+in-userspace conversion tools.
+
+The general format of the checkpoint image is as follows:
+1. Image header
+2. Task hierarchy
+3. Tasks' state
+4. Image trailer
+
+The image always begins with a general header that holds a magic
+number, an architecture identifier (little endian format), a format
+version number (@rev), followed by information about the kernel
+(currently version and UTS data). It also holds the time of the
+checkpoint and the flags given to sys_checkpoint(). This header is
+followed by an arch-specific header.
+
+The task hierarchy comes next so that userspace tools can read it
+early (even from a stream) and re-create the restarting tasks. This is
+basically an array of all checkpointed tasks, and their relationships
+(parent, siblings, threads, etc).
+
+Then the state of all tasks is saved, in the order that they appear in
+the tasks array above. For each state, we save data like task_struct,
+namespaces, open files, memory layout, memory contents, cpu state,
+signals and signal handlers, etc. For resources that are shared among
+multiple processes, we first checkpoint said resource (and only once),
+and in the task data we give a reference to it. More about shared
+resources below.
+
+Finally, the image always ends with a trailer that holds a (different)
+magic number, serving for sanity check.
+
+
+Shared objects
+==============
+
+Many resources may be shared by multiple tasks (e.g. file descriptors,
+memory address space, etc), or even have multiple references from
+other resources (e.g. a single inode that represents two ends of a
+pipe).
+
+Shared objects are tracked using a hash table (objhash) to ensure that
+they are only checkpointed or restored once. To handle a shared
+object, it is first looked up in the hash table, to determine if is
+the first encounter or a recurring appearance.  The hash table itself
+is not saved as part of the checkpoint image: it is constructed
+dynamically during both checkpoint and restart, and discarded at the
+end of the operation.
+
+During checkpoint, when a shared object is encountered for the first
+time, it is inserted to the hash table, indexed by its kernel address.
+It is assigned an identifier (@objref) in order of appearance, and
+then its state if saved. Subsequent lookups of that object in the hash
+will yield that entry, in which case only the @objref is saved, as
+opposed the entire state of the object.
+
+During restart, shared objects are indexed by their @objref as given
+during the checkpoint. On the first appearance of each shared object,
+a new resource will be created and its state restored from the image.
+Then the object is added to the hash table. Subsequent lookups of the
+same unique identifier in the hash table will yield that entry, and
+then the existing object instance is reused instead of creating
+a new one.
+
+The hash grabs a reference to each object that is inserted, and
+maintains this reference for the entire lifetime of the hash. Thus,
+it is always safe to reference an object that is stored in the hash.
+The hash is "one-way" in the sense that objects that are added are
+never deleted from the hash until the hash is discarded. This, in
+turn, happens only when the checkpoint (or restart) terminates.
+
+Shared objects are thus saved when they are first seen, and _before_
+the parent object that uses them. Therefore by the time the parent
+objects needs them, they should already be in the objhash. The one
+exception is when more than a single shared resource will be restarted
+at once (e.g. like the two ends of a pipe, or all the namespaces in an
+nsproxy). In this case the parent object is dumped first followed by
+the individual sub-resources).
+
+The checkpoint image is stream-able, meaning that restarting from it
+may not require lseek(). This is enforced at checkpoint time, by
+carefully selecting the order of shared objects, to respect the rule
+that an object is always saved before the objects that refers to it.
+
+
+Memory contents format
+======================
+
+The memory contents of a given memory address space (->mm) is dumped
+as a sequence of vma objects, represented by 'struct ckpt_hdr_vma'.
+This header details the vma properties, and a reference to a file
+(if file backed) or an inode (or shared memory) object.
+
+The vma header is followed by the actual contents - but only those
+pages that need to be saved, i.e. dirty pages. They are written in
+chunks of data, where each chunks contains a header that indicates
+that number of pages in the chunk, followed by an array of virtual
+addresses and then an array of actual page contents. The last chunk
+holds zero pages.
+
+To illustrate this, consider a single simple task with two vmas: one
+is file mapped with two dumped pages, and the other is anonymous with
+three dumped pages. The memory dump will look like this:
+
+	ckpt_hdr + ckpt_hdr_vma
+		ckpt_hdr_pgarr (nr_pages = 2)
+			addr1, addr2
+			page1, page2
+		ckpt_hdr_pgarr (nr_pages = 0)
+	ckpt_hdr + ckpt_hdr_vma
+		ckpt_hdr_pgarr (nr_pages = 3)
+		addr3, addr4, addr5
+		page3, page4, page5
+		ckpt_hdr_pgarr (nr_pages = 0)
+
+
+Error handling
+==============
+
+Both checkpoint and restart operations may fail due to a variety of
+reasons. Using a simple, single return value from the system call is
+insufficient to report the reason of a failure.
+
+Checkpoint - to provide informative status report upon failure, the
+checkpoint image may contain one (or more) error objects, 'struct
+ckpt_hdr_err'.  An error objects consists of a mandatory pre-header
+followed by a null character ('\0'), and then a string that describes
+the error. By default, if an error occurs, this will be the last
+object written to the checkpoint image.
+
+Upon failure, the caller can examine the image (e.g. with 'ckptinfo')
+and extract the detailed error message. The leading '\0' is useful if
+one wants to seek back from the end of the checkpoint image, instead
+of parsing the entire image separately.
+
+Restart - to be defined.
+
+
+Security
+========
+
+The main question is whether sys_checkpoint() and sys_restart()
+require privileged or unprivileged operation.
+
+Early versions checked capable(CAP_SYS_ADMIN) assuming that we would
+attempt to remove the need for privilege, so that all users could
+safely use it. Arnd Bergmann pointed out that it'd make more sense to
+let unprivileged users use them now, so that we'll be more careful
+about the security as patches roll in.
+
+Checkpoint: the main concern is whether a task that performs the
+checkpoint of another task has sufficient privileges to access its
+state. We address this by requiring that the checkpointer task will be
+able to ptrace the target task, by means of ptrace_may_access() with
+access mode.
+
+Restart: the main concern is that we may allow an unprivileged user to
+feed the kernel with random data. To this end, the restart works in a
+way that does not skip the usual security checks. Task credentials,
+i.e. euid, reuid, and LSM security contexts currently come from the
+caller, not the checkpoint image.  As credentials are restored too,
+the ability of a task that calls sys_restore() to setresuid/setresgid
+to those values must be checked.
+
+Keeping the restart procedure to operate within the limits of the
+caller's credentials means that there various scenarios that cannot
+be supported. For instance, a setuid program that opened a protected
+log file and then dropped privileges will fail the restart, because
+the user won't have enough credentials to reopen the file. In these
+cases, we should probably treat restarting like inserting a kernel
+module: surely the user can cause havoc by providing incorrect data,
+but then again we must trust the root account.
+
+So that's why we don't want CAP_SYS_ADMIN required up-front. That way
+we will be forced to more carefully review each of those features.
+However, this can be controlled with a sysctl-variable.
+
+
+Kernel interfaces
+=================
+
+* To checkpoint a vma, the 'struct vm_operations_struct' needs to
+  provide a method ->checkpoint:
+    int checkpoint(struct ckpt_ctx *, struct vma_struct *)
+  Restart requires a matching (exported) restore:
+    int restore(struct ckpt_ctx *, struct mm_struct *, struct ckpt_hdr_vma *)
+
+* To checkpoint a file, the 'struct file_operations' needs to provide
+  the methods ->checkpoint and ->collect:
+    int checkpoint(struct ckpt_ctx *, struct file *)
+    int collect(struct ckpt_ctx *, struct file *)
+  Restart requires a matching (exported) restore:
+    int restore(struct ckpt_ctx *, struct ckpt_hdr_file *)
+  For most file systems, generic_file_{checkpoint,restore}() can be
+  used.
+
+* To checkpoint a socket, the 'struct proto_ops' needs to provide
+  the methods ->checkpoint, ->collect and ->restore:
+    int checkpoint(struct ckpt_ctx *ctx, struct socket *sock);
+    int collect(struct ckpt_ctx *ctx, struct socket *sock);
+    int restore(struct ckpt_ctx *, struct socket *sock, struct ckpt_hdr_socket *h)
+
diff --git a/Documentation/checkpoint/restart_self.c b/Documentation/checkpoint/restart_self.c
new file mode 100644
index 0000000..288209d
--- /dev/null
+++ b/Documentation/checkpoint/restart_self.c
@@ -0,0 +1,20 @@
+#include <stdio.h>
+#include <stdlib.h>
+#include <unistd.h>
+#include <errno.h>
+#include <sys/syscall.h>
+
+int main(int argc, char *argv[])
+{
+	pid_t pid = getpid();
+	int ret;
+
+	ret = syscall(__NR_restart, pid, STDIN_FILENO, 0);
+	if (ret < 0)
+		perror("restart");
+
+	printf("should not reach here !\n");
+
+	return 0;
+}
+
diff --git a/Documentation/checkpoint/test.c b/Documentation/checkpoint/test.c
new file mode 100644
index 0000000..1183655
--- /dev/null
+++ b/Documentation/checkpoint/test.c
@@ -0,0 +1,48 @@
+#include <stdio.h>
+#include <stdlib.h>
+#include <unistd.h>
+#include <errno.h>
+#include <math.h>
+
+#define OUTFILE  "/tmp/cr-test.out"
+
+int main(int argc, char *argv[])
+{
+	FILE *file;
+	float a;
+	int i;
+
+	close(0);
+	close(1);
+	close(2);
+
+	unlink(OUTFILE);
+	file = fopen(OUTFILE, "w+");
+	if (!file) {
+		perror("open");
+		exit(1);
+	}
+	if (dup2(0, 2) < 0) {
+		perror("dup2");
+		exit(1);
+	}
+
+	a = sqrt(2.53 * (getpid() / 1.21));
+
+	fprintf(file, "hello, world (%.2f)!\n", a);
+	fflush(file);
+
+	for (i = 0; i < 1000; i++) {
+		sleep(1);
+		/* make the fpu work ->  a = a + i/10  */
+		a = sqrt(a*a + 2*a*(i/10.0) + i*i/100.0);
+		fprintf(file, "count %d (%.2f)!\n", i, a);
+		fflush(file);
+	}
+
+	fprintf(file, "world, hello (%.2f) !\n", a);
+	fflush(file);
+
+	return 0;
+}
+
diff --git a/Documentation/checkpoint/usage.txt b/Documentation/checkpoint/usage.txt
new file mode 100644
index 0000000..4c75441
--- /dev/null
+++ b/Documentation/checkpoint/usage.txt
@@ -0,0 +1,229 @@
+
+	      How to use Checkpoint-Restart
+	=========================================
+
+
+API
+===
+
+The API consists of three new system calls:
+
+* long checkpoint(pid_t pid, int fd, unsigned long flag);
+
+ Checkpoint a (sub-)container whose root task is identified by @pid,
+ to the open file indicated by @fd. @flags may be one or more of:
+   - CHECKPOINT_SUBTREE : allow checkpoint of sub-container
+ (other value are not allowed).
+
+ Returns: a positive checkpoint identifier (ckptid) upon success, 0 if
+ it returns from a restart, and -1 if an error occurs. The ckptid will
+ uniquely identify a checkpoint image, for as long as the checkpoint
+ is kept in the kernel (e.g. if one wishes to keep a checkpoint, or a
+ partial checkpoint, residing in kernel memory).
+
+* long sys_restart(pid_t pid, int fd, unsigned long flags);
+
+ Restart a process hierarchy from a checkpoint image that is read from
+ the blob stored in the file indicated by @fd. The @flags' will have
+ future meaning (must be 0 for now). @pid indicates the root of the
+ hierarchy as seen in the coordinator's pid-namespace, and is expected
+ to be a child of the coordinator. @flags may be one or more of:
+   - RESTART_TASKSELF : (self) restart of a single process
+   - RESTART_FROEZN : processes remain frozen once restart completes
+ (Note that this argument may mean 'ckptid' to identify an in-kernel
+ checkpoint image, with some @flags in the future).
+
+ Returns: -1 if an error occurs, 0 on success when restarting from a
+ "self" checkpoint, and return value of system call at the time of the
+ checkpoint when restarting from an "external" checkpoint.
+
+ (If a process was frozen for checkpoint while in userspace, it will
+ resume running in userspace exactly where it was interrupted. If it
+ was frozen while in kernel doing a syscall, it will return what the
+ syscall returned when interrupted/completed, and proceed from there
+ as if it had only been frozen and then thawed. Finally, if it did a
+ self-checkpoint, it will resume to the first instruction after the
+ call to checkpoint(2), having returned 0, to indicate whether the
+ return is from the checkpoint or a restart).
+
+* int clone_with_pid(unsigned long clone_flags, void *news,
+		     int *parent_tidptr, int *child_tidptr,
+		     struct target_pid_set *pid_set)
+
+  struct target_pid_set {
+	 int num_pids;
+	 pid_t *target_pids;
+  }
+
+ Container restart requires that a task have the same pid it had when
+ it was checkpointed. When containers are nested the tasks within the
+ containers exist in multiple pid namespaces and hence have multiple
+ pids to specify during restart.
+
+ clone_with_pids(), intended for use during restart, is similar to
+ clone(), except that it takes a 'target_pid_set' parameter. This
+ parameter lets caller choose specific pid numbers for the child
+ process, in the process's active and ancestor pid namespaces.
+
+ Unlike clone(), clone_with_pids() needs CAP_SYS_ADMIN, at least for
+ now, to prevent unprivileged processes from misusing this interface.
+
+ If a target-pid is 0, the kernel continues to assign a pid for the
+ process in that namespace. If a requested pid is taken, the system
+ call fails with -EBUSY. If 'pid_set.num_pids' exceeds the current
+ nesting level of pid namespaces, the system call fails with -EINVAL.
+
+
+Sysctl/proc
+===========
+
+/proc/sys/kernel/ckpt_unpriv_allowed		[default = 1]
+  controls whether c/r operation is allowed for unprivileged users
+
+
+Operation
+=========
+
+The granularity of a checkpoint usually is a process hierarchy. The
+'pid' argument is interpreted in the caller's pid namespace. So to
+checkpoint a container whose init task (pid 1 in that pidns) appears
+as pid 3497 the caller's pidns, the caller must use pid 3497. Passing
+pid 1 will attempt to checkpoint the caller's container, and if the
+caller isn't privileged and init is owned by root, it will fail.
+
+Unless the CHECKPOINT_SUBTREE flag is set, if the caller passes a pid
+which does not refer to a container's init task, then sys_checkpoint()
+would return -EINVAL.
+
+We assume that during checkpoint and restart the container state is
+quiescent. During checkpoint, this means that all affected tasks are
+frozen (or otherwise stopped). During restart, this means that all
+affected tasks are executing the sys_restart() call. In both cases, if
+there are other tasks possible sharing state with the container, they
+must not modify it during the operation. It is the responsibility of
+the caller to follow this requirement.
+
+If the assumption that all tasks are frozen and that there is no other
+sharing doesn't hold - then the results of the operation are undefined
+(just as, e.g. not calling execve() immediately after vfork() produces
+undefined results). In particular, either checkpoint will fail, or it
+may produce a checkpoint image that can't be restarted, or (unlikely)
+the restart may produce a container whose state does not match that of
+the original container.
+
+
+User tools
+==========
+
+* checkpoint(1): a tool to perform a checkpoint of a container/subtree
+* restart(1): a tool to restart a container/subtree
+* ckptinfo: a tool to examine a checkpoint image
+
+It is best to use the dedicated user tools for checkpoint and restart.
+
+If you insist, then here is a code snippet that illustrates how a
+checkpoint is initiated by a process inside a container - the logic is
+similar to fork():
+	...
+	ckptid = checkpoint(0, ...);
+	switch (crid) {
+	case -1:
+		perror("checkpoint failed");
+		break;
+	default:
+		fprintf(stderr, "checkpoint succeeded, CRID=%d\n", ret);
+		/* proceed with execution after checkpoint */
+		...
+		break;
+	case 0:
+		fprintf(stderr, "returned after restart\n");
+		/* proceed with action required following a restart */
+		...
+		break;
+	}
+	...
+
+And to initiate a restart, the process in an empty container can use
+logic similar to execve():
+	...
+	if (restart(pid, ...) < 0)
+		perror("restart failed");
+	/* only get here if restart failed */
+	...
+
+Note, that the code also supports "self" checkpoint, where a process
+can checkpoint itself. This mode does not capture the relationships of
+the task with other tasks, or any shared resources. It is useful for
+application that wish to be able to save and restore their state.
+They will either not use (or care about) shared resources, or they
+will be aware of the operations and adapt suitably after a restart.
+The code above can also be used for "self" checkpoint.
+
+
+You may find the following sample programs useful:
+
+* checkpoint.c: accepts a 'pid' argument and checkpoint that task to stdout
+* checkpoint_self.c: a simple test program doing self-checkpoint
+* restart_self.c: restarts a (self-) checkpoint image from stdin
+* test.c: a simple test program to checkpoint
+
+
+"External" checkpoint
+=====================
+
+To do "external" checkpoint, you need to first freeze that other task
+either using the freezer cgroup.
+
+Restart does not preserve the original PID yet, (because we haven't
+solved yet the fork-with-specific-pid issue). In a real scenario, you
+probably want to first create a new names space, and have the init
+task there call 'sys_restart()'.
+
+I tested it this way:
+	$ ./test &
+	[1] 3493
+
+	$ echo 3493 > /cgroup/0/tasks
+	$ echo FROZEN > /cgroup/0/freezer.state
+	$ ./checkpoint 3493 > ckpt.image
+
+	$ mv /tmp/cr-test.out /tmp/cr-test.out.orig
+	$ cp /tmp/cr-test.out.orig /tmp/cr-test.out
+
+	$ echo THAWED > /cgroup/0/freezer.state
+
+	$ ./restart_self < ckpt.image
+Now compare the output of the two output files.
+
+
+"Self" checkpoint
+================
+
+To do "self" checkpoint, you can incorporate the code from ckpt.c into
+your application.
+
+Here is how to test the "self" checkpoint:
+	$ ./checkpoint_self > self.image &
+	[1] 3512
+
+	$ sleep 3
+	$ mv /tmp/cr-test.out /tmp/cr-test.out.orig
+	$ cp /tmp/cr-test.out.orig /tmp/cr-test.out
+
+	$ cat /tmp/cr-rest.out
+	hello, world (85.46)!
+	count 0 (85.46)!
+	count 1 (85.56)!
+	count 2 (85.76)!
+	count 3 (86.46)!
+
+	$ sed -i 's/count/xxxx/g' /tmp/cr-rest.out
+
+	$ ./restart_self < self.image &
+Now compare the output of the two output files.
+
+Note how in test.c we close stdin, stdout, stderr - that's because
+currently we only support regular files (not ttys/ptys).
+
+If you check the output of ps, you'll see that "rstr" changed its name
+to "test" or "self", as expected.
-- 
1.6.0.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH v18 20/80] c/r: basic infrastructure for checkpoint/restart
       [not found] ` <1253749920-18673-1-git-send-email-orenl-RdfvBDnrOixBDgjK7y7TUQ@public.gmane.org>
                     ` (18 preceding siblings ...)
  2009-09-23 23:50   ` [PATCH v18 19/80] c/r: documentation Oren Laadan
@ 2009-09-23 23:51   ` Oren Laadan
  2009-09-23 23:51   ` [PATCH v18 21/80] c/r: x86_32 support " Oren Laadan
                     ` (61 subsequent siblings)
  81 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-23 23:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-api-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Linus Torvalds, Ingo Molnar,
	Pavel Emelyanov

Add those interfaces, as well as helpers needed to easily manage the
file format. The code is roughly broken out as follows:

checkpoint/sys.c - user/kernel data transfer, as well as setup of the
  c/r context (a per-checkpoint data structure for housekeeping)

checkpoint/checkpoint.c - output wrappers and basic checkpoint handling

checkpoint/restart.c - input wrappers and basic restart handling

checkpoint/process.c - c/r of task data

For now, we can only checkpoint the 'current' task ("self" checkpoint),
and the 'pid' argument to the syscall is ignored.

Patches to add the per-architecture support as well as the actual
work to do the memory checkpoint follow in subsequent patches.


Changelog[v18]:
  - Detect error-headers in input data on restart, and abort.
  - Standard format for checkpoint error strings (and documentation)
  - [Matt Helsley] Rename headerless struct ckpt_hdr_* to struct ckpt_*
  - [Dan Smith] Add an errno validation function
  - Add ckpt_read_payload(): read a variable-length object (no header)
  - Add ckpt_read_string(): same for strings (ensures null-terminated)
  - Add ckpt_read_consume(): consumes next object without processing
Changelog[v17]:
  - Fix compilation for architectures that don't support checkpoint
  - Save/restore t->{set,clear}_child_tid
  - Restart(2) isn't idempotent: must return -EINTR if interrupted
  - ckpt_debug does not depend on DYNAMIC_DEBUG, on by default
  - Export generic checkpoint headers to userespace
  - Fix comment for prototype of sys_restart
  - Have ckpt_debug() print global-pid and __LINE__
  - Only save and test kernel constants once (in header)
Changelog[v16]:
  - Split ctx->flags to ->uflags (user flags) and ->kflags (kernel flags)
  - Introduce __ckpt_write_err() and ckpt_write_err() to report errors
  - Allow @ptr == NULL to write (or read) header only without payload
  - Introduce _ckpt_read_obj_type()
Changelog[v15]:
  - Replace header buffer in ckpt_ctx (hbuf,hpos) with kmalloc/kfree()
Changelog[v14]:
  - Cleanup interface to get/put hdr buffers
  - Merge checkpoint and restart code into a single file (per subsystem)
  - Take uts_sem around access to uts->{release,version,machine}
  - Embed ckpt_hdr in all ckpt_hdr_...., cleanup read/write helpers
  - Define sys_checkpoint(0,...) as asking for a self-checkpoint (Serge)
  - Revert use of 'pr_fmt' to avoid tainting whom includes us (Nathan Lynch)
  - Explicitly indicate length of UTS fields in header
  - Discard field 'h->parent' from ckpt_hdr
Changelog[v12]:
  - ckpt_kwrite/ckpt_kread() again use vfs_read(), vfs_write() (safer)
  - Split ckpt_write/ckpt_read() to two parts: _ckpt_write/read() helper
  - Befriend with sparse : explicit conversion to 'void __user *'
  - Redfine 'pr_fmt' instead of using special ckpt_debug()
Changelog[v10]:
  - add ckpt_write_buffer(), ckpt_read_buffer() and ckpt_read_buf_type()
  - force end-of-string in ckpt_read_string() (fix possible DoS)
Changelog[v9]:
  - ckpt_kwrite/ckpt_kread() use file->f_op->write() directly
  - Drop ckpt_uwrite/ckpt_uread() since they aren't used anywhere
Changelog[v6]:
  - Balance all calls to ckpt_hbuf_get() with matching ckpt_hbuf_put()
    (although it's not really needed)
Changelog[v5]:
  - Rename headers files s/ckpt/checkpoint/
Changelog[v2]:
  - Added utsname->{release,version,machine} to checkpoint header
  - Pad header structures to 64 bits to ensure compatibility

Signed-off-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
---
 Makefile                         |    2 +-
 checkpoint/Makefile              |    6 +-
 checkpoint/checkpoint.c          |  369 +++++++++++++++++++++++++++++++++
 checkpoint/process.c             |  102 +++++++++
 checkpoint/restart.c             |  422 ++++++++++++++++++++++++++++++++++++++
 checkpoint/sys.c                 |  247 ++++++++++++++++++++++-
 include/linux/Kbuild             |    3 +
 include/linux/checkpoint.h       |  109 ++++++++++
 include/linux/checkpoint_hdr.h   |  111 ++++++++++
 include/linux/checkpoint_types.h |   35 +++
 include/linux/magic.h            |    4 +
 lib/Kconfig.debug                |   13 ++
 12 files changed, 1419 insertions(+), 4 deletions(-)
 create mode 100644 checkpoint/checkpoint.c
 create mode 100644 checkpoint/process.c
 create mode 100644 checkpoint/restart.c
 create mode 100644 include/linux/checkpoint.h
 create mode 100644 include/linux/checkpoint_hdr.h
 create mode 100644 include/linux/checkpoint_types.h

diff --git a/Makefile b/Makefile
index fe45658..2e44d0f 100644
--- a/Makefile
+++ b/Makefile
@@ -639,7 +639,7 @@ export mod_strip_cmd
 
 
 ifeq ($(KBUILD_EXTMOD),)
-core-y		+= kernel/ mm/ fs/ ipc/ security/ crypto/ block/
+core-y		+= kernel/ mm/ fs/ ipc/ security/ crypto/ block/ checkpoint/
 
 vmlinux-dirs	:= $(patsubst %/,%,$(filter %/, $(init-y) $(init-m) \
 		     $(core-y) $(core-m) $(drivers-y) $(drivers-m) \
diff --git a/checkpoint/Makefile b/checkpoint/Makefile
index 8a32c6f..99364cc 100644
--- a/checkpoint/Makefile
+++ b/checkpoint/Makefile
@@ -2,4 +2,8 @@
 # Makefile for linux checkpoint/restart.
 #
 
-obj-$(CONFIG_CHECKPOINT) += sys.o
+obj-$(CONFIG_CHECKPOINT) += \
+	sys.o \
+	checkpoint.o \
+	restart.o \
+	process.o
diff --git a/checkpoint/checkpoint.c b/checkpoint/checkpoint.c
new file mode 100644
index 0000000..57eb7d8
--- /dev/null
+++ b/checkpoint/checkpoint.c
@@ -0,0 +1,369 @@
+/*
+ *  Checkpoint logic and helpers
+ *
+ *  Copyright (C) 2008-2009 Oren Laadan
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+/* default debug level for output */
+#define CKPT_DFLAG  CKPT_DSYS
+
+#include <linux/version.h>
+#include <linux/time.h>
+#include <linux/fs.h>
+#include <linux/file.h>
+#include <linux/dcache.h>
+#include <linux/mount.h>
+#include <linux/utsname.h>
+#include <linux/magic.h>
+#include <linux/checkpoint.h>
+#include <linux/checkpoint_hdr.h>
+
+/* unique checkpoint identifier (FIXME: should be per-container ?) */
+static atomic_t ctx_count = ATOMIC_INIT(0);
+
+/**
+ * ckpt_write_obj - write an object
+ * @ctx: checkpoint context
+ * @h: object descriptor
+ */
+int ckpt_write_obj(struct ckpt_ctx *ctx, struct ckpt_hdr *h)
+{
+	_ckpt_debug(CKPT_DRW, "type %d len %d\n", h->type, h->len);
+	return ckpt_kwrite(ctx, h, h->len);
+}
+
+/**
+ * ckpt_write_obj_type - write an object (from a pointer)
+ * @ctx: checkpoint context
+ * @ptr: buffer pointer
+ * @len: buffer size
+ * @type: desired type
+ *
+ * If @ptr is NULL, then write only the header (payload to follow)
+ */
+int ckpt_write_obj_type(struct ckpt_ctx *ctx, void *ptr, int len, int type)
+{
+	struct ckpt_hdr *h;
+	int ret;
+
+	h = ckpt_hdr_get(ctx, sizeof(*h));
+	if (!h)
+		return -ENOMEM;
+
+	h->type = type;
+	h->len = len + sizeof(*h);
+
+	_ckpt_debug(CKPT_DRW, "type %d len %d\n", h->type, h->len);
+	ret = ckpt_kwrite(ctx, h, sizeof(*h));
+	if (ret < 0)
+		goto out;
+	if (ptr)
+		ret = ckpt_kwrite(ctx, ptr, len);
+ out:
+	_ckpt_hdr_put(ctx, h, sizeof(*h));
+	return ret;
+}
+
+/**
+ * ckpt_write_buffer - write an object of type buffer
+ * @ctx: checkpoint context
+ * @ptr: buffer pointer
+ * @len: buffer size
+ */
+int ckpt_write_buffer(struct ckpt_ctx *ctx, void *ptr, int len)
+{
+	return ckpt_write_obj_type(ctx, ptr, len, CKPT_HDR_BUFFER);
+}
+
+/**
+ * ckpt_write_string - write an object of type string
+ * @ctx: checkpoint context
+ * @str: string pointer
+ * @len: string length
+ */
+int ckpt_write_string(struct ckpt_ctx *ctx, char *str, int len)
+{
+	return ckpt_write_obj_type(ctx, str, len, CKPT_HDR_STRING);
+}
+
+/*
+ * __ckpt_generate_fmt - generate standard checkpoint error message
+ * @ctx: checkpoint context
+ * @prefmt: pre-format string
+ * @fmt: message format
+ *
+ * This generates a unified format of checkpoint error messages, to
+ * ease (after the failure) inspection by userspace tools. It converts
+ * the (printf) message @fmt into a new format: "[PREFMT]: fmt".
+ *
+ * PREFMT is constructed from @prefmt by subtituting format snippets
+ * according to the contents of @prefmt.  The format characters in
+ * @prefmt can be E (error), O (objref), P (pointer), S (string) and
+ * V (variable/symbol). For example, E will generate a "err %d" in
+ * PREFMT (see prefmt_array below).
+ *
+ * If @prefmt begins with T, PREFMT will begin with "pid %d tsk %s"
+ * with the pid and the tsk->comm of the currently checkpointed task.
+ * The latter is taken from ctx->tsk, and is it the responsbilility of
+ * the caller to have a valid pointer there (in particular, functions
+ * that iterate on the processes: collect_objects, checkpoint_task,
+ * and tree_count_tasks).
+ *
+ * The caller of ckpt_write_err() and _ckpt_write_err() must provide
+ * the additional variabes, in order, to match the @prefmt (except for
+ * the T key), e.g.:
+ *
+ *   ckpt_writ_err(ctx, "TEO", "FILE flags %d", err, objref, flags);
+ *
+ * Here, T is simply passed, E expects an integer (err), O expects an
+ * integer (objref), and the last argument matches the format string.
+ */
+static char *__ckpt_generate_fmt(struct ckpt_ctx *ctx, char *prefmt, char *fmt)
+{
+	static int warn_notask = 0;
+	static int warn_prefmt = 0;
+	char *format;
+	int i, j, len = 0;
+
+	static struct {
+		char key;
+		char *fmt;
+	} prefmt_array[] = {
+		{ 'E', "err %d" },
+		{ 'O', "obj %d" },
+		{ 'P', "ptr %p" },
+		{ 'V', "sym %pS" },
+		{ 'S', "str %s" },
+		{ 0, "??? %pS" },
+	};
+
+	/*
+	 * 17 for "pid %d" (plus space)
+	 * 21 for "tsk %s" (tsk->comm)
+	 * up to 8 per varfmt entry
+	 */
+	format = kzalloc(37 + 8 * strlen(prefmt) + strlen(fmt), GFP_KERNEL);
+	if (!format)
+		return NULL;
+
+	format[len++] = '[';
+
+	if (prefmt[0] == 'T') {
+		if (ctx->tsk)
+			len = sprintf(format, "pid %d tsk %s ",
+				      task_pid_vnr(ctx->tsk), ctx->tsk->comm);
+		else if (warn_notask++ < 5)
+			printk(KERN_ERR "c/r: no target task set\n");
+		prefmt++;
+	}
+
+	for (i = 0; i < strlen(prefmt); i++) {
+		for (j = 0; prefmt_array[j].key; j++)
+			if (prefmt_array[j].key == prefmt[i])
+				break;
+		if (!prefmt_array[j].key && warn_prefmt++ < 5)
+			printk(KERN_ERR "c/r: unknown prefmt %c\n", prefmt[i]);
+		len += sprintf(&format[len], "%s ", prefmt_array[j].fmt);
+	}
+
+	if (len > 1)
+		sprintf(&format[len-1], "]: %s", fmt);  /* erase last space */
+	else
+		sprintf(format, "%s", fmt);
+
+	return format;
+}
+
+/* see _ckpt_generate_fmt for information on @prefmt */
+static void __ckpt_generate_err(struct ckpt_ctx *ctx, char *prefmt,
+				char *fmt, va_list ap)
+{
+	va_list aq;
+	char *format;
+	char *str;
+	int len;
+
+	format = __ckpt_generate_fmt(ctx, prefmt, fmt);
+	va_copy(aq, ap);
+
+	/*
+	 * prefix the error string with a '\0' to facilitate easy
+	 * backtrace to the beginning of the error message without
+	 * needing to parse the entire checkpoint image.
+	 */
+	ctx->err_string[0] = '\0';
+	str = &ctx->err_string[1];
+	len = vsnprintf(str, 255, format ? : fmt, ap) + 2;
+
+	if (len > 256) {
+		printk(KERN_NOTICE "c/r: error string truncated: ");
+		vprintk(fmt, aq);
+	}
+
+	va_end(aq);
+	kfree(format);
+
+	ckpt_debug("c/r: checkpoint error: %s\n", str);
+}
+
+/**
+ * __ckpt_write_err - save an error string on the ctx->err_string
+ * @ctx: checkpoint context
+ * @prefmt: error pre-format
+ * @fmt: message format
+ * @...: arguments
+ *
+ * See _ckpt_generate_fmt for information on @prefmt.
+ * Use this during checkpoint to report while holding a spinlock
+ */
+void __ckpt_write_err(struct ckpt_ctx *ctx, char *prefmt, char *fmt, ...)
+{
+	va_list ap;
+
+	va_start(ap, fmt);
+	__ckpt_generate_err(ctx, prefmt, fmt, ap);
+	va_end(ap);
+}
+
+/**
+ * ckpt_write_err - write an object describing an error
+ * @ctx: checkpoint context
+ * @pre: string pre-format
+ * @fmt: error string format
+ * @...: error string arguments
+ *
+ * See _ckpt_generate_fmt for information on @prefmt.
+ * If @fmt is null, the string in the ctx->err_string will be used (and freed)
+ */
+int ckpt_write_err(struct ckpt_ctx *ctx, char *pre, char *fmt, ...)
+{
+	va_list ap;
+	char *str;
+	int len, ret = 0;
+
+	if (fmt) {
+		va_start(ap, fmt);
+		__ckpt_generate_err(ctx, pre, fmt, ap);
+		va_end(ap);
+	}
+
+	str = ctx->err_string;
+	len = strlen(str + 1) + 2;	/* leading and trailing '\0' */
+
+	if (len == 0)	/* empty error string */
+		return 0;
+
+	ret = ckpt_write_obj_type(ctx, NULL, 0, CKPT_HDR_ERROR);
+	if (!ret)
+		ret = ckpt_write_string(ctx, str, len);
+	if (ret < 0)
+		printk(KERN_NOTICE "c/r: error string unsaved (%d): %s\n",
+		       ret, str + 1);
+
+	str[1] = '\0';
+	return ret;
+}
+
+/***********************************************************************
+ * Checkpoint
+ */
+
+static void fill_kernel_const(struct ckpt_const *h)
+{
+	struct task_struct *tsk;
+	struct new_utsname *uts;
+
+	/* task */
+	h->task_comm_len = sizeof(tsk->comm);
+	/* uts */
+	h->uts_release_len = sizeof(uts->release);
+	h->uts_version_len = sizeof(uts->version);
+	h->uts_machine_len = sizeof(uts->machine);
+}
+
+/* write the checkpoint header */
+static int checkpoint_write_header(struct ckpt_ctx *ctx)
+{
+	struct ckpt_hdr_header *h;
+	struct new_utsname *uts;
+	struct timeval ktv;
+	int ret;
+
+	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_HEADER);
+	if (!h)
+		return -ENOMEM;
+
+	do_gettimeofday(&ktv);
+	uts = utsname();
+
+	h->magic = CHECKPOINT_MAGIC_HEAD;
+	h->major = (LINUX_VERSION_CODE >> 16) & 0xff;
+	h->minor = (LINUX_VERSION_CODE >> 8) & 0xff;
+	h->patch = (LINUX_VERSION_CODE) & 0xff;
+
+	h->rev = CHECKPOINT_VERSION;
+
+	h->uflags = ctx->uflags;
+	h->time = ktv.tv_sec;
+
+	fill_kernel_const(&h->constants);
+
+	ret = ckpt_write_obj(ctx, &h->h);
+	ckpt_hdr_put(ctx, h);
+	if (ret < 0)
+		return ret;
+
+	down_read(&uts_sem);
+	ret = ckpt_write_buffer(ctx, uts->release, sizeof(uts->release));
+	if (ret < 0)
+		goto up;
+	ret = ckpt_write_buffer(ctx, uts->version, sizeof(uts->version));
+	if (ret < 0)
+		goto up;
+	ret = ckpt_write_buffer(ctx, uts->machine, sizeof(uts->machine));
+ up:
+	up_read(&uts_sem);
+	return ret;
+}
+
+/* write the checkpoint trailer */
+static int checkpoint_write_tail(struct ckpt_ctx *ctx)
+{
+	struct ckpt_hdr_tail *h;
+	int ret;
+
+	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_TAIL);
+	if (!h)
+		return -ENOMEM;
+
+	h->magic = CHECKPOINT_MAGIC_TAIL;
+
+	ret = ckpt_write_obj(ctx, &h->h);
+	ckpt_hdr_put(ctx, h);
+	return ret;
+}
+
+long do_checkpoint(struct ckpt_ctx *ctx, pid_t pid)
+{
+	long ret;
+
+	ret = checkpoint_write_header(ctx);
+	if (ret < 0)
+		goto out;
+	ret = checkpoint_task(ctx, current);
+	if (ret < 0)
+		goto out;
+	ret = checkpoint_write_tail(ctx);
+	if (ret < 0)
+		goto out;
+
+	/* on success, return (unique) checkpoint identifier */
+	ctx->crid = atomic_inc_return(&ctx_count);
+	ret = ctx->crid;
+ out:
+	return ret;
+}
diff --git a/checkpoint/process.c b/checkpoint/process.c
new file mode 100644
index 0000000..d221c2a
--- /dev/null
+++ b/checkpoint/process.c
@@ -0,0 +1,102 @@
+/*
+ *  Checkpoint task structure
+ *
+ *  Copyright (C) 2008-2009 Oren Laadan
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+/* default debug level for output */
+#define CKPT_DFLAG  CKPT_DSYS
+
+#include <linux/sched.h>
+#include <linux/checkpoint.h>
+#include <linux/checkpoint_hdr.h>
+
+/***********************************************************************
+ * Checkpoint
+ */
+
+/* dump the task_struct of a given task */
+static int checkpoint_task_struct(struct ckpt_ctx *ctx, struct task_struct *t)
+{
+	struct ckpt_hdr_task *h;
+	int ret;
+
+	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_TASK);
+	if (!h)
+		return -ENOMEM;
+
+	h->state = t->state;
+	h->exit_state = t->exit_state;
+	h->exit_code = t->exit_code;
+	h->exit_signal = t->exit_signal;
+
+	h->set_child_tid = (unsigned long) t->set_child_tid;
+	h->clear_child_tid = (unsigned long) t->clear_child_tid;
+
+	/* FIXME: save remaining relevant task_struct fields */
+
+	ret = ckpt_write_obj(ctx, &h->h);
+	ckpt_hdr_put(ctx, h);
+	if (ret < 0)
+		return ret;
+
+	return ckpt_write_string(ctx, t->comm, TASK_COMM_LEN);
+}
+
+/* dump the entire state of a given task */
+int checkpoint_task(struct ckpt_ctx *ctx, struct task_struct *t)
+{
+	int ret;
+
+	ctx->tsk = t;
+
+	ret = checkpoint_task_struct(ctx, t);
+	ckpt_debug("task %d\n", ret);
+
+	ctx->tsk = NULL;
+	return ret;
+}
+
+/***********************************************************************
+ * Restart
+ */
+
+/* read the task_struct into the current task */
+static int restore_task_struct(struct ckpt_ctx *ctx)
+{
+	struct ckpt_hdr_task *h;
+	struct task_struct *t = current;
+	int ret;
+
+	h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_TASK);
+	if (IS_ERR(h))
+		return PTR_ERR(h);
+
+	memset(t->comm, 0, TASK_COMM_LEN);
+	ret = _ckpt_read_string(ctx, t->comm, TASK_COMM_LEN);
+	if (ret < 0)
+		goto out;
+
+	t->set_child_tid = (int __user *) (unsigned long) h->set_child_tid;
+	t->clear_child_tid = (int __user *) (unsigned long) h->clear_child_tid;
+
+	/* FIXME: restore remaining relevant task_struct fields */
+ out:
+	ckpt_hdr_put(ctx, h);
+	return ret;
+}
+
+/* read the entire state of the current task */
+int restore_task(struct ckpt_ctx *ctx)
+{
+	int ret;
+
+	ret = restore_task_struct(ctx);
+	ckpt_debug("task %d\n", ret);
+
+	return ret;
+}
diff --git a/checkpoint/restart.c b/checkpoint/restart.c
new file mode 100644
index 0000000..9f08f4d
--- /dev/null
+++ b/checkpoint/restart.c
@@ -0,0 +1,422 @@
+/*
+ *  Restart logic and helpers
+ *
+ *  Copyright (C) 2008-2009 Oren Laadan
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+/* default debug level for output */
+#define CKPT_DFLAG  CKPT_DSYS
+
+#include <linux/version.h>
+#include <linux/sched.h>
+#include <linux/file.h>
+#include <linux/magic.h>
+#include <linux/utsname.h>
+#include <linux/checkpoint.h>
+#include <linux/checkpoint_hdr.h>
+
+static int _ckpt_read_err(struct ckpt_ctx *ctx, struct ckpt_hdr *h)
+{
+	char *ptr;
+	int len, ret;
+
+	len = h->len - sizeof(*h);
+	ptr = kzalloc(len + 1, GFP_KERNEL);
+	if (!ptr) {
+		ckpt_debug("insufficient memory to report image error\n");
+		return -ENOMEM;
+	}
+
+	ret = ckpt_kread(ctx, ptr, len);
+	if (ret >= 0) {
+		ckpt_debug("%s\n", &ptr[1]);
+		ret = -EIO;
+	}
+
+	kfree(ptr);
+	return ret;
+}
+
+/**
+ * _ckpt_read_obj - read an object (ckpt_hdr followed by payload)
+ * @ctx: checkpoint context
+ * @h: desired ckpt_hdr
+ * @ptr: desired buffer
+ * @len: desired payload length (if 0, flexible)
+ * @max: maximum payload length (if 0, flexible)
+ *
+ * If @ptr is NULL, then read only the header (payload to follow)
+ */
+static int _ckpt_read_obj(struct ckpt_ctx *ctx, struct ckpt_hdr *h,
+			  void *ptr, int len, int max)
+{
+	int ret;
+
+ again:
+	ret = ckpt_kread(ctx, h, sizeof(*h));
+	if (ret < 0)
+		return ret;
+	_ckpt_debug(CKPT_DRW, "type %d len %d(%d,%d)\n",
+		    h->type, h->len, len, max);
+	if (h->len < sizeof(*h))
+		return -EINVAL;
+
+	if (h->type == CKPT_HDR_ERROR) {
+		ret = _ckpt_read_err(ctx, h);
+		if (ret < 0)
+			return ret;
+		goto again;
+	}
+
+	/* if len specified, enforce, else if maximum specified, enforce */
+	if ((len && h->len != len) || (!len && max && h->len > max))
+		return -EINVAL;
+
+	if (ptr)
+		ret = ckpt_kread(ctx, ptr, h->len - sizeof(struct ckpt_hdr));
+	return ret;
+}
+
+/**
+ * _ckpt_read_obj_type - read an object of some type
+ * @ctx: checkpoint context
+ * @ptr: provided buffer
+ * @len: buffer length
+ * @type: buffer type
+ *
+ * If @ptr is NULL, then read only the header (payload to follow).
+ * @len specifies the expected buffer length (ignored if set to 0).
+ * Returns: actual _payload_ length
+ */
+int _ckpt_read_obj_type(struct ckpt_ctx *ctx, void *ptr, int len, int type)
+{
+	struct ckpt_hdr h;
+	int ret;
+
+	if (len)
+		len += sizeof(struct ckpt_hdr);
+	ret = _ckpt_read_obj(ctx, &h, ptr, len, len);
+	if (ret < 0)
+		return ret;
+	if (h.type != type)
+		return -EINVAL;
+	return h.len - sizeof(h);
+}
+
+/**
+ * _ckpt_read_buffer - read an object of type buffer (set length)
+ * @ctx: checkpoint context
+ * @ptr: provided buffer
+ * @len: buffer length
+ *
+ * If @ptr is NULL, then read only the header (payload to follow).
+ * @len specifies the expected buffer length (ignored if set to 0).
+ * Returns: _payload_ length.
+ */
+int _ckpt_read_buffer(struct ckpt_ctx *ctx, void *ptr, int len)
+{
+	BUG_ON(!len);
+	return _ckpt_read_obj_type(ctx, ptr, len, CKPT_HDR_BUFFER);
+}
+
+/**
+ * _ckpt_read_string - read an object of type string (set length)
+ * @ctx: checkpoint context
+ * @ptr: provided buffer
+ * @len: string length (including '\0')
+ *
+ * If @ptr is NULL, then read only the header (payload to follow)
+ */
+int _ckpt_read_string(struct ckpt_ctx *ctx, void *ptr, int len)
+{
+	int ret;
+
+	BUG_ON(!len);
+	ret = _ckpt_read_obj_type(ctx, ptr, len, CKPT_HDR_STRING);
+	if (ret < 0)
+		return ret;
+	if (ptr)
+		((char *) ptr)[len - 1] = '\0';	/* always play it safe */
+	return 0;
+}
+
+/**
+ * ckpt_read_obj - allocate and read an object (ckpt_hdr followed by payload)
+ * @ctx: checkpoint context
+ * @h: object descriptor
+ * @len: desired payload length (if 0, flexible)
+ * @max: maximum payload length
+ *
+ * Return: new buffer allocated on success, error pointer otherwise
+ */
+static void *ckpt_read_obj(struct ckpt_ctx *ctx, int len, int max)
+{
+	struct ckpt_hdr hh;
+	struct ckpt_hdr *h;
+	int ret;
+
+	ret = ckpt_kread(ctx, &hh, sizeof(hh));
+	if (ret < 0)
+		return ERR_PTR(ret);
+	_ckpt_debug(CKPT_DRW, "type %d len %d(%d,%d)\n",
+		    hh.type, hh.len, len, max);
+	if (hh.len < sizeof(*h))
+		return ERR_PTR(-EINVAL);
+	/* if len specified, enforce, else if maximum specified, enforce */
+	if ((len && hh.len != len) || (!len && max && hh.len > max))
+		return ERR_PTR(-EINVAL);
+
+	h = ckpt_hdr_get(ctx, hh.len);
+	if (!h)
+		return ERR_PTR(-ENOMEM);
+
+	*h = hh;	/* yay ! */
+
+	ret = ckpt_kread(ctx, (h + 1), hh.len - sizeof(struct ckpt_hdr));
+	if (ret < 0) {
+		ckpt_hdr_put(ctx, h);
+		h = ERR_PTR(ret);
+	}
+
+	return h;
+}
+
+/**
+ * ckpt_read_obj_type - allocate and read an object of some type
+ * @ctx: checkpoint context
+ * @len: desired object length
+ * @type: desired object type
+ *
+ * Return: new buffer allocated on success, error pointer otherwise
+ */
+void *ckpt_read_obj_type(struct ckpt_ctx *ctx, int len, int type)
+{
+	struct ckpt_hdr *h;
+
+	BUG_ON(!len);
+
+	h = ckpt_read_obj(ctx, len, len);
+	if (IS_ERR(h))
+		return h;
+
+	if (h->type != type) {
+		ckpt_hdr_put(ctx, h);
+		h = ERR_PTR(-EINVAL);
+	}
+
+	return h;
+}
+
+/**
+ * ckpt_read_buf_type - allocate and read an object of some type (flxible)
+ * @ctx: checkpoint context
+ * @max: maximum object length
+ * @type: desired object type
+ *
+ * This differs from ckpt_read_obj_type() in that the length of the
+ * incoming object is flexible (up to the maximum specified by @max;
+ * unlimited if @max is 0), as determined by the ckpt_hdr data.
+ *
+ * Return: new buffer allocated on success, error pointer otherwise
+ */
+void *ckpt_read_buf_type(struct ckpt_ctx *ctx, int max, int type)
+{
+	struct ckpt_hdr *h;
+
+	h = ckpt_read_obj(ctx, 0, max);
+	if (IS_ERR(h))
+		return h;
+
+	if (h->type != type) {
+		ckpt_hdr_put(ctx, h);
+		h = ERR_PTR(-EINVAL);
+	}
+
+	return h;
+}
+
+/**
+ * ckpt_read_payload - allocate and read the payload of an object
+ * @ctx: checkpoint context
+ * @max: maximum payload length
+ * @str: pointer to buffer to be allocated (caller must free)
+ * @type: desired object type
+ *
+ * This can be used to read a variable-length _payload_ from the checkpoint
+ * stream. @max limits the size of the resulting buffer.
+ *
+ * Return: actual _payload_ length
+ */
+int ckpt_read_payload(struct ckpt_ctx *ctx, void **ptr, int max, int type)
+{
+	int len, ret;
+
+	len = _ckpt_read_obj_type(ctx, NULL, 0, type);
+	if (len < 0)
+		return len;
+	else if (len > max)
+		return -EINVAL;
+
+	*ptr = kmalloc(len, GFP_KERNEL);
+	if (!*ptr)
+		return -ENOMEM;
+
+	ret = ckpt_kread(ctx, *ptr, len);
+	if (ret < 0) {
+		kfree(*ptr);
+		return ret;
+	}
+
+	return len;
+}
+
+/**
+ * ckpt_read_string - allocate and read a string (variable length)
+ * @ctx: checkpoint context
+ * @max: maximum acceptable length
+ *
+ * Return: allocate string or error pointer
+ */
+char *ckpt_read_string(struct ckpt_ctx *ctx, int max)
+{
+	char *str;
+	int len;
+
+	len = ckpt_read_payload(ctx, (void **)&str, max, CKPT_HDR_STRING);
+	if (len < 0)
+		return ERR_PTR(len);
+	str[len - 1] = '\0';  	/* always play it safe */
+	return str;
+}
+
+/**
+ * ckpt_read_consume - consume the next object of expected type
+ * @ctx: checkpoint context
+ * @len: desired object length
+ * @type: desired object type
+ *
+ * This can be used to skip an object in the input stream when the
+ * data is unnecessary for the restart. @len indicates the length of
+ * the object); if @len is zero the length is unconstrained.
+ */
+int ckpt_read_consume(struct ckpt_ctx *ctx, int len, int type)
+{
+	struct ckpt_hdr *h;
+	int ret = 0;
+
+	h = ckpt_read_obj(ctx, len, 0);
+	if (IS_ERR(h))
+		return PTR_ERR(h);
+
+	if (h->type != type)
+		ret = -EINVAL;
+
+	ckpt_hdr_put(ctx, h);
+	return ret;
+}
+
+/***********************************************************************
+ * Restart
+ */
+
+static int check_kernel_const(struct ckpt_const *h)
+{
+	struct task_struct *tsk;
+	struct new_utsname *uts;
+
+	/* task */
+	if (h->task_comm_len != sizeof(tsk->comm))
+		return -EINVAL;
+	/* uts */
+	if (h->uts_release_len != sizeof(uts->release))
+		return -EINVAL;
+	if (h->uts_version_len != sizeof(uts->version))
+		return -EINVAL;
+	if (h->uts_machine_len != sizeof(uts->machine))
+		return -EINVAL;
+
+	return 0;
+}
+
+/* read the checkpoint header */
+static int restore_read_header(struct ckpt_ctx *ctx)
+{
+	struct ckpt_hdr_header *h;
+	struct new_utsname *uts = NULL;
+	int ret;
+
+	h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_HEADER);
+	if (IS_ERR(h))
+		return PTR_ERR(h);
+
+	ret = -EINVAL;
+	if (h->magic != CHECKPOINT_MAGIC_HEAD ||
+	    h->rev != CHECKPOINT_VERSION ||
+	    h->major != ((LINUX_VERSION_CODE >> 16) & 0xff) ||
+	    h->minor != ((LINUX_VERSION_CODE >> 8) & 0xff) ||
+	    h->patch != ((LINUX_VERSION_CODE) & 0xff))
+		goto out;
+	if (h->uflags)
+		goto out;
+
+	ret = check_kernel_const(&h->constants);
+	if (ret < 0)
+		goto out;
+
+	ret = -ENOMEM;
+	uts = kmalloc(sizeof(*uts), GFP_KERNEL);
+	if (!uts)
+		goto out;
+
+	ctx->oflags = h->uflags;
+
+	/* FIX: verify compatibility of release, version and machine */
+	ret = _ckpt_read_buffer(ctx, uts->release, sizeof(uts->release));
+	if (ret < 0)
+		goto out;
+	ret = _ckpt_read_buffer(ctx, uts->version, sizeof(uts->version));
+	if (ret < 0)
+		goto out;
+	ret = _ckpt_read_buffer(ctx, uts->machine, sizeof(uts->machine));
+ out:
+	kfree(uts);
+	ckpt_hdr_put(ctx, h);
+	return ret;
+}
+
+/* read the checkpoint trailer */
+static int restore_read_tail(struct ckpt_ctx *ctx)
+{
+	struct ckpt_hdr_tail *h;
+	int ret = 0;
+
+	h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_TAIL);
+	if (IS_ERR(h))
+		return PTR_ERR(h);
+
+	if (h->magic != CHECKPOINT_MAGIC_TAIL)
+		ret = -EINVAL;
+
+	ckpt_hdr_put(ctx, h);
+	return ret;
+}
+
+long do_restart(struct ckpt_ctx *ctx, pid_t pid)
+{
+	long ret;
+
+	ret = restore_read_header(ctx);
+	if (ret < 0)
+		return ret;
+	ret = restore_task(ctx);
+	if (ret < 0)
+		return ret;
+	ret = restore_read_tail(ctx);
+
+	/* on success, adjust the return value if needed [TODO] */
+	return ret;
+}
diff --git a/checkpoint/sys.c b/checkpoint/sys.c
index 79936cc..7f6f71e 100644
--- a/checkpoint/sys.c
+++ b/checkpoint/sys.c
@@ -8,9 +8,192 @@
  *  distribution for more details.
  */
 
+/* default debug level for output */
+#define CKPT_DFLAG  CKPT_DSYS
+
 #include <linux/sched.h>
 #include <linux/kernel.h>
 #include <linux/syscalls.h>
+#include <linux/fs.h>
+#include <linux/file.h>
+#include <linux/uaccess.h>
+#include <linux/capability.h>
+#include <linux/checkpoint.h>
+
+/*
+ * Helpers to write(read) from(to) kernel space to(from) the checkpoint
+ * image file descriptor (similar to how a core-dump is performed).
+ *
+ *   ckpt_kwrite() - write a kernel-space buffer to the checkpoint image
+ *   ckpt_kread() - read from the checkpoint image to a kernel-space buffer
+ */
+
+static inline int _ckpt_kwrite(struct file *file, void *addr, int count)
+{
+	void __user *uaddr = (__force void __user *) addr;
+	ssize_t nwrite;
+	int nleft;
+
+	for (nleft = count; nleft; nleft -= nwrite) {
+		loff_t pos = file_pos_read(file);
+		nwrite = vfs_write(file, uaddr, nleft, &pos);
+		file_pos_write(file, pos);
+		if (nwrite < 0) {
+			if (nwrite == -EAGAIN)
+				nwrite = 0;
+			else
+				return nwrite;
+		}
+		uaddr += nwrite;
+	}
+	return 0;
+}
+
+int ckpt_kwrite(struct ckpt_ctx *ctx, void *addr, int count)
+{
+	mm_segment_t fs;
+	int ret;
+
+	fs = get_fs();
+	set_fs(KERNEL_DS);
+	ret = _ckpt_kwrite(ctx->file, addr, count);
+	set_fs(fs);
+
+	ctx->total += count;
+	return ret;
+}
+
+static inline int _ckpt_kread(struct file *file, void *addr, int count)
+{
+	void __user *uaddr = (__force void __user *) addr;
+	ssize_t nread;
+	int nleft;
+
+	for (nleft = count; nleft; nleft -= nread) {
+		loff_t pos = file_pos_read(file);
+		nread = vfs_read(file, uaddr, nleft, &pos);
+		file_pos_write(file, pos);
+		if (nread <= 0) {
+			if (nread == -EAGAIN) {
+				nread = 0;
+				continue;
+			} else if (nread == 0)
+				nread = -EPIPE;		/* unexecpted EOF */
+			return nread;
+		}
+		uaddr += nread;
+	}
+	return 0;
+}
+
+int ckpt_kread(struct ckpt_ctx *ctx, void *addr, int count)
+{
+	mm_segment_t fs;
+	int ret;
+
+	fs = get_fs();
+	set_fs(KERNEL_DS);
+	ret = _ckpt_kread(ctx->file , addr, count);
+	set_fs(fs);
+
+	ctx->total += count;
+	return ret;
+}
+
+/**
+ * ckpt_hdr_get - get a hdr of certain size
+ * @ctx: checkpoint context
+ * @len: desired length
+ *
+ * Returns pointer to header
+ */
+void *ckpt_hdr_get(struct ckpt_ctx *ctx, int len)
+{
+	return kzalloc(len, GFP_KERNEL);
+}
+
+/**
+ * _ckpt_hdr_put - free a hdr allocated with ckpt_hdr_get
+ * @ctx: checkpoint context
+ * @ptr: header to free
+ * @len: header length
+ *
+ * (requiring 'ptr' makes it easily interchangable with kmalloc/kfree
+ */
+void _ckpt_hdr_put(struct ckpt_ctx *ctx, void *ptr, int len)
+{
+	kfree(ptr);
+}
+
+/**
+ * ckpt_hdr_put - free a hdr allocated with ckpt_hdr_get
+ * @ctx: checkpoint context
+ * @ptr: header to free
+ *
+ * It is assumed that @ptr begins with a 'struct ckpt_hdr'.
+ */
+void ckpt_hdr_put(struct ckpt_ctx *ctx, void *ptr)
+{
+	struct ckpt_hdr *h = (struct ckpt_hdr *) ptr;
+	_ckpt_hdr_put(ctx, ptr, h->len);
+}
+
+/**
+ * ckpt_hdr_get_type - get a hdr of certain size
+ * @ctx: checkpoint context
+ * @len: number of bytes to reserve
+ *
+ * Returns pointer to reserved space on hbuf
+ */
+void *ckpt_hdr_get_type(struct ckpt_ctx *ctx, int len, int type)
+{
+	struct ckpt_hdr *h;
+
+	h = ckpt_hdr_get(ctx, len);
+	if (!h)
+		return NULL;
+
+	h->type = type;
+	h->len = len;
+	return h;
+}
+
+
+/*
+ * Helpers to manage c/r contexts: allocated for each checkpoint and/or
+ * restart operation, and persists until the operation is completed.
+ */
+
+static void ckpt_ctx_free(struct ckpt_ctx *ctx)
+{
+	if (ctx->file)
+		fput(ctx->file);
+	kfree(ctx);
+}
+
+static struct ckpt_ctx *ckpt_ctx_alloc(int fd, unsigned long uflags,
+				       unsigned long kflags)
+{
+	struct ckpt_ctx *ctx;
+	int err;
+
+	ctx = kzalloc(sizeof(*ctx), GFP_KERNEL);
+	if (!ctx)
+		return ERR_PTR(-ENOMEM);
+
+	ctx->uflags = uflags;
+	ctx->kflags = kflags;
+
+	err = -EBADF;
+	ctx->file = fget(fd);
+	if (!ctx->file)
+		goto err;
+
+	return ctx;
+ err:
+	ckpt_ctx_free(ctx);
+	return ERR_PTR(err);
+}
 
 /**
  * sys_checkpoint - checkpoint a container
@@ -23,7 +206,26 @@
  */
 SYSCALL_DEFINE3(checkpoint, pid_t, pid, int, fd, unsigned long, flags)
 {
-	return -ENOSYS;
+	struct ckpt_ctx *ctx;
+	long ret;
+
+	/* no flags for now */
+	if (flags)
+		return -EINVAL;
+
+	if (pid == 0)
+		pid = task_pid_vnr(current);
+	ctx = ckpt_ctx_alloc(fd, flags, CKPT_CTX_CHECKPOINT);
+	if (IS_ERR(ctx))
+		return PTR_ERR(ctx);
+
+	ret = do_checkpoint(ctx, pid);
+
+	if (!ret)
+		ret = ctx->crid;
+
+	ckpt_ctx_free(ctx);
+	return ret;
 }
 
 /**
@@ -37,5 +239,46 @@ SYSCALL_DEFINE3(checkpoint, pid_t, pid, int, fd, unsigned long, flags)
  */
 SYSCALL_DEFINE3(restart, pid_t, pid, int, fd, unsigned long, flags)
 {
-	return -ENOSYS;
+	struct ckpt_ctx *ctx = NULL;
+	long ret;
+
+	/* no flags for now */
+	if (flags)
+		return -EINVAL;
+
+	ctx = ckpt_ctx_alloc(fd, flags, CKPT_CTX_RESTART);
+	if (IS_ERR(ctx))
+		return PTR_ERR(ctx);
+
+	ret = do_restart(ctx, pid);
+
+	/* restart(2) isn't idempotent: can't restart syscall */
+	if (ret == -ERESTARTSYS || ret == -ERESTARTNOINTR ||
+	    ret == -ERESTARTNOHAND || ret == -ERESTART_RESTARTBLOCK)
+		ret = -EINTR;
+
+	ckpt_ctx_free(ctx);
+	return ret;
+}
+
+
+/* 'ckpt_debug_level' controls the verbosity level of c/r code */
+#ifdef CONFIG_CHECKPOINT_DEBUG
+
+/* FIX: allow to change during runtime */
+unsigned long __read_mostly ckpt_debug_level = CKPT_DDEFAULT;
+
+static __init int ckpt_debug_setup(char *s)
+{
+	long val, ret;
+
+	ret = strict_strtoul(s, 10, &val);
+	if (ret < 0)
+		return ret;
+	ckpt_debug_level = val;
+	return 0;
 }
+
+__setup("ckpt_debug=", ckpt_debug_setup);
+
+#endif /* CONFIG_CHECKPOINT_DEBUG */
diff --git a/include/linux/Kbuild b/include/linux/Kbuild
index 334a359..3e8bd18 100644
--- a/include/linux/Kbuild
+++ b/include/linux/Kbuild
@@ -44,6 +44,9 @@ header-y += bpqether.h
 header-y += bsg.h
 header-y += can.h
 header-y += cdk.h
+header-y += checkpoint.h
+header-y += checkpoint_hdr.h
+header-y += checkpoint_types.h
 header-y += chio.h
 header-y += coda_psdev.h
 header-y += coff.h
diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
new file mode 100644
index 0000000..be0ba4b
--- /dev/null
+++ b/include/linux/checkpoint.h
@@ -0,0 +1,109 @@
+#ifndef _LINUX_CHECKPOINT_H_
+#define _LINUX_CHECKPOINT_H_
+/*
+ *  Generic checkpoint-restart
+ *
+ *  Copyright (C) 2008-2009 Oren Laadan
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+#define CHECKPOINT_VERSION  2
+
+#ifdef __KERNEL__
+#ifdef CONFIG_CHECKPOINT
+
+#include <linux/checkpoint_types.h>
+#include <linux/checkpoint_hdr.h>
+#include <linux/err.h>
+
+/* ckpt_ctx: kflags */
+#define CKPT_CTX_CHECKPOINT_BIT		0
+#define CKPT_CTX_RESTART_BIT		1
+
+#define CKPT_CTX_CHECKPOINT	(1 << CKPT_CTX_CHECKPOINT_BIT)
+#define CKPT_CTX_RESTART	(1 << CKPT_CTX_RESTART_BIT)
+
+
+extern int ckpt_kwrite(struct ckpt_ctx *ctx, void *buf, int count);
+extern int ckpt_kread(struct ckpt_ctx *ctx, void *buf, int count);
+
+extern void _ckpt_hdr_put(struct ckpt_ctx *ctx, void *ptr, int n);
+extern void ckpt_hdr_put(struct ckpt_ctx *ctx, void *ptr);
+extern void *ckpt_hdr_get(struct ckpt_ctx *ctx, int n);
+extern void *ckpt_hdr_get_type(struct ckpt_ctx *ctx, int n, int type);
+
+extern int ckpt_write_obj(struct ckpt_ctx *ctx, struct ckpt_hdr *h);
+extern int ckpt_write_obj_type(struct ckpt_ctx *ctx,
+			       void *ptr, int len, int type);
+extern int ckpt_write_buffer(struct ckpt_ctx *ctx, void *ptr, int len);
+extern int ckpt_write_string(struct ckpt_ctx *ctx, char *str, int len);
+extern void __ckpt_write_err(struct ckpt_ctx *ctx, char *ptr, char *fmt, ...);
+extern int ckpt_write_err(struct ckpt_ctx *ctx, char *ptr, char *fmt, ...);
+
+extern int _ckpt_read_obj_type(struct ckpt_ctx *ctx,
+			       void *ptr, int len, int type);
+extern int _ckpt_read_buffer(struct ckpt_ctx *ctx, void *ptr, int len);
+extern int _ckpt_read_string(struct ckpt_ctx *ctx, void *ptr, int len);
+extern void *ckpt_read_obj_type(struct ckpt_ctx *ctx, int len, int type);
+extern void *ckpt_read_buf_type(struct ckpt_ctx *ctx, int len, int type);
+extern int ckpt_read_payload(struct ckpt_ctx *ctx,
+			     void **ptr, int max, int type);
+extern char *ckpt_read_string(struct ckpt_ctx *ctx, int max);
+extern int ckpt_read_consume(struct ckpt_ctx *ctx, int len, int type);
+
+extern long do_checkpoint(struct ckpt_ctx *ctx, pid_t pid);
+extern long do_restart(struct ckpt_ctx *ctx, pid_t pid);
+
+/* task */
+extern int checkpoint_task(struct ckpt_ctx *ctx, struct task_struct *t);
+extern int restore_task(struct ckpt_ctx *ctx);
+
+static inline int ckpt_validate_errno(int errno)
+{
+	return (errno >= 0) && (errno < MAX_ERRNO);
+}
+
+/* debugging flags */
+#define CKPT_DBASE	0x1		/* anything */
+#define CKPT_DSYS	0x2		/* generic (system) */
+#define CKPT_DRW	0x4		/* image read/write */
+
+#define CKPT_DDEFAULT	0xffff		/* default debug level */
+
+#ifndef CKPT_DFLAG
+#define CKPT_DFLAG	0xffff		/* everything */
+#endif
+
+#ifdef CONFIG_CHECKPOINT_DEBUG
+extern unsigned long ckpt_debug_level;
+
+/* use this to select a specific debug level */
+#define _ckpt_debug(level, fmt, args...)				\
+	do {								\
+		if (ckpt_debug_level & (level))				\
+			printk(KERN_DEBUG "[%d:%d:c/r:%s:%d] " fmt,	\
+				current->pid, task_pid_vnr(current),	\
+				__func__, __LINE__, ## args);		\
+	} while (0)
+
+/*
+ * CKPT_DBASE is the base flags, doesn't change
+ * CKPT_DFLAG is to be redfined in each source file
+ */
+#define ckpt_debug(fmt, args...)  \
+	_ckpt_debug(CKPT_DBASE | CKPT_DFLAG, fmt, ## args)
+
+#else
+
+#define _ckpt_debug(level, fmt, args...)	do { } while (0)
+#define ckpt_debug(fmt, args...)		do { } while (0)
+
+#endif /* CONFIG_CHECKPOINT_DEBUG */
+
+#endif /* CONFIG_CHECKPOINT */
+#endif /* __KERNEL__ */
+
+#endif /* _LINUX_CHECKPOINT_H_ */
diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
new file mode 100644
index 0000000..22dadbd
--- /dev/null
+++ b/include/linux/checkpoint_hdr.h
@@ -0,0 +1,111 @@
+#ifndef _CHECKPOINT_CKPT_HDR_H_
+#define _CHECKPOINT_CKPT_HDR_H_
+/*
+ *  Generic container checkpoint-restart
+ *
+ *  Copyright (C) 2008-2009 Oren Laadan
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+#include <linux/types.h>
+#include <linux/utsname.h>
+
+/*
+ * To maintain compatibility between 32-bit and 64-bit architecture flavors,
+ * keep data 64-bit aligned: use padding for structure members, and use
+ * __attribute__((aligned (8))) for the entire structure.
+ *
+ * Quoting Arnd Bergmann:
+ *   "This structure has an odd multiple of 32-bit members, which means
+ *   that if you put it into a larger structure that also contains 64-bit
+ *   members, the larger structure may get different alignment on x86-32
+ *   and x86-64, which you might want to avoid. I can't tell if this is
+ *   an actual problem here. ... In this case, I'm pretty sure that
+ *   sizeof(ckpt_hdr_task) on x86-32 is different from x86-64, since it
+ *   will be 32-bit aligned on x86-32."
+ */
+
+/*
+ * header format: 'struct ckpt_hdr' must prefix all other headers. Therfore
+ * when a header is passed around, the information about it (type, size)
+ * is readily available. Structs that include a struct ckpt_hdr are named
+ * struct ckpt_hdr_* by convention (usualy the struct ckpt_hdr is the first
+ * member).
+ */
+struct ckpt_hdr {
+	__u32 type;
+	__u32 len;
+} __attribute__((aligned(8)));
+
+/* header types */
+enum {
+	CKPT_HDR_HEADER = 1,
+	CKPT_HDR_BUFFER,
+	CKPT_HDR_STRING,
+
+	CKPT_HDR_TASK = 101,
+
+	CKPT_HDR_TAIL = 9001,
+
+	CKPT_HDR_ERROR = 9999,
+};
+
+/* kernel constants */
+struct ckpt_const {
+	/* task */
+	__u16 task_comm_len;
+	/* uts */
+	__u16 uts_release_len;
+	__u16 uts_version_len;
+	__u16 uts_machine_len;
+} __attribute__((aligned(8)));
+
+/* checkpoint image header */
+struct ckpt_hdr_header {
+	struct ckpt_hdr h;
+	__u64 magic;
+
+	__u16 _padding;
+
+	__u16 major;
+	__u16 minor;
+	__u16 patch;
+	__u16 rev;
+
+	struct ckpt_const constants;
+
+	__u64 time;	/* when checkpoint taken */
+	__u64 uflags;	/* uflags from checkpoint */
+
+	/*
+	 * the header is followed by three strings:
+	 *   char release[const.uts_release_len];
+	 *   char version[const.uts_version_len];
+	 *   char machine[const.uts_machine_len];
+	 */
+} __attribute__((aligned(8)));
+
+
+/* checkpoint image trailer */
+struct ckpt_hdr_tail {
+	struct ckpt_hdr h;
+	__u64 magic;
+} __attribute__((aligned(8)));
+
+
+/* task data */
+struct ckpt_hdr_task {
+	struct ckpt_hdr h;
+	__u32 state;
+	__u32 exit_state;
+	__u32 exit_code;
+	__u32 exit_signal;
+
+	__u64 set_child_tid;
+	__u64 clear_child_tid;
+} __attribute__((aligned(8)));
+
+#endif /* _CHECKPOINT_CKPT_HDR_H_ */
diff --git a/include/linux/checkpoint_types.h b/include/linux/checkpoint_types.h
new file mode 100644
index 0000000..585cb7b
--- /dev/null
+++ b/include/linux/checkpoint_types.h
@@ -0,0 +1,35 @@
+#ifndef _LINUX_CHECKPOINT_TYPES_H_
+#define _LINUX_CHECKPOINT_TYPES_H_
+/*
+ *  Generic checkpoint-restart
+ *
+ *  Copyright (C) 2008-2009 Oren Laadan
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+#ifdef __KERNEL__
+
+#include <linux/fs.h>
+
+struct ckpt_ctx {
+	int crid;		/* unique checkpoint id */
+
+	pid_t root_pid;		/* container identifier */
+
+	unsigned long kflags;	/* kerenl flags */
+	unsigned long uflags;	/* user flags */
+	unsigned long oflags;	/* restart: uflags from checkpoint */
+
+	struct file *file;	/* input/output file */
+	int total;		/* total read/written */
+
+	struct task_struct *tsk;/* checkpoint: current target task */
+	char err_string[256];	/* checkpoint: error string */
+};
+
+#endif /* __KERNEL__ */
+
+#endif /* _LINUX_CHECKPOINT_TYPES_H_ */
diff --git a/include/linux/magic.h b/include/linux/magic.h
index 1923327..ff17a59 100644
--- a/include/linux/magic.h
+++ b/include/linux/magic.h
@@ -53,4 +53,8 @@
 #define INOTIFYFS_SUPER_MAGIC	0x2BAD1DEA
 
 #define STACK_END_MAGIC		0x57AC6E9D
+
+#define CHECKPOINT_MAGIC_HEAD  0x00feed0cc0a2d200LL
+#define CHECKPOINT_MAGIC_TAIL  0x002d2a0cc0deef00LL
+
 #endif /* __LINUX_MAGIC_H__ */
diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index 12327b2..e1ae6e6 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -1006,6 +1006,19 @@ config DMA_API_DEBUG
 	  This option causes a performance degredation.  Use only if you want
 	  to debug device drivers. If unsure, say N.
 
+config CHECKPOINT_DEBUG
+	bool "Checkpoint/restart debugging (EXPERIMENTAL)"
+	depends on CHECKPOINT
+	default y
+	help
+	  This options turns on the debugging output of checkpoint/restart.
+	  The level of verbosity is controlled by 'ckpt_debug_level' and can
+	  be set at boot time with "ckpt_debug=" option.
+
+	  Turning this option off will reduce the size of the c/r code. If
+	  turned on, it is unlikely to incur visible overhead if the debug
+	  level is set to zero.
+
 source "samples/Kconfig"
 
 source "lib/Kconfig.kgdb"
-- 
1.6.0.4

^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH v18 20/80] c/r: basic infrastructure for checkpoint/restart
  2009-09-23 23:50 ` Oren Laadan
@ 2009-09-23 23:51   ` Oren Laadan
  -1 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-23 23:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, containers, linux-kernel, linux-mm, linux-api,
	Serge Hallyn, Ingo Molnar, Pavel Emelyanov, Oren Laadan,
	Oren Laadan

Add those interfaces, as well as helpers needed to easily manage the
file format. The code is roughly broken out as follows:

checkpoint/sys.c - user/kernel data transfer, as well as setup of the
  c/r context (a per-checkpoint data structure for housekeeping)

checkpoint/checkpoint.c - output wrappers and basic checkpoint handling

checkpoint/restart.c - input wrappers and basic restart handling

checkpoint/process.c - c/r of task data

For now, we can only checkpoint the 'current' task ("self" checkpoint),
and the 'pid' argument to the syscall is ignored.

Patches to add the per-architecture support as well as the actual
work to do the memory checkpoint follow in subsequent patches.


Changelog[v18]:
  - Detect error-headers in input data on restart, and abort.
  - Standard format for checkpoint error strings (and documentation)
  - [Matt Helsley] Rename headerless struct ckpt_hdr_* to struct ckpt_*
  - [Dan Smith] Add an errno validation function
  - Add ckpt_read_payload(): read a variable-length object (no header)
  - Add ckpt_read_string(): same for strings (ensures null-terminated)
  - Add ckpt_read_consume(): consumes next object without processing
Changelog[v17]:
  - Fix compilation for architectures that don't support checkpoint
  - Save/restore t->{set,clear}_child_tid
  - Restart(2) isn't idempotent: must return -EINTR if interrupted
  - ckpt_debug does not depend on DYNAMIC_DEBUG, on by default
  - Export generic checkpoint headers to userespace
  - Fix comment for prototype of sys_restart
  - Have ckpt_debug() print global-pid and __LINE__
  - Only save and test kernel constants once (in header)
Changelog[v16]:
  - Split ctx->flags to ->uflags (user flags) and ->kflags (kernel flags)
  - Introduce __ckpt_write_err() and ckpt_write_err() to report errors
  - Allow @ptr == NULL to write (or read) header only without payload
  - Introduce _ckpt_read_obj_type()
Changelog[v15]:
  - Replace header buffer in ckpt_ctx (hbuf,hpos) with kmalloc/kfree()
Changelog[v14]:
  - Cleanup interface to get/put hdr buffers
  - Merge checkpoint and restart code into a single file (per subsystem)
  - Take uts_sem around access to uts->{release,version,machine}
  - Embed ckpt_hdr in all ckpt_hdr_...., cleanup read/write helpers
  - Define sys_checkpoint(0,...) as asking for a self-checkpoint (Serge)
  - Revert use of 'pr_fmt' to avoid tainting whom includes us (Nathan Lynch)
  - Explicitly indicate length of UTS fields in header
  - Discard field 'h->parent' from ckpt_hdr
Changelog[v12]:
  - ckpt_kwrite/ckpt_kread() again use vfs_read(), vfs_write() (safer)
  - Split ckpt_write/ckpt_read() to two parts: _ckpt_write/read() helper
  - Befriend with sparse : explicit conversion to 'void __user *'
  - Redfine 'pr_fmt' instead of using special ckpt_debug()
Changelog[v10]:
  - add ckpt_write_buffer(), ckpt_read_buffer() and ckpt_read_buf_type()
  - force end-of-string in ckpt_read_string() (fix possible DoS)
Changelog[v9]:
  - ckpt_kwrite/ckpt_kread() use file->f_op->write() directly
  - Drop ckpt_uwrite/ckpt_uread() since they aren't used anywhere
Changelog[v6]:
  - Balance all calls to ckpt_hbuf_get() with matching ckpt_hbuf_put()
    (although it's not really needed)
Changelog[v5]:
  - Rename headers files s/ckpt/checkpoint/
Changelog[v2]:
  - Added utsname->{release,version,machine} to checkpoint header
  - Pad header structures to 64 bits to ensure compatibility

Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
---
 Makefile                         |    2 +-
 checkpoint/Makefile              |    6 +-
 checkpoint/checkpoint.c          |  369 +++++++++++++++++++++++++++++++++
 checkpoint/process.c             |  102 +++++++++
 checkpoint/restart.c             |  422 ++++++++++++++++++++++++++++++++++++++
 checkpoint/sys.c                 |  247 ++++++++++++++++++++++-
 include/linux/Kbuild             |    3 +
 include/linux/checkpoint.h       |  109 ++++++++++
 include/linux/checkpoint_hdr.h   |  111 ++++++++++
 include/linux/checkpoint_types.h |   35 +++
 include/linux/magic.h            |    4 +
 lib/Kconfig.debug                |   13 ++
 12 files changed, 1419 insertions(+), 4 deletions(-)
 create mode 100644 checkpoint/checkpoint.c
 create mode 100644 checkpoint/process.c
 create mode 100644 checkpoint/restart.c
 create mode 100644 include/linux/checkpoint.h
 create mode 100644 include/linux/checkpoint_hdr.h
 create mode 100644 include/linux/checkpoint_types.h

diff --git a/Makefile b/Makefile
index fe45658..2e44d0f 100644
--- a/Makefile
+++ b/Makefile
@@ -639,7 +639,7 @@ export mod_strip_cmd
 
 
 ifeq ($(KBUILD_EXTMOD),)
-core-y		+= kernel/ mm/ fs/ ipc/ security/ crypto/ block/
+core-y		+= kernel/ mm/ fs/ ipc/ security/ crypto/ block/ checkpoint/
 
 vmlinux-dirs	:= $(patsubst %/,%,$(filter %/, $(init-y) $(init-m) \
 		     $(core-y) $(core-m) $(drivers-y) $(drivers-m) \
diff --git a/checkpoint/Makefile b/checkpoint/Makefile
index 8a32c6f..99364cc 100644
--- a/checkpoint/Makefile
+++ b/checkpoint/Makefile
@@ -2,4 +2,8 @@
 # Makefile for linux checkpoint/restart.
 #
 
-obj-$(CONFIG_CHECKPOINT) += sys.o
+obj-$(CONFIG_CHECKPOINT) += \
+	sys.o \
+	checkpoint.o \
+	restart.o \
+	process.o
diff --git a/checkpoint/checkpoint.c b/checkpoint/checkpoint.c
new file mode 100644
index 0000000..57eb7d8
--- /dev/null
+++ b/checkpoint/checkpoint.c
@@ -0,0 +1,369 @@
+/*
+ *  Checkpoint logic and helpers
+ *
+ *  Copyright (C) 2008-2009 Oren Laadan
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+/* default debug level for output */
+#define CKPT_DFLAG  CKPT_DSYS
+
+#include <linux/version.h>
+#include <linux/time.h>
+#include <linux/fs.h>
+#include <linux/file.h>
+#include <linux/dcache.h>
+#include <linux/mount.h>
+#include <linux/utsname.h>
+#include <linux/magic.h>
+#include <linux/checkpoint.h>
+#include <linux/checkpoint_hdr.h>
+
+/* unique checkpoint identifier (FIXME: should be per-container ?) */
+static atomic_t ctx_count = ATOMIC_INIT(0);
+
+/**
+ * ckpt_write_obj - write an object
+ * @ctx: checkpoint context
+ * @h: object descriptor
+ */
+int ckpt_write_obj(struct ckpt_ctx *ctx, struct ckpt_hdr *h)
+{
+	_ckpt_debug(CKPT_DRW, "type %d len %d\n", h->type, h->len);
+	return ckpt_kwrite(ctx, h, h->len);
+}
+
+/**
+ * ckpt_write_obj_type - write an object (from a pointer)
+ * @ctx: checkpoint context
+ * @ptr: buffer pointer
+ * @len: buffer size
+ * @type: desired type
+ *
+ * If @ptr is NULL, then write only the header (payload to follow)
+ */
+int ckpt_write_obj_type(struct ckpt_ctx *ctx, void *ptr, int len, int type)
+{
+	struct ckpt_hdr *h;
+	int ret;
+
+	h = ckpt_hdr_get(ctx, sizeof(*h));
+	if (!h)
+		return -ENOMEM;
+
+	h->type = type;
+	h->len = len + sizeof(*h);
+
+	_ckpt_debug(CKPT_DRW, "type %d len %d\n", h->type, h->len);
+	ret = ckpt_kwrite(ctx, h, sizeof(*h));
+	if (ret < 0)
+		goto out;
+	if (ptr)
+		ret = ckpt_kwrite(ctx, ptr, len);
+ out:
+	_ckpt_hdr_put(ctx, h, sizeof(*h));
+	return ret;
+}
+
+/**
+ * ckpt_write_buffer - write an object of type buffer
+ * @ctx: checkpoint context
+ * @ptr: buffer pointer
+ * @len: buffer size
+ */
+int ckpt_write_buffer(struct ckpt_ctx *ctx, void *ptr, int len)
+{
+	return ckpt_write_obj_type(ctx, ptr, len, CKPT_HDR_BUFFER);
+}
+
+/**
+ * ckpt_write_string - write an object of type string
+ * @ctx: checkpoint context
+ * @str: string pointer
+ * @len: string length
+ */
+int ckpt_write_string(struct ckpt_ctx *ctx, char *str, int len)
+{
+	return ckpt_write_obj_type(ctx, str, len, CKPT_HDR_STRING);
+}
+
+/*
+ * __ckpt_generate_fmt - generate standard checkpoint error message
+ * @ctx: checkpoint context
+ * @prefmt: pre-format string
+ * @fmt: message format
+ *
+ * This generates a unified format of checkpoint error messages, to
+ * ease (after the failure) inspection by userspace tools. It converts
+ * the (printf) message @fmt into a new format: "[PREFMT]: fmt".
+ *
+ * PREFMT is constructed from @prefmt by subtituting format snippets
+ * according to the contents of @prefmt.  The format characters in
+ * @prefmt can be E (error), O (objref), P (pointer), S (string) and
+ * V (variable/symbol). For example, E will generate a "err %d" in
+ * PREFMT (see prefmt_array below).
+ *
+ * If @prefmt begins with T, PREFMT will begin with "pid %d tsk %s"
+ * with the pid and the tsk->comm of the currently checkpointed task.
+ * The latter is taken from ctx->tsk, and is it the responsbilility of
+ * the caller to have a valid pointer there (in particular, functions
+ * that iterate on the processes: collect_objects, checkpoint_task,
+ * and tree_count_tasks).
+ *
+ * The caller of ckpt_write_err() and _ckpt_write_err() must provide
+ * the additional variabes, in order, to match the @prefmt (except for
+ * the T key), e.g.:
+ *
+ *   ckpt_writ_err(ctx, "TEO", "FILE flags %d", err, objref, flags);
+ *
+ * Here, T is simply passed, E expects an integer (err), O expects an
+ * integer (objref), and the last argument matches the format string.
+ */
+static char *__ckpt_generate_fmt(struct ckpt_ctx *ctx, char *prefmt, char *fmt)
+{
+	static int warn_notask = 0;
+	static int warn_prefmt = 0;
+	char *format;
+	int i, j, len = 0;
+
+	static struct {
+		char key;
+		char *fmt;
+	} prefmt_array[] = {
+		{ 'E', "err %d" },
+		{ 'O', "obj %d" },
+		{ 'P', "ptr %p" },
+		{ 'V', "sym %pS" },
+		{ 'S', "str %s" },
+		{ 0, "??? %pS" },
+	};
+
+	/*
+	 * 17 for "pid %d" (plus space)
+	 * 21 for "tsk %s" (tsk->comm)
+	 * up to 8 per varfmt entry
+	 */
+	format = kzalloc(37 + 8 * strlen(prefmt) + strlen(fmt), GFP_KERNEL);
+	if (!format)
+		return NULL;
+
+	format[len++] = '[';
+
+	if (prefmt[0] == 'T') {
+		if (ctx->tsk)
+			len = sprintf(format, "pid %d tsk %s ",
+				      task_pid_vnr(ctx->tsk), ctx->tsk->comm);
+		else if (warn_notask++ < 5)
+			printk(KERN_ERR "c/r: no target task set\n");
+		prefmt++;
+	}
+
+	for (i = 0; i < strlen(prefmt); i++) {
+		for (j = 0; prefmt_array[j].key; j++)
+			if (prefmt_array[j].key == prefmt[i])
+				break;
+		if (!prefmt_array[j].key && warn_prefmt++ < 5)
+			printk(KERN_ERR "c/r: unknown prefmt %c\n", prefmt[i]);
+		len += sprintf(&format[len], "%s ", prefmt_array[j].fmt);
+	}
+
+	if (len > 1)
+		sprintf(&format[len-1], "]: %s", fmt);  /* erase last space */
+	else
+		sprintf(format, "%s", fmt);
+
+	return format;
+}
+
+/* see _ckpt_generate_fmt for information on @prefmt */
+static void __ckpt_generate_err(struct ckpt_ctx *ctx, char *prefmt,
+				char *fmt, va_list ap)
+{
+	va_list aq;
+	char *format;
+	char *str;
+	int len;
+
+	format = __ckpt_generate_fmt(ctx, prefmt, fmt);
+	va_copy(aq, ap);
+
+	/*
+	 * prefix the error string with a '\0' to facilitate easy
+	 * backtrace to the beginning of the error message without
+	 * needing to parse the entire checkpoint image.
+	 */
+	ctx->err_string[0] = '\0';
+	str = &ctx->err_string[1];
+	len = vsnprintf(str, 255, format ? : fmt, ap) + 2;
+
+	if (len > 256) {
+		printk(KERN_NOTICE "c/r: error string truncated: ");
+		vprintk(fmt, aq);
+	}
+
+	va_end(aq);
+	kfree(format);
+
+	ckpt_debug("c/r: checkpoint error: %s\n", str);
+}
+
+/**
+ * __ckpt_write_err - save an error string on the ctx->err_string
+ * @ctx: checkpoint context
+ * @prefmt: error pre-format
+ * @fmt: message format
+ * @...: arguments
+ *
+ * See _ckpt_generate_fmt for information on @prefmt.
+ * Use this during checkpoint to report while holding a spinlock
+ */
+void __ckpt_write_err(struct ckpt_ctx *ctx, char *prefmt, char *fmt, ...)
+{
+	va_list ap;
+
+	va_start(ap, fmt);
+	__ckpt_generate_err(ctx, prefmt, fmt, ap);
+	va_end(ap);
+}
+
+/**
+ * ckpt_write_err - write an object describing an error
+ * @ctx: checkpoint context
+ * @pre: string pre-format
+ * @fmt: error string format
+ * @...: error string arguments
+ *
+ * See _ckpt_generate_fmt for information on @prefmt.
+ * If @fmt is null, the string in the ctx->err_string will be used (and freed)
+ */
+int ckpt_write_err(struct ckpt_ctx *ctx, char *pre, char *fmt, ...)
+{
+	va_list ap;
+	char *str;
+	int len, ret = 0;
+
+	if (fmt) {
+		va_start(ap, fmt);
+		__ckpt_generate_err(ctx, pre, fmt, ap);
+		va_end(ap);
+	}
+
+	str = ctx->err_string;
+	len = strlen(str + 1) + 2;	/* leading and trailing '\0' */
+
+	if (len == 0)	/* empty error string */
+		return 0;
+
+	ret = ckpt_write_obj_type(ctx, NULL, 0, CKPT_HDR_ERROR);
+	if (!ret)
+		ret = ckpt_write_string(ctx, str, len);
+	if (ret < 0)
+		printk(KERN_NOTICE "c/r: error string unsaved (%d): %s\n",
+		       ret, str + 1);
+
+	str[1] = '\0';
+	return ret;
+}
+
+/***********************************************************************
+ * Checkpoint
+ */
+
+static void fill_kernel_const(struct ckpt_const *h)
+{
+	struct task_struct *tsk;
+	struct new_utsname *uts;
+
+	/* task */
+	h->task_comm_len = sizeof(tsk->comm);
+	/* uts */
+	h->uts_release_len = sizeof(uts->release);
+	h->uts_version_len = sizeof(uts->version);
+	h->uts_machine_len = sizeof(uts->machine);
+}
+
+/* write the checkpoint header */
+static int checkpoint_write_header(struct ckpt_ctx *ctx)
+{
+	struct ckpt_hdr_header *h;
+	struct new_utsname *uts;
+	struct timeval ktv;
+	int ret;
+
+	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_HEADER);
+	if (!h)
+		return -ENOMEM;
+
+	do_gettimeofday(&ktv);
+	uts = utsname();
+
+	h->magic = CHECKPOINT_MAGIC_HEAD;
+	h->major = (LINUX_VERSION_CODE >> 16) & 0xff;
+	h->minor = (LINUX_VERSION_CODE >> 8) & 0xff;
+	h->patch = (LINUX_VERSION_CODE) & 0xff;
+
+	h->rev = CHECKPOINT_VERSION;
+
+	h->uflags = ctx->uflags;
+	h->time = ktv.tv_sec;
+
+	fill_kernel_const(&h->constants);
+
+	ret = ckpt_write_obj(ctx, &h->h);
+	ckpt_hdr_put(ctx, h);
+	if (ret < 0)
+		return ret;
+
+	down_read(&uts_sem);
+	ret = ckpt_write_buffer(ctx, uts->release, sizeof(uts->release));
+	if (ret < 0)
+		goto up;
+	ret = ckpt_write_buffer(ctx, uts->version, sizeof(uts->version));
+	if (ret < 0)
+		goto up;
+	ret = ckpt_write_buffer(ctx, uts->machine, sizeof(uts->machine));
+ up:
+	up_read(&uts_sem);
+	return ret;
+}
+
+/* write the checkpoint trailer */
+static int checkpoint_write_tail(struct ckpt_ctx *ctx)
+{
+	struct ckpt_hdr_tail *h;
+	int ret;
+
+	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_TAIL);
+	if (!h)
+		return -ENOMEM;
+
+	h->magic = CHECKPOINT_MAGIC_TAIL;
+
+	ret = ckpt_write_obj(ctx, &h->h);
+	ckpt_hdr_put(ctx, h);
+	return ret;
+}
+
+long do_checkpoint(struct ckpt_ctx *ctx, pid_t pid)
+{
+	long ret;
+
+	ret = checkpoint_write_header(ctx);
+	if (ret < 0)
+		goto out;
+	ret = checkpoint_task(ctx, current);
+	if (ret < 0)
+		goto out;
+	ret = checkpoint_write_tail(ctx);
+	if (ret < 0)
+		goto out;
+
+	/* on success, return (unique) checkpoint identifier */
+	ctx->crid = atomic_inc_return(&ctx_count);
+	ret = ctx->crid;
+ out:
+	return ret;
+}
diff --git a/checkpoint/process.c b/checkpoint/process.c
new file mode 100644
index 0000000..d221c2a
--- /dev/null
+++ b/checkpoint/process.c
@@ -0,0 +1,102 @@
+/*
+ *  Checkpoint task structure
+ *
+ *  Copyright (C) 2008-2009 Oren Laadan
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+/* default debug level for output */
+#define CKPT_DFLAG  CKPT_DSYS
+
+#include <linux/sched.h>
+#include <linux/checkpoint.h>
+#include <linux/checkpoint_hdr.h>
+
+/***********************************************************************
+ * Checkpoint
+ */
+
+/* dump the task_struct of a given task */
+static int checkpoint_task_struct(struct ckpt_ctx *ctx, struct task_struct *t)
+{
+	struct ckpt_hdr_task *h;
+	int ret;
+
+	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_TASK);
+	if (!h)
+		return -ENOMEM;
+
+	h->state = t->state;
+	h->exit_state = t->exit_state;
+	h->exit_code = t->exit_code;
+	h->exit_signal = t->exit_signal;
+
+	h->set_child_tid = (unsigned long) t->set_child_tid;
+	h->clear_child_tid = (unsigned long) t->clear_child_tid;
+
+	/* FIXME: save remaining relevant task_struct fields */
+
+	ret = ckpt_write_obj(ctx, &h->h);
+	ckpt_hdr_put(ctx, h);
+	if (ret < 0)
+		return ret;
+
+	return ckpt_write_string(ctx, t->comm, TASK_COMM_LEN);
+}
+
+/* dump the entire state of a given task */
+int checkpoint_task(struct ckpt_ctx *ctx, struct task_struct *t)
+{
+	int ret;
+
+	ctx->tsk = t;
+
+	ret = checkpoint_task_struct(ctx, t);
+	ckpt_debug("task %d\n", ret);
+
+	ctx->tsk = NULL;
+	return ret;
+}
+
+/***********************************************************************
+ * Restart
+ */
+
+/* read the task_struct into the current task */
+static int restore_task_struct(struct ckpt_ctx *ctx)
+{
+	struct ckpt_hdr_task *h;
+	struct task_struct *t = current;
+	int ret;
+
+	h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_TASK);
+	if (IS_ERR(h))
+		return PTR_ERR(h);
+
+	memset(t->comm, 0, TASK_COMM_LEN);
+	ret = _ckpt_read_string(ctx, t->comm, TASK_COMM_LEN);
+	if (ret < 0)
+		goto out;
+
+	t->set_child_tid = (int __user *) (unsigned long) h->set_child_tid;
+	t->clear_child_tid = (int __user *) (unsigned long) h->clear_child_tid;
+
+	/* FIXME: restore remaining relevant task_struct fields */
+ out:
+	ckpt_hdr_put(ctx, h);
+	return ret;
+}
+
+/* read the entire state of the current task */
+int restore_task(struct ckpt_ctx *ctx)
+{
+	int ret;
+
+	ret = restore_task_struct(ctx);
+	ckpt_debug("task %d\n", ret);
+
+	return ret;
+}
diff --git a/checkpoint/restart.c b/checkpoint/restart.c
new file mode 100644
index 0000000..9f08f4d
--- /dev/null
+++ b/checkpoint/restart.c
@@ -0,0 +1,422 @@
+/*
+ *  Restart logic and helpers
+ *
+ *  Copyright (C) 2008-2009 Oren Laadan
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+/* default debug level for output */
+#define CKPT_DFLAG  CKPT_DSYS
+
+#include <linux/version.h>
+#include <linux/sched.h>
+#include <linux/file.h>
+#include <linux/magic.h>
+#include <linux/utsname.h>
+#include <linux/checkpoint.h>
+#include <linux/checkpoint_hdr.h>
+
+static int _ckpt_read_err(struct ckpt_ctx *ctx, struct ckpt_hdr *h)
+{
+	char *ptr;
+	int len, ret;
+
+	len = h->len - sizeof(*h);
+	ptr = kzalloc(len + 1, GFP_KERNEL);
+	if (!ptr) {
+		ckpt_debug("insufficient memory to report image error\n");
+		return -ENOMEM;
+	}
+
+	ret = ckpt_kread(ctx, ptr, len);
+	if (ret >= 0) {
+		ckpt_debug("%s\n", &ptr[1]);
+		ret = -EIO;
+	}
+
+	kfree(ptr);
+	return ret;
+}
+
+/**
+ * _ckpt_read_obj - read an object (ckpt_hdr followed by payload)
+ * @ctx: checkpoint context
+ * @h: desired ckpt_hdr
+ * @ptr: desired buffer
+ * @len: desired payload length (if 0, flexible)
+ * @max: maximum payload length (if 0, flexible)
+ *
+ * If @ptr is NULL, then read only the header (payload to follow)
+ */
+static int _ckpt_read_obj(struct ckpt_ctx *ctx, struct ckpt_hdr *h,
+			  void *ptr, int len, int max)
+{
+	int ret;
+
+ again:
+	ret = ckpt_kread(ctx, h, sizeof(*h));
+	if (ret < 0)
+		return ret;
+	_ckpt_debug(CKPT_DRW, "type %d len %d(%d,%d)\n",
+		    h->type, h->len, len, max);
+	if (h->len < sizeof(*h))
+		return -EINVAL;
+
+	if (h->type == CKPT_HDR_ERROR) {
+		ret = _ckpt_read_err(ctx, h);
+		if (ret < 0)
+			return ret;
+		goto again;
+	}
+
+	/* if len specified, enforce, else if maximum specified, enforce */
+	if ((len && h->len != len) || (!len && max && h->len > max))
+		return -EINVAL;
+
+	if (ptr)
+		ret = ckpt_kread(ctx, ptr, h->len - sizeof(struct ckpt_hdr));
+	return ret;
+}
+
+/**
+ * _ckpt_read_obj_type - read an object of some type
+ * @ctx: checkpoint context
+ * @ptr: provided buffer
+ * @len: buffer length
+ * @type: buffer type
+ *
+ * If @ptr is NULL, then read only the header (payload to follow).
+ * @len specifies the expected buffer length (ignored if set to 0).
+ * Returns: actual _payload_ length
+ */
+int _ckpt_read_obj_type(struct ckpt_ctx *ctx, void *ptr, int len, int type)
+{
+	struct ckpt_hdr h;
+	int ret;
+
+	if (len)
+		len += sizeof(struct ckpt_hdr);
+	ret = _ckpt_read_obj(ctx, &h, ptr, len, len);
+	if (ret < 0)
+		return ret;
+	if (h.type != type)
+		return -EINVAL;
+	return h.len - sizeof(h);
+}
+
+/**
+ * _ckpt_read_buffer - read an object of type buffer (set length)
+ * @ctx: checkpoint context
+ * @ptr: provided buffer
+ * @len: buffer length
+ *
+ * If @ptr is NULL, then read only the header (payload to follow).
+ * @len specifies the expected buffer length (ignored if set to 0).
+ * Returns: _payload_ length.
+ */
+int _ckpt_read_buffer(struct ckpt_ctx *ctx, void *ptr, int len)
+{
+	BUG_ON(!len);
+	return _ckpt_read_obj_type(ctx, ptr, len, CKPT_HDR_BUFFER);
+}
+
+/**
+ * _ckpt_read_string - read an object of type string (set length)
+ * @ctx: checkpoint context
+ * @ptr: provided buffer
+ * @len: string length (including '\0')
+ *
+ * If @ptr is NULL, then read only the header (payload to follow)
+ */
+int _ckpt_read_string(struct ckpt_ctx *ctx, void *ptr, int len)
+{
+	int ret;
+
+	BUG_ON(!len);
+	ret = _ckpt_read_obj_type(ctx, ptr, len, CKPT_HDR_STRING);
+	if (ret < 0)
+		return ret;
+	if (ptr)
+		((char *) ptr)[len - 1] = '\0';	/* always play it safe */
+	return 0;
+}
+
+/**
+ * ckpt_read_obj - allocate and read an object (ckpt_hdr followed by payload)
+ * @ctx: checkpoint context
+ * @h: object descriptor
+ * @len: desired payload length (if 0, flexible)
+ * @max: maximum payload length
+ *
+ * Return: new buffer allocated on success, error pointer otherwise
+ */
+static void *ckpt_read_obj(struct ckpt_ctx *ctx, int len, int max)
+{
+	struct ckpt_hdr hh;
+	struct ckpt_hdr *h;
+	int ret;
+
+	ret = ckpt_kread(ctx, &hh, sizeof(hh));
+	if (ret < 0)
+		return ERR_PTR(ret);
+	_ckpt_debug(CKPT_DRW, "type %d len %d(%d,%d)\n",
+		    hh.type, hh.len, len, max);
+	if (hh.len < sizeof(*h))
+		return ERR_PTR(-EINVAL);
+	/* if len specified, enforce, else if maximum specified, enforce */
+	if ((len && hh.len != len) || (!len && max && hh.len > max))
+		return ERR_PTR(-EINVAL);
+
+	h = ckpt_hdr_get(ctx, hh.len);
+	if (!h)
+		return ERR_PTR(-ENOMEM);
+
+	*h = hh;	/* yay ! */
+
+	ret = ckpt_kread(ctx, (h + 1), hh.len - sizeof(struct ckpt_hdr));
+	if (ret < 0) {
+		ckpt_hdr_put(ctx, h);
+		h = ERR_PTR(ret);
+	}
+
+	return h;
+}
+
+/**
+ * ckpt_read_obj_type - allocate and read an object of some type
+ * @ctx: checkpoint context
+ * @len: desired object length
+ * @type: desired object type
+ *
+ * Return: new buffer allocated on success, error pointer otherwise
+ */
+void *ckpt_read_obj_type(struct ckpt_ctx *ctx, int len, int type)
+{
+	struct ckpt_hdr *h;
+
+	BUG_ON(!len);
+
+	h = ckpt_read_obj(ctx, len, len);
+	if (IS_ERR(h))
+		return h;
+
+	if (h->type != type) {
+		ckpt_hdr_put(ctx, h);
+		h = ERR_PTR(-EINVAL);
+	}
+
+	return h;
+}
+
+/**
+ * ckpt_read_buf_type - allocate and read an object of some type (flxible)
+ * @ctx: checkpoint context
+ * @max: maximum object length
+ * @type: desired object type
+ *
+ * This differs from ckpt_read_obj_type() in that the length of the
+ * incoming object is flexible (up to the maximum specified by @max;
+ * unlimited if @max is 0), as determined by the ckpt_hdr data.
+ *
+ * Return: new buffer allocated on success, error pointer otherwise
+ */
+void *ckpt_read_buf_type(struct ckpt_ctx *ctx, int max, int type)
+{
+	struct ckpt_hdr *h;
+
+	h = ckpt_read_obj(ctx, 0, max);
+	if (IS_ERR(h))
+		return h;
+
+	if (h->type != type) {
+		ckpt_hdr_put(ctx, h);
+		h = ERR_PTR(-EINVAL);
+	}
+
+	return h;
+}
+
+/**
+ * ckpt_read_payload - allocate and read the payload of an object
+ * @ctx: checkpoint context
+ * @max: maximum payload length
+ * @str: pointer to buffer to be allocated (caller must free)
+ * @type: desired object type
+ *
+ * This can be used to read a variable-length _payload_ from the checkpoint
+ * stream. @max limits the size of the resulting buffer.
+ *
+ * Return: actual _payload_ length
+ */
+int ckpt_read_payload(struct ckpt_ctx *ctx, void **ptr, int max, int type)
+{
+	int len, ret;
+
+	len = _ckpt_read_obj_type(ctx, NULL, 0, type);
+	if (len < 0)
+		return len;
+	else if (len > max)
+		return -EINVAL;
+
+	*ptr = kmalloc(len, GFP_KERNEL);
+	if (!*ptr)
+		return -ENOMEM;
+
+	ret = ckpt_kread(ctx, *ptr, len);
+	if (ret < 0) {
+		kfree(*ptr);
+		return ret;
+	}
+
+	return len;
+}
+
+/**
+ * ckpt_read_string - allocate and read a string (variable length)
+ * @ctx: checkpoint context
+ * @max: maximum acceptable length
+ *
+ * Return: allocate string or error pointer
+ */
+char *ckpt_read_string(struct ckpt_ctx *ctx, int max)
+{
+	char *str;
+	int len;
+
+	len = ckpt_read_payload(ctx, (void **)&str, max, CKPT_HDR_STRING);
+	if (len < 0)
+		return ERR_PTR(len);
+	str[len - 1] = '\0';  	/* always play it safe */
+	return str;
+}
+
+/**
+ * ckpt_read_consume - consume the next object of expected type
+ * @ctx: checkpoint context
+ * @len: desired object length
+ * @type: desired object type
+ *
+ * This can be used to skip an object in the input stream when the
+ * data is unnecessary for the restart. @len indicates the length of
+ * the object); if @len is zero the length is unconstrained.
+ */
+int ckpt_read_consume(struct ckpt_ctx *ctx, int len, int type)
+{
+	struct ckpt_hdr *h;
+	int ret = 0;
+
+	h = ckpt_read_obj(ctx, len, 0);
+	if (IS_ERR(h))
+		return PTR_ERR(h);
+
+	if (h->type != type)
+		ret = -EINVAL;
+
+	ckpt_hdr_put(ctx, h);
+	return ret;
+}
+
+/***********************************************************************
+ * Restart
+ */
+
+static int check_kernel_const(struct ckpt_const *h)
+{
+	struct task_struct *tsk;
+	struct new_utsname *uts;
+
+	/* task */
+	if (h->task_comm_len != sizeof(tsk->comm))
+		return -EINVAL;
+	/* uts */
+	if (h->uts_release_len != sizeof(uts->release))
+		return -EINVAL;
+	if (h->uts_version_len != sizeof(uts->version))
+		return -EINVAL;
+	if (h->uts_machine_len != sizeof(uts->machine))
+		return -EINVAL;
+
+	return 0;
+}
+
+/* read the checkpoint header */
+static int restore_read_header(struct ckpt_ctx *ctx)
+{
+	struct ckpt_hdr_header *h;
+	struct new_utsname *uts = NULL;
+	int ret;
+
+	h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_HEADER);
+	if (IS_ERR(h))
+		return PTR_ERR(h);
+
+	ret = -EINVAL;
+	if (h->magic != CHECKPOINT_MAGIC_HEAD ||
+	    h->rev != CHECKPOINT_VERSION ||
+	    h->major != ((LINUX_VERSION_CODE >> 16) & 0xff) ||
+	    h->minor != ((LINUX_VERSION_CODE >> 8) & 0xff) ||
+	    h->patch != ((LINUX_VERSION_CODE) & 0xff))
+		goto out;
+	if (h->uflags)
+		goto out;
+
+	ret = check_kernel_const(&h->constants);
+	if (ret < 0)
+		goto out;
+
+	ret = -ENOMEM;
+	uts = kmalloc(sizeof(*uts), GFP_KERNEL);
+	if (!uts)
+		goto out;
+
+	ctx->oflags = h->uflags;
+
+	/* FIX: verify compatibility of release, version and machine */
+	ret = _ckpt_read_buffer(ctx, uts->release, sizeof(uts->release));
+	if (ret < 0)
+		goto out;
+	ret = _ckpt_read_buffer(ctx, uts->version, sizeof(uts->version));
+	if (ret < 0)
+		goto out;
+	ret = _ckpt_read_buffer(ctx, uts->machine, sizeof(uts->machine));
+ out:
+	kfree(uts);
+	ckpt_hdr_put(ctx, h);
+	return ret;
+}
+
+/* read the checkpoint trailer */
+static int restore_read_tail(struct ckpt_ctx *ctx)
+{
+	struct ckpt_hdr_tail *h;
+	int ret = 0;
+
+	h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_TAIL);
+	if (IS_ERR(h))
+		return PTR_ERR(h);
+
+	if (h->magic != CHECKPOINT_MAGIC_TAIL)
+		ret = -EINVAL;
+
+	ckpt_hdr_put(ctx, h);
+	return ret;
+}
+
+long do_restart(struct ckpt_ctx *ctx, pid_t pid)
+{
+	long ret;
+
+	ret = restore_read_header(ctx);
+	if (ret < 0)
+		return ret;
+	ret = restore_task(ctx);
+	if (ret < 0)
+		return ret;
+	ret = restore_read_tail(ctx);
+
+	/* on success, adjust the return value if needed [TODO] */
+	return ret;
+}
diff --git a/checkpoint/sys.c b/checkpoint/sys.c
index 79936cc..7f6f71e 100644
--- a/checkpoint/sys.c
+++ b/checkpoint/sys.c
@@ -8,9 +8,192 @@
  *  distribution for more details.
  */
 
+/* default debug level for output */
+#define CKPT_DFLAG  CKPT_DSYS
+
 #include <linux/sched.h>
 #include <linux/kernel.h>
 #include <linux/syscalls.h>
+#include <linux/fs.h>
+#include <linux/file.h>
+#include <linux/uaccess.h>
+#include <linux/capability.h>
+#include <linux/checkpoint.h>
+
+/*
+ * Helpers to write(read) from(to) kernel space to(from) the checkpoint
+ * image file descriptor (similar to how a core-dump is performed).
+ *
+ *   ckpt_kwrite() - write a kernel-space buffer to the checkpoint image
+ *   ckpt_kread() - read from the checkpoint image to a kernel-space buffer
+ */
+
+static inline int _ckpt_kwrite(struct file *file, void *addr, int count)
+{
+	void __user *uaddr = (__force void __user *) addr;
+	ssize_t nwrite;
+	int nleft;
+
+	for (nleft = count; nleft; nleft -= nwrite) {
+		loff_t pos = file_pos_read(file);
+		nwrite = vfs_write(file, uaddr, nleft, &pos);
+		file_pos_write(file, pos);
+		if (nwrite < 0) {
+			if (nwrite == -EAGAIN)
+				nwrite = 0;
+			else
+				return nwrite;
+		}
+		uaddr += nwrite;
+	}
+	return 0;
+}
+
+int ckpt_kwrite(struct ckpt_ctx *ctx, void *addr, int count)
+{
+	mm_segment_t fs;
+	int ret;
+
+	fs = get_fs();
+	set_fs(KERNEL_DS);
+	ret = _ckpt_kwrite(ctx->file, addr, count);
+	set_fs(fs);
+
+	ctx->total += count;
+	return ret;
+}
+
+static inline int _ckpt_kread(struct file *file, void *addr, int count)
+{
+	void __user *uaddr = (__force void __user *) addr;
+	ssize_t nread;
+	int nleft;
+
+	for (nleft = count; nleft; nleft -= nread) {
+		loff_t pos = file_pos_read(file);
+		nread = vfs_read(file, uaddr, nleft, &pos);
+		file_pos_write(file, pos);
+		if (nread <= 0) {
+			if (nread == -EAGAIN) {
+				nread = 0;
+				continue;
+			} else if (nread == 0)
+				nread = -EPIPE;		/* unexecpted EOF */
+			return nread;
+		}
+		uaddr += nread;
+	}
+	return 0;
+}
+
+int ckpt_kread(struct ckpt_ctx *ctx, void *addr, int count)
+{
+	mm_segment_t fs;
+	int ret;
+
+	fs = get_fs();
+	set_fs(KERNEL_DS);
+	ret = _ckpt_kread(ctx->file , addr, count);
+	set_fs(fs);
+
+	ctx->total += count;
+	return ret;
+}
+
+/**
+ * ckpt_hdr_get - get a hdr of certain size
+ * @ctx: checkpoint context
+ * @len: desired length
+ *
+ * Returns pointer to header
+ */
+void *ckpt_hdr_get(struct ckpt_ctx *ctx, int len)
+{
+	return kzalloc(len, GFP_KERNEL);
+}
+
+/**
+ * _ckpt_hdr_put - free a hdr allocated with ckpt_hdr_get
+ * @ctx: checkpoint context
+ * @ptr: header to free
+ * @len: header length
+ *
+ * (requiring 'ptr' makes it easily interchangable with kmalloc/kfree
+ */
+void _ckpt_hdr_put(struct ckpt_ctx *ctx, void *ptr, int len)
+{
+	kfree(ptr);
+}
+
+/**
+ * ckpt_hdr_put - free a hdr allocated with ckpt_hdr_get
+ * @ctx: checkpoint context
+ * @ptr: header to free
+ *
+ * It is assumed that @ptr begins with a 'struct ckpt_hdr'.
+ */
+void ckpt_hdr_put(struct ckpt_ctx *ctx, void *ptr)
+{
+	struct ckpt_hdr *h = (struct ckpt_hdr *) ptr;
+	_ckpt_hdr_put(ctx, ptr, h->len);
+}
+
+/**
+ * ckpt_hdr_get_type - get a hdr of certain size
+ * @ctx: checkpoint context
+ * @len: number of bytes to reserve
+ *
+ * Returns pointer to reserved space on hbuf
+ */
+void *ckpt_hdr_get_type(struct ckpt_ctx *ctx, int len, int type)
+{
+	struct ckpt_hdr *h;
+
+	h = ckpt_hdr_get(ctx, len);
+	if (!h)
+		return NULL;
+
+	h->type = type;
+	h->len = len;
+	return h;
+}
+
+
+/*
+ * Helpers to manage c/r contexts: allocated for each checkpoint and/or
+ * restart operation, and persists until the operation is completed.
+ */
+
+static void ckpt_ctx_free(struct ckpt_ctx *ctx)
+{
+	if (ctx->file)
+		fput(ctx->file);
+	kfree(ctx);
+}
+
+static struct ckpt_ctx *ckpt_ctx_alloc(int fd, unsigned long uflags,
+				       unsigned long kflags)
+{
+	struct ckpt_ctx *ctx;
+	int err;
+
+	ctx = kzalloc(sizeof(*ctx), GFP_KERNEL);
+	if (!ctx)
+		return ERR_PTR(-ENOMEM);
+
+	ctx->uflags = uflags;
+	ctx->kflags = kflags;
+
+	err = -EBADF;
+	ctx->file = fget(fd);
+	if (!ctx->file)
+		goto err;
+
+	return ctx;
+ err:
+	ckpt_ctx_free(ctx);
+	return ERR_PTR(err);
+}
 
 /**
  * sys_checkpoint - checkpoint a container
@@ -23,7 +206,26 @@
  */
 SYSCALL_DEFINE3(checkpoint, pid_t, pid, int, fd, unsigned long, flags)
 {
-	return -ENOSYS;
+	struct ckpt_ctx *ctx;
+	long ret;
+
+	/* no flags for now */
+	if (flags)
+		return -EINVAL;
+
+	if (pid == 0)
+		pid = task_pid_vnr(current);
+	ctx = ckpt_ctx_alloc(fd, flags, CKPT_CTX_CHECKPOINT);
+	if (IS_ERR(ctx))
+		return PTR_ERR(ctx);
+
+	ret = do_checkpoint(ctx, pid);
+
+	if (!ret)
+		ret = ctx->crid;
+
+	ckpt_ctx_free(ctx);
+	return ret;
 }
 
 /**
@@ -37,5 +239,46 @@ SYSCALL_DEFINE3(checkpoint, pid_t, pid, int, fd, unsigned long, flags)
  */
 SYSCALL_DEFINE3(restart, pid_t, pid, int, fd, unsigned long, flags)
 {
-	return -ENOSYS;
+	struct ckpt_ctx *ctx = NULL;
+	long ret;
+
+	/* no flags for now */
+	if (flags)
+		return -EINVAL;
+
+	ctx = ckpt_ctx_alloc(fd, flags, CKPT_CTX_RESTART);
+	if (IS_ERR(ctx))
+		return PTR_ERR(ctx);
+
+	ret = do_restart(ctx, pid);
+
+	/* restart(2) isn't idempotent: can't restart syscall */
+	if (ret == -ERESTARTSYS || ret == -ERESTARTNOINTR ||
+	    ret == -ERESTARTNOHAND || ret == -ERESTART_RESTARTBLOCK)
+		ret = -EINTR;
+
+	ckpt_ctx_free(ctx);
+	return ret;
+}
+
+
+/* 'ckpt_debug_level' controls the verbosity level of c/r code */
+#ifdef CONFIG_CHECKPOINT_DEBUG
+
+/* FIX: allow to change during runtime */
+unsigned long __read_mostly ckpt_debug_level = CKPT_DDEFAULT;
+
+static __init int ckpt_debug_setup(char *s)
+{
+	long val, ret;
+
+	ret = strict_strtoul(s, 10, &val);
+	if (ret < 0)
+		return ret;
+	ckpt_debug_level = val;
+	return 0;
 }
+
+__setup("ckpt_debug=", ckpt_debug_setup);
+
+#endif /* CONFIG_CHECKPOINT_DEBUG */
diff --git a/include/linux/Kbuild b/include/linux/Kbuild
index 334a359..3e8bd18 100644
--- a/include/linux/Kbuild
+++ b/include/linux/Kbuild
@@ -44,6 +44,9 @@ header-y += bpqether.h
 header-y += bsg.h
 header-y += can.h
 header-y += cdk.h
+header-y += checkpoint.h
+header-y += checkpoint_hdr.h
+header-y += checkpoint_types.h
 header-y += chio.h
 header-y += coda_psdev.h
 header-y += coff.h
diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
new file mode 100644
index 0000000..be0ba4b
--- /dev/null
+++ b/include/linux/checkpoint.h
@@ -0,0 +1,109 @@
+#ifndef _LINUX_CHECKPOINT_H_
+#define _LINUX_CHECKPOINT_H_
+/*
+ *  Generic checkpoint-restart
+ *
+ *  Copyright (C) 2008-2009 Oren Laadan
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+#define CHECKPOINT_VERSION  2
+
+#ifdef __KERNEL__
+#ifdef CONFIG_CHECKPOINT
+
+#include <linux/checkpoint_types.h>
+#include <linux/checkpoint_hdr.h>
+#include <linux/err.h>
+
+/* ckpt_ctx: kflags */
+#define CKPT_CTX_CHECKPOINT_BIT		0
+#define CKPT_CTX_RESTART_BIT		1
+
+#define CKPT_CTX_CHECKPOINT	(1 << CKPT_CTX_CHECKPOINT_BIT)
+#define CKPT_CTX_RESTART	(1 << CKPT_CTX_RESTART_BIT)
+
+
+extern int ckpt_kwrite(struct ckpt_ctx *ctx, void *buf, int count);
+extern int ckpt_kread(struct ckpt_ctx *ctx, void *buf, int count);
+
+extern void _ckpt_hdr_put(struct ckpt_ctx *ctx, void *ptr, int n);
+extern void ckpt_hdr_put(struct ckpt_ctx *ctx, void *ptr);
+extern void *ckpt_hdr_get(struct ckpt_ctx *ctx, int n);
+extern void *ckpt_hdr_get_type(struct ckpt_ctx *ctx, int n, int type);
+
+extern int ckpt_write_obj(struct ckpt_ctx *ctx, struct ckpt_hdr *h);
+extern int ckpt_write_obj_type(struct ckpt_ctx *ctx,
+			       void *ptr, int len, int type);
+extern int ckpt_write_buffer(struct ckpt_ctx *ctx, void *ptr, int len);
+extern int ckpt_write_string(struct ckpt_ctx *ctx, char *str, int len);
+extern void __ckpt_write_err(struct ckpt_ctx *ctx, char *ptr, char *fmt, ...);
+extern int ckpt_write_err(struct ckpt_ctx *ctx, char *ptr, char *fmt, ...);
+
+extern int _ckpt_read_obj_type(struct ckpt_ctx *ctx,
+			       void *ptr, int len, int type);
+extern int _ckpt_read_buffer(struct ckpt_ctx *ctx, void *ptr, int len);
+extern int _ckpt_read_string(struct ckpt_ctx *ctx, void *ptr, int len);
+extern void *ckpt_read_obj_type(struct ckpt_ctx *ctx, int len, int type);
+extern void *ckpt_read_buf_type(struct ckpt_ctx *ctx, int len, int type);
+extern int ckpt_read_payload(struct ckpt_ctx *ctx,
+			     void **ptr, int max, int type);
+extern char *ckpt_read_string(struct ckpt_ctx *ctx, int max);
+extern int ckpt_read_consume(struct ckpt_ctx *ctx, int len, int type);
+
+extern long do_checkpoint(struct ckpt_ctx *ctx, pid_t pid);
+extern long do_restart(struct ckpt_ctx *ctx, pid_t pid);
+
+/* task */
+extern int checkpoint_task(struct ckpt_ctx *ctx, struct task_struct *t);
+extern int restore_task(struct ckpt_ctx *ctx);
+
+static inline int ckpt_validate_errno(int errno)
+{
+	return (errno >= 0) && (errno < MAX_ERRNO);
+}
+
+/* debugging flags */
+#define CKPT_DBASE	0x1		/* anything */
+#define CKPT_DSYS	0x2		/* generic (system) */
+#define CKPT_DRW	0x4		/* image read/write */
+
+#define CKPT_DDEFAULT	0xffff		/* default debug level */
+
+#ifndef CKPT_DFLAG
+#define CKPT_DFLAG	0xffff		/* everything */
+#endif
+
+#ifdef CONFIG_CHECKPOINT_DEBUG
+extern unsigned long ckpt_debug_level;
+
+/* use this to select a specific debug level */
+#define _ckpt_debug(level, fmt, args...)				\
+	do {								\
+		if (ckpt_debug_level & (level))				\
+			printk(KERN_DEBUG "[%d:%d:c/r:%s:%d] " fmt,	\
+				current->pid, task_pid_vnr(current),	\
+				__func__, __LINE__, ## args);		\
+	} while (0)
+
+/*
+ * CKPT_DBASE is the base flags, doesn't change
+ * CKPT_DFLAG is to be redfined in each source file
+ */
+#define ckpt_debug(fmt, args...)  \
+	_ckpt_debug(CKPT_DBASE | CKPT_DFLAG, fmt, ## args)
+
+#else
+
+#define _ckpt_debug(level, fmt, args...)	do { } while (0)
+#define ckpt_debug(fmt, args...)		do { } while (0)
+
+#endif /* CONFIG_CHECKPOINT_DEBUG */
+
+#endif /* CONFIG_CHECKPOINT */
+#endif /* __KERNEL__ */
+
+#endif /* _LINUX_CHECKPOINT_H_ */
diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
new file mode 100644
index 0000000..22dadbd
--- /dev/null
+++ b/include/linux/checkpoint_hdr.h
@@ -0,0 +1,111 @@
+#ifndef _CHECKPOINT_CKPT_HDR_H_
+#define _CHECKPOINT_CKPT_HDR_H_
+/*
+ *  Generic container checkpoint-restart
+ *
+ *  Copyright (C) 2008-2009 Oren Laadan
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+#include <linux/types.h>
+#include <linux/utsname.h>
+
+/*
+ * To maintain compatibility between 32-bit and 64-bit architecture flavors,
+ * keep data 64-bit aligned: use padding for structure members, and use
+ * __attribute__((aligned (8))) for the entire structure.
+ *
+ * Quoting Arnd Bergmann:
+ *   "This structure has an odd multiple of 32-bit members, which means
+ *   that if you put it into a larger structure that also contains 64-bit
+ *   members, the larger structure may get different alignment on x86-32
+ *   and x86-64, which you might want to avoid. I can't tell if this is
+ *   an actual problem here. ... In this case, I'm pretty sure that
+ *   sizeof(ckpt_hdr_task) on x86-32 is different from x86-64, since it
+ *   will be 32-bit aligned on x86-32."
+ */
+
+/*
+ * header format: 'struct ckpt_hdr' must prefix all other headers. Therfore
+ * when a header is passed around, the information about it (type, size)
+ * is readily available. Structs that include a struct ckpt_hdr are named
+ * struct ckpt_hdr_* by convention (usualy the struct ckpt_hdr is the first
+ * member).
+ */
+struct ckpt_hdr {
+	__u32 type;
+	__u32 len;
+} __attribute__((aligned(8)));
+
+/* header types */
+enum {
+	CKPT_HDR_HEADER = 1,
+	CKPT_HDR_BUFFER,
+	CKPT_HDR_STRING,
+
+	CKPT_HDR_TASK = 101,
+
+	CKPT_HDR_TAIL = 9001,
+
+	CKPT_HDR_ERROR = 9999,
+};
+
+/* kernel constants */
+struct ckpt_const {
+	/* task */
+	__u16 task_comm_len;
+	/* uts */
+	__u16 uts_release_len;
+	__u16 uts_version_len;
+	__u16 uts_machine_len;
+} __attribute__((aligned(8)));
+
+/* checkpoint image header */
+struct ckpt_hdr_header {
+	struct ckpt_hdr h;
+	__u64 magic;
+
+	__u16 _padding;
+
+	__u16 major;
+	__u16 minor;
+	__u16 patch;
+	__u16 rev;
+
+	struct ckpt_const constants;
+
+	__u64 time;	/* when checkpoint taken */
+	__u64 uflags;	/* uflags from checkpoint */
+
+	/*
+	 * the header is followed by three strings:
+	 *   char release[const.uts_release_len];
+	 *   char version[const.uts_version_len];
+	 *   char machine[const.uts_machine_len];
+	 */
+} __attribute__((aligned(8)));
+
+
+/* checkpoint image trailer */
+struct ckpt_hdr_tail {
+	struct ckpt_hdr h;
+	__u64 magic;
+} __attribute__((aligned(8)));
+
+
+/* task data */
+struct ckpt_hdr_task {
+	struct ckpt_hdr h;
+	__u32 state;
+	__u32 exit_state;
+	__u32 exit_code;
+	__u32 exit_signal;
+
+	__u64 set_child_tid;
+	__u64 clear_child_tid;
+} __attribute__((aligned(8)));
+
+#endif /* _CHECKPOINT_CKPT_HDR_H_ */
diff --git a/include/linux/checkpoint_types.h b/include/linux/checkpoint_types.h
new file mode 100644
index 0000000..585cb7b
--- /dev/null
+++ b/include/linux/checkpoint_types.h
@@ -0,0 +1,35 @@
+#ifndef _LINUX_CHECKPOINT_TYPES_H_
+#define _LINUX_CHECKPOINT_TYPES_H_
+/*
+ *  Generic checkpoint-restart
+ *
+ *  Copyright (C) 2008-2009 Oren Laadan
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+#ifdef __KERNEL__
+
+#include <linux/fs.h>
+
+struct ckpt_ctx {
+	int crid;		/* unique checkpoint id */
+
+	pid_t root_pid;		/* container identifier */
+
+	unsigned long kflags;	/* kerenl flags */
+	unsigned long uflags;	/* user flags */
+	unsigned long oflags;	/* restart: uflags from checkpoint */
+
+	struct file *file;	/* input/output file */
+	int total;		/* total read/written */
+
+	struct task_struct *tsk;/* checkpoint: current target task */
+	char err_string[256];	/* checkpoint: error string */
+};
+
+#endif /* __KERNEL__ */
+
+#endif /* _LINUX_CHECKPOINT_TYPES_H_ */
diff --git a/include/linux/magic.h b/include/linux/magic.h
index 1923327..ff17a59 100644
--- a/include/linux/magic.h
+++ b/include/linux/magic.h
@@ -53,4 +53,8 @@
 #define INOTIFYFS_SUPER_MAGIC	0x2BAD1DEA
 
 #define STACK_END_MAGIC		0x57AC6E9D
+
+#define CHECKPOINT_MAGIC_HEAD  0x00feed0cc0a2d200LL
+#define CHECKPOINT_MAGIC_TAIL  0x002d2a0cc0deef00LL
+
 #endif /* __LINUX_MAGIC_H__ */
diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index 12327b2..e1ae6e6 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -1006,6 +1006,19 @@ config DMA_API_DEBUG
 	  This option causes a performance degredation.  Use only if you want
 	  to debug device drivers. If unsure, say N.
 
+config CHECKPOINT_DEBUG
+	bool "Checkpoint/restart debugging (EXPERIMENTAL)"
+	depends on CHECKPOINT
+	default y
+	help
+	  This options turns on the debugging output of checkpoint/restart.
+	  The level of verbosity is controlled by 'ckpt_debug_level' and can
+	  be set at boot time with "ckpt_debug=" option.
+
+	  Turning this option off will reduce the size of the c/r code. If
+	  turned on, it is unlikely to incur visible overhead if the debug
+	  level is set to zero.
+
 source "samples/Kconfig"
 
 source "lib/Kconfig.kgdb"
-- 
1.6.0.4


^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH v18 20/80] c/r: basic infrastructure for checkpoint/restart
@ 2009-09-23 23:51   ` Oren Laadan
  0 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-23 23:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, containers, linux-kernel, linux-mm, linux-api,
	Serge Hallyn, Ingo Molnar, Pavel Emelyanov, Oren Laadan,
	Oren Laadan

Add those interfaces, as well as helpers needed to easily manage the
file format. The code is roughly broken out as follows:

checkpoint/sys.c - user/kernel data transfer, as well as setup of the
  c/r context (a per-checkpoint data structure for housekeeping)

checkpoint/checkpoint.c - output wrappers and basic checkpoint handling

checkpoint/restart.c - input wrappers and basic restart handling

checkpoint/process.c - c/r of task data

For now, we can only checkpoint the 'current' task ("self" checkpoint),
and the 'pid' argument to the syscall is ignored.

Patches to add the per-architecture support as well as the actual
work to do the memory checkpoint follow in subsequent patches.


Changelog[v18]:
  - Detect error-headers in input data on restart, and abort.
  - Standard format for checkpoint error strings (and documentation)
  - [Matt Helsley] Rename headerless struct ckpt_hdr_* to struct ckpt_*
  - [Dan Smith] Add an errno validation function
  - Add ckpt_read_payload(): read a variable-length object (no header)
  - Add ckpt_read_string(): same for strings (ensures null-terminated)
  - Add ckpt_read_consume(): consumes next object without processing
Changelog[v17]:
  - Fix compilation for architectures that don't support checkpoint
  - Save/restore t->{set,clear}_child_tid
  - Restart(2) isn't idempotent: must return -EINTR if interrupted
  - ckpt_debug does not depend on DYNAMIC_DEBUG, on by default
  - Export generic checkpoint headers to userespace
  - Fix comment for prototype of sys_restart
  - Have ckpt_debug() print global-pid and __LINE__
  - Only save and test kernel constants once (in header)
Changelog[v16]:
  - Split ctx->flags to ->uflags (user flags) and ->kflags (kernel flags)
  - Introduce __ckpt_write_err() and ckpt_write_err() to report errors
  - Allow @ptr == NULL to write (or read) header only without payload
  - Introduce _ckpt_read_obj_type()
Changelog[v15]:
  - Replace header buffer in ckpt_ctx (hbuf,hpos) with kmalloc/kfree()
Changelog[v14]:
  - Cleanup interface to get/put hdr buffers
  - Merge checkpoint and restart code into a single file (per subsystem)
  - Take uts_sem around access to uts->{release,version,machine}
  - Embed ckpt_hdr in all ckpt_hdr_...., cleanup read/write helpers
  - Define sys_checkpoint(0,...) as asking for a self-checkpoint (Serge)
  - Revert use of 'pr_fmt' to avoid tainting whom includes us (Nathan Lynch)
  - Explicitly indicate length of UTS fields in header
  - Discard field 'h->parent' from ckpt_hdr
Changelog[v12]:
  - ckpt_kwrite/ckpt_kread() again use vfs_read(), vfs_write() (safer)
  - Split ckpt_write/ckpt_read() to two parts: _ckpt_write/read() helper
  - Befriend with sparse : explicit conversion to 'void __user *'
  - Redfine 'pr_fmt' instead of using special ckpt_debug()
Changelog[v10]:
  - add ckpt_write_buffer(), ckpt_read_buffer() and ckpt_read_buf_type()
  - force end-of-string in ckpt_read_string() (fix possible DoS)
Changelog[v9]:
  - ckpt_kwrite/ckpt_kread() use file->f_op->write() directly
  - Drop ckpt_uwrite/ckpt_uread() since they aren't used anywhere
Changelog[v6]:
  - Balance all calls to ckpt_hbuf_get() with matching ckpt_hbuf_put()
    (although it's not really needed)
Changelog[v5]:
  - Rename headers files s/ckpt/checkpoint/
Changelog[v2]:
  - Added utsname->{release,version,machine} to checkpoint header
  - Pad header structures to 64 bits to ensure compatibility

Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
---
 Makefile                         |    2 +-
 checkpoint/Makefile              |    6 +-
 checkpoint/checkpoint.c          |  369 +++++++++++++++++++++++++++++++++
 checkpoint/process.c             |  102 +++++++++
 checkpoint/restart.c             |  422 ++++++++++++++++++++++++++++++++++++++
 checkpoint/sys.c                 |  247 ++++++++++++++++++++++-
 include/linux/Kbuild             |    3 +
 include/linux/checkpoint.h       |  109 ++++++++++
 include/linux/checkpoint_hdr.h   |  111 ++++++++++
 include/linux/checkpoint_types.h |   35 +++
 include/linux/magic.h            |    4 +
 lib/Kconfig.debug                |   13 ++
 12 files changed, 1419 insertions(+), 4 deletions(-)
 create mode 100644 checkpoint/checkpoint.c
 create mode 100644 checkpoint/process.c
 create mode 100644 checkpoint/restart.c
 create mode 100644 include/linux/checkpoint.h
 create mode 100644 include/linux/checkpoint_hdr.h
 create mode 100644 include/linux/checkpoint_types.h

diff --git a/Makefile b/Makefile
index fe45658..2e44d0f 100644
--- a/Makefile
+++ b/Makefile
@@ -639,7 +639,7 @@ export mod_strip_cmd
 
 
 ifeq ($(KBUILD_EXTMOD),)
-core-y		+= kernel/ mm/ fs/ ipc/ security/ crypto/ block/
+core-y		+= kernel/ mm/ fs/ ipc/ security/ crypto/ block/ checkpoint/
 
 vmlinux-dirs	:= $(patsubst %/,%,$(filter %/, $(init-y) $(init-m) \
 		     $(core-y) $(core-m) $(drivers-y) $(drivers-m) \
diff --git a/checkpoint/Makefile b/checkpoint/Makefile
index 8a32c6f..99364cc 100644
--- a/checkpoint/Makefile
+++ b/checkpoint/Makefile
@@ -2,4 +2,8 @@
 # Makefile for linux checkpoint/restart.
 #
 
-obj-$(CONFIG_CHECKPOINT) += sys.o
+obj-$(CONFIG_CHECKPOINT) += \
+	sys.o \
+	checkpoint.o \
+	restart.o \
+	process.o
diff --git a/checkpoint/checkpoint.c b/checkpoint/checkpoint.c
new file mode 100644
index 0000000..57eb7d8
--- /dev/null
+++ b/checkpoint/checkpoint.c
@@ -0,0 +1,369 @@
+/*
+ *  Checkpoint logic and helpers
+ *
+ *  Copyright (C) 2008-2009 Oren Laadan
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+/* default debug level for output */
+#define CKPT_DFLAG  CKPT_DSYS
+
+#include <linux/version.h>
+#include <linux/time.h>
+#include <linux/fs.h>
+#include <linux/file.h>
+#include <linux/dcache.h>
+#include <linux/mount.h>
+#include <linux/utsname.h>
+#include <linux/magic.h>
+#include <linux/checkpoint.h>
+#include <linux/checkpoint_hdr.h>
+
+/* unique checkpoint identifier (FIXME: should be per-container ?) */
+static atomic_t ctx_count = ATOMIC_INIT(0);
+
+/**
+ * ckpt_write_obj - write an object
+ * @ctx: checkpoint context
+ * @h: object descriptor
+ */
+int ckpt_write_obj(struct ckpt_ctx *ctx, struct ckpt_hdr *h)
+{
+	_ckpt_debug(CKPT_DRW, "type %d len %d\n", h->type, h->len);
+	return ckpt_kwrite(ctx, h, h->len);
+}
+
+/**
+ * ckpt_write_obj_type - write an object (from a pointer)
+ * @ctx: checkpoint context
+ * @ptr: buffer pointer
+ * @len: buffer size
+ * @type: desired type
+ *
+ * If @ptr is NULL, then write only the header (payload to follow)
+ */
+int ckpt_write_obj_type(struct ckpt_ctx *ctx, void *ptr, int len, int type)
+{
+	struct ckpt_hdr *h;
+	int ret;
+
+	h = ckpt_hdr_get(ctx, sizeof(*h));
+	if (!h)
+		return -ENOMEM;
+
+	h->type = type;
+	h->len = len + sizeof(*h);
+
+	_ckpt_debug(CKPT_DRW, "type %d len %d\n", h->type, h->len);
+	ret = ckpt_kwrite(ctx, h, sizeof(*h));
+	if (ret < 0)
+		goto out;
+	if (ptr)
+		ret = ckpt_kwrite(ctx, ptr, len);
+ out:
+	_ckpt_hdr_put(ctx, h, sizeof(*h));
+	return ret;
+}
+
+/**
+ * ckpt_write_buffer - write an object of type buffer
+ * @ctx: checkpoint context
+ * @ptr: buffer pointer
+ * @len: buffer size
+ */
+int ckpt_write_buffer(struct ckpt_ctx *ctx, void *ptr, int len)
+{
+	return ckpt_write_obj_type(ctx, ptr, len, CKPT_HDR_BUFFER);
+}
+
+/**
+ * ckpt_write_string - write an object of type string
+ * @ctx: checkpoint context
+ * @str: string pointer
+ * @len: string length
+ */
+int ckpt_write_string(struct ckpt_ctx *ctx, char *str, int len)
+{
+	return ckpt_write_obj_type(ctx, str, len, CKPT_HDR_STRING);
+}
+
+/*
+ * __ckpt_generate_fmt - generate standard checkpoint error message
+ * @ctx: checkpoint context
+ * @prefmt: pre-format string
+ * @fmt: message format
+ *
+ * This generates a unified format of checkpoint error messages, to
+ * ease (after the failure) inspection by userspace tools. It converts
+ * the (printf) message @fmt into a new format: "[PREFMT]: fmt".
+ *
+ * PREFMT is constructed from @prefmt by subtituting format snippets
+ * according to the contents of @prefmt.  The format characters in
+ * @prefmt can be E (error), O (objref), P (pointer), S (string) and
+ * V (variable/symbol). For example, E will generate a "err %d" in
+ * PREFMT (see prefmt_array below).
+ *
+ * If @prefmt begins with T, PREFMT will begin with "pid %d tsk %s"
+ * with the pid and the tsk->comm of the currently checkpointed task.
+ * The latter is taken from ctx->tsk, and is it the responsbilility of
+ * the caller to have a valid pointer there (in particular, functions
+ * that iterate on the processes: collect_objects, checkpoint_task,
+ * and tree_count_tasks).
+ *
+ * The caller of ckpt_write_err() and _ckpt_write_err() must provide
+ * the additional variabes, in order, to match the @prefmt (except for
+ * the T key), e.g.:
+ *
+ *   ckpt_writ_err(ctx, "TEO", "FILE flags %d", err, objref, flags);
+ *
+ * Here, T is simply passed, E expects an integer (err), O expects an
+ * integer (objref), and the last argument matches the format string.
+ */
+static char *__ckpt_generate_fmt(struct ckpt_ctx *ctx, char *prefmt, char *fmt)
+{
+	static int warn_notask = 0;
+	static int warn_prefmt = 0;
+	char *format;
+	int i, j, len = 0;
+
+	static struct {
+		char key;
+		char *fmt;
+	} prefmt_array[] = {
+		{ 'E', "err %d" },
+		{ 'O', "obj %d" },
+		{ 'P', "ptr %p" },
+		{ 'V', "sym %pS" },
+		{ 'S', "str %s" },
+		{ 0, "??? %pS" },
+	};
+
+	/*
+	 * 17 for "pid %d" (plus space)
+	 * 21 for "tsk %s" (tsk->comm)
+	 * up to 8 per varfmt entry
+	 */
+	format = kzalloc(37 + 8 * strlen(prefmt) + strlen(fmt), GFP_KERNEL);
+	if (!format)
+		return NULL;
+
+	format[len++] = '[';
+
+	if (prefmt[0] == 'T') {
+		if (ctx->tsk)
+			len = sprintf(format, "pid %d tsk %s ",
+				      task_pid_vnr(ctx->tsk), ctx->tsk->comm);
+		else if (warn_notask++ < 5)
+			printk(KERN_ERR "c/r: no target task set\n");
+		prefmt++;
+	}
+
+	for (i = 0; i < strlen(prefmt); i++) {
+		for (j = 0; prefmt_array[j].key; j++)
+			if (prefmt_array[j].key == prefmt[i])
+				break;
+		if (!prefmt_array[j].key && warn_prefmt++ < 5)
+			printk(KERN_ERR "c/r: unknown prefmt %c\n", prefmt[i]);
+		len += sprintf(&format[len], "%s ", prefmt_array[j].fmt);
+	}
+
+	if (len > 1)
+		sprintf(&format[len-1], "]: %s", fmt);  /* erase last space */
+	else
+		sprintf(format, "%s", fmt);
+
+	return format;
+}
+
+/* see _ckpt_generate_fmt for information on @prefmt */
+static void __ckpt_generate_err(struct ckpt_ctx *ctx, char *prefmt,
+				char *fmt, va_list ap)
+{
+	va_list aq;
+	char *format;
+	char *str;
+	int len;
+
+	format = __ckpt_generate_fmt(ctx, prefmt, fmt);
+	va_copy(aq, ap);
+
+	/*
+	 * prefix the error string with a '\0' to facilitate easy
+	 * backtrace to the beginning of the error message without
+	 * needing to parse the entire checkpoint image.
+	 */
+	ctx->err_string[0] = '\0';
+	str = &ctx->err_string[1];
+	len = vsnprintf(str, 255, format ? : fmt, ap) + 2;
+
+	if (len > 256) {
+		printk(KERN_NOTICE "c/r: error string truncated: ");
+		vprintk(fmt, aq);
+	}
+
+	va_end(aq);
+	kfree(format);
+
+	ckpt_debug("c/r: checkpoint error: %s\n", str);
+}
+
+/**
+ * __ckpt_write_err - save an error string on the ctx->err_string
+ * @ctx: checkpoint context
+ * @prefmt: error pre-format
+ * @fmt: message format
+ * @...: arguments
+ *
+ * See _ckpt_generate_fmt for information on @prefmt.
+ * Use this during checkpoint to report while holding a spinlock
+ */
+void __ckpt_write_err(struct ckpt_ctx *ctx, char *prefmt, char *fmt, ...)
+{
+	va_list ap;
+
+	va_start(ap, fmt);
+	__ckpt_generate_err(ctx, prefmt, fmt, ap);
+	va_end(ap);
+}
+
+/**
+ * ckpt_write_err - write an object describing an error
+ * @ctx: checkpoint context
+ * @pre: string pre-format
+ * @fmt: error string format
+ * @...: error string arguments
+ *
+ * See _ckpt_generate_fmt for information on @prefmt.
+ * If @fmt is null, the string in the ctx->err_string will be used (and freed)
+ */
+int ckpt_write_err(struct ckpt_ctx *ctx, char *pre, char *fmt, ...)
+{
+	va_list ap;
+	char *str;
+	int len, ret = 0;
+
+	if (fmt) {
+		va_start(ap, fmt);
+		__ckpt_generate_err(ctx, pre, fmt, ap);
+		va_end(ap);
+	}
+
+	str = ctx->err_string;
+	len = strlen(str + 1) + 2;	/* leading and trailing '\0' */
+
+	if (len == 0)	/* empty error string */
+		return 0;
+
+	ret = ckpt_write_obj_type(ctx, NULL, 0, CKPT_HDR_ERROR);
+	if (!ret)
+		ret = ckpt_write_string(ctx, str, len);
+	if (ret < 0)
+		printk(KERN_NOTICE "c/r: error string unsaved (%d): %s\n",
+		       ret, str + 1);
+
+	str[1] = '\0';
+	return ret;
+}
+
+/***********************************************************************
+ * Checkpoint
+ */
+
+static void fill_kernel_const(struct ckpt_const *h)
+{
+	struct task_struct *tsk;
+	struct new_utsname *uts;
+
+	/* task */
+	h->task_comm_len = sizeof(tsk->comm);
+	/* uts */
+	h->uts_release_len = sizeof(uts->release);
+	h->uts_version_len = sizeof(uts->version);
+	h->uts_machine_len = sizeof(uts->machine);
+}
+
+/* write the checkpoint header */
+static int checkpoint_write_header(struct ckpt_ctx *ctx)
+{
+	struct ckpt_hdr_header *h;
+	struct new_utsname *uts;
+	struct timeval ktv;
+	int ret;
+
+	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_HEADER);
+	if (!h)
+		return -ENOMEM;
+
+	do_gettimeofday(&ktv);
+	uts = utsname();
+
+	h->magic = CHECKPOINT_MAGIC_HEAD;
+	h->major = (LINUX_VERSION_CODE >> 16) & 0xff;
+	h->minor = (LINUX_VERSION_CODE >> 8) & 0xff;
+	h->patch = (LINUX_VERSION_CODE) & 0xff;
+
+	h->rev = CHECKPOINT_VERSION;
+
+	h->uflags = ctx->uflags;
+	h->time = ktv.tv_sec;
+
+	fill_kernel_const(&h->constants);
+
+	ret = ckpt_write_obj(ctx, &h->h);
+	ckpt_hdr_put(ctx, h);
+	if (ret < 0)
+		return ret;
+
+	down_read(&uts_sem);
+	ret = ckpt_write_buffer(ctx, uts->release, sizeof(uts->release));
+	if (ret < 0)
+		goto up;
+	ret = ckpt_write_buffer(ctx, uts->version, sizeof(uts->version));
+	if (ret < 0)
+		goto up;
+	ret = ckpt_write_buffer(ctx, uts->machine, sizeof(uts->machine));
+ up:
+	up_read(&uts_sem);
+	return ret;
+}
+
+/* write the checkpoint trailer */
+static int checkpoint_write_tail(struct ckpt_ctx *ctx)
+{
+	struct ckpt_hdr_tail *h;
+	int ret;
+
+	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_TAIL);
+	if (!h)
+		return -ENOMEM;
+
+	h->magic = CHECKPOINT_MAGIC_TAIL;
+
+	ret = ckpt_write_obj(ctx, &h->h);
+	ckpt_hdr_put(ctx, h);
+	return ret;
+}
+
+long do_checkpoint(struct ckpt_ctx *ctx, pid_t pid)
+{
+	long ret;
+
+	ret = checkpoint_write_header(ctx);
+	if (ret < 0)
+		goto out;
+	ret = checkpoint_task(ctx, current);
+	if (ret < 0)
+		goto out;
+	ret = checkpoint_write_tail(ctx);
+	if (ret < 0)
+		goto out;
+
+	/* on success, return (unique) checkpoint identifier */
+	ctx->crid = atomic_inc_return(&ctx_count);
+	ret = ctx->crid;
+ out:
+	return ret;
+}
diff --git a/checkpoint/process.c b/checkpoint/process.c
new file mode 100644
index 0000000..d221c2a
--- /dev/null
+++ b/checkpoint/process.c
@@ -0,0 +1,102 @@
+/*
+ *  Checkpoint task structure
+ *
+ *  Copyright (C) 2008-2009 Oren Laadan
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+/* default debug level for output */
+#define CKPT_DFLAG  CKPT_DSYS
+
+#include <linux/sched.h>
+#include <linux/checkpoint.h>
+#include <linux/checkpoint_hdr.h>
+
+/***********************************************************************
+ * Checkpoint
+ */
+
+/* dump the task_struct of a given task */
+static int checkpoint_task_struct(struct ckpt_ctx *ctx, struct task_struct *t)
+{
+	struct ckpt_hdr_task *h;
+	int ret;
+
+	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_TASK);
+	if (!h)
+		return -ENOMEM;
+
+	h->state = t->state;
+	h->exit_state = t->exit_state;
+	h->exit_code = t->exit_code;
+	h->exit_signal = t->exit_signal;
+
+	h->set_child_tid = (unsigned long) t->set_child_tid;
+	h->clear_child_tid = (unsigned long) t->clear_child_tid;
+
+	/* FIXME: save remaining relevant task_struct fields */
+
+	ret = ckpt_write_obj(ctx, &h->h);
+	ckpt_hdr_put(ctx, h);
+	if (ret < 0)
+		return ret;
+
+	return ckpt_write_string(ctx, t->comm, TASK_COMM_LEN);
+}
+
+/* dump the entire state of a given task */
+int checkpoint_task(struct ckpt_ctx *ctx, struct task_struct *t)
+{
+	int ret;
+
+	ctx->tsk = t;
+
+	ret = checkpoint_task_struct(ctx, t);
+	ckpt_debug("task %d\n", ret);
+
+	ctx->tsk = NULL;
+	return ret;
+}
+
+/***********************************************************************
+ * Restart
+ */
+
+/* read the task_struct into the current task */
+static int restore_task_struct(struct ckpt_ctx *ctx)
+{
+	struct ckpt_hdr_task *h;
+	struct task_struct *t = current;
+	int ret;
+
+	h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_TASK);
+	if (IS_ERR(h))
+		return PTR_ERR(h);
+
+	memset(t->comm, 0, TASK_COMM_LEN);
+	ret = _ckpt_read_string(ctx, t->comm, TASK_COMM_LEN);
+	if (ret < 0)
+		goto out;
+
+	t->set_child_tid = (int __user *) (unsigned long) h->set_child_tid;
+	t->clear_child_tid = (int __user *) (unsigned long) h->clear_child_tid;
+
+	/* FIXME: restore remaining relevant task_struct fields */
+ out:
+	ckpt_hdr_put(ctx, h);
+	return ret;
+}
+
+/* read the entire state of the current task */
+int restore_task(struct ckpt_ctx *ctx)
+{
+	int ret;
+
+	ret = restore_task_struct(ctx);
+	ckpt_debug("task %d\n", ret);
+
+	return ret;
+}
diff --git a/checkpoint/restart.c b/checkpoint/restart.c
new file mode 100644
index 0000000..9f08f4d
--- /dev/null
+++ b/checkpoint/restart.c
@@ -0,0 +1,422 @@
+/*
+ *  Restart logic and helpers
+ *
+ *  Copyright (C) 2008-2009 Oren Laadan
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+/* default debug level for output */
+#define CKPT_DFLAG  CKPT_DSYS
+
+#include <linux/version.h>
+#include <linux/sched.h>
+#include <linux/file.h>
+#include <linux/magic.h>
+#include <linux/utsname.h>
+#include <linux/checkpoint.h>
+#include <linux/checkpoint_hdr.h>
+
+static int _ckpt_read_err(struct ckpt_ctx *ctx, struct ckpt_hdr *h)
+{
+	char *ptr;
+	int len, ret;
+
+	len = h->len - sizeof(*h);
+	ptr = kzalloc(len + 1, GFP_KERNEL);
+	if (!ptr) {
+		ckpt_debug("insufficient memory to report image error\n");
+		return -ENOMEM;
+	}
+
+	ret = ckpt_kread(ctx, ptr, len);
+	if (ret >= 0) {
+		ckpt_debug("%s\n", &ptr[1]);
+		ret = -EIO;
+	}
+
+	kfree(ptr);
+	return ret;
+}
+
+/**
+ * _ckpt_read_obj - read an object (ckpt_hdr followed by payload)
+ * @ctx: checkpoint context
+ * @h: desired ckpt_hdr
+ * @ptr: desired buffer
+ * @len: desired payload length (if 0, flexible)
+ * @max: maximum payload length (if 0, flexible)
+ *
+ * If @ptr is NULL, then read only the header (payload to follow)
+ */
+static int _ckpt_read_obj(struct ckpt_ctx *ctx, struct ckpt_hdr *h,
+			  void *ptr, int len, int max)
+{
+	int ret;
+
+ again:
+	ret = ckpt_kread(ctx, h, sizeof(*h));
+	if (ret < 0)
+		return ret;
+	_ckpt_debug(CKPT_DRW, "type %d len %d(%d,%d)\n",
+		    h->type, h->len, len, max);
+	if (h->len < sizeof(*h))
+		return -EINVAL;
+
+	if (h->type == CKPT_HDR_ERROR) {
+		ret = _ckpt_read_err(ctx, h);
+		if (ret < 0)
+			return ret;
+		goto again;
+	}
+
+	/* if len specified, enforce, else if maximum specified, enforce */
+	if ((len && h->len != len) || (!len && max && h->len > max))
+		return -EINVAL;
+
+	if (ptr)
+		ret = ckpt_kread(ctx, ptr, h->len - sizeof(struct ckpt_hdr));
+	return ret;
+}
+
+/**
+ * _ckpt_read_obj_type - read an object of some type
+ * @ctx: checkpoint context
+ * @ptr: provided buffer
+ * @len: buffer length
+ * @type: buffer type
+ *
+ * If @ptr is NULL, then read only the header (payload to follow).
+ * @len specifies the expected buffer length (ignored if set to 0).
+ * Returns: actual _payload_ length
+ */
+int _ckpt_read_obj_type(struct ckpt_ctx *ctx, void *ptr, int len, int type)
+{
+	struct ckpt_hdr h;
+	int ret;
+
+	if (len)
+		len += sizeof(struct ckpt_hdr);
+	ret = _ckpt_read_obj(ctx, &h, ptr, len, len);
+	if (ret < 0)
+		return ret;
+	if (h.type != type)
+		return -EINVAL;
+	return h.len - sizeof(h);
+}
+
+/**
+ * _ckpt_read_buffer - read an object of type buffer (set length)
+ * @ctx: checkpoint context
+ * @ptr: provided buffer
+ * @len: buffer length
+ *
+ * If @ptr is NULL, then read only the header (payload to follow).
+ * @len specifies the expected buffer length (ignored if set to 0).
+ * Returns: _payload_ length.
+ */
+int _ckpt_read_buffer(struct ckpt_ctx *ctx, void *ptr, int len)
+{
+	BUG_ON(!len);
+	return _ckpt_read_obj_type(ctx, ptr, len, CKPT_HDR_BUFFER);
+}
+
+/**
+ * _ckpt_read_string - read an object of type string (set length)
+ * @ctx: checkpoint context
+ * @ptr: provided buffer
+ * @len: string length (including '\0')
+ *
+ * If @ptr is NULL, then read only the header (payload to follow)
+ */
+int _ckpt_read_string(struct ckpt_ctx *ctx, void *ptr, int len)
+{
+	int ret;
+
+	BUG_ON(!len);
+	ret = _ckpt_read_obj_type(ctx, ptr, len, CKPT_HDR_STRING);
+	if (ret < 0)
+		return ret;
+	if (ptr)
+		((char *) ptr)[len - 1] = '\0';	/* always play it safe */
+	return 0;
+}
+
+/**
+ * ckpt_read_obj - allocate and read an object (ckpt_hdr followed by payload)
+ * @ctx: checkpoint context
+ * @h: object descriptor
+ * @len: desired payload length (if 0, flexible)
+ * @max: maximum payload length
+ *
+ * Return: new buffer allocated on success, error pointer otherwise
+ */
+static void *ckpt_read_obj(struct ckpt_ctx *ctx, int len, int max)
+{
+	struct ckpt_hdr hh;
+	struct ckpt_hdr *h;
+	int ret;
+
+	ret = ckpt_kread(ctx, &hh, sizeof(hh));
+	if (ret < 0)
+		return ERR_PTR(ret);
+	_ckpt_debug(CKPT_DRW, "type %d len %d(%d,%d)\n",
+		    hh.type, hh.len, len, max);
+	if (hh.len < sizeof(*h))
+		return ERR_PTR(-EINVAL);
+	/* if len specified, enforce, else if maximum specified, enforce */
+	if ((len && hh.len != len) || (!len && max && hh.len > max))
+		return ERR_PTR(-EINVAL);
+
+	h = ckpt_hdr_get(ctx, hh.len);
+	if (!h)
+		return ERR_PTR(-ENOMEM);
+
+	*h = hh;	/* yay ! */
+
+	ret = ckpt_kread(ctx, (h + 1), hh.len - sizeof(struct ckpt_hdr));
+	if (ret < 0) {
+		ckpt_hdr_put(ctx, h);
+		h = ERR_PTR(ret);
+	}
+
+	return h;
+}
+
+/**
+ * ckpt_read_obj_type - allocate and read an object of some type
+ * @ctx: checkpoint context
+ * @len: desired object length
+ * @type: desired object type
+ *
+ * Return: new buffer allocated on success, error pointer otherwise
+ */
+void *ckpt_read_obj_type(struct ckpt_ctx *ctx, int len, int type)
+{
+	struct ckpt_hdr *h;
+
+	BUG_ON(!len);
+
+	h = ckpt_read_obj(ctx, len, len);
+	if (IS_ERR(h))
+		return h;
+
+	if (h->type != type) {
+		ckpt_hdr_put(ctx, h);
+		h = ERR_PTR(-EINVAL);
+	}
+
+	return h;
+}
+
+/**
+ * ckpt_read_buf_type - allocate and read an object of some type (flxible)
+ * @ctx: checkpoint context
+ * @max: maximum object length
+ * @type: desired object type
+ *
+ * This differs from ckpt_read_obj_type() in that the length of the
+ * incoming object is flexible (up to the maximum specified by @max;
+ * unlimited if @max is 0), as determined by the ckpt_hdr data.
+ *
+ * Return: new buffer allocated on success, error pointer otherwise
+ */
+void *ckpt_read_buf_type(struct ckpt_ctx *ctx, int max, int type)
+{
+	struct ckpt_hdr *h;
+
+	h = ckpt_read_obj(ctx, 0, max);
+	if (IS_ERR(h))
+		return h;
+
+	if (h->type != type) {
+		ckpt_hdr_put(ctx, h);
+		h = ERR_PTR(-EINVAL);
+	}
+
+	return h;
+}
+
+/**
+ * ckpt_read_payload - allocate and read the payload of an object
+ * @ctx: checkpoint context
+ * @max: maximum payload length
+ * @str: pointer to buffer to be allocated (caller must free)
+ * @type: desired object type
+ *
+ * This can be used to read a variable-length _payload_ from the checkpoint
+ * stream. @max limits the size of the resulting buffer.
+ *
+ * Return: actual _payload_ length
+ */
+int ckpt_read_payload(struct ckpt_ctx *ctx, void **ptr, int max, int type)
+{
+	int len, ret;
+
+	len = _ckpt_read_obj_type(ctx, NULL, 0, type);
+	if (len < 0)
+		return len;
+	else if (len > max)
+		return -EINVAL;
+
+	*ptr = kmalloc(len, GFP_KERNEL);
+	if (!*ptr)
+		return -ENOMEM;
+
+	ret = ckpt_kread(ctx, *ptr, len);
+	if (ret < 0) {
+		kfree(*ptr);
+		return ret;
+	}
+
+	return len;
+}
+
+/**
+ * ckpt_read_string - allocate and read a string (variable length)
+ * @ctx: checkpoint context
+ * @max: maximum acceptable length
+ *
+ * Return: allocate string or error pointer
+ */
+char *ckpt_read_string(struct ckpt_ctx *ctx, int max)
+{
+	char *str;
+	int len;
+
+	len = ckpt_read_payload(ctx, (void **)&str, max, CKPT_HDR_STRING);
+	if (len < 0)
+		return ERR_PTR(len);
+	str[len - 1] = '\0';  	/* always play it safe */
+	return str;
+}
+
+/**
+ * ckpt_read_consume - consume the next object of expected type
+ * @ctx: checkpoint context
+ * @len: desired object length
+ * @type: desired object type
+ *
+ * This can be used to skip an object in the input stream when the
+ * data is unnecessary for the restart. @len indicates the length of
+ * the object); if @len is zero the length is unconstrained.
+ */
+int ckpt_read_consume(struct ckpt_ctx *ctx, int len, int type)
+{
+	struct ckpt_hdr *h;
+	int ret = 0;
+
+	h = ckpt_read_obj(ctx, len, 0);
+	if (IS_ERR(h))
+		return PTR_ERR(h);
+
+	if (h->type != type)
+		ret = -EINVAL;
+
+	ckpt_hdr_put(ctx, h);
+	return ret;
+}
+
+/***********************************************************************
+ * Restart
+ */
+
+static int check_kernel_const(struct ckpt_const *h)
+{
+	struct task_struct *tsk;
+	struct new_utsname *uts;
+
+	/* task */
+	if (h->task_comm_len != sizeof(tsk->comm))
+		return -EINVAL;
+	/* uts */
+	if (h->uts_release_len != sizeof(uts->release))
+		return -EINVAL;
+	if (h->uts_version_len != sizeof(uts->version))
+		return -EINVAL;
+	if (h->uts_machine_len != sizeof(uts->machine))
+		return -EINVAL;
+
+	return 0;
+}
+
+/* read the checkpoint header */
+static int restore_read_header(struct ckpt_ctx *ctx)
+{
+	struct ckpt_hdr_header *h;
+	struct new_utsname *uts = NULL;
+	int ret;
+
+	h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_HEADER);
+	if (IS_ERR(h))
+		return PTR_ERR(h);
+
+	ret = -EINVAL;
+	if (h->magic != CHECKPOINT_MAGIC_HEAD ||
+	    h->rev != CHECKPOINT_VERSION ||
+	    h->major != ((LINUX_VERSION_CODE >> 16) & 0xff) ||
+	    h->minor != ((LINUX_VERSION_CODE >> 8) & 0xff) ||
+	    h->patch != ((LINUX_VERSION_CODE) & 0xff))
+		goto out;
+	if (h->uflags)
+		goto out;
+
+	ret = check_kernel_const(&h->constants);
+	if (ret < 0)
+		goto out;
+
+	ret = -ENOMEM;
+	uts = kmalloc(sizeof(*uts), GFP_KERNEL);
+	if (!uts)
+		goto out;
+
+	ctx->oflags = h->uflags;
+
+	/* FIX: verify compatibility of release, version and machine */
+	ret = _ckpt_read_buffer(ctx, uts->release, sizeof(uts->release));
+	if (ret < 0)
+		goto out;
+	ret = _ckpt_read_buffer(ctx, uts->version, sizeof(uts->version));
+	if (ret < 0)
+		goto out;
+	ret = _ckpt_read_buffer(ctx, uts->machine, sizeof(uts->machine));
+ out:
+	kfree(uts);
+	ckpt_hdr_put(ctx, h);
+	return ret;
+}
+
+/* read the checkpoint trailer */
+static int restore_read_tail(struct ckpt_ctx *ctx)
+{
+	struct ckpt_hdr_tail *h;
+	int ret = 0;
+
+	h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_TAIL);
+	if (IS_ERR(h))
+		return PTR_ERR(h);
+
+	if (h->magic != CHECKPOINT_MAGIC_TAIL)
+		ret = -EINVAL;
+
+	ckpt_hdr_put(ctx, h);
+	return ret;
+}
+
+long do_restart(struct ckpt_ctx *ctx, pid_t pid)
+{
+	long ret;
+
+	ret = restore_read_header(ctx);
+	if (ret < 0)
+		return ret;
+	ret = restore_task(ctx);
+	if (ret < 0)
+		return ret;
+	ret = restore_read_tail(ctx);
+
+	/* on success, adjust the return value if needed [TODO] */
+	return ret;
+}
diff --git a/checkpoint/sys.c b/checkpoint/sys.c
index 79936cc..7f6f71e 100644
--- a/checkpoint/sys.c
+++ b/checkpoint/sys.c
@@ -8,9 +8,192 @@
  *  distribution for more details.
  */
 
+/* default debug level for output */
+#define CKPT_DFLAG  CKPT_DSYS
+
 #include <linux/sched.h>
 #include <linux/kernel.h>
 #include <linux/syscalls.h>
+#include <linux/fs.h>
+#include <linux/file.h>
+#include <linux/uaccess.h>
+#include <linux/capability.h>
+#include <linux/checkpoint.h>
+
+/*
+ * Helpers to write(read) from(to) kernel space to(from) the checkpoint
+ * image file descriptor (similar to how a core-dump is performed).
+ *
+ *   ckpt_kwrite() - write a kernel-space buffer to the checkpoint image
+ *   ckpt_kread() - read from the checkpoint image to a kernel-space buffer
+ */
+
+static inline int _ckpt_kwrite(struct file *file, void *addr, int count)
+{
+	void __user *uaddr = (__force void __user *) addr;
+	ssize_t nwrite;
+	int nleft;
+
+	for (nleft = count; nleft; nleft -= nwrite) {
+		loff_t pos = file_pos_read(file);
+		nwrite = vfs_write(file, uaddr, nleft, &pos);
+		file_pos_write(file, pos);
+		if (nwrite < 0) {
+			if (nwrite == -EAGAIN)
+				nwrite = 0;
+			else
+				return nwrite;
+		}
+		uaddr += nwrite;
+	}
+	return 0;
+}
+
+int ckpt_kwrite(struct ckpt_ctx *ctx, void *addr, int count)
+{
+	mm_segment_t fs;
+	int ret;
+
+	fs = get_fs();
+	set_fs(KERNEL_DS);
+	ret = _ckpt_kwrite(ctx->file, addr, count);
+	set_fs(fs);
+
+	ctx->total += count;
+	return ret;
+}
+
+static inline int _ckpt_kread(struct file *file, void *addr, int count)
+{
+	void __user *uaddr = (__force void __user *) addr;
+	ssize_t nread;
+	int nleft;
+
+	for (nleft = count; nleft; nleft -= nread) {
+		loff_t pos = file_pos_read(file);
+		nread = vfs_read(file, uaddr, nleft, &pos);
+		file_pos_write(file, pos);
+		if (nread <= 0) {
+			if (nread == -EAGAIN) {
+				nread = 0;
+				continue;
+			} else if (nread == 0)
+				nread = -EPIPE;		/* unexecpted EOF */
+			return nread;
+		}
+		uaddr += nread;
+	}
+	return 0;
+}
+
+int ckpt_kread(struct ckpt_ctx *ctx, void *addr, int count)
+{
+	mm_segment_t fs;
+	int ret;
+
+	fs = get_fs();
+	set_fs(KERNEL_DS);
+	ret = _ckpt_kread(ctx->file , addr, count);
+	set_fs(fs);
+
+	ctx->total += count;
+	return ret;
+}
+
+/**
+ * ckpt_hdr_get - get a hdr of certain size
+ * @ctx: checkpoint context
+ * @len: desired length
+ *
+ * Returns pointer to header
+ */
+void *ckpt_hdr_get(struct ckpt_ctx *ctx, int len)
+{
+	return kzalloc(len, GFP_KERNEL);
+}
+
+/**
+ * _ckpt_hdr_put - free a hdr allocated with ckpt_hdr_get
+ * @ctx: checkpoint context
+ * @ptr: header to free
+ * @len: header length
+ *
+ * (requiring 'ptr' makes it easily interchangable with kmalloc/kfree
+ */
+void _ckpt_hdr_put(struct ckpt_ctx *ctx, void *ptr, int len)
+{
+	kfree(ptr);
+}
+
+/**
+ * ckpt_hdr_put - free a hdr allocated with ckpt_hdr_get
+ * @ctx: checkpoint context
+ * @ptr: header to free
+ *
+ * It is assumed that @ptr begins with a 'struct ckpt_hdr'.
+ */
+void ckpt_hdr_put(struct ckpt_ctx *ctx, void *ptr)
+{
+	struct ckpt_hdr *h = (struct ckpt_hdr *) ptr;
+	_ckpt_hdr_put(ctx, ptr, h->len);
+}
+
+/**
+ * ckpt_hdr_get_type - get a hdr of certain size
+ * @ctx: checkpoint context
+ * @len: number of bytes to reserve
+ *
+ * Returns pointer to reserved space on hbuf
+ */
+void *ckpt_hdr_get_type(struct ckpt_ctx *ctx, int len, int type)
+{
+	struct ckpt_hdr *h;
+
+	h = ckpt_hdr_get(ctx, len);
+	if (!h)
+		return NULL;
+
+	h->type = type;
+	h->len = len;
+	return h;
+}
+
+
+/*
+ * Helpers to manage c/r contexts: allocated for each checkpoint and/or
+ * restart operation, and persists until the operation is completed.
+ */
+
+static void ckpt_ctx_free(struct ckpt_ctx *ctx)
+{
+	if (ctx->file)
+		fput(ctx->file);
+	kfree(ctx);
+}
+
+static struct ckpt_ctx *ckpt_ctx_alloc(int fd, unsigned long uflags,
+				       unsigned long kflags)
+{
+	struct ckpt_ctx *ctx;
+	int err;
+
+	ctx = kzalloc(sizeof(*ctx), GFP_KERNEL);
+	if (!ctx)
+		return ERR_PTR(-ENOMEM);
+
+	ctx->uflags = uflags;
+	ctx->kflags = kflags;
+
+	err = -EBADF;
+	ctx->file = fget(fd);
+	if (!ctx->file)
+		goto err;
+
+	return ctx;
+ err:
+	ckpt_ctx_free(ctx);
+	return ERR_PTR(err);
+}
 
 /**
  * sys_checkpoint - checkpoint a container
@@ -23,7 +206,26 @@
  */
 SYSCALL_DEFINE3(checkpoint, pid_t, pid, int, fd, unsigned long, flags)
 {
-	return -ENOSYS;
+	struct ckpt_ctx *ctx;
+	long ret;
+
+	/* no flags for now */
+	if (flags)
+		return -EINVAL;
+
+	if (pid == 0)
+		pid = task_pid_vnr(current);
+	ctx = ckpt_ctx_alloc(fd, flags, CKPT_CTX_CHECKPOINT);
+	if (IS_ERR(ctx))
+		return PTR_ERR(ctx);
+
+	ret = do_checkpoint(ctx, pid);
+
+	if (!ret)
+		ret = ctx->crid;
+
+	ckpt_ctx_free(ctx);
+	return ret;
 }
 
 /**
@@ -37,5 +239,46 @@ SYSCALL_DEFINE3(checkpoint, pid_t, pid, int, fd, unsigned long, flags)
  */
 SYSCALL_DEFINE3(restart, pid_t, pid, int, fd, unsigned long, flags)
 {
-	return -ENOSYS;
+	struct ckpt_ctx *ctx = NULL;
+	long ret;
+
+	/* no flags for now */
+	if (flags)
+		return -EINVAL;
+
+	ctx = ckpt_ctx_alloc(fd, flags, CKPT_CTX_RESTART);
+	if (IS_ERR(ctx))
+		return PTR_ERR(ctx);
+
+	ret = do_restart(ctx, pid);
+
+	/* restart(2) isn't idempotent: can't restart syscall */
+	if (ret == -ERESTARTSYS || ret == -ERESTARTNOINTR ||
+	    ret == -ERESTARTNOHAND || ret == -ERESTART_RESTARTBLOCK)
+		ret = -EINTR;
+
+	ckpt_ctx_free(ctx);
+	return ret;
+}
+
+
+/* 'ckpt_debug_level' controls the verbosity level of c/r code */
+#ifdef CONFIG_CHECKPOINT_DEBUG
+
+/* FIX: allow to change during runtime */
+unsigned long __read_mostly ckpt_debug_level = CKPT_DDEFAULT;
+
+static __init int ckpt_debug_setup(char *s)
+{
+	long val, ret;
+
+	ret = strict_strtoul(s, 10, &val);
+	if (ret < 0)
+		return ret;
+	ckpt_debug_level = val;
+	return 0;
 }
+
+__setup("ckpt_debug=", ckpt_debug_setup);
+
+#endif /* CONFIG_CHECKPOINT_DEBUG */
diff --git a/include/linux/Kbuild b/include/linux/Kbuild
index 334a359..3e8bd18 100644
--- a/include/linux/Kbuild
+++ b/include/linux/Kbuild
@@ -44,6 +44,9 @@ header-y += bpqether.h
 header-y += bsg.h
 header-y += can.h
 header-y += cdk.h
+header-y += checkpoint.h
+header-y += checkpoint_hdr.h
+header-y += checkpoint_types.h
 header-y += chio.h
 header-y += coda_psdev.h
 header-y += coff.h
diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
new file mode 100644
index 0000000..be0ba4b
--- /dev/null
+++ b/include/linux/checkpoint.h
@@ -0,0 +1,109 @@
+#ifndef _LINUX_CHECKPOINT_H_
+#define _LINUX_CHECKPOINT_H_
+/*
+ *  Generic checkpoint-restart
+ *
+ *  Copyright (C) 2008-2009 Oren Laadan
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+#define CHECKPOINT_VERSION  2
+
+#ifdef __KERNEL__
+#ifdef CONFIG_CHECKPOINT
+
+#include <linux/checkpoint_types.h>
+#include <linux/checkpoint_hdr.h>
+#include <linux/err.h>
+
+/* ckpt_ctx: kflags */
+#define CKPT_CTX_CHECKPOINT_BIT		0
+#define CKPT_CTX_RESTART_BIT		1
+
+#define CKPT_CTX_CHECKPOINT	(1 << CKPT_CTX_CHECKPOINT_BIT)
+#define CKPT_CTX_RESTART	(1 << CKPT_CTX_RESTART_BIT)
+
+
+extern int ckpt_kwrite(struct ckpt_ctx *ctx, void *buf, int count);
+extern int ckpt_kread(struct ckpt_ctx *ctx, void *buf, int count);
+
+extern void _ckpt_hdr_put(struct ckpt_ctx *ctx, void *ptr, int n);
+extern void ckpt_hdr_put(struct ckpt_ctx *ctx, void *ptr);
+extern void *ckpt_hdr_get(struct ckpt_ctx *ctx, int n);
+extern void *ckpt_hdr_get_type(struct ckpt_ctx *ctx, int n, int type);
+
+extern int ckpt_write_obj(struct ckpt_ctx *ctx, struct ckpt_hdr *h);
+extern int ckpt_write_obj_type(struct ckpt_ctx *ctx,
+			       void *ptr, int len, int type);
+extern int ckpt_write_buffer(struct ckpt_ctx *ctx, void *ptr, int len);
+extern int ckpt_write_string(struct ckpt_ctx *ctx, char *str, int len);
+extern void __ckpt_write_err(struct ckpt_ctx *ctx, char *ptr, char *fmt, ...);
+extern int ckpt_write_err(struct ckpt_ctx *ctx, char *ptr, char *fmt, ...);
+
+extern int _ckpt_read_obj_type(struct ckpt_ctx *ctx,
+			       void *ptr, int len, int type);
+extern int _ckpt_read_buffer(struct ckpt_ctx *ctx, void *ptr, int len);
+extern int _ckpt_read_string(struct ckpt_ctx *ctx, void *ptr, int len);
+extern void *ckpt_read_obj_type(struct ckpt_ctx *ctx, int len, int type);
+extern void *ckpt_read_buf_type(struct ckpt_ctx *ctx, int len, int type);
+extern int ckpt_read_payload(struct ckpt_ctx *ctx,
+			     void **ptr, int max, int type);
+extern char *ckpt_read_string(struct ckpt_ctx *ctx, int max);
+extern int ckpt_read_consume(struct ckpt_ctx *ctx, int len, int type);
+
+extern long do_checkpoint(struct ckpt_ctx *ctx, pid_t pid);
+extern long do_restart(struct ckpt_ctx *ctx, pid_t pid);
+
+/* task */
+extern int checkpoint_task(struct ckpt_ctx *ctx, struct task_struct *t);
+extern int restore_task(struct ckpt_ctx *ctx);
+
+static inline int ckpt_validate_errno(int errno)
+{
+	return (errno >= 0) && (errno < MAX_ERRNO);
+}
+
+/* debugging flags */
+#define CKPT_DBASE	0x1		/* anything */
+#define CKPT_DSYS	0x2		/* generic (system) */
+#define CKPT_DRW	0x4		/* image read/write */
+
+#define CKPT_DDEFAULT	0xffff		/* default debug level */
+
+#ifndef CKPT_DFLAG
+#define CKPT_DFLAG	0xffff		/* everything */
+#endif
+
+#ifdef CONFIG_CHECKPOINT_DEBUG
+extern unsigned long ckpt_debug_level;
+
+/* use this to select a specific debug level */
+#define _ckpt_debug(level, fmt, args...)				\
+	do {								\
+		if (ckpt_debug_level & (level))				\
+			printk(KERN_DEBUG "[%d:%d:c/r:%s:%d] " fmt,	\
+				current->pid, task_pid_vnr(current),	\
+				__func__, __LINE__, ## args);		\
+	} while (0)
+
+/*
+ * CKPT_DBASE is the base flags, doesn't change
+ * CKPT_DFLAG is to be redfined in each source file
+ */
+#define ckpt_debug(fmt, args...)  \
+	_ckpt_debug(CKPT_DBASE | CKPT_DFLAG, fmt, ## args)
+
+#else
+
+#define _ckpt_debug(level, fmt, args...)	do { } while (0)
+#define ckpt_debug(fmt, args...)		do { } while (0)
+
+#endif /* CONFIG_CHECKPOINT_DEBUG */
+
+#endif /* CONFIG_CHECKPOINT */
+#endif /* __KERNEL__ */
+
+#endif /* _LINUX_CHECKPOINT_H_ */
diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
new file mode 100644
index 0000000..22dadbd
--- /dev/null
+++ b/include/linux/checkpoint_hdr.h
@@ -0,0 +1,111 @@
+#ifndef _CHECKPOINT_CKPT_HDR_H_
+#define _CHECKPOINT_CKPT_HDR_H_
+/*
+ *  Generic container checkpoint-restart
+ *
+ *  Copyright (C) 2008-2009 Oren Laadan
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+#include <linux/types.h>
+#include <linux/utsname.h>
+
+/*
+ * To maintain compatibility between 32-bit and 64-bit architecture flavors,
+ * keep data 64-bit aligned: use padding for structure members, and use
+ * __attribute__((aligned (8))) for the entire structure.
+ *
+ * Quoting Arnd Bergmann:
+ *   "This structure has an odd multiple of 32-bit members, which means
+ *   that if you put it into a larger structure that also contains 64-bit
+ *   members, the larger structure may get different alignment on x86-32
+ *   and x86-64, which you might want to avoid. I can't tell if this is
+ *   an actual problem here. ... In this case, I'm pretty sure that
+ *   sizeof(ckpt_hdr_task) on x86-32 is different from x86-64, since it
+ *   will be 32-bit aligned on x86-32."
+ */
+
+/*
+ * header format: 'struct ckpt_hdr' must prefix all other headers. Therfore
+ * when a header is passed around, the information about it (type, size)
+ * is readily available. Structs that include a struct ckpt_hdr are named
+ * struct ckpt_hdr_* by convention (usualy the struct ckpt_hdr is the first
+ * member).
+ */
+struct ckpt_hdr {
+	__u32 type;
+	__u32 len;
+} __attribute__((aligned(8)));
+
+/* header types */
+enum {
+	CKPT_HDR_HEADER = 1,
+	CKPT_HDR_BUFFER,
+	CKPT_HDR_STRING,
+
+	CKPT_HDR_TASK = 101,
+
+	CKPT_HDR_TAIL = 9001,
+
+	CKPT_HDR_ERROR = 9999,
+};
+
+/* kernel constants */
+struct ckpt_const {
+	/* task */
+	__u16 task_comm_len;
+	/* uts */
+	__u16 uts_release_len;
+	__u16 uts_version_len;
+	__u16 uts_machine_len;
+} __attribute__((aligned(8)));
+
+/* checkpoint image header */
+struct ckpt_hdr_header {
+	struct ckpt_hdr h;
+	__u64 magic;
+
+	__u16 _padding;
+
+	__u16 major;
+	__u16 minor;
+	__u16 patch;
+	__u16 rev;
+
+	struct ckpt_const constants;
+
+	__u64 time;	/* when checkpoint taken */
+	__u64 uflags;	/* uflags from checkpoint */
+
+	/*
+	 * the header is followed by three strings:
+	 *   char release[const.uts_release_len];
+	 *   char version[const.uts_version_len];
+	 *   char machine[const.uts_machine_len];
+	 */
+} __attribute__((aligned(8)));
+
+
+/* checkpoint image trailer */
+struct ckpt_hdr_tail {
+	struct ckpt_hdr h;
+	__u64 magic;
+} __attribute__((aligned(8)));
+
+
+/* task data */
+struct ckpt_hdr_task {
+	struct ckpt_hdr h;
+	__u32 state;
+	__u32 exit_state;
+	__u32 exit_code;
+	__u32 exit_signal;
+
+	__u64 set_child_tid;
+	__u64 clear_child_tid;
+} __attribute__((aligned(8)));
+
+#endif /* _CHECKPOINT_CKPT_HDR_H_ */
diff --git a/include/linux/checkpoint_types.h b/include/linux/checkpoint_types.h
new file mode 100644
index 0000000..585cb7b
--- /dev/null
+++ b/include/linux/checkpoint_types.h
@@ -0,0 +1,35 @@
+#ifndef _LINUX_CHECKPOINT_TYPES_H_
+#define _LINUX_CHECKPOINT_TYPES_H_
+/*
+ *  Generic checkpoint-restart
+ *
+ *  Copyright (C) 2008-2009 Oren Laadan
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+#ifdef __KERNEL__
+
+#include <linux/fs.h>
+
+struct ckpt_ctx {
+	int crid;		/* unique checkpoint id */
+
+	pid_t root_pid;		/* container identifier */
+
+	unsigned long kflags;	/* kerenl flags */
+	unsigned long uflags;	/* user flags */
+	unsigned long oflags;	/* restart: uflags from checkpoint */
+
+	struct file *file;	/* input/output file */
+	int total;		/* total read/written */
+
+	struct task_struct *tsk;/* checkpoint: current target task */
+	char err_string[256];	/* checkpoint: error string */
+};
+
+#endif /* __KERNEL__ */
+
+#endif /* _LINUX_CHECKPOINT_TYPES_H_ */
diff --git a/include/linux/magic.h b/include/linux/magic.h
index 1923327..ff17a59 100644
--- a/include/linux/magic.h
+++ b/include/linux/magic.h
@@ -53,4 +53,8 @@
 #define INOTIFYFS_SUPER_MAGIC	0x2BAD1DEA
 
 #define STACK_END_MAGIC		0x57AC6E9D
+
+#define CHECKPOINT_MAGIC_HEAD  0x00feed0cc0a2d200LL
+#define CHECKPOINT_MAGIC_TAIL  0x002d2a0cc0deef00LL
+
 #endif /* __LINUX_MAGIC_H__ */
diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index 12327b2..e1ae6e6 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -1006,6 +1006,19 @@ config DMA_API_DEBUG
 	  This option causes a performance degredation.  Use only if you want
 	  to debug device drivers. If unsure, say N.
 
+config CHECKPOINT_DEBUG
+	bool "Checkpoint/restart debugging (EXPERIMENTAL)"
+	depends on CHECKPOINT
+	default y
+	help
+	  This options turns on the debugging output of checkpoint/restart.
+	  The level of verbosity is controlled by 'ckpt_debug_level' and can
+	  be set at boot time with "ckpt_debug=" option.
+
+	  Turning this option off will reduce the size of the c/r code. If
+	  turned on, it is unlikely to incur visible overhead if the debug
+	  level is set to zero.
+
 source "samples/Kconfig"
 
 source "lib/Kconfig.kgdb"
-- 
1.6.0.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH v18 21/80] c/r: x86_32 support for checkpoint/restart
       [not found] ` <1253749920-18673-1-git-send-email-orenl-RdfvBDnrOixBDgjK7y7TUQ@public.gmane.org>
                     ` (19 preceding siblings ...)
  2009-09-23 23:51   ` [PATCH v18 20/80] c/r: basic infrastructure for checkpoint/restart Oren Laadan
@ 2009-09-23 23:51   ` Oren Laadan
  2009-09-23 23:51   ` [PATCH v18 22/80] c/r: external checkpoint of a task other than ourself Oren Laadan
                     ` (60 subsequent siblings)
  81 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-23 23:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-api-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Linus Torvalds, Ingo Molnar,
	Pavel Emelyanov

Add logic to save and restore architecture specific state, including
thread-specific state, CPU registers and FPU state.

In addition, architecture capabilities are saved in an architecure
specific extension of the header (ckpt_hdr_head_arch); Currently this
includes only FPU capabilities.

Currently only x86-32 is supported.

Changelog[v17]:
  - Fix compilation for architectures that don't support checkpoint
  - Validate cpu registers and TLS descriptors on restart
  - Validate debug registers on restart
  - Export asm/checkpoint_hdr.h to userspace
Changelog[v16]:
  - All objects are preceded by ckpt_hdr (TLS and xstate_buf)
  - Add architecture identifier to main header
Changelog[v14]:
  - Use new interface ckpt_hdr_get/put()
  - Embed struct ckpt_hdr in struct ckpt_hdr...
  - Remove preempt_disable/enable() around init_fpu() and fix leak
  - Revert change to pr_debug(), back to ckpt_debug()
  - Move code related to task_struct to checkpoint/process.c
Changelog[v12]:
  - A couple of missed calls to ckpt_hbuf_put()
  - Replace obsolete ckpt_debug() with pr_debug()
Changelog[v9]:
  - Add arch-specific header that details architecture capabilities;
    split FPU restore to send capabilities only once.
  - Test for zero TLS entries in ckpt_write_thread()
  - Fix asm/checkpoint_hdr.h so it can be included from user-space
Changelog[v7]:
  - Fix save/restore state of FPU
Changelog[v5]:
  - Remove preempt_disable() when restoring debug registers
Changelog[v4]:
  - Fix header structure alignment
Changelog[v2]:
  - Pad header structures to 64 bits to ensure compatibility
  - Follow Dave Hansen's refactoring of the original post

Signed-off-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
---
 arch/x86/include/asm/Kbuild           |    1 +
 arch/x86/include/asm/checkpoint_hdr.h |  121 ++++++++
 arch/x86/include/asm/ptrace.h         |    5 +
 arch/x86/kernel/ptrace.c              |    8 +-
 arch/x86/mm/Makefile                  |    2 +
 arch/x86/mm/checkpoint.c              |  531 +++++++++++++++++++++++++++++++++
 checkpoint/checkpoint.c               |    7 +-
 checkpoint/process.c                  |   19 ++-
 checkpoint/restart.c                  |    6 +
 include/linux/checkpoint.h            |    9 +
 include/linux/checkpoint_hdr.h        |   16 +-
 11 files changed, 718 insertions(+), 7 deletions(-)
 create mode 100644 arch/x86/include/asm/checkpoint_hdr.h
 create mode 100644 arch/x86/mm/checkpoint.c

diff --git a/arch/x86/include/asm/Kbuild b/arch/x86/include/asm/Kbuild
index 4a8e80c..f76cb6e 100644
--- a/arch/x86/include/asm/Kbuild
+++ b/arch/x86/include/asm/Kbuild
@@ -2,6 +2,7 @@ include include/asm-generic/Kbuild.asm
 
 header-y += boot.h
 header-y += bootparam.h
+header-y += checkpoint_hdr.h
 header-y += debugreg.h
 header-y += ldt.h
 header-y += msr-index.h
diff --git a/arch/x86/include/asm/checkpoint_hdr.h b/arch/x86/include/asm/checkpoint_hdr.h
new file mode 100644
index 0000000..f4d1e14
--- /dev/null
+++ b/arch/x86/include/asm/checkpoint_hdr.h
@@ -0,0 +1,121 @@
+#ifndef __ASM_X86_CKPT_HDR_H
+#define __ASM_X86_CKPT_HDR_H
+/*
+ *  Checkpoint/restart - architecture specific headers x86
+ *
+ *  Copyright (C) 2008-2009 Oren Laadan
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+#ifndef _CHECKPOINT_CKPT_HDR_H_
+#error asm/checkpoint_hdr.h included directly
+#endif
+
+#include <linux/types.h>
+
+/*
+ * To maintain compatibility between 32-bit and 64-bit architecture flavors,
+ * keep data 64-bit aligned: use padding for structure members, and use
+ * __attribute__((aligned (8))) for the entire structure.
+ *
+ * Quoting Arnd Bergmann:
+ *   "This structure has an odd multiple of 32-bit members, which means
+ *   that if you put it into a larger structure that also contains 64-bit
+ *   members, the larger structure may get different alignment on x86-32
+ *   and x86-64, which you might want to avoid. I can't tell if this is
+ *   an actual problem here. ... In this case, I'm pretty sure that
+ *   sizeof(ckpt_hdr_task) on x86-32 is different from x86-64, since it
+ *   will be 32-bit aligned on x86-32."
+ */
+
+/* i387 structure seen from kernel/userspace */
+#ifdef __KERNEL__
+#include <asm/processor.h>
+#else
+#include <sys/user.h>
+#endif
+
+#ifdef CONFIG_X86_32
+#define CKPT_ARCH_ID	CKPT_ARCH_X86_32
+#endif
+
+/* arch dependent header types */
+enum {
+	CKPT_HDR_CPU_FPU = 201,
+};
+
+struct ckpt_hdr_header_arch {
+	struct ckpt_hdr h;
+	/* FIXME: add HAVE_HWFP */
+	__u16 has_fxsr;
+	__u16 has_xsave;
+	__u16 xstate_size;
+	__u16 _pading;
+} __attribute__((aligned(8)));
+
+struct ckpt_hdr_thread {
+	struct ckpt_hdr h;
+	__u32 thread_info_flags;
+	__u16 gdt_entry_tls_entries;
+	__u16 sizeof_tls_array;
+} __attribute__((aligned(8)));
+
+/* designed to work for both x86_32 and x86_64 */
+struct ckpt_hdr_cpu {
+	struct ckpt_hdr h;
+	/* see struct pt_regs (x86_64) */
+	__u64 r15;
+	__u64 r14;
+	__u64 r13;
+	__u64 r12;
+	__u64 bp;
+	__u64 bx;
+	__u64 r11;
+	__u64 r10;
+	__u64 r9;
+	__u64 r8;
+	__u64 ax;
+	__u64 cx;
+	__u64 dx;
+	__u64 si;
+	__u64 di;
+	__u64 orig_ax;
+	__u64 ip;
+	__u64 sp;
+
+	__u64 flags;
+
+	/* segment registers */
+	__u64 fs;
+	__u64 gs;
+
+	__u16 fsindex;
+	__u16 gsindex;
+	__u16 cs;
+	__u16 ss;
+	__u16 ds;
+	__u16 es;
+
+	__u32 used_math;
+
+	/* debug registers */
+	__u64 debugreg0;
+	__u64 debugreg1;
+	__u64 debugreg2;
+	__u64 debugreg3;
+	__u64 debugreg6;
+	__u64 debugreg7;
+
+	/* thread_xstate contents follow (if used_math) */
+} __attribute__((aligned(8)));
+
+#define CKPT_X86_SEG_NULL	0
+#define CKPT_X86_SEG_USER32_CS	1
+#define CKPT_X86_SEG_USER32_DS	2
+#define CKPT_X86_SEG_TLS	0x4000	/* 0100 0000 0000 00xx */
+#define CKPT_X86_SEG_LDT	0x8000	/* 100x xxxx xxxx xxxx */
+
+#endif /* __ASM_X86_CKPT_HDR__H */
diff --git a/arch/x86/include/asm/ptrace.h b/arch/x86/include/asm/ptrace.h
index 0f0d908..66b507b 100644
--- a/arch/x86/include/asm/ptrace.h
+++ b/arch/x86/include/asm/ptrace.h
@@ -242,6 +242,11 @@ extern void ptrace_bts_untrace(struct task_struct *tsk);
 #define arch_ptrace_untrace(tsk)	ptrace_bts_untrace(tsk)
 #endif /* CONFIG_X86_PTRACE_BTS */
 
+extern int ptrace_check_debugreg(int _32bit,
+				 unsigned long dr0, unsigned long dr1,
+				 unsigned long dr2, unsigned long dr3,
+				 unsigned long dr6, unsigned long dr7);
+
 #endif /* __KERNEL__ */
 
 #endif /* !__ASSEMBLY__ */
diff --git a/arch/x86/kernel/ptrace.c b/arch/x86/kernel/ptrace.c
index 9b4cacf..3b434bd 100644
--- a/arch/x86/kernel/ptrace.c
+++ b/arch/x86/kernel/ptrace.c
@@ -466,10 +466,10 @@ static unsigned long ptrace_get_debugreg(struct task_struct *child, int n)
 	return 0;
 }
 
-static int ptrace_check_debugreg(int _32bit,
-				 unsigned long dr0, unsigned long dr1,
-				 unsigned long dr2, unsigned long dr3,
-				 unsigned long dr6, unsigned long dr7)
+int ptrace_check_debugreg(int _32bit,
+			  unsigned long dr0, unsigned long dr1,
+			  unsigned long dr2, unsigned long dr3,
+			  unsigned long dr6, unsigned long dr7)
 {
 	/* Breakpoint type: 00: --x, 01: -w-, 10: undefined, 11: rw- */
 	unsigned int rw[4];
diff --git a/arch/x86/mm/Makefile b/arch/x86/mm/Makefile
index eefdeee..ddd5abb 100644
--- a/arch/x86/mm/Makefile
+++ b/arch/x86/mm/Makefile
@@ -21,3 +21,5 @@ obj-$(CONFIG_K8_NUMA)		+= k8topology_64.o
 obj-$(CONFIG_ACPI_NUMA)		+= srat_$(BITS).o
 
 obj-$(CONFIG_MEMTEST)		+= memtest.o
+
+obj-$(CONFIG_CHECKPOINT)	+= checkpoint.o
diff --git a/arch/x86/mm/checkpoint.c b/arch/x86/mm/checkpoint.c
new file mode 100644
index 0000000..023039f
--- /dev/null
+++ b/arch/x86/mm/checkpoint.c
@@ -0,0 +1,531 @@
+/*
+ *  Checkpoint/restart - architecture specific support for x86
+ *
+ *  Copyright (C) 2008-2009 Oren Laadan
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+/* default debug level for output */
+#define CKPT_DFLAG  CKPT_DSYS
+
+#include <asm/desc.h>
+#include <asm/i387.h>
+
+#include <linux/checkpoint.h>
+#include <linux/checkpoint_hdr.h>
+
+/*
+ * helpers to encode/decode/validate registers/segments/eflags
+ */
+
+static int check_eflags(__u32 eflags)
+{
+#define X86_EFLAGS_CKPT_MASK  \
+	(X86_EFLAGS_CF | X86_EFLAGS_PF | X86_EFLAGS_AF | X86_EFLAGS_ZF | \
+	 X86_EFLAGS_SF | X86_EFLAGS_TF | X86_EFLAGS_DF | X86_EFLAGS_OF | \
+	 X86_EFLAGS_NT | X86_EFLAGS_AC | X86_EFLAGS_ID)
+
+	if ((eflags & ~X86_EFLAGS_CKPT_MASK) != (X86_EFLAGS_IF | 0x2))
+		return 0;
+	return 1;
+}
+
+static int check_tls(struct desc_struct *desc)
+{
+	if (!desc->a && !desc->b)
+		return 1;
+	if (desc->l != 0 || desc->s != 1 || desc->dpl != 3)
+		return 0;
+	return 1;
+}
+
+static int check_segment(__u16 seg)
+{
+	int ret = 0;
+
+	switch (seg) {
+	case CKPT_X86_SEG_NULL:
+	case CKPT_X86_SEG_USER32_CS:
+	case CKPT_X86_SEG_USER32_DS:
+		return 1;
+	}
+	if (seg & CKPT_X86_SEG_TLS) {
+		seg &= ~CKPT_X86_SEG_TLS;
+		if (seg <= GDT_ENTRY_TLS_MAX - GDT_ENTRY_TLS_MIN)
+			ret = 1;
+	} else if (seg & CKPT_X86_SEG_LDT) {
+		seg &= ~CKPT_X86_SEG_LDT;
+		if (seg <= 0x1fff)
+			ret = 1;
+	}
+	return ret;
+}
+
+static __u16 encode_segment(unsigned short seg)
+{
+	if (seg == 0)
+		return CKPT_X86_SEG_NULL;
+	BUG_ON((seg & 3) != 3);
+
+	if (seg == __USER_CS)
+		return CKPT_X86_SEG_USER32_CS;
+	if (seg == __USER_DS)
+		return CKPT_X86_SEG_USER32_DS;
+
+	if (seg & 4)
+		return CKPT_X86_SEG_LDT | (seg >> 3);
+
+	seg >>= 3;
+	if (GDT_ENTRY_TLS_MIN <= seg && seg <= GDT_ENTRY_TLS_MAX)
+		return CKPT_X86_SEG_TLS | (seg - GDT_ENTRY_TLS_MIN);
+
+	printk(KERN_ERR "c/r: (decode) bad segment %#hx\n", seg);
+	BUG();
+}
+
+static unsigned short decode_segment(__u16 seg)
+{
+	if (seg == CKPT_X86_SEG_NULL)
+		return 0;
+	if (seg == CKPT_X86_SEG_USER32_CS)
+		return __USER_CS;
+	if (seg == CKPT_X86_SEG_USER32_DS)
+		return __USER_DS;
+
+	if (seg & CKPT_X86_SEG_TLS) {
+		seg &= ~CKPT_X86_SEG_TLS;
+		return ((GDT_ENTRY_TLS_MIN + seg) << 3) | 3;
+	}
+	if (seg & CKPT_X86_SEG_LDT) {
+		seg &= ~CKPT_X86_SEG_LDT;
+		return (seg << 3) | 7;
+	}
+	BUG();
+}
+
+#define CKPT_X86_TIF_UNSUPPORTED   (_TIF_SECCOMP | _TIF_IO_BITMAP)
+
+/**************************************************************************
+ * Checkpoint
+ */
+
+static int may_checkpoint_thread(struct ckpt_ctx *ctx, struct task_struct *t)
+{
+	if (t->thread.vm86_info) {
+		ckpt_write_err(ctx, "TE", "task in VM86 mode", -EBUSY);
+		return -EBUSY;
+	}
+	if (task_thread_info(t)->flags & CKPT_X86_TIF_UNSUPPORTED) {
+		ckpt_write_err(ctx, "TE", "bad thread info flags %#lx", -EBUSY);
+		return -EBUSY;
+	}
+	return 0;
+}
+
+/* dump the thread_struct of a given task */
+int checkpoint_thread(struct ckpt_ctx *ctx, struct task_struct *t)
+{
+	struct ckpt_hdr_thread *h;
+	int tls_size;
+	int ret;
+
+	ret = may_checkpoint_thread(ctx, t);
+	if (ret < 0)
+		return ret;
+
+	tls_size = sizeof(t->thread.tls_array);
+
+	h = ckpt_hdr_get_type(ctx, sizeof(*h) + tls_size, CKPT_HDR_THREAD);
+	if (!h)
+		return -ENOMEM;
+
+	h->thread_info_flags =
+		task_thread_info(t)->flags & ~CKPT_X86_TIF_UNSUPPORTED;
+	h->gdt_entry_tls_entries = GDT_ENTRY_TLS_ENTRIES;
+	h->sizeof_tls_array = tls_size;
+
+	/* For simplicity dump the entire array */
+	memcpy(h + 1, t->thread.tls_array, tls_size);
+
+	ret = ckpt_write_obj(ctx, &h->h);
+	ckpt_hdr_put(ctx, h);
+	return ret;
+}
+
+#ifdef CONFIG_X86_32
+
+static void save_cpu_regs(struct ckpt_hdr_cpu *h, struct task_struct *t)
+{
+	struct thread_struct *thread = &t->thread;
+	struct pt_regs *regs = task_pt_regs(t);
+	unsigned long _gs;
+
+	h->bp = regs->bp;
+	h->bx = regs->bx;
+	h->ax = regs->ax;
+	h->cx = regs->cx;
+	h->dx = regs->dx;
+	h->si = regs->si;
+	h->di = regs->di;
+	h->orig_ax = regs->orig_ax;
+	h->ip = regs->ip;
+
+	h->flags = regs->flags;
+	h->sp = regs->sp;
+
+	h->cs = encode_segment(regs->cs);
+	h->ss = encode_segment(regs->ss);
+	h->ds = encode_segment(regs->ds);
+	h->es = encode_segment(regs->es);
+
+	/*
+	 * for checkpoint in process context (from within a container)
+	 * the GS segment register should be saved from the hardware;
+	 * otherwise it is already saved on the thread structure
+	 */
+	if (t == current)
+		_gs = get_user_gs(regs);
+	else
+		_gs = thread->gs;
+
+	h->fsindex = encode_segment(regs->fs);
+	h->gsindex = encode_segment(_gs);
+
+	/*
+	 * for checkpoint in process context (from within a container),
+	 * the actual syscall is taking place at this very moment; so
+	 * we (optimistically) subtitute the future return value (0) of
+	 * this syscall into the orig_eax, so that upon restart it will
+	 * succeed (or it will endlessly retry checkpoint...)
+	 */
+	if (t == current) {
+		BUG_ON(h->orig_ax < 0);
+		h->ax = 0;
+	}
+}
+
+static void save_cpu_debug(struct ckpt_hdr_cpu *h, struct task_struct *t)
+{
+	struct thread_struct *thread = &t->thread;
+
+	/* debug regs */
+
+	/*
+	 * for checkpoint in process context (from within a container),
+	 * get the actual registers; otherwise get the saved values.
+	 */
+
+	if (t == current) {
+		get_debugreg(h->debugreg0, 0);
+		get_debugreg(h->debugreg1, 1);
+		get_debugreg(h->debugreg2, 2);
+		get_debugreg(h->debugreg3, 3);
+		get_debugreg(h->debugreg6, 6);
+		get_debugreg(h->debugreg7, 7);
+	} else {
+		h->debugreg0 = thread->debugreg0;
+		h->debugreg1 = thread->debugreg1;
+		h->debugreg2 = thread->debugreg2;
+		h->debugreg3 = thread->debugreg3;
+		h->debugreg6 = thread->debugreg6;
+		h->debugreg7 = thread->debugreg7;
+	}
+}
+
+static void save_cpu_fpu(struct ckpt_hdr_cpu *h, struct task_struct *t)
+{
+	h->used_math = tsk_used_math(t) ? 1 : 0;
+}
+
+static int checkpoint_cpu_fpu(struct ckpt_ctx *ctx, struct task_struct *t)
+{
+	struct ckpt_hdr *h;
+	int ret;
+
+	h = ckpt_hdr_get_type(ctx, xstate_size + sizeof(*h),
+			      CKPT_HDR_CPU_FPU);
+	if (!h)
+		return -ENOMEM;
+
+	/* i387 + MMU + SSE logic */
+	preempt_disable();	/* needed it (t == current) */
+
+	/*
+	 * normally, no need to unlazy_fpu(), since TS_USEDFPU flag
+	 * was cleared when task was context-switched out...
+	 * except if we are in process context, in which case we do
+	 */
+	if (t == current && (task_thread_info(t)->status & TS_USEDFPU))
+		unlazy_fpu(current);
+
+	/*
+	 * For simplicity dump the entire structure.
+	 * FIX: need to be deliberate about what registers we are
+	 * dumping for traceability and compatibility.
+	 */
+	memcpy(h + 1, t->thread.xstate, xstate_size);
+	preempt_enable();	/* needed if (t == current) */
+
+	ret = ckpt_write_obj(ctx, h);
+	ckpt_hdr_put(ctx, h);
+
+	return ret;
+}
+
+#endif	/* CONFIG_X86_32 */
+
+/* dump the cpu state and registers of a given task */
+int checkpoint_cpu(struct ckpt_ctx *ctx, struct task_struct *t)
+{
+	struct ckpt_hdr_cpu *h;
+	int ret;
+
+	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_CPU);
+	if (!h)
+		return -ENOMEM;
+
+	save_cpu_regs(h, t);
+	save_cpu_debug(h, t);
+	save_cpu_fpu(h, t);
+
+	ckpt_debug("math %d debug %d\n", h->used_math, !!h->debugreg7);
+
+	ret = ckpt_write_obj(ctx, &h->h);
+	if (ret < 0)
+		goto out;
+
+	if (h->used_math)
+		ret = checkpoint_cpu_fpu(ctx, t);
+ out:
+	ckpt_hdr_put(ctx, h);
+	return ret;
+}
+
+int checkpoint_write_header_arch(struct ckpt_ctx *ctx)
+{
+	struct ckpt_hdr_header_arch *h;
+	int ret;
+
+	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_HEADER_ARCH);
+	if (!h)
+		return -ENOMEM;
+
+	/* FPU capabilities */
+	h->has_fxsr = cpu_has_fxsr;
+	h->has_xsave = cpu_has_xsave;
+	h->xstate_size = xstate_size;
+
+	ret = ckpt_write_obj(ctx, &h->h);
+	ckpt_hdr_put(ctx, h);
+
+	return ret;
+}
+
+/**************************************************************************
+ * Restart
+ */
+
+/* read the thread_struct into the current task */
+int restore_thread(struct ckpt_ctx *ctx)
+{
+	struct ckpt_hdr_thread *h;
+	struct thread_struct *thread = &current->thread;
+	struct desc_struct *desc;
+	int tls_size;
+	int i, cpu, ret;
+
+	tls_size = sizeof(thread->tls_array);
+
+	h = ckpt_read_obj_type(ctx, sizeof(*h) + tls_size, CKPT_HDR_THREAD);
+	if (IS_ERR(h))
+		return PTR_ERR(h);
+
+	ret = -EINVAL;
+	if (h->thread_info_flags & CKPT_X86_TIF_UNSUPPORTED)
+		goto out;
+	if (h->gdt_entry_tls_entries != GDT_ENTRY_TLS_ENTRIES)
+		goto out;
+	if (h->sizeof_tls_array != tls_size)
+		goto out;
+
+	/*
+	 * restore TLS by hand: why convert to struct user_desc if
+	 * sys_set_thread_entry() will convert it back ?
+	 */
+	desc = (struct desc_struct *) (h + 1);
+
+	for (i = 0; i < GDT_ENTRY_TLS_ENTRIES; i++) {
+		if (!check_tls(&desc[i]))
+			goto out;
+	}
+
+	cpu = get_cpu();
+	memcpy(thread->tls_array, desc, tls_size);
+	load_TLS(thread, cpu);
+	put_cpu();
+
+	/* TODO: restore TIF flags as necessary (e.g. TIF_NOTSC) */
+
+	ret = 0;
+ out:
+	ckpt_hdr_put(ctx, h);
+	return ret;
+}
+
+#ifdef CONFIG_X86_32
+
+static int load_cpu_regs(struct ckpt_hdr_cpu *h, struct task_struct *t)
+{
+	struct thread_struct *thread = &t->thread;
+	struct pt_regs *regs = task_pt_regs(t);
+
+	if (!check_eflags(h->flags))
+		return -EINVAL;
+	if (h->cs == CKPT_X86_SEG_NULL)
+		return -EINVAL;
+	if (!check_segment(h->cs) || !check_segment(h->ds) ||
+	    !check_segment(h->es) || !check_segment(h->ss) ||
+	    !check_segment(h->fsindex) || !check_segment(h->gsindex))
+		return -EINVAL;
+
+	regs->bp = h->bp;
+	regs->bx = h->bx;
+	regs->ax = h->ax;
+	regs->cx = h->cx;
+	regs->dx = h->dx;
+	regs->si = h->si;
+	regs->di = h->di;
+	regs->orig_ax = h->orig_ax;
+	regs->ip = h->ip;
+
+	regs->flags = h->flags;
+	regs->sp = h->sp;
+
+	regs->ds = decode_segment(h->ds);
+	regs->es = decode_segment(h->es);
+	regs->cs = decode_segment(h->cs);
+	regs->ss = decode_segment(h->ss);
+
+	regs->fs = decode_segment(h->fsindex);
+	regs->gs = decode_segment(h->gsindex);
+
+	thread->gs = regs->gs;
+	lazy_load_gs(regs->gs);
+
+	return 0;
+}
+
+static int load_cpu_debug(struct ckpt_hdr_cpu *h, struct task_struct *t)
+{
+	int ret;
+
+	ret = ptrace_check_debugreg(1, h->debugreg0, h->debugreg1, h->debugreg2,
+				    h->debugreg3, h->debugreg6, h->debugreg7);
+	if (ret < 0)
+		return ret;
+
+	set_debugreg(h->debugreg0, 0);
+	set_debugreg(h->debugreg1, 1);
+	/* ignore 4, 5 */
+	set_debugreg(h->debugreg2, 2);
+	set_debugreg(h->debugreg3, 3);
+	set_debugreg(h->debugreg6, 6);
+	set_debugreg(h->debugreg7, 7);
+
+	if (h->debugreg7)
+		set_tsk_thread_flag(t, TIF_DEBUG);
+	else
+		clear_tsk_thread_flag(t, TIF_DEBUG);
+
+	return 0;
+}
+
+static int load_cpu_fpu(struct ckpt_hdr_cpu *h, struct task_struct *t)
+{
+	preempt_disable();
+
+	__clear_fpu(t);		/* in case we used FPU in user mode */
+
+	if (!h->used_math)
+		clear_used_math();
+
+	preempt_enable();
+	return 0;
+}
+
+static int restore_cpu_fpu(struct ckpt_ctx *ctx, struct task_struct *t)
+{
+	struct ckpt_hdr *h;
+	int ret;
+
+	/* init_fpu() eventually also calls set_used_math() */
+	ret = init_fpu(current);
+	if (ret < 0)
+		return ret;
+
+	h = ckpt_read_obj_type(ctx, xstate_size + sizeof(*h),
+			       CKPT_HDR_CPU_FPU);
+	if (IS_ERR(h))
+		return PTR_ERR(h);
+
+	memcpy(t->thread.xstate, h + 1, xstate_size);
+
+	ckpt_hdr_put(ctx, h);
+	return ret;
+}
+
+#endif	/* CONFIG_X86_32 */
+
+/* read the cpu state and registers for the current task */
+int restore_cpu(struct ckpt_ctx *ctx)
+{
+	struct ckpt_hdr_cpu *h;
+	struct task_struct *t = current;
+	int ret;
+
+	h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_CPU);
+	if (IS_ERR(h))
+		return PTR_ERR(h);
+
+	ckpt_debug("math %d debug %d\n", h->used_math, !!h->debugreg7);
+
+	ret = load_cpu_regs(h, t);
+	if (ret < 0)
+		goto out;
+	ret = load_cpu_debug(h, t);
+	if (ret < 0)
+		goto out;
+	ret = load_cpu_fpu(h, t);
+	if (ret < 0)
+		goto out;
+
+	if (h->used_math)
+		ret = restore_cpu_fpu(ctx, t);
+ out:
+	ckpt_hdr_put(ctx, h);
+	return ret;
+}
+
+int restore_read_header_arch(struct ckpt_ctx *ctx)
+{
+	struct ckpt_hdr_header_arch *h;
+	int ret = 0;
+
+	h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_HEADER_ARCH);
+	if (IS_ERR(h))
+		return PTR_ERR(h);
+
+	/* FIX: verify compatibility of architecture features */
+
+	/* verify FPU capabilities */
+	if (h->has_fxsr != cpu_has_fxsr ||
+	    h->has_xsave != cpu_has_xsave ||
+	    h->xstate_size != xstate_size)
+		ret = -EINVAL;
+
+	ckpt_hdr_put(ctx, h);
+	return ret;
+}
diff --git a/checkpoint/checkpoint.c b/checkpoint/checkpoint.c
index 57eb7d8..e228151 100644
--- a/checkpoint/checkpoint.c
+++ b/checkpoint/checkpoint.c
@@ -300,6 +300,8 @@ static int checkpoint_write_header(struct ckpt_ctx *ctx)
 	do_gettimeofday(&ktv);
 	uts = utsname();
 
+	h->arch_id = cpu_to_le16(CKPT_ARCH_ID);  /* see asm/checkpoitn.h */
+
 	h->magic = CHECKPOINT_MAGIC_HEAD;
 	h->major = (LINUX_VERSION_CODE >> 16) & 0xff;
 	h->minor = (LINUX_VERSION_CODE >> 8) & 0xff;
@@ -327,7 +329,10 @@ static int checkpoint_write_header(struct ckpt_ctx *ctx)
 	ret = ckpt_write_buffer(ctx, uts->machine, sizeof(uts->machine));
  up:
 	up_read(&uts_sem);
-	return ret;
+	if (ret < 0)
+		return ret;
+
+	return checkpoint_write_header_arch(ctx);
 }
 
 /* write the checkpoint trailer */
diff --git a/checkpoint/process.c b/checkpoint/process.c
index d221c2a..1d1170c 100644
--- a/checkpoint/process.c
+++ b/checkpoint/process.c
@@ -57,6 +57,15 @@ int checkpoint_task(struct ckpt_ctx *ctx, struct task_struct *t)
 	ret = checkpoint_task_struct(ctx, t);
 	ckpt_debug("task %d\n", ret);
 
+	if (ret < 0)
+		goto out;
+	ret = checkpoint_thread(ctx, t);
+	ckpt_debug("thread %d\n", ret);
+	if (ret < 0)
+		goto out;
+	ret = checkpoint_cpu(ctx, t);
+	ckpt_debug("cpu %d\n", ret);
+ out:
 	ctx->tsk = NULL;
 	return ret;
 }
@@ -97,6 +106,14 @@ int restore_task(struct ckpt_ctx *ctx)
 
 	ret = restore_task_struct(ctx);
 	ckpt_debug("task %d\n", ret);
-
+	if (ret < 0)
+		goto out;
+	ret = restore_thread(ctx);
+	ckpt_debug("thread %d\n", ret);
+	if (ret < 0)
+		goto out;
+	ret = restore_cpu(ctx);
+	ckpt_debug("cpu %d\n", ret);
+ out:
 	return ret;
 }
diff --git a/checkpoint/restart.c b/checkpoint/restart.c
index 9f08f4d..afe51c2 100644
--- a/checkpoint/restart.c
+++ b/checkpoint/restart.c
@@ -354,6 +354,8 @@ static int restore_read_header(struct ckpt_ctx *ctx)
 		return PTR_ERR(h);
 
 	ret = -EINVAL;
+	if (le16_to_cpu(h->arch_id) != CKPT_ARCH_ID)
+		goto out;
 	if (h->magic != CHECKPOINT_MAGIC_HEAD ||
 	    h->rev != CHECKPOINT_VERSION ||
 	    h->major != ((LINUX_VERSION_CODE >> 16) & 0xff) ||
@@ -382,6 +384,10 @@ static int restore_read_header(struct ckpt_ctx *ctx)
 	if (ret < 0)
 		goto out;
 	ret = _ckpt_read_buffer(ctx, uts->machine, sizeof(uts->machine));
+	if (ret < 0)
+		goto out;
+
+	ret = restore_read_header_arch(ctx);
  out:
 	kfree(uts);
 	ckpt_hdr_put(ctx, h);
diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
index be0ba4b..aa8ce11 100644
--- a/include/linux/checkpoint.h
+++ b/include/linux/checkpoint.h
@@ -61,6 +61,15 @@ extern long do_restart(struct ckpt_ctx *ctx, pid_t pid);
 extern int checkpoint_task(struct ckpt_ctx *ctx, struct task_struct *t);
 extern int restore_task(struct ckpt_ctx *ctx);
 
+/* arch hooks */
+extern int checkpoint_write_header_arch(struct ckpt_ctx *ctx);
+extern int checkpoint_thread(struct ckpt_ctx *ctx, struct task_struct *t);
+extern int checkpoint_cpu(struct ckpt_ctx *ctx, struct task_struct *t);
+
+extern int restore_read_header_arch(struct ckpt_ctx *ctx);
+extern int restore_thread(struct ckpt_ctx *ctx);
+extern int restore_cpu(struct ckpt_ctx *ctx);
+
 static inline int ckpt_validate_errno(int errno)
 {
 	return (errno >= 0) && (errno < MAX_ERRNO);
diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
index 22dadbd..92d082e 100644
--- a/include/linux/checkpoint_hdr.h
+++ b/include/linux/checkpoint_hdr.h
@@ -40,19 +40,33 @@ struct ckpt_hdr {
 	__u32 len;
 } __attribute__((aligned(8)));
 
+
+#include <asm/checkpoint_hdr.h>
+
+
 /* header types */
 enum {
 	CKPT_HDR_HEADER = 1,
+	CKPT_HDR_HEADER_ARCH,
 	CKPT_HDR_BUFFER,
 	CKPT_HDR_STRING,
 
 	CKPT_HDR_TASK = 101,
+	CKPT_HDR_THREAD,
+	CKPT_HDR_CPU,
+
+	/* 201-299: reserved for arch-dependent */
 
 	CKPT_HDR_TAIL = 9001,
 
 	CKPT_HDR_ERROR = 9999,
 };
 
+/* architecture */
+enum {
+	CKPT_ARCH_X86_32 = 1,
+};
+
 /* kernel constants */
 struct ckpt_const {
 	/* task */
@@ -68,7 +82,7 @@ struct ckpt_hdr_header {
 	struct ckpt_hdr h;
 	__u64 magic;
 
-	__u16 _padding;
+	__u16 arch_id;
 
 	__u16 major;
 	__u16 minor;
-- 
1.6.0.4

^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH v18 21/80] c/r: x86_32 support for checkpoint/restart
  2009-09-23 23:50 ` Oren Laadan
@ 2009-09-23 23:51   ` Oren Laadan
  -1 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-23 23:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, containers, linux-kernel, linux-mm, linux-api,
	Serge Hallyn, Ingo Molnar, Pavel Emelyanov, Oren Laadan,
	Oren Laadan

Add logic to save and restore architecture specific state, including
thread-specific state, CPU registers and FPU state.

In addition, architecture capabilities are saved in an architecure
specific extension of the header (ckpt_hdr_head_arch); Currently this
includes only FPU capabilities.

Currently only x86-32 is supported.

Changelog[v17]:
  - Fix compilation for architectures that don't support checkpoint
  - Validate cpu registers and TLS descriptors on restart
  - Validate debug registers on restart
  - Export asm/checkpoint_hdr.h to userspace
Changelog[v16]:
  - All objects are preceded by ckpt_hdr (TLS and xstate_buf)
  - Add architecture identifier to main header
Changelog[v14]:
  - Use new interface ckpt_hdr_get/put()
  - Embed struct ckpt_hdr in struct ckpt_hdr...
  - Remove preempt_disable/enable() around init_fpu() and fix leak
  - Revert change to pr_debug(), back to ckpt_debug()
  - Move code related to task_struct to checkpoint/process.c
Changelog[v12]:
  - A couple of missed calls to ckpt_hbuf_put()
  - Replace obsolete ckpt_debug() with pr_debug()
Changelog[v9]:
  - Add arch-specific header that details architecture capabilities;
    split FPU restore to send capabilities only once.
  - Test for zero TLS entries in ckpt_write_thread()
  - Fix asm/checkpoint_hdr.h so it can be included from user-space
Changelog[v7]:
  - Fix save/restore state of FPU
Changelog[v5]:
  - Remove preempt_disable() when restoring debug registers
Changelog[v4]:
  - Fix header structure alignment
Changelog[v2]:
  - Pad header structures to 64 bits to ensure compatibility
  - Follow Dave Hansen's refactoring of the original post

Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
---
 arch/x86/include/asm/Kbuild           |    1 +
 arch/x86/include/asm/checkpoint_hdr.h |  121 ++++++++
 arch/x86/include/asm/ptrace.h         |    5 +
 arch/x86/kernel/ptrace.c              |    8 +-
 arch/x86/mm/Makefile                  |    2 +
 arch/x86/mm/checkpoint.c              |  531 +++++++++++++++++++++++++++++++++
 checkpoint/checkpoint.c               |    7 +-
 checkpoint/process.c                  |   19 ++-
 checkpoint/restart.c                  |    6 +
 include/linux/checkpoint.h            |    9 +
 include/linux/checkpoint_hdr.h        |   16 +-
 11 files changed, 718 insertions(+), 7 deletions(-)
 create mode 100644 arch/x86/include/asm/checkpoint_hdr.h
 create mode 100644 arch/x86/mm/checkpoint.c

diff --git a/arch/x86/include/asm/Kbuild b/arch/x86/include/asm/Kbuild
index 4a8e80c..f76cb6e 100644
--- a/arch/x86/include/asm/Kbuild
+++ b/arch/x86/include/asm/Kbuild
@@ -2,6 +2,7 @@ include include/asm-generic/Kbuild.asm
 
 header-y += boot.h
 header-y += bootparam.h
+header-y += checkpoint_hdr.h
 header-y += debugreg.h
 header-y += ldt.h
 header-y += msr-index.h
diff --git a/arch/x86/include/asm/checkpoint_hdr.h b/arch/x86/include/asm/checkpoint_hdr.h
new file mode 100644
index 0000000..f4d1e14
--- /dev/null
+++ b/arch/x86/include/asm/checkpoint_hdr.h
@@ -0,0 +1,121 @@
+#ifndef __ASM_X86_CKPT_HDR_H
+#define __ASM_X86_CKPT_HDR_H
+/*
+ *  Checkpoint/restart - architecture specific headers x86
+ *
+ *  Copyright (C) 2008-2009 Oren Laadan
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+#ifndef _CHECKPOINT_CKPT_HDR_H_
+#error asm/checkpoint_hdr.h included directly
+#endif
+
+#include <linux/types.h>
+
+/*
+ * To maintain compatibility between 32-bit and 64-bit architecture flavors,
+ * keep data 64-bit aligned: use padding for structure members, and use
+ * __attribute__((aligned (8))) for the entire structure.
+ *
+ * Quoting Arnd Bergmann:
+ *   "This structure has an odd multiple of 32-bit members, which means
+ *   that if you put it into a larger structure that also contains 64-bit
+ *   members, the larger structure may get different alignment on x86-32
+ *   and x86-64, which you might want to avoid. I can't tell if this is
+ *   an actual problem here. ... In this case, I'm pretty sure that
+ *   sizeof(ckpt_hdr_task) on x86-32 is different from x86-64, since it
+ *   will be 32-bit aligned on x86-32."
+ */
+
+/* i387 structure seen from kernel/userspace */
+#ifdef __KERNEL__
+#include <asm/processor.h>
+#else
+#include <sys/user.h>
+#endif
+
+#ifdef CONFIG_X86_32
+#define CKPT_ARCH_ID	CKPT_ARCH_X86_32
+#endif
+
+/* arch dependent header types */
+enum {
+	CKPT_HDR_CPU_FPU = 201,
+};
+
+struct ckpt_hdr_header_arch {
+	struct ckpt_hdr h;
+	/* FIXME: add HAVE_HWFP */
+	__u16 has_fxsr;
+	__u16 has_xsave;
+	__u16 xstate_size;
+	__u16 _pading;
+} __attribute__((aligned(8)));
+
+struct ckpt_hdr_thread {
+	struct ckpt_hdr h;
+	__u32 thread_info_flags;
+	__u16 gdt_entry_tls_entries;
+	__u16 sizeof_tls_array;
+} __attribute__((aligned(8)));
+
+/* designed to work for both x86_32 and x86_64 */
+struct ckpt_hdr_cpu {
+	struct ckpt_hdr h;
+	/* see struct pt_regs (x86_64) */
+	__u64 r15;
+	__u64 r14;
+	__u64 r13;
+	__u64 r12;
+	__u64 bp;
+	__u64 bx;
+	__u64 r11;
+	__u64 r10;
+	__u64 r9;
+	__u64 r8;
+	__u64 ax;
+	__u64 cx;
+	__u64 dx;
+	__u64 si;
+	__u64 di;
+	__u64 orig_ax;
+	__u64 ip;
+	__u64 sp;
+
+	__u64 flags;
+
+	/* segment registers */
+	__u64 fs;
+	__u64 gs;
+
+	__u16 fsindex;
+	__u16 gsindex;
+	__u16 cs;
+	__u16 ss;
+	__u16 ds;
+	__u16 es;
+
+	__u32 used_math;
+
+	/* debug registers */
+	__u64 debugreg0;
+	__u64 debugreg1;
+	__u64 debugreg2;
+	__u64 debugreg3;
+	__u64 debugreg6;
+	__u64 debugreg7;
+
+	/* thread_xstate contents follow (if used_math) */
+} __attribute__((aligned(8)));
+
+#define CKPT_X86_SEG_NULL	0
+#define CKPT_X86_SEG_USER32_CS	1
+#define CKPT_X86_SEG_USER32_DS	2
+#define CKPT_X86_SEG_TLS	0x4000	/* 0100 0000 0000 00xx */
+#define CKPT_X86_SEG_LDT	0x8000	/* 100x xxxx xxxx xxxx */
+
+#endif /* __ASM_X86_CKPT_HDR__H */
diff --git a/arch/x86/include/asm/ptrace.h b/arch/x86/include/asm/ptrace.h
index 0f0d908..66b507b 100644
--- a/arch/x86/include/asm/ptrace.h
+++ b/arch/x86/include/asm/ptrace.h
@@ -242,6 +242,11 @@ extern void ptrace_bts_untrace(struct task_struct *tsk);
 #define arch_ptrace_untrace(tsk)	ptrace_bts_untrace(tsk)
 #endif /* CONFIG_X86_PTRACE_BTS */
 
+extern int ptrace_check_debugreg(int _32bit,
+				 unsigned long dr0, unsigned long dr1,
+				 unsigned long dr2, unsigned long dr3,
+				 unsigned long dr6, unsigned long dr7);
+
 #endif /* __KERNEL__ */
 
 #endif /* !__ASSEMBLY__ */
diff --git a/arch/x86/kernel/ptrace.c b/arch/x86/kernel/ptrace.c
index 9b4cacf..3b434bd 100644
--- a/arch/x86/kernel/ptrace.c
+++ b/arch/x86/kernel/ptrace.c
@@ -466,10 +466,10 @@ static unsigned long ptrace_get_debugreg(struct task_struct *child, int n)
 	return 0;
 }
 
-static int ptrace_check_debugreg(int _32bit,
-				 unsigned long dr0, unsigned long dr1,
-				 unsigned long dr2, unsigned long dr3,
-				 unsigned long dr6, unsigned long dr7)
+int ptrace_check_debugreg(int _32bit,
+			  unsigned long dr0, unsigned long dr1,
+			  unsigned long dr2, unsigned long dr3,
+			  unsigned long dr6, unsigned long dr7)
 {
 	/* Breakpoint type: 00: --x, 01: -w-, 10: undefined, 11: rw- */
 	unsigned int rw[4];
diff --git a/arch/x86/mm/Makefile b/arch/x86/mm/Makefile
index eefdeee..ddd5abb 100644
--- a/arch/x86/mm/Makefile
+++ b/arch/x86/mm/Makefile
@@ -21,3 +21,5 @@ obj-$(CONFIG_K8_NUMA)		+= k8topology_64.o
 obj-$(CONFIG_ACPI_NUMA)		+= srat_$(BITS).o
 
 obj-$(CONFIG_MEMTEST)		+= memtest.o
+
+obj-$(CONFIG_CHECKPOINT)	+= checkpoint.o
diff --git a/arch/x86/mm/checkpoint.c b/arch/x86/mm/checkpoint.c
new file mode 100644
index 0000000..023039f
--- /dev/null
+++ b/arch/x86/mm/checkpoint.c
@@ -0,0 +1,531 @@
+/*
+ *  Checkpoint/restart - architecture specific support for x86
+ *
+ *  Copyright (C) 2008-2009 Oren Laadan
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+/* default debug level for output */
+#define CKPT_DFLAG  CKPT_DSYS
+
+#include <asm/desc.h>
+#include <asm/i387.h>
+
+#include <linux/checkpoint.h>
+#include <linux/checkpoint_hdr.h>
+
+/*
+ * helpers to encode/decode/validate registers/segments/eflags
+ */
+
+static int check_eflags(__u32 eflags)
+{
+#define X86_EFLAGS_CKPT_MASK  \
+	(X86_EFLAGS_CF | X86_EFLAGS_PF | X86_EFLAGS_AF | X86_EFLAGS_ZF | \
+	 X86_EFLAGS_SF | X86_EFLAGS_TF | X86_EFLAGS_DF | X86_EFLAGS_OF | \
+	 X86_EFLAGS_NT | X86_EFLAGS_AC | X86_EFLAGS_ID)
+
+	if ((eflags & ~X86_EFLAGS_CKPT_MASK) != (X86_EFLAGS_IF | 0x2))
+		return 0;
+	return 1;
+}
+
+static int check_tls(struct desc_struct *desc)
+{
+	if (!desc->a && !desc->b)
+		return 1;
+	if (desc->l != 0 || desc->s != 1 || desc->dpl != 3)
+		return 0;
+	return 1;
+}
+
+static int check_segment(__u16 seg)
+{
+	int ret = 0;
+
+	switch (seg) {
+	case CKPT_X86_SEG_NULL:
+	case CKPT_X86_SEG_USER32_CS:
+	case CKPT_X86_SEG_USER32_DS:
+		return 1;
+	}
+	if (seg & CKPT_X86_SEG_TLS) {
+		seg &= ~CKPT_X86_SEG_TLS;
+		if (seg <= GDT_ENTRY_TLS_MAX - GDT_ENTRY_TLS_MIN)
+			ret = 1;
+	} else if (seg & CKPT_X86_SEG_LDT) {
+		seg &= ~CKPT_X86_SEG_LDT;
+		if (seg <= 0x1fff)
+			ret = 1;
+	}
+	return ret;
+}
+
+static __u16 encode_segment(unsigned short seg)
+{
+	if (seg == 0)
+		return CKPT_X86_SEG_NULL;
+	BUG_ON((seg & 3) != 3);
+
+	if (seg == __USER_CS)
+		return CKPT_X86_SEG_USER32_CS;
+	if (seg == __USER_DS)
+		return CKPT_X86_SEG_USER32_DS;
+
+	if (seg & 4)
+		return CKPT_X86_SEG_LDT | (seg >> 3);
+
+	seg >>= 3;
+	if (GDT_ENTRY_TLS_MIN <= seg && seg <= GDT_ENTRY_TLS_MAX)
+		return CKPT_X86_SEG_TLS | (seg - GDT_ENTRY_TLS_MIN);
+
+	printk(KERN_ERR "c/r: (decode) bad segment %#hx\n", seg);
+	BUG();
+}
+
+static unsigned short decode_segment(__u16 seg)
+{
+	if (seg == CKPT_X86_SEG_NULL)
+		return 0;
+	if (seg == CKPT_X86_SEG_USER32_CS)
+		return __USER_CS;
+	if (seg == CKPT_X86_SEG_USER32_DS)
+		return __USER_DS;
+
+	if (seg & CKPT_X86_SEG_TLS) {
+		seg &= ~CKPT_X86_SEG_TLS;
+		return ((GDT_ENTRY_TLS_MIN + seg) << 3) | 3;
+	}
+	if (seg & CKPT_X86_SEG_LDT) {
+		seg &= ~CKPT_X86_SEG_LDT;
+		return (seg << 3) | 7;
+	}
+	BUG();
+}
+
+#define CKPT_X86_TIF_UNSUPPORTED   (_TIF_SECCOMP | _TIF_IO_BITMAP)
+
+/**************************************************************************
+ * Checkpoint
+ */
+
+static int may_checkpoint_thread(struct ckpt_ctx *ctx, struct task_struct *t)
+{
+	if (t->thread.vm86_info) {
+		ckpt_write_err(ctx, "TE", "task in VM86 mode", -EBUSY);
+		return -EBUSY;
+	}
+	if (task_thread_info(t)->flags & CKPT_X86_TIF_UNSUPPORTED) {
+		ckpt_write_err(ctx, "TE", "bad thread info flags %#lx", -EBUSY);
+		return -EBUSY;
+	}
+	return 0;
+}
+
+/* dump the thread_struct of a given task */
+int checkpoint_thread(struct ckpt_ctx *ctx, struct task_struct *t)
+{
+	struct ckpt_hdr_thread *h;
+	int tls_size;
+	int ret;
+
+	ret = may_checkpoint_thread(ctx, t);
+	if (ret < 0)
+		return ret;
+
+	tls_size = sizeof(t->thread.tls_array);
+
+	h = ckpt_hdr_get_type(ctx, sizeof(*h) + tls_size, CKPT_HDR_THREAD);
+	if (!h)
+		return -ENOMEM;
+
+	h->thread_info_flags =
+		task_thread_info(t)->flags & ~CKPT_X86_TIF_UNSUPPORTED;
+	h->gdt_entry_tls_entries = GDT_ENTRY_TLS_ENTRIES;
+	h->sizeof_tls_array = tls_size;
+
+	/* For simplicity dump the entire array */
+	memcpy(h + 1, t->thread.tls_array, tls_size);
+
+	ret = ckpt_write_obj(ctx, &h->h);
+	ckpt_hdr_put(ctx, h);
+	return ret;
+}
+
+#ifdef CONFIG_X86_32
+
+static void save_cpu_regs(struct ckpt_hdr_cpu *h, struct task_struct *t)
+{
+	struct thread_struct *thread = &t->thread;
+	struct pt_regs *regs = task_pt_regs(t);
+	unsigned long _gs;
+
+	h->bp = regs->bp;
+	h->bx = regs->bx;
+	h->ax = regs->ax;
+	h->cx = regs->cx;
+	h->dx = regs->dx;
+	h->si = regs->si;
+	h->di = regs->di;
+	h->orig_ax = regs->orig_ax;
+	h->ip = regs->ip;
+
+	h->flags = regs->flags;
+	h->sp = regs->sp;
+
+	h->cs = encode_segment(regs->cs);
+	h->ss = encode_segment(regs->ss);
+	h->ds = encode_segment(regs->ds);
+	h->es = encode_segment(regs->es);
+
+	/*
+	 * for checkpoint in process context (from within a container)
+	 * the GS segment register should be saved from the hardware;
+	 * otherwise it is already saved on the thread structure
+	 */
+	if (t == current)
+		_gs = get_user_gs(regs);
+	else
+		_gs = thread->gs;
+
+	h->fsindex = encode_segment(regs->fs);
+	h->gsindex = encode_segment(_gs);
+
+	/*
+	 * for checkpoint in process context (from within a container),
+	 * the actual syscall is taking place at this very moment; so
+	 * we (optimistically) subtitute the future return value (0) of
+	 * this syscall into the orig_eax, so that upon restart it will
+	 * succeed (or it will endlessly retry checkpoint...)
+	 */
+	if (t == current) {
+		BUG_ON(h->orig_ax < 0);
+		h->ax = 0;
+	}
+}
+
+static void save_cpu_debug(struct ckpt_hdr_cpu *h, struct task_struct *t)
+{
+	struct thread_struct *thread = &t->thread;
+
+	/* debug regs */
+
+	/*
+	 * for checkpoint in process context (from within a container),
+	 * get the actual registers; otherwise get the saved values.
+	 */
+
+	if (t == current) {
+		get_debugreg(h->debugreg0, 0);
+		get_debugreg(h->debugreg1, 1);
+		get_debugreg(h->debugreg2, 2);
+		get_debugreg(h->debugreg3, 3);
+		get_debugreg(h->debugreg6, 6);
+		get_debugreg(h->debugreg7, 7);
+	} else {
+		h->debugreg0 = thread->debugreg0;
+		h->debugreg1 = thread->debugreg1;
+		h->debugreg2 = thread->debugreg2;
+		h->debugreg3 = thread->debugreg3;
+		h->debugreg6 = thread->debugreg6;
+		h->debugreg7 = thread->debugreg7;
+	}
+}
+
+static void save_cpu_fpu(struct ckpt_hdr_cpu *h, struct task_struct *t)
+{
+	h->used_math = tsk_used_math(t) ? 1 : 0;
+}
+
+static int checkpoint_cpu_fpu(struct ckpt_ctx *ctx, struct task_struct *t)
+{
+	struct ckpt_hdr *h;
+	int ret;
+
+	h = ckpt_hdr_get_type(ctx, xstate_size + sizeof(*h),
+			      CKPT_HDR_CPU_FPU);
+	if (!h)
+		return -ENOMEM;
+
+	/* i387 + MMU + SSE logic */
+	preempt_disable();	/* needed it (t == current) */
+
+	/*
+	 * normally, no need to unlazy_fpu(), since TS_USEDFPU flag
+	 * was cleared when task was context-switched out...
+	 * except if we are in process context, in which case we do
+	 */
+	if (t == current && (task_thread_info(t)->status & TS_USEDFPU))
+		unlazy_fpu(current);
+
+	/*
+	 * For simplicity dump the entire structure.
+	 * FIX: need to be deliberate about what registers we are
+	 * dumping for traceability and compatibility.
+	 */
+	memcpy(h + 1, t->thread.xstate, xstate_size);
+	preempt_enable();	/* needed if (t == current) */
+
+	ret = ckpt_write_obj(ctx, h);
+	ckpt_hdr_put(ctx, h);
+
+	return ret;
+}
+
+#endif	/* CONFIG_X86_32 */
+
+/* dump the cpu state and registers of a given task */
+int checkpoint_cpu(struct ckpt_ctx *ctx, struct task_struct *t)
+{
+	struct ckpt_hdr_cpu *h;
+	int ret;
+
+	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_CPU);
+	if (!h)
+		return -ENOMEM;
+
+	save_cpu_regs(h, t);
+	save_cpu_debug(h, t);
+	save_cpu_fpu(h, t);
+
+	ckpt_debug("math %d debug %d\n", h->used_math, !!h->debugreg7);
+
+	ret = ckpt_write_obj(ctx, &h->h);
+	if (ret < 0)
+		goto out;
+
+	if (h->used_math)
+		ret = checkpoint_cpu_fpu(ctx, t);
+ out:
+	ckpt_hdr_put(ctx, h);
+	return ret;
+}
+
+int checkpoint_write_header_arch(struct ckpt_ctx *ctx)
+{
+	struct ckpt_hdr_header_arch *h;
+	int ret;
+
+	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_HEADER_ARCH);
+	if (!h)
+		return -ENOMEM;
+
+	/* FPU capabilities */
+	h->has_fxsr = cpu_has_fxsr;
+	h->has_xsave = cpu_has_xsave;
+	h->xstate_size = xstate_size;
+
+	ret = ckpt_write_obj(ctx, &h->h);
+	ckpt_hdr_put(ctx, h);
+
+	return ret;
+}
+
+/**************************************************************************
+ * Restart
+ */
+
+/* read the thread_struct into the current task */
+int restore_thread(struct ckpt_ctx *ctx)
+{
+	struct ckpt_hdr_thread *h;
+	struct thread_struct *thread = &current->thread;
+	struct desc_struct *desc;
+	int tls_size;
+	int i, cpu, ret;
+
+	tls_size = sizeof(thread->tls_array);
+
+	h = ckpt_read_obj_type(ctx, sizeof(*h) + tls_size, CKPT_HDR_THREAD);
+	if (IS_ERR(h))
+		return PTR_ERR(h);
+
+	ret = -EINVAL;
+	if (h->thread_info_flags & CKPT_X86_TIF_UNSUPPORTED)
+		goto out;
+	if (h->gdt_entry_tls_entries != GDT_ENTRY_TLS_ENTRIES)
+		goto out;
+	if (h->sizeof_tls_array != tls_size)
+		goto out;
+
+	/*
+	 * restore TLS by hand: why convert to struct user_desc if
+	 * sys_set_thread_entry() will convert it back ?
+	 */
+	desc = (struct desc_struct *) (h + 1);
+
+	for (i = 0; i < GDT_ENTRY_TLS_ENTRIES; i++) {
+		if (!check_tls(&desc[i]))
+			goto out;
+	}
+
+	cpu = get_cpu();
+	memcpy(thread->tls_array, desc, tls_size);
+	load_TLS(thread, cpu);
+	put_cpu();
+
+	/* TODO: restore TIF flags as necessary (e.g. TIF_NOTSC) */
+
+	ret = 0;
+ out:
+	ckpt_hdr_put(ctx, h);
+	return ret;
+}
+
+#ifdef CONFIG_X86_32
+
+static int load_cpu_regs(struct ckpt_hdr_cpu *h, struct task_struct *t)
+{
+	struct thread_struct *thread = &t->thread;
+	struct pt_regs *regs = task_pt_regs(t);
+
+	if (!check_eflags(h->flags))
+		return -EINVAL;
+	if (h->cs == CKPT_X86_SEG_NULL)
+		return -EINVAL;
+	if (!check_segment(h->cs) || !check_segment(h->ds) ||
+	    !check_segment(h->es) || !check_segment(h->ss) ||
+	    !check_segment(h->fsindex) || !check_segment(h->gsindex))
+		return -EINVAL;
+
+	regs->bp = h->bp;
+	regs->bx = h->bx;
+	regs->ax = h->ax;
+	regs->cx = h->cx;
+	regs->dx = h->dx;
+	regs->si = h->si;
+	regs->di = h->di;
+	regs->orig_ax = h->orig_ax;
+	regs->ip = h->ip;
+
+	regs->flags = h->flags;
+	regs->sp = h->sp;
+
+	regs->ds = decode_segment(h->ds);
+	regs->es = decode_segment(h->es);
+	regs->cs = decode_segment(h->cs);
+	regs->ss = decode_segment(h->ss);
+
+	regs->fs = decode_segment(h->fsindex);
+	regs->gs = decode_segment(h->gsindex);
+
+	thread->gs = regs->gs;
+	lazy_load_gs(regs->gs);
+
+	return 0;
+}
+
+static int load_cpu_debug(struct ckpt_hdr_cpu *h, struct task_struct *t)
+{
+	int ret;
+
+	ret = ptrace_check_debugreg(1, h->debugreg0, h->debugreg1, h->debugreg2,
+				    h->debugreg3, h->debugreg6, h->debugreg7);
+	if (ret < 0)
+		return ret;
+
+	set_debugreg(h->debugreg0, 0);
+	set_debugreg(h->debugreg1, 1);
+	/* ignore 4, 5 */
+	set_debugreg(h->debugreg2, 2);
+	set_debugreg(h->debugreg3, 3);
+	set_debugreg(h->debugreg6, 6);
+	set_debugreg(h->debugreg7, 7);
+
+	if (h->debugreg7)
+		set_tsk_thread_flag(t, TIF_DEBUG);
+	else
+		clear_tsk_thread_flag(t, TIF_DEBUG);
+
+	return 0;
+}
+
+static int load_cpu_fpu(struct ckpt_hdr_cpu *h, struct task_struct *t)
+{
+	preempt_disable();
+
+	__clear_fpu(t);		/* in case we used FPU in user mode */
+
+	if (!h->used_math)
+		clear_used_math();
+
+	preempt_enable();
+	return 0;
+}
+
+static int restore_cpu_fpu(struct ckpt_ctx *ctx, struct task_struct *t)
+{
+	struct ckpt_hdr *h;
+	int ret;
+
+	/* init_fpu() eventually also calls set_used_math() */
+	ret = init_fpu(current);
+	if (ret < 0)
+		return ret;
+
+	h = ckpt_read_obj_type(ctx, xstate_size + sizeof(*h),
+			       CKPT_HDR_CPU_FPU);
+	if (IS_ERR(h))
+		return PTR_ERR(h);
+
+	memcpy(t->thread.xstate, h + 1, xstate_size);
+
+	ckpt_hdr_put(ctx, h);
+	return ret;
+}
+
+#endif	/* CONFIG_X86_32 */
+
+/* read the cpu state and registers for the current task */
+int restore_cpu(struct ckpt_ctx *ctx)
+{
+	struct ckpt_hdr_cpu *h;
+	struct task_struct *t = current;
+	int ret;
+
+	h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_CPU);
+	if (IS_ERR(h))
+		return PTR_ERR(h);
+
+	ckpt_debug("math %d debug %d\n", h->used_math, !!h->debugreg7);
+
+	ret = load_cpu_regs(h, t);
+	if (ret < 0)
+		goto out;
+	ret = load_cpu_debug(h, t);
+	if (ret < 0)
+		goto out;
+	ret = load_cpu_fpu(h, t);
+	if (ret < 0)
+		goto out;
+
+	if (h->used_math)
+		ret = restore_cpu_fpu(ctx, t);
+ out:
+	ckpt_hdr_put(ctx, h);
+	return ret;
+}
+
+int restore_read_header_arch(struct ckpt_ctx *ctx)
+{
+	struct ckpt_hdr_header_arch *h;
+	int ret = 0;
+
+	h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_HEADER_ARCH);
+	if (IS_ERR(h))
+		return PTR_ERR(h);
+
+	/* FIX: verify compatibility of architecture features */
+
+	/* verify FPU capabilities */
+	if (h->has_fxsr != cpu_has_fxsr ||
+	    h->has_xsave != cpu_has_xsave ||
+	    h->xstate_size != xstate_size)
+		ret = -EINVAL;
+
+	ckpt_hdr_put(ctx, h);
+	return ret;
+}
diff --git a/checkpoint/checkpoint.c b/checkpoint/checkpoint.c
index 57eb7d8..e228151 100644
--- a/checkpoint/checkpoint.c
+++ b/checkpoint/checkpoint.c
@@ -300,6 +300,8 @@ static int checkpoint_write_header(struct ckpt_ctx *ctx)
 	do_gettimeofday(&ktv);
 	uts = utsname();
 
+	h->arch_id = cpu_to_le16(CKPT_ARCH_ID);  /* see asm/checkpoitn.h */
+
 	h->magic = CHECKPOINT_MAGIC_HEAD;
 	h->major = (LINUX_VERSION_CODE >> 16) & 0xff;
 	h->minor = (LINUX_VERSION_CODE >> 8) & 0xff;
@@ -327,7 +329,10 @@ static int checkpoint_write_header(struct ckpt_ctx *ctx)
 	ret = ckpt_write_buffer(ctx, uts->machine, sizeof(uts->machine));
  up:
 	up_read(&uts_sem);
-	return ret;
+	if (ret < 0)
+		return ret;
+
+	return checkpoint_write_header_arch(ctx);
 }
 
 /* write the checkpoint trailer */
diff --git a/checkpoint/process.c b/checkpoint/process.c
index d221c2a..1d1170c 100644
--- a/checkpoint/process.c
+++ b/checkpoint/process.c
@@ -57,6 +57,15 @@ int checkpoint_task(struct ckpt_ctx *ctx, struct task_struct *t)
 	ret = checkpoint_task_struct(ctx, t);
 	ckpt_debug("task %d\n", ret);
 
+	if (ret < 0)
+		goto out;
+	ret = checkpoint_thread(ctx, t);
+	ckpt_debug("thread %d\n", ret);
+	if (ret < 0)
+		goto out;
+	ret = checkpoint_cpu(ctx, t);
+	ckpt_debug("cpu %d\n", ret);
+ out:
 	ctx->tsk = NULL;
 	return ret;
 }
@@ -97,6 +106,14 @@ int restore_task(struct ckpt_ctx *ctx)
 
 	ret = restore_task_struct(ctx);
 	ckpt_debug("task %d\n", ret);
-
+	if (ret < 0)
+		goto out;
+	ret = restore_thread(ctx);
+	ckpt_debug("thread %d\n", ret);
+	if (ret < 0)
+		goto out;
+	ret = restore_cpu(ctx);
+	ckpt_debug("cpu %d\n", ret);
+ out:
 	return ret;
 }
diff --git a/checkpoint/restart.c b/checkpoint/restart.c
index 9f08f4d..afe51c2 100644
--- a/checkpoint/restart.c
+++ b/checkpoint/restart.c
@@ -354,6 +354,8 @@ static int restore_read_header(struct ckpt_ctx *ctx)
 		return PTR_ERR(h);
 
 	ret = -EINVAL;
+	if (le16_to_cpu(h->arch_id) != CKPT_ARCH_ID)
+		goto out;
 	if (h->magic != CHECKPOINT_MAGIC_HEAD ||
 	    h->rev != CHECKPOINT_VERSION ||
 	    h->major != ((LINUX_VERSION_CODE >> 16) & 0xff) ||
@@ -382,6 +384,10 @@ static int restore_read_header(struct ckpt_ctx *ctx)
 	if (ret < 0)
 		goto out;
 	ret = _ckpt_read_buffer(ctx, uts->machine, sizeof(uts->machine));
+	if (ret < 0)
+		goto out;
+
+	ret = restore_read_header_arch(ctx);
  out:
 	kfree(uts);
 	ckpt_hdr_put(ctx, h);
diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
index be0ba4b..aa8ce11 100644
--- a/include/linux/checkpoint.h
+++ b/include/linux/checkpoint.h
@@ -61,6 +61,15 @@ extern long do_restart(struct ckpt_ctx *ctx, pid_t pid);
 extern int checkpoint_task(struct ckpt_ctx *ctx, struct task_struct *t);
 extern int restore_task(struct ckpt_ctx *ctx);
 
+/* arch hooks */
+extern int checkpoint_write_header_arch(struct ckpt_ctx *ctx);
+extern int checkpoint_thread(struct ckpt_ctx *ctx, struct task_struct *t);
+extern int checkpoint_cpu(struct ckpt_ctx *ctx, struct task_struct *t);
+
+extern int restore_read_header_arch(struct ckpt_ctx *ctx);
+extern int restore_thread(struct ckpt_ctx *ctx);
+extern int restore_cpu(struct ckpt_ctx *ctx);
+
 static inline int ckpt_validate_errno(int errno)
 {
 	return (errno >= 0) && (errno < MAX_ERRNO);
diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
index 22dadbd..92d082e 100644
--- a/include/linux/checkpoint_hdr.h
+++ b/include/linux/checkpoint_hdr.h
@@ -40,19 +40,33 @@ struct ckpt_hdr {
 	__u32 len;
 } __attribute__((aligned(8)));
 
+
+#include <asm/checkpoint_hdr.h>
+
+
 /* header types */
 enum {
 	CKPT_HDR_HEADER = 1,
+	CKPT_HDR_HEADER_ARCH,
 	CKPT_HDR_BUFFER,
 	CKPT_HDR_STRING,
 
 	CKPT_HDR_TASK = 101,
+	CKPT_HDR_THREAD,
+	CKPT_HDR_CPU,
+
+	/* 201-299: reserved for arch-dependent */
 
 	CKPT_HDR_TAIL = 9001,
 
 	CKPT_HDR_ERROR = 9999,
 };
 
+/* architecture */
+enum {
+	CKPT_ARCH_X86_32 = 1,
+};
+
 /* kernel constants */
 struct ckpt_const {
 	/* task */
@@ -68,7 +82,7 @@ struct ckpt_hdr_header {
 	struct ckpt_hdr h;
 	__u64 magic;
 
-	__u16 _padding;
+	__u16 arch_id;
 
 	__u16 major;
 	__u16 minor;
-- 
1.6.0.4


^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH v18 21/80] c/r: x86_32 support for checkpoint/restart
@ 2009-09-23 23:51   ` Oren Laadan
  0 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-23 23:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, containers, linux-kernel, linux-mm, linux-api,
	Serge Hallyn, Ingo Molnar, Pavel Emelyanov, Oren Laadan,
	Oren Laadan

Add logic to save and restore architecture specific state, including
thread-specific state, CPU registers and FPU state.

In addition, architecture capabilities are saved in an architecure
specific extension of the header (ckpt_hdr_head_arch); Currently this
includes only FPU capabilities.

Currently only x86-32 is supported.

Changelog[v17]:
  - Fix compilation for architectures that don't support checkpoint
  - Validate cpu registers and TLS descriptors on restart
  - Validate debug registers on restart
  - Export asm/checkpoint_hdr.h to userspace
Changelog[v16]:
  - All objects are preceded by ckpt_hdr (TLS and xstate_buf)
  - Add architecture identifier to main header
Changelog[v14]:
  - Use new interface ckpt_hdr_get/put()
  - Embed struct ckpt_hdr in struct ckpt_hdr...
  - Remove preempt_disable/enable() around init_fpu() and fix leak
  - Revert change to pr_debug(), back to ckpt_debug()
  - Move code related to task_struct to checkpoint/process.c
Changelog[v12]:
  - A couple of missed calls to ckpt_hbuf_put()
  - Replace obsolete ckpt_debug() with pr_debug()
Changelog[v9]:
  - Add arch-specific header that details architecture capabilities;
    split FPU restore to send capabilities only once.
  - Test for zero TLS entries in ckpt_write_thread()
  - Fix asm/checkpoint_hdr.h so it can be included from user-space
Changelog[v7]:
  - Fix save/restore state of FPU
Changelog[v5]:
  - Remove preempt_disable() when restoring debug registers
Changelog[v4]:
  - Fix header structure alignment
Changelog[v2]:
  - Pad header structures to 64 bits to ensure compatibility
  - Follow Dave Hansen's refactoring of the original post

Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
---
 arch/x86/include/asm/Kbuild           |    1 +
 arch/x86/include/asm/checkpoint_hdr.h |  121 ++++++++
 arch/x86/include/asm/ptrace.h         |    5 +
 arch/x86/kernel/ptrace.c              |    8 +-
 arch/x86/mm/Makefile                  |    2 +
 arch/x86/mm/checkpoint.c              |  531 +++++++++++++++++++++++++++++++++
 checkpoint/checkpoint.c               |    7 +-
 checkpoint/process.c                  |   19 ++-
 checkpoint/restart.c                  |    6 +
 include/linux/checkpoint.h            |    9 +
 include/linux/checkpoint_hdr.h        |   16 +-
 11 files changed, 718 insertions(+), 7 deletions(-)
 create mode 100644 arch/x86/include/asm/checkpoint_hdr.h
 create mode 100644 arch/x86/mm/checkpoint.c

diff --git a/arch/x86/include/asm/Kbuild b/arch/x86/include/asm/Kbuild
index 4a8e80c..f76cb6e 100644
--- a/arch/x86/include/asm/Kbuild
+++ b/arch/x86/include/asm/Kbuild
@@ -2,6 +2,7 @@ include include/asm-generic/Kbuild.asm
 
 header-y += boot.h
 header-y += bootparam.h
+header-y += checkpoint_hdr.h
 header-y += debugreg.h
 header-y += ldt.h
 header-y += msr-index.h
diff --git a/arch/x86/include/asm/checkpoint_hdr.h b/arch/x86/include/asm/checkpoint_hdr.h
new file mode 100644
index 0000000..f4d1e14
--- /dev/null
+++ b/arch/x86/include/asm/checkpoint_hdr.h
@@ -0,0 +1,121 @@
+#ifndef __ASM_X86_CKPT_HDR_H
+#define __ASM_X86_CKPT_HDR_H
+/*
+ *  Checkpoint/restart - architecture specific headers x86
+ *
+ *  Copyright (C) 2008-2009 Oren Laadan
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+#ifndef _CHECKPOINT_CKPT_HDR_H_
+#error asm/checkpoint_hdr.h included directly
+#endif
+
+#include <linux/types.h>
+
+/*
+ * To maintain compatibility between 32-bit and 64-bit architecture flavors,
+ * keep data 64-bit aligned: use padding for structure members, and use
+ * __attribute__((aligned (8))) for the entire structure.
+ *
+ * Quoting Arnd Bergmann:
+ *   "This structure has an odd multiple of 32-bit members, which means
+ *   that if you put it into a larger structure that also contains 64-bit
+ *   members, the larger structure may get different alignment on x86-32
+ *   and x86-64, which you might want to avoid. I can't tell if this is
+ *   an actual problem here. ... In this case, I'm pretty sure that
+ *   sizeof(ckpt_hdr_task) on x86-32 is different from x86-64, since it
+ *   will be 32-bit aligned on x86-32."
+ */
+
+/* i387 structure seen from kernel/userspace */
+#ifdef __KERNEL__
+#include <asm/processor.h>
+#else
+#include <sys/user.h>
+#endif
+
+#ifdef CONFIG_X86_32
+#define CKPT_ARCH_ID	CKPT_ARCH_X86_32
+#endif
+
+/* arch dependent header types */
+enum {
+	CKPT_HDR_CPU_FPU = 201,
+};
+
+struct ckpt_hdr_header_arch {
+	struct ckpt_hdr h;
+	/* FIXME: add HAVE_HWFP */
+	__u16 has_fxsr;
+	__u16 has_xsave;
+	__u16 xstate_size;
+	__u16 _pading;
+} __attribute__((aligned(8)));
+
+struct ckpt_hdr_thread {
+	struct ckpt_hdr h;
+	__u32 thread_info_flags;
+	__u16 gdt_entry_tls_entries;
+	__u16 sizeof_tls_array;
+} __attribute__((aligned(8)));
+
+/* designed to work for both x86_32 and x86_64 */
+struct ckpt_hdr_cpu {
+	struct ckpt_hdr h;
+	/* see struct pt_regs (x86_64) */
+	__u64 r15;
+	__u64 r14;
+	__u64 r13;
+	__u64 r12;
+	__u64 bp;
+	__u64 bx;
+	__u64 r11;
+	__u64 r10;
+	__u64 r9;
+	__u64 r8;
+	__u64 ax;
+	__u64 cx;
+	__u64 dx;
+	__u64 si;
+	__u64 di;
+	__u64 orig_ax;
+	__u64 ip;
+	__u64 sp;
+
+	__u64 flags;
+
+	/* segment registers */
+	__u64 fs;
+	__u64 gs;
+
+	__u16 fsindex;
+	__u16 gsindex;
+	__u16 cs;
+	__u16 ss;
+	__u16 ds;
+	__u16 es;
+
+	__u32 used_math;
+
+	/* debug registers */
+	__u64 debugreg0;
+	__u64 debugreg1;
+	__u64 debugreg2;
+	__u64 debugreg3;
+	__u64 debugreg6;
+	__u64 debugreg7;
+
+	/* thread_xstate contents follow (if used_math) */
+} __attribute__((aligned(8)));
+
+#define CKPT_X86_SEG_NULL	0
+#define CKPT_X86_SEG_USER32_CS	1
+#define CKPT_X86_SEG_USER32_DS	2
+#define CKPT_X86_SEG_TLS	0x4000	/* 0100 0000 0000 00xx */
+#define CKPT_X86_SEG_LDT	0x8000	/* 100x xxxx xxxx xxxx */
+
+#endif /* __ASM_X86_CKPT_HDR__H */
diff --git a/arch/x86/include/asm/ptrace.h b/arch/x86/include/asm/ptrace.h
index 0f0d908..66b507b 100644
--- a/arch/x86/include/asm/ptrace.h
+++ b/arch/x86/include/asm/ptrace.h
@@ -242,6 +242,11 @@ extern void ptrace_bts_untrace(struct task_struct *tsk);
 #define arch_ptrace_untrace(tsk)	ptrace_bts_untrace(tsk)
 #endif /* CONFIG_X86_PTRACE_BTS */
 
+extern int ptrace_check_debugreg(int _32bit,
+				 unsigned long dr0, unsigned long dr1,
+				 unsigned long dr2, unsigned long dr3,
+				 unsigned long dr6, unsigned long dr7);
+
 #endif /* __KERNEL__ */
 
 #endif /* !__ASSEMBLY__ */
diff --git a/arch/x86/kernel/ptrace.c b/arch/x86/kernel/ptrace.c
index 9b4cacf..3b434bd 100644
--- a/arch/x86/kernel/ptrace.c
+++ b/arch/x86/kernel/ptrace.c
@@ -466,10 +466,10 @@ static unsigned long ptrace_get_debugreg(struct task_struct *child, int n)
 	return 0;
 }
 
-static int ptrace_check_debugreg(int _32bit,
-				 unsigned long dr0, unsigned long dr1,
-				 unsigned long dr2, unsigned long dr3,
-				 unsigned long dr6, unsigned long dr7)
+int ptrace_check_debugreg(int _32bit,
+			  unsigned long dr0, unsigned long dr1,
+			  unsigned long dr2, unsigned long dr3,
+			  unsigned long dr6, unsigned long dr7)
 {
 	/* Breakpoint type: 00: --x, 01: -w-, 10: undefined, 11: rw- */
 	unsigned int rw[4];
diff --git a/arch/x86/mm/Makefile b/arch/x86/mm/Makefile
index eefdeee..ddd5abb 100644
--- a/arch/x86/mm/Makefile
+++ b/arch/x86/mm/Makefile
@@ -21,3 +21,5 @@ obj-$(CONFIG_K8_NUMA)		+= k8topology_64.o
 obj-$(CONFIG_ACPI_NUMA)		+= srat_$(BITS).o
 
 obj-$(CONFIG_MEMTEST)		+= memtest.o
+
+obj-$(CONFIG_CHECKPOINT)	+= checkpoint.o
diff --git a/arch/x86/mm/checkpoint.c b/arch/x86/mm/checkpoint.c
new file mode 100644
index 0000000..023039f
--- /dev/null
+++ b/arch/x86/mm/checkpoint.c
@@ -0,0 +1,531 @@
+/*
+ *  Checkpoint/restart - architecture specific support for x86
+ *
+ *  Copyright (C) 2008-2009 Oren Laadan
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+/* default debug level for output */
+#define CKPT_DFLAG  CKPT_DSYS
+
+#include <asm/desc.h>
+#include <asm/i387.h>
+
+#include <linux/checkpoint.h>
+#include <linux/checkpoint_hdr.h>
+
+/*
+ * helpers to encode/decode/validate registers/segments/eflags
+ */
+
+static int check_eflags(__u32 eflags)
+{
+#define X86_EFLAGS_CKPT_MASK  \
+	(X86_EFLAGS_CF | X86_EFLAGS_PF | X86_EFLAGS_AF | X86_EFLAGS_ZF | \
+	 X86_EFLAGS_SF | X86_EFLAGS_TF | X86_EFLAGS_DF | X86_EFLAGS_OF | \
+	 X86_EFLAGS_NT | X86_EFLAGS_AC | X86_EFLAGS_ID)
+
+	if ((eflags & ~X86_EFLAGS_CKPT_MASK) != (X86_EFLAGS_IF | 0x2))
+		return 0;
+	return 1;
+}
+
+static int check_tls(struct desc_struct *desc)
+{
+	if (!desc->a && !desc->b)
+		return 1;
+	if (desc->l != 0 || desc->s != 1 || desc->dpl != 3)
+		return 0;
+	return 1;
+}
+
+static int check_segment(__u16 seg)
+{
+	int ret = 0;
+
+	switch (seg) {
+	case CKPT_X86_SEG_NULL:
+	case CKPT_X86_SEG_USER32_CS:
+	case CKPT_X86_SEG_USER32_DS:
+		return 1;
+	}
+	if (seg & CKPT_X86_SEG_TLS) {
+		seg &= ~CKPT_X86_SEG_TLS;
+		if (seg <= GDT_ENTRY_TLS_MAX - GDT_ENTRY_TLS_MIN)
+			ret = 1;
+	} else if (seg & CKPT_X86_SEG_LDT) {
+		seg &= ~CKPT_X86_SEG_LDT;
+		if (seg <= 0x1fff)
+			ret = 1;
+	}
+	return ret;
+}
+
+static __u16 encode_segment(unsigned short seg)
+{
+	if (seg == 0)
+		return CKPT_X86_SEG_NULL;
+	BUG_ON((seg & 3) != 3);
+
+	if (seg == __USER_CS)
+		return CKPT_X86_SEG_USER32_CS;
+	if (seg == __USER_DS)
+		return CKPT_X86_SEG_USER32_DS;
+
+	if (seg & 4)
+		return CKPT_X86_SEG_LDT | (seg >> 3);
+
+	seg >>= 3;
+	if (GDT_ENTRY_TLS_MIN <= seg && seg <= GDT_ENTRY_TLS_MAX)
+		return CKPT_X86_SEG_TLS | (seg - GDT_ENTRY_TLS_MIN);
+
+	printk(KERN_ERR "c/r: (decode) bad segment %#hx\n", seg);
+	BUG();
+}
+
+static unsigned short decode_segment(__u16 seg)
+{
+	if (seg == CKPT_X86_SEG_NULL)
+		return 0;
+	if (seg == CKPT_X86_SEG_USER32_CS)
+		return __USER_CS;
+	if (seg == CKPT_X86_SEG_USER32_DS)
+		return __USER_DS;
+
+	if (seg & CKPT_X86_SEG_TLS) {
+		seg &= ~CKPT_X86_SEG_TLS;
+		return ((GDT_ENTRY_TLS_MIN + seg) << 3) | 3;
+	}
+	if (seg & CKPT_X86_SEG_LDT) {
+		seg &= ~CKPT_X86_SEG_LDT;
+		return (seg << 3) | 7;
+	}
+	BUG();
+}
+
+#define CKPT_X86_TIF_UNSUPPORTED   (_TIF_SECCOMP | _TIF_IO_BITMAP)
+
+/**************************************************************************
+ * Checkpoint
+ */
+
+static int may_checkpoint_thread(struct ckpt_ctx *ctx, struct task_struct *t)
+{
+	if (t->thread.vm86_info) {
+		ckpt_write_err(ctx, "TE", "task in VM86 mode", -EBUSY);
+		return -EBUSY;
+	}
+	if (task_thread_info(t)->flags & CKPT_X86_TIF_UNSUPPORTED) {
+		ckpt_write_err(ctx, "TE", "bad thread info flags %#lx", -EBUSY);
+		return -EBUSY;
+	}
+	return 0;
+}
+
+/* dump the thread_struct of a given task */
+int checkpoint_thread(struct ckpt_ctx *ctx, struct task_struct *t)
+{
+	struct ckpt_hdr_thread *h;
+	int tls_size;
+	int ret;
+
+	ret = may_checkpoint_thread(ctx, t);
+	if (ret < 0)
+		return ret;
+
+	tls_size = sizeof(t->thread.tls_array);
+
+	h = ckpt_hdr_get_type(ctx, sizeof(*h) + tls_size, CKPT_HDR_THREAD);
+	if (!h)
+		return -ENOMEM;
+
+	h->thread_info_flags =
+		task_thread_info(t)->flags & ~CKPT_X86_TIF_UNSUPPORTED;
+	h->gdt_entry_tls_entries = GDT_ENTRY_TLS_ENTRIES;
+	h->sizeof_tls_array = tls_size;
+
+	/* For simplicity dump the entire array */
+	memcpy(h + 1, t->thread.tls_array, tls_size);
+
+	ret = ckpt_write_obj(ctx, &h->h);
+	ckpt_hdr_put(ctx, h);
+	return ret;
+}
+
+#ifdef CONFIG_X86_32
+
+static void save_cpu_regs(struct ckpt_hdr_cpu *h, struct task_struct *t)
+{
+	struct thread_struct *thread = &t->thread;
+	struct pt_regs *regs = task_pt_regs(t);
+	unsigned long _gs;
+
+	h->bp = regs->bp;
+	h->bx = regs->bx;
+	h->ax = regs->ax;
+	h->cx = regs->cx;
+	h->dx = regs->dx;
+	h->si = regs->si;
+	h->di = regs->di;
+	h->orig_ax = regs->orig_ax;
+	h->ip = regs->ip;
+
+	h->flags = regs->flags;
+	h->sp = regs->sp;
+
+	h->cs = encode_segment(regs->cs);
+	h->ss = encode_segment(regs->ss);
+	h->ds = encode_segment(regs->ds);
+	h->es = encode_segment(regs->es);
+
+	/*
+	 * for checkpoint in process context (from within a container)
+	 * the GS segment register should be saved from the hardware;
+	 * otherwise it is already saved on the thread structure
+	 */
+	if (t == current)
+		_gs = get_user_gs(regs);
+	else
+		_gs = thread->gs;
+
+	h->fsindex = encode_segment(regs->fs);
+	h->gsindex = encode_segment(_gs);
+
+	/*
+	 * for checkpoint in process context (from within a container),
+	 * the actual syscall is taking place at this very moment; so
+	 * we (optimistically) subtitute the future return value (0) of
+	 * this syscall into the orig_eax, so that upon restart it will
+	 * succeed (or it will endlessly retry checkpoint...)
+	 */
+	if (t == current) {
+		BUG_ON(h->orig_ax < 0);
+		h->ax = 0;
+	}
+}
+
+static void save_cpu_debug(struct ckpt_hdr_cpu *h, struct task_struct *t)
+{
+	struct thread_struct *thread = &t->thread;
+
+	/* debug regs */
+
+	/*
+	 * for checkpoint in process context (from within a container),
+	 * get the actual registers; otherwise get the saved values.
+	 */
+
+	if (t == current) {
+		get_debugreg(h->debugreg0, 0);
+		get_debugreg(h->debugreg1, 1);
+		get_debugreg(h->debugreg2, 2);
+		get_debugreg(h->debugreg3, 3);
+		get_debugreg(h->debugreg6, 6);
+		get_debugreg(h->debugreg7, 7);
+	} else {
+		h->debugreg0 = thread->debugreg0;
+		h->debugreg1 = thread->debugreg1;
+		h->debugreg2 = thread->debugreg2;
+		h->debugreg3 = thread->debugreg3;
+		h->debugreg6 = thread->debugreg6;
+		h->debugreg7 = thread->debugreg7;
+	}
+}
+
+static void save_cpu_fpu(struct ckpt_hdr_cpu *h, struct task_struct *t)
+{
+	h->used_math = tsk_used_math(t) ? 1 : 0;
+}
+
+static int checkpoint_cpu_fpu(struct ckpt_ctx *ctx, struct task_struct *t)
+{
+	struct ckpt_hdr *h;
+	int ret;
+
+	h = ckpt_hdr_get_type(ctx, xstate_size + sizeof(*h),
+			      CKPT_HDR_CPU_FPU);
+	if (!h)
+		return -ENOMEM;
+
+	/* i387 + MMU + SSE logic */
+	preempt_disable();	/* needed it (t == current) */
+
+	/*
+	 * normally, no need to unlazy_fpu(), since TS_USEDFPU flag
+	 * was cleared when task was context-switched out...
+	 * except if we are in process context, in which case we do
+	 */
+	if (t == current && (task_thread_info(t)->status & TS_USEDFPU))
+		unlazy_fpu(current);
+
+	/*
+	 * For simplicity dump the entire structure.
+	 * FIX: need to be deliberate about what registers we are
+	 * dumping for traceability and compatibility.
+	 */
+	memcpy(h + 1, t->thread.xstate, xstate_size);
+	preempt_enable();	/* needed if (t == current) */
+
+	ret = ckpt_write_obj(ctx, h);
+	ckpt_hdr_put(ctx, h);
+
+	return ret;
+}
+
+#endif	/* CONFIG_X86_32 */
+
+/* dump the cpu state and registers of a given task */
+int checkpoint_cpu(struct ckpt_ctx *ctx, struct task_struct *t)
+{
+	struct ckpt_hdr_cpu *h;
+	int ret;
+
+	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_CPU);
+	if (!h)
+		return -ENOMEM;
+
+	save_cpu_regs(h, t);
+	save_cpu_debug(h, t);
+	save_cpu_fpu(h, t);
+
+	ckpt_debug("math %d debug %d\n", h->used_math, !!h->debugreg7);
+
+	ret = ckpt_write_obj(ctx, &h->h);
+	if (ret < 0)
+		goto out;
+
+	if (h->used_math)
+		ret = checkpoint_cpu_fpu(ctx, t);
+ out:
+	ckpt_hdr_put(ctx, h);
+	return ret;
+}
+
+int checkpoint_write_header_arch(struct ckpt_ctx *ctx)
+{
+	struct ckpt_hdr_header_arch *h;
+	int ret;
+
+	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_HEADER_ARCH);
+	if (!h)
+		return -ENOMEM;
+
+	/* FPU capabilities */
+	h->has_fxsr = cpu_has_fxsr;
+	h->has_xsave = cpu_has_xsave;
+	h->xstate_size = xstate_size;
+
+	ret = ckpt_write_obj(ctx, &h->h);
+	ckpt_hdr_put(ctx, h);
+
+	return ret;
+}
+
+/**************************************************************************
+ * Restart
+ */
+
+/* read the thread_struct into the current task */
+int restore_thread(struct ckpt_ctx *ctx)
+{
+	struct ckpt_hdr_thread *h;
+	struct thread_struct *thread = &current->thread;
+	struct desc_struct *desc;
+	int tls_size;
+	int i, cpu, ret;
+
+	tls_size = sizeof(thread->tls_array);
+
+	h = ckpt_read_obj_type(ctx, sizeof(*h) + tls_size, CKPT_HDR_THREAD);
+	if (IS_ERR(h))
+		return PTR_ERR(h);
+
+	ret = -EINVAL;
+	if (h->thread_info_flags & CKPT_X86_TIF_UNSUPPORTED)
+		goto out;
+	if (h->gdt_entry_tls_entries != GDT_ENTRY_TLS_ENTRIES)
+		goto out;
+	if (h->sizeof_tls_array != tls_size)
+		goto out;
+
+	/*
+	 * restore TLS by hand: why convert to struct user_desc if
+	 * sys_set_thread_entry() will convert it back ?
+	 */
+	desc = (struct desc_struct *) (h + 1);
+
+	for (i = 0; i < GDT_ENTRY_TLS_ENTRIES; i++) {
+		if (!check_tls(&desc[i]))
+			goto out;
+	}
+
+	cpu = get_cpu();
+	memcpy(thread->tls_array, desc, tls_size);
+	load_TLS(thread, cpu);
+	put_cpu();
+
+	/* TODO: restore TIF flags as necessary (e.g. TIF_NOTSC) */
+
+	ret = 0;
+ out:
+	ckpt_hdr_put(ctx, h);
+	return ret;
+}
+
+#ifdef CONFIG_X86_32
+
+static int load_cpu_regs(struct ckpt_hdr_cpu *h, struct task_struct *t)
+{
+	struct thread_struct *thread = &t->thread;
+	struct pt_regs *regs = task_pt_regs(t);
+
+	if (!check_eflags(h->flags))
+		return -EINVAL;
+	if (h->cs == CKPT_X86_SEG_NULL)
+		return -EINVAL;
+	if (!check_segment(h->cs) || !check_segment(h->ds) ||
+	    !check_segment(h->es) || !check_segment(h->ss) ||
+	    !check_segment(h->fsindex) || !check_segment(h->gsindex))
+		return -EINVAL;
+
+	regs->bp = h->bp;
+	regs->bx = h->bx;
+	regs->ax = h->ax;
+	regs->cx = h->cx;
+	regs->dx = h->dx;
+	regs->si = h->si;
+	regs->di = h->di;
+	regs->orig_ax = h->orig_ax;
+	regs->ip = h->ip;
+
+	regs->flags = h->flags;
+	regs->sp = h->sp;
+
+	regs->ds = decode_segment(h->ds);
+	regs->es = decode_segment(h->es);
+	regs->cs = decode_segment(h->cs);
+	regs->ss = decode_segment(h->ss);
+
+	regs->fs = decode_segment(h->fsindex);
+	regs->gs = decode_segment(h->gsindex);
+
+	thread->gs = regs->gs;
+	lazy_load_gs(regs->gs);
+
+	return 0;
+}
+
+static int load_cpu_debug(struct ckpt_hdr_cpu *h, struct task_struct *t)
+{
+	int ret;
+
+	ret = ptrace_check_debugreg(1, h->debugreg0, h->debugreg1, h->debugreg2,
+				    h->debugreg3, h->debugreg6, h->debugreg7);
+	if (ret < 0)
+		return ret;
+
+	set_debugreg(h->debugreg0, 0);
+	set_debugreg(h->debugreg1, 1);
+	/* ignore 4, 5 */
+	set_debugreg(h->debugreg2, 2);
+	set_debugreg(h->debugreg3, 3);
+	set_debugreg(h->debugreg6, 6);
+	set_debugreg(h->debugreg7, 7);
+
+	if (h->debugreg7)
+		set_tsk_thread_flag(t, TIF_DEBUG);
+	else
+		clear_tsk_thread_flag(t, TIF_DEBUG);
+
+	return 0;
+}
+
+static int load_cpu_fpu(struct ckpt_hdr_cpu *h, struct task_struct *t)
+{
+	preempt_disable();
+
+	__clear_fpu(t);		/* in case we used FPU in user mode */
+
+	if (!h->used_math)
+		clear_used_math();
+
+	preempt_enable();
+	return 0;
+}
+
+static int restore_cpu_fpu(struct ckpt_ctx *ctx, struct task_struct *t)
+{
+	struct ckpt_hdr *h;
+	int ret;
+
+	/* init_fpu() eventually also calls set_used_math() */
+	ret = init_fpu(current);
+	if (ret < 0)
+		return ret;
+
+	h = ckpt_read_obj_type(ctx, xstate_size + sizeof(*h),
+			       CKPT_HDR_CPU_FPU);
+	if (IS_ERR(h))
+		return PTR_ERR(h);
+
+	memcpy(t->thread.xstate, h + 1, xstate_size);
+
+	ckpt_hdr_put(ctx, h);
+	return ret;
+}
+
+#endif	/* CONFIG_X86_32 */
+
+/* read the cpu state and registers for the current task */
+int restore_cpu(struct ckpt_ctx *ctx)
+{
+	struct ckpt_hdr_cpu *h;
+	struct task_struct *t = current;
+	int ret;
+
+	h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_CPU);
+	if (IS_ERR(h))
+		return PTR_ERR(h);
+
+	ckpt_debug("math %d debug %d\n", h->used_math, !!h->debugreg7);
+
+	ret = load_cpu_regs(h, t);
+	if (ret < 0)
+		goto out;
+	ret = load_cpu_debug(h, t);
+	if (ret < 0)
+		goto out;
+	ret = load_cpu_fpu(h, t);
+	if (ret < 0)
+		goto out;
+
+	if (h->used_math)
+		ret = restore_cpu_fpu(ctx, t);
+ out:
+	ckpt_hdr_put(ctx, h);
+	return ret;
+}
+
+int restore_read_header_arch(struct ckpt_ctx *ctx)
+{
+	struct ckpt_hdr_header_arch *h;
+	int ret = 0;
+
+	h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_HEADER_ARCH);
+	if (IS_ERR(h))
+		return PTR_ERR(h);
+
+	/* FIX: verify compatibility of architecture features */
+
+	/* verify FPU capabilities */
+	if (h->has_fxsr != cpu_has_fxsr ||
+	    h->has_xsave != cpu_has_xsave ||
+	    h->xstate_size != xstate_size)
+		ret = -EINVAL;
+
+	ckpt_hdr_put(ctx, h);
+	return ret;
+}
diff --git a/checkpoint/checkpoint.c b/checkpoint/checkpoint.c
index 57eb7d8..e228151 100644
--- a/checkpoint/checkpoint.c
+++ b/checkpoint/checkpoint.c
@@ -300,6 +300,8 @@ static int checkpoint_write_header(struct ckpt_ctx *ctx)
 	do_gettimeofday(&ktv);
 	uts = utsname();
 
+	h->arch_id = cpu_to_le16(CKPT_ARCH_ID);  /* see asm/checkpoitn.h */
+
 	h->magic = CHECKPOINT_MAGIC_HEAD;
 	h->major = (LINUX_VERSION_CODE >> 16) & 0xff;
 	h->minor = (LINUX_VERSION_CODE >> 8) & 0xff;
@@ -327,7 +329,10 @@ static int checkpoint_write_header(struct ckpt_ctx *ctx)
 	ret = ckpt_write_buffer(ctx, uts->machine, sizeof(uts->machine));
  up:
 	up_read(&uts_sem);
-	return ret;
+	if (ret < 0)
+		return ret;
+
+	return checkpoint_write_header_arch(ctx);
 }
 
 /* write the checkpoint trailer */
diff --git a/checkpoint/process.c b/checkpoint/process.c
index d221c2a..1d1170c 100644
--- a/checkpoint/process.c
+++ b/checkpoint/process.c
@@ -57,6 +57,15 @@ int checkpoint_task(struct ckpt_ctx *ctx, struct task_struct *t)
 	ret = checkpoint_task_struct(ctx, t);
 	ckpt_debug("task %d\n", ret);
 
+	if (ret < 0)
+		goto out;
+	ret = checkpoint_thread(ctx, t);
+	ckpt_debug("thread %d\n", ret);
+	if (ret < 0)
+		goto out;
+	ret = checkpoint_cpu(ctx, t);
+	ckpt_debug("cpu %d\n", ret);
+ out:
 	ctx->tsk = NULL;
 	return ret;
 }
@@ -97,6 +106,14 @@ int restore_task(struct ckpt_ctx *ctx)
 
 	ret = restore_task_struct(ctx);
 	ckpt_debug("task %d\n", ret);
-
+	if (ret < 0)
+		goto out;
+	ret = restore_thread(ctx);
+	ckpt_debug("thread %d\n", ret);
+	if (ret < 0)
+		goto out;
+	ret = restore_cpu(ctx);
+	ckpt_debug("cpu %d\n", ret);
+ out:
 	return ret;
 }
diff --git a/checkpoint/restart.c b/checkpoint/restart.c
index 9f08f4d..afe51c2 100644
--- a/checkpoint/restart.c
+++ b/checkpoint/restart.c
@@ -354,6 +354,8 @@ static int restore_read_header(struct ckpt_ctx *ctx)
 		return PTR_ERR(h);
 
 	ret = -EINVAL;
+	if (le16_to_cpu(h->arch_id) != CKPT_ARCH_ID)
+		goto out;
 	if (h->magic != CHECKPOINT_MAGIC_HEAD ||
 	    h->rev != CHECKPOINT_VERSION ||
 	    h->major != ((LINUX_VERSION_CODE >> 16) & 0xff) ||
@@ -382,6 +384,10 @@ static int restore_read_header(struct ckpt_ctx *ctx)
 	if (ret < 0)
 		goto out;
 	ret = _ckpt_read_buffer(ctx, uts->machine, sizeof(uts->machine));
+	if (ret < 0)
+		goto out;
+
+	ret = restore_read_header_arch(ctx);
  out:
 	kfree(uts);
 	ckpt_hdr_put(ctx, h);
diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
index be0ba4b..aa8ce11 100644
--- a/include/linux/checkpoint.h
+++ b/include/linux/checkpoint.h
@@ -61,6 +61,15 @@ extern long do_restart(struct ckpt_ctx *ctx, pid_t pid);
 extern int checkpoint_task(struct ckpt_ctx *ctx, struct task_struct *t);
 extern int restore_task(struct ckpt_ctx *ctx);
 
+/* arch hooks */
+extern int checkpoint_write_header_arch(struct ckpt_ctx *ctx);
+extern int checkpoint_thread(struct ckpt_ctx *ctx, struct task_struct *t);
+extern int checkpoint_cpu(struct ckpt_ctx *ctx, struct task_struct *t);
+
+extern int restore_read_header_arch(struct ckpt_ctx *ctx);
+extern int restore_thread(struct ckpt_ctx *ctx);
+extern int restore_cpu(struct ckpt_ctx *ctx);
+
 static inline int ckpt_validate_errno(int errno)
 {
 	return (errno >= 0) && (errno < MAX_ERRNO);
diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
index 22dadbd..92d082e 100644
--- a/include/linux/checkpoint_hdr.h
+++ b/include/linux/checkpoint_hdr.h
@@ -40,19 +40,33 @@ struct ckpt_hdr {
 	__u32 len;
 } __attribute__((aligned(8)));
 
+
+#include <asm/checkpoint_hdr.h>
+
+
 /* header types */
 enum {
 	CKPT_HDR_HEADER = 1,
+	CKPT_HDR_HEADER_ARCH,
 	CKPT_HDR_BUFFER,
 	CKPT_HDR_STRING,
 
 	CKPT_HDR_TASK = 101,
+	CKPT_HDR_THREAD,
+	CKPT_HDR_CPU,
+
+	/* 201-299: reserved for arch-dependent */
 
 	CKPT_HDR_TAIL = 9001,
 
 	CKPT_HDR_ERROR = 9999,
 };
 
+/* architecture */
+enum {
+	CKPT_ARCH_X86_32 = 1,
+};
+
 /* kernel constants */
 struct ckpt_const {
 	/* task */
@@ -68,7 +82,7 @@ struct ckpt_hdr_header {
 	struct ckpt_hdr h;
 	__u64 magic;
 
-	__u16 _padding;
+	__u16 arch_id;
 
 	__u16 major;
 	__u16 minor;
-- 
1.6.0.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH v18 22/80] c/r: external checkpoint of a task other than ourself
       [not found] ` <1253749920-18673-1-git-send-email-orenl-RdfvBDnrOixBDgjK7y7TUQ@public.gmane.org>
                     ` (20 preceding siblings ...)
  2009-09-23 23:51   ` [PATCH v18 21/80] c/r: x86_32 support " Oren Laadan
@ 2009-09-23 23:51   ` Oren Laadan
  2009-09-23 23:51   ` [PATCH v18 23/80] c/r: export functionality used in next patch for restart-blocks Oren Laadan
                     ` (59 subsequent siblings)
  81 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-23 23:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-api-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Linus Torvalds, Ingo Molnar,
	Pavel Emelyanov

Now we can do "external" checkpoint, i.e. act on another task.

sys_checkpoint() now looks up the target pid (in our namespace) and
checkpoints that corresponding task. That task should be the root of
a container, unless CHECKPOINT_SUBTREE flag is given.

Set state of freezer cgroup of checkpointed task hierarchy to
"CHECKPOINTING" during a checkpoint, to ensure that task(s) cannot be
thawed while at it.

Ensure that all tasks belong to root task's freezer cgroup (the root
task is also tested, to detect it if changes its freezer cgroups
before it moves to "CHECKPOINTING").

sys_restart() remains nearly the same, as the restart is always done
in the context of the restarting task. However, the original task may
have been frozen from user space, or interrupted from a syscall for
the checkpoint. This is accounted for by restoring a suitable retval
for the restarting task, according to how it was checkpointed.

Changelog[v17]:
  - Move restore_retval() to this patch
  - Tighten ptrace ceckpoint for checkpoint to PTRACE_MODE_ATTACH
  - Use CHECKPOINTING state for hierarchy's freezer for checkpoint
Changelog[v16]:
  - Use CHECKPOINT_SUBTREE to allow subtree (partial container)
Changelog[v14]:
  - Refuse non-self checkpoint if target task isn't frozen
Changelog[v12]:
  - Replace obsolete ckpt_debug() with pr_debug()
Changelog[v11]:
  - Copy contents of 'init->fs->root' instead of pointing to them
Changelog[v10]:
  - Grab vfs root of container init, rather than current process

Signed-off-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
---
 checkpoint/Kconfig               |    1 +
 checkpoint/checkpoint.c          |   98 +++++++++++++++++++++++++++++++++++++-
 checkpoint/restart.c             |   61 +++++++++++++++++++++++-
 checkpoint/sys.c                 |   10 ++++
 include/linux/checkpoint_types.h |    7 ++-
 5 files changed, 174 insertions(+), 3 deletions(-)

diff --git a/checkpoint/Kconfig b/checkpoint/Kconfig
index ef7d406..21fc86b 100644
--- a/checkpoint/Kconfig
+++ b/checkpoint/Kconfig
@@ -5,6 +5,7 @@
 config CHECKPOINT
 	bool "Checkpoint/restart (EXPERIMENTAL)"
 	depends on CHECKPOINT_SUPPORT && EXPERIMENTAL
+	depends on CGROUP_FREEZER
 	help
 	  Application checkpoint/restart is the ability to save the
 	  state of a running application so that it can later resume
diff --git a/checkpoint/checkpoint.c b/checkpoint/checkpoint.c
index e228151..ad89f50 100644
--- a/checkpoint/checkpoint.c
+++ b/checkpoint/checkpoint.c
@@ -12,6 +12,9 @@
 #define CKPT_DFLAG  CKPT_DSYS
 
 #include <linux/version.h>
+#include <linux/sched.h>
+#include <linux/freezer.h>
+#include <linux/ptrace.h>
 #include <linux/time.h>
 #include <linux/fs.h>
 #include <linux/file.h>
@@ -352,14 +355,105 @@ static int checkpoint_write_tail(struct ckpt_ctx *ctx)
 	return ret;
 }
 
+static int may_checkpoint_task(struct ckpt_ctx *ctx, struct task_struct *t)
+{
+	if (t->state == TASK_DEAD) {
+		__ckpt_write_err(ctx, "TE", "task state EXIT_DEAD\n", -EBUSY);
+		return -EBUSY;
+	}
+
+	if (!ptrace_may_access(t, PTRACE_MODE_ATTACH)) {
+		__ckpt_write_err(ctx, "TE", "ptrace attach denied", -EPERM);
+		return -EPERM;
+	}
+
+	/* verify that all tasks belongs to same freezer cgroup */
+	if (t != current && !in_same_cgroup_freezer(t, ctx->root_freezer)) {
+		__ckpt_write_err(ctx, "TE", "unfrozen or wrong cgroup", -EBUSY);
+		return -EBUSY;
+	}
+
+	/* FIX: add support for ptraced tasks */
+	if (task_ptrace(t)) {
+		__ckpt_write_err(ctx, "TE", "task is ptraced", -EBUSY);
+		return -EBUSY;
+	}
+
+	return 0;
+}
+
+/* setup checkpoint-specific parts of ctx */
+static int init_checkpoint_ctx(struct ckpt_ctx *ctx, pid_t pid)
+{
+	struct task_struct *task;
+	struct nsproxy *nsproxy;
+	int ret;
+
+	/*
+	 * No need for explicit cleanup here, because if an error
+	 * occurs then ckpt_ctx_free() is eventually called.
+	 */
+
+	ctx->root_pid = pid;
+
+	/* root task */
+	read_lock(&tasklist_lock);
+	task = find_task_by_vpid(pid);
+	if (task)
+		get_task_struct(task);
+	read_unlock(&tasklist_lock);
+	if (!task)
+		return -ESRCH;
+	else
+		ctx->root_task = task;
+
+	/* root nsproxy */
+	rcu_read_lock();
+	nsproxy = task_nsproxy(task);
+	if (nsproxy)
+		get_nsproxy(nsproxy);
+	rcu_read_unlock();
+	if (!nsproxy)
+		return -ESRCH;
+	else
+		ctx->root_nsproxy = nsproxy;
+
+	/* root freezer */
+	ctx->root_freezer = task;
+	geT_task_struct(task);
+
+	ret = may_checkpoint_task(ctx, task);
+	if (ret) {
+		ckpt_write_err(ctx, "", NULL);
+		put_task_struct(task);
+		put_task_struct(task);
+		put_nsproxy(nsproxy);
+		ctx->root_nsproxy = NULL;
+		ctx->root_task = NULL;
+		return ret;
+	}
+
+	return 0;
+}
+
 long do_checkpoint(struct ckpt_ctx *ctx, pid_t pid)
 {
 	long ret;
 
+	ret = init_checkpoint_ctx(ctx, pid);
+	if (ret < 0)
+		return ret;
+
+	if (ctx->root_freezer) {
+		ret = cgroup_freezer_begin_checkpoint(ctx->root_freezer);
+		if (ret < 0)
+			return ret;
+	}
+
 	ret = checkpoint_write_header(ctx);
 	if (ret < 0)
 		goto out;
-	ret = checkpoint_task(ctx, current);
+	ret = checkpoint_task(ctx, ctx->root_task);
 	if (ret < 0)
 		goto out;
 	ret = checkpoint_write_tail(ctx);
@@ -370,5 +464,7 @@ long do_checkpoint(struct ckpt_ctx *ctx, pid_t pid)
 	ctx->crid = atomic_inc_return(&ctx_count);
 	ret = ctx->crid;
  out:
+	if (ctx->root_freezer)
+		cgroup_freezer_end_checkpoint(ctx->root_freezer);
 	return ret;
 }
diff --git a/checkpoint/restart.c b/checkpoint/restart.c
index afe51c2..cb02ffb 100644
--- a/checkpoint/restart.c
+++ b/checkpoint/restart.c
@@ -411,10 +411,67 @@ static int restore_read_tail(struct ckpt_ctx *ctx)
 	return ret;
 }
 
+static long restore_retval(void)
+{
+	struct pt_regs *regs = task_pt_regs(current);
+	long ret;
+
+	/*
+	 * For the restart, we entered the kernel via sys_restart(),
+	 * so our return path is via the syscall exit. In particular,
+	 * the code in entry.S will put the value that we will return
+	 * into a register (e.g. regs->eax in x86), thus passing it to
+	 * the caller task.
+	 *
+	 * What we do now depends on what happened to the checkpointed
+	 * task right before the checkpoint - there are three cases:
+	 *
+	 * 1) It was carrying out a syscall when became frozen, or
+	 * 2) It was running in userspace, or
+	 * 3) It was doing a self-checkpoint
+	 *
+	 * In case #1, if the syscall succeeded, perhaps partially,
+	 * then the retval is non-negative. If it failed, the error
+	 * may be one of -ERESTART..., which is interpreted in the
+	 * signal handling code. If that is the case, we force the
+	 * signal handler to kick in by faking a signal to ourselves
+	 * (a la freeze/thaw) when ret < 0.
+	 *
+	 * In case #2, our return value will overwrite the original
+	 * value in the affected register. Workaround by simply using
+	 * that saved value of that register as our retval.
+	 *
+	 * In case #3, then the state was recorded while the task was
+	 * in checkpoint(2) syscall. The syscall is execpted to return
+	 * 0 when returning from a restart. Fortunately, this already
+	 * has been arranged for at checkpoint time (the register that
+	 * holds the retval, e.g. regs->eax in x86, was set to
+	 * zero).
+	 */
+
+	/* needed for all 3 cases: get old value/error/retval */
+	ret = syscall_get_return_value(current, regs);
+
+	/* if from a syscall and returning error, kick in signal handlig */
+	if (syscall_get_nr(current, regs) >= 0 && ret < 0)
+		set_tsk_thread_flag(current, TIF_SIGPENDING);
+
+	return ret;
+}
+
+/* setup restart-specific parts of ctx */
+static int init_restart_ctx(struct ckpt_ctx *ctx, pid_t pid)
+{
+	return 0;
+}
+
 long do_restart(struct ckpt_ctx *ctx, pid_t pid)
 {
 	long ret;
 
+	ret = init_restart_ctx(ctx, pid);
+	if (ret < 0)
+		return ret;
 	ret = restore_read_header(ctx);
 	if (ret < 0)
 		return ret;
@@ -422,7 +479,9 @@ long do_restart(struct ckpt_ctx *ctx, pid_t pid)
 	if (ret < 0)
 		return ret;
 	ret = restore_read_tail(ctx);
+	if (ret < 0)
+		return ret;
 
 	/* on success, adjust the return value if needed [TODO] */
-	return ret;
+	return restore_retval(ctx);
 }
diff --git a/checkpoint/sys.c b/checkpoint/sys.c
index 7f6f71e..dda2c21 100644
--- a/checkpoint/sys.c
+++ b/checkpoint/sys.c
@@ -12,7 +12,9 @@
 #define CKPT_DFLAG  CKPT_DSYS
 
 #include <linux/sched.h>
+#include <linux/nsproxy.h>
 #include <linux/kernel.h>
+#include <linux/cgroup.h>
 #include <linux/syscalls.h>
 #include <linux/fs.h>
 #include <linux/file.h>
@@ -168,6 +170,14 @@ static void ckpt_ctx_free(struct ckpt_ctx *ctx)
 {
 	if (ctx->file)
 		fput(ctx->file);
+
+	if (ctx->root_nsproxy)
+		put_nsproxy(ctx->root_nsproxy);
+	if (ctx->root_task)
+		put_task_struct(ctx->root_task);
+	if (ctx->root_freezer)
+		put_task_struct(ctx->root_freezer);
+
 	kfree(ctx);
 }
 
diff --git a/include/linux/checkpoint_types.h b/include/linux/checkpoint_types.h
index 585cb7b..15dbe1b 100644
--- a/include/linux/checkpoint_types.h
+++ b/include/linux/checkpoint_types.h
@@ -12,12 +12,17 @@
 
 #ifdef __KERNEL__
 
+#include <linux/sched.h>
+#include <linux/nsproxy.h>
 #include <linux/fs.h>
 
 struct ckpt_ctx {
 	int crid;		/* unique checkpoint id */
 
-	pid_t root_pid;		/* container identifier */
+	pid_t root_pid;				/* [container] root pid */
+	struct task_struct *root_task;		/* [container] root task */
+	struct nsproxy *root_nsproxy;		/* [container] root nsproxy */
+	struct task_struct *root_freezer;	/* [container] root task */
 
 	unsigned long kflags;	/* kerenl flags */
 	unsigned long uflags;	/* user flags */
-- 
1.6.0.4

^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH v18 22/80] c/r: external checkpoint of a task other than ourself
  2009-09-23 23:50 ` Oren Laadan
@ 2009-09-23 23:51   ` Oren Laadan
  -1 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-23 23:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, containers, linux-kernel, linux-mm, linux-api,
	Serge Hallyn, Ingo Molnar, Pavel Emelyanov, Oren Laadan,
	Oren Laadan

Now we can do "external" checkpoint, i.e. act on another task.

sys_checkpoint() now looks up the target pid (in our namespace) and
checkpoints that corresponding task. That task should be the root of
a container, unless CHECKPOINT_SUBTREE flag is given.

Set state of freezer cgroup of checkpointed task hierarchy to
"CHECKPOINTING" during a checkpoint, to ensure that task(s) cannot be
thawed while at it.

Ensure that all tasks belong to root task's freezer cgroup (the root
task is also tested, to detect it if changes its freezer cgroups
before it moves to "CHECKPOINTING").

sys_restart() remains nearly the same, as the restart is always done
in the context of the restarting task. However, the original task may
have been frozen from user space, or interrupted from a syscall for
the checkpoint. This is accounted for by restoring a suitable retval
for the restarting task, according to how it was checkpointed.

Changelog[v17]:
  - Move restore_retval() to this patch
  - Tighten ptrace ceckpoint for checkpoint to PTRACE_MODE_ATTACH
  - Use CHECKPOINTING state for hierarchy's freezer for checkpoint
Changelog[v16]:
  - Use CHECKPOINT_SUBTREE to allow subtree (partial container)
Changelog[v14]:
  - Refuse non-self checkpoint if target task isn't frozen
Changelog[v12]:
  - Replace obsolete ckpt_debug() with pr_debug()
Changelog[v11]:
  - Copy contents of 'init->fs->root' instead of pointing to them
Changelog[v10]:
  - Grab vfs root of container init, rather than current process

Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
---
 checkpoint/Kconfig               |    1 +
 checkpoint/checkpoint.c          |   98 +++++++++++++++++++++++++++++++++++++-
 checkpoint/restart.c             |   61 +++++++++++++++++++++++-
 checkpoint/sys.c                 |   10 ++++
 include/linux/checkpoint_types.h |    7 ++-
 5 files changed, 174 insertions(+), 3 deletions(-)

diff --git a/checkpoint/Kconfig b/checkpoint/Kconfig
index ef7d406..21fc86b 100644
--- a/checkpoint/Kconfig
+++ b/checkpoint/Kconfig
@@ -5,6 +5,7 @@
 config CHECKPOINT
 	bool "Checkpoint/restart (EXPERIMENTAL)"
 	depends on CHECKPOINT_SUPPORT && EXPERIMENTAL
+	depends on CGROUP_FREEZER
 	help
 	  Application checkpoint/restart is the ability to save the
 	  state of a running application so that it can later resume
diff --git a/checkpoint/checkpoint.c b/checkpoint/checkpoint.c
index e228151..ad89f50 100644
--- a/checkpoint/checkpoint.c
+++ b/checkpoint/checkpoint.c
@@ -12,6 +12,9 @@
 #define CKPT_DFLAG  CKPT_DSYS
 
 #include <linux/version.h>
+#include <linux/sched.h>
+#include <linux/freezer.h>
+#include <linux/ptrace.h>
 #include <linux/time.h>
 #include <linux/fs.h>
 #include <linux/file.h>
@@ -352,14 +355,105 @@ static int checkpoint_write_tail(struct ckpt_ctx *ctx)
 	return ret;
 }
 
+static int may_checkpoint_task(struct ckpt_ctx *ctx, struct task_struct *t)
+{
+	if (t->state == TASK_DEAD) {
+		__ckpt_write_err(ctx, "TE", "task state EXIT_DEAD\n", -EBUSY);
+		return -EBUSY;
+	}
+
+	if (!ptrace_may_access(t, PTRACE_MODE_ATTACH)) {
+		__ckpt_write_err(ctx, "TE", "ptrace attach denied", -EPERM);
+		return -EPERM;
+	}
+
+	/* verify that all tasks belongs to same freezer cgroup */
+	if (t != current && !in_same_cgroup_freezer(t, ctx->root_freezer)) {
+		__ckpt_write_err(ctx, "TE", "unfrozen or wrong cgroup", -EBUSY);
+		return -EBUSY;
+	}
+
+	/* FIX: add support for ptraced tasks */
+	if (task_ptrace(t)) {
+		__ckpt_write_err(ctx, "TE", "task is ptraced", -EBUSY);
+		return -EBUSY;
+	}
+
+	return 0;
+}
+
+/* setup checkpoint-specific parts of ctx */
+static int init_checkpoint_ctx(struct ckpt_ctx *ctx, pid_t pid)
+{
+	struct task_struct *task;
+	struct nsproxy *nsproxy;
+	int ret;
+
+	/*
+	 * No need for explicit cleanup here, because if an error
+	 * occurs then ckpt_ctx_free() is eventually called.
+	 */
+
+	ctx->root_pid = pid;
+
+	/* root task */
+	read_lock(&tasklist_lock);
+	task = find_task_by_vpid(pid);
+	if (task)
+		get_task_struct(task);
+	read_unlock(&tasklist_lock);
+	if (!task)
+		return -ESRCH;
+	else
+		ctx->root_task = task;
+
+	/* root nsproxy */
+	rcu_read_lock();
+	nsproxy = task_nsproxy(task);
+	if (nsproxy)
+		get_nsproxy(nsproxy);
+	rcu_read_unlock();
+	if (!nsproxy)
+		return -ESRCH;
+	else
+		ctx->root_nsproxy = nsproxy;
+
+	/* root freezer */
+	ctx->root_freezer = task;
+	geT_task_struct(task);
+
+	ret = may_checkpoint_task(ctx, task);
+	if (ret) {
+		ckpt_write_err(ctx, "", NULL);
+		put_task_struct(task);
+		put_task_struct(task);
+		put_nsproxy(nsproxy);
+		ctx->root_nsproxy = NULL;
+		ctx->root_task = NULL;
+		return ret;
+	}
+
+	return 0;
+}
+
 long do_checkpoint(struct ckpt_ctx *ctx, pid_t pid)
 {
 	long ret;
 
+	ret = init_checkpoint_ctx(ctx, pid);
+	if (ret < 0)
+		return ret;
+
+	if (ctx->root_freezer) {
+		ret = cgroup_freezer_begin_checkpoint(ctx->root_freezer);
+		if (ret < 0)
+			return ret;
+	}
+
 	ret = checkpoint_write_header(ctx);
 	if (ret < 0)
 		goto out;
-	ret = checkpoint_task(ctx, current);
+	ret = checkpoint_task(ctx, ctx->root_task);
 	if (ret < 0)
 		goto out;
 	ret = checkpoint_write_tail(ctx);
@@ -370,5 +464,7 @@ long do_checkpoint(struct ckpt_ctx *ctx, pid_t pid)
 	ctx->crid = atomic_inc_return(&ctx_count);
 	ret = ctx->crid;
  out:
+	if (ctx->root_freezer)
+		cgroup_freezer_end_checkpoint(ctx->root_freezer);
 	return ret;
 }
diff --git a/checkpoint/restart.c b/checkpoint/restart.c
index afe51c2..cb02ffb 100644
--- a/checkpoint/restart.c
+++ b/checkpoint/restart.c
@@ -411,10 +411,67 @@ static int restore_read_tail(struct ckpt_ctx *ctx)
 	return ret;
 }
 
+static long restore_retval(void)
+{
+	struct pt_regs *regs = task_pt_regs(current);
+	long ret;
+
+	/*
+	 * For the restart, we entered the kernel via sys_restart(),
+	 * so our return path is via the syscall exit. In particular,
+	 * the code in entry.S will put the value that we will return
+	 * into a register (e.g. regs->eax in x86), thus passing it to
+	 * the caller task.
+	 *
+	 * What we do now depends on what happened to the checkpointed
+	 * task right before the checkpoint - there are three cases:
+	 *
+	 * 1) It was carrying out a syscall when became frozen, or
+	 * 2) It was running in userspace, or
+	 * 3) It was doing a self-checkpoint
+	 *
+	 * In case #1, if the syscall succeeded, perhaps partially,
+	 * then the retval is non-negative. If it failed, the error
+	 * may be one of -ERESTART..., which is interpreted in the
+	 * signal handling code. If that is the case, we force the
+	 * signal handler to kick in by faking a signal to ourselves
+	 * (a la freeze/thaw) when ret < 0.
+	 *
+	 * In case #2, our return value will overwrite the original
+	 * value in the affected register. Workaround by simply using
+	 * that saved value of that register as our retval.
+	 *
+	 * In case #3, then the state was recorded while the task was
+	 * in checkpoint(2) syscall. The syscall is execpted to return
+	 * 0 when returning from a restart. Fortunately, this already
+	 * has been arranged for at checkpoint time (the register that
+	 * holds the retval, e.g. regs->eax in x86, was set to
+	 * zero).
+	 */
+
+	/* needed for all 3 cases: get old value/error/retval */
+	ret = syscall_get_return_value(current, regs);
+
+	/* if from a syscall and returning error, kick in signal handlig */
+	if (syscall_get_nr(current, regs) >= 0 && ret < 0)
+		set_tsk_thread_flag(current, TIF_SIGPENDING);
+
+	return ret;
+}
+
+/* setup restart-specific parts of ctx */
+static int init_restart_ctx(struct ckpt_ctx *ctx, pid_t pid)
+{
+	return 0;
+}
+
 long do_restart(struct ckpt_ctx *ctx, pid_t pid)
 {
 	long ret;
 
+	ret = init_restart_ctx(ctx, pid);
+	if (ret < 0)
+		return ret;
 	ret = restore_read_header(ctx);
 	if (ret < 0)
 		return ret;
@@ -422,7 +479,9 @@ long do_restart(struct ckpt_ctx *ctx, pid_t pid)
 	if (ret < 0)
 		return ret;
 	ret = restore_read_tail(ctx);
+	if (ret < 0)
+		return ret;
 
 	/* on success, adjust the return value if needed [TODO] */
-	return ret;
+	return restore_retval(ctx);
 }
diff --git a/checkpoint/sys.c b/checkpoint/sys.c
index 7f6f71e..dda2c21 100644
--- a/checkpoint/sys.c
+++ b/checkpoint/sys.c
@@ -12,7 +12,9 @@
 #define CKPT_DFLAG  CKPT_DSYS
 
 #include <linux/sched.h>
+#include <linux/nsproxy.h>
 #include <linux/kernel.h>
+#include <linux/cgroup.h>
 #include <linux/syscalls.h>
 #include <linux/fs.h>
 #include <linux/file.h>
@@ -168,6 +170,14 @@ static void ckpt_ctx_free(struct ckpt_ctx *ctx)
 {
 	if (ctx->file)
 		fput(ctx->file);
+
+	if (ctx->root_nsproxy)
+		put_nsproxy(ctx->root_nsproxy);
+	if (ctx->root_task)
+		put_task_struct(ctx->root_task);
+	if (ctx->root_freezer)
+		put_task_struct(ctx->root_freezer);
+
 	kfree(ctx);
 }
 
diff --git a/include/linux/checkpoint_types.h b/include/linux/checkpoint_types.h
index 585cb7b..15dbe1b 100644
--- a/include/linux/checkpoint_types.h
+++ b/include/linux/checkpoint_types.h
@@ -12,12 +12,17 @@
 
 #ifdef __KERNEL__
 
+#include <linux/sched.h>
+#include <linux/nsproxy.h>
 #include <linux/fs.h>
 
 struct ckpt_ctx {
 	int crid;		/* unique checkpoint id */
 
-	pid_t root_pid;		/* container identifier */
+	pid_t root_pid;				/* [container] root pid */
+	struct task_struct *root_task;		/* [container] root task */
+	struct nsproxy *root_nsproxy;		/* [container] root nsproxy */
+	struct task_struct *root_freezer;	/* [container] root task */
 
 	unsigned long kflags;	/* kerenl flags */
 	unsigned long uflags;	/* user flags */
-- 
1.6.0.4


^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH v18 22/80] c/r: external checkpoint of a task other than ourself
@ 2009-09-23 23:51   ` Oren Laadan
  0 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-23 23:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, containers, linux-kernel, linux-mm, linux-api,
	Serge Hallyn, Ingo Molnar, Pavel Emelyanov, Oren Laadan,
	Oren Laadan

Now we can do "external" checkpoint, i.e. act on another task.

sys_checkpoint() now looks up the target pid (in our namespace) and
checkpoints that corresponding task. That task should be the root of
a container, unless CHECKPOINT_SUBTREE flag is given.

Set state of freezer cgroup of checkpointed task hierarchy to
"CHECKPOINTING" during a checkpoint, to ensure that task(s) cannot be
thawed while at it.

Ensure that all tasks belong to root task's freezer cgroup (the root
task is also tested, to detect it if changes its freezer cgroups
before it moves to "CHECKPOINTING").

sys_restart() remains nearly the same, as the restart is always done
in the context of the restarting task. However, the original task may
have been frozen from user space, or interrupted from a syscall for
the checkpoint. This is accounted for by restoring a suitable retval
for the restarting task, according to how it was checkpointed.

Changelog[v17]:
  - Move restore_retval() to this patch
  - Tighten ptrace ceckpoint for checkpoint to PTRACE_MODE_ATTACH
  - Use CHECKPOINTING state for hierarchy's freezer for checkpoint
Changelog[v16]:
  - Use CHECKPOINT_SUBTREE to allow subtree (partial container)
Changelog[v14]:
  - Refuse non-self checkpoint if target task isn't frozen
Changelog[v12]:
  - Replace obsolete ckpt_debug() with pr_debug()
Changelog[v11]:
  - Copy contents of 'init->fs->root' instead of pointing to them
Changelog[v10]:
  - Grab vfs root of container init, rather than current process

Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
---
 checkpoint/Kconfig               |    1 +
 checkpoint/checkpoint.c          |   98 +++++++++++++++++++++++++++++++++++++-
 checkpoint/restart.c             |   61 +++++++++++++++++++++++-
 checkpoint/sys.c                 |   10 ++++
 include/linux/checkpoint_types.h |    7 ++-
 5 files changed, 174 insertions(+), 3 deletions(-)

diff --git a/checkpoint/Kconfig b/checkpoint/Kconfig
index ef7d406..21fc86b 100644
--- a/checkpoint/Kconfig
+++ b/checkpoint/Kconfig
@@ -5,6 +5,7 @@
 config CHECKPOINT
 	bool "Checkpoint/restart (EXPERIMENTAL)"
 	depends on CHECKPOINT_SUPPORT && EXPERIMENTAL
+	depends on CGROUP_FREEZER
 	help
 	  Application checkpoint/restart is the ability to save the
 	  state of a running application so that it can later resume
diff --git a/checkpoint/checkpoint.c b/checkpoint/checkpoint.c
index e228151..ad89f50 100644
--- a/checkpoint/checkpoint.c
+++ b/checkpoint/checkpoint.c
@@ -12,6 +12,9 @@
 #define CKPT_DFLAG  CKPT_DSYS
 
 #include <linux/version.h>
+#include <linux/sched.h>
+#include <linux/freezer.h>
+#include <linux/ptrace.h>
 #include <linux/time.h>
 #include <linux/fs.h>
 #include <linux/file.h>
@@ -352,14 +355,105 @@ static int checkpoint_write_tail(struct ckpt_ctx *ctx)
 	return ret;
 }
 
+static int may_checkpoint_task(struct ckpt_ctx *ctx, struct task_struct *t)
+{
+	if (t->state == TASK_DEAD) {
+		__ckpt_write_err(ctx, "TE", "task state EXIT_DEAD\n", -EBUSY);
+		return -EBUSY;
+	}
+
+	if (!ptrace_may_access(t, PTRACE_MODE_ATTACH)) {
+		__ckpt_write_err(ctx, "TE", "ptrace attach denied", -EPERM);
+		return -EPERM;
+	}
+
+	/* verify that all tasks belongs to same freezer cgroup */
+	if (t != current && !in_same_cgroup_freezer(t, ctx->root_freezer)) {
+		__ckpt_write_err(ctx, "TE", "unfrozen or wrong cgroup", -EBUSY);
+		return -EBUSY;
+	}
+
+	/* FIX: add support for ptraced tasks */
+	if (task_ptrace(t)) {
+		__ckpt_write_err(ctx, "TE", "task is ptraced", -EBUSY);
+		return -EBUSY;
+	}
+
+	return 0;
+}
+
+/* setup checkpoint-specific parts of ctx */
+static int init_checkpoint_ctx(struct ckpt_ctx *ctx, pid_t pid)
+{
+	struct task_struct *task;
+	struct nsproxy *nsproxy;
+	int ret;
+
+	/*
+	 * No need for explicit cleanup here, because if an error
+	 * occurs then ckpt_ctx_free() is eventually called.
+	 */
+
+	ctx->root_pid = pid;
+
+	/* root task */
+	read_lock(&tasklist_lock);
+	task = find_task_by_vpid(pid);
+	if (task)
+		get_task_struct(task);
+	read_unlock(&tasklist_lock);
+	if (!task)
+		return -ESRCH;
+	else
+		ctx->root_task = task;
+
+	/* root nsproxy */
+	rcu_read_lock();
+	nsproxy = task_nsproxy(task);
+	if (nsproxy)
+		get_nsproxy(nsproxy);
+	rcu_read_unlock();
+	if (!nsproxy)
+		return -ESRCH;
+	else
+		ctx->root_nsproxy = nsproxy;
+
+	/* root freezer */
+	ctx->root_freezer = task;
+	geT_task_struct(task);
+
+	ret = may_checkpoint_task(ctx, task);
+	if (ret) {
+		ckpt_write_err(ctx, "", NULL);
+		put_task_struct(task);
+		put_task_struct(task);
+		put_nsproxy(nsproxy);
+		ctx->root_nsproxy = NULL;
+		ctx->root_task = NULL;
+		return ret;
+	}
+
+	return 0;
+}
+
 long do_checkpoint(struct ckpt_ctx *ctx, pid_t pid)
 {
 	long ret;
 
+	ret = init_checkpoint_ctx(ctx, pid);
+	if (ret < 0)
+		return ret;
+
+	if (ctx->root_freezer) {
+		ret = cgroup_freezer_begin_checkpoint(ctx->root_freezer);
+		if (ret < 0)
+			return ret;
+	}
+
 	ret = checkpoint_write_header(ctx);
 	if (ret < 0)
 		goto out;
-	ret = checkpoint_task(ctx, current);
+	ret = checkpoint_task(ctx, ctx->root_task);
 	if (ret < 0)
 		goto out;
 	ret = checkpoint_write_tail(ctx);
@@ -370,5 +464,7 @@ long do_checkpoint(struct ckpt_ctx *ctx, pid_t pid)
 	ctx->crid = atomic_inc_return(&ctx_count);
 	ret = ctx->crid;
  out:
+	if (ctx->root_freezer)
+		cgroup_freezer_end_checkpoint(ctx->root_freezer);
 	return ret;
 }
diff --git a/checkpoint/restart.c b/checkpoint/restart.c
index afe51c2..cb02ffb 100644
--- a/checkpoint/restart.c
+++ b/checkpoint/restart.c
@@ -411,10 +411,67 @@ static int restore_read_tail(struct ckpt_ctx *ctx)
 	return ret;
 }
 
+static long restore_retval(void)
+{
+	struct pt_regs *regs = task_pt_regs(current);
+	long ret;
+
+	/*
+	 * For the restart, we entered the kernel via sys_restart(),
+	 * so our return path is via the syscall exit. In particular,
+	 * the code in entry.S will put the value that we will return
+	 * into a register (e.g. regs->eax in x86), thus passing it to
+	 * the caller task.
+	 *
+	 * What we do now depends on what happened to the checkpointed
+	 * task right before the checkpoint - there are three cases:
+	 *
+	 * 1) It was carrying out a syscall when became frozen, or
+	 * 2) It was running in userspace, or
+	 * 3) It was doing a self-checkpoint
+	 *
+	 * In case #1, if the syscall succeeded, perhaps partially,
+	 * then the retval is non-negative. If it failed, the error
+	 * may be one of -ERESTART..., which is interpreted in the
+	 * signal handling code. If that is the case, we force the
+	 * signal handler to kick in by faking a signal to ourselves
+	 * (a la freeze/thaw) when ret < 0.
+	 *
+	 * In case #2, our return value will overwrite the original
+	 * value in the affected register. Workaround by simply using
+	 * that saved value of that register as our retval.
+	 *
+	 * In case #3, then the state was recorded while the task was
+	 * in checkpoint(2) syscall. The syscall is execpted to return
+	 * 0 when returning from a restart. Fortunately, this already
+	 * has been arranged for at checkpoint time (the register that
+	 * holds the retval, e.g. regs->eax in x86, was set to
+	 * zero).
+	 */
+
+	/* needed for all 3 cases: get old value/error/retval */
+	ret = syscall_get_return_value(current, regs);
+
+	/* if from a syscall and returning error, kick in signal handlig */
+	if (syscall_get_nr(current, regs) >= 0 && ret < 0)
+		set_tsk_thread_flag(current, TIF_SIGPENDING);
+
+	return ret;
+}
+
+/* setup restart-specific parts of ctx */
+static int init_restart_ctx(struct ckpt_ctx *ctx, pid_t pid)
+{
+	return 0;
+}
+
 long do_restart(struct ckpt_ctx *ctx, pid_t pid)
 {
 	long ret;
 
+	ret = init_restart_ctx(ctx, pid);
+	if (ret < 0)
+		return ret;
 	ret = restore_read_header(ctx);
 	if (ret < 0)
 		return ret;
@@ -422,7 +479,9 @@ long do_restart(struct ckpt_ctx *ctx, pid_t pid)
 	if (ret < 0)
 		return ret;
 	ret = restore_read_tail(ctx);
+	if (ret < 0)
+		return ret;
 
 	/* on success, adjust the return value if needed [TODO] */
-	return ret;
+	return restore_retval(ctx);
 }
diff --git a/checkpoint/sys.c b/checkpoint/sys.c
index 7f6f71e..dda2c21 100644
--- a/checkpoint/sys.c
+++ b/checkpoint/sys.c
@@ -12,7 +12,9 @@
 #define CKPT_DFLAG  CKPT_DSYS
 
 #include <linux/sched.h>
+#include <linux/nsproxy.h>
 #include <linux/kernel.h>
+#include <linux/cgroup.h>
 #include <linux/syscalls.h>
 #include <linux/fs.h>
 #include <linux/file.h>
@@ -168,6 +170,14 @@ static void ckpt_ctx_free(struct ckpt_ctx *ctx)
 {
 	if (ctx->file)
 		fput(ctx->file);
+
+	if (ctx->root_nsproxy)
+		put_nsproxy(ctx->root_nsproxy);
+	if (ctx->root_task)
+		put_task_struct(ctx->root_task);
+	if (ctx->root_freezer)
+		put_task_struct(ctx->root_freezer);
+
 	kfree(ctx);
 }
 
diff --git a/include/linux/checkpoint_types.h b/include/linux/checkpoint_types.h
index 585cb7b..15dbe1b 100644
--- a/include/linux/checkpoint_types.h
+++ b/include/linux/checkpoint_types.h
@@ -12,12 +12,17 @@
 
 #ifdef __KERNEL__
 
+#include <linux/sched.h>
+#include <linux/nsproxy.h>
 #include <linux/fs.h>
 
 struct ckpt_ctx {
 	int crid;		/* unique checkpoint id */
 
-	pid_t root_pid;		/* container identifier */
+	pid_t root_pid;				/* [container] root pid */
+	struct task_struct *root_task;		/* [container] root task */
+	struct nsproxy *root_nsproxy;		/* [container] root nsproxy */
+	struct task_struct *root_freezer;	/* [container] root task */
 
 	unsigned long kflags;	/* kerenl flags */
 	unsigned long uflags;	/* user flags */
-- 
1.6.0.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH v18 23/80] c/r: export functionality used in next patch for restart-blocks
       [not found] ` <1253749920-18673-1-git-send-email-orenl-RdfvBDnrOixBDgjK7y7TUQ@public.gmane.org>
                     ` (21 preceding siblings ...)
  2009-09-23 23:51   ` [PATCH v18 22/80] c/r: external checkpoint of a task other than ourself Oren Laadan
@ 2009-09-23 23:51   ` Oren Laadan
  2009-09-23 23:51   ` [PATCH v18 24/80] c/r: restart-blocks Oren Laadan
                     ` (58 subsequent siblings)
  81 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-23 23:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-api-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Linus Torvalds, Ingo Molnar,
	Pavel Emelyanov

To support c/r of restart-blocks (system call that need to be
restarted because they were interrupted but there was no userspace
visible side-effect), export restart-block callbacks for poll()
and futex() syscalls.

More details on c/r of restart-blocks and how it works in the
following patch.

Signed-off-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
Acked-by: Serge Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
---
 fs/select.c                  |    2 +-
 include/linux/futex.h        |   11 +++++++++++
 include/linux/poll.h         |    3 +++
 include/linux/posix-timers.h |    6 ++++++
 kernel/compat.c              |    4 ++--
 kernel/futex.c               |   12 +-----------
 kernel/posix-timers.c        |    2 +-
 7 files changed, 25 insertions(+), 15 deletions(-)

diff --git a/fs/select.c b/fs/select.c
index 8084834..e1bd524 100644
--- a/fs/select.c
+++ b/fs/select.c
@@ -866,7 +866,7 @@ out_fds:
 	return err;
 }
 
-static long do_restart_poll(struct restart_block *restart_block)
+long do_restart_poll(struct restart_block *restart_block)
 {
 	struct pollfd __user *ufds = restart_block->poll.ufds;
 	int nfds = restart_block->poll.nfds;
diff --git a/include/linux/futex.h b/include/linux/futex.h
index 34956c8..4326f81 100644
--- a/include/linux/futex.h
+++ b/include/linux/futex.h
@@ -136,6 +136,17 @@ extern int
 handle_futex_death(u32 __user *uaddr, struct task_struct *curr, int pi);
 
 /*
+ * In case we must use restart_block to restart a futex_wait,
+ * we encode in the 'flags' shared capability
+ */
+#define FLAGS_SHARED		0x01
+#define FLAGS_CLOCKRT		0x02
+#define FLAGS_HAS_TIMEOUT	0x04
+
+/* for c/r */
+extern long futex_wait_restart(struct restart_block *restart);
+
+/*
  * Futexes are matched on equal values of this key.
  * The key type depends on whether it's a shared or private mapping.
  * Don't rearrange members without looking at hash_futex().
diff --git a/include/linux/poll.h b/include/linux/poll.h
index fa287f2..0841c51 100644
--- a/include/linux/poll.h
+++ b/include/linux/poll.h
@@ -134,6 +134,9 @@ extern int core_sys_select(int n, fd_set __user *inp, fd_set __user *outp,
 
 extern int poll_select_set_timeout(struct timespec *to, long sec, long nsec);
 
+/* used by checkpoint/restart */
+extern long do_restart_poll(struct restart_block *restart_block);
+
 #endif /* KERNEL */
 
 #endif /* _LINUX_POLL_H */
diff --git a/include/linux/posix-timers.h b/include/linux/posix-timers.h
index 4f71bf4..d0d6a66 100644
--- a/include/linux/posix-timers.h
+++ b/include/linux/posix-timers.h
@@ -101,6 +101,10 @@ int posix_cpu_timer_create(struct k_itimer *timer);
 int posix_cpu_nsleep(const clockid_t which_clock, int flags,
 		     struct timespec *rqtp, struct timespec __user *rmtp);
 long posix_cpu_nsleep_restart(struct restart_block *restart_block);
+#ifdef CONFIG_COMPAT
+long compat_nanosleep_restart(struct restart_block *restart);
+long compat_clock_nanosleep_restart(struct restart_block *restart);
+#endif
 int posix_cpu_timer_set(struct k_itimer *timer, int flags,
 			struct itimerspec *new, struct itimerspec *old);
 int posix_cpu_timer_del(struct k_itimer *timer);
@@ -119,4 +123,6 @@ long clock_nanosleep_restart(struct restart_block *restart_block);
 
 void update_rlimit_cpu(unsigned long rlim_new);
 
+int invalid_clockid(const clockid_t which_clock);
+
 #endif
diff --git a/kernel/compat.c b/kernel/compat.c
index f6c204f..20afdba 100644
--- a/kernel/compat.c
+++ b/kernel/compat.c
@@ -100,7 +100,7 @@ int put_compat_timespec(const struct timespec *ts, struct compat_timespec __user
 			__put_user(ts->tv_nsec, &cts->tv_nsec)) ? -EFAULT : 0;
 }
 
-static long compat_nanosleep_restart(struct restart_block *restart)
+long compat_nanosleep_restart(struct restart_block *restart)
 {
 	struct compat_timespec __user *rmtp;
 	struct timespec rmt;
@@ -647,7 +647,7 @@ long compat_sys_clock_getres(clockid_t which_clock,
 	return err;
 }
 
-static long compat_clock_nanosleep_restart(struct restart_block *restart)
+long compat_clock_nanosleep_restart(struct restart_block *restart)
 {
 	long err;
 	mm_segment_t oldfs;
diff --git a/kernel/futex.c b/kernel/futex.c
index e18cfbd..def86c8 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -1533,16 +1533,6 @@ handle_fault:
 	goto retry;
 }
 
-/*
- * In case we must use restart_block to restart a futex_wait,
- * we encode in the 'flags' shared capability
- */
-#define FLAGS_SHARED		0x01
-#define FLAGS_CLOCKRT		0x02
-#define FLAGS_HAS_TIMEOUT	0x04
-
-static long futex_wait_restart(struct restart_block *restart);
-
 /**
  * fixup_owner() - Post lock pi_state and corner case management
  * @uaddr:	user address of the futex
@@ -1812,7 +1802,7 @@ out:
 }
 
 
-static long futex_wait_restart(struct restart_block *restart)
+long futex_wait_restart(struct restart_block *restart)
 {
 	u32 __user *uaddr = (u32 __user *)restart->futex.uaddr;
 	int fshared = 0;
diff --git a/kernel/posix-timers.c b/kernel/posix-timers.c
index d089d05..7a4fc9d 100644
--- a/kernel/posix-timers.c
+++ b/kernel/posix-timers.c
@@ -211,7 +211,7 @@ static int no_nsleep(const clockid_t which_clock, int flags,
 /*
  * Return nonzero if we know a priori this clockid_t value is bogus.
  */
-static inline int invalid_clockid(const clockid_t which_clock)
+int invalid_clockid(const clockid_t which_clock)
 {
 	if (which_clock < 0)	/* CPU clock, posix_cpu_* will check it */
 		return 0;
-- 
1.6.0.4

^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH v18 23/80] c/r: export functionality used in next patch for restart-blocks
  2009-09-23 23:50 ` Oren Laadan
@ 2009-09-23 23:51   ` Oren Laadan
  -1 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-23 23:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, containers, linux-kernel, linux-mm, linux-api,
	Serge Hallyn, Ingo Molnar, Pavel Emelyanov, Oren Laadan,
	Oren Laadan

To support c/r of restart-blocks (system call that need to be
restarted because they were interrupted but there was no userspace
visible side-effect), export restart-block callbacks for poll()
and futex() syscalls.

More details on c/r of restart-blocks and how it works in the
following patch.

Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
Acked-by: Serge Hallyn <serue@us.ibm.com>
---
 fs/select.c                  |    2 +-
 include/linux/futex.h        |   11 +++++++++++
 include/linux/poll.h         |    3 +++
 include/linux/posix-timers.h |    6 ++++++
 kernel/compat.c              |    4 ++--
 kernel/futex.c               |   12 +-----------
 kernel/posix-timers.c        |    2 +-
 7 files changed, 25 insertions(+), 15 deletions(-)

diff --git a/fs/select.c b/fs/select.c
index 8084834..e1bd524 100644
--- a/fs/select.c
+++ b/fs/select.c
@@ -866,7 +866,7 @@ out_fds:
 	return err;
 }
 
-static long do_restart_poll(struct restart_block *restart_block)
+long do_restart_poll(struct restart_block *restart_block)
 {
 	struct pollfd __user *ufds = restart_block->poll.ufds;
 	int nfds = restart_block->poll.nfds;
diff --git a/include/linux/futex.h b/include/linux/futex.h
index 34956c8..4326f81 100644
--- a/include/linux/futex.h
+++ b/include/linux/futex.h
@@ -136,6 +136,17 @@ extern int
 handle_futex_death(u32 __user *uaddr, struct task_struct *curr, int pi);
 
 /*
+ * In case we must use restart_block to restart a futex_wait,
+ * we encode in the 'flags' shared capability
+ */
+#define FLAGS_SHARED		0x01
+#define FLAGS_CLOCKRT		0x02
+#define FLAGS_HAS_TIMEOUT	0x04
+
+/* for c/r */
+extern long futex_wait_restart(struct restart_block *restart);
+
+/*
  * Futexes are matched on equal values of this key.
  * The key type depends on whether it's a shared or private mapping.
  * Don't rearrange members without looking at hash_futex().
diff --git a/include/linux/poll.h b/include/linux/poll.h
index fa287f2..0841c51 100644
--- a/include/linux/poll.h
+++ b/include/linux/poll.h
@@ -134,6 +134,9 @@ extern int core_sys_select(int n, fd_set __user *inp, fd_set __user *outp,
 
 extern int poll_select_set_timeout(struct timespec *to, long sec, long nsec);
 
+/* used by checkpoint/restart */
+extern long do_restart_poll(struct restart_block *restart_block);
+
 #endif /* KERNEL */
 
 #endif /* _LINUX_POLL_H */
diff --git a/include/linux/posix-timers.h b/include/linux/posix-timers.h
index 4f71bf4..d0d6a66 100644
--- a/include/linux/posix-timers.h
+++ b/include/linux/posix-timers.h
@@ -101,6 +101,10 @@ int posix_cpu_timer_create(struct k_itimer *timer);
 int posix_cpu_nsleep(const clockid_t which_clock, int flags,
 		     struct timespec *rqtp, struct timespec __user *rmtp);
 long posix_cpu_nsleep_restart(struct restart_block *restart_block);
+#ifdef CONFIG_COMPAT
+long compat_nanosleep_restart(struct restart_block *restart);
+long compat_clock_nanosleep_restart(struct restart_block *restart);
+#endif
 int posix_cpu_timer_set(struct k_itimer *timer, int flags,
 			struct itimerspec *new, struct itimerspec *old);
 int posix_cpu_timer_del(struct k_itimer *timer);
@@ -119,4 +123,6 @@ long clock_nanosleep_restart(struct restart_block *restart_block);
 
 void update_rlimit_cpu(unsigned long rlim_new);
 
+int invalid_clockid(const clockid_t which_clock);
+
 #endif
diff --git a/kernel/compat.c b/kernel/compat.c
index f6c204f..20afdba 100644
--- a/kernel/compat.c
+++ b/kernel/compat.c
@@ -100,7 +100,7 @@ int put_compat_timespec(const struct timespec *ts, struct compat_timespec __user
 			__put_user(ts->tv_nsec, &cts->tv_nsec)) ? -EFAULT : 0;
 }
 
-static long compat_nanosleep_restart(struct restart_block *restart)
+long compat_nanosleep_restart(struct restart_block *restart)
 {
 	struct compat_timespec __user *rmtp;
 	struct timespec rmt;
@@ -647,7 +647,7 @@ long compat_sys_clock_getres(clockid_t which_clock,
 	return err;
 }
 
-static long compat_clock_nanosleep_restart(struct restart_block *restart)
+long compat_clock_nanosleep_restart(struct restart_block *restart)
 {
 	long err;
 	mm_segment_t oldfs;
diff --git a/kernel/futex.c b/kernel/futex.c
index e18cfbd..def86c8 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -1533,16 +1533,6 @@ handle_fault:
 	goto retry;
 }
 
-/*
- * In case we must use restart_block to restart a futex_wait,
- * we encode in the 'flags' shared capability
- */
-#define FLAGS_SHARED		0x01
-#define FLAGS_CLOCKRT		0x02
-#define FLAGS_HAS_TIMEOUT	0x04
-
-static long futex_wait_restart(struct restart_block *restart);
-
 /**
  * fixup_owner() - Post lock pi_state and corner case management
  * @uaddr:	user address of the futex
@@ -1812,7 +1802,7 @@ out:
 }
 
 
-static long futex_wait_restart(struct restart_block *restart)
+long futex_wait_restart(struct restart_block *restart)
 {
 	u32 __user *uaddr = (u32 __user *)restart->futex.uaddr;
 	int fshared = 0;
diff --git a/kernel/posix-timers.c b/kernel/posix-timers.c
index d089d05..7a4fc9d 100644
--- a/kernel/posix-timers.c
+++ b/kernel/posix-timers.c
@@ -211,7 +211,7 @@ static int no_nsleep(const clockid_t which_clock, int flags,
 /*
  * Return nonzero if we know a priori this clockid_t value is bogus.
  */
-static inline int invalid_clockid(const clockid_t which_clock)
+int invalid_clockid(const clockid_t which_clock)
 {
 	if (which_clock < 0)	/* CPU clock, posix_cpu_* will check it */
 		return 0;
-- 
1.6.0.4


^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH v18 23/80] c/r: export functionality used in next patch for restart-blocks
@ 2009-09-23 23:51   ` Oren Laadan
  0 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-23 23:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, containers, linux-kernel, linux-mm, linux-api,
	Serge Hallyn, Ingo Molnar, Pavel Emelyanov, Oren Laadan,
	Oren Laadan

To support c/r of restart-blocks (system call that need to be
restarted because they were interrupted but there was no userspace
visible side-effect), export restart-block callbacks for poll()
and futex() syscalls.

More details on c/r of restart-blocks and how it works in the
following patch.

Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
Acked-by: Serge Hallyn <serue@us.ibm.com>
---
 fs/select.c                  |    2 +-
 include/linux/futex.h        |   11 +++++++++++
 include/linux/poll.h         |    3 +++
 include/linux/posix-timers.h |    6 ++++++
 kernel/compat.c              |    4 ++--
 kernel/futex.c               |   12 +-----------
 kernel/posix-timers.c        |    2 +-
 7 files changed, 25 insertions(+), 15 deletions(-)

diff --git a/fs/select.c b/fs/select.c
index 8084834..e1bd524 100644
--- a/fs/select.c
+++ b/fs/select.c
@@ -866,7 +866,7 @@ out_fds:
 	return err;
 }
 
-static long do_restart_poll(struct restart_block *restart_block)
+long do_restart_poll(struct restart_block *restart_block)
 {
 	struct pollfd __user *ufds = restart_block->poll.ufds;
 	int nfds = restart_block->poll.nfds;
diff --git a/include/linux/futex.h b/include/linux/futex.h
index 34956c8..4326f81 100644
--- a/include/linux/futex.h
+++ b/include/linux/futex.h
@@ -136,6 +136,17 @@ extern int
 handle_futex_death(u32 __user *uaddr, struct task_struct *curr, int pi);
 
 /*
+ * In case we must use restart_block to restart a futex_wait,
+ * we encode in the 'flags' shared capability
+ */
+#define FLAGS_SHARED		0x01
+#define FLAGS_CLOCKRT		0x02
+#define FLAGS_HAS_TIMEOUT	0x04
+
+/* for c/r */
+extern long futex_wait_restart(struct restart_block *restart);
+
+/*
  * Futexes are matched on equal values of this key.
  * The key type depends on whether it's a shared or private mapping.
  * Don't rearrange members without looking at hash_futex().
diff --git a/include/linux/poll.h b/include/linux/poll.h
index fa287f2..0841c51 100644
--- a/include/linux/poll.h
+++ b/include/linux/poll.h
@@ -134,6 +134,9 @@ extern int core_sys_select(int n, fd_set __user *inp, fd_set __user *outp,
 
 extern int poll_select_set_timeout(struct timespec *to, long sec, long nsec);
 
+/* used by checkpoint/restart */
+extern long do_restart_poll(struct restart_block *restart_block);
+
 #endif /* KERNEL */
 
 #endif /* _LINUX_POLL_H */
diff --git a/include/linux/posix-timers.h b/include/linux/posix-timers.h
index 4f71bf4..d0d6a66 100644
--- a/include/linux/posix-timers.h
+++ b/include/linux/posix-timers.h
@@ -101,6 +101,10 @@ int posix_cpu_timer_create(struct k_itimer *timer);
 int posix_cpu_nsleep(const clockid_t which_clock, int flags,
 		     struct timespec *rqtp, struct timespec __user *rmtp);
 long posix_cpu_nsleep_restart(struct restart_block *restart_block);
+#ifdef CONFIG_COMPAT
+long compat_nanosleep_restart(struct restart_block *restart);
+long compat_clock_nanosleep_restart(struct restart_block *restart);
+#endif
 int posix_cpu_timer_set(struct k_itimer *timer, int flags,
 			struct itimerspec *new, struct itimerspec *old);
 int posix_cpu_timer_del(struct k_itimer *timer);
@@ -119,4 +123,6 @@ long clock_nanosleep_restart(struct restart_block *restart_block);
 
 void update_rlimit_cpu(unsigned long rlim_new);
 
+int invalid_clockid(const clockid_t which_clock);
+
 #endif
diff --git a/kernel/compat.c b/kernel/compat.c
index f6c204f..20afdba 100644
--- a/kernel/compat.c
+++ b/kernel/compat.c
@@ -100,7 +100,7 @@ int put_compat_timespec(const struct timespec *ts, struct compat_timespec __user
 			__put_user(ts->tv_nsec, &cts->tv_nsec)) ? -EFAULT : 0;
 }
 
-static long compat_nanosleep_restart(struct restart_block *restart)
+long compat_nanosleep_restart(struct restart_block *restart)
 {
 	struct compat_timespec __user *rmtp;
 	struct timespec rmt;
@@ -647,7 +647,7 @@ long compat_sys_clock_getres(clockid_t which_clock,
 	return err;
 }
 
-static long compat_clock_nanosleep_restart(struct restart_block *restart)
+long compat_clock_nanosleep_restart(struct restart_block *restart)
 {
 	long err;
 	mm_segment_t oldfs;
diff --git a/kernel/futex.c b/kernel/futex.c
index e18cfbd..def86c8 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -1533,16 +1533,6 @@ handle_fault:
 	goto retry;
 }
 
-/*
- * In case we must use restart_block to restart a futex_wait,
- * we encode in the 'flags' shared capability
- */
-#define FLAGS_SHARED		0x01
-#define FLAGS_CLOCKRT		0x02
-#define FLAGS_HAS_TIMEOUT	0x04
-
-static long futex_wait_restart(struct restart_block *restart);
-
 /**
  * fixup_owner() - Post lock pi_state and corner case management
  * @uaddr:	user address of the futex
@@ -1812,7 +1802,7 @@ out:
 }
 
 
-static long futex_wait_restart(struct restart_block *restart)
+long futex_wait_restart(struct restart_block *restart)
 {
 	u32 __user *uaddr = (u32 __user *)restart->futex.uaddr;
 	int fshared = 0;
diff --git a/kernel/posix-timers.c b/kernel/posix-timers.c
index d089d05..7a4fc9d 100644
--- a/kernel/posix-timers.c
+++ b/kernel/posix-timers.c
@@ -211,7 +211,7 @@ static int no_nsleep(const clockid_t which_clock, int flags,
 /*
  * Return nonzero if we know a priori this clockid_t value is bogus.
  */
-static inline int invalid_clockid(const clockid_t which_clock)
+int invalid_clockid(const clockid_t which_clock)
 {
 	if (which_clock < 0)	/* CPU clock, posix_cpu_* will check it */
 		return 0;
-- 
1.6.0.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH v18 24/80] c/r: restart-blocks
       [not found] ` <1253749920-18673-1-git-send-email-orenl-RdfvBDnrOixBDgjK7y7TUQ@public.gmane.org>
                     ` (22 preceding siblings ...)
  2009-09-23 23:51   ` [PATCH v18 23/80] c/r: export functionality used in next patch for restart-blocks Oren Laadan
@ 2009-09-23 23:51   ` Oren Laadan
  2009-09-23 23:51   ` [PATCH v18 25/80] c/r: checkpoint multiple processes Oren Laadan
                     ` (57 subsequent siblings)
  81 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-23 23:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-api-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Linus Torvalds, Ingo Molnar,
	Pavel Emelyanov

(Paraphrasing what's said this message:
http://lists.openwall.net/linux-kernel/2007/12/05/64)

Restart blocks are callbacks used cause a system call to be restarted
with the arguments specified in the system call restart block. It is
useful for system call that are not idempotent, i.e. the argument(s)
might be a relative timeout, where some adjustments are required when
restarting the system call. It relies on the system call itself to set
up its restart point and the argument save area.  They are rare: an
actual signal would turn that it an EINTR. The only case that should
ever trigger this is some kernel action that interrupts the system
call, but does not actually result in any user-visible state changes -
like freeze and thaw.

So restart blocks are about time remaining for the system call to
sleep/wait. Generally in c/r, there are two possible time models that
we can follow: absolute, relative. Here, I chose to save the relative
timeout, measured from the beginning of the checkpoint. The time when
the checkpoint (and restart) begin is also saved. This information is
sufficient to restart in either model (absolute or negative).

Which model to use should eventually be a per application choice (and
possible configurable via cradvise() or some sort). For now, we adopt
the relative model, namely, at restart the timeout is set relative to
the beginning of the restart.

To checkpoint, we check if a task has a valid restart block, and if so
we save the *remaining* time that is has to wait/sleep, and the type
of the restart block.

To restart, we fill in the data required at the proper place in the
thread information. If the system call return an error (which is
possibly an -ERESTARTSYS eg), we not only use that error as our own
return value, but also arrange for the task to execute the signal
handler (by faking a signal). The handler, in turn, already has the
code to handle these restart request gracefully.

Changelog[v1]:
  - [Nathan Lynch] fix compilation errors with CONFIG_COMPAT=y

Signed-off-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
---
 checkpoint/checkpoint.c          |    1 +
 checkpoint/process.c             |  226 ++++++++++++++++++++++++++++++++++++++
 checkpoint/restart.c             |    5 +-
 checkpoint/sys.c                 |    1 +
 include/linux/checkpoint.h       |    4 +
 include/linux/checkpoint_hdr.h   |   22 ++++
 include/linux/checkpoint_types.h |    3 +
 7 files changed, 260 insertions(+), 2 deletions(-)

diff --git a/checkpoint/checkpoint.c b/checkpoint/checkpoint.c
index ad89f50..554400c 100644
--- a/checkpoint/checkpoint.c
+++ b/checkpoint/checkpoint.c
@@ -22,6 +22,7 @@
 #include <linux/mount.h>
 #include <linux/utsname.h>
 #include <linux/magic.h>
+#include <linux/hrtimer.h>
 #include <linux/checkpoint.h>
 #include <linux/checkpoint_hdr.h>
 
diff --git a/checkpoint/process.c b/checkpoint/process.c
index 1d1170c..330c8d4 100644
--- a/checkpoint/process.c
+++ b/checkpoint/process.c
@@ -12,6 +12,9 @@
 #define CKPT_DFLAG  CKPT_DSYS
 
 #include <linux/sched.h>
+#include <linux/posix-timers.h>
+#include <linux/futex.h>
+#include <linux/poll.h>
 #include <linux/checkpoint.h>
 #include <linux/checkpoint_hdr.h>
 
@@ -47,6 +50,116 @@ static int checkpoint_task_struct(struct ckpt_ctx *ctx, struct task_struct *t)
 	return ckpt_write_string(ctx, t->comm, TASK_COMM_LEN);
 }
 
+/* dump the task_struct of a given task */
+int checkpoint_restart_block(struct ckpt_ctx *ctx, struct task_struct *t)
+{
+	struct ckpt_hdr_restart_block *h;
+	struct restart_block *restart_block;
+	long (*fn)(struct restart_block *);
+	s64 base, expire = 0;
+	int ret;
+
+	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_RESTART_BLOCK);
+	if (!h)
+		return -ENOMEM;
+
+	base = ktime_to_ns(ctx->ktime_begin);
+	restart_block = &task_thread_info(t)->restart_block;
+	fn = restart_block->fn;
+
+	/* FIX: enumerate clockid_t so we're immune to changes */
+
+	if (fn == do_no_restart_syscall) {
+
+		h->function_type = CKPT_RESTART_BLOCK_NONE;
+		ckpt_debug("restart_block: non\n");
+
+	} else if (fn == hrtimer_nanosleep_restart) {
+
+		h->function_type = CKPT_RESTART_BLOCK_HRTIMER_NANOSLEEP;
+		h->arg_0 = restart_block->nanosleep.index;
+		h->arg_1 = (unsigned long) restart_block->nanosleep.rmtp;
+		expire = restart_block->nanosleep.expires;
+		ckpt_debug("restart_block: hrtimer expire %lld now %lld\n",
+			 expire, base);
+
+	} else if (fn == posix_cpu_nsleep_restart) {
+		struct timespec ts;
+
+		h->function_type = CKPT_RESTART_BLOCK_POSIX_CPU_NANOSLEEP;
+		h->arg_0 = restart_block->arg0;
+		h->arg_1 = restart_block->arg1;
+		ts.tv_sec = restart_block->arg2;
+		ts.tv_nsec = restart_block->arg3;
+		expire = timespec_to_ns(&ts);
+		ckpt_debug("restart_block: posix_cpu expire %lld now %lld\n",
+			 expire, base);
+
+#ifdef CONFIG_COMPAT
+	} else if (fn == compat_nanosleep_restart) {
+
+		h->function_type = CKPT_RESTART_BLOCK_COMPAT_NANOSLEEP;
+		h->arg_0 = restart_block->nanosleep.index;
+		h->arg_1 = (unsigned long)restart_block->nanosleep.rmtp;
+		h->arg_2 = (unsigned long)restart_block->nanosleep.compat_rmtp;
+		expire = restart_block->nanosleep.expires;
+		ckpt_debug("restart_block: compat expire %lld now %lld\n",
+			 expire, base);
+
+	} else if (fn == compat_clock_nanosleep_restart) {
+
+		h->function_type = CKPT_RESTART_BLOCK_COMPAT_CLOCK_NANOSLEEP;
+		h->arg_0 = restart_block->nanosleep.index;
+		h->arg_1 = (unsigned long)restart_block->nanosleep.rmtp;
+		h->arg_2 = (unsigned long)restart_block->nanosleep.compat_rmtp;
+		expire = restart_block->nanosleep.expires;
+		ckpt_debug("restart_block: compat_clock expire %lld now %lld\n",
+			 expire, base);
+
+#endif
+	} else if (fn == futex_wait_restart) {
+
+		h->function_type = CKPT_RESTART_BLOCK_FUTEX;
+		h->arg_0 = (unsigned long) restart_block->futex.uaddr;
+		h->arg_1 = restart_block->futex.val;
+		h->arg_2 = restart_block->futex.flags;
+		h->arg_3 = restart_block->futex.bitset;
+		expire = restart_block->futex.time;
+		ckpt_debug("restart_block: futex expire %lld now %lld\n",
+			 expire, base);
+
+	} else if (fn == do_restart_poll) {
+		struct timespec ts;
+
+		h->function_type = CKPT_RESTART_BLOCK_POLL;
+		h->arg_0 = (unsigned long) restart_block->poll.ufds;
+		h->arg_1 = restart_block->poll.nfds;
+		h->arg_2 = restart_block->poll.has_timeout;
+		ts.tv_sec = restart_block->poll.tv_sec;
+		ts.tv_nsec = restart_block->poll.tv_nsec;
+		expire = timespec_to_ns(&ts);
+		ckpt_debug("restart_block: poll expire %lld now %lld\n",
+			 expire, base);
+
+	} else {
+
+		BUG();
+
+	}
+
+	/* common to all restart blocks: */
+	h->arg_4 = (base < expire ? expire - base : 0);
+
+	ckpt_debug("restart_block: args %#llx %#llx %#llx %#llx %#llx\n",
+		 h->arg_0, h->arg_1, h->arg_2, h->arg_3, h->arg_4);
+
+	ret = ckpt_write_obj(ctx, &h->h);
+	ckpt_hdr_put(ctx, h);
+
+	ckpt_debug("restart_block ret %d\n", ret);
+	return ret;
+}
+
 /* dump the entire state of a given task */
 int checkpoint_task(struct ckpt_ctx *ctx, struct task_struct *t)
 {
@@ -63,6 +176,10 @@ int checkpoint_task(struct ckpt_ctx *ctx, struct task_struct *t)
 	ckpt_debug("thread %d\n", ret);
 	if (ret < 0)
 		goto out;
+	ret = checkpoint_restart_block(ctx, t);
+	ckpt_debug("restart-blocks %d\n", ret);
+	if (ret < 0)
+		goto out;
 	ret = checkpoint_cpu(ctx, t);
 	ckpt_debug("cpu %d\n", ret);
  out:
@@ -99,6 +216,111 @@ static int restore_task_struct(struct ckpt_ctx *ctx)
 	return ret;
 }
 
+int restore_restart_block(struct ckpt_ctx *ctx)
+{
+	struct ckpt_hdr_restart_block *h;
+	struct restart_block restart_block;
+	struct timespec ts;
+	clockid_t clockid;
+	s64 expire;
+	int ret = 0;
+
+	h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_RESTART_BLOCK);
+	if (IS_ERR(h))
+		return PTR_ERR(h);
+
+	expire = ktime_to_ns(ctx->ktime_begin) + h->arg_4;
+	restart_block.fn = NULL;
+
+	ckpt_debug("restart_block: expire %lld begin %lld\n",
+		 expire, ktime_to_ns(ctx->ktime_begin));
+	ckpt_debug("restart_block: args %#llx %#llx %#llx %#llx %#llx\n",
+		 h->arg_0, h->arg_1, h->arg_2, h->arg_3, h->arg_4);
+
+	switch (h->function_type) {
+	case CKPT_RESTART_BLOCK_NONE:
+		restart_block.fn = do_no_restart_syscall;
+		break;
+	case CKPT_RESTART_BLOCK_HRTIMER_NANOSLEEP:
+		clockid = h->arg_0;
+		if (clockid < 0 || invalid_clockid(clockid))
+			break;
+		restart_block.fn = hrtimer_nanosleep_restart;
+		restart_block.nanosleep.index = clockid;
+		restart_block.nanosleep.rmtp =
+			(struct timespec __user *) (unsigned long) h->arg_1;
+		restart_block.nanosleep.expires = expire;
+		break;
+	case CKPT_RESTART_BLOCK_POSIX_CPU_NANOSLEEP:
+		clockid = h->arg_0;
+		if (clockid < 0 || invalid_clockid(clockid))
+			break;
+		restart_block.fn = posix_cpu_nsleep_restart;
+		restart_block.arg0 = clockid;
+		restart_block.arg1 = h->arg_1;
+		ts = ns_to_timespec(expire);
+		restart_block.arg2 = ts.tv_sec;
+		restart_block.arg3 = ts.tv_nsec;
+		break;
+#ifdef CONFIG_COMPAT
+	case CKPT_RESTART_BLOCK_COMPAT_NANOSLEEP:
+		clockid = h->arg_0;
+		if (clockid < 0 || invalid_clockid(clockid))
+			break;
+		restart_block.fn = compat_nanosleep_restart;
+		restart_block.nanosleep.index = clockid;
+		restart_block.nanosleep.rmtp =
+			(struct timespec __user *) (unsigned long) h->arg_1;
+		restart_block.nanosleep.compat_rmtp =
+			(struct compat_timespec __user *)
+				(unsigned long) h->arg_2;
+		restart_block.nanosleep.expires = expire;
+		break;
+	case CKPT_RESTART_BLOCK_COMPAT_CLOCK_NANOSLEEP:
+		clockid = h->arg_0;
+		if (clockid < 0 || invalid_clockid(clockid))
+			break;
+		restart_block.fn = compat_clock_nanosleep_restart;
+		restart_block.nanosleep.index = clockid;
+		restart_block.nanosleep.rmtp =
+			(struct timespec __user *) (unsigned long) h->arg_1;
+		restart_block.nanosleep.compat_rmtp =
+			(struct compat_timespec __user *)
+				(unsigned long) h->arg_2;
+		restart_block.nanosleep.expires = expire;
+		break;
+#endif
+	case CKPT_RESTART_BLOCK_FUTEX:
+		restart_block.fn = futex_wait_restart;
+		restart_block.futex.uaddr = (u32 *) (unsigned long) h->arg_0;
+		restart_block.futex.val = h->arg_1;
+		restart_block.futex.flags = h->arg_2;
+		restart_block.futex.bitset = h->arg_3;
+		restart_block.futex.time = expire;
+		break;
+	case CKPT_RESTART_BLOCK_POLL:
+		restart_block.fn = do_restart_poll;
+		restart_block.poll.ufds =
+			(struct pollfd __user *) (unsigned long) h->arg_0;
+		restart_block.poll.nfds = h->arg_1;
+		restart_block.poll.has_timeout = h->arg_2;
+		ts = ns_to_timespec(expire);
+		restart_block.poll.tv_sec = ts.tv_sec;
+		restart_block.poll.tv_nsec = ts.tv_nsec;
+		break;
+	default:
+		break;
+	}
+
+	if (restart_block.fn)
+		task_thread_info(current)->restart_block = restart_block;
+	else
+		ret = -EINVAL;
+
+	ckpt_hdr_put(ctx, h);
+	return ret;
+}
+
 /* read the entire state of the current task */
 int restore_task(struct ckpt_ctx *ctx)
 {
@@ -112,6 +334,10 @@ int restore_task(struct ckpt_ctx *ctx)
 	ckpt_debug("thread %d\n", ret);
 	if (ret < 0)
 		goto out;
+	ret = restore_restart_block(ctx);
+	ckpt_debug("restart-blocks %d\n", ret);
+	if (ret < 0)
+		goto out;
 	ret = restore_cpu(ctx);
 	ckpt_debug("cpu %d\n", ret);
  out:
diff --git a/checkpoint/restart.c b/checkpoint/restart.c
index cb02ffb..fdad264 100644
--- a/checkpoint/restart.c
+++ b/checkpoint/restart.c
@@ -16,6 +16,8 @@
 #include <linux/file.h>
 #include <linux/magic.h>
 #include <linux/utsname.h>
+#include <asm/syscall.h>
+#include <linux/elf.h>
 #include <linux/checkpoint.h>
 #include <linux/checkpoint_hdr.h>
 
@@ -482,6 +484,5 @@ long do_restart(struct ckpt_ctx *ctx, pid_t pid)
 	if (ret < 0)
 		return ret;
 
-	/* on success, adjust the return value if needed [TODO] */
-	return restore_retval(ctx);
+	return restore_retval();
 }
diff --git a/checkpoint/sys.c b/checkpoint/sys.c
index dda2c21..b37bc8c 100644
--- a/checkpoint/sys.c
+++ b/checkpoint/sys.c
@@ -193,6 +193,7 @@ static struct ckpt_ctx *ckpt_ctx_alloc(int fd, unsigned long uflags,
 
 	ctx->uflags = uflags;
 	ctx->kflags = kflags;
+	ctx->ktime_begin = ktime_get();
 
 	err = -EBADF;
 	ctx->file = fget(fd);
diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
index aa8ce11..14c0a7f 100644
--- a/include/linux/checkpoint.h
+++ b/include/linux/checkpoint.h
@@ -70,6 +70,10 @@ extern int restore_read_header_arch(struct ckpt_ctx *ctx);
 extern int restore_thread(struct ckpt_ctx *ctx);
 extern int restore_cpu(struct ckpt_ctx *ctx);
 
+extern int checkpoint_restart_block(struct ckpt_ctx *ctx,
+				    struct task_struct *t);
+extern int restore_restart_block(struct ckpt_ctx *ctx);
+
 static inline int ckpt_validate_errno(int errno)
 {
 	return (errno >= 0) && (errno < MAX_ERRNO);
diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
index 92d082e..b72c59c 100644
--- a/include/linux/checkpoint_hdr.h
+++ b/include/linux/checkpoint_hdr.h
@@ -52,6 +52,7 @@ enum {
 	CKPT_HDR_STRING,
 
 	CKPT_HDR_TASK = 101,
+	CKPT_HDR_RESTART_BLOCK,
 	CKPT_HDR_THREAD,
 	CKPT_HDR_CPU,
 
@@ -122,4 +123,25 @@ struct ckpt_hdr_task {
 	__u64 clear_child_tid;
 } __attribute__((aligned(8)));
 
+/* restart blocks */
+struct ckpt_hdr_restart_block {
+	struct ckpt_hdr h;
+	__u64 function_type;
+	__u64 arg_0;
+	__u64 arg_1;
+	__u64 arg_2;
+	__u64 arg_3;
+	__u64 arg_4;
+} __attribute__((aligned(8)));
+
+enum restart_block_type {
+	CKPT_RESTART_BLOCK_NONE = 1,
+	CKPT_RESTART_BLOCK_HRTIMER_NANOSLEEP,
+	CKPT_RESTART_BLOCK_POSIX_CPU_NANOSLEEP,
+	CKPT_RESTART_BLOCK_COMPAT_NANOSLEEP,
+	CKPT_RESTART_BLOCK_COMPAT_CLOCK_NANOSLEEP,
+	CKPT_RESTART_BLOCK_POLL,
+	CKPT_RESTART_BLOCK_FUTEX
+};
+
 #endif /* _CHECKPOINT_CKPT_HDR_H_ */
diff --git a/include/linux/checkpoint_types.h b/include/linux/checkpoint_types.h
index 15dbe1b..046bdc4 100644
--- a/include/linux/checkpoint_types.h
+++ b/include/linux/checkpoint_types.h
@@ -15,10 +15,13 @@
 #include <linux/sched.h>
 #include <linux/nsproxy.h>
 #include <linux/fs.h>
+#include <linux/ktime.h>
 
 struct ckpt_ctx {
 	int crid;		/* unique checkpoint id */
 
+	ktime_t ktime_begin;	/* checkpoint start time */
+
 	pid_t root_pid;				/* [container] root pid */
 	struct task_struct *root_task;		/* [container] root task */
 	struct nsproxy *root_nsproxy;		/* [container] root nsproxy */
-- 
1.6.0.4

^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH v18 24/80] c/r: restart-blocks
  2009-09-23 23:50 ` Oren Laadan
@ 2009-09-23 23:51   ` Oren Laadan
  -1 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-23 23:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, containers, linux-kernel, linux-mm, linux-api,
	Serge Hallyn, Ingo Molnar, Pavel Emelyanov, Oren Laadan,
	Oren Laadan

(Paraphrasing what's said this message:
http://lists.openwall.net/linux-kernel/2007/12/05/64)

Restart blocks are callbacks used cause a system call to be restarted
with the arguments specified in the system call restart block. It is
useful for system call that are not idempotent, i.e. the argument(s)
might be a relative timeout, where some adjustments are required when
restarting the system call. It relies on the system call itself to set
up its restart point and the argument save area.  They are rare: an
actual signal would turn that it an EINTR. The only case that should
ever trigger this is some kernel action that interrupts the system
call, but does not actually result in any user-visible state changes -
like freeze and thaw.

So restart blocks are about time remaining for the system call to
sleep/wait. Generally in c/r, there are two possible time models that
we can follow: absolute, relative. Here, I chose to save the relative
timeout, measured from the beginning of the checkpoint. The time when
the checkpoint (and restart) begin is also saved. This information is
sufficient to restart in either model (absolute or negative).

Which model to use should eventually be a per application choice (and
possible configurable via cradvise() or some sort). For now, we adopt
the relative model, namely, at restart the timeout is set relative to
the beginning of the restart.

To checkpoint, we check if a task has a valid restart block, and if so
we save the *remaining* time that is has to wait/sleep, and the type
of the restart block.

To restart, we fill in the data required at the proper place in the
thread information. If the system call return an error (which is
possibly an -ERESTARTSYS eg), we not only use that error as our own
return value, but also arrange for the task to execute the signal
handler (by faking a signal). The handler, in turn, already has the
code to handle these restart request gracefully.

Changelog[v1]:
  - [Nathan Lynch] fix compilation errors with CONFIG_COMPAT=y

Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
---
 checkpoint/checkpoint.c          |    1 +
 checkpoint/process.c             |  226 ++++++++++++++++++++++++++++++++++++++
 checkpoint/restart.c             |    5 +-
 checkpoint/sys.c                 |    1 +
 include/linux/checkpoint.h       |    4 +
 include/linux/checkpoint_hdr.h   |   22 ++++
 include/linux/checkpoint_types.h |    3 +
 7 files changed, 260 insertions(+), 2 deletions(-)

diff --git a/checkpoint/checkpoint.c b/checkpoint/checkpoint.c
index ad89f50..554400c 100644
--- a/checkpoint/checkpoint.c
+++ b/checkpoint/checkpoint.c
@@ -22,6 +22,7 @@
 #include <linux/mount.h>
 #include <linux/utsname.h>
 #include <linux/magic.h>
+#include <linux/hrtimer.h>
 #include <linux/checkpoint.h>
 #include <linux/checkpoint_hdr.h>
 
diff --git a/checkpoint/process.c b/checkpoint/process.c
index 1d1170c..330c8d4 100644
--- a/checkpoint/process.c
+++ b/checkpoint/process.c
@@ -12,6 +12,9 @@
 #define CKPT_DFLAG  CKPT_DSYS
 
 #include <linux/sched.h>
+#include <linux/posix-timers.h>
+#include <linux/futex.h>
+#include <linux/poll.h>
 #include <linux/checkpoint.h>
 #include <linux/checkpoint_hdr.h>
 
@@ -47,6 +50,116 @@ static int checkpoint_task_struct(struct ckpt_ctx *ctx, struct task_struct *t)
 	return ckpt_write_string(ctx, t->comm, TASK_COMM_LEN);
 }
 
+/* dump the task_struct of a given task */
+int checkpoint_restart_block(struct ckpt_ctx *ctx, struct task_struct *t)
+{
+	struct ckpt_hdr_restart_block *h;
+	struct restart_block *restart_block;
+	long (*fn)(struct restart_block *);
+	s64 base, expire = 0;
+	int ret;
+
+	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_RESTART_BLOCK);
+	if (!h)
+		return -ENOMEM;
+
+	base = ktime_to_ns(ctx->ktime_begin);
+	restart_block = &task_thread_info(t)->restart_block;
+	fn = restart_block->fn;
+
+	/* FIX: enumerate clockid_t so we're immune to changes */
+
+	if (fn == do_no_restart_syscall) {
+
+		h->function_type = CKPT_RESTART_BLOCK_NONE;
+		ckpt_debug("restart_block: non\n");
+
+	} else if (fn == hrtimer_nanosleep_restart) {
+
+		h->function_type = CKPT_RESTART_BLOCK_HRTIMER_NANOSLEEP;
+		h->arg_0 = restart_block->nanosleep.index;
+		h->arg_1 = (unsigned long) restart_block->nanosleep.rmtp;
+		expire = restart_block->nanosleep.expires;
+		ckpt_debug("restart_block: hrtimer expire %lld now %lld\n",
+			 expire, base);
+
+	} else if (fn == posix_cpu_nsleep_restart) {
+		struct timespec ts;
+
+		h->function_type = CKPT_RESTART_BLOCK_POSIX_CPU_NANOSLEEP;
+		h->arg_0 = restart_block->arg0;
+		h->arg_1 = restart_block->arg1;
+		ts.tv_sec = restart_block->arg2;
+		ts.tv_nsec = restart_block->arg3;
+		expire = timespec_to_ns(&ts);
+		ckpt_debug("restart_block: posix_cpu expire %lld now %lld\n",
+			 expire, base);
+
+#ifdef CONFIG_COMPAT
+	} else if (fn == compat_nanosleep_restart) {
+
+		h->function_type = CKPT_RESTART_BLOCK_COMPAT_NANOSLEEP;
+		h->arg_0 = restart_block->nanosleep.index;
+		h->arg_1 = (unsigned long)restart_block->nanosleep.rmtp;
+		h->arg_2 = (unsigned long)restart_block->nanosleep.compat_rmtp;
+		expire = restart_block->nanosleep.expires;
+		ckpt_debug("restart_block: compat expire %lld now %lld\n",
+			 expire, base);
+
+	} else if (fn == compat_clock_nanosleep_restart) {
+
+		h->function_type = CKPT_RESTART_BLOCK_COMPAT_CLOCK_NANOSLEEP;
+		h->arg_0 = restart_block->nanosleep.index;
+		h->arg_1 = (unsigned long)restart_block->nanosleep.rmtp;
+		h->arg_2 = (unsigned long)restart_block->nanosleep.compat_rmtp;
+		expire = restart_block->nanosleep.expires;
+		ckpt_debug("restart_block: compat_clock expire %lld now %lld\n",
+			 expire, base);
+
+#endif
+	} else if (fn == futex_wait_restart) {
+
+		h->function_type = CKPT_RESTART_BLOCK_FUTEX;
+		h->arg_0 = (unsigned long) restart_block->futex.uaddr;
+		h->arg_1 = restart_block->futex.val;
+		h->arg_2 = restart_block->futex.flags;
+		h->arg_3 = restart_block->futex.bitset;
+		expire = restart_block->futex.time;
+		ckpt_debug("restart_block: futex expire %lld now %lld\n",
+			 expire, base);
+
+	} else if (fn == do_restart_poll) {
+		struct timespec ts;
+
+		h->function_type = CKPT_RESTART_BLOCK_POLL;
+		h->arg_0 = (unsigned long) restart_block->poll.ufds;
+		h->arg_1 = restart_block->poll.nfds;
+		h->arg_2 = restart_block->poll.has_timeout;
+		ts.tv_sec = restart_block->poll.tv_sec;
+		ts.tv_nsec = restart_block->poll.tv_nsec;
+		expire = timespec_to_ns(&ts);
+		ckpt_debug("restart_block: poll expire %lld now %lld\n",
+			 expire, base);
+
+	} else {
+
+		BUG();
+
+	}
+
+	/* common to all restart blocks: */
+	h->arg_4 = (base < expire ? expire - base : 0);
+
+	ckpt_debug("restart_block: args %#llx %#llx %#llx %#llx %#llx\n",
+		 h->arg_0, h->arg_1, h->arg_2, h->arg_3, h->arg_4);
+
+	ret = ckpt_write_obj(ctx, &h->h);
+	ckpt_hdr_put(ctx, h);
+
+	ckpt_debug("restart_block ret %d\n", ret);
+	return ret;
+}
+
 /* dump the entire state of a given task */
 int checkpoint_task(struct ckpt_ctx *ctx, struct task_struct *t)
 {
@@ -63,6 +176,10 @@ int checkpoint_task(struct ckpt_ctx *ctx, struct task_struct *t)
 	ckpt_debug("thread %d\n", ret);
 	if (ret < 0)
 		goto out;
+	ret = checkpoint_restart_block(ctx, t);
+	ckpt_debug("restart-blocks %d\n", ret);
+	if (ret < 0)
+		goto out;
 	ret = checkpoint_cpu(ctx, t);
 	ckpt_debug("cpu %d\n", ret);
  out:
@@ -99,6 +216,111 @@ static int restore_task_struct(struct ckpt_ctx *ctx)
 	return ret;
 }
 
+int restore_restart_block(struct ckpt_ctx *ctx)
+{
+	struct ckpt_hdr_restart_block *h;
+	struct restart_block restart_block;
+	struct timespec ts;
+	clockid_t clockid;
+	s64 expire;
+	int ret = 0;
+
+	h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_RESTART_BLOCK);
+	if (IS_ERR(h))
+		return PTR_ERR(h);
+
+	expire = ktime_to_ns(ctx->ktime_begin) + h->arg_4;
+	restart_block.fn = NULL;
+
+	ckpt_debug("restart_block: expire %lld begin %lld\n",
+		 expire, ktime_to_ns(ctx->ktime_begin));
+	ckpt_debug("restart_block: args %#llx %#llx %#llx %#llx %#llx\n",
+		 h->arg_0, h->arg_1, h->arg_2, h->arg_3, h->arg_4);
+
+	switch (h->function_type) {
+	case CKPT_RESTART_BLOCK_NONE:
+		restart_block.fn = do_no_restart_syscall;
+		break;
+	case CKPT_RESTART_BLOCK_HRTIMER_NANOSLEEP:
+		clockid = h->arg_0;
+		if (clockid < 0 || invalid_clockid(clockid))
+			break;
+		restart_block.fn = hrtimer_nanosleep_restart;
+		restart_block.nanosleep.index = clockid;
+		restart_block.nanosleep.rmtp =
+			(struct timespec __user *) (unsigned long) h->arg_1;
+		restart_block.nanosleep.expires = expire;
+		break;
+	case CKPT_RESTART_BLOCK_POSIX_CPU_NANOSLEEP:
+		clockid = h->arg_0;
+		if (clockid < 0 || invalid_clockid(clockid))
+			break;
+		restart_block.fn = posix_cpu_nsleep_restart;
+		restart_block.arg0 = clockid;
+		restart_block.arg1 = h->arg_1;
+		ts = ns_to_timespec(expire);
+		restart_block.arg2 = ts.tv_sec;
+		restart_block.arg3 = ts.tv_nsec;
+		break;
+#ifdef CONFIG_COMPAT
+	case CKPT_RESTART_BLOCK_COMPAT_NANOSLEEP:
+		clockid = h->arg_0;
+		if (clockid < 0 || invalid_clockid(clockid))
+			break;
+		restart_block.fn = compat_nanosleep_restart;
+		restart_block.nanosleep.index = clockid;
+		restart_block.nanosleep.rmtp =
+			(struct timespec __user *) (unsigned long) h->arg_1;
+		restart_block.nanosleep.compat_rmtp =
+			(struct compat_timespec __user *)
+				(unsigned long) h->arg_2;
+		restart_block.nanosleep.expires = expire;
+		break;
+	case CKPT_RESTART_BLOCK_COMPAT_CLOCK_NANOSLEEP:
+		clockid = h->arg_0;
+		if (clockid < 0 || invalid_clockid(clockid))
+			break;
+		restart_block.fn = compat_clock_nanosleep_restart;
+		restart_block.nanosleep.index = clockid;
+		restart_block.nanosleep.rmtp =
+			(struct timespec __user *) (unsigned long) h->arg_1;
+		restart_block.nanosleep.compat_rmtp =
+			(struct compat_timespec __user *)
+				(unsigned long) h->arg_2;
+		restart_block.nanosleep.expires = expire;
+		break;
+#endif
+	case CKPT_RESTART_BLOCK_FUTEX:
+		restart_block.fn = futex_wait_restart;
+		restart_block.futex.uaddr = (u32 *) (unsigned long) h->arg_0;
+		restart_block.futex.val = h->arg_1;
+		restart_block.futex.flags = h->arg_2;
+		restart_block.futex.bitset = h->arg_3;
+		restart_block.futex.time = expire;
+		break;
+	case CKPT_RESTART_BLOCK_POLL:
+		restart_block.fn = do_restart_poll;
+		restart_block.poll.ufds =
+			(struct pollfd __user *) (unsigned long) h->arg_0;
+		restart_block.poll.nfds = h->arg_1;
+		restart_block.poll.has_timeout = h->arg_2;
+		ts = ns_to_timespec(expire);
+		restart_block.poll.tv_sec = ts.tv_sec;
+		restart_block.poll.tv_nsec = ts.tv_nsec;
+		break;
+	default:
+		break;
+	}
+
+	if (restart_block.fn)
+		task_thread_info(current)->restart_block = restart_block;
+	else
+		ret = -EINVAL;
+
+	ckpt_hdr_put(ctx, h);
+	return ret;
+}
+
 /* read the entire state of the current task */
 int restore_task(struct ckpt_ctx *ctx)
 {
@@ -112,6 +334,10 @@ int restore_task(struct ckpt_ctx *ctx)
 	ckpt_debug("thread %d\n", ret);
 	if (ret < 0)
 		goto out;
+	ret = restore_restart_block(ctx);
+	ckpt_debug("restart-blocks %d\n", ret);
+	if (ret < 0)
+		goto out;
 	ret = restore_cpu(ctx);
 	ckpt_debug("cpu %d\n", ret);
  out:
diff --git a/checkpoint/restart.c b/checkpoint/restart.c
index cb02ffb..fdad264 100644
--- a/checkpoint/restart.c
+++ b/checkpoint/restart.c
@@ -16,6 +16,8 @@
 #include <linux/file.h>
 #include <linux/magic.h>
 #include <linux/utsname.h>
+#include <asm/syscall.h>
+#include <linux/elf.h>
 #include <linux/checkpoint.h>
 #include <linux/checkpoint_hdr.h>
 
@@ -482,6 +484,5 @@ long do_restart(struct ckpt_ctx *ctx, pid_t pid)
 	if (ret < 0)
 		return ret;
 
-	/* on success, adjust the return value if needed [TODO] */
-	return restore_retval(ctx);
+	return restore_retval();
 }
diff --git a/checkpoint/sys.c b/checkpoint/sys.c
index dda2c21..b37bc8c 100644
--- a/checkpoint/sys.c
+++ b/checkpoint/sys.c
@@ -193,6 +193,7 @@ static struct ckpt_ctx *ckpt_ctx_alloc(int fd, unsigned long uflags,
 
 	ctx->uflags = uflags;
 	ctx->kflags = kflags;
+	ctx->ktime_begin = ktime_get();
 
 	err = -EBADF;
 	ctx->file = fget(fd);
diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
index aa8ce11..14c0a7f 100644
--- a/include/linux/checkpoint.h
+++ b/include/linux/checkpoint.h
@@ -70,6 +70,10 @@ extern int restore_read_header_arch(struct ckpt_ctx *ctx);
 extern int restore_thread(struct ckpt_ctx *ctx);
 extern int restore_cpu(struct ckpt_ctx *ctx);
 
+extern int checkpoint_restart_block(struct ckpt_ctx *ctx,
+				    struct task_struct *t);
+extern int restore_restart_block(struct ckpt_ctx *ctx);
+
 static inline int ckpt_validate_errno(int errno)
 {
 	return (errno >= 0) && (errno < MAX_ERRNO);
diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
index 92d082e..b72c59c 100644
--- a/include/linux/checkpoint_hdr.h
+++ b/include/linux/checkpoint_hdr.h
@@ -52,6 +52,7 @@ enum {
 	CKPT_HDR_STRING,
 
 	CKPT_HDR_TASK = 101,
+	CKPT_HDR_RESTART_BLOCK,
 	CKPT_HDR_THREAD,
 	CKPT_HDR_CPU,
 
@@ -122,4 +123,25 @@ struct ckpt_hdr_task {
 	__u64 clear_child_tid;
 } __attribute__((aligned(8)));
 
+/* restart blocks */
+struct ckpt_hdr_restart_block {
+	struct ckpt_hdr h;
+	__u64 function_type;
+	__u64 arg_0;
+	__u64 arg_1;
+	__u64 arg_2;
+	__u64 arg_3;
+	__u64 arg_4;
+} __attribute__((aligned(8)));
+
+enum restart_block_type {
+	CKPT_RESTART_BLOCK_NONE = 1,
+	CKPT_RESTART_BLOCK_HRTIMER_NANOSLEEP,
+	CKPT_RESTART_BLOCK_POSIX_CPU_NANOSLEEP,
+	CKPT_RESTART_BLOCK_COMPAT_NANOSLEEP,
+	CKPT_RESTART_BLOCK_COMPAT_CLOCK_NANOSLEEP,
+	CKPT_RESTART_BLOCK_POLL,
+	CKPT_RESTART_BLOCK_FUTEX
+};
+
 #endif /* _CHECKPOINT_CKPT_HDR_H_ */
diff --git a/include/linux/checkpoint_types.h b/include/linux/checkpoint_types.h
index 15dbe1b..046bdc4 100644
--- a/include/linux/checkpoint_types.h
+++ b/include/linux/checkpoint_types.h
@@ -15,10 +15,13 @@
 #include <linux/sched.h>
 #include <linux/nsproxy.h>
 #include <linux/fs.h>
+#include <linux/ktime.h>
 
 struct ckpt_ctx {
 	int crid;		/* unique checkpoint id */
 
+	ktime_t ktime_begin;	/* checkpoint start time */
+
 	pid_t root_pid;				/* [container] root pid */
 	struct task_struct *root_task;		/* [container] root task */
 	struct nsproxy *root_nsproxy;		/* [container] root nsproxy */
-- 
1.6.0.4


^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH v18 24/80] c/r: restart-blocks
@ 2009-09-23 23:51   ` Oren Laadan
  0 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-23 23:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, containers, linux-kernel, linux-mm, linux-api,
	Serge Hallyn, Ingo Molnar, Pavel Emelyanov, Oren Laadan,
	Oren Laadan

(Paraphrasing what's said this message:
http://lists.openwall.net/linux-kernel/2007/12/05/64)

Restart blocks are callbacks used cause a system call to be restarted
with the arguments specified in the system call restart block. It is
useful for system call that are not idempotent, i.e. the argument(s)
might be a relative timeout, where some adjustments are required when
restarting the system call. It relies on the system call itself to set
up its restart point and the argument save area.  They are rare: an
actual signal would turn that it an EINTR. The only case that should
ever trigger this is some kernel action that interrupts the system
call, but does not actually result in any user-visible state changes -
like freeze and thaw.

So restart blocks are about time remaining for the system call to
sleep/wait. Generally in c/r, there are two possible time models that
we can follow: absolute, relative. Here, I chose to save the relative
timeout, measured from the beginning of the checkpoint. The time when
the checkpoint (and restart) begin is also saved. This information is
sufficient to restart in either model (absolute or negative).

Which model to use should eventually be a per application choice (and
possible configurable via cradvise() or some sort). For now, we adopt
the relative model, namely, at restart the timeout is set relative to
the beginning of the restart.

To checkpoint, we check if a task has a valid restart block, and if so
we save the *remaining* time that is has to wait/sleep, and the type
of the restart block.

To restart, we fill in the data required at the proper place in the
thread information. If the system call return an error (which is
possibly an -ERESTARTSYS eg), we not only use that error as our own
return value, but also arrange for the task to execute the signal
handler (by faking a signal). The handler, in turn, already has the
code to handle these restart request gracefully.

Changelog[v1]:
  - [Nathan Lynch] fix compilation errors with CONFIG_COMPAT=y

Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
---
 checkpoint/checkpoint.c          |    1 +
 checkpoint/process.c             |  226 ++++++++++++++++++++++++++++++++++++++
 checkpoint/restart.c             |    5 +-
 checkpoint/sys.c                 |    1 +
 include/linux/checkpoint.h       |    4 +
 include/linux/checkpoint_hdr.h   |   22 ++++
 include/linux/checkpoint_types.h |    3 +
 7 files changed, 260 insertions(+), 2 deletions(-)

diff --git a/checkpoint/checkpoint.c b/checkpoint/checkpoint.c
index ad89f50..554400c 100644
--- a/checkpoint/checkpoint.c
+++ b/checkpoint/checkpoint.c
@@ -22,6 +22,7 @@
 #include <linux/mount.h>
 #include <linux/utsname.h>
 #include <linux/magic.h>
+#include <linux/hrtimer.h>
 #include <linux/checkpoint.h>
 #include <linux/checkpoint_hdr.h>
 
diff --git a/checkpoint/process.c b/checkpoint/process.c
index 1d1170c..330c8d4 100644
--- a/checkpoint/process.c
+++ b/checkpoint/process.c
@@ -12,6 +12,9 @@
 #define CKPT_DFLAG  CKPT_DSYS
 
 #include <linux/sched.h>
+#include <linux/posix-timers.h>
+#include <linux/futex.h>
+#include <linux/poll.h>
 #include <linux/checkpoint.h>
 #include <linux/checkpoint_hdr.h>
 
@@ -47,6 +50,116 @@ static int checkpoint_task_struct(struct ckpt_ctx *ctx, struct task_struct *t)
 	return ckpt_write_string(ctx, t->comm, TASK_COMM_LEN);
 }
 
+/* dump the task_struct of a given task */
+int checkpoint_restart_block(struct ckpt_ctx *ctx, struct task_struct *t)
+{
+	struct ckpt_hdr_restart_block *h;
+	struct restart_block *restart_block;
+	long (*fn)(struct restart_block *);
+	s64 base, expire = 0;
+	int ret;
+
+	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_RESTART_BLOCK);
+	if (!h)
+		return -ENOMEM;
+
+	base = ktime_to_ns(ctx->ktime_begin);
+	restart_block = &task_thread_info(t)->restart_block;
+	fn = restart_block->fn;
+
+	/* FIX: enumerate clockid_t so we're immune to changes */
+
+	if (fn == do_no_restart_syscall) {
+
+		h->function_type = CKPT_RESTART_BLOCK_NONE;
+		ckpt_debug("restart_block: non\n");
+
+	} else if (fn == hrtimer_nanosleep_restart) {
+
+		h->function_type = CKPT_RESTART_BLOCK_HRTIMER_NANOSLEEP;
+		h->arg_0 = restart_block->nanosleep.index;
+		h->arg_1 = (unsigned long) restart_block->nanosleep.rmtp;
+		expire = restart_block->nanosleep.expires;
+		ckpt_debug("restart_block: hrtimer expire %lld now %lld\n",
+			 expire, base);
+
+	} else if (fn == posix_cpu_nsleep_restart) {
+		struct timespec ts;
+
+		h->function_type = CKPT_RESTART_BLOCK_POSIX_CPU_NANOSLEEP;
+		h->arg_0 = restart_block->arg0;
+		h->arg_1 = restart_block->arg1;
+		ts.tv_sec = restart_block->arg2;
+		ts.tv_nsec = restart_block->arg3;
+		expire = timespec_to_ns(&ts);
+		ckpt_debug("restart_block: posix_cpu expire %lld now %lld\n",
+			 expire, base);
+
+#ifdef CONFIG_COMPAT
+	} else if (fn == compat_nanosleep_restart) {
+
+		h->function_type = CKPT_RESTART_BLOCK_COMPAT_NANOSLEEP;
+		h->arg_0 = restart_block->nanosleep.index;
+		h->arg_1 = (unsigned long)restart_block->nanosleep.rmtp;
+		h->arg_2 = (unsigned long)restart_block->nanosleep.compat_rmtp;
+		expire = restart_block->nanosleep.expires;
+		ckpt_debug("restart_block: compat expire %lld now %lld\n",
+			 expire, base);
+
+	} else if (fn == compat_clock_nanosleep_restart) {
+
+		h->function_type = CKPT_RESTART_BLOCK_COMPAT_CLOCK_NANOSLEEP;
+		h->arg_0 = restart_block->nanosleep.index;
+		h->arg_1 = (unsigned long)restart_block->nanosleep.rmtp;
+		h->arg_2 = (unsigned long)restart_block->nanosleep.compat_rmtp;
+		expire = restart_block->nanosleep.expires;
+		ckpt_debug("restart_block: compat_clock expire %lld now %lld\n",
+			 expire, base);
+
+#endif
+	} else if (fn == futex_wait_restart) {
+
+		h->function_type = CKPT_RESTART_BLOCK_FUTEX;
+		h->arg_0 = (unsigned long) restart_block->futex.uaddr;
+		h->arg_1 = restart_block->futex.val;
+		h->arg_2 = restart_block->futex.flags;
+		h->arg_3 = restart_block->futex.bitset;
+		expire = restart_block->futex.time;
+		ckpt_debug("restart_block: futex expire %lld now %lld\n",
+			 expire, base);
+
+	} else if (fn == do_restart_poll) {
+		struct timespec ts;
+
+		h->function_type = CKPT_RESTART_BLOCK_POLL;
+		h->arg_0 = (unsigned long) restart_block->poll.ufds;
+		h->arg_1 = restart_block->poll.nfds;
+		h->arg_2 = restart_block->poll.has_timeout;
+		ts.tv_sec = restart_block->poll.tv_sec;
+		ts.tv_nsec = restart_block->poll.tv_nsec;
+		expire = timespec_to_ns(&ts);
+		ckpt_debug("restart_block: poll expire %lld now %lld\n",
+			 expire, base);
+
+	} else {
+
+		BUG();
+
+	}
+
+	/* common to all restart blocks: */
+	h->arg_4 = (base < expire ? expire - base : 0);
+
+	ckpt_debug("restart_block: args %#llx %#llx %#llx %#llx %#llx\n",
+		 h->arg_0, h->arg_1, h->arg_2, h->arg_3, h->arg_4);
+
+	ret = ckpt_write_obj(ctx, &h->h);
+	ckpt_hdr_put(ctx, h);
+
+	ckpt_debug("restart_block ret %d\n", ret);
+	return ret;
+}
+
 /* dump the entire state of a given task */
 int checkpoint_task(struct ckpt_ctx *ctx, struct task_struct *t)
 {
@@ -63,6 +176,10 @@ int checkpoint_task(struct ckpt_ctx *ctx, struct task_struct *t)
 	ckpt_debug("thread %d\n", ret);
 	if (ret < 0)
 		goto out;
+	ret = checkpoint_restart_block(ctx, t);
+	ckpt_debug("restart-blocks %d\n", ret);
+	if (ret < 0)
+		goto out;
 	ret = checkpoint_cpu(ctx, t);
 	ckpt_debug("cpu %d\n", ret);
  out:
@@ -99,6 +216,111 @@ static int restore_task_struct(struct ckpt_ctx *ctx)
 	return ret;
 }
 
+int restore_restart_block(struct ckpt_ctx *ctx)
+{
+	struct ckpt_hdr_restart_block *h;
+	struct restart_block restart_block;
+	struct timespec ts;
+	clockid_t clockid;
+	s64 expire;
+	int ret = 0;
+
+	h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_RESTART_BLOCK);
+	if (IS_ERR(h))
+		return PTR_ERR(h);
+
+	expire = ktime_to_ns(ctx->ktime_begin) + h->arg_4;
+	restart_block.fn = NULL;
+
+	ckpt_debug("restart_block: expire %lld begin %lld\n",
+		 expire, ktime_to_ns(ctx->ktime_begin));
+	ckpt_debug("restart_block: args %#llx %#llx %#llx %#llx %#llx\n",
+		 h->arg_0, h->arg_1, h->arg_2, h->arg_3, h->arg_4);
+
+	switch (h->function_type) {
+	case CKPT_RESTART_BLOCK_NONE:
+		restart_block.fn = do_no_restart_syscall;
+		break;
+	case CKPT_RESTART_BLOCK_HRTIMER_NANOSLEEP:
+		clockid = h->arg_0;
+		if (clockid < 0 || invalid_clockid(clockid))
+			break;
+		restart_block.fn = hrtimer_nanosleep_restart;
+		restart_block.nanosleep.index = clockid;
+		restart_block.nanosleep.rmtp =
+			(struct timespec __user *) (unsigned long) h->arg_1;
+		restart_block.nanosleep.expires = expire;
+		break;
+	case CKPT_RESTART_BLOCK_POSIX_CPU_NANOSLEEP:
+		clockid = h->arg_0;
+		if (clockid < 0 || invalid_clockid(clockid))
+			break;
+		restart_block.fn = posix_cpu_nsleep_restart;
+		restart_block.arg0 = clockid;
+		restart_block.arg1 = h->arg_1;
+		ts = ns_to_timespec(expire);
+		restart_block.arg2 = ts.tv_sec;
+		restart_block.arg3 = ts.tv_nsec;
+		break;
+#ifdef CONFIG_COMPAT
+	case CKPT_RESTART_BLOCK_COMPAT_NANOSLEEP:
+		clockid = h->arg_0;
+		if (clockid < 0 || invalid_clockid(clockid))
+			break;
+		restart_block.fn = compat_nanosleep_restart;
+		restart_block.nanosleep.index = clockid;
+		restart_block.nanosleep.rmtp =
+			(struct timespec __user *) (unsigned long) h->arg_1;
+		restart_block.nanosleep.compat_rmtp =
+			(struct compat_timespec __user *)
+				(unsigned long) h->arg_2;
+		restart_block.nanosleep.expires = expire;
+		break;
+	case CKPT_RESTART_BLOCK_COMPAT_CLOCK_NANOSLEEP:
+		clockid = h->arg_0;
+		if (clockid < 0 || invalid_clockid(clockid))
+			break;
+		restart_block.fn = compat_clock_nanosleep_restart;
+		restart_block.nanosleep.index = clockid;
+		restart_block.nanosleep.rmtp =
+			(struct timespec __user *) (unsigned long) h->arg_1;
+		restart_block.nanosleep.compat_rmtp =
+			(struct compat_timespec __user *)
+				(unsigned long) h->arg_2;
+		restart_block.nanosleep.expires = expire;
+		break;
+#endif
+	case CKPT_RESTART_BLOCK_FUTEX:
+		restart_block.fn = futex_wait_restart;
+		restart_block.futex.uaddr = (u32 *) (unsigned long) h->arg_0;
+		restart_block.futex.val = h->arg_1;
+		restart_block.futex.flags = h->arg_2;
+		restart_block.futex.bitset = h->arg_3;
+		restart_block.futex.time = expire;
+		break;
+	case CKPT_RESTART_BLOCK_POLL:
+		restart_block.fn = do_restart_poll;
+		restart_block.poll.ufds =
+			(struct pollfd __user *) (unsigned long) h->arg_0;
+		restart_block.poll.nfds = h->arg_1;
+		restart_block.poll.has_timeout = h->arg_2;
+		ts = ns_to_timespec(expire);
+		restart_block.poll.tv_sec = ts.tv_sec;
+		restart_block.poll.tv_nsec = ts.tv_nsec;
+		break;
+	default:
+		break;
+	}
+
+	if (restart_block.fn)
+		task_thread_info(current)->restart_block = restart_block;
+	else
+		ret = -EINVAL;
+
+	ckpt_hdr_put(ctx, h);
+	return ret;
+}
+
 /* read the entire state of the current task */
 int restore_task(struct ckpt_ctx *ctx)
 {
@@ -112,6 +334,10 @@ int restore_task(struct ckpt_ctx *ctx)
 	ckpt_debug("thread %d\n", ret);
 	if (ret < 0)
 		goto out;
+	ret = restore_restart_block(ctx);
+	ckpt_debug("restart-blocks %d\n", ret);
+	if (ret < 0)
+		goto out;
 	ret = restore_cpu(ctx);
 	ckpt_debug("cpu %d\n", ret);
  out:
diff --git a/checkpoint/restart.c b/checkpoint/restart.c
index cb02ffb..fdad264 100644
--- a/checkpoint/restart.c
+++ b/checkpoint/restart.c
@@ -16,6 +16,8 @@
 #include <linux/file.h>
 #include <linux/magic.h>
 #include <linux/utsname.h>
+#include <asm/syscall.h>
+#include <linux/elf.h>
 #include <linux/checkpoint.h>
 #include <linux/checkpoint_hdr.h>
 
@@ -482,6 +484,5 @@ long do_restart(struct ckpt_ctx *ctx, pid_t pid)
 	if (ret < 0)
 		return ret;
 
-	/* on success, adjust the return value if needed [TODO] */
-	return restore_retval(ctx);
+	return restore_retval();
 }
diff --git a/checkpoint/sys.c b/checkpoint/sys.c
index dda2c21..b37bc8c 100644
--- a/checkpoint/sys.c
+++ b/checkpoint/sys.c
@@ -193,6 +193,7 @@ static struct ckpt_ctx *ckpt_ctx_alloc(int fd, unsigned long uflags,
 
 	ctx->uflags = uflags;
 	ctx->kflags = kflags;
+	ctx->ktime_begin = ktime_get();
 
 	err = -EBADF;
 	ctx->file = fget(fd);
diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
index aa8ce11..14c0a7f 100644
--- a/include/linux/checkpoint.h
+++ b/include/linux/checkpoint.h
@@ -70,6 +70,10 @@ extern int restore_read_header_arch(struct ckpt_ctx *ctx);
 extern int restore_thread(struct ckpt_ctx *ctx);
 extern int restore_cpu(struct ckpt_ctx *ctx);
 
+extern int checkpoint_restart_block(struct ckpt_ctx *ctx,
+				    struct task_struct *t);
+extern int restore_restart_block(struct ckpt_ctx *ctx);
+
 static inline int ckpt_validate_errno(int errno)
 {
 	return (errno >= 0) && (errno < MAX_ERRNO);
diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
index 92d082e..b72c59c 100644
--- a/include/linux/checkpoint_hdr.h
+++ b/include/linux/checkpoint_hdr.h
@@ -52,6 +52,7 @@ enum {
 	CKPT_HDR_STRING,
 
 	CKPT_HDR_TASK = 101,
+	CKPT_HDR_RESTART_BLOCK,
 	CKPT_HDR_THREAD,
 	CKPT_HDR_CPU,
 
@@ -122,4 +123,25 @@ struct ckpt_hdr_task {
 	__u64 clear_child_tid;
 } __attribute__((aligned(8)));
 
+/* restart blocks */
+struct ckpt_hdr_restart_block {
+	struct ckpt_hdr h;
+	__u64 function_type;
+	__u64 arg_0;
+	__u64 arg_1;
+	__u64 arg_2;
+	__u64 arg_3;
+	__u64 arg_4;
+} __attribute__((aligned(8)));
+
+enum restart_block_type {
+	CKPT_RESTART_BLOCK_NONE = 1,
+	CKPT_RESTART_BLOCK_HRTIMER_NANOSLEEP,
+	CKPT_RESTART_BLOCK_POSIX_CPU_NANOSLEEP,
+	CKPT_RESTART_BLOCK_COMPAT_NANOSLEEP,
+	CKPT_RESTART_BLOCK_COMPAT_CLOCK_NANOSLEEP,
+	CKPT_RESTART_BLOCK_POLL,
+	CKPT_RESTART_BLOCK_FUTEX
+};
+
 #endif /* _CHECKPOINT_CKPT_HDR_H_ */
diff --git a/include/linux/checkpoint_types.h b/include/linux/checkpoint_types.h
index 15dbe1b..046bdc4 100644
--- a/include/linux/checkpoint_types.h
+++ b/include/linux/checkpoint_types.h
@@ -15,10 +15,13 @@
 #include <linux/sched.h>
 #include <linux/nsproxy.h>
 #include <linux/fs.h>
+#include <linux/ktime.h>
 
 struct ckpt_ctx {
 	int crid;		/* unique checkpoint id */
 
+	ktime_t ktime_begin;	/* checkpoint start time */
+
 	pid_t root_pid;				/* [container] root pid */
 	struct task_struct *root_task;		/* [container] root task */
 	struct nsproxy *root_nsproxy;		/* [container] root nsproxy */
-- 
1.6.0.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH v18 25/80] c/r: checkpoint multiple processes
       [not found] ` <1253749920-18673-1-git-send-email-orenl-RdfvBDnrOixBDgjK7y7TUQ@public.gmane.org>
                     ` (23 preceding siblings ...)
  2009-09-23 23:51   ` [PATCH v18 24/80] c/r: restart-blocks Oren Laadan
@ 2009-09-23 23:51   ` Oren Laadan
  2009-09-23 23:51   ` [PATCH v18 26/80] c/r: restart " Oren Laadan
                     ` (56 subsequent siblings)
  81 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-23 23:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-api-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Linus Torvalds, Ingo Molnar,
	Pavel Emelyanov

Checkpointing of multiple processes works by recording the tasks tree
structure below a given "root" task. The root task is expected to be a
container init, and then an entire container is checkpointed. However,
passing CHECKPOINT_SUBTREE to checkpoint(2) relaxes this requirement
and allows to checkpoint a subtree of processes from the root task.

For a given root task, do a DFS scan of the tasks tree and collect them
into an array (keeping a reference to each task). Using DFS simplifies
the recreation of tasks either in user space or kernel space. For each
task collected, test if it can be checkpointed, and save its pid, tgid,
and ppid.

The actual work is divided into two passes: a first scan counts the
tasks, then memory is allocated and a second scan fills the array.

Whether checkpoints and restarts require CAP_SYS_ADMIN is determined
by sysctl 'ckpt_unpriv_allowed': if 1, then regular permission checks
are intended to prevent privilege escalation, however if 0 it prevents
unprivileged users from exploiting any privilege escalation bugs.

The logic is suitable for creation of processes during restart either
in userspace or by the kernel.

Currently we ignore threads and zombies.

Changelog[v18]:
  - Replace some EAGAIN with EBUSY
  - Add a few more ckpt_write_err()s
  - Rename headerless struct ckpt_hdr_* to struct ckpt_*
Changelog[v16]:
  - CHECKPOINT_SUBTREE flags allows subtree (not whole container)
  - sysctl variable 'ckpt_unpriv_allowed' controls needed privileges
Changelog[v14]:
  - Refuse non-self checkpoint if target task isn't frozen
  - Refuse checkpoint (for now) if task is ptraced
  - Revert change to pr_debug(), back to ckpt_debug()
  - Use only unsigned fields in checkpoint headers
  - Check retval of ckpt_tree_count_tasks() in ckpt_build_tree()
  - Discard 'h.parent' field
  - Check whether calls to ckpt_hbuf_get() fail
  - Disallow threads or siblings to container init
Changelog[v13]:
  - Release tasklist_lock in error path in ckpt_tree_count_tasks()
  - Use separate index for 'tasks_arr' and 'hh' in ckpt_write_pids()
Changelog[v12]:
  - Replace obsolete ckpt_debug() with pr_debug()

Signed-off-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
---
 checkpoint/checkpoint.c          |  295 ++++++++++++++++++++++++++++++++++++--
 checkpoint/restart.c             |    2 +-
 checkpoint/sys.c                 |   33 ++++-
 include/linux/checkpoint.h       |    6 +
 include/linux/checkpoint_hdr.h   |   16 ++-
 include/linux/checkpoint_types.h |    4 +
 kernel/sysctl.c                  |   17 +++
 7 files changed, 355 insertions(+), 18 deletions(-)

diff --git a/checkpoint/checkpoint.c b/checkpoint/checkpoint.c
index 554400c..fc02436 100644
--- a/checkpoint/checkpoint.c
+++ b/checkpoint/checkpoint.c
@@ -356,8 +356,27 @@ static int checkpoint_write_tail(struct ckpt_ctx *ctx)
 	return ret;
 }
 
+/* dump all tasks in ctx->tasks_arr[] */
+static int checkpoint_all_tasks(struct ckpt_ctx *ctx)
+{
+	int n, ret = 0;
+
+	for (n = 0; n < ctx->nr_tasks; n++) {
+		ckpt_debug("dumping task #%d\n", n);
+		ret = checkpoint_task(ctx, ctx->tasks_arr[n]);
+		if (ret < 0)
+			break;
+	}
+
+	return ret;
+}
+
 static int may_checkpoint_task(struct ckpt_ctx *ctx, struct task_struct *t)
 {
+	struct task_struct *root = ctx->root_task;
+
+	ckpt_debug("check %d\n", task_pid_nr_ns(t, ctx->root_nsproxy->pid_ns));
+
 	if (t->state == TASK_DEAD) {
 		__ckpt_write_err(ctx, "TE", "task state EXIT_DEAD\n", -EBUSY);
 		return -EBUSY;
@@ -380,15 +399,258 @@ static int may_checkpoint_task(struct ckpt_ctx *ctx, struct task_struct *t)
 		return -EBUSY;
 	}
 
+	/*
+	 * FIX: for now, disallow siblings of container init created
+	 * via CLONE_PARENT (unclear if they will remain possible)
+	 */
+	if (ctx->root_init && t != root && t->tgid != root->tgid &&
+	    t->real_parent == root->real_parent) {
+		__ckpt_write_err(ctx, "TE", "task is sibling of root", -EINVAL);
+		return -EINVAL;
+	}
+
+	/* FIX: change this when namespaces are added */
+	if (task_nsproxy(t) != ctx->root_nsproxy)
+		return -EPERM;
+
+	return 0;
+}
+
+#define CKPT_HDR_PIDS_CHUNK	256
+
+static int checkpoint_pids(struct ckpt_ctx *ctx)
+{
+	struct ckpt_pids *h;
+	struct pid_namespace *ns;
+	struct task_struct *task;
+	struct task_struct **tasks_arr;
+	int nr_tasks, n, pos = 0, ret = 0;
+
+	ns = ctx->root_nsproxy->pid_ns;
+	tasks_arr = ctx->tasks_arr;
+	nr_tasks = ctx->nr_tasks;
+	BUG_ON(nr_tasks <= 0);
+
+	ret = ckpt_write_obj_type(ctx, NULL,
+				  sizeof(*h) * nr_tasks,
+				  CKPT_HDR_BUFFER);
+	if (ret < 0)
+		return ret;
+
+	h = ckpt_hdr_get(ctx, sizeof(*h) * CKPT_HDR_PIDS_CHUNK);
+	if (!h)
+		return -ENOMEM;
+
+	do {
+		rcu_read_lock();
+		for (n = 0; n < min(nr_tasks, CKPT_HDR_PIDS_CHUNK); n++) {
+			task = tasks_arr[pos];
+
+			h[n].vpid = task_pid_nr_ns(task, ns);
+			h[n].vtgid = task_tgid_nr_ns(task, ns);
+			h[n].vpgid = task_pgrp_nr_ns(task, ns);
+			h[n].vsid = task_session_nr_ns(task, ns);
+			h[n].vppid = task_tgid_nr_ns(task->real_parent, ns);
+			ckpt_debug("task[%d]: vpid %d vtgid %d parent %d\n",
+				   pos, h[n].vpid, h[n].vtgid, h[n].vppid);
+			pos++;
+		}
+		rcu_read_unlock();
+
+		n = min(nr_tasks, CKPT_HDR_PIDS_CHUNK);
+		ret = ckpt_kwrite(ctx, h, n * sizeof(*h));
+		if (ret < 0)
+			break;
+
+		nr_tasks -= n;
+	} while (nr_tasks > 0);
+
+	_ckpt_hdr_put(ctx, h, sizeof(*h) * CKPT_HDR_PIDS_CHUNK);
+	return ret;
+}
+
+/* count number of tasks in tree (and optionally fill pid's in array) */
+static int tree_count_tasks(struct ckpt_ctx *ctx)
+{
+	struct task_struct *root;
+	struct task_struct *task;
+	struct task_struct *parent;
+	struct task_struct **tasks_arr = ctx->tasks_arr;
+	int nr_tasks = ctx->nr_tasks;
+	int nr = 0;
+	int ret;
+
+	read_lock(&tasklist_lock);
+
+	/* we hold the lock, so root_task->real_parent can't change */
+	task = ctx->root_task;
+	if (ctx->root_init) {
+		/* container-init: start from container parent */
+		parent = task->real_parent;
+		root = parent;
+	} else {
+		/* non-container-init: start from root task and down */
+		parent = NULL;
+		root = task;
+	}
+
+	/* count tasks via DFS scan of the tree */
+	while (1) {
+		ctx->tsk = task;  /* (for ckpt_write_err) */
+
+		/* is this task cool ? */
+		ret = may_checkpoint_task(ctx, task);
+		if (ret < 0) {
+			nr = ret;
+			break;
+		}
+		if (tasks_arr) {
+			/* unlikely... but if so then try again later */
+			if (nr == nr_tasks) {
+				nr = -EBUSY; /* cleanup in ckpt_ctx_free() */
+				break;
+			}
+			tasks_arr[nr] = task;
+			get_task_struct(task);
+		}
+		nr++;
+		/* if has children - proceed with child */
+		if (!list_empty(&task->children)) {
+			parent = task;
+			task = list_entry(task->children.next,
+					  struct task_struct, sibling);
+			continue;
+		}
+		while (task != root) {
+			/* if has sibling - proceed with sibling */
+			if (!list_is_last(&task->sibling, &parent->children)) {
+				task = list_entry(task->sibling.next,
+						  struct task_struct, sibling);
+				break;
+			}
+
+			/* else, trace back to parent and proceed */
+			task = parent;
+			parent = parent->real_parent;
+		}
+		if (task == root)
+			break;
+	}
+	ctx->tsk = NULL;
+
+	read_unlock(&tasklist_lock);
+
+	if (nr < 0)
+		ckpt_write_err(ctx, "", NULL);
+	return nr;
+}
+
+/*
+ * build_tree - scan the tasks tree in DFS order and fill in array
+ * @ctx: checkpoint context
+ *
+ * Using DFS order simplifies the restart logic to re-create the tasks.
+ *
+ * On success, ctx->tasks_arr will be allocated and populated with all
+ * tasks (reference taken), and ctx->nr_tasks will hold the total count.
+ * The array is cleaned up by ckpt_ctx_free().
+ */
+static int build_tree(struct ckpt_ctx *ctx)
+{
+	int n, m;
+
+	/* count tasks (no side effects) */
+	n = tree_count_tasks(ctx);
+	if (n < 0)
+		return n;
+
+	ctx->nr_tasks = n;
+	ctx->tasks_arr = kzalloc(n * sizeof(*ctx->tasks_arr), GFP_KERNEL);
+	if (!ctx->tasks_arr)
+		return -ENOMEM;
+
+	/* count again (now will fill array) */
+	m = tree_count_tasks(ctx);
+
+	/* unlikely, but ... (cleanup in ckpt_ctx_free) */
+	if (m < 0)
+		return m;
+	else if (m != n)
+		return -EBUSY;
+
 	return 0;
 }
 
+/* dump the array that describes the tasks tree */
+static int checkpoint_tree(struct ckpt_ctx *ctx)
+{
+	struct ckpt_hdr_tree *h;
+	int ret;
+
+	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_TREE);
+	if (!h)
+		return -ENOMEM;
+
+	h->nr_tasks = ctx->nr_tasks;
+
+	ret = ckpt_write_obj(ctx, &h->h);
+	ckpt_hdr_put(ctx, h);
+	if (ret < 0)
+		return ret;
+
+	ret = checkpoint_pids(ctx);
+	return ret;
+}
+
+static struct task_struct *get_freezer_task(struct task_struct *root_task)
+{
+	struct task_struct *p;
+
+	/*
+	 * For the duration of checkpoint we deep-freeze all tasks.
+	 * Normally do it through the root task's freezer cgroup.
+	 * However, if the root task is also the current task (doing
+	 * self-checkpoint) we can't freeze ourselves. In this case,
+	 * choose the next available (non-dead) task instead. We'll
+	 * use its freezer cgroup to verify that all tasks belong to
+	 * the same cgroup.
+	 */
+
+	if (root_task != current) {
+		get_task_struct(root_task);
+		return root_task;
+	}
+
+	/* search among threads, then children */
+	read_lock(&tasklist_lock);
+
+	for (p = next_thread(root_task); p != root_task; p = next_thread(p)) {
+		if (p->state == TASK_DEAD)
+			continue;
+		if (!in_same_cgroup_freezer(p, root_task))
+			goto out;
+	}
+
+	list_for_each_entry(p, &root_task->children, sibling) {
+		if (p->state == TASK_DEAD)
+			continue;
+		if (!in_same_cgroup_freezer(p, root_task))
+			goto out;
+	}
+
+	p = NULL;
+ out:
+	read_unlock(&tasklist_lock);
+	if (p)
+		get_task_struct(p);
+	return p;
+}
+
 /* setup checkpoint-specific parts of ctx */
 static int init_checkpoint_ctx(struct ckpt_ctx *ctx, pid_t pid)
 {
 	struct task_struct *task;
 	struct nsproxy *nsproxy;
-	int ret;
 
 	/*
 	 * No need for explicit cleanup here, because if an error
@@ -420,18 +682,14 @@ static int init_checkpoint_ctx(struct ckpt_ctx *ctx, pid_t pid)
 		ctx->root_nsproxy = nsproxy;
 
 	/* root freezer */
-	ctx->root_freezer = task;
-	geT_task_struct(task);
+	ctx->root_freezer = get_freezer_task(task);
 
-	ret = may_checkpoint_task(ctx, task);
-	if (ret) {
-		ckpt_write_err(ctx, "", NULL);
-		put_task_struct(task);
-		put_task_struct(task);
-		put_nsproxy(nsproxy);
-		ctx->root_nsproxy = NULL;
-		ctx->root_task = NULL;
-		return ret;
+	/* container init ? */
+	ctx->root_init = is_container_init(task);
+
+	if (!(ctx->uflags & CHECKPOINT_SUBTREE) && !ctx->root_init) {
+		ckpt_write_err(ctx, "E", "not container init", -EINVAL);
+		return -EINVAL;  /* cleanup by ckpt_ctx_free() */
 	}
 
 	return 0;
@@ -447,14 +705,23 @@ long do_checkpoint(struct ckpt_ctx *ctx, pid_t pid)
 
 	if (ctx->root_freezer) {
 		ret = cgroup_freezer_begin_checkpoint(ctx->root_freezer);
-		if (ret < 0)
+		if (ret < 0) {
+			ckpt_write_err(ctx, "E", "freezer cgroup failed", ret);
 			return ret;
+		}
 	}
 
+	ret = build_tree(ctx);
+	if (ret < 0)
+		goto out;
+
 	ret = checkpoint_write_header(ctx);
 	if (ret < 0)
 		goto out;
-	ret = checkpoint_task(ctx, ctx->root_task);
+	ret = checkpoint_tree(ctx);
+	if (ret < 0)
+		goto out;
+	ret = checkpoint_all_tasks(ctx);
 	if (ret < 0)
 		goto out;
 	ret = checkpoint_write_tail(ctx);
diff --git a/checkpoint/restart.c b/checkpoint/restart.c
index fdad264..3f22403 100644
--- a/checkpoint/restart.c
+++ b/checkpoint/restart.c
@@ -364,7 +364,7 @@ static int restore_read_header(struct ckpt_ctx *ctx)
 	    h->minor != ((LINUX_VERSION_CODE >> 8) & 0xff) ||
 	    h->patch != ((LINUX_VERSION_CODE) & 0xff))
 		goto out;
-	if (h->uflags)
+	if (h->uflags & ~CHECKPOINT_USER_FLAGS)
 		goto out;
 
 	ret = check_kernel_const(&h->constants);
diff --git a/checkpoint/sys.c b/checkpoint/sys.c
index b37bc8c..cc94775 100644
--- a/checkpoint/sys.c
+++ b/checkpoint/sys.c
@@ -23,6 +23,14 @@
 #include <linux/checkpoint.h>
 
 /*
+ * ckpt_unpriv_allowed - sysctl controlled, do not allow checkpoints or
+ * restarts unless caller has CAP_SYS_ADMIN, if 0 (prevent unprivileged
+ * useres from expoitling any privilege escalation bugs). If it is 1,
+ * then regular permissions checks are intended to do the job.
+ */
+int ckpt_unpriv_allowed = 1;	/* default: allow */
+
+/*
  * Helpers to write(read) from(to) kernel space to(from) the checkpoint
  * image file descriptor (similar to how a core-dump is performed).
  *
@@ -166,11 +174,27 @@ void *ckpt_hdr_get_type(struct ckpt_ctx *ctx, int len, int type)
  * restart operation, and persists until the operation is completed.
  */
 
+static void task_arr_free(struct ckpt_ctx *ctx)
+{
+	int n;
+
+	for (n = 0; n < ctx->nr_tasks; n++) {
+		if (ctx->tasks_arr[n]) {
+			put_task_struct(ctx->tasks_arr[n]);
+			ctx->tasks_arr[n] = NULL;
+		}
+	}
+	kfree(ctx->tasks_arr);
+}
+
 static void ckpt_ctx_free(struct ckpt_ctx *ctx)
 {
 	if (ctx->file)
 		fput(ctx->file);
 
+	if (ctx->tasks_arr)
+		task_arr_free(ctx);
+
 	if (ctx->root_nsproxy)
 		put_nsproxy(ctx->root_nsproxy);
 	if (ctx->root_task)
@@ -220,10 +244,12 @@ SYSCALL_DEFINE3(checkpoint, pid_t, pid, int, fd, unsigned long, flags)
 	struct ckpt_ctx *ctx;
 	long ret;
 
-	/* no flags for now */
-	if (flags)
+	if (flags & ~CHECKPOINT_USER_FLAGS)
 		return -EINVAL;
 
+	if (!ckpt_unpriv_allowed && !capable(CAP_SYS_ADMIN))
+		return -EPERM;
+
 	if (pid == 0)
 		pid = task_pid_vnr(current);
 	ctx = ckpt_ctx_alloc(fd, flags, CKPT_CTX_CHECKPOINT);
@@ -257,6 +283,9 @@ SYSCALL_DEFINE3(restart, pid_t, pid, int, fd, unsigned long, flags)
 	if (flags)
 		return -EINVAL;
 
+	if (!ckpt_unpriv_allowed && !capable(CAP_SYS_ADMIN))
+		return -EPERM;
+
 	ctx = ckpt_ctx_alloc(fd, flags, CKPT_CTX_RESTART);
 	if (IS_ERR(ctx))
 		return PTR_ERR(ctx);
diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
index 14c0a7f..a4650bb 100644
--- a/include/linux/checkpoint.h
+++ b/include/linux/checkpoint.h
@@ -12,6 +12,9 @@
 
 #define CHECKPOINT_VERSION  2
 
+/* checkpoint user flags */
+#define CHECKPOINT_SUBTREE	0x1
+
 #ifdef __KERNEL__
 #ifdef CONFIG_CHECKPOINT
 
@@ -26,6 +29,9 @@
 #define CKPT_CTX_CHECKPOINT	(1 << CKPT_CTX_CHECKPOINT_BIT)
 #define CKPT_CTX_RESTART	(1 << CKPT_CTX_RESTART_BIT)
 
+/* ckpt_ctx: uflags */
+#define CHECKPOINT_USER_FLAGS		CHECKPOINT_SUBTREE
+
 
 extern int ckpt_kwrite(struct ckpt_ctx *ctx, void *buf, int count);
 extern int ckpt_kread(struct ckpt_ctx *ctx, void *buf, int count);
diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
index b72c59c..26e10fb 100644
--- a/include/linux/checkpoint_hdr.h
+++ b/include/linux/checkpoint_hdr.h
@@ -51,7 +51,8 @@ enum {
 	CKPT_HDR_BUFFER,
 	CKPT_HDR_STRING,
 
-	CKPT_HDR_TASK = 101,
+	CKPT_HDR_TREE = 101,
+	CKPT_HDR_TASK,
 	CKPT_HDR_RESTART_BLOCK,
 	CKPT_HDR_THREAD,
 	CKPT_HDR_CPU,
@@ -110,6 +111,19 @@ struct ckpt_hdr_tail {
 	__u64 magic;
 } __attribute__((aligned(8)));
 
+/* task tree */
+struct ckpt_hdr_tree {
+	struct ckpt_hdr h;
+	__s32 nr_tasks;
+} __attribute__((aligned(8)));
+
+struct ckpt_pids {
+	__s32 vpid;
+	__s32 vppid;
+	__s32 vtgid;
+	__s32 vpgid;
+	__s32 vsid;
+} __attribute__((aligned(8)));
 
 /* task data */
 struct ckpt_hdr_task {
diff --git a/include/linux/checkpoint_types.h b/include/linux/checkpoint_types.h
index 046bdc4..c3399b3 100644
--- a/include/linux/checkpoint_types.h
+++ b/include/linux/checkpoint_types.h
@@ -22,6 +22,7 @@ struct ckpt_ctx {
 
 	ktime_t ktime_begin;	/* checkpoint start time */
 
+	int root_init;				/* [container] root init ? */
 	pid_t root_pid;				/* [container] root pid */
 	struct task_struct *root_task;		/* [container] root task */
 	struct nsproxy *root_nsproxy;		/* [container] root nsproxy */
@@ -34,6 +35,9 @@ struct ckpt_ctx {
 	struct file *file;	/* input/output file */
 	int total;		/* total read/written */
 
+	struct task_struct **tasks_arr;	/* array of all tasks in container */
+	int nr_tasks;			/* size of tasks array */
+
 	struct task_struct *tsk;/* checkpoint: current target task */
 	char err_string[256];	/* checkpoint: error string */
 };
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 58be760..3046e2c 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -198,6 +198,10 @@ int sysctl_legacy_va_layout;
 extern int prove_locking;
 extern int lock_stat;
 
+#ifdef CONFIG_CHECKPOINT
+extern int ckpt_unpriv_allowed;
+#endif
+
 /* The default sysctl tables: */
 
 static struct ctl_table root_table[] = {
@@ -990,6 +994,19 @@ static struct ctl_table kern_table[] = {
 		.proc_handler	= &proc_dointvec,
 	},
 #endif
+#ifdef CONFIG_CHECKPOINT
+	{
+		.ctl_name	= CTL_UNNUMBERED,
+		.procname	= "ckpt_unpriv_allowed",
+		.data		= &ckpt_unpriv_allowed,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= &proc_dointvec_minmax,
+		.strategy	= &sysctl_intvec,
+		.extra1		= &zero,
+		.extra2		= &one,
+	},
+#endif
 
 /*
  * NOTE: do not add new entries to this table unless you have read
-- 
1.6.0.4

^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH v18 25/80] c/r: checkpoint multiple processes
  2009-09-23 23:50 ` Oren Laadan
@ 2009-09-23 23:51   ` Oren Laadan
  -1 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-23 23:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, containers, linux-kernel, linux-mm, linux-api,
	Serge Hallyn, Ingo Molnar, Pavel Emelyanov, Oren Laadan,
	Oren Laadan

Checkpointing of multiple processes works by recording the tasks tree
structure below a given "root" task. The root task is expected to be a
container init, and then an entire container is checkpointed. However,
passing CHECKPOINT_SUBTREE to checkpoint(2) relaxes this requirement
and allows to checkpoint a subtree of processes from the root task.

For a given root task, do a DFS scan of the tasks tree and collect them
into an array (keeping a reference to each task). Using DFS simplifies
the recreation of tasks either in user space or kernel space. For each
task collected, test if it can be checkpointed, and save its pid, tgid,
and ppid.

The actual work is divided into two passes: a first scan counts the
tasks, then memory is allocated and a second scan fills the array.

Whether checkpoints and restarts require CAP_SYS_ADMIN is determined
by sysctl 'ckpt_unpriv_allowed': if 1, then regular permission checks
are intended to prevent privilege escalation, however if 0 it prevents
unprivileged users from exploiting any privilege escalation bugs.

The logic is suitable for creation of processes during restart either
in userspace or by the kernel.

Currently we ignore threads and zombies.

Changelog[v18]:
  - Replace some EAGAIN with EBUSY
  - Add a few more ckpt_write_err()s
  - Rename headerless struct ckpt_hdr_* to struct ckpt_*
Changelog[v16]:
  - CHECKPOINT_SUBTREE flags allows subtree (not whole container)
  - sysctl variable 'ckpt_unpriv_allowed' controls needed privileges
Changelog[v14]:
  - Refuse non-self checkpoint if target task isn't frozen
  - Refuse checkpoint (for now) if task is ptraced
  - Revert change to pr_debug(), back to ckpt_debug()
  - Use only unsigned fields in checkpoint headers
  - Check retval of ckpt_tree_count_tasks() in ckpt_build_tree()
  - Discard 'h.parent' field
  - Check whether calls to ckpt_hbuf_get() fail
  - Disallow threads or siblings to container init
Changelog[v13]:
  - Release tasklist_lock in error path in ckpt_tree_count_tasks()
  - Use separate index for 'tasks_arr' and 'hh' in ckpt_write_pids()
Changelog[v12]:
  - Replace obsolete ckpt_debug() with pr_debug()

Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
---
 checkpoint/checkpoint.c          |  295 ++++++++++++++++++++++++++++++++++++--
 checkpoint/restart.c             |    2 +-
 checkpoint/sys.c                 |   33 ++++-
 include/linux/checkpoint.h       |    6 +
 include/linux/checkpoint_hdr.h   |   16 ++-
 include/linux/checkpoint_types.h |    4 +
 kernel/sysctl.c                  |   17 +++
 7 files changed, 355 insertions(+), 18 deletions(-)

diff --git a/checkpoint/checkpoint.c b/checkpoint/checkpoint.c
index 554400c..fc02436 100644
--- a/checkpoint/checkpoint.c
+++ b/checkpoint/checkpoint.c
@@ -356,8 +356,27 @@ static int checkpoint_write_tail(struct ckpt_ctx *ctx)
 	return ret;
 }
 
+/* dump all tasks in ctx->tasks_arr[] */
+static int checkpoint_all_tasks(struct ckpt_ctx *ctx)
+{
+	int n, ret = 0;
+
+	for (n = 0; n < ctx->nr_tasks; n++) {
+		ckpt_debug("dumping task #%d\n", n);
+		ret = checkpoint_task(ctx, ctx->tasks_arr[n]);
+		if (ret < 0)
+			break;
+	}
+
+	return ret;
+}
+
 static int may_checkpoint_task(struct ckpt_ctx *ctx, struct task_struct *t)
 {
+	struct task_struct *root = ctx->root_task;
+
+	ckpt_debug("check %d\n", task_pid_nr_ns(t, ctx->root_nsproxy->pid_ns));
+
 	if (t->state == TASK_DEAD) {
 		__ckpt_write_err(ctx, "TE", "task state EXIT_DEAD\n", -EBUSY);
 		return -EBUSY;
@@ -380,15 +399,258 @@ static int may_checkpoint_task(struct ckpt_ctx *ctx, struct task_struct *t)
 		return -EBUSY;
 	}
 
+	/*
+	 * FIX: for now, disallow siblings of container init created
+	 * via CLONE_PARENT (unclear if they will remain possible)
+	 */
+	if (ctx->root_init && t != root && t->tgid != root->tgid &&
+	    t->real_parent == root->real_parent) {
+		__ckpt_write_err(ctx, "TE", "task is sibling of root", -EINVAL);
+		return -EINVAL;
+	}
+
+	/* FIX: change this when namespaces are added */
+	if (task_nsproxy(t) != ctx->root_nsproxy)
+		return -EPERM;
+
+	return 0;
+}
+
+#define CKPT_HDR_PIDS_CHUNK	256
+
+static int checkpoint_pids(struct ckpt_ctx *ctx)
+{
+	struct ckpt_pids *h;
+	struct pid_namespace *ns;
+	struct task_struct *task;
+	struct task_struct **tasks_arr;
+	int nr_tasks, n, pos = 0, ret = 0;
+
+	ns = ctx->root_nsproxy->pid_ns;
+	tasks_arr = ctx->tasks_arr;
+	nr_tasks = ctx->nr_tasks;
+	BUG_ON(nr_tasks <= 0);
+
+	ret = ckpt_write_obj_type(ctx, NULL,
+				  sizeof(*h) * nr_tasks,
+				  CKPT_HDR_BUFFER);
+	if (ret < 0)
+		return ret;
+
+	h = ckpt_hdr_get(ctx, sizeof(*h) * CKPT_HDR_PIDS_CHUNK);
+	if (!h)
+		return -ENOMEM;
+
+	do {
+		rcu_read_lock();
+		for (n = 0; n < min(nr_tasks, CKPT_HDR_PIDS_CHUNK); n++) {
+			task = tasks_arr[pos];
+
+			h[n].vpid = task_pid_nr_ns(task, ns);
+			h[n].vtgid = task_tgid_nr_ns(task, ns);
+			h[n].vpgid = task_pgrp_nr_ns(task, ns);
+			h[n].vsid = task_session_nr_ns(task, ns);
+			h[n].vppid = task_tgid_nr_ns(task->real_parent, ns);
+			ckpt_debug("task[%d]: vpid %d vtgid %d parent %d\n",
+				   pos, h[n].vpid, h[n].vtgid, h[n].vppid);
+			pos++;
+		}
+		rcu_read_unlock();
+
+		n = min(nr_tasks, CKPT_HDR_PIDS_CHUNK);
+		ret = ckpt_kwrite(ctx, h, n * sizeof(*h));
+		if (ret < 0)
+			break;
+
+		nr_tasks -= n;
+	} while (nr_tasks > 0);
+
+	_ckpt_hdr_put(ctx, h, sizeof(*h) * CKPT_HDR_PIDS_CHUNK);
+	return ret;
+}
+
+/* count number of tasks in tree (and optionally fill pid's in array) */
+static int tree_count_tasks(struct ckpt_ctx *ctx)
+{
+	struct task_struct *root;
+	struct task_struct *task;
+	struct task_struct *parent;
+	struct task_struct **tasks_arr = ctx->tasks_arr;
+	int nr_tasks = ctx->nr_tasks;
+	int nr = 0;
+	int ret;
+
+	read_lock(&tasklist_lock);
+
+	/* we hold the lock, so root_task->real_parent can't change */
+	task = ctx->root_task;
+	if (ctx->root_init) {
+		/* container-init: start from container parent */
+		parent = task->real_parent;
+		root = parent;
+	} else {
+		/* non-container-init: start from root task and down */
+		parent = NULL;
+		root = task;
+	}
+
+	/* count tasks via DFS scan of the tree */
+	while (1) {
+		ctx->tsk = task;  /* (for ckpt_write_err) */
+
+		/* is this task cool ? */
+		ret = may_checkpoint_task(ctx, task);
+		if (ret < 0) {
+			nr = ret;
+			break;
+		}
+		if (tasks_arr) {
+			/* unlikely... but if so then try again later */
+			if (nr == nr_tasks) {
+				nr = -EBUSY; /* cleanup in ckpt_ctx_free() */
+				break;
+			}
+			tasks_arr[nr] = task;
+			get_task_struct(task);
+		}
+		nr++;
+		/* if has children - proceed with child */
+		if (!list_empty(&task->children)) {
+			parent = task;
+			task = list_entry(task->children.next,
+					  struct task_struct, sibling);
+			continue;
+		}
+		while (task != root) {
+			/* if has sibling - proceed with sibling */
+			if (!list_is_last(&task->sibling, &parent->children)) {
+				task = list_entry(task->sibling.next,
+						  struct task_struct, sibling);
+				break;
+			}
+
+			/* else, trace back to parent and proceed */
+			task = parent;
+			parent = parent->real_parent;
+		}
+		if (task == root)
+			break;
+	}
+	ctx->tsk = NULL;
+
+	read_unlock(&tasklist_lock);
+
+	if (nr < 0)
+		ckpt_write_err(ctx, "", NULL);
+	return nr;
+}
+
+/*
+ * build_tree - scan the tasks tree in DFS order and fill in array
+ * @ctx: checkpoint context
+ *
+ * Using DFS order simplifies the restart logic to re-create the tasks.
+ *
+ * On success, ctx->tasks_arr will be allocated and populated with all
+ * tasks (reference taken), and ctx->nr_tasks will hold the total count.
+ * The array is cleaned up by ckpt_ctx_free().
+ */
+static int build_tree(struct ckpt_ctx *ctx)
+{
+	int n, m;
+
+	/* count tasks (no side effects) */
+	n = tree_count_tasks(ctx);
+	if (n < 0)
+		return n;
+
+	ctx->nr_tasks = n;
+	ctx->tasks_arr = kzalloc(n * sizeof(*ctx->tasks_arr), GFP_KERNEL);
+	if (!ctx->tasks_arr)
+		return -ENOMEM;
+
+	/* count again (now will fill array) */
+	m = tree_count_tasks(ctx);
+
+	/* unlikely, but ... (cleanup in ckpt_ctx_free) */
+	if (m < 0)
+		return m;
+	else if (m != n)
+		return -EBUSY;
+
 	return 0;
 }
 
+/* dump the array that describes the tasks tree */
+static int checkpoint_tree(struct ckpt_ctx *ctx)
+{
+	struct ckpt_hdr_tree *h;
+	int ret;
+
+	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_TREE);
+	if (!h)
+		return -ENOMEM;
+
+	h->nr_tasks = ctx->nr_tasks;
+
+	ret = ckpt_write_obj(ctx, &h->h);
+	ckpt_hdr_put(ctx, h);
+	if (ret < 0)
+		return ret;
+
+	ret = checkpoint_pids(ctx);
+	return ret;
+}
+
+static struct task_struct *get_freezer_task(struct task_struct *root_task)
+{
+	struct task_struct *p;
+
+	/*
+	 * For the duration of checkpoint we deep-freeze all tasks.
+	 * Normally do it through the root task's freezer cgroup.
+	 * However, if the root task is also the current task (doing
+	 * self-checkpoint) we can't freeze ourselves. In this case,
+	 * choose the next available (non-dead) task instead. We'll
+	 * use its freezer cgroup to verify that all tasks belong to
+	 * the same cgroup.
+	 */
+
+	if (root_task != current) {
+		get_task_struct(root_task);
+		return root_task;
+	}
+
+	/* search among threads, then children */
+	read_lock(&tasklist_lock);
+
+	for (p = next_thread(root_task); p != root_task; p = next_thread(p)) {
+		if (p->state == TASK_DEAD)
+			continue;
+		if (!in_same_cgroup_freezer(p, root_task))
+			goto out;
+	}
+
+	list_for_each_entry(p, &root_task->children, sibling) {
+		if (p->state == TASK_DEAD)
+			continue;
+		if (!in_same_cgroup_freezer(p, root_task))
+			goto out;
+	}
+
+	p = NULL;
+ out:
+	read_unlock(&tasklist_lock);
+	if (p)
+		get_task_struct(p);
+	return p;
+}
+
 /* setup checkpoint-specific parts of ctx */
 static int init_checkpoint_ctx(struct ckpt_ctx *ctx, pid_t pid)
 {
 	struct task_struct *task;
 	struct nsproxy *nsproxy;
-	int ret;
 
 	/*
 	 * No need for explicit cleanup here, because if an error
@@ -420,18 +682,14 @@ static int init_checkpoint_ctx(struct ckpt_ctx *ctx, pid_t pid)
 		ctx->root_nsproxy = nsproxy;
 
 	/* root freezer */
-	ctx->root_freezer = task;
-	geT_task_struct(task);
+	ctx->root_freezer = get_freezer_task(task);
 
-	ret = may_checkpoint_task(ctx, task);
-	if (ret) {
-		ckpt_write_err(ctx, "", NULL);
-		put_task_struct(task);
-		put_task_struct(task);
-		put_nsproxy(nsproxy);
-		ctx->root_nsproxy = NULL;
-		ctx->root_task = NULL;
-		return ret;
+	/* container init ? */
+	ctx->root_init = is_container_init(task);
+
+	if (!(ctx->uflags & CHECKPOINT_SUBTREE) && !ctx->root_init) {
+		ckpt_write_err(ctx, "E", "not container init", -EINVAL);
+		return -EINVAL;  /* cleanup by ckpt_ctx_free() */
 	}
 
 	return 0;
@@ -447,14 +705,23 @@ long do_checkpoint(struct ckpt_ctx *ctx, pid_t pid)
 
 	if (ctx->root_freezer) {
 		ret = cgroup_freezer_begin_checkpoint(ctx->root_freezer);
-		if (ret < 0)
+		if (ret < 0) {
+			ckpt_write_err(ctx, "E", "freezer cgroup failed", ret);
 			return ret;
+		}
 	}
 
+	ret = build_tree(ctx);
+	if (ret < 0)
+		goto out;
+
 	ret = checkpoint_write_header(ctx);
 	if (ret < 0)
 		goto out;
-	ret = checkpoint_task(ctx, ctx->root_task);
+	ret = checkpoint_tree(ctx);
+	if (ret < 0)
+		goto out;
+	ret = checkpoint_all_tasks(ctx);
 	if (ret < 0)
 		goto out;
 	ret = checkpoint_write_tail(ctx);
diff --git a/checkpoint/restart.c b/checkpoint/restart.c
index fdad264..3f22403 100644
--- a/checkpoint/restart.c
+++ b/checkpoint/restart.c
@@ -364,7 +364,7 @@ static int restore_read_header(struct ckpt_ctx *ctx)
 	    h->minor != ((LINUX_VERSION_CODE >> 8) & 0xff) ||
 	    h->patch != ((LINUX_VERSION_CODE) & 0xff))
 		goto out;
-	if (h->uflags)
+	if (h->uflags & ~CHECKPOINT_USER_FLAGS)
 		goto out;
 
 	ret = check_kernel_const(&h->constants);
diff --git a/checkpoint/sys.c b/checkpoint/sys.c
index b37bc8c..cc94775 100644
--- a/checkpoint/sys.c
+++ b/checkpoint/sys.c
@@ -23,6 +23,14 @@
 #include <linux/checkpoint.h>
 
 /*
+ * ckpt_unpriv_allowed - sysctl controlled, do not allow checkpoints or
+ * restarts unless caller has CAP_SYS_ADMIN, if 0 (prevent unprivileged
+ * useres from expoitling any privilege escalation bugs). If it is 1,
+ * then regular permissions checks are intended to do the job.
+ */
+int ckpt_unpriv_allowed = 1;	/* default: allow */
+
+/*
  * Helpers to write(read) from(to) kernel space to(from) the checkpoint
  * image file descriptor (similar to how a core-dump is performed).
  *
@@ -166,11 +174,27 @@ void *ckpt_hdr_get_type(struct ckpt_ctx *ctx, int len, int type)
  * restart operation, and persists until the operation is completed.
  */
 
+static void task_arr_free(struct ckpt_ctx *ctx)
+{
+	int n;
+
+	for (n = 0; n < ctx->nr_tasks; n++) {
+		if (ctx->tasks_arr[n]) {
+			put_task_struct(ctx->tasks_arr[n]);
+			ctx->tasks_arr[n] = NULL;
+		}
+	}
+	kfree(ctx->tasks_arr);
+}
+
 static void ckpt_ctx_free(struct ckpt_ctx *ctx)
 {
 	if (ctx->file)
 		fput(ctx->file);
 
+	if (ctx->tasks_arr)
+		task_arr_free(ctx);
+
 	if (ctx->root_nsproxy)
 		put_nsproxy(ctx->root_nsproxy);
 	if (ctx->root_task)
@@ -220,10 +244,12 @@ SYSCALL_DEFINE3(checkpoint, pid_t, pid, int, fd, unsigned long, flags)
 	struct ckpt_ctx *ctx;
 	long ret;
 
-	/* no flags for now */
-	if (flags)
+	if (flags & ~CHECKPOINT_USER_FLAGS)
 		return -EINVAL;
 
+	if (!ckpt_unpriv_allowed && !capable(CAP_SYS_ADMIN))
+		return -EPERM;
+
 	if (pid == 0)
 		pid = task_pid_vnr(current);
 	ctx = ckpt_ctx_alloc(fd, flags, CKPT_CTX_CHECKPOINT);
@@ -257,6 +283,9 @@ SYSCALL_DEFINE3(restart, pid_t, pid, int, fd, unsigned long, flags)
 	if (flags)
 		return -EINVAL;
 
+	if (!ckpt_unpriv_allowed && !capable(CAP_SYS_ADMIN))
+		return -EPERM;
+
 	ctx = ckpt_ctx_alloc(fd, flags, CKPT_CTX_RESTART);
 	if (IS_ERR(ctx))
 		return PTR_ERR(ctx);
diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
index 14c0a7f..a4650bb 100644
--- a/include/linux/checkpoint.h
+++ b/include/linux/checkpoint.h
@@ -12,6 +12,9 @@
 
 #define CHECKPOINT_VERSION  2
 
+/* checkpoint user flags */
+#define CHECKPOINT_SUBTREE	0x1
+
 #ifdef __KERNEL__
 #ifdef CONFIG_CHECKPOINT
 
@@ -26,6 +29,9 @@
 #define CKPT_CTX_CHECKPOINT	(1 << CKPT_CTX_CHECKPOINT_BIT)
 #define CKPT_CTX_RESTART	(1 << CKPT_CTX_RESTART_BIT)
 
+/* ckpt_ctx: uflags */
+#define CHECKPOINT_USER_FLAGS		CHECKPOINT_SUBTREE
+
 
 extern int ckpt_kwrite(struct ckpt_ctx *ctx, void *buf, int count);
 extern int ckpt_kread(struct ckpt_ctx *ctx, void *buf, int count);
diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
index b72c59c..26e10fb 100644
--- a/include/linux/checkpoint_hdr.h
+++ b/include/linux/checkpoint_hdr.h
@@ -51,7 +51,8 @@ enum {
 	CKPT_HDR_BUFFER,
 	CKPT_HDR_STRING,
 
-	CKPT_HDR_TASK = 101,
+	CKPT_HDR_TREE = 101,
+	CKPT_HDR_TASK,
 	CKPT_HDR_RESTART_BLOCK,
 	CKPT_HDR_THREAD,
 	CKPT_HDR_CPU,
@@ -110,6 +111,19 @@ struct ckpt_hdr_tail {
 	__u64 magic;
 } __attribute__((aligned(8)));
 
+/* task tree */
+struct ckpt_hdr_tree {
+	struct ckpt_hdr h;
+	__s32 nr_tasks;
+} __attribute__((aligned(8)));
+
+struct ckpt_pids {
+	__s32 vpid;
+	__s32 vppid;
+	__s32 vtgid;
+	__s32 vpgid;
+	__s32 vsid;
+} __attribute__((aligned(8)));
 
 /* task data */
 struct ckpt_hdr_task {
diff --git a/include/linux/checkpoint_types.h b/include/linux/checkpoint_types.h
index 046bdc4..c3399b3 100644
--- a/include/linux/checkpoint_types.h
+++ b/include/linux/checkpoint_types.h
@@ -22,6 +22,7 @@ struct ckpt_ctx {
 
 	ktime_t ktime_begin;	/* checkpoint start time */
 
+	int root_init;				/* [container] root init ? */
 	pid_t root_pid;				/* [container] root pid */
 	struct task_struct *root_task;		/* [container] root task */
 	struct nsproxy *root_nsproxy;		/* [container] root nsproxy */
@@ -34,6 +35,9 @@ struct ckpt_ctx {
 	struct file *file;	/* input/output file */
 	int total;		/* total read/written */
 
+	struct task_struct **tasks_arr;	/* array of all tasks in container */
+	int nr_tasks;			/* size of tasks array */
+
 	struct task_struct *tsk;/* checkpoint: current target task */
 	char err_string[256];	/* checkpoint: error string */
 };
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 58be760..3046e2c 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -198,6 +198,10 @@ int sysctl_legacy_va_layout;
 extern int prove_locking;
 extern int lock_stat;
 
+#ifdef CONFIG_CHECKPOINT
+extern int ckpt_unpriv_allowed;
+#endif
+
 /* The default sysctl tables: */
 
 static struct ctl_table root_table[] = {
@@ -990,6 +994,19 @@ static struct ctl_table kern_table[] = {
 		.proc_handler	= &proc_dointvec,
 	},
 #endif
+#ifdef CONFIG_CHECKPOINT
+	{
+		.ctl_name	= CTL_UNNUMBERED,
+		.procname	= "ckpt_unpriv_allowed",
+		.data		= &ckpt_unpriv_allowed,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= &proc_dointvec_minmax,
+		.strategy	= &sysctl_intvec,
+		.extra1		= &zero,
+		.extra2		= &one,
+	},
+#endif
 
 /*
  * NOTE: do not add new entries to this table unless you have read
-- 
1.6.0.4


^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH v18 25/80] c/r: checkpoint multiple processes
@ 2009-09-23 23:51   ` Oren Laadan
  0 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-23 23:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, containers, linux-kernel, linux-mm, linux-api,
	Serge Hallyn, Ingo Molnar, Pavel Emelyanov, Oren Laadan,
	Oren Laadan

Checkpointing of multiple processes works by recording the tasks tree
structure below a given "root" task. The root task is expected to be a
container init, and then an entire container is checkpointed. However,
passing CHECKPOINT_SUBTREE to checkpoint(2) relaxes this requirement
and allows to checkpoint a subtree of processes from the root task.

For a given root task, do a DFS scan of the tasks tree and collect them
into an array (keeping a reference to each task). Using DFS simplifies
the recreation of tasks either in user space or kernel space. For each
task collected, test if it can be checkpointed, and save its pid, tgid,
and ppid.

The actual work is divided into two passes: a first scan counts the
tasks, then memory is allocated and a second scan fills the array.

Whether checkpoints and restarts require CAP_SYS_ADMIN is determined
by sysctl 'ckpt_unpriv_allowed': if 1, then regular permission checks
are intended to prevent privilege escalation, however if 0 it prevents
unprivileged users from exploiting any privilege escalation bugs.

The logic is suitable for creation of processes during restart either
in userspace or by the kernel.

Currently we ignore threads and zombies.

Changelog[v18]:
  - Replace some EAGAIN with EBUSY
  - Add a few more ckpt_write_err()s
  - Rename headerless struct ckpt_hdr_* to struct ckpt_*
Changelog[v16]:
  - CHECKPOINT_SUBTREE flags allows subtree (not whole container)
  - sysctl variable 'ckpt_unpriv_allowed' controls needed privileges
Changelog[v14]:
  - Refuse non-self checkpoint if target task isn't frozen
  - Refuse checkpoint (for now) if task is ptraced
  - Revert change to pr_debug(), back to ckpt_debug()
  - Use only unsigned fields in checkpoint headers
  - Check retval of ckpt_tree_count_tasks() in ckpt_build_tree()
  - Discard 'h.parent' field
  - Check whether calls to ckpt_hbuf_get() fail
  - Disallow threads or siblings to container init
Changelog[v13]:
  - Release tasklist_lock in error path in ckpt_tree_count_tasks()
  - Use separate index for 'tasks_arr' and 'hh' in ckpt_write_pids()
Changelog[v12]:
  - Replace obsolete ckpt_debug() with pr_debug()

Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
---
 checkpoint/checkpoint.c          |  295 ++++++++++++++++++++++++++++++++++++--
 checkpoint/restart.c             |    2 +-
 checkpoint/sys.c                 |   33 ++++-
 include/linux/checkpoint.h       |    6 +
 include/linux/checkpoint_hdr.h   |   16 ++-
 include/linux/checkpoint_types.h |    4 +
 kernel/sysctl.c                  |   17 +++
 7 files changed, 355 insertions(+), 18 deletions(-)

diff --git a/checkpoint/checkpoint.c b/checkpoint/checkpoint.c
index 554400c..fc02436 100644
--- a/checkpoint/checkpoint.c
+++ b/checkpoint/checkpoint.c
@@ -356,8 +356,27 @@ static int checkpoint_write_tail(struct ckpt_ctx *ctx)
 	return ret;
 }
 
+/* dump all tasks in ctx->tasks_arr[] */
+static int checkpoint_all_tasks(struct ckpt_ctx *ctx)
+{
+	int n, ret = 0;
+
+	for (n = 0; n < ctx->nr_tasks; n++) {
+		ckpt_debug("dumping task #%d\n", n);
+		ret = checkpoint_task(ctx, ctx->tasks_arr[n]);
+		if (ret < 0)
+			break;
+	}
+
+	return ret;
+}
+
 static int may_checkpoint_task(struct ckpt_ctx *ctx, struct task_struct *t)
 {
+	struct task_struct *root = ctx->root_task;
+
+	ckpt_debug("check %d\n", task_pid_nr_ns(t, ctx->root_nsproxy->pid_ns));
+
 	if (t->state == TASK_DEAD) {
 		__ckpt_write_err(ctx, "TE", "task state EXIT_DEAD\n", -EBUSY);
 		return -EBUSY;
@@ -380,15 +399,258 @@ static int may_checkpoint_task(struct ckpt_ctx *ctx, struct task_struct *t)
 		return -EBUSY;
 	}
 
+	/*
+	 * FIX: for now, disallow siblings of container init created
+	 * via CLONE_PARENT (unclear if they will remain possible)
+	 */
+	if (ctx->root_init && t != root && t->tgid != root->tgid &&
+	    t->real_parent == root->real_parent) {
+		__ckpt_write_err(ctx, "TE", "task is sibling of root", -EINVAL);
+		return -EINVAL;
+	}
+
+	/* FIX: change this when namespaces are added */
+	if (task_nsproxy(t) != ctx->root_nsproxy)
+		return -EPERM;
+
+	return 0;
+}
+
+#define CKPT_HDR_PIDS_CHUNK	256
+
+static int checkpoint_pids(struct ckpt_ctx *ctx)
+{
+	struct ckpt_pids *h;
+	struct pid_namespace *ns;
+	struct task_struct *task;
+	struct task_struct **tasks_arr;
+	int nr_tasks, n, pos = 0, ret = 0;
+
+	ns = ctx->root_nsproxy->pid_ns;
+	tasks_arr = ctx->tasks_arr;
+	nr_tasks = ctx->nr_tasks;
+	BUG_ON(nr_tasks <= 0);
+
+	ret = ckpt_write_obj_type(ctx, NULL,
+				  sizeof(*h) * nr_tasks,
+				  CKPT_HDR_BUFFER);
+	if (ret < 0)
+		return ret;
+
+	h = ckpt_hdr_get(ctx, sizeof(*h) * CKPT_HDR_PIDS_CHUNK);
+	if (!h)
+		return -ENOMEM;
+
+	do {
+		rcu_read_lock();
+		for (n = 0; n < min(nr_tasks, CKPT_HDR_PIDS_CHUNK); n++) {
+			task = tasks_arr[pos];
+
+			h[n].vpid = task_pid_nr_ns(task, ns);
+			h[n].vtgid = task_tgid_nr_ns(task, ns);
+			h[n].vpgid = task_pgrp_nr_ns(task, ns);
+			h[n].vsid = task_session_nr_ns(task, ns);
+			h[n].vppid = task_tgid_nr_ns(task->real_parent, ns);
+			ckpt_debug("task[%d]: vpid %d vtgid %d parent %d\n",
+				   pos, h[n].vpid, h[n].vtgid, h[n].vppid);
+			pos++;
+		}
+		rcu_read_unlock();
+
+		n = min(nr_tasks, CKPT_HDR_PIDS_CHUNK);
+		ret = ckpt_kwrite(ctx, h, n * sizeof(*h));
+		if (ret < 0)
+			break;
+
+		nr_tasks -= n;
+	} while (nr_tasks > 0);
+
+	_ckpt_hdr_put(ctx, h, sizeof(*h) * CKPT_HDR_PIDS_CHUNK);
+	return ret;
+}
+
+/* count number of tasks in tree (and optionally fill pid's in array) */
+static int tree_count_tasks(struct ckpt_ctx *ctx)
+{
+	struct task_struct *root;
+	struct task_struct *task;
+	struct task_struct *parent;
+	struct task_struct **tasks_arr = ctx->tasks_arr;
+	int nr_tasks = ctx->nr_tasks;
+	int nr = 0;
+	int ret;
+
+	read_lock(&tasklist_lock);
+
+	/* we hold the lock, so root_task->real_parent can't change */
+	task = ctx->root_task;
+	if (ctx->root_init) {
+		/* container-init: start from container parent */
+		parent = task->real_parent;
+		root = parent;
+	} else {
+		/* non-container-init: start from root task and down */
+		parent = NULL;
+		root = task;
+	}
+
+	/* count tasks via DFS scan of the tree */
+	while (1) {
+		ctx->tsk = task;  /* (for ckpt_write_err) */
+
+		/* is this task cool ? */
+		ret = may_checkpoint_task(ctx, task);
+		if (ret < 0) {
+			nr = ret;
+			break;
+		}
+		if (tasks_arr) {
+			/* unlikely... but if so then try again later */
+			if (nr == nr_tasks) {
+				nr = -EBUSY; /* cleanup in ckpt_ctx_free() */
+				break;
+			}
+			tasks_arr[nr] = task;
+			get_task_struct(task);
+		}
+		nr++;
+		/* if has children - proceed with child */
+		if (!list_empty(&task->children)) {
+			parent = task;
+			task = list_entry(task->children.next,
+					  struct task_struct, sibling);
+			continue;
+		}
+		while (task != root) {
+			/* if has sibling - proceed with sibling */
+			if (!list_is_last(&task->sibling, &parent->children)) {
+				task = list_entry(task->sibling.next,
+						  struct task_struct, sibling);
+				break;
+			}
+
+			/* else, trace back to parent and proceed */
+			task = parent;
+			parent = parent->real_parent;
+		}
+		if (task == root)
+			break;
+	}
+	ctx->tsk = NULL;
+
+	read_unlock(&tasklist_lock);
+
+	if (nr < 0)
+		ckpt_write_err(ctx, "", NULL);
+	return nr;
+}
+
+/*
+ * build_tree - scan the tasks tree in DFS order and fill in array
+ * @ctx: checkpoint context
+ *
+ * Using DFS order simplifies the restart logic to re-create the tasks.
+ *
+ * On success, ctx->tasks_arr will be allocated and populated with all
+ * tasks (reference taken), and ctx->nr_tasks will hold the total count.
+ * The array is cleaned up by ckpt_ctx_free().
+ */
+static int build_tree(struct ckpt_ctx *ctx)
+{
+	int n, m;
+
+	/* count tasks (no side effects) */
+	n = tree_count_tasks(ctx);
+	if (n < 0)
+		return n;
+
+	ctx->nr_tasks = n;
+	ctx->tasks_arr = kzalloc(n * sizeof(*ctx->tasks_arr), GFP_KERNEL);
+	if (!ctx->tasks_arr)
+		return -ENOMEM;
+
+	/* count again (now will fill array) */
+	m = tree_count_tasks(ctx);
+
+	/* unlikely, but ... (cleanup in ckpt_ctx_free) */
+	if (m < 0)
+		return m;
+	else if (m != n)
+		return -EBUSY;
+
 	return 0;
 }
 
+/* dump the array that describes the tasks tree */
+static int checkpoint_tree(struct ckpt_ctx *ctx)
+{
+	struct ckpt_hdr_tree *h;
+	int ret;
+
+	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_TREE);
+	if (!h)
+		return -ENOMEM;
+
+	h->nr_tasks = ctx->nr_tasks;
+
+	ret = ckpt_write_obj(ctx, &h->h);
+	ckpt_hdr_put(ctx, h);
+	if (ret < 0)
+		return ret;
+
+	ret = checkpoint_pids(ctx);
+	return ret;
+}
+
+static struct task_struct *get_freezer_task(struct task_struct *root_task)
+{
+	struct task_struct *p;
+
+	/*
+	 * For the duration of checkpoint we deep-freeze all tasks.
+	 * Normally do it through the root task's freezer cgroup.
+	 * However, if the root task is also the current task (doing
+	 * self-checkpoint) we can't freeze ourselves. In this case,
+	 * choose the next available (non-dead) task instead. We'll
+	 * use its freezer cgroup to verify that all tasks belong to
+	 * the same cgroup.
+	 */
+
+	if (root_task != current) {
+		get_task_struct(root_task);
+		return root_task;
+	}
+
+	/* search among threads, then children */
+	read_lock(&tasklist_lock);
+
+	for (p = next_thread(root_task); p != root_task; p = next_thread(p)) {
+		if (p->state == TASK_DEAD)
+			continue;
+		if (!in_same_cgroup_freezer(p, root_task))
+			goto out;
+	}
+
+	list_for_each_entry(p, &root_task->children, sibling) {
+		if (p->state == TASK_DEAD)
+			continue;
+		if (!in_same_cgroup_freezer(p, root_task))
+			goto out;
+	}
+
+	p = NULL;
+ out:
+	read_unlock(&tasklist_lock);
+	if (p)
+		get_task_struct(p);
+	return p;
+}
+
 /* setup checkpoint-specific parts of ctx */
 static int init_checkpoint_ctx(struct ckpt_ctx *ctx, pid_t pid)
 {
 	struct task_struct *task;
 	struct nsproxy *nsproxy;
-	int ret;
 
 	/*
 	 * No need for explicit cleanup here, because if an error
@@ -420,18 +682,14 @@ static int init_checkpoint_ctx(struct ckpt_ctx *ctx, pid_t pid)
 		ctx->root_nsproxy = nsproxy;
 
 	/* root freezer */
-	ctx->root_freezer = task;
-	geT_task_struct(task);
+	ctx->root_freezer = get_freezer_task(task);
 
-	ret = may_checkpoint_task(ctx, task);
-	if (ret) {
-		ckpt_write_err(ctx, "", NULL);
-		put_task_struct(task);
-		put_task_struct(task);
-		put_nsproxy(nsproxy);
-		ctx->root_nsproxy = NULL;
-		ctx->root_task = NULL;
-		return ret;
+	/* container init ? */
+	ctx->root_init = is_container_init(task);
+
+	if (!(ctx->uflags & CHECKPOINT_SUBTREE) && !ctx->root_init) {
+		ckpt_write_err(ctx, "E", "not container init", -EINVAL);
+		return -EINVAL;  /* cleanup by ckpt_ctx_free() */
 	}
 
 	return 0;
@@ -447,14 +705,23 @@ long do_checkpoint(struct ckpt_ctx *ctx, pid_t pid)
 
 	if (ctx->root_freezer) {
 		ret = cgroup_freezer_begin_checkpoint(ctx->root_freezer);
-		if (ret < 0)
+		if (ret < 0) {
+			ckpt_write_err(ctx, "E", "freezer cgroup failed", ret);
 			return ret;
+		}
 	}
 
+	ret = build_tree(ctx);
+	if (ret < 0)
+		goto out;
+
 	ret = checkpoint_write_header(ctx);
 	if (ret < 0)
 		goto out;
-	ret = checkpoint_task(ctx, ctx->root_task);
+	ret = checkpoint_tree(ctx);
+	if (ret < 0)
+		goto out;
+	ret = checkpoint_all_tasks(ctx);
 	if (ret < 0)
 		goto out;
 	ret = checkpoint_write_tail(ctx);
diff --git a/checkpoint/restart.c b/checkpoint/restart.c
index fdad264..3f22403 100644
--- a/checkpoint/restart.c
+++ b/checkpoint/restart.c
@@ -364,7 +364,7 @@ static int restore_read_header(struct ckpt_ctx *ctx)
 	    h->minor != ((LINUX_VERSION_CODE >> 8) & 0xff) ||
 	    h->patch != ((LINUX_VERSION_CODE) & 0xff))
 		goto out;
-	if (h->uflags)
+	if (h->uflags & ~CHECKPOINT_USER_FLAGS)
 		goto out;
 
 	ret = check_kernel_const(&h->constants);
diff --git a/checkpoint/sys.c b/checkpoint/sys.c
index b37bc8c..cc94775 100644
--- a/checkpoint/sys.c
+++ b/checkpoint/sys.c
@@ -23,6 +23,14 @@
 #include <linux/checkpoint.h>
 
 /*
+ * ckpt_unpriv_allowed - sysctl controlled, do not allow checkpoints or
+ * restarts unless caller has CAP_SYS_ADMIN, if 0 (prevent unprivileged
+ * useres from expoitling any privilege escalation bugs). If it is 1,
+ * then regular permissions checks are intended to do the job.
+ */
+int ckpt_unpriv_allowed = 1;	/* default: allow */
+
+/*
  * Helpers to write(read) from(to) kernel space to(from) the checkpoint
  * image file descriptor (similar to how a core-dump is performed).
  *
@@ -166,11 +174,27 @@ void *ckpt_hdr_get_type(struct ckpt_ctx *ctx, int len, int type)
  * restart operation, and persists until the operation is completed.
  */
 
+static void task_arr_free(struct ckpt_ctx *ctx)
+{
+	int n;
+
+	for (n = 0; n < ctx->nr_tasks; n++) {
+		if (ctx->tasks_arr[n]) {
+			put_task_struct(ctx->tasks_arr[n]);
+			ctx->tasks_arr[n] = NULL;
+		}
+	}
+	kfree(ctx->tasks_arr);
+}
+
 static void ckpt_ctx_free(struct ckpt_ctx *ctx)
 {
 	if (ctx->file)
 		fput(ctx->file);
 
+	if (ctx->tasks_arr)
+		task_arr_free(ctx);
+
 	if (ctx->root_nsproxy)
 		put_nsproxy(ctx->root_nsproxy);
 	if (ctx->root_task)
@@ -220,10 +244,12 @@ SYSCALL_DEFINE3(checkpoint, pid_t, pid, int, fd, unsigned long, flags)
 	struct ckpt_ctx *ctx;
 	long ret;
 
-	/* no flags for now */
-	if (flags)
+	if (flags & ~CHECKPOINT_USER_FLAGS)
 		return -EINVAL;
 
+	if (!ckpt_unpriv_allowed && !capable(CAP_SYS_ADMIN))
+		return -EPERM;
+
 	if (pid == 0)
 		pid = task_pid_vnr(current);
 	ctx = ckpt_ctx_alloc(fd, flags, CKPT_CTX_CHECKPOINT);
@@ -257,6 +283,9 @@ SYSCALL_DEFINE3(restart, pid_t, pid, int, fd, unsigned long, flags)
 	if (flags)
 		return -EINVAL;
 
+	if (!ckpt_unpriv_allowed && !capable(CAP_SYS_ADMIN))
+		return -EPERM;
+
 	ctx = ckpt_ctx_alloc(fd, flags, CKPT_CTX_RESTART);
 	if (IS_ERR(ctx))
 		return PTR_ERR(ctx);
diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
index 14c0a7f..a4650bb 100644
--- a/include/linux/checkpoint.h
+++ b/include/linux/checkpoint.h
@@ -12,6 +12,9 @@
 
 #define CHECKPOINT_VERSION  2
 
+/* checkpoint user flags */
+#define CHECKPOINT_SUBTREE	0x1
+
 #ifdef __KERNEL__
 #ifdef CONFIG_CHECKPOINT
 
@@ -26,6 +29,9 @@
 #define CKPT_CTX_CHECKPOINT	(1 << CKPT_CTX_CHECKPOINT_BIT)
 #define CKPT_CTX_RESTART	(1 << CKPT_CTX_RESTART_BIT)
 
+/* ckpt_ctx: uflags */
+#define CHECKPOINT_USER_FLAGS		CHECKPOINT_SUBTREE
+
 
 extern int ckpt_kwrite(struct ckpt_ctx *ctx, void *buf, int count);
 extern int ckpt_kread(struct ckpt_ctx *ctx, void *buf, int count);
diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
index b72c59c..26e10fb 100644
--- a/include/linux/checkpoint_hdr.h
+++ b/include/linux/checkpoint_hdr.h
@@ -51,7 +51,8 @@ enum {
 	CKPT_HDR_BUFFER,
 	CKPT_HDR_STRING,
 
-	CKPT_HDR_TASK = 101,
+	CKPT_HDR_TREE = 101,
+	CKPT_HDR_TASK,
 	CKPT_HDR_RESTART_BLOCK,
 	CKPT_HDR_THREAD,
 	CKPT_HDR_CPU,
@@ -110,6 +111,19 @@ struct ckpt_hdr_tail {
 	__u64 magic;
 } __attribute__((aligned(8)));
 
+/* task tree */
+struct ckpt_hdr_tree {
+	struct ckpt_hdr h;
+	__s32 nr_tasks;
+} __attribute__((aligned(8)));
+
+struct ckpt_pids {
+	__s32 vpid;
+	__s32 vppid;
+	__s32 vtgid;
+	__s32 vpgid;
+	__s32 vsid;
+} __attribute__((aligned(8)));
 
 /* task data */
 struct ckpt_hdr_task {
diff --git a/include/linux/checkpoint_types.h b/include/linux/checkpoint_types.h
index 046bdc4..c3399b3 100644
--- a/include/linux/checkpoint_types.h
+++ b/include/linux/checkpoint_types.h
@@ -22,6 +22,7 @@ struct ckpt_ctx {
 
 	ktime_t ktime_begin;	/* checkpoint start time */
 
+	int root_init;				/* [container] root init ? */
 	pid_t root_pid;				/* [container] root pid */
 	struct task_struct *root_task;		/* [container] root task */
 	struct nsproxy *root_nsproxy;		/* [container] root nsproxy */
@@ -34,6 +35,9 @@ struct ckpt_ctx {
 	struct file *file;	/* input/output file */
 	int total;		/* total read/written */
 
+	struct task_struct **tasks_arr;	/* array of all tasks in container */
+	int nr_tasks;			/* size of tasks array */
+
 	struct task_struct *tsk;/* checkpoint: current target task */
 	char err_string[256];	/* checkpoint: error string */
 };
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 58be760..3046e2c 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -198,6 +198,10 @@ int sysctl_legacy_va_layout;
 extern int prove_locking;
 extern int lock_stat;
 
+#ifdef CONFIG_CHECKPOINT
+extern int ckpt_unpriv_allowed;
+#endif
+
 /* The default sysctl tables: */
 
 static struct ctl_table root_table[] = {
@@ -990,6 +994,19 @@ static struct ctl_table kern_table[] = {
 		.proc_handler	= &proc_dointvec,
 	},
 #endif
+#ifdef CONFIG_CHECKPOINT
+	{
+		.ctl_name	= CTL_UNNUMBERED,
+		.procname	= "ckpt_unpriv_allowed",
+		.data		= &ckpt_unpriv_allowed,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= &proc_dointvec_minmax,
+		.strategy	= &sysctl_intvec,
+		.extra1		= &zero,
+		.extra2		= &one,
+	},
+#endif
 
 /*
  * NOTE: do not add new entries to this table unless you have read
-- 
1.6.0.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH v18 26/80] c/r: restart multiple processes
       [not found] ` <1253749920-18673-1-git-send-email-orenl-RdfvBDnrOixBDgjK7y7TUQ@public.gmane.org>
                     ` (24 preceding siblings ...)
  2009-09-23 23:51   ` [PATCH v18 25/80] c/r: checkpoint multiple processes Oren Laadan
@ 2009-09-23 23:51   ` Oren Laadan
  2009-09-23 23:51   ` [PATCH v18 27/80] c/r: introduce PF_RESTARTING, and skip notification on exit Oren Laadan
                     ` (55 subsequent siblings)
  81 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-23 23:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-api-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Linus Torvalds, Ingo Molnar,
	Pavel Emelyanov

Restarting of multiple processes expects all restarting tasks to call
sys_restart(). Once inside the system call, each task will restart
itself at the same order that they were saved. The internals of the
syscall will take care of in-kernel synchronization bewteen tasks.

This patch does _not_ create the task tree in the kernel. Instead it
assumes that all tasks are created in some way and then invoke the
restart syscall. You can use the userspace mktree.c program to do
that.

There is one special task - the coordinator - that is not part of the
restarted hierarchy. The coordinator task allocates the restart
context (ctx) and orchestrates the restart. Thus even if a restart
fails after, or during the restore of the root task, the user
perceives a clean exit and an error message.

The coordinator task will:
 1) read header and tree, create @ctx (wake up restarting tasks)
 2) set the ->checkpoint_ctx field of itself and all descendants
 3) wait for all restarting tasks to reach sync point #1
 4) activate first restarting task (root task)
 5) wait for all other tasks to complete and reach sync point #3
 6) wake up everybody

(Note that in step #2 the coordinator assumes that the entire task
hierarchy exists by the time it enters sys_restart; this is arranged
in user space by 'mktree')

Task that are restarting has three sync points:
 1) wait for its ->checkpoint_ctx to be set (by the coordinator)
 2) wait for the task's turn to restore (be active)
 [...now the task restores its state...]
 3) wait for all other tasks to complete

The third sync point ensures that a task may only resume execution
after all tasks have successfully restored their state (or fail if an
error has occured). This prevents tasks from returning to user space
prematurely, before the entire restart completes.

If a single task wishes to restart, it can set the "RESTART_TASKSELF"
flag to restart(2) to skip the logic of the coordinator.

The root-task is a child of the coordinator, identified by the @pid
given to sys_restart() in the pid-ns of the coordinator. Restarting
tasks that aren't the coordinator, should set the @pid argument of
restart(2) syscall to zero.

All tasks explicitly test for an error flag on the checkpoint context
when they wakeup from sync points.  If an error occurs during the
restart of some task, it will mark the @ctx with an error flag, and
wakeup the other tasks.

An array of pids (the one saved during the checkpoint) is used to
synchronize the operation. The first task in the array is the init
task (*). The restart context (@ctx) maintains a "current position" in
the array, which indicates which task is currently active. Once the
currently active task completes its own restart, it increments that
position and wakes up the next task.

Restart assumes that userspace provides meaningful data, otherwise
it's garbage-in-garbage-out. In this case, the syscall may block
indefinitely, but in TASK_INTERRUPTIBLE, so the user can ctrl-c or
otherwise kill the stray restarting tasks.

In terms of security, restart runs as the user the invokes it, so it
will not allow a user to do more than is otherwise permitted by the
usual system semantics and policy.

Currently we ignore threads and zombies, as well as session ids.
Add support for multiple processes

(*) For containers, restart should be called inside a fresh container
by the init task of that container. However, it is also possible to
restart applications not necessarily inside a container, and without
restoring the original pids of the processes (that is, provided that
the application can tolerate such behavior). This is useful to allow
multi-process restart of tasks not isolated inside a container, and
also for debugging.

Changelog[v18]:
  - Fix race of prepare_descendant() with an ongoing fork()
  - Track and report the first error if restart fails
  - Tighten logic to protect against bogus pids in input
  - [Matt Helsley] Improve debug output from ckpt_notify_error()
Changelog[v17]:
  - Add uflag RESTART_FROZEN to freeze tasks after restart
  - Fix restore_retval() and use only for restarting tasks
  - Coordinator converts -ERSTART... to -EINTR
  - Coordinator marks and sets descendants' ->checkpoint_ctx
  - Coordinator properly detects errors when woken up from wait
  - Fix race where root_task could kick start too early
  - Add a sync point for restarting tasks
  - Multiple fixes to restart logic
Changelog[v14]:
  - Revert change to pr_debug(), back to ckpt_debug()
  - Discard field 'h.parent'
  - Check whether calls to ckpt_hbuf_get() fail
Changelog[v13]:
  - Clear root_task->checkpoint_ctx regardless of error condition
  - Remove unused argument 'ctx' from do_restore_task() prototype
  - Remove unused member 'pids_err' from 'struct ckpt_ctx'
Changelog[v12]:
  - Replace obsolete ckpt_debug() with pr_debug()

Signed-off-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
---
 checkpoint/restart.c             |  481 ++++++++++++++++++++++++++++++++++++--
 checkpoint/sys.c                 |   33 ++-
 include/linux/checkpoint.h       |   32 +++
 include/linux/checkpoint_types.h |   16 ++-
 include/linux/sched.h            |    4 +
 kernel/exit.c                    |    5 +
 kernel/fork.c                    |    8 +
 7 files changed, 549 insertions(+), 30 deletions(-)

diff --git a/checkpoint/restart.c b/checkpoint/restart.c
index 3f22403..4da09b7 100644
--- a/checkpoint/restart.c
+++ b/checkpoint/restart.c
@@ -13,7 +13,10 @@
 
 #include <linux/version.h>
 #include <linux/sched.h>
+#include <linux/wait.h>
 #include <linux/file.h>
+#include <linux/ptrace.h>
+#include <linux/freezer.h>
 #include <linux/magic.h>
 #include <linux/utsname.h>
 #include <asm/syscall.h>
@@ -413,6 +416,434 @@ static int restore_read_tail(struct ckpt_ctx *ctx)
 	return ret;
 }
 
+/* restore_read_tree - read the tasks tree into the checkpoint context */
+static int restore_read_tree(struct ckpt_ctx *ctx)
+{
+	struct ckpt_hdr_tree *h;
+	int size, ret;
+
+	h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_TREE);
+	if (IS_ERR(h))
+		return PTR_ERR(h);
+
+	ret = -EINVAL;
+	if (h->nr_tasks <= 0)
+		goto out;
+
+	ctx->nr_pids = h->nr_tasks;
+	size = sizeof(*ctx->pids_arr) * ctx->nr_pids;
+	if (size <= 0)		/* overflow ? */
+		goto out;
+
+	ctx->pids_arr = kmalloc(size, GFP_KERNEL);
+	if (!ctx->pids_arr) {
+		ret = -ENOMEM;
+		goto out;
+	}
+	ret = _ckpt_read_buffer(ctx, ctx->pids_arr, size);
+ out:
+	ckpt_hdr_put(ctx, h);
+	return ret;
+}
+
+static inline pid_t get_active_pid(struct ckpt_ctx *ctx)
+{
+	int active = ctx->active_pid;
+	return active >= 0 ? ctx->pids_arr[active].vpid : 0;
+}
+
+static inline int is_task_active(struct ckpt_ctx *ctx, pid_t pid)
+{
+	return get_active_pid(ctx) == pid;
+}
+
+static inline void _restore_notify_error(struct ckpt_ctx *ctx, int errno)
+{
+	ckpt_set_ctx_error(ctx, errno);
+	complete(&ctx->complete);
+}
+
+/* Need to call ckpt_debug such that it will get the correct source location */
+#define restore_notify_error(ctx, errno) \
+do { \
+	ckpt_debug("ctx root pid %d err %d", ctx->root_pid, errno); \
+	_restore_notify_error(ctx, errno); \
+} while(0)
+
+static int restore_activate_next(struct ckpt_ctx *ctx)
+{
+	struct task_struct *task;
+	int active;
+	pid_t pid;
+
+	active = ++ctx->active_pid;
+	if (active >= ctx->nr_pids) {
+		complete(&ctx->complete);
+		return 0;
+	}
+
+	pid = get_active_pid(ctx);
+	ckpt_debug("active pid %d (%d < %d)\n", pid, active, ctx->nr_pids);
+
+	rcu_read_lock();
+	task = find_task_by_pid_ns(pid, ctx->root_nsproxy->pid_ns);
+	/* target task must have same restart context */
+	if (task && task->checkpoint_ctx == ctx)
+		wake_up_process(task);
+	else
+		task = NULL;
+	rcu_read_unlock();
+
+	if (!task) {
+		restore_notify_error(ctx, -ESRCH);
+		return -ESRCH;
+	}
+
+	return 0;
+}
+
+static int wait_task_active(struct ckpt_ctx *ctx)
+{
+	pid_t pid = task_pid_vnr(current);
+	int ret;
+
+	ckpt_debug("pid %d waiting\n", pid);
+	ret = wait_event_interruptible(ctx->waitq,
+				       is_task_active(ctx, pid) ||
+				       ckpt_test_ctx_error(ctx));
+	if (!ret && ckpt_test_ctx_error(ctx)) {
+		force_sig(SIGKILL, current);
+		ret = -EBUSY;
+	}
+	return ret;
+}
+
+static int wait_task_sync(struct ckpt_ctx *ctx)
+{
+	ckpt_debug("pid %d syncing\n", task_pid_vnr(current));
+	wait_event_interruptible(ctx->waitq, ckpt_test_ctx_complete(ctx));
+	if (ckpt_test_ctx_error(ctx)) {
+		force_sig(SIGKILL, current);
+		return -EBUSY;
+	}
+	return 0;
+}
+
+static int do_restore_task(void)
+{
+	DECLARE_WAIT_QUEUE_HEAD(waitq);
+	struct ckpt_ctx *ctx, *old_ctx;
+	int ret;
+
+	/*
+	 * Wait for coordinator to become visible, then grab a
+	 * reference to its restart context.
+	 */
+	ret = wait_event_interruptible(waitq, current->checkpoint_ctx);
+	if (ret < 0)
+		return ret;
+
+	ctx = xchg(&current->checkpoint_ctx, NULL);
+	if (!ctx)
+		return -EAGAIN;
+	ckpt_ctx_get(ctx);
+
+	/*
+	 * Put the @ctx back on our task_struct. If an ancestor tried
+	 * to prepare_descendants() on us (although extremly unlikely)
+	 * we will encounter the ctx that he xchg()ed there and bail.
+	 */
+	old_ctx = xchg(&current->checkpoint_ctx, ctx);
+	if (old_ctx) {
+		ckpt_debug("self-set of checkpoint_ctx failed\n");
+
+		/* alert coordinator of unexpected ctx */
+		restore_notify_error(old_ctx, -EAGAIN);
+		ckpt_ctx_put(old_ctx);
+
+		/* alert our coordinator that we bail */
+		restore_notify_error(ctx, -EAGAIN);
+		ckpt_ctx_put(ctx);
+		return -EAGAIN;
+	}
+
+	/* wait for our turn, do the restore, and tell next task in line */
+	ret = wait_task_active(ctx);
+	if (ret < 0)
+		goto out;
+
+	ret = restore_task(ctx);
+	if (ret < 0)
+		goto out;
+
+	ret = restore_activate_next(ctx);
+	if (ret < 0)
+		goto out;
+
+	ret = wait_task_sync(ctx);
+ out:
+	old_ctx = xchg(&current->checkpoint_ctx, NULL);
+	if (old_ctx)
+		ckpt_ctx_put(old_ctx);
+
+	/* if we're first to fail - notify others */
+	if (ret < 0 && !ckpt_test_ctx_error(ctx)) {
+		restore_notify_error(ctx, ret);
+		wake_up_all(&ctx->waitq);
+	}
+
+	ckpt_ctx_put(ctx);
+	return ret;
+}
+
+/**
+ * prepare_descendants - set ->checkpoint_ctx of all descendants
+ * @ctx: checkpoint context
+ * @root: root process for restart
+ *
+ * Called by the coodinator to set the ->checkpoint_ctx pointer of the
+ * root task and all its descendants.
+ */
+static int prepare_descendants(struct ckpt_ctx *ctx, struct task_struct *root)
+{
+	struct task_struct *leader = root;
+	struct task_struct *parent = NULL;
+	struct task_struct *task = root;
+	struct ckpt_ctx *old_ctx;
+	int nr_pids = ctx->nr_pids;
+	int ret = 0;
+
+	read_lock(&tasklist_lock);
+	while (nr_pids) {
+		ckpt_debug("consider task %d\n", task_pid_vnr(task));
+		if (task_ptrace(task) & PT_PTRACED) {
+			ret = -EBUSY;
+			break;
+		}
+		/*
+		 * Set task->checkpoint_ctx of all non-zombie descendants.
+		 * If a descendant already has a ->checkpoint_ctx, it
+		 * must be a coordinator (for a different restart ?) so
+		 * we fail.
+		 *
+		 * Note that own ancestors cannot interfere since they
+		 * won't descend past us, as own ->checkpoint_ctx must
+		 * already be set.
+		 */
+		if (!task->exit_state) {
+			ckpt_ctx_get(ctx);
+			old_ctx = xchg(&task->checkpoint_ctx, ctx);
+			if (old_ctx) {
+				ckpt_debug("bad task %d\n",task_pid_vnr(task));
+				ckpt_ctx_put(old_ctx);
+				ret = -EAGAIN;
+				break;
+			}
+			ckpt_debug("prepare task %d\n", task_pid_vnr(task));
+			wake_up_process(task);
+			nr_pids--;
+		}
+
+		/* if has children - proceed with child */
+		if (!list_empty(&task->children)) {
+			parent = task;
+			task = list_entry(task->children.next,
+					  struct task_struct, sibling);
+			continue;
+		}
+		while (task != root) {
+			/* if has sibling - proceed with sibling */
+			if (!list_is_last(&task->sibling, &parent->children)) {
+				task = list_entry(task->sibling.next,
+						  struct task_struct, sibling);
+				break;
+			}
+
+			/* else, trace back to parent and proceed */
+			task = parent;
+			parent = parent->real_parent;
+		}
+		if (task == root) {
+			/* in case root task is multi-threaded */
+			root = task = next_thread(task);
+			if (root == leader)
+				break;
+		}
+	}
+	read_unlock(&tasklist_lock);
+	ckpt_debug("left %d ret %d root/task %d\n", nr_pids, ret, task == root);
+
+	/* fail unless number of processes matches */
+	if (!ret && (nr_pids || task != root))
+		ret = -ESRCH;
+
+	return ret;
+}
+
+static int wait_all_tasks_finish(struct ckpt_ctx *ctx)
+{
+	int ret;
+
+	init_completion(&ctx->complete);
+
+	BUG_ON(ctx->active_pid != -1);
+	ret = restore_activate_next(ctx);
+	if (ret < 0)
+		return ret;
+
+	ret = wait_for_completion_interruptible(&ctx->complete);
+
+	ckpt_debug("final sync kflags %#lx\n", ctx->kflags);
+	/*
+	 * Usually when restart fails, the restarting task will first
+	 * set @ctx->errno before waking us up. In the rare event that
+	 * @ctx->errno is unset, we must have been interrupted and
+	 * then checked for an error prior to ctx->errno update...
+	 */
+	if (ckpt_test_ctx_error(ctx))
+		ret = ctx->errno ? ctx->errno : -EINTR;
+
+	return ret;
+}
+
+static struct task_struct *choose_root_task(struct ckpt_ctx *ctx, pid_t pid)
+{
+	struct task_struct *task;
+
+	if (ctx->uflags & RESTART_TASKSELF) {
+		ctx->root_pid = pid;
+		ctx->root_task = current;
+		get_task_struct(current);
+		return current;
+	}
+
+	read_lock(&tasklist_lock);
+	list_for_each_entry(task, &current->children, sibling) {
+		if (task_pid_vnr(task) == pid) {
+			get_task_struct(task);
+			ctx->root_task = task;
+			ctx->root_pid = pid;
+			break;
+		}
+	}
+	read_unlock(&tasklist_lock);
+
+	return task;
+}
+
+/* setup restart-specific parts of ctx */
+static int init_restart_ctx(struct ckpt_ctx *ctx, pid_t pid)
+{
+	struct nsproxy *nsproxy;
+
+	/*
+	 * No need for explicit cleanup here, because if an error
+	 * occurs then ckpt_ctx_free() is eventually called.
+	 */
+
+	ctx->root_task = choose_root_task(ctx, pid);
+	if (!ctx->root_task)
+		return -ESRCH;
+
+	rcu_read_lock();
+	nsproxy = task_nsproxy(ctx->root_task);
+	if (nsproxy) {
+		get_nsproxy(nsproxy);
+		ctx->root_nsproxy = nsproxy;
+	}
+	rcu_read_unlock();
+	if (!nsproxy)
+		return -ESRCH;
+
+	ctx->active_pid = -1;	/* see restore_activate_next, get_active_pid */
+
+	return 0;
+}
+
+static int do_restore_coord(struct ckpt_ctx *ctx, pid_t pid)
+{
+	struct ckpt_ctx *old_ctx;
+	int ret;
+
+	ret = restore_read_header(ctx);
+	if (ret < 0)
+		return ret;
+	ret = restore_read_tree(ctx);
+	if (ret < 0)
+		return ret;
+
+	if ((ctx->uflags & RESTART_TASKSELF) && ctx->nr_pids != 1)
+		return -EINVAL;
+
+	ret = init_restart_ctx(ctx, pid);
+	if (ret < 0)
+		return ret;
+
+	/*
+	 * Populate own ->checkpoint_ctx: if an ancestor attempts to
+	 * prepare_descendants() on us, it will fail. Furthermore,
+	 * that ancestor won't proceed deeper to interfere with our
+	 * descendants that are restarting (e.g. by xchg()ing their
+	 * ->checkpoint_ctx pointer temporarily).
+	 */
+	ckpt_ctx_get(ctx);
+	old_ctx = xchg(&current->checkpoint_ctx, ctx);
+	if (old_ctx) {
+		/*
+		 * We are a bad-behaving descendant: an ancestor must
+		 * have done prepare_descendants() on us as part of a
+		 * restart. Oh, well ... alert ancestor (coordinator)
+		 * with an error on @old_ctx.
+		 */
+		ckpt_debug("bad behaving checkpoint_ctx\n");
+		restore_notify_error(old_ctx, -EBUSY);
+		ckpt_ctx_put(old_ctx);
+		ret = -EBUSY;
+		goto out;
+	}
+
+	if (ctx->uflags & RESTART_TASKSELF) {
+		ret = restore_task(ctx);
+		if (ret < 0)
+			goto out;
+	} else {
+		/* prepare descendants' t->checkpoint_ctx point to coord */
+		ret = prepare_descendants(ctx, ctx->root_task);
+		if (ret < 0)
+			goto out;
+		/* wait for all other tasks to complete do_restore_task() */
+		ret = wait_all_tasks_finish(ctx);
+		if (ret < 0)
+			goto out;
+	}
+
+	ret = restore_read_tail(ctx);
+	if (ret < 0)
+		goto out;
+
+	if (ctx->uflags & RESTART_FROZEN) {
+		ret = cgroup_freezer_make_frozen(ctx->root_task);
+		ckpt_debug("freezing restart tasks ... %d\n", ret);
+	}
+ out:
+	if (ret < 0)
+		ckpt_set_ctx_error(ctx, ret);
+	else
+		ckpt_set_ctx_success(ctx);
+
+	if (!(ctx->uflags & RESTART_TASKSELF))
+		wake_up_all(&ctx->waitq);
+	/*
+	 * If an ancestor attempts to prepare_descendants() on us, it
+	 * xchg()s our ->checkpoint_ctx, and free it. Our @ctx will,
+	 * instead, point to the ctx that said ancestor placed.
+	 */
+	ctx = xchg(&current->checkpoint_ctx, NULL);
+	ckpt_ctx_put(ctx);
+
+	return ret;
+}
+
 static long restore_retval(void)
 {
 	struct pt_regs *regs = task_pt_regs(current);
@@ -461,28 +892,40 @@ static long restore_retval(void)
 	return ret;
 }
 
-/* setup restart-specific parts of ctx */
-static int init_restart_ctx(struct ckpt_ctx *ctx, pid_t pid)
-{
-	return 0;
-}
-
 long do_restart(struct ckpt_ctx *ctx, pid_t pid)
 {
 	long ret;
 
-	ret = init_restart_ctx(ctx, pid);
-	if (ret < 0)
-		return ret;
-	ret = restore_read_header(ctx);
-	if (ret < 0)
-		return ret;
-	ret = restore_task(ctx);
-	if (ret < 0)
-		return ret;
-	ret = restore_read_tail(ctx);
-	if (ret < 0)
-		return ret;
+	if (ctx)
+		ret = do_restore_coord(ctx, pid);
+	else
+		ret = do_restore_task();
 
-	return restore_retval();
+	/* restart(2) isn't idempotent: should not be auto-restarted */
+	if (ret == -ERESTARTSYS || ret == -ERESTARTNOINTR ||
+	    ret == -ERESTARTNOHAND || ret == -ERESTART_RESTARTBLOCK)
+		ret = -EINTR;
+
+	/*
+	 * The retval from what we return to the caller when all goes
+	 * well: this is either the retval from the original syscall
+	 * that was interrupted during checkpoint, or the contents of
+	 * (saved) eax if the task was in userspace.
+	 *
+	 * The coordinator (ctx!=NULL) is exempt: don't adjust its retval.
+	 * But in self-restart (where RESTART_TASKSELF), the coordinator
+	 * _itself_ is a restarting task.
+	 */
+
+	if (!ctx || (ctx->uflags & RESTART_TASKSELF)) {
+		if (ret < 0) {
+			/* partial restore is undefined: terminate */
+			ckpt_debug("restart err %ld, exiting\n", ret);
+			force_sig(SIGKILL, current);
+		} else {
+			ret = restore_retval();
+		}
+	}
+
+	return ret;
 }
diff --git a/checkpoint/sys.c b/checkpoint/sys.c
index cc94775..c8921f0 100644
--- a/checkpoint/sys.c
+++ b/checkpoint/sys.c
@@ -189,6 +189,8 @@ static void task_arr_free(struct ckpt_ctx *ctx)
 
 static void ckpt_ctx_free(struct ckpt_ctx *ctx)
 {
+	BUG_ON(atomic_read(&ctx->refcount));
+
 	if (ctx->file)
 		fput(ctx->file);
 
@@ -202,6 +204,8 @@ static void ckpt_ctx_free(struct ckpt_ctx *ctx)
 	if (ctx->root_freezer)
 		put_task_struct(ctx->root_freezer);
 
+	kfree(ctx->pids_arr);
+
 	kfree(ctx);
 }
 
@@ -219,17 +223,32 @@ static struct ckpt_ctx *ckpt_ctx_alloc(int fd, unsigned long uflags,
 	ctx->kflags = kflags;
 	ctx->ktime_begin = ktime_get();
 
+	atomic_set(&ctx->refcount, 0);
+	init_waitqueue_head(&ctx->waitq);
+
 	err = -EBADF;
 	ctx->file = fget(fd);
 	if (!ctx->file)
 		goto err;
 
+	atomic_inc(&ctx->refcount);
 	return ctx;
  err:
 	ckpt_ctx_free(ctx);
 	return ERR_PTR(err);
 }
 
+void ckpt_ctx_get(struct ckpt_ctx *ctx)
+{
+	atomic_inc(&ctx->refcount);
+}
+
+void ckpt_ctx_put(struct ckpt_ctx *ctx)
+{
+	if (ctx && atomic_dec_and_test(&ctx->refcount))
+		ckpt_ctx_free(ctx);
+}
+
 /**
  * sys_checkpoint - checkpoint a container
  * @pid: pid of the container init(1) process
@@ -261,7 +280,7 @@ SYSCALL_DEFINE3(checkpoint, pid_t, pid, int, fd, unsigned long, flags)
 	if (!ret)
 		ret = ctx->crid;
 
-	ckpt_ctx_free(ctx);
+	ckpt_ctx_put(ctx);
 	return ret;
 }
 
@@ -280,24 +299,20 @@ SYSCALL_DEFINE3(restart, pid_t, pid, int, fd, unsigned long, flags)
 	long ret;
 
 	/* no flags for now */
-	if (flags)
+	if (flags & ~RESTART_USER_FLAGS)
 		return -EINVAL;
 
 	if (!ckpt_unpriv_allowed && !capable(CAP_SYS_ADMIN))
 		return -EPERM;
 
-	ctx = ckpt_ctx_alloc(fd, flags, CKPT_CTX_RESTART);
+	if (pid)
+		ctx = ckpt_ctx_alloc(fd, flags, CKPT_CTX_RESTART);
 	if (IS_ERR(ctx))
 		return PTR_ERR(ctx);
 
 	ret = do_restart(ctx, pid);
 
-	/* restart(2) isn't idempotent: can't restart syscall */
-	if (ret == -ERESTARTSYS || ret == -ERESTARTNOINTR ||
-	    ret == -ERESTARTNOHAND || ret == -ERESTART_RESTARTBLOCK)
-		ret = -EINTR;
-
-	ckpt_ctx_free(ctx);
+	ckpt_ctx_put(ctx);
 	return ret;
 }
 
diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
index a4650bb..4227b31 100644
--- a/include/linux/checkpoint.h
+++ b/include/linux/checkpoint.h
@@ -15,6 +15,10 @@
 /* checkpoint user flags */
 #define CHECKPOINT_SUBTREE	0x1
 
+/* restart user flags */
+#define RESTART_TASKSELF	0x1
+#define RESTART_FROZEN		0x2
+
 #ifdef __KERNEL__
 #ifdef CONFIG_CHECKPOINT
 
@@ -25,13 +29,19 @@
 /* ckpt_ctx: kflags */
 #define CKPT_CTX_CHECKPOINT_BIT		0
 #define CKPT_CTX_RESTART_BIT		1
+#define CKPT_CTX_SUCCESS_BIT		2
+#define CKPT_CTX_ERROR_BIT		3
 
 #define CKPT_CTX_CHECKPOINT	(1 << CKPT_CTX_CHECKPOINT_BIT)
 #define CKPT_CTX_RESTART	(1 << CKPT_CTX_RESTART_BIT)
+#define CKPT_CTX_SUCCESS	(1 << CKPT_CTX_SUCCESS_BIT)
+#define CKPT_CTX_ERROR		(1 << CKPT_CTX_ERROR_BIT)
 
 /* ckpt_ctx: uflags */
 #define CHECKPOINT_USER_FLAGS		CHECKPOINT_SUBTREE
+#define RESTART_USER_FLAGS		(RESTART_TASKSELF | RESTART_FROZEN)
 
+extern void exit_checkpoint(struct task_struct *tsk);
 
 extern int ckpt_kwrite(struct ckpt_ctx *ctx, void *buf, int count);
 extern int ckpt_kread(struct ckpt_ctx *ctx, void *buf, int count);
@@ -60,6 +70,28 @@ extern int ckpt_read_payload(struct ckpt_ctx *ctx,
 extern char *ckpt_read_string(struct ckpt_ctx *ctx, int max);
 extern int ckpt_read_consume(struct ckpt_ctx *ctx, int len, int type);
 
+/* ckpt kflags */
+#define ckpt_set_ctx_kflag(__ctx, __kflag)  \
+	set_bit(__kflag##_BIT, &(__ctx)->kflags)
+#define ckpt_test_and_set_ctx_kflag(__ctx, __kflag)  \
+	test_and_set_bit(__kflag##_BIT, &(__ctx)->kflags)
+
+#define ckpt_set_ctx_success(ctx)  ckpt_set_ctx_kflag(ctx, CKPT_CTX_SUCCESS)
+
+static inline void ckpt_set_ctx_error(struct ckpt_ctx *ctx, int errno)
+{
+	if (ckpt_test_and_set_ctx_kflag(ctx, CKPT_CTX_ERROR))
+		ctx->errno = errno;
+}
+
+#define ckpt_test_ctx_error(ctx)  \
+	((ctx)->kflags & CKPT_CTX_ERROR)
+#define ckpt_test_ctx_complete(ctx)  \
+	((ctx)->kflags & (CKPT_CTX_SUCCESS | CKPT_CTX_ERROR))
+
+extern void ckpt_ctx_get(struct ckpt_ctx *ctx);
+extern void ckpt_ctx_put(struct ckpt_ctx *ctx);
+
 extern long do_checkpoint(struct ckpt_ctx *ctx, pid_t pid);
 extern long do_restart(struct ckpt_ctx *ctx, pid_t pid);
 
diff --git a/include/linux/checkpoint_types.h b/include/linux/checkpoint_types.h
index c3399b3..f74deac 100644
--- a/include/linux/checkpoint_types.h
+++ b/include/linux/checkpoint_types.h
@@ -16,6 +16,7 @@
 #include <linux/nsproxy.h>
 #include <linux/fs.h>
 #include <linux/ktime.h>
+#include <linux/wait.h>
 
 struct ckpt_ctx {
 	int crid;		/* unique checkpoint id */
@@ -35,11 +36,22 @@ struct ckpt_ctx {
 	struct file *file;	/* input/output file */
 	int total;		/* total read/written */
 
-	struct task_struct **tasks_arr;	/* array of all tasks in container */
-	int nr_tasks;			/* size of tasks array */
+	atomic_t refcount;
 
 	struct task_struct *tsk;/* checkpoint: current target task */
 	char err_string[256];	/* checkpoint: error string */
+	int errno;		/* restart: errno that caused failure */
+
+	/* [multi-process checkpoint] */
+	struct task_struct **tasks_arr; /* array of all tasks [checkpoint] */
+	int nr_tasks;                   /* size of tasks array */
+
+	/* [multi-process restart] */
+	struct ckpt_pids *pids_arr;	/* array of all pids [restart] */
+	int nr_pids;			/* size of pids array */
+	int active_pid;			/* (next) position in pids array */
+	struct completion complete;	/* container root and other tasks on */
+	wait_queue_head_t waitq;	/* start, end, and restart ordering */
 };
 
 #endif /* __KERNEL__ */
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 2fa783d..91b57db 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1480,6 +1480,9 @@ struct task_struct {
 	/* bitmask of trace recursion */
 	unsigned long trace_recursion;
 #endif /* CONFIG_TRACING */
+#ifdef CONFIG_CHECKPOINT
+	struct ckpt_ctx *checkpoint_ctx;
+#endif
 };
 
 /* Future-safe accessor for struct task_struct's cpus_allowed. */
@@ -1694,6 +1697,7 @@ extern cputime_t task_gtime(struct task_struct *p);
 #define PF_SPREAD_PAGE	0x01000000	/* Spread page cache over cpuset */
 #define PF_SPREAD_SLAB	0x02000000	/* Spread some slab caches over cpuset */
 #define PF_THREAD_BOUND	0x04000000	/* Thread bound to specific cpu */
+#define PF_RESTARTING	0x08000000	/* Process is restarting (c/r) */
 #define PF_MEMPOLICY	0x10000000	/* Non-default NUMA mempolicy */
 #define PF_MUTEX_TESTER	0x20000000	/* Thread belongs to the rt mutex tester */
 #define PF_FREEZER_SKIP	0x40000000	/* Freezer should not count it as freezeable */
diff --git a/kernel/exit.c b/kernel/exit.c
index 869dc22..912b1fa 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -49,6 +49,7 @@
 #include <linux/init_task.h>
 #include <linux/perf_counter.h>
 #include <trace/events/sched.h>
+#include <linux/checkpoint.h>
 
 #include <asm/uaccess.h>
 #include <asm/unistd.h>
@@ -992,6 +993,10 @@ NORET_TYPE void do_exit(long code)
 	if (unlikely(current->pi_state_cache))
 		kfree(current->pi_state_cache);
 #endif
+#ifdef CONFIG_CHECKPOINT
+	if (unlikely(tsk->checkpoint_ctx))
+		exit_checkpoint(tsk);
+#endif
 	/*
 	 * Make sure we are holding no locks:
 	 */
diff --git a/kernel/fork.c b/kernel/fork.c
index f5a0cef..57118e4 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -62,6 +62,7 @@
 #include <linux/fs_struct.h>
 #include <linux/magic.h>
 #include <linux/perf_counter.h>
+#include <linux/checkpoint.h>
 
 #include <asm/pgtable.h>
 #include <asm/pgalloc.h>
@@ -1185,6 +1186,13 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 	/* Need tasklist lock for parent etc handling! */
 	write_lock_irq(&tasklist_lock);
 
+#ifdef CONFIG_CHECKPOINT
+	/* If parent is restarting, child should be too */
+	if (unlikely(current->checkpoint_ctx)) {
+		p->checkpoint_ctx = current->checkpoint_ctx;
+		ckpt_ctx_get(p->checkpoint_ctx);
+	}
+#endif
 	/*
 	 * The task hasn't been attached yet, so its cpus_allowed mask will
 	 * not be changed, nor will its assigned CPU.
-- 
1.6.0.4

^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH v18 26/80] c/r: restart multiple processes
  2009-09-23 23:50 ` Oren Laadan
@ 2009-09-23 23:51   ` Oren Laadan
  -1 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-23 23:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, containers, linux-kernel, linux-mm, linux-api,
	Serge Hallyn, Ingo Molnar, Pavel Emelyanov, Oren Laadan,
	Oren Laadan

Restarting of multiple processes expects all restarting tasks to call
sys_restart(). Once inside the system call, each task will restart
itself at the same order that they were saved. The internals of the
syscall will take care of in-kernel synchronization bewteen tasks.

This patch does _not_ create the task tree in the kernel. Instead it
assumes that all tasks are created in some way and then invoke the
restart syscall. You can use the userspace mktree.c program to do
that.

There is one special task - the coordinator - that is not part of the
restarted hierarchy. The coordinator task allocates the restart
context (ctx) and orchestrates the restart. Thus even if a restart
fails after, or during the restore of the root task, the user
perceives a clean exit and an error message.

The coordinator task will:
 1) read header and tree, create @ctx (wake up restarting tasks)
 2) set the ->checkpoint_ctx field of itself and all descendants
 3) wait for all restarting tasks to reach sync point #1
 4) activate first restarting task (root task)
 5) wait for all other tasks to complete and reach sync point #3
 6) wake up everybody

(Note that in step #2 the coordinator assumes that the entire task
hierarchy exists by the time it enters sys_restart; this is arranged
in user space by 'mktree')

Task that are restarting has three sync points:
 1) wait for its ->checkpoint_ctx to be set (by the coordinator)
 2) wait for the task's turn to restore (be active)
 [...now the task restores its state...]
 3) wait for all other tasks to complete

The third sync point ensures that a task may only resume execution
after all tasks have successfully restored their state (or fail if an
error has occured). This prevents tasks from returning to user space
prematurely, before the entire restart completes.

If a single task wishes to restart, it can set the "RESTART_TASKSELF"
flag to restart(2) to skip the logic of the coordinator.

The root-task is a child of the coordinator, identified by the @pid
given to sys_restart() in the pid-ns of the coordinator. Restarting
tasks that aren't the coordinator, should set the @pid argument of
restart(2) syscall to zero.

All tasks explicitly test for an error flag on the checkpoint context
when they wakeup from sync points.  If an error occurs during the
restart of some task, it will mark the @ctx with an error flag, and
wakeup the other tasks.

An array of pids (the one saved during the checkpoint) is used to
synchronize the operation. The first task in the array is the init
task (*). The restart context (@ctx) maintains a "current position" in
the array, which indicates which task is currently active. Once the
currently active task completes its own restart, it increments that
position and wakes up the next task.

Restart assumes that userspace provides meaningful data, otherwise
it's garbage-in-garbage-out. In this case, the syscall may block
indefinitely, but in TASK_INTERRUPTIBLE, so the user can ctrl-c or
otherwise kill the stray restarting tasks.

In terms of security, restart runs as the user the invokes it, so it
will not allow a user to do more than is otherwise permitted by the
usual system semantics and policy.

Currently we ignore threads and zombies, as well as session ids.
Add support for multiple processes

(*) For containers, restart should be called inside a fresh container
by the init task of that container. However, it is also possible to
restart applications not necessarily inside a container, and without
restoring the original pids of the processes (that is, provided that
the application can tolerate such behavior). This is useful to allow
multi-process restart of tasks not isolated inside a container, and
also for debugging.

Changelog[v18]:
  - Fix race of prepare_descendant() with an ongoing fork()
  - Track and report the first error if restart fails
  - Tighten logic to protect against bogus pids in input
  - [Matt Helsley] Improve debug output from ckpt_notify_error()
Changelog[v17]:
  - Add uflag RESTART_FROZEN to freeze tasks after restart
  - Fix restore_retval() and use only for restarting tasks
  - Coordinator converts -ERSTART... to -EINTR
  - Coordinator marks and sets descendants' ->checkpoint_ctx
  - Coordinator properly detects errors when woken up from wait
  - Fix race where root_task could kick start too early
  - Add a sync point for restarting tasks
  - Multiple fixes to restart logic
Changelog[v14]:
  - Revert change to pr_debug(), back to ckpt_debug()
  - Discard field 'h.parent'
  - Check whether calls to ckpt_hbuf_get() fail
Changelog[v13]:
  - Clear root_task->checkpoint_ctx regardless of error condition
  - Remove unused argument 'ctx' from do_restore_task() prototype
  - Remove unused member 'pids_err' from 'struct ckpt_ctx'
Changelog[v12]:
  - Replace obsolete ckpt_debug() with pr_debug()

Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
---
 checkpoint/restart.c             |  481 ++++++++++++++++++++++++++++++++++++--
 checkpoint/sys.c                 |   33 ++-
 include/linux/checkpoint.h       |   32 +++
 include/linux/checkpoint_types.h |   16 ++-
 include/linux/sched.h            |    4 +
 kernel/exit.c                    |    5 +
 kernel/fork.c                    |    8 +
 7 files changed, 549 insertions(+), 30 deletions(-)

diff --git a/checkpoint/restart.c b/checkpoint/restart.c
index 3f22403..4da09b7 100644
--- a/checkpoint/restart.c
+++ b/checkpoint/restart.c
@@ -13,7 +13,10 @@
 
 #include <linux/version.h>
 #include <linux/sched.h>
+#include <linux/wait.h>
 #include <linux/file.h>
+#include <linux/ptrace.h>
+#include <linux/freezer.h>
 #include <linux/magic.h>
 #include <linux/utsname.h>
 #include <asm/syscall.h>
@@ -413,6 +416,434 @@ static int restore_read_tail(struct ckpt_ctx *ctx)
 	return ret;
 }
 
+/* restore_read_tree - read the tasks tree into the checkpoint context */
+static int restore_read_tree(struct ckpt_ctx *ctx)
+{
+	struct ckpt_hdr_tree *h;
+	int size, ret;
+
+	h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_TREE);
+	if (IS_ERR(h))
+		return PTR_ERR(h);
+
+	ret = -EINVAL;
+	if (h->nr_tasks <= 0)
+		goto out;
+
+	ctx->nr_pids = h->nr_tasks;
+	size = sizeof(*ctx->pids_arr) * ctx->nr_pids;
+	if (size <= 0)		/* overflow ? */
+		goto out;
+
+	ctx->pids_arr = kmalloc(size, GFP_KERNEL);
+	if (!ctx->pids_arr) {
+		ret = -ENOMEM;
+		goto out;
+	}
+	ret = _ckpt_read_buffer(ctx, ctx->pids_arr, size);
+ out:
+	ckpt_hdr_put(ctx, h);
+	return ret;
+}
+
+static inline pid_t get_active_pid(struct ckpt_ctx *ctx)
+{
+	int active = ctx->active_pid;
+	return active >= 0 ? ctx->pids_arr[active].vpid : 0;
+}
+
+static inline int is_task_active(struct ckpt_ctx *ctx, pid_t pid)
+{
+	return get_active_pid(ctx) == pid;
+}
+
+static inline void _restore_notify_error(struct ckpt_ctx *ctx, int errno)
+{
+	ckpt_set_ctx_error(ctx, errno);
+	complete(&ctx->complete);
+}
+
+/* Need to call ckpt_debug such that it will get the correct source location */
+#define restore_notify_error(ctx, errno) \
+do { \
+	ckpt_debug("ctx root pid %d err %d", ctx->root_pid, errno); \
+	_restore_notify_error(ctx, errno); \
+} while(0)
+
+static int restore_activate_next(struct ckpt_ctx *ctx)
+{
+	struct task_struct *task;
+	int active;
+	pid_t pid;
+
+	active = ++ctx->active_pid;
+	if (active >= ctx->nr_pids) {
+		complete(&ctx->complete);
+		return 0;
+	}
+
+	pid = get_active_pid(ctx);
+	ckpt_debug("active pid %d (%d < %d)\n", pid, active, ctx->nr_pids);
+
+	rcu_read_lock();
+	task = find_task_by_pid_ns(pid, ctx->root_nsproxy->pid_ns);
+	/* target task must have same restart context */
+	if (task && task->checkpoint_ctx == ctx)
+		wake_up_process(task);
+	else
+		task = NULL;
+	rcu_read_unlock();
+
+	if (!task) {
+		restore_notify_error(ctx, -ESRCH);
+		return -ESRCH;
+	}
+
+	return 0;
+}
+
+static int wait_task_active(struct ckpt_ctx *ctx)
+{
+	pid_t pid = task_pid_vnr(current);
+	int ret;
+
+	ckpt_debug("pid %d waiting\n", pid);
+	ret = wait_event_interruptible(ctx->waitq,
+				       is_task_active(ctx, pid) ||
+				       ckpt_test_ctx_error(ctx));
+	if (!ret && ckpt_test_ctx_error(ctx)) {
+		force_sig(SIGKILL, current);
+		ret = -EBUSY;
+	}
+	return ret;
+}
+
+static int wait_task_sync(struct ckpt_ctx *ctx)
+{
+	ckpt_debug("pid %d syncing\n", task_pid_vnr(current));
+	wait_event_interruptible(ctx->waitq, ckpt_test_ctx_complete(ctx));
+	if (ckpt_test_ctx_error(ctx)) {
+		force_sig(SIGKILL, current);
+		return -EBUSY;
+	}
+	return 0;
+}
+
+static int do_restore_task(void)
+{
+	DECLARE_WAIT_QUEUE_HEAD(waitq);
+	struct ckpt_ctx *ctx, *old_ctx;
+	int ret;
+
+	/*
+	 * Wait for coordinator to become visible, then grab a
+	 * reference to its restart context.
+	 */
+	ret = wait_event_interruptible(waitq, current->checkpoint_ctx);
+	if (ret < 0)
+		return ret;
+
+	ctx = xchg(&current->checkpoint_ctx, NULL);
+	if (!ctx)
+		return -EAGAIN;
+	ckpt_ctx_get(ctx);
+
+	/*
+	 * Put the @ctx back on our task_struct. If an ancestor tried
+	 * to prepare_descendants() on us (although extremly unlikely)
+	 * we will encounter the ctx that he xchg()ed there and bail.
+	 */
+	old_ctx = xchg(&current->checkpoint_ctx, ctx);
+	if (old_ctx) {
+		ckpt_debug("self-set of checkpoint_ctx failed\n");
+
+		/* alert coordinator of unexpected ctx */
+		restore_notify_error(old_ctx, -EAGAIN);
+		ckpt_ctx_put(old_ctx);
+
+		/* alert our coordinator that we bail */
+		restore_notify_error(ctx, -EAGAIN);
+		ckpt_ctx_put(ctx);
+		return -EAGAIN;
+	}
+
+	/* wait for our turn, do the restore, and tell next task in line */
+	ret = wait_task_active(ctx);
+	if (ret < 0)
+		goto out;
+
+	ret = restore_task(ctx);
+	if (ret < 0)
+		goto out;
+
+	ret = restore_activate_next(ctx);
+	if (ret < 0)
+		goto out;
+
+	ret = wait_task_sync(ctx);
+ out:
+	old_ctx = xchg(&current->checkpoint_ctx, NULL);
+	if (old_ctx)
+		ckpt_ctx_put(old_ctx);
+
+	/* if we're first to fail - notify others */
+	if (ret < 0 && !ckpt_test_ctx_error(ctx)) {
+		restore_notify_error(ctx, ret);
+		wake_up_all(&ctx->waitq);
+	}
+
+	ckpt_ctx_put(ctx);
+	return ret;
+}
+
+/**
+ * prepare_descendants - set ->checkpoint_ctx of all descendants
+ * @ctx: checkpoint context
+ * @root: root process for restart
+ *
+ * Called by the coodinator to set the ->checkpoint_ctx pointer of the
+ * root task and all its descendants.
+ */
+static int prepare_descendants(struct ckpt_ctx *ctx, struct task_struct *root)
+{
+	struct task_struct *leader = root;
+	struct task_struct *parent = NULL;
+	struct task_struct *task = root;
+	struct ckpt_ctx *old_ctx;
+	int nr_pids = ctx->nr_pids;
+	int ret = 0;
+
+	read_lock(&tasklist_lock);
+	while (nr_pids) {
+		ckpt_debug("consider task %d\n", task_pid_vnr(task));
+		if (task_ptrace(task) & PT_PTRACED) {
+			ret = -EBUSY;
+			break;
+		}
+		/*
+		 * Set task->checkpoint_ctx of all non-zombie descendants.
+		 * If a descendant already has a ->checkpoint_ctx, it
+		 * must be a coordinator (for a different restart ?) so
+		 * we fail.
+		 *
+		 * Note that own ancestors cannot interfere since they
+		 * won't descend past us, as own ->checkpoint_ctx must
+		 * already be set.
+		 */
+		if (!task->exit_state) {
+			ckpt_ctx_get(ctx);
+			old_ctx = xchg(&task->checkpoint_ctx, ctx);
+			if (old_ctx) {
+				ckpt_debug("bad task %d\n",task_pid_vnr(task));
+				ckpt_ctx_put(old_ctx);
+				ret = -EAGAIN;
+				break;
+			}
+			ckpt_debug("prepare task %d\n", task_pid_vnr(task));
+			wake_up_process(task);
+			nr_pids--;
+		}
+
+		/* if has children - proceed with child */
+		if (!list_empty(&task->children)) {
+			parent = task;
+			task = list_entry(task->children.next,
+					  struct task_struct, sibling);
+			continue;
+		}
+		while (task != root) {
+			/* if has sibling - proceed with sibling */
+			if (!list_is_last(&task->sibling, &parent->children)) {
+				task = list_entry(task->sibling.next,
+						  struct task_struct, sibling);
+				break;
+			}
+
+			/* else, trace back to parent and proceed */
+			task = parent;
+			parent = parent->real_parent;
+		}
+		if (task == root) {
+			/* in case root task is multi-threaded */
+			root = task = next_thread(task);
+			if (root == leader)
+				break;
+		}
+	}
+	read_unlock(&tasklist_lock);
+	ckpt_debug("left %d ret %d root/task %d\n", nr_pids, ret, task == root);
+
+	/* fail unless number of processes matches */
+	if (!ret && (nr_pids || task != root))
+		ret = -ESRCH;
+
+	return ret;
+}
+
+static int wait_all_tasks_finish(struct ckpt_ctx *ctx)
+{
+	int ret;
+
+	init_completion(&ctx->complete);
+
+	BUG_ON(ctx->active_pid != -1);
+	ret = restore_activate_next(ctx);
+	if (ret < 0)
+		return ret;
+
+	ret = wait_for_completion_interruptible(&ctx->complete);
+
+	ckpt_debug("final sync kflags %#lx\n", ctx->kflags);
+	/*
+	 * Usually when restart fails, the restarting task will first
+	 * set @ctx->errno before waking us up. In the rare event that
+	 * @ctx->errno is unset, we must have been interrupted and
+	 * then checked for an error prior to ctx->errno update...
+	 */
+	if (ckpt_test_ctx_error(ctx))
+		ret = ctx->errno ? ctx->errno : -EINTR;
+
+	return ret;
+}
+
+static struct task_struct *choose_root_task(struct ckpt_ctx *ctx, pid_t pid)
+{
+	struct task_struct *task;
+
+	if (ctx->uflags & RESTART_TASKSELF) {
+		ctx->root_pid = pid;
+		ctx->root_task = current;
+		get_task_struct(current);
+		return current;
+	}
+
+	read_lock(&tasklist_lock);
+	list_for_each_entry(task, &current->children, sibling) {
+		if (task_pid_vnr(task) == pid) {
+			get_task_struct(task);
+			ctx->root_task = task;
+			ctx->root_pid = pid;
+			break;
+		}
+	}
+	read_unlock(&tasklist_lock);
+
+	return task;
+}
+
+/* setup restart-specific parts of ctx */
+static int init_restart_ctx(struct ckpt_ctx *ctx, pid_t pid)
+{
+	struct nsproxy *nsproxy;
+
+	/*
+	 * No need for explicit cleanup here, because if an error
+	 * occurs then ckpt_ctx_free() is eventually called.
+	 */
+
+	ctx->root_task = choose_root_task(ctx, pid);
+	if (!ctx->root_task)
+		return -ESRCH;
+
+	rcu_read_lock();
+	nsproxy = task_nsproxy(ctx->root_task);
+	if (nsproxy) {
+		get_nsproxy(nsproxy);
+		ctx->root_nsproxy = nsproxy;
+	}
+	rcu_read_unlock();
+	if (!nsproxy)
+		return -ESRCH;
+
+	ctx->active_pid = -1;	/* see restore_activate_next, get_active_pid */
+
+	return 0;
+}
+
+static int do_restore_coord(struct ckpt_ctx *ctx, pid_t pid)
+{
+	struct ckpt_ctx *old_ctx;
+	int ret;
+
+	ret = restore_read_header(ctx);
+	if (ret < 0)
+		return ret;
+	ret = restore_read_tree(ctx);
+	if (ret < 0)
+		return ret;
+
+	if ((ctx->uflags & RESTART_TASKSELF) && ctx->nr_pids != 1)
+		return -EINVAL;
+
+	ret = init_restart_ctx(ctx, pid);
+	if (ret < 0)
+		return ret;
+
+	/*
+	 * Populate own ->checkpoint_ctx: if an ancestor attempts to
+	 * prepare_descendants() on us, it will fail. Furthermore,
+	 * that ancestor won't proceed deeper to interfere with our
+	 * descendants that are restarting (e.g. by xchg()ing their
+	 * ->checkpoint_ctx pointer temporarily).
+	 */
+	ckpt_ctx_get(ctx);
+	old_ctx = xchg(&current->checkpoint_ctx, ctx);
+	if (old_ctx) {
+		/*
+		 * We are a bad-behaving descendant: an ancestor must
+		 * have done prepare_descendants() on us as part of a
+		 * restart. Oh, well ... alert ancestor (coordinator)
+		 * with an error on @old_ctx.
+		 */
+		ckpt_debug("bad behaving checkpoint_ctx\n");
+		restore_notify_error(old_ctx, -EBUSY);
+		ckpt_ctx_put(old_ctx);
+		ret = -EBUSY;
+		goto out;
+	}
+
+	if (ctx->uflags & RESTART_TASKSELF) {
+		ret = restore_task(ctx);
+		if (ret < 0)
+			goto out;
+	} else {
+		/* prepare descendants' t->checkpoint_ctx point to coord */
+		ret = prepare_descendants(ctx, ctx->root_task);
+		if (ret < 0)
+			goto out;
+		/* wait for all other tasks to complete do_restore_task() */
+		ret = wait_all_tasks_finish(ctx);
+		if (ret < 0)
+			goto out;
+	}
+
+	ret = restore_read_tail(ctx);
+	if (ret < 0)
+		goto out;
+
+	if (ctx->uflags & RESTART_FROZEN) {
+		ret = cgroup_freezer_make_frozen(ctx->root_task);
+		ckpt_debug("freezing restart tasks ... %d\n", ret);
+	}
+ out:
+	if (ret < 0)
+		ckpt_set_ctx_error(ctx, ret);
+	else
+		ckpt_set_ctx_success(ctx);
+
+	if (!(ctx->uflags & RESTART_TASKSELF))
+		wake_up_all(&ctx->waitq);
+	/*
+	 * If an ancestor attempts to prepare_descendants() on us, it
+	 * xchg()s our ->checkpoint_ctx, and free it. Our @ctx will,
+	 * instead, point to the ctx that said ancestor placed.
+	 */
+	ctx = xchg(&current->checkpoint_ctx, NULL);
+	ckpt_ctx_put(ctx);
+
+	return ret;
+}
+
 static long restore_retval(void)
 {
 	struct pt_regs *regs = task_pt_regs(current);
@@ -461,28 +892,40 @@ static long restore_retval(void)
 	return ret;
 }
 
-/* setup restart-specific parts of ctx */
-static int init_restart_ctx(struct ckpt_ctx *ctx, pid_t pid)
-{
-	return 0;
-}
-
 long do_restart(struct ckpt_ctx *ctx, pid_t pid)
 {
 	long ret;
 
-	ret = init_restart_ctx(ctx, pid);
-	if (ret < 0)
-		return ret;
-	ret = restore_read_header(ctx);
-	if (ret < 0)
-		return ret;
-	ret = restore_task(ctx);
-	if (ret < 0)
-		return ret;
-	ret = restore_read_tail(ctx);
-	if (ret < 0)
-		return ret;
+	if (ctx)
+		ret = do_restore_coord(ctx, pid);
+	else
+		ret = do_restore_task();
 
-	return restore_retval();
+	/* restart(2) isn't idempotent: should not be auto-restarted */
+	if (ret == -ERESTARTSYS || ret == -ERESTARTNOINTR ||
+	    ret == -ERESTARTNOHAND || ret == -ERESTART_RESTARTBLOCK)
+		ret = -EINTR;
+
+	/*
+	 * The retval from what we return to the caller when all goes
+	 * well: this is either the retval from the original syscall
+	 * that was interrupted during checkpoint, or the contents of
+	 * (saved) eax if the task was in userspace.
+	 *
+	 * The coordinator (ctx!=NULL) is exempt: don't adjust its retval.
+	 * But in self-restart (where RESTART_TASKSELF), the coordinator
+	 * _itself_ is a restarting task.
+	 */
+
+	if (!ctx || (ctx->uflags & RESTART_TASKSELF)) {
+		if (ret < 0) {
+			/* partial restore is undefined: terminate */
+			ckpt_debug("restart err %ld, exiting\n", ret);
+			force_sig(SIGKILL, current);
+		} else {
+			ret = restore_retval();
+		}
+	}
+
+	return ret;
 }
diff --git a/checkpoint/sys.c b/checkpoint/sys.c
index cc94775..c8921f0 100644
--- a/checkpoint/sys.c
+++ b/checkpoint/sys.c
@@ -189,6 +189,8 @@ static void task_arr_free(struct ckpt_ctx *ctx)
 
 static void ckpt_ctx_free(struct ckpt_ctx *ctx)
 {
+	BUG_ON(atomic_read(&ctx->refcount));
+
 	if (ctx->file)
 		fput(ctx->file);
 
@@ -202,6 +204,8 @@ static void ckpt_ctx_free(struct ckpt_ctx *ctx)
 	if (ctx->root_freezer)
 		put_task_struct(ctx->root_freezer);
 
+	kfree(ctx->pids_arr);
+
 	kfree(ctx);
 }
 
@@ -219,17 +223,32 @@ static struct ckpt_ctx *ckpt_ctx_alloc(int fd, unsigned long uflags,
 	ctx->kflags = kflags;
 	ctx->ktime_begin = ktime_get();
 
+	atomic_set(&ctx->refcount, 0);
+	init_waitqueue_head(&ctx->waitq);
+
 	err = -EBADF;
 	ctx->file = fget(fd);
 	if (!ctx->file)
 		goto err;
 
+	atomic_inc(&ctx->refcount);
 	return ctx;
  err:
 	ckpt_ctx_free(ctx);
 	return ERR_PTR(err);
 }
 
+void ckpt_ctx_get(struct ckpt_ctx *ctx)
+{
+	atomic_inc(&ctx->refcount);
+}
+
+void ckpt_ctx_put(struct ckpt_ctx *ctx)
+{
+	if (ctx && atomic_dec_and_test(&ctx->refcount))
+		ckpt_ctx_free(ctx);
+}
+
 /**
  * sys_checkpoint - checkpoint a container
  * @pid: pid of the container init(1) process
@@ -261,7 +280,7 @@ SYSCALL_DEFINE3(checkpoint, pid_t, pid, int, fd, unsigned long, flags)
 	if (!ret)
 		ret = ctx->crid;
 
-	ckpt_ctx_free(ctx);
+	ckpt_ctx_put(ctx);
 	return ret;
 }
 
@@ -280,24 +299,20 @@ SYSCALL_DEFINE3(restart, pid_t, pid, int, fd, unsigned long, flags)
 	long ret;
 
 	/* no flags for now */
-	if (flags)
+	if (flags & ~RESTART_USER_FLAGS)
 		return -EINVAL;
 
 	if (!ckpt_unpriv_allowed && !capable(CAP_SYS_ADMIN))
 		return -EPERM;
 
-	ctx = ckpt_ctx_alloc(fd, flags, CKPT_CTX_RESTART);
+	if (pid)
+		ctx = ckpt_ctx_alloc(fd, flags, CKPT_CTX_RESTART);
 	if (IS_ERR(ctx))
 		return PTR_ERR(ctx);
 
 	ret = do_restart(ctx, pid);
 
-	/* restart(2) isn't idempotent: can't restart syscall */
-	if (ret == -ERESTARTSYS || ret == -ERESTARTNOINTR ||
-	    ret == -ERESTARTNOHAND || ret == -ERESTART_RESTARTBLOCK)
-		ret = -EINTR;
-
-	ckpt_ctx_free(ctx);
+	ckpt_ctx_put(ctx);
 	return ret;
 }
 
diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
index a4650bb..4227b31 100644
--- a/include/linux/checkpoint.h
+++ b/include/linux/checkpoint.h
@@ -15,6 +15,10 @@
 /* checkpoint user flags */
 #define CHECKPOINT_SUBTREE	0x1
 
+/* restart user flags */
+#define RESTART_TASKSELF	0x1
+#define RESTART_FROZEN		0x2
+
 #ifdef __KERNEL__
 #ifdef CONFIG_CHECKPOINT
 
@@ -25,13 +29,19 @@
 /* ckpt_ctx: kflags */
 #define CKPT_CTX_CHECKPOINT_BIT		0
 #define CKPT_CTX_RESTART_BIT		1
+#define CKPT_CTX_SUCCESS_BIT		2
+#define CKPT_CTX_ERROR_BIT		3
 
 #define CKPT_CTX_CHECKPOINT	(1 << CKPT_CTX_CHECKPOINT_BIT)
 #define CKPT_CTX_RESTART	(1 << CKPT_CTX_RESTART_BIT)
+#define CKPT_CTX_SUCCESS	(1 << CKPT_CTX_SUCCESS_BIT)
+#define CKPT_CTX_ERROR		(1 << CKPT_CTX_ERROR_BIT)
 
 /* ckpt_ctx: uflags */
 #define CHECKPOINT_USER_FLAGS		CHECKPOINT_SUBTREE
+#define RESTART_USER_FLAGS		(RESTART_TASKSELF | RESTART_FROZEN)
 
+extern void exit_checkpoint(struct task_struct *tsk);
 
 extern int ckpt_kwrite(struct ckpt_ctx *ctx, void *buf, int count);
 extern int ckpt_kread(struct ckpt_ctx *ctx, void *buf, int count);
@@ -60,6 +70,28 @@ extern int ckpt_read_payload(struct ckpt_ctx *ctx,
 extern char *ckpt_read_string(struct ckpt_ctx *ctx, int max);
 extern int ckpt_read_consume(struct ckpt_ctx *ctx, int len, int type);
 
+/* ckpt kflags */
+#define ckpt_set_ctx_kflag(__ctx, __kflag)  \
+	set_bit(__kflag##_BIT, &(__ctx)->kflags)
+#define ckpt_test_and_set_ctx_kflag(__ctx, __kflag)  \
+	test_and_set_bit(__kflag##_BIT, &(__ctx)->kflags)
+
+#define ckpt_set_ctx_success(ctx)  ckpt_set_ctx_kflag(ctx, CKPT_CTX_SUCCESS)
+
+static inline void ckpt_set_ctx_error(struct ckpt_ctx *ctx, int errno)
+{
+	if (ckpt_test_and_set_ctx_kflag(ctx, CKPT_CTX_ERROR))
+		ctx->errno = errno;
+}
+
+#define ckpt_test_ctx_error(ctx)  \
+	((ctx)->kflags & CKPT_CTX_ERROR)
+#define ckpt_test_ctx_complete(ctx)  \
+	((ctx)->kflags & (CKPT_CTX_SUCCESS | CKPT_CTX_ERROR))
+
+extern void ckpt_ctx_get(struct ckpt_ctx *ctx);
+extern void ckpt_ctx_put(struct ckpt_ctx *ctx);
+
 extern long do_checkpoint(struct ckpt_ctx *ctx, pid_t pid);
 extern long do_restart(struct ckpt_ctx *ctx, pid_t pid);
 
diff --git a/include/linux/checkpoint_types.h b/include/linux/checkpoint_types.h
index c3399b3..f74deac 100644
--- a/include/linux/checkpoint_types.h
+++ b/include/linux/checkpoint_types.h
@@ -16,6 +16,7 @@
 #include <linux/nsproxy.h>
 #include <linux/fs.h>
 #include <linux/ktime.h>
+#include <linux/wait.h>
 
 struct ckpt_ctx {
 	int crid;		/* unique checkpoint id */
@@ -35,11 +36,22 @@ struct ckpt_ctx {
 	struct file *file;	/* input/output file */
 	int total;		/* total read/written */
 
-	struct task_struct **tasks_arr;	/* array of all tasks in container */
-	int nr_tasks;			/* size of tasks array */
+	atomic_t refcount;
 
 	struct task_struct *tsk;/* checkpoint: current target task */
 	char err_string[256];	/* checkpoint: error string */
+	int errno;		/* restart: errno that caused failure */
+
+	/* [multi-process checkpoint] */
+	struct task_struct **tasks_arr; /* array of all tasks [checkpoint] */
+	int nr_tasks;                   /* size of tasks array */
+
+	/* [multi-process restart] */
+	struct ckpt_pids *pids_arr;	/* array of all pids [restart] */
+	int nr_pids;			/* size of pids array */
+	int active_pid;			/* (next) position in pids array */
+	struct completion complete;	/* container root and other tasks on */
+	wait_queue_head_t waitq;	/* start, end, and restart ordering */
 };
 
 #endif /* __KERNEL__ */
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 2fa783d..91b57db 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1480,6 +1480,9 @@ struct task_struct {
 	/* bitmask of trace recursion */
 	unsigned long trace_recursion;
 #endif /* CONFIG_TRACING */
+#ifdef CONFIG_CHECKPOINT
+	struct ckpt_ctx *checkpoint_ctx;
+#endif
 };
 
 /* Future-safe accessor for struct task_struct's cpus_allowed. */
@@ -1694,6 +1697,7 @@ extern cputime_t task_gtime(struct task_struct *p);
 #define PF_SPREAD_PAGE	0x01000000	/* Spread page cache over cpuset */
 #define PF_SPREAD_SLAB	0x02000000	/* Spread some slab caches over cpuset */
 #define PF_THREAD_BOUND	0x04000000	/* Thread bound to specific cpu */
+#define PF_RESTARTING	0x08000000	/* Process is restarting (c/r) */
 #define PF_MEMPOLICY	0x10000000	/* Non-default NUMA mempolicy */
 #define PF_MUTEX_TESTER	0x20000000	/* Thread belongs to the rt mutex tester */
 #define PF_FREEZER_SKIP	0x40000000	/* Freezer should not count it as freezeable */
diff --git a/kernel/exit.c b/kernel/exit.c
index 869dc22..912b1fa 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -49,6 +49,7 @@
 #include <linux/init_task.h>
 #include <linux/perf_counter.h>
 #include <trace/events/sched.h>
+#include <linux/checkpoint.h>
 
 #include <asm/uaccess.h>
 #include <asm/unistd.h>
@@ -992,6 +993,10 @@ NORET_TYPE void do_exit(long code)
 	if (unlikely(current->pi_state_cache))
 		kfree(current->pi_state_cache);
 #endif
+#ifdef CONFIG_CHECKPOINT
+	if (unlikely(tsk->checkpoint_ctx))
+		exit_checkpoint(tsk);
+#endif
 	/*
 	 * Make sure we are holding no locks:
 	 */
diff --git a/kernel/fork.c b/kernel/fork.c
index f5a0cef..57118e4 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -62,6 +62,7 @@
 #include <linux/fs_struct.h>
 #include <linux/magic.h>
 #include <linux/perf_counter.h>
+#include <linux/checkpoint.h>
 
 #include <asm/pgtable.h>
 #include <asm/pgalloc.h>
@@ -1185,6 +1186,13 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 	/* Need tasklist lock for parent etc handling! */
 	write_lock_irq(&tasklist_lock);
 
+#ifdef CONFIG_CHECKPOINT
+	/* If parent is restarting, child should be too */
+	if (unlikely(current->checkpoint_ctx)) {
+		p->checkpoint_ctx = current->checkpoint_ctx;
+		ckpt_ctx_get(p->checkpoint_ctx);
+	}
+#endif
 	/*
 	 * The task hasn't been attached yet, so its cpus_allowed mask will
 	 * not be changed, nor will its assigned CPU.
-- 
1.6.0.4


^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH v18 26/80] c/r: restart multiple processes
@ 2009-09-23 23:51   ` Oren Laadan
  0 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-23 23:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, containers, linux-kernel, linux-mm, linux-api,
	Serge Hallyn, Ingo Molnar, Pavel Emelyanov, Oren Laadan,
	Oren Laadan

Restarting of multiple processes expects all restarting tasks to call
sys_restart(). Once inside the system call, each task will restart
itself at the same order that they were saved. The internals of the
syscall will take care of in-kernel synchronization bewteen tasks.

This patch does _not_ create the task tree in the kernel. Instead it
assumes that all tasks are created in some way and then invoke the
restart syscall. You can use the userspace mktree.c program to do
that.

There is one special task - the coordinator - that is not part of the
restarted hierarchy. The coordinator task allocates the restart
context (ctx) and orchestrates the restart. Thus even if a restart
fails after, or during the restore of the root task, the user
perceives a clean exit and an error message.

The coordinator task will:
 1) read header and tree, create @ctx (wake up restarting tasks)
 2) set the ->checkpoint_ctx field of itself and all descendants
 3) wait for all restarting tasks to reach sync point #1
 4) activate first restarting task (root task)
 5) wait for all other tasks to complete and reach sync point #3
 6) wake up everybody

(Note that in step #2 the coordinator assumes that the entire task
hierarchy exists by the time it enters sys_restart; this is arranged
in user space by 'mktree')

Task that are restarting has three sync points:
 1) wait for its ->checkpoint_ctx to be set (by the coordinator)
 2) wait for the task's turn to restore (be active)
 [...now the task restores its state...]
 3) wait for all other tasks to complete

The third sync point ensures that a task may only resume execution
after all tasks have successfully restored their state (or fail if an
error has occured). This prevents tasks from returning to user space
prematurely, before the entire restart completes.

If a single task wishes to restart, it can set the "RESTART_TASKSELF"
flag to restart(2) to skip the logic of the coordinator.

The root-task is a child of the coordinator, identified by the @pid
given to sys_restart() in the pid-ns of the coordinator. Restarting
tasks that aren't the coordinator, should set the @pid argument of
restart(2) syscall to zero.

All tasks explicitly test for an error flag on the checkpoint context
when they wakeup from sync points.  If an error occurs during the
restart of some task, it will mark the @ctx with an error flag, and
wakeup the other tasks.

An array of pids (the one saved during the checkpoint) is used to
synchronize the operation. The first task in the array is the init
task (*). The restart context (@ctx) maintains a "current position" in
the array, which indicates which task is currently active. Once the
currently active task completes its own restart, it increments that
position and wakes up the next task.

Restart assumes that userspace provides meaningful data, otherwise
it's garbage-in-garbage-out. In this case, the syscall may block
indefinitely, but in TASK_INTERRUPTIBLE, so the user can ctrl-c or
otherwise kill the stray restarting tasks.

In terms of security, restart runs as the user the invokes it, so it
will not allow a user to do more than is otherwise permitted by the
usual system semantics and policy.

Currently we ignore threads and zombies, as well as session ids.
Add support for multiple processes

(*) For containers, restart should be called inside a fresh container
by the init task of that container. However, it is also possible to
restart applications not necessarily inside a container, and without
restoring the original pids of the processes (that is, provided that
the application can tolerate such behavior). This is useful to allow
multi-process restart of tasks not isolated inside a container, and
also for debugging.

Changelog[v18]:
  - Fix race of prepare_descendant() with an ongoing fork()
  - Track and report the first error if restart fails
  - Tighten logic to protect against bogus pids in input
  - [Matt Helsley] Improve debug output from ckpt_notify_error()
Changelog[v17]:
  - Add uflag RESTART_FROZEN to freeze tasks after restart
  - Fix restore_retval() and use only for restarting tasks
  - Coordinator converts -ERSTART... to -EINTR
  - Coordinator marks and sets descendants' ->checkpoint_ctx
  - Coordinator properly detects errors when woken up from wait
  - Fix race where root_task could kick start too early
  - Add a sync point for restarting tasks
  - Multiple fixes to restart logic
Changelog[v14]:
  - Revert change to pr_debug(), back to ckpt_debug()
  - Discard field 'h.parent'
  - Check whether calls to ckpt_hbuf_get() fail
Changelog[v13]:
  - Clear root_task->checkpoint_ctx regardless of error condition
  - Remove unused argument 'ctx' from do_restore_task() prototype
  - Remove unused member 'pids_err' from 'struct ckpt_ctx'
Changelog[v12]:
  - Replace obsolete ckpt_debug() with pr_debug()

Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
---
 checkpoint/restart.c             |  481 ++++++++++++++++++++++++++++++++++++--
 checkpoint/sys.c                 |   33 ++-
 include/linux/checkpoint.h       |   32 +++
 include/linux/checkpoint_types.h |   16 ++-
 include/linux/sched.h            |    4 +
 kernel/exit.c                    |    5 +
 kernel/fork.c                    |    8 +
 7 files changed, 549 insertions(+), 30 deletions(-)

diff --git a/checkpoint/restart.c b/checkpoint/restart.c
index 3f22403..4da09b7 100644
--- a/checkpoint/restart.c
+++ b/checkpoint/restart.c
@@ -13,7 +13,10 @@
 
 #include <linux/version.h>
 #include <linux/sched.h>
+#include <linux/wait.h>
 #include <linux/file.h>
+#include <linux/ptrace.h>
+#include <linux/freezer.h>
 #include <linux/magic.h>
 #include <linux/utsname.h>
 #include <asm/syscall.h>
@@ -413,6 +416,434 @@ static int restore_read_tail(struct ckpt_ctx *ctx)
 	return ret;
 }
 
+/* restore_read_tree - read the tasks tree into the checkpoint context */
+static int restore_read_tree(struct ckpt_ctx *ctx)
+{
+	struct ckpt_hdr_tree *h;
+	int size, ret;
+
+	h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_TREE);
+	if (IS_ERR(h))
+		return PTR_ERR(h);
+
+	ret = -EINVAL;
+	if (h->nr_tasks <= 0)
+		goto out;
+
+	ctx->nr_pids = h->nr_tasks;
+	size = sizeof(*ctx->pids_arr) * ctx->nr_pids;
+	if (size <= 0)		/* overflow ? */
+		goto out;
+
+	ctx->pids_arr = kmalloc(size, GFP_KERNEL);
+	if (!ctx->pids_arr) {
+		ret = -ENOMEM;
+		goto out;
+	}
+	ret = _ckpt_read_buffer(ctx, ctx->pids_arr, size);
+ out:
+	ckpt_hdr_put(ctx, h);
+	return ret;
+}
+
+static inline pid_t get_active_pid(struct ckpt_ctx *ctx)
+{
+	int active = ctx->active_pid;
+	return active >= 0 ? ctx->pids_arr[active].vpid : 0;
+}
+
+static inline int is_task_active(struct ckpt_ctx *ctx, pid_t pid)
+{
+	return get_active_pid(ctx) == pid;
+}
+
+static inline void _restore_notify_error(struct ckpt_ctx *ctx, int errno)
+{
+	ckpt_set_ctx_error(ctx, errno);
+	complete(&ctx->complete);
+}
+
+/* Need to call ckpt_debug such that it will get the correct source location */
+#define restore_notify_error(ctx, errno) \
+do { \
+	ckpt_debug("ctx root pid %d err %d", ctx->root_pid, errno); \
+	_restore_notify_error(ctx, errno); \
+} while(0)
+
+static int restore_activate_next(struct ckpt_ctx *ctx)
+{
+	struct task_struct *task;
+	int active;
+	pid_t pid;
+
+	active = ++ctx->active_pid;
+	if (active >= ctx->nr_pids) {
+		complete(&ctx->complete);
+		return 0;
+	}
+
+	pid = get_active_pid(ctx);
+	ckpt_debug("active pid %d (%d < %d)\n", pid, active, ctx->nr_pids);
+
+	rcu_read_lock();
+	task = find_task_by_pid_ns(pid, ctx->root_nsproxy->pid_ns);
+	/* target task must have same restart context */
+	if (task && task->checkpoint_ctx == ctx)
+		wake_up_process(task);
+	else
+		task = NULL;
+	rcu_read_unlock();
+
+	if (!task) {
+		restore_notify_error(ctx, -ESRCH);
+		return -ESRCH;
+	}
+
+	return 0;
+}
+
+static int wait_task_active(struct ckpt_ctx *ctx)
+{
+	pid_t pid = task_pid_vnr(current);
+	int ret;
+
+	ckpt_debug("pid %d waiting\n", pid);
+	ret = wait_event_interruptible(ctx->waitq,
+				       is_task_active(ctx, pid) ||
+				       ckpt_test_ctx_error(ctx));
+	if (!ret && ckpt_test_ctx_error(ctx)) {
+		force_sig(SIGKILL, current);
+		ret = -EBUSY;
+	}
+	return ret;
+}
+
+static int wait_task_sync(struct ckpt_ctx *ctx)
+{
+	ckpt_debug("pid %d syncing\n", task_pid_vnr(current));
+	wait_event_interruptible(ctx->waitq, ckpt_test_ctx_complete(ctx));
+	if (ckpt_test_ctx_error(ctx)) {
+		force_sig(SIGKILL, current);
+		return -EBUSY;
+	}
+	return 0;
+}
+
+static int do_restore_task(void)
+{
+	DECLARE_WAIT_QUEUE_HEAD(waitq);
+	struct ckpt_ctx *ctx, *old_ctx;
+	int ret;
+
+	/*
+	 * Wait for coordinator to become visible, then grab a
+	 * reference to its restart context.
+	 */
+	ret = wait_event_interruptible(waitq, current->checkpoint_ctx);
+	if (ret < 0)
+		return ret;
+
+	ctx = xchg(&current->checkpoint_ctx, NULL);
+	if (!ctx)
+		return -EAGAIN;
+	ckpt_ctx_get(ctx);
+
+	/*
+	 * Put the @ctx back on our task_struct. If an ancestor tried
+	 * to prepare_descendants() on us (although extremly unlikely)
+	 * we will encounter the ctx that he xchg()ed there and bail.
+	 */
+	old_ctx = xchg(&current->checkpoint_ctx, ctx);
+	if (old_ctx) {
+		ckpt_debug("self-set of checkpoint_ctx failed\n");
+
+		/* alert coordinator of unexpected ctx */
+		restore_notify_error(old_ctx, -EAGAIN);
+		ckpt_ctx_put(old_ctx);
+
+		/* alert our coordinator that we bail */
+		restore_notify_error(ctx, -EAGAIN);
+		ckpt_ctx_put(ctx);
+		return -EAGAIN;
+	}
+
+	/* wait for our turn, do the restore, and tell next task in line */
+	ret = wait_task_active(ctx);
+	if (ret < 0)
+		goto out;
+
+	ret = restore_task(ctx);
+	if (ret < 0)
+		goto out;
+
+	ret = restore_activate_next(ctx);
+	if (ret < 0)
+		goto out;
+
+	ret = wait_task_sync(ctx);
+ out:
+	old_ctx = xchg(&current->checkpoint_ctx, NULL);
+	if (old_ctx)
+		ckpt_ctx_put(old_ctx);
+
+	/* if we're first to fail - notify others */
+	if (ret < 0 && !ckpt_test_ctx_error(ctx)) {
+		restore_notify_error(ctx, ret);
+		wake_up_all(&ctx->waitq);
+	}
+
+	ckpt_ctx_put(ctx);
+	return ret;
+}
+
+/**
+ * prepare_descendants - set ->checkpoint_ctx of all descendants
+ * @ctx: checkpoint context
+ * @root: root process for restart
+ *
+ * Called by the coodinator to set the ->checkpoint_ctx pointer of the
+ * root task and all its descendants.
+ */
+static int prepare_descendants(struct ckpt_ctx *ctx, struct task_struct *root)
+{
+	struct task_struct *leader = root;
+	struct task_struct *parent = NULL;
+	struct task_struct *task = root;
+	struct ckpt_ctx *old_ctx;
+	int nr_pids = ctx->nr_pids;
+	int ret = 0;
+
+	read_lock(&tasklist_lock);
+	while (nr_pids) {
+		ckpt_debug("consider task %d\n", task_pid_vnr(task));
+		if (task_ptrace(task) & PT_PTRACED) {
+			ret = -EBUSY;
+			break;
+		}
+		/*
+		 * Set task->checkpoint_ctx of all non-zombie descendants.
+		 * If a descendant already has a ->checkpoint_ctx, it
+		 * must be a coordinator (for a different restart ?) so
+		 * we fail.
+		 *
+		 * Note that own ancestors cannot interfere since they
+		 * won't descend past us, as own ->checkpoint_ctx must
+		 * already be set.
+		 */
+		if (!task->exit_state) {
+			ckpt_ctx_get(ctx);
+			old_ctx = xchg(&task->checkpoint_ctx, ctx);
+			if (old_ctx) {
+				ckpt_debug("bad task %d\n",task_pid_vnr(task));
+				ckpt_ctx_put(old_ctx);
+				ret = -EAGAIN;
+				break;
+			}
+			ckpt_debug("prepare task %d\n", task_pid_vnr(task));
+			wake_up_process(task);
+			nr_pids--;
+		}
+
+		/* if has children - proceed with child */
+		if (!list_empty(&task->children)) {
+			parent = task;
+			task = list_entry(task->children.next,
+					  struct task_struct, sibling);
+			continue;
+		}
+		while (task != root) {
+			/* if has sibling - proceed with sibling */
+			if (!list_is_last(&task->sibling, &parent->children)) {
+				task = list_entry(task->sibling.next,
+						  struct task_struct, sibling);
+				break;
+			}
+
+			/* else, trace back to parent and proceed */
+			task = parent;
+			parent = parent->real_parent;
+		}
+		if (task == root) {
+			/* in case root task is multi-threaded */
+			root = task = next_thread(task);
+			if (root == leader)
+				break;
+		}
+	}
+	read_unlock(&tasklist_lock);
+	ckpt_debug("left %d ret %d root/task %d\n", nr_pids, ret, task == root);
+
+	/* fail unless number of processes matches */
+	if (!ret && (nr_pids || task != root))
+		ret = -ESRCH;
+
+	return ret;
+}
+
+static int wait_all_tasks_finish(struct ckpt_ctx *ctx)
+{
+	int ret;
+
+	init_completion(&ctx->complete);
+
+	BUG_ON(ctx->active_pid != -1);
+	ret = restore_activate_next(ctx);
+	if (ret < 0)
+		return ret;
+
+	ret = wait_for_completion_interruptible(&ctx->complete);
+
+	ckpt_debug("final sync kflags %#lx\n", ctx->kflags);
+	/*
+	 * Usually when restart fails, the restarting task will first
+	 * set @ctx->errno before waking us up. In the rare event that
+	 * @ctx->errno is unset, we must have been interrupted and
+	 * then checked for an error prior to ctx->errno update...
+	 */
+	if (ckpt_test_ctx_error(ctx))
+		ret = ctx->errno ? ctx->errno : -EINTR;
+
+	return ret;
+}
+
+static struct task_struct *choose_root_task(struct ckpt_ctx *ctx, pid_t pid)
+{
+	struct task_struct *task;
+
+	if (ctx->uflags & RESTART_TASKSELF) {
+		ctx->root_pid = pid;
+		ctx->root_task = current;
+		get_task_struct(current);
+		return current;
+	}
+
+	read_lock(&tasklist_lock);
+	list_for_each_entry(task, &current->children, sibling) {
+		if (task_pid_vnr(task) == pid) {
+			get_task_struct(task);
+			ctx->root_task = task;
+			ctx->root_pid = pid;
+			break;
+		}
+	}
+	read_unlock(&tasklist_lock);
+
+	return task;
+}
+
+/* setup restart-specific parts of ctx */
+static int init_restart_ctx(struct ckpt_ctx *ctx, pid_t pid)
+{
+	struct nsproxy *nsproxy;
+
+	/*
+	 * No need for explicit cleanup here, because if an error
+	 * occurs then ckpt_ctx_free() is eventually called.
+	 */
+
+	ctx->root_task = choose_root_task(ctx, pid);
+	if (!ctx->root_task)
+		return -ESRCH;
+
+	rcu_read_lock();
+	nsproxy = task_nsproxy(ctx->root_task);
+	if (nsproxy) {
+		get_nsproxy(nsproxy);
+		ctx->root_nsproxy = nsproxy;
+	}
+	rcu_read_unlock();
+	if (!nsproxy)
+		return -ESRCH;
+
+	ctx->active_pid = -1;	/* see restore_activate_next, get_active_pid */
+
+	return 0;
+}
+
+static int do_restore_coord(struct ckpt_ctx *ctx, pid_t pid)
+{
+	struct ckpt_ctx *old_ctx;
+	int ret;
+
+	ret = restore_read_header(ctx);
+	if (ret < 0)
+		return ret;
+	ret = restore_read_tree(ctx);
+	if (ret < 0)
+		return ret;
+
+	if ((ctx->uflags & RESTART_TASKSELF) && ctx->nr_pids != 1)
+		return -EINVAL;
+
+	ret = init_restart_ctx(ctx, pid);
+	if (ret < 0)
+		return ret;
+
+	/*
+	 * Populate own ->checkpoint_ctx: if an ancestor attempts to
+	 * prepare_descendants() on us, it will fail. Furthermore,
+	 * that ancestor won't proceed deeper to interfere with our
+	 * descendants that are restarting (e.g. by xchg()ing their
+	 * ->checkpoint_ctx pointer temporarily).
+	 */
+	ckpt_ctx_get(ctx);
+	old_ctx = xchg(&current->checkpoint_ctx, ctx);
+	if (old_ctx) {
+		/*
+		 * We are a bad-behaving descendant: an ancestor must
+		 * have done prepare_descendants() on us as part of a
+		 * restart. Oh, well ... alert ancestor (coordinator)
+		 * with an error on @old_ctx.
+		 */
+		ckpt_debug("bad behaving checkpoint_ctx\n");
+		restore_notify_error(old_ctx, -EBUSY);
+		ckpt_ctx_put(old_ctx);
+		ret = -EBUSY;
+		goto out;
+	}
+
+	if (ctx->uflags & RESTART_TASKSELF) {
+		ret = restore_task(ctx);
+		if (ret < 0)
+			goto out;
+	} else {
+		/* prepare descendants' t->checkpoint_ctx point to coord */
+		ret = prepare_descendants(ctx, ctx->root_task);
+		if (ret < 0)
+			goto out;
+		/* wait for all other tasks to complete do_restore_task() */
+		ret = wait_all_tasks_finish(ctx);
+		if (ret < 0)
+			goto out;
+	}
+
+	ret = restore_read_tail(ctx);
+	if (ret < 0)
+		goto out;
+
+	if (ctx->uflags & RESTART_FROZEN) {
+		ret = cgroup_freezer_make_frozen(ctx->root_task);
+		ckpt_debug("freezing restart tasks ... %d\n", ret);
+	}
+ out:
+	if (ret < 0)
+		ckpt_set_ctx_error(ctx, ret);
+	else
+		ckpt_set_ctx_success(ctx);
+
+	if (!(ctx->uflags & RESTART_TASKSELF))
+		wake_up_all(&ctx->waitq);
+	/*
+	 * If an ancestor attempts to prepare_descendants() on us, it
+	 * xchg()s our ->checkpoint_ctx, and free it. Our @ctx will,
+	 * instead, point to the ctx that said ancestor placed.
+	 */
+	ctx = xchg(&current->checkpoint_ctx, NULL);
+	ckpt_ctx_put(ctx);
+
+	return ret;
+}
+
 static long restore_retval(void)
 {
 	struct pt_regs *regs = task_pt_regs(current);
@@ -461,28 +892,40 @@ static long restore_retval(void)
 	return ret;
 }
 
-/* setup restart-specific parts of ctx */
-static int init_restart_ctx(struct ckpt_ctx *ctx, pid_t pid)
-{
-	return 0;
-}
-
 long do_restart(struct ckpt_ctx *ctx, pid_t pid)
 {
 	long ret;
 
-	ret = init_restart_ctx(ctx, pid);
-	if (ret < 0)
-		return ret;
-	ret = restore_read_header(ctx);
-	if (ret < 0)
-		return ret;
-	ret = restore_task(ctx);
-	if (ret < 0)
-		return ret;
-	ret = restore_read_tail(ctx);
-	if (ret < 0)
-		return ret;
+	if (ctx)
+		ret = do_restore_coord(ctx, pid);
+	else
+		ret = do_restore_task();
 
-	return restore_retval();
+	/* restart(2) isn't idempotent: should not be auto-restarted */
+	if (ret == -ERESTARTSYS || ret == -ERESTARTNOINTR ||
+	    ret == -ERESTARTNOHAND || ret == -ERESTART_RESTARTBLOCK)
+		ret = -EINTR;
+
+	/*
+	 * The retval from what we return to the caller when all goes
+	 * well: this is either the retval from the original syscall
+	 * that was interrupted during checkpoint, or the contents of
+	 * (saved) eax if the task was in userspace.
+	 *
+	 * The coordinator (ctx!=NULL) is exempt: don't adjust its retval.
+	 * But in self-restart (where RESTART_TASKSELF), the coordinator
+	 * _itself_ is a restarting task.
+	 */
+
+	if (!ctx || (ctx->uflags & RESTART_TASKSELF)) {
+		if (ret < 0) {
+			/* partial restore is undefined: terminate */
+			ckpt_debug("restart err %ld, exiting\n", ret);
+			force_sig(SIGKILL, current);
+		} else {
+			ret = restore_retval();
+		}
+	}
+
+	return ret;
 }
diff --git a/checkpoint/sys.c b/checkpoint/sys.c
index cc94775..c8921f0 100644
--- a/checkpoint/sys.c
+++ b/checkpoint/sys.c
@@ -189,6 +189,8 @@ static void task_arr_free(struct ckpt_ctx *ctx)
 
 static void ckpt_ctx_free(struct ckpt_ctx *ctx)
 {
+	BUG_ON(atomic_read(&ctx->refcount));
+
 	if (ctx->file)
 		fput(ctx->file);
 
@@ -202,6 +204,8 @@ static void ckpt_ctx_free(struct ckpt_ctx *ctx)
 	if (ctx->root_freezer)
 		put_task_struct(ctx->root_freezer);
 
+	kfree(ctx->pids_arr);
+
 	kfree(ctx);
 }
 
@@ -219,17 +223,32 @@ static struct ckpt_ctx *ckpt_ctx_alloc(int fd, unsigned long uflags,
 	ctx->kflags = kflags;
 	ctx->ktime_begin = ktime_get();
 
+	atomic_set(&ctx->refcount, 0);
+	init_waitqueue_head(&ctx->waitq);
+
 	err = -EBADF;
 	ctx->file = fget(fd);
 	if (!ctx->file)
 		goto err;
 
+	atomic_inc(&ctx->refcount);
 	return ctx;
  err:
 	ckpt_ctx_free(ctx);
 	return ERR_PTR(err);
 }
 
+void ckpt_ctx_get(struct ckpt_ctx *ctx)
+{
+	atomic_inc(&ctx->refcount);
+}
+
+void ckpt_ctx_put(struct ckpt_ctx *ctx)
+{
+	if (ctx && atomic_dec_and_test(&ctx->refcount))
+		ckpt_ctx_free(ctx);
+}
+
 /**
  * sys_checkpoint - checkpoint a container
  * @pid: pid of the container init(1) process
@@ -261,7 +280,7 @@ SYSCALL_DEFINE3(checkpoint, pid_t, pid, int, fd, unsigned long, flags)
 	if (!ret)
 		ret = ctx->crid;
 
-	ckpt_ctx_free(ctx);
+	ckpt_ctx_put(ctx);
 	return ret;
 }
 
@@ -280,24 +299,20 @@ SYSCALL_DEFINE3(restart, pid_t, pid, int, fd, unsigned long, flags)
 	long ret;
 
 	/* no flags for now */
-	if (flags)
+	if (flags & ~RESTART_USER_FLAGS)
 		return -EINVAL;
 
 	if (!ckpt_unpriv_allowed && !capable(CAP_SYS_ADMIN))
 		return -EPERM;
 
-	ctx = ckpt_ctx_alloc(fd, flags, CKPT_CTX_RESTART);
+	if (pid)
+		ctx = ckpt_ctx_alloc(fd, flags, CKPT_CTX_RESTART);
 	if (IS_ERR(ctx))
 		return PTR_ERR(ctx);
 
 	ret = do_restart(ctx, pid);
 
-	/* restart(2) isn't idempotent: can't restart syscall */
-	if (ret == -ERESTARTSYS || ret == -ERESTARTNOINTR ||
-	    ret == -ERESTARTNOHAND || ret == -ERESTART_RESTARTBLOCK)
-		ret = -EINTR;
-
-	ckpt_ctx_free(ctx);
+	ckpt_ctx_put(ctx);
 	return ret;
 }
 
diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
index a4650bb..4227b31 100644
--- a/include/linux/checkpoint.h
+++ b/include/linux/checkpoint.h
@@ -15,6 +15,10 @@
 /* checkpoint user flags */
 #define CHECKPOINT_SUBTREE	0x1
 
+/* restart user flags */
+#define RESTART_TASKSELF	0x1
+#define RESTART_FROZEN		0x2
+
 #ifdef __KERNEL__
 #ifdef CONFIG_CHECKPOINT
 
@@ -25,13 +29,19 @@
 /* ckpt_ctx: kflags */
 #define CKPT_CTX_CHECKPOINT_BIT		0
 #define CKPT_CTX_RESTART_BIT		1
+#define CKPT_CTX_SUCCESS_BIT		2
+#define CKPT_CTX_ERROR_BIT		3
 
 #define CKPT_CTX_CHECKPOINT	(1 << CKPT_CTX_CHECKPOINT_BIT)
 #define CKPT_CTX_RESTART	(1 << CKPT_CTX_RESTART_BIT)
+#define CKPT_CTX_SUCCESS	(1 << CKPT_CTX_SUCCESS_BIT)
+#define CKPT_CTX_ERROR		(1 << CKPT_CTX_ERROR_BIT)
 
 /* ckpt_ctx: uflags */
 #define CHECKPOINT_USER_FLAGS		CHECKPOINT_SUBTREE
+#define RESTART_USER_FLAGS		(RESTART_TASKSELF | RESTART_FROZEN)
 
+extern void exit_checkpoint(struct task_struct *tsk);
 
 extern int ckpt_kwrite(struct ckpt_ctx *ctx, void *buf, int count);
 extern int ckpt_kread(struct ckpt_ctx *ctx, void *buf, int count);
@@ -60,6 +70,28 @@ extern int ckpt_read_payload(struct ckpt_ctx *ctx,
 extern char *ckpt_read_string(struct ckpt_ctx *ctx, int max);
 extern int ckpt_read_consume(struct ckpt_ctx *ctx, int len, int type);
 
+/* ckpt kflags */
+#define ckpt_set_ctx_kflag(__ctx, __kflag)  \
+	set_bit(__kflag##_BIT, &(__ctx)->kflags)
+#define ckpt_test_and_set_ctx_kflag(__ctx, __kflag)  \
+	test_and_set_bit(__kflag##_BIT, &(__ctx)->kflags)
+
+#define ckpt_set_ctx_success(ctx)  ckpt_set_ctx_kflag(ctx, CKPT_CTX_SUCCESS)
+
+static inline void ckpt_set_ctx_error(struct ckpt_ctx *ctx, int errno)
+{
+	if (ckpt_test_and_set_ctx_kflag(ctx, CKPT_CTX_ERROR))
+		ctx->errno = errno;
+}
+
+#define ckpt_test_ctx_error(ctx)  \
+	((ctx)->kflags & CKPT_CTX_ERROR)
+#define ckpt_test_ctx_complete(ctx)  \
+	((ctx)->kflags & (CKPT_CTX_SUCCESS | CKPT_CTX_ERROR))
+
+extern void ckpt_ctx_get(struct ckpt_ctx *ctx);
+extern void ckpt_ctx_put(struct ckpt_ctx *ctx);
+
 extern long do_checkpoint(struct ckpt_ctx *ctx, pid_t pid);
 extern long do_restart(struct ckpt_ctx *ctx, pid_t pid);
 
diff --git a/include/linux/checkpoint_types.h b/include/linux/checkpoint_types.h
index c3399b3..f74deac 100644
--- a/include/linux/checkpoint_types.h
+++ b/include/linux/checkpoint_types.h
@@ -16,6 +16,7 @@
 #include <linux/nsproxy.h>
 #include <linux/fs.h>
 #include <linux/ktime.h>
+#include <linux/wait.h>
 
 struct ckpt_ctx {
 	int crid;		/* unique checkpoint id */
@@ -35,11 +36,22 @@ struct ckpt_ctx {
 	struct file *file;	/* input/output file */
 	int total;		/* total read/written */
 
-	struct task_struct **tasks_arr;	/* array of all tasks in container */
-	int nr_tasks;			/* size of tasks array */
+	atomic_t refcount;
 
 	struct task_struct *tsk;/* checkpoint: current target task */
 	char err_string[256];	/* checkpoint: error string */
+	int errno;		/* restart: errno that caused failure */
+
+	/* [multi-process checkpoint] */
+	struct task_struct **tasks_arr; /* array of all tasks [checkpoint] */
+	int nr_tasks;                   /* size of tasks array */
+
+	/* [multi-process restart] */
+	struct ckpt_pids *pids_arr;	/* array of all pids [restart] */
+	int nr_pids;			/* size of pids array */
+	int active_pid;			/* (next) position in pids array */
+	struct completion complete;	/* container root and other tasks on */
+	wait_queue_head_t waitq;	/* start, end, and restart ordering */
 };
 
 #endif /* __KERNEL__ */
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 2fa783d..91b57db 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1480,6 +1480,9 @@ struct task_struct {
 	/* bitmask of trace recursion */
 	unsigned long trace_recursion;
 #endif /* CONFIG_TRACING */
+#ifdef CONFIG_CHECKPOINT
+	struct ckpt_ctx *checkpoint_ctx;
+#endif
 };
 
 /* Future-safe accessor for struct task_struct's cpus_allowed. */
@@ -1694,6 +1697,7 @@ extern cputime_t task_gtime(struct task_struct *p);
 #define PF_SPREAD_PAGE	0x01000000	/* Spread page cache over cpuset */
 #define PF_SPREAD_SLAB	0x02000000	/* Spread some slab caches over cpuset */
 #define PF_THREAD_BOUND	0x04000000	/* Thread bound to specific cpu */
+#define PF_RESTARTING	0x08000000	/* Process is restarting (c/r) */
 #define PF_MEMPOLICY	0x10000000	/* Non-default NUMA mempolicy */
 #define PF_MUTEX_TESTER	0x20000000	/* Thread belongs to the rt mutex tester */
 #define PF_FREEZER_SKIP	0x40000000	/* Freezer should not count it as freezeable */
diff --git a/kernel/exit.c b/kernel/exit.c
index 869dc22..912b1fa 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -49,6 +49,7 @@
 #include <linux/init_task.h>
 #include <linux/perf_counter.h>
 #include <trace/events/sched.h>
+#include <linux/checkpoint.h>
 
 #include <asm/uaccess.h>
 #include <asm/unistd.h>
@@ -992,6 +993,10 @@ NORET_TYPE void do_exit(long code)
 	if (unlikely(current->pi_state_cache))
 		kfree(current->pi_state_cache);
 #endif
+#ifdef CONFIG_CHECKPOINT
+	if (unlikely(tsk->checkpoint_ctx))
+		exit_checkpoint(tsk);
+#endif
 	/*
 	 * Make sure we are holding no locks:
 	 */
diff --git a/kernel/fork.c b/kernel/fork.c
index f5a0cef..57118e4 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -62,6 +62,7 @@
 #include <linux/fs_struct.h>
 #include <linux/magic.h>
 #include <linux/perf_counter.h>
+#include <linux/checkpoint.h>
 
 #include <asm/pgtable.h>
 #include <asm/pgalloc.h>
@@ -1185,6 +1186,13 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 	/* Need tasklist lock for parent etc handling! */
 	write_lock_irq(&tasklist_lock);
 
+#ifdef CONFIG_CHECKPOINT
+	/* If parent is restarting, child should be too */
+	if (unlikely(current->checkpoint_ctx)) {
+		p->checkpoint_ctx = current->checkpoint_ctx;
+		ckpt_ctx_get(p->checkpoint_ctx);
+	}
+#endif
 	/*
 	 * The task hasn't been attached yet, so its cpus_allowed mask will
 	 * not be changed, nor will its assigned CPU.
-- 
1.6.0.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH v18 27/80] c/r: introduce PF_RESTARTING, and skip notification on exit
       [not found] ` <1253749920-18673-1-git-send-email-orenl-RdfvBDnrOixBDgjK7y7TUQ@public.gmane.org>
                     ` (25 preceding siblings ...)
  2009-09-23 23:51   ` [PATCH v18 26/80] c/r: restart " Oren Laadan
@ 2009-09-23 23:51   ` Oren Laadan
  2009-09-23 23:51   ` [PATCH v18 28/80] c/r: support for zombie processes Oren Laadan
                     ` (54 subsequent siblings)
  81 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-23 23:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-api-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Linus Torvalds, Ingo Molnar,
	Pavel Emelyanov

To restore zombie's we will create the a task, that, on its turn to
run, calls do_exit(). Unlike normal tasks that exit, we need to
prevent notification side effects that send signals to other
processes, e.g. parent (SIGCHLD) or child tasks (per child's request).

There are three main cases for such notifications:

1) do_notify_parent(): parent of a process is notified about a change
 in status (e.g. become zombie, reparent, etc). If parent ignores,
 then mark child for immediate release (skip zombie).

2) kill_orphan_pgrp(): a process group that becomes orphaned will
 signal stopped jobs (HUP then CONT).

3) reparent_thread(): children of a process are signaled (per request)
 with p->pdeath_signal

Remember that restoring signal state (for any restarting task) must
complete _before_ it is allowed to resume execution, and not during
the resume. Otherwise, a running task may send a signal to another
task that hasn't restored yet, so the new signal will be lost
soon-after.

I considered two possible way to address this:

1. Add another sync point to restart: all tasks will first restore
their state without signals (all signals blocked), and zombies call
do_exit(). A sync point then will ensure that all zombies are gone and
their effects done. Then all tasks restore their signal state (and
mask), and sync (new point) again. Only then they may resume
execution.
The main disadvantage is the added complexity and inefficiency,
for no good reason.

2. Introduce PF_RESTARTING: mark all restarting tasks with a new flag,
and teach the above three notifications to skip sending the signal if
theis flag is set.
The main advantage is simplicity and completeness. Also, such a flag
may to be useful later on. This the method implemented.

Signed-off-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
---
 kernel/exit.c   |    7 ++++++-
 kernel/signal.c |    4 ++++
 2 files changed, 10 insertions(+), 1 deletions(-)

diff --git a/kernel/exit.c b/kernel/exit.c
index 912b1fa..41ac4cf 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -299,6 +299,10 @@ kill_orphaned_pgrp(struct task_struct *tsk, struct task_struct *parent)
 	struct pid *pgrp = task_pgrp(tsk);
 	struct task_struct *ignored_task = tsk;
 
+	/* restarting zombie doesn't trigger signals */
+	if (tsk->flags & PF_RESTARTING)
+		return;
+
 	if (!parent)
 		 /* exit: our father is in a different pgrp than
 		  * we are and we were the only connection outside.
@@ -739,7 +743,8 @@ static struct task_struct *find_new_reaper(struct task_struct *father)
 static void reparent_thread(struct task_struct *father, struct task_struct *p,
 				struct list_head *dead)
 {
-	if (p->pdeath_signal)
+	/* restarting zombie doesn't trigger signals */
+	if (p->pdeath_signal && !(p->flags & PF_RESTARTING))
 		group_send_sig_info(p->pdeath_signal, SEND_SIG_NOINFO, p);
 
 	list_move_tail(&p->sibling, &p->real_parent->children);
diff --git a/kernel/signal.c b/kernel/signal.c
index 64c5dee..ea217b0 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -1413,6 +1413,10 @@ int do_notify_parent(struct task_struct *tsk, int sig)
 	BUG_ON(!task_ptrace(tsk) &&
 	       (tsk->group_leader != tsk || !thread_group_empty(tsk)));
 
+	/* restarting zombie doesn't notify parent */
+	if (tsk->flags & PF_RESTARTING)
+		return ret;
+
 	info.si_signo = sig;
 	info.si_errno = 0;
 	/*
-- 
1.6.0.4

^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH v18 27/80] c/r: introduce PF_RESTARTING, and skip notification on exit
  2009-09-23 23:50 ` Oren Laadan
@ 2009-09-23 23:51   ` Oren Laadan
  -1 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-23 23:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, containers, linux-kernel, linux-mm, linux-api,
	Serge Hallyn, Ingo Molnar, Pavel Emelyanov, Oren Laadan,
	Oren Laadan

To restore zombie's we will create the a task, that, on its turn to
run, calls do_exit(). Unlike normal tasks that exit, we need to
prevent notification side effects that send signals to other
processes, e.g. parent (SIGCHLD) or child tasks (per child's request).

There are three main cases for such notifications:

1) do_notify_parent(): parent of a process is notified about a change
 in status (e.g. become zombie, reparent, etc). If parent ignores,
 then mark child for immediate release (skip zombie).

2) kill_orphan_pgrp(): a process group that becomes orphaned will
 signal stopped jobs (HUP then CONT).

3) reparent_thread(): children of a process are signaled (per request)
 with p->pdeath_signal

Remember that restoring signal state (for any restarting task) must
complete _before_ it is allowed to resume execution, and not during
the resume. Otherwise, a running task may send a signal to another
task that hasn't restored yet, so the new signal will be lost
soon-after.

I considered two possible way to address this:

1. Add another sync point to restart: all tasks will first restore
their state without signals (all signals blocked), and zombies call
do_exit(). A sync point then will ensure that all zombies are gone and
their effects done. Then all tasks restore their signal state (and
mask), and sync (new point) again. Only then they may resume
execution.
The main disadvantage is the added complexity and inefficiency,
for no good reason.

2. Introduce PF_RESTARTING: mark all restarting tasks with a new flag,
and teach the above three notifications to skip sending the signal if
theis flag is set.
The main advantage is simplicity and completeness. Also, such a flag
may to be useful later on. This the method implemented.

Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
---
 kernel/exit.c   |    7 ++++++-
 kernel/signal.c |    4 ++++
 2 files changed, 10 insertions(+), 1 deletions(-)

diff --git a/kernel/exit.c b/kernel/exit.c
index 912b1fa..41ac4cf 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -299,6 +299,10 @@ kill_orphaned_pgrp(struct task_struct *tsk, struct task_struct *parent)
 	struct pid *pgrp = task_pgrp(tsk);
 	struct task_struct *ignored_task = tsk;
 
+	/* restarting zombie doesn't trigger signals */
+	if (tsk->flags & PF_RESTARTING)
+		return;
+
 	if (!parent)
 		 /* exit: our father is in a different pgrp than
 		  * we are and we were the only connection outside.
@@ -739,7 +743,8 @@ static struct task_struct *find_new_reaper(struct task_struct *father)
 static void reparent_thread(struct task_struct *father, struct task_struct *p,
 				struct list_head *dead)
 {
-	if (p->pdeath_signal)
+	/* restarting zombie doesn't trigger signals */
+	if (p->pdeath_signal && !(p->flags & PF_RESTARTING))
 		group_send_sig_info(p->pdeath_signal, SEND_SIG_NOINFO, p);
 
 	list_move_tail(&p->sibling, &p->real_parent->children);
diff --git a/kernel/signal.c b/kernel/signal.c
index 64c5dee..ea217b0 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -1413,6 +1413,10 @@ int do_notify_parent(struct task_struct *tsk, int sig)
 	BUG_ON(!task_ptrace(tsk) &&
 	       (tsk->group_leader != tsk || !thread_group_empty(tsk)));
 
+	/* restarting zombie doesn't notify parent */
+	if (tsk->flags & PF_RESTARTING)
+		return ret;
+
 	info.si_signo = sig;
 	info.si_errno = 0;
 	/*
-- 
1.6.0.4


^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH v18 27/80] c/r: introduce PF_RESTARTING, and skip notification on exit
@ 2009-09-23 23:51   ` Oren Laadan
  0 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-23 23:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, containers, linux-kernel, linux-mm, linux-api,
	Serge Hallyn, Ingo Molnar, Pavel Emelyanov, Oren Laadan,
	Oren Laadan

To restore zombie's we will create the a task, that, on its turn to
run, calls do_exit(). Unlike normal tasks that exit, we need to
prevent notification side effects that send signals to other
processes, e.g. parent (SIGCHLD) or child tasks (per child's request).

There are three main cases for such notifications:

1) do_notify_parent(): parent of a process is notified about a change
 in status (e.g. become zombie, reparent, etc). If parent ignores,
 then mark child for immediate release (skip zombie).

2) kill_orphan_pgrp(): a process group that becomes orphaned will
 signal stopped jobs (HUP then CONT).

3) reparent_thread(): children of a process are signaled (per request)
 with p->pdeath_signal

Remember that restoring signal state (for any restarting task) must
complete _before_ it is allowed to resume execution, and not during
the resume. Otherwise, a running task may send a signal to another
task that hasn't restored yet, so the new signal will be lost
soon-after.

I considered two possible way to address this:

1. Add another sync point to restart: all tasks will first restore
their state without signals (all signals blocked), and zombies call
do_exit(). A sync point then will ensure that all zombies are gone and
their effects done. Then all tasks restore their signal state (and
mask), and sync (new point) again. Only then they may resume
execution.
The main disadvantage is the added complexity and inefficiency,
for no good reason.

2. Introduce PF_RESTARTING: mark all restarting tasks with a new flag,
and teach the above three notifications to skip sending the signal if
theis flag is set.
The main advantage is simplicity and completeness. Also, such a flag
may to be useful later on. This the method implemented.

Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
---
 kernel/exit.c   |    7 ++++++-
 kernel/signal.c |    4 ++++
 2 files changed, 10 insertions(+), 1 deletions(-)

diff --git a/kernel/exit.c b/kernel/exit.c
index 912b1fa..41ac4cf 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -299,6 +299,10 @@ kill_orphaned_pgrp(struct task_struct *tsk, struct task_struct *parent)
 	struct pid *pgrp = task_pgrp(tsk);
 	struct task_struct *ignored_task = tsk;
 
+	/* restarting zombie doesn't trigger signals */
+	if (tsk->flags & PF_RESTARTING)
+		return;
+
 	if (!parent)
 		 /* exit: our father is in a different pgrp than
 		  * we are and we were the only connection outside.
@@ -739,7 +743,8 @@ static struct task_struct *find_new_reaper(struct task_struct *father)
 static void reparent_thread(struct task_struct *father, struct task_struct *p,
 				struct list_head *dead)
 {
-	if (p->pdeath_signal)
+	/* restarting zombie doesn't trigger signals */
+	if (p->pdeath_signal && !(p->flags & PF_RESTARTING))
 		group_send_sig_info(p->pdeath_signal, SEND_SIG_NOINFO, p);
 
 	list_move_tail(&p->sibling, &p->real_parent->children);
diff --git a/kernel/signal.c b/kernel/signal.c
index 64c5dee..ea217b0 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -1413,6 +1413,10 @@ int do_notify_parent(struct task_struct *tsk, int sig)
 	BUG_ON(!task_ptrace(tsk) &&
 	       (tsk->group_leader != tsk || !thread_group_empty(tsk)));
 
+	/* restarting zombie doesn't notify parent */
+	if (tsk->flags & PF_RESTARTING)
+		return ret;
+
 	info.si_signo = sig;
 	info.si_errno = 0;
 	/*
-- 
1.6.0.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH v18 28/80] c/r: support for zombie processes
       [not found] ` <1253749920-18673-1-git-send-email-orenl-RdfvBDnrOixBDgjK7y7TUQ@public.gmane.org>
                     ` (26 preceding siblings ...)
  2009-09-23 23:51   ` [PATCH v18 27/80] c/r: introduce PF_RESTARTING, and skip notification on exit Oren Laadan
@ 2009-09-23 23:51   ` Oren Laadan
  2009-09-23 23:51   ` [PATCH v18 29/80] c/r: Save and restore the [compat_]robust_list member of the task struct Oren Laadan
                     ` (53 subsequent siblings)
  81 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-23 23:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-api-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Linus Torvalds, Ingo Molnar,
	Pavel Emelyanov

During checkpoint, a zombie processes need only save p->comm,
p->state, p->exit_state, and p->exit_code.

During restart, zombie processes are created like all other
processes. They validate the saved exit_code restore p->comm
and p->exit_code. Then they call do_exit() instead of waking
up the next task in line.

But before, they place the @ctx in p->checkpoint_ctx, so that
only at exit time they will wake up the next task in line,
and drop the reference to the @ctx.

This provides the guarantee that when the coordinator's wait
completes, all normal tasks completed their restart, and all
zombie tasks are already zombified (as opposed to perhap only
becoming a zombie).

Changelog[v18]:
  - Fix leak of ckpt_ctx when restoring zombie tasks
  - Add a few more ckpt_write_err()s
Changelog[v17]:
  - Validate t->exit_signal for both threads and leader
  - Skip zombies in most of may_checkpoint_task()
  - Save/restore t->pdeath_signal
  - Validate ->exit_signal and ->pdeath_signal

Signed-off-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
---
 checkpoint/checkpoint.c        |   10 ++++--
 checkpoint/process.c           |   69 +++++++++++++++++++++++++++++++++++-----
 checkpoint/restart.c           |   41 +++++++++++++++++++++--
 include/linux/checkpoint.h     |    1 +
 include/linux/checkpoint_hdr.h |    1 +
 5 files changed, 107 insertions(+), 15 deletions(-)

diff --git a/checkpoint/checkpoint.c b/checkpoint/checkpoint.c
index fc02436..93d7860 100644
--- a/checkpoint/checkpoint.c
+++ b/checkpoint/checkpoint.c
@@ -377,7 +377,7 @@ static int may_checkpoint_task(struct ckpt_ctx *ctx, struct task_struct *t)
 
 	ckpt_debug("check %d\n", task_pid_nr_ns(t, ctx->root_nsproxy->pid_ns));
 
-	if (t->state == TASK_DEAD) {
+	if (t->exit_state == EXIT_DEAD) {
 		__ckpt_write_err(ctx, "TE", "task state EXIT_DEAD\n", -EBUSY);
 		return -EBUSY;
 	}
@@ -387,6 +387,10 @@ static int may_checkpoint_task(struct ckpt_ctx *ctx, struct task_struct *t)
 		return -EPERM;
 	}
 
+	/* zombies are cool (and also don't have nsproxy, below...) */
+	if (t->exit_state)
+		return 0;
+
 	/* verify that all tasks belongs to same freezer cgroup */
 	if (t != current && !in_same_cgroup_freezer(t, ctx->root_freezer)) {
 		__ckpt_write_err(ctx, "TE", "unfrozen or wrong cgroup", -EBUSY);
@@ -403,8 +407,8 @@ static int may_checkpoint_task(struct ckpt_ctx *ctx, struct task_struct *t)
 	 * FIX: for now, disallow siblings of container init created
 	 * via CLONE_PARENT (unclear if they will remain possible)
 	 */
-	if (ctx->root_init && t != root && t->tgid != root->tgid &&
-	    t->real_parent == root->real_parent) {
+	if (ctx->root_init && t != root &&
+	    t->real_parent == root->real_parent && t->tgid != root->tgid) {
 		__ckpt_write_err(ctx, "TE", "task is sibling of root", -EINVAL);
 		return -EINVAL;
 	}
diff --git a/checkpoint/process.c b/checkpoint/process.c
index 330c8d4..62ae72d 100644
--- a/checkpoint/process.c
+++ b/checkpoint/process.c
@@ -35,12 +35,18 @@ static int checkpoint_task_struct(struct ckpt_ctx *ctx, struct task_struct *t)
 	h->state = t->state;
 	h->exit_state = t->exit_state;
 	h->exit_code = t->exit_code;
-	h->exit_signal = t->exit_signal;
 
-	h->set_child_tid = (unsigned long) t->set_child_tid;
-	h->clear_child_tid = (unsigned long) t->clear_child_tid;
+	if (t->exit_state) {
+		/* zombie - skip remaining state */
+		BUG_ON(t->exit_state != EXIT_ZOMBIE);
+	} else {
+		/* FIXME: save remaining relevant task_struct fields */
+		h->exit_signal = t->exit_signal;
+		h->pdeath_signal = t->pdeath_signal;
 
-	/* FIXME: save remaining relevant task_struct fields */
+		h->set_child_tid = (unsigned long) t->set_child_tid;
+		h->clear_child_tid = (unsigned long) t->clear_child_tid;
+	}
 
 	ret = ckpt_write_obj(ctx, &h->h);
 	ckpt_hdr_put(ctx, h);
@@ -172,6 +178,11 @@ int checkpoint_task(struct ckpt_ctx *ctx, struct task_struct *t)
 
 	if (ret < 0)
 		goto out;
+
+	/* zombie - we're done here */
+	if (t->exit_state)
+		return 0;
+
 	ret = checkpoint_thread(ctx, t);
 	ckpt_debug("thread %d\n", ret);
 	if (ret < 0)
@@ -191,6 +202,19 @@ int checkpoint_task(struct ckpt_ctx *ctx, struct task_struct *t)
  * Restart
  */
 
+static inline int valid_exit_code(int exit_code)
+{
+	if (exit_code >= 0x10000)
+		return 0;
+	if (exit_code & 0xff) {
+		if (exit_code & ~0xff)
+			return 0;
+		if (!valid_signal(exit_code & 0xff))
+			return 0;
+	}
+	return 1;
+}
+
 /* read the task_struct into the current task */
 static int restore_task_struct(struct ckpt_ctx *ctx)
 {
@@ -202,15 +226,39 @@ static int restore_task_struct(struct ckpt_ctx *ctx)
 	if (IS_ERR(h))
 		return PTR_ERR(h);
 
+	ret = -EINVAL;
+	if (h->state == TASK_DEAD) {
+		if (h->exit_state != EXIT_ZOMBIE)
+			goto out;
+		if (!valid_exit_code(h->exit_code))
+			goto out;
+		t->exit_code = h->exit_code;
+	} else {
+		if (h->exit_code)
+			goto out;
+		if ((thread_group_leader(t) && !valid_signal(h->exit_signal)) ||
+		    (!thread_group_leader(t) && h->exit_signal != -1))
+			goto out;
+		if (!valid_signal(h->pdeath_signal))
+			goto out;
+
+		/* FIXME: restore remaining relevant task_struct fields */
+		t->exit_signal = h->exit_signal;
+		t->pdeath_signal = h->pdeath_signal;
+
+		t->set_child_tid =
+			(int __user *) (unsigned long) h->set_child_tid;
+		t->clear_child_tid =
+			(int __user *) (unsigned long) h->clear_child_tid;
+	}
+
 	memset(t->comm, 0, TASK_COMM_LEN);
 	ret = _ckpt_read_string(ctx, t->comm, TASK_COMM_LEN);
 	if (ret < 0)
 		goto out;
 
-	t->set_child_tid = (int __user *) (unsigned long) h->set_child_tid;
-	t->clear_child_tid = (int __user *) (unsigned long) h->clear_child_tid;
-
-	/* FIXME: restore remaining relevant task_struct fields */
+	/* return 1 for zombie, 0 otherwise */
+	ret = (h->state == TASK_DEAD ? 1 : 0);
  out:
 	ckpt_hdr_put(ctx, h);
 	return ret;
@@ -330,6 +378,11 @@ int restore_task(struct ckpt_ctx *ctx)
 	ckpt_debug("task %d\n", ret);
 	if (ret < 0)
 		goto out;
+
+	/* zombie - we're done here */
+	if (ret)
+		goto out;
+
 	ret = restore_thread(ctx);
 	ckpt_debug("thread %d\n", ret);
 	if (ret < 0)
diff --git a/checkpoint/restart.c b/checkpoint/restart.c
index 4da09b7..d43eec7 100644
--- a/checkpoint/restart.c
+++ b/checkpoint/restart.c
@@ -473,17 +473,14 @@ do { \
 static int restore_activate_next(struct ckpt_ctx *ctx)
 {
 	struct task_struct *task;
-	int active;
 	pid_t pid;
 
-	active = ++ctx->active_pid;
-	if (active >= ctx->nr_pids) {
+	if (++ctx->active_pid >= ctx->nr_pids) {
 		complete(&ctx->complete);
 		return 0;
 	}
 
 	pid = get_active_pid(ctx);
-	ckpt_debug("active pid %d (%d < %d)\n", pid, active, ctx->nr_pids);
 
 	rcu_read_lock();
 	task = find_task_by_pid_ns(pid, ctx->root_nsproxy->pid_ns);
@@ -511,6 +508,8 @@ static int wait_task_active(struct ckpt_ctx *ctx)
 	ret = wait_event_interruptible(ctx->waitq,
 				       is_task_active(ctx, pid) ||
 				       ckpt_test_ctx_error(ctx));
+	ckpt_debug("active %d < %d (ret %d)\n",
+		   ctx->active_pid, ctx->nr_pids, ret);
 	if (!ret && ckpt_test_ctx_error(ctx)) {
 		force_sig(SIGKILL, current);
 		ret = -EBUSY;
@@ -567,6 +566,8 @@ static int do_restore_task(void)
 		return -EAGAIN;
 	}
 
+	current->flags |= PF_RESTARTING;
+
 	/* wait for our turn, do the restore, and tell next task in line */
 	ret = wait_task_active(ctx);
 	if (ret < 0)
@@ -576,6 +577,16 @@ static int do_restore_task(void)
 	if (ret < 0)
 		goto out;
 
+	/*
+	 * zombie: we're done here; do_exit() will notice the @ctx on
+	 * our current->checkpoint_ctx (and our PF_RESTARTING) - it
+	 * will call restore_activate_next() and release the @ctx.
+	 */
+	if (ret) {
+		ckpt_ctx_put(ctx);
+		do_exit(current->exit_code);
+	}
+
 	ret = restore_activate_next(ctx);
 	if (ret < 0)
 		goto out;
@@ -592,6 +603,7 @@ static int do_restore_task(void)
 		wake_up_all(&ctx->waitq);
 	}
 
+	current->flags &= ~PF_RESTARTING;
 	ckpt_ctx_put(ctx);
 	return ret;
 }
@@ -929,3 +941,24 @@ long do_restart(struct ckpt_ctx *ctx, pid_t pid)
 
 	return ret;
 }
+
+/**
+ * exit_checkpoint - callback from do_exit to cleanup checkpoint state
+ * @tsk: terminating task
+ */
+void exit_checkpoint(struct task_struct *tsk)
+{
+	struct ckpt_ctx *ctx;
+
+	/* no one else will touch this, because @tsk is dead already */
+	ctx = xchg(&tsk->checkpoint_ctx, NULL);
+
+	/* restarting zombies will activate next task in restart */
+	if (tsk->flags & PF_RESTARTING) {
+		BUG_ON(ctx->active_pid == -1);
+		if (restore_activate_next(ctx) < 0)
+			pr_warning("c/r: [%d] failed zombie exit\n", tsk->pid);
+	}
+
+	ckpt_ctx_put(ctx);
+}
diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
index 4227b31..5c02d9b 100644
--- a/include/linux/checkpoint.h
+++ b/include/linux/checkpoint.h
@@ -96,6 +96,7 @@ extern long do_checkpoint(struct ckpt_ctx *ctx, pid_t pid);
 extern long do_restart(struct ckpt_ctx *ctx, pid_t pid);
 
 /* task */
+extern int ckpt_activate_next(struct ckpt_ctx *ctx);
 extern int checkpoint_task(struct ckpt_ctx *ctx, struct task_struct *t);
 extern int restore_task(struct ckpt_ctx *ctx);
 
diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
index 26e10fb..8ae3bbe 100644
--- a/include/linux/checkpoint_hdr.h
+++ b/include/linux/checkpoint_hdr.h
@@ -132,6 +132,7 @@ struct ckpt_hdr_task {
 	__u32 exit_state;
 	__u32 exit_code;
 	__u32 exit_signal;
+	__u32 pdeath_signal;
 
 	__u64 set_child_tid;
 	__u64 clear_child_tid;
-- 
1.6.0.4

^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH v18 28/80] c/r: support for zombie processes
  2009-09-23 23:50 ` Oren Laadan
@ 2009-09-23 23:51   ` Oren Laadan
  -1 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-23 23:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, containers, linux-kernel, linux-mm, linux-api,
	Serge Hallyn, Ingo Molnar, Pavel Emelyanov, Oren Laadan,
	Oren Laadan

During checkpoint, a zombie processes need only save p->comm,
p->state, p->exit_state, and p->exit_code.

During restart, zombie processes are created like all other
processes. They validate the saved exit_code restore p->comm
and p->exit_code. Then they call do_exit() instead of waking
up the next task in line.

But before, they place the @ctx in p->checkpoint_ctx, so that
only at exit time they will wake up the next task in line,
and drop the reference to the @ctx.

This provides the guarantee that when the coordinator's wait
completes, all normal tasks completed their restart, and all
zombie tasks are already zombified (as opposed to perhap only
becoming a zombie).

Changelog[v18]:
  - Fix leak of ckpt_ctx when restoring zombie tasks
  - Add a few more ckpt_write_err()s
Changelog[v17]:
  - Validate t->exit_signal for both threads and leader
  - Skip zombies in most of may_checkpoint_task()
  - Save/restore t->pdeath_signal
  - Validate ->exit_signal and ->pdeath_signal

Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
---
 checkpoint/checkpoint.c        |   10 ++++--
 checkpoint/process.c           |   69 +++++++++++++++++++++++++++++++++++-----
 checkpoint/restart.c           |   41 +++++++++++++++++++++--
 include/linux/checkpoint.h     |    1 +
 include/linux/checkpoint_hdr.h |    1 +
 5 files changed, 107 insertions(+), 15 deletions(-)

diff --git a/checkpoint/checkpoint.c b/checkpoint/checkpoint.c
index fc02436..93d7860 100644
--- a/checkpoint/checkpoint.c
+++ b/checkpoint/checkpoint.c
@@ -377,7 +377,7 @@ static int may_checkpoint_task(struct ckpt_ctx *ctx, struct task_struct *t)
 
 	ckpt_debug("check %d\n", task_pid_nr_ns(t, ctx->root_nsproxy->pid_ns));
 
-	if (t->state == TASK_DEAD) {
+	if (t->exit_state == EXIT_DEAD) {
 		__ckpt_write_err(ctx, "TE", "task state EXIT_DEAD\n", -EBUSY);
 		return -EBUSY;
 	}
@@ -387,6 +387,10 @@ static int may_checkpoint_task(struct ckpt_ctx *ctx, struct task_struct *t)
 		return -EPERM;
 	}
 
+	/* zombies are cool (and also don't have nsproxy, below...) */
+	if (t->exit_state)
+		return 0;
+
 	/* verify that all tasks belongs to same freezer cgroup */
 	if (t != current && !in_same_cgroup_freezer(t, ctx->root_freezer)) {
 		__ckpt_write_err(ctx, "TE", "unfrozen or wrong cgroup", -EBUSY);
@@ -403,8 +407,8 @@ static int may_checkpoint_task(struct ckpt_ctx *ctx, struct task_struct *t)
 	 * FIX: for now, disallow siblings of container init created
 	 * via CLONE_PARENT (unclear if they will remain possible)
 	 */
-	if (ctx->root_init && t != root && t->tgid != root->tgid &&
-	    t->real_parent == root->real_parent) {
+	if (ctx->root_init && t != root &&
+	    t->real_parent == root->real_parent && t->tgid != root->tgid) {
 		__ckpt_write_err(ctx, "TE", "task is sibling of root", -EINVAL);
 		return -EINVAL;
 	}
diff --git a/checkpoint/process.c b/checkpoint/process.c
index 330c8d4..62ae72d 100644
--- a/checkpoint/process.c
+++ b/checkpoint/process.c
@@ -35,12 +35,18 @@ static int checkpoint_task_struct(struct ckpt_ctx *ctx, struct task_struct *t)
 	h->state = t->state;
 	h->exit_state = t->exit_state;
 	h->exit_code = t->exit_code;
-	h->exit_signal = t->exit_signal;
 
-	h->set_child_tid = (unsigned long) t->set_child_tid;
-	h->clear_child_tid = (unsigned long) t->clear_child_tid;
+	if (t->exit_state) {
+		/* zombie - skip remaining state */
+		BUG_ON(t->exit_state != EXIT_ZOMBIE);
+	} else {
+		/* FIXME: save remaining relevant task_struct fields */
+		h->exit_signal = t->exit_signal;
+		h->pdeath_signal = t->pdeath_signal;
 
-	/* FIXME: save remaining relevant task_struct fields */
+		h->set_child_tid = (unsigned long) t->set_child_tid;
+		h->clear_child_tid = (unsigned long) t->clear_child_tid;
+	}
 
 	ret = ckpt_write_obj(ctx, &h->h);
 	ckpt_hdr_put(ctx, h);
@@ -172,6 +178,11 @@ int checkpoint_task(struct ckpt_ctx *ctx, struct task_struct *t)
 
 	if (ret < 0)
 		goto out;
+
+	/* zombie - we're done here */
+	if (t->exit_state)
+		return 0;
+
 	ret = checkpoint_thread(ctx, t);
 	ckpt_debug("thread %d\n", ret);
 	if (ret < 0)
@@ -191,6 +202,19 @@ int checkpoint_task(struct ckpt_ctx *ctx, struct task_struct *t)
  * Restart
  */
 
+static inline int valid_exit_code(int exit_code)
+{
+	if (exit_code >= 0x10000)
+		return 0;
+	if (exit_code & 0xff) {
+		if (exit_code & ~0xff)
+			return 0;
+		if (!valid_signal(exit_code & 0xff))
+			return 0;
+	}
+	return 1;
+}
+
 /* read the task_struct into the current task */
 static int restore_task_struct(struct ckpt_ctx *ctx)
 {
@@ -202,15 +226,39 @@ static int restore_task_struct(struct ckpt_ctx *ctx)
 	if (IS_ERR(h))
 		return PTR_ERR(h);
 
+	ret = -EINVAL;
+	if (h->state == TASK_DEAD) {
+		if (h->exit_state != EXIT_ZOMBIE)
+			goto out;
+		if (!valid_exit_code(h->exit_code))
+			goto out;
+		t->exit_code = h->exit_code;
+	} else {
+		if (h->exit_code)
+			goto out;
+		if ((thread_group_leader(t) && !valid_signal(h->exit_signal)) ||
+		    (!thread_group_leader(t) && h->exit_signal != -1))
+			goto out;
+		if (!valid_signal(h->pdeath_signal))
+			goto out;
+
+		/* FIXME: restore remaining relevant task_struct fields */
+		t->exit_signal = h->exit_signal;
+		t->pdeath_signal = h->pdeath_signal;
+
+		t->set_child_tid =
+			(int __user *) (unsigned long) h->set_child_tid;
+		t->clear_child_tid =
+			(int __user *) (unsigned long) h->clear_child_tid;
+	}
+
 	memset(t->comm, 0, TASK_COMM_LEN);
 	ret = _ckpt_read_string(ctx, t->comm, TASK_COMM_LEN);
 	if (ret < 0)
 		goto out;
 
-	t->set_child_tid = (int __user *) (unsigned long) h->set_child_tid;
-	t->clear_child_tid = (int __user *) (unsigned long) h->clear_child_tid;
-
-	/* FIXME: restore remaining relevant task_struct fields */
+	/* return 1 for zombie, 0 otherwise */
+	ret = (h->state == TASK_DEAD ? 1 : 0);
  out:
 	ckpt_hdr_put(ctx, h);
 	return ret;
@@ -330,6 +378,11 @@ int restore_task(struct ckpt_ctx *ctx)
 	ckpt_debug("task %d\n", ret);
 	if (ret < 0)
 		goto out;
+
+	/* zombie - we're done here */
+	if (ret)
+		goto out;
+
 	ret = restore_thread(ctx);
 	ckpt_debug("thread %d\n", ret);
 	if (ret < 0)
diff --git a/checkpoint/restart.c b/checkpoint/restart.c
index 4da09b7..d43eec7 100644
--- a/checkpoint/restart.c
+++ b/checkpoint/restart.c
@@ -473,17 +473,14 @@ do { \
 static int restore_activate_next(struct ckpt_ctx *ctx)
 {
 	struct task_struct *task;
-	int active;
 	pid_t pid;
 
-	active = ++ctx->active_pid;
-	if (active >= ctx->nr_pids) {
+	if (++ctx->active_pid >= ctx->nr_pids) {
 		complete(&ctx->complete);
 		return 0;
 	}
 
 	pid = get_active_pid(ctx);
-	ckpt_debug("active pid %d (%d < %d)\n", pid, active, ctx->nr_pids);
 
 	rcu_read_lock();
 	task = find_task_by_pid_ns(pid, ctx->root_nsproxy->pid_ns);
@@ -511,6 +508,8 @@ static int wait_task_active(struct ckpt_ctx *ctx)
 	ret = wait_event_interruptible(ctx->waitq,
 				       is_task_active(ctx, pid) ||
 				       ckpt_test_ctx_error(ctx));
+	ckpt_debug("active %d < %d (ret %d)\n",
+		   ctx->active_pid, ctx->nr_pids, ret);
 	if (!ret && ckpt_test_ctx_error(ctx)) {
 		force_sig(SIGKILL, current);
 		ret = -EBUSY;
@@ -567,6 +566,8 @@ static int do_restore_task(void)
 		return -EAGAIN;
 	}
 
+	current->flags |= PF_RESTARTING;
+
 	/* wait for our turn, do the restore, and tell next task in line */
 	ret = wait_task_active(ctx);
 	if (ret < 0)
@@ -576,6 +577,16 @@ static int do_restore_task(void)
 	if (ret < 0)
 		goto out;
 
+	/*
+	 * zombie: we're done here; do_exit() will notice the @ctx on
+	 * our current->checkpoint_ctx (and our PF_RESTARTING) - it
+	 * will call restore_activate_next() and release the @ctx.
+	 */
+	if (ret) {
+		ckpt_ctx_put(ctx);
+		do_exit(current->exit_code);
+	}
+
 	ret = restore_activate_next(ctx);
 	if (ret < 0)
 		goto out;
@@ -592,6 +603,7 @@ static int do_restore_task(void)
 		wake_up_all(&ctx->waitq);
 	}
 
+	current->flags &= ~PF_RESTARTING;
 	ckpt_ctx_put(ctx);
 	return ret;
 }
@@ -929,3 +941,24 @@ long do_restart(struct ckpt_ctx *ctx, pid_t pid)
 
 	return ret;
 }
+
+/**
+ * exit_checkpoint - callback from do_exit to cleanup checkpoint state
+ * @tsk: terminating task
+ */
+void exit_checkpoint(struct task_struct *tsk)
+{
+	struct ckpt_ctx *ctx;
+
+	/* no one else will touch this, because @tsk is dead already */
+	ctx = xchg(&tsk->checkpoint_ctx, NULL);
+
+	/* restarting zombies will activate next task in restart */
+	if (tsk->flags & PF_RESTARTING) {
+		BUG_ON(ctx->active_pid == -1);
+		if (restore_activate_next(ctx) < 0)
+			pr_warning("c/r: [%d] failed zombie exit\n", tsk->pid);
+	}
+
+	ckpt_ctx_put(ctx);
+}
diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
index 4227b31..5c02d9b 100644
--- a/include/linux/checkpoint.h
+++ b/include/linux/checkpoint.h
@@ -96,6 +96,7 @@ extern long do_checkpoint(struct ckpt_ctx *ctx, pid_t pid);
 extern long do_restart(struct ckpt_ctx *ctx, pid_t pid);
 
 /* task */
+extern int ckpt_activate_next(struct ckpt_ctx *ctx);
 extern int checkpoint_task(struct ckpt_ctx *ctx, struct task_struct *t);
 extern int restore_task(struct ckpt_ctx *ctx);
 
diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
index 26e10fb..8ae3bbe 100644
--- a/include/linux/checkpoint_hdr.h
+++ b/include/linux/checkpoint_hdr.h
@@ -132,6 +132,7 @@ struct ckpt_hdr_task {
 	__u32 exit_state;
 	__u32 exit_code;
 	__u32 exit_signal;
+	__u32 pdeath_signal;
 
 	__u64 set_child_tid;
 	__u64 clear_child_tid;
-- 
1.6.0.4


^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH v18 28/80] c/r: support for zombie processes
@ 2009-09-23 23:51   ` Oren Laadan
  0 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-23 23:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, containers, linux-kernel, linux-mm, linux-api,
	Serge Hallyn, Ingo Molnar, Pavel Emelyanov, Oren Laadan,
	Oren Laadan

During checkpoint, a zombie processes need only save p->comm,
p->state, p->exit_state, and p->exit_code.

During restart, zombie processes are created like all other
processes. They validate the saved exit_code restore p->comm
and p->exit_code. Then they call do_exit() instead of waking
up the next task in line.

But before, they place the @ctx in p->checkpoint_ctx, so that
only at exit time they will wake up the next task in line,
and drop the reference to the @ctx.

This provides the guarantee that when the coordinator's wait
completes, all normal tasks completed their restart, and all
zombie tasks are already zombified (as opposed to perhap only
becoming a zombie).

Changelog[v18]:
  - Fix leak of ckpt_ctx when restoring zombie tasks
  - Add a few more ckpt_write_err()s
Changelog[v17]:
  - Validate t->exit_signal for both threads and leader
  - Skip zombies in most of may_checkpoint_task()
  - Save/restore t->pdeath_signal
  - Validate ->exit_signal and ->pdeath_signal

Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
---
 checkpoint/checkpoint.c        |   10 ++++--
 checkpoint/process.c           |   69 +++++++++++++++++++++++++++++++++++-----
 checkpoint/restart.c           |   41 +++++++++++++++++++++--
 include/linux/checkpoint.h     |    1 +
 include/linux/checkpoint_hdr.h |    1 +
 5 files changed, 107 insertions(+), 15 deletions(-)

diff --git a/checkpoint/checkpoint.c b/checkpoint/checkpoint.c
index fc02436..93d7860 100644
--- a/checkpoint/checkpoint.c
+++ b/checkpoint/checkpoint.c
@@ -377,7 +377,7 @@ static int may_checkpoint_task(struct ckpt_ctx *ctx, struct task_struct *t)
 
 	ckpt_debug("check %d\n", task_pid_nr_ns(t, ctx->root_nsproxy->pid_ns));
 
-	if (t->state == TASK_DEAD) {
+	if (t->exit_state == EXIT_DEAD) {
 		__ckpt_write_err(ctx, "TE", "task state EXIT_DEAD\n", -EBUSY);
 		return -EBUSY;
 	}
@@ -387,6 +387,10 @@ static int may_checkpoint_task(struct ckpt_ctx *ctx, struct task_struct *t)
 		return -EPERM;
 	}
 
+	/* zombies are cool (and also don't have nsproxy, below...) */
+	if (t->exit_state)
+		return 0;
+
 	/* verify that all tasks belongs to same freezer cgroup */
 	if (t != current && !in_same_cgroup_freezer(t, ctx->root_freezer)) {
 		__ckpt_write_err(ctx, "TE", "unfrozen or wrong cgroup", -EBUSY);
@@ -403,8 +407,8 @@ static int may_checkpoint_task(struct ckpt_ctx *ctx, struct task_struct *t)
 	 * FIX: for now, disallow siblings of container init created
 	 * via CLONE_PARENT (unclear if they will remain possible)
 	 */
-	if (ctx->root_init && t != root && t->tgid != root->tgid &&
-	    t->real_parent == root->real_parent) {
+	if (ctx->root_init && t != root &&
+	    t->real_parent == root->real_parent && t->tgid != root->tgid) {
 		__ckpt_write_err(ctx, "TE", "task is sibling of root", -EINVAL);
 		return -EINVAL;
 	}
diff --git a/checkpoint/process.c b/checkpoint/process.c
index 330c8d4..62ae72d 100644
--- a/checkpoint/process.c
+++ b/checkpoint/process.c
@@ -35,12 +35,18 @@ static int checkpoint_task_struct(struct ckpt_ctx *ctx, struct task_struct *t)
 	h->state = t->state;
 	h->exit_state = t->exit_state;
 	h->exit_code = t->exit_code;
-	h->exit_signal = t->exit_signal;
 
-	h->set_child_tid = (unsigned long) t->set_child_tid;
-	h->clear_child_tid = (unsigned long) t->clear_child_tid;
+	if (t->exit_state) {
+		/* zombie - skip remaining state */
+		BUG_ON(t->exit_state != EXIT_ZOMBIE);
+	} else {
+		/* FIXME: save remaining relevant task_struct fields */
+		h->exit_signal = t->exit_signal;
+		h->pdeath_signal = t->pdeath_signal;
 
-	/* FIXME: save remaining relevant task_struct fields */
+		h->set_child_tid = (unsigned long) t->set_child_tid;
+		h->clear_child_tid = (unsigned long) t->clear_child_tid;
+	}
 
 	ret = ckpt_write_obj(ctx, &h->h);
 	ckpt_hdr_put(ctx, h);
@@ -172,6 +178,11 @@ int checkpoint_task(struct ckpt_ctx *ctx, struct task_struct *t)
 
 	if (ret < 0)
 		goto out;
+
+	/* zombie - we're done here */
+	if (t->exit_state)
+		return 0;
+
 	ret = checkpoint_thread(ctx, t);
 	ckpt_debug("thread %d\n", ret);
 	if (ret < 0)
@@ -191,6 +202,19 @@ int checkpoint_task(struct ckpt_ctx *ctx, struct task_struct *t)
  * Restart
  */
 
+static inline int valid_exit_code(int exit_code)
+{
+	if (exit_code >= 0x10000)
+		return 0;
+	if (exit_code & 0xff) {
+		if (exit_code & ~0xff)
+			return 0;
+		if (!valid_signal(exit_code & 0xff))
+			return 0;
+	}
+	return 1;
+}
+
 /* read the task_struct into the current task */
 static int restore_task_struct(struct ckpt_ctx *ctx)
 {
@@ -202,15 +226,39 @@ static int restore_task_struct(struct ckpt_ctx *ctx)
 	if (IS_ERR(h))
 		return PTR_ERR(h);
 
+	ret = -EINVAL;
+	if (h->state == TASK_DEAD) {
+		if (h->exit_state != EXIT_ZOMBIE)
+			goto out;
+		if (!valid_exit_code(h->exit_code))
+			goto out;
+		t->exit_code = h->exit_code;
+	} else {
+		if (h->exit_code)
+			goto out;
+		if ((thread_group_leader(t) && !valid_signal(h->exit_signal)) ||
+		    (!thread_group_leader(t) && h->exit_signal != -1))
+			goto out;
+		if (!valid_signal(h->pdeath_signal))
+			goto out;
+
+		/* FIXME: restore remaining relevant task_struct fields */
+		t->exit_signal = h->exit_signal;
+		t->pdeath_signal = h->pdeath_signal;
+
+		t->set_child_tid =
+			(int __user *) (unsigned long) h->set_child_tid;
+		t->clear_child_tid =
+			(int __user *) (unsigned long) h->clear_child_tid;
+	}
+
 	memset(t->comm, 0, TASK_COMM_LEN);
 	ret = _ckpt_read_string(ctx, t->comm, TASK_COMM_LEN);
 	if (ret < 0)
 		goto out;
 
-	t->set_child_tid = (int __user *) (unsigned long) h->set_child_tid;
-	t->clear_child_tid = (int __user *) (unsigned long) h->clear_child_tid;
-
-	/* FIXME: restore remaining relevant task_struct fields */
+	/* return 1 for zombie, 0 otherwise */
+	ret = (h->state == TASK_DEAD ? 1 : 0);
  out:
 	ckpt_hdr_put(ctx, h);
 	return ret;
@@ -330,6 +378,11 @@ int restore_task(struct ckpt_ctx *ctx)
 	ckpt_debug("task %d\n", ret);
 	if (ret < 0)
 		goto out;
+
+	/* zombie - we're done here */
+	if (ret)
+		goto out;
+
 	ret = restore_thread(ctx);
 	ckpt_debug("thread %d\n", ret);
 	if (ret < 0)
diff --git a/checkpoint/restart.c b/checkpoint/restart.c
index 4da09b7..d43eec7 100644
--- a/checkpoint/restart.c
+++ b/checkpoint/restart.c
@@ -473,17 +473,14 @@ do { \
 static int restore_activate_next(struct ckpt_ctx *ctx)
 {
 	struct task_struct *task;
-	int active;
 	pid_t pid;
 
-	active = ++ctx->active_pid;
-	if (active >= ctx->nr_pids) {
+	if (++ctx->active_pid >= ctx->nr_pids) {
 		complete(&ctx->complete);
 		return 0;
 	}
 
 	pid = get_active_pid(ctx);
-	ckpt_debug("active pid %d (%d < %d)\n", pid, active, ctx->nr_pids);
 
 	rcu_read_lock();
 	task = find_task_by_pid_ns(pid, ctx->root_nsproxy->pid_ns);
@@ -511,6 +508,8 @@ static int wait_task_active(struct ckpt_ctx *ctx)
 	ret = wait_event_interruptible(ctx->waitq,
 				       is_task_active(ctx, pid) ||
 				       ckpt_test_ctx_error(ctx));
+	ckpt_debug("active %d < %d (ret %d)\n",
+		   ctx->active_pid, ctx->nr_pids, ret);
 	if (!ret && ckpt_test_ctx_error(ctx)) {
 		force_sig(SIGKILL, current);
 		ret = -EBUSY;
@@ -567,6 +566,8 @@ static int do_restore_task(void)
 		return -EAGAIN;
 	}
 
+	current->flags |= PF_RESTARTING;
+
 	/* wait for our turn, do the restore, and tell next task in line */
 	ret = wait_task_active(ctx);
 	if (ret < 0)
@@ -576,6 +577,16 @@ static int do_restore_task(void)
 	if (ret < 0)
 		goto out;
 
+	/*
+	 * zombie: we're done here; do_exit() will notice the @ctx on
+	 * our current->checkpoint_ctx (and our PF_RESTARTING) - it
+	 * will call restore_activate_next() and release the @ctx.
+	 */
+	if (ret) {
+		ckpt_ctx_put(ctx);
+		do_exit(current->exit_code);
+	}
+
 	ret = restore_activate_next(ctx);
 	if (ret < 0)
 		goto out;
@@ -592,6 +603,7 @@ static int do_restore_task(void)
 		wake_up_all(&ctx->waitq);
 	}
 
+	current->flags &= ~PF_RESTARTING;
 	ckpt_ctx_put(ctx);
 	return ret;
 }
@@ -929,3 +941,24 @@ long do_restart(struct ckpt_ctx *ctx, pid_t pid)
 
 	return ret;
 }
+
+/**
+ * exit_checkpoint - callback from do_exit to cleanup checkpoint state
+ * @tsk: terminating task
+ */
+void exit_checkpoint(struct task_struct *tsk)
+{
+	struct ckpt_ctx *ctx;
+
+	/* no one else will touch this, because @tsk is dead already */
+	ctx = xchg(&tsk->checkpoint_ctx, NULL);
+
+	/* restarting zombies will activate next task in restart */
+	if (tsk->flags & PF_RESTARTING) {
+		BUG_ON(ctx->active_pid == -1);
+		if (restore_activate_next(ctx) < 0)
+			pr_warning("c/r: [%d] failed zombie exit\n", tsk->pid);
+	}
+
+	ckpt_ctx_put(ctx);
+}
diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
index 4227b31..5c02d9b 100644
--- a/include/linux/checkpoint.h
+++ b/include/linux/checkpoint.h
@@ -96,6 +96,7 @@ extern long do_checkpoint(struct ckpt_ctx *ctx, pid_t pid);
 extern long do_restart(struct ckpt_ctx *ctx, pid_t pid);
 
 /* task */
+extern int ckpt_activate_next(struct ckpt_ctx *ctx);
 extern int checkpoint_task(struct ckpt_ctx *ctx, struct task_struct *t);
 extern int restore_task(struct ckpt_ctx *ctx);
 
diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
index 26e10fb..8ae3bbe 100644
--- a/include/linux/checkpoint_hdr.h
+++ b/include/linux/checkpoint_hdr.h
@@ -132,6 +132,7 @@ struct ckpt_hdr_task {
 	__u32 exit_state;
 	__u32 exit_code;
 	__u32 exit_signal;
+	__u32 pdeath_signal;
 
 	__u64 set_child_tid;
 	__u64 clear_child_tid;
-- 
1.6.0.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH v18 29/80] c/r: Save and restore the [compat_]robust_list member of the task struct
       [not found] ` <1253749920-18673-1-git-send-email-orenl-RdfvBDnrOixBDgjK7y7TUQ@public.gmane.org>
                     ` (27 preceding siblings ...)
  2009-09-23 23:51   ` [PATCH v18 28/80] c/r: support for zombie processes Oren Laadan
@ 2009-09-23 23:51   ` Oren Laadan
  2009-09-23 23:51   ` [PATCH v18 30/80] c/r: infrastructure for shared objects Oren Laadan
                     ` (52 subsequent siblings)
  81 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-23 23:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-api-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Linus Torvalds, Ingo Molnar,
	Pavel Emelyanov

From: Matt Helsley <matthltc-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>

These lists record which futexes the task holds. To keep the overhead of
robust futexes low the list is kept in userspace. When the task exits the
kernel carefully walks these lists to recover held futexes that
other tasks may be attempting to acquire with FUTEX_WAIT.

Because they point to userspace memory that is saved/restored by
checkpoint/restart saving the list pointers themselves is safe.

While saving the pointers is safe during checkpoint, restart is tricky
because the robust futex ABI contains provisions for changes based on
checking the size of the list head. So we need to save the length of
the list head too in order to make sure that the kernel used during
restart is capable of handling that ABI. Since there is only one ABI
supported at the moment taking the list head's size is simple. Should
the ABI change we will need to use the same size as specified during
sys_set_robust_list() and hence some new means of determining the length
of this userspace structure in sys_checkpoint would be required.

Rather than rewrite the logic that checks and handles the ABI we reuse
sys_set_robust_list() by factoring out the body of the function and
calling it during restart.

Signed-off-by: Matt Helsley <matthltc-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
[orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org: move save/restore code to checkpoint/process.c]
---
 checkpoint/process.c           |   49 ++++++++++++++++++++++++++++++++++++++++
 include/linux/checkpoint_hdr.h |    5 ++++
 include/linux/compat.h         |    3 +-
 include/linux/futex.h          |    1 +
 kernel/futex.c                 |   19 +++++++++-----
 kernel/futex_compat.c          |   13 ++++++++--
 6 files changed, 79 insertions(+), 11 deletions(-)

diff --git a/checkpoint/process.c b/checkpoint/process.c
index 62ae72d..2580b31 100644
--- a/checkpoint/process.c
+++ b/checkpoint/process.c
@@ -14,10 +14,57 @@
 #include <linux/sched.h>
 #include <linux/posix-timers.h>
 #include <linux/futex.h>
+#include <linux/compat.h>
 #include <linux/poll.h>
 #include <linux/checkpoint.h>
 #include <linux/checkpoint_hdr.h>
 
+
+#ifdef CONFIG_FUTEX
+static void save_task_robust_futex_list(struct ckpt_hdr_task *h,
+					struct task_struct *t)
+{
+	/*
+	 * These are __user pointers and thus can be saved without
+	 * the objhash.
+	 */
+	h->robust_futex_list = (unsigned long)t->robust_list;
+	h->robust_futex_head_len = sizeof(*t->robust_list);
+#ifdef CONFIG_COMPAT
+	h->compat_robust_futex_list = ptr_to_compat(t->compat_robust_list);
+	h->compat_robust_futex_head_len = sizeof(*t->compat_robust_list);
+#endif
+}
+
+static void restore_task_robust_futex_list(struct ckpt_hdr_task *h)
+{
+	/* Since we restore the memory map the address remains the same and
+	 * this is safe. This is the same as [compat_]sys_set_robust_list() */
+	if (h->robust_futex_list) {
+		struct robust_list_head __user *rfl;
+		rfl = (void __user *)(unsigned long) h->robust_futex_list;
+		do_set_robust_list(rfl, h->robust_futex_head_len);
+	}
+#ifdef CONFIG_COMPAT
+	if (h->compat_robust_futex_list) {
+		struct compat_robust_list_head __user *crfl;
+		crfl = compat_ptr(h->compat_robust_futex_list);
+		do_compat_set_robust_list(crfl, h->compat_robust_futex_head_len);
+	}
+#endif
+}
+#else /* !CONFIG_FUTEX */
+static inline void save_task_robust_futex_list(struct ckpt_hdr_task *h,
+					       struct task_struct *t)
+{
+}
+
+static inline void restore_task_robust_futex_list(struct ckpt_hdr_task *h)
+{
+}
+#endif /* CONFIG_FUTEX */
+
+
 /***********************************************************************
  * Checkpoint
  */
@@ -46,6 +93,7 @@ static int checkpoint_task_struct(struct ckpt_ctx *ctx, struct task_struct *t)
 
 		h->set_child_tid = (unsigned long) t->set_child_tid;
 		h->clear_child_tid = (unsigned long) t->clear_child_tid;
+		save_task_robust_futex_list(h, t);
 	}
 
 	ret = ckpt_write_obj(ctx, &h->h);
@@ -250,6 +298,7 @@ static int restore_task_struct(struct ckpt_ctx *ctx)
 			(int __user *) (unsigned long) h->set_child_tid;
 		t->clear_child_tid =
 			(int __user *) (unsigned long) h->clear_child_tid;
+		restore_task_robust_futex_list(h);
 	}
 
 	memset(t->comm, 0, TASK_COMM_LEN);
diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
index 8ae3bbe..b56fe71 100644
--- a/include/linux/checkpoint_hdr.h
+++ b/include/linux/checkpoint_hdr.h
@@ -136,6 +136,11 @@ struct ckpt_hdr_task {
 
 	__u64 set_child_tid;
 	__u64 clear_child_tid;
+
+	__u32 compat_robust_futex_head_len;
+	__u32 compat_robust_futex_list; /* a compat __user ptr */
+	__u32 robust_futex_head_len;
+	__u64 robust_futex_list; /* a __user ptr */
 } __attribute__((aligned(8)));
 
 /* restart blocks */
diff --git a/include/linux/compat.h b/include/linux/compat.h
index af931ee..f444cf0 100644
--- a/include/linux/compat.h
+++ b/include/linux/compat.h
@@ -165,7 +165,8 @@ struct compat_robust_list_head {
 };
 
 extern void compat_exit_robust_list(struct task_struct *curr);
-
+extern long do_compat_set_robust_list(struct compat_robust_list_head __user *head,
+				      compat_size_t len);
 asmlinkage long
 compat_sys_set_robust_list(struct compat_robust_list_head __user *head,
 			   compat_size_t len);
diff --git a/include/linux/futex.h b/include/linux/futex.h
index 4326f81..2e126a9 100644
--- a/include/linux/futex.h
+++ b/include/linux/futex.h
@@ -185,6 +185,7 @@ union futex_key {
 #define FUTEX_KEY_INIT (union futex_key) { .both = { .ptr = NULL } }
 
 #ifdef CONFIG_FUTEX
+extern long do_set_robust_list(struct robust_list_head __user *head, size_t len);
 extern void exit_robust_list(struct task_struct *curr);
 extern void exit_pi_state_list(struct task_struct *curr);
 extern int futex_cmpxchg_enabled;
diff --git a/kernel/futex.c b/kernel/futex.c
index def86c8..1ef7c8e 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -2278,13 +2278,7 @@ out:
  * the list. There can only be one such pending lock.
  */
 
-/**
- * sys_set_robust_list - set the robust-futex list head of a task
- * @head: pointer to the list-head
- * @len: length of the list-head, as userspace expects
- */
-SYSCALL_DEFINE2(set_robust_list, struct robust_list_head __user *, head,
-		size_t, len)
+long do_set_robust_list(struct robust_list_head __user *head, size_t len)
 {
 	if (!futex_cmpxchg_enabled)
 		return -ENOSYS;
@@ -2300,6 +2294,17 @@ SYSCALL_DEFINE2(set_robust_list, struct robust_list_head __user *, head,
 }
 
 /**
+ * sys_set_robust_list - set the robust-futex list head of a task
+ * @head: pointer to the list-head
+ * @len: length of the list-head, as userspace expects
+ */
+SYSCALL_DEFINE2(set_robust_list, struct robust_list_head __user *, head,
+		size_t, len)
+{
+	return do_set_robust_list(head, len);
+}
+
+/**
  * sys_get_robust_list - get the robust-futex list head of a task
  * @pid: pid of the process [zero for current task]
  * @head_ptr: pointer to a list-head pointer, the kernel fills it in
diff --git a/kernel/futex_compat.c b/kernel/futex_compat.c
index 2357165..5e1a169 100644
--- a/kernel/futex_compat.c
+++ b/kernel/futex_compat.c
@@ -114,9 +114,9 @@ void compat_exit_robust_list(struct task_struct *curr)
 	}
 }
 
-asmlinkage long
-compat_sys_set_robust_list(struct compat_robust_list_head __user *head,
-			   compat_size_t len)
+long
+do_compat_set_robust_list(struct compat_robust_list_head __user *head,
+			  compat_size_t len)
 {
 	if (!futex_cmpxchg_enabled)
 		return -ENOSYS;
@@ -130,6 +130,13 @@ compat_sys_set_robust_list(struct compat_robust_list_head __user *head,
 }
 
 asmlinkage long
+compat_sys_set_robust_list(struct compat_robust_list_head __user *head,
+			   compat_size_t len)
+{
+	return do_compat_set_robust_list(head, len);
+}
+
+asmlinkage long
 compat_sys_get_robust_list(int pid, compat_uptr_t __user *head_ptr,
 			   compat_size_t __user *len_ptr)
 {
-- 
1.6.0.4

^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH v18 29/80] c/r: Save and restore the [compat_]robust_list member of the task struct
  2009-09-23 23:50 ` Oren Laadan
@ 2009-09-23 23:51   ` Oren Laadan
  -1 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-23 23:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, containers, linux-kernel, linux-mm, linux-api,
	Serge Hallyn, Ingo Molnar, Pavel Emelyanov, Matt Helsley

From: Matt Helsley <matthltc@us.ibm.com>

These lists record which futexes the task holds. To keep the overhead of
robust futexes low the list is kept in userspace. When the task exits the
kernel carefully walks these lists to recover held futexes that
other tasks may be attempting to acquire with FUTEX_WAIT.

Because they point to userspace memory that is saved/restored by
checkpoint/restart saving the list pointers themselves is safe.

While saving the pointers is safe during checkpoint, restart is tricky
because the robust futex ABI contains provisions for changes based on
checking the size of the list head. So we need to save the length of
the list head too in order to make sure that the kernel used during
restart is capable of handling that ABI. Since there is only one ABI
supported at the moment taking the list head's size is simple. Should
the ABI change we will need to use the same size as specified during
sys_set_robust_list() and hence some new means of determining the length
of this userspace structure in sys_checkpoint would be required.

Rather than rewrite the logic that checks and handles the ABI we reuse
sys_set_robust_list() by factoring out the body of the function and
calling it during restart.

Signed-off-by: Matt Helsley <matthltc@us.ibm.com>
[orenl@cs.columbia.edu: move save/restore code to checkpoint/process.c]
---
 checkpoint/process.c           |   49 ++++++++++++++++++++++++++++++++++++++++
 include/linux/checkpoint_hdr.h |    5 ++++
 include/linux/compat.h         |    3 +-
 include/linux/futex.h          |    1 +
 kernel/futex.c                 |   19 +++++++++-----
 kernel/futex_compat.c          |   13 ++++++++--
 6 files changed, 79 insertions(+), 11 deletions(-)

diff --git a/checkpoint/process.c b/checkpoint/process.c
index 62ae72d..2580b31 100644
--- a/checkpoint/process.c
+++ b/checkpoint/process.c
@@ -14,10 +14,57 @@
 #include <linux/sched.h>
 #include <linux/posix-timers.h>
 #include <linux/futex.h>
+#include <linux/compat.h>
 #include <linux/poll.h>
 #include <linux/checkpoint.h>
 #include <linux/checkpoint_hdr.h>
 
+
+#ifdef CONFIG_FUTEX
+static void save_task_robust_futex_list(struct ckpt_hdr_task *h,
+					struct task_struct *t)
+{
+	/*
+	 * These are __user pointers and thus can be saved without
+	 * the objhash.
+	 */
+	h->robust_futex_list = (unsigned long)t->robust_list;
+	h->robust_futex_head_len = sizeof(*t->robust_list);
+#ifdef CONFIG_COMPAT
+	h->compat_robust_futex_list = ptr_to_compat(t->compat_robust_list);
+	h->compat_robust_futex_head_len = sizeof(*t->compat_robust_list);
+#endif
+}
+
+static void restore_task_robust_futex_list(struct ckpt_hdr_task *h)
+{
+	/* Since we restore the memory map the address remains the same and
+	 * this is safe. This is the same as [compat_]sys_set_robust_list() */
+	if (h->robust_futex_list) {
+		struct robust_list_head __user *rfl;
+		rfl = (void __user *)(unsigned long) h->robust_futex_list;
+		do_set_robust_list(rfl, h->robust_futex_head_len);
+	}
+#ifdef CONFIG_COMPAT
+	if (h->compat_robust_futex_list) {
+		struct compat_robust_list_head __user *crfl;
+		crfl = compat_ptr(h->compat_robust_futex_list);
+		do_compat_set_robust_list(crfl, h->compat_robust_futex_head_len);
+	}
+#endif
+}
+#else /* !CONFIG_FUTEX */
+static inline void save_task_robust_futex_list(struct ckpt_hdr_task *h,
+					       struct task_struct *t)
+{
+}
+
+static inline void restore_task_robust_futex_list(struct ckpt_hdr_task *h)
+{
+}
+#endif /* CONFIG_FUTEX */
+
+
 /***********************************************************************
  * Checkpoint
  */
@@ -46,6 +93,7 @@ static int checkpoint_task_struct(struct ckpt_ctx *ctx, struct task_struct *t)
 
 		h->set_child_tid = (unsigned long) t->set_child_tid;
 		h->clear_child_tid = (unsigned long) t->clear_child_tid;
+		save_task_robust_futex_list(h, t);
 	}
 
 	ret = ckpt_write_obj(ctx, &h->h);
@@ -250,6 +298,7 @@ static int restore_task_struct(struct ckpt_ctx *ctx)
 			(int __user *) (unsigned long) h->set_child_tid;
 		t->clear_child_tid =
 			(int __user *) (unsigned long) h->clear_child_tid;
+		restore_task_robust_futex_list(h);
 	}
 
 	memset(t->comm, 0, TASK_COMM_LEN);
diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
index 8ae3bbe..b56fe71 100644
--- a/include/linux/checkpoint_hdr.h
+++ b/include/linux/checkpoint_hdr.h
@@ -136,6 +136,11 @@ struct ckpt_hdr_task {
 
 	__u64 set_child_tid;
 	__u64 clear_child_tid;
+
+	__u32 compat_robust_futex_head_len;
+	__u32 compat_robust_futex_list; /* a compat __user ptr */
+	__u32 robust_futex_head_len;
+	__u64 robust_futex_list; /* a __user ptr */
 } __attribute__((aligned(8)));
 
 /* restart blocks */
diff --git a/include/linux/compat.h b/include/linux/compat.h
index af931ee..f444cf0 100644
--- a/include/linux/compat.h
+++ b/include/linux/compat.h
@@ -165,7 +165,8 @@ struct compat_robust_list_head {
 };
 
 extern void compat_exit_robust_list(struct task_struct *curr);
-
+extern long do_compat_set_robust_list(struct compat_robust_list_head __user *head,
+				      compat_size_t len);
 asmlinkage long
 compat_sys_set_robust_list(struct compat_robust_list_head __user *head,
 			   compat_size_t len);
diff --git a/include/linux/futex.h b/include/linux/futex.h
index 4326f81..2e126a9 100644
--- a/include/linux/futex.h
+++ b/include/linux/futex.h
@@ -185,6 +185,7 @@ union futex_key {
 #define FUTEX_KEY_INIT (union futex_key) { .both = { .ptr = NULL } }
 
 #ifdef CONFIG_FUTEX
+extern long do_set_robust_list(struct robust_list_head __user *head, size_t len);
 extern void exit_robust_list(struct task_struct *curr);
 extern void exit_pi_state_list(struct task_struct *curr);
 extern int futex_cmpxchg_enabled;
diff --git a/kernel/futex.c b/kernel/futex.c
index def86c8..1ef7c8e 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -2278,13 +2278,7 @@ out:
  * the list. There can only be one such pending lock.
  */
 
-/**
- * sys_set_robust_list - set the robust-futex list head of a task
- * @head: pointer to the list-head
- * @len: length of the list-head, as userspace expects
- */
-SYSCALL_DEFINE2(set_robust_list, struct robust_list_head __user *, head,
-		size_t, len)
+long do_set_robust_list(struct robust_list_head __user *head, size_t len)
 {
 	if (!futex_cmpxchg_enabled)
 		return -ENOSYS;
@@ -2300,6 +2294,17 @@ SYSCALL_DEFINE2(set_robust_list, struct robust_list_head __user *, head,
 }
 
 /**
+ * sys_set_robust_list - set the robust-futex list head of a task
+ * @head: pointer to the list-head
+ * @len: length of the list-head, as userspace expects
+ */
+SYSCALL_DEFINE2(set_robust_list, struct robust_list_head __user *, head,
+		size_t, len)
+{
+	return do_set_robust_list(head, len);
+}
+
+/**
  * sys_get_robust_list - get the robust-futex list head of a task
  * @pid: pid of the process [zero for current task]
  * @head_ptr: pointer to a list-head pointer, the kernel fills it in
diff --git a/kernel/futex_compat.c b/kernel/futex_compat.c
index 2357165..5e1a169 100644
--- a/kernel/futex_compat.c
+++ b/kernel/futex_compat.c
@@ -114,9 +114,9 @@ void compat_exit_robust_list(struct task_struct *curr)
 	}
 }
 
-asmlinkage long
-compat_sys_set_robust_list(struct compat_robust_list_head __user *head,
-			   compat_size_t len)
+long
+do_compat_set_robust_list(struct compat_robust_list_head __user *head,
+			  compat_size_t len)
 {
 	if (!futex_cmpxchg_enabled)
 		return -ENOSYS;
@@ -130,6 +130,13 @@ compat_sys_set_robust_list(struct compat_robust_list_head __user *head,
 }
 
 asmlinkage long
+compat_sys_set_robust_list(struct compat_robust_list_head __user *head,
+			   compat_size_t len)
+{
+	return do_compat_set_robust_list(head, len);
+}
+
+asmlinkage long
 compat_sys_get_robust_list(int pid, compat_uptr_t __user *head_ptr,
 			   compat_size_t __user *len_ptr)
 {
-- 
1.6.0.4


^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH v18 29/80] c/r: Save and restore the [compat_]robust_list member of the task struct
@ 2009-09-23 23:51   ` Oren Laadan
  0 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-23 23:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, containers, linux-kernel, linux-mm, linux-api,
	Serge Hallyn, Ingo Molnar, Pavel Emelyanov, Matt Helsley

From: Matt Helsley <matthltc@us.ibm.com>

These lists record which futexes the task holds. To keep the overhead of
robust futexes low the list is kept in userspace. When the task exits the
kernel carefully walks these lists to recover held futexes that
other tasks may be attempting to acquire with FUTEX_WAIT.

Because they point to userspace memory that is saved/restored by
checkpoint/restart saving the list pointers themselves is safe.

While saving the pointers is safe during checkpoint, restart is tricky
because the robust futex ABI contains provisions for changes based on
checking the size of the list head. So we need to save the length of
the list head too in order to make sure that the kernel used during
restart is capable of handling that ABI. Since there is only one ABI
supported at the moment taking the list head's size is simple. Should
the ABI change we will need to use the same size as specified during
sys_set_robust_list() and hence some new means of determining the length
of this userspace structure in sys_checkpoint would be required.

Rather than rewrite the logic that checks and handles the ABI we reuse
sys_set_robust_list() by factoring out the body of the function and
calling it during restart.

Signed-off-by: Matt Helsley <matthltc@us.ibm.com>
[orenl@cs.columbia.edu: move save/restore code to checkpoint/process.c]
---
 checkpoint/process.c           |   49 ++++++++++++++++++++++++++++++++++++++++
 include/linux/checkpoint_hdr.h |    5 ++++
 include/linux/compat.h         |    3 +-
 include/linux/futex.h          |    1 +
 kernel/futex.c                 |   19 +++++++++-----
 kernel/futex_compat.c          |   13 ++++++++--
 6 files changed, 79 insertions(+), 11 deletions(-)

diff --git a/checkpoint/process.c b/checkpoint/process.c
index 62ae72d..2580b31 100644
--- a/checkpoint/process.c
+++ b/checkpoint/process.c
@@ -14,10 +14,57 @@
 #include <linux/sched.h>
 #include <linux/posix-timers.h>
 #include <linux/futex.h>
+#include <linux/compat.h>
 #include <linux/poll.h>
 #include <linux/checkpoint.h>
 #include <linux/checkpoint_hdr.h>
 
+
+#ifdef CONFIG_FUTEX
+static void save_task_robust_futex_list(struct ckpt_hdr_task *h,
+					struct task_struct *t)
+{
+	/*
+	 * These are __user pointers and thus can be saved without
+	 * the objhash.
+	 */
+	h->robust_futex_list = (unsigned long)t->robust_list;
+	h->robust_futex_head_len = sizeof(*t->robust_list);
+#ifdef CONFIG_COMPAT
+	h->compat_robust_futex_list = ptr_to_compat(t->compat_robust_list);
+	h->compat_robust_futex_head_len = sizeof(*t->compat_robust_list);
+#endif
+}
+
+static void restore_task_robust_futex_list(struct ckpt_hdr_task *h)
+{
+	/* Since we restore the memory map the address remains the same and
+	 * this is safe. This is the same as [compat_]sys_set_robust_list() */
+	if (h->robust_futex_list) {
+		struct robust_list_head __user *rfl;
+		rfl = (void __user *)(unsigned long) h->robust_futex_list;
+		do_set_robust_list(rfl, h->robust_futex_head_len);
+	}
+#ifdef CONFIG_COMPAT
+	if (h->compat_robust_futex_list) {
+		struct compat_robust_list_head __user *crfl;
+		crfl = compat_ptr(h->compat_robust_futex_list);
+		do_compat_set_robust_list(crfl, h->compat_robust_futex_head_len);
+	}
+#endif
+}
+#else /* !CONFIG_FUTEX */
+static inline void save_task_robust_futex_list(struct ckpt_hdr_task *h,
+					       struct task_struct *t)
+{
+}
+
+static inline void restore_task_robust_futex_list(struct ckpt_hdr_task *h)
+{
+}
+#endif /* CONFIG_FUTEX */
+
+
 /***********************************************************************
  * Checkpoint
  */
@@ -46,6 +93,7 @@ static int checkpoint_task_struct(struct ckpt_ctx *ctx, struct task_struct *t)
 
 		h->set_child_tid = (unsigned long) t->set_child_tid;
 		h->clear_child_tid = (unsigned long) t->clear_child_tid;
+		save_task_robust_futex_list(h, t);
 	}
 
 	ret = ckpt_write_obj(ctx, &h->h);
@@ -250,6 +298,7 @@ static int restore_task_struct(struct ckpt_ctx *ctx)
 			(int __user *) (unsigned long) h->set_child_tid;
 		t->clear_child_tid =
 			(int __user *) (unsigned long) h->clear_child_tid;
+		restore_task_robust_futex_list(h);
 	}
 
 	memset(t->comm, 0, TASK_COMM_LEN);
diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
index 8ae3bbe..b56fe71 100644
--- a/include/linux/checkpoint_hdr.h
+++ b/include/linux/checkpoint_hdr.h
@@ -136,6 +136,11 @@ struct ckpt_hdr_task {
 
 	__u64 set_child_tid;
 	__u64 clear_child_tid;
+
+	__u32 compat_robust_futex_head_len;
+	__u32 compat_robust_futex_list; /* a compat __user ptr */
+	__u32 robust_futex_head_len;
+	__u64 robust_futex_list; /* a __user ptr */
 } __attribute__((aligned(8)));
 
 /* restart blocks */
diff --git a/include/linux/compat.h b/include/linux/compat.h
index af931ee..f444cf0 100644
--- a/include/linux/compat.h
+++ b/include/linux/compat.h
@@ -165,7 +165,8 @@ struct compat_robust_list_head {
 };
 
 extern void compat_exit_robust_list(struct task_struct *curr);
-
+extern long do_compat_set_robust_list(struct compat_robust_list_head __user *head,
+				      compat_size_t len);
 asmlinkage long
 compat_sys_set_robust_list(struct compat_robust_list_head __user *head,
 			   compat_size_t len);
diff --git a/include/linux/futex.h b/include/linux/futex.h
index 4326f81..2e126a9 100644
--- a/include/linux/futex.h
+++ b/include/linux/futex.h
@@ -185,6 +185,7 @@ union futex_key {
 #define FUTEX_KEY_INIT (union futex_key) { .both = { .ptr = NULL } }
 
 #ifdef CONFIG_FUTEX
+extern long do_set_robust_list(struct robust_list_head __user *head, size_t len);
 extern void exit_robust_list(struct task_struct *curr);
 extern void exit_pi_state_list(struct task_struct *curr);
 extern int futex_cmpxchg_enabled;
diff --git a/kernel/futex.c b/kernel/futex.c
index def86c8..1ef7c8e 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -2278,13 +2278,7 @@ out:
  * the list. There can only be one such pending lock.
  */
 
-/**
- * sys_set_robust_list - set the robust-futex list head of a task
- * @head: pointer to the list-head
- * @len: length of the list-head, as userspace expects
- */
-SYSCALL_DEFINE2(set_robust_list, struct robust_list_head __user *, head,
-		size_t, len)
+long do_set_robust_list(struct robust_list_head __user *head, size_t len)
 {
 	if (!futex_cmpxchg_enabled)
 		return -ENOSYS;
@@ -2300,6 +2294,17 @@ SYSCALL_DEFINE2(set_robust_list, struct robust_list_head __user *, head,
 }
 
 /**
+ * sys_set_robust_list - set the robust-futex list head of a task
+ * @head: pointer to the list-head
+ * @len: length of the list-head, as userspace expects
+ */
+SYSCALL_DEFINE2(set_robust_list, struct robust_list_head __user *, head,
+		size_t, len)
+{
+	return do_set_robust_list(head, len);
+}
+
+/**
  * sys_get_robust_list - get the robust-futex list head of a task
  * @pid: pid of the process [zero for current task]
  * @head_ptr: pointer to a list-head pointer, the kernel fills it in
diff --git a/kernel/futex_compat.c b/kernel/futex_compat.c
index 2357165..5e1a169 100644
--- a/kernel/futex_compat.c
+++ b/kernel/futex_compat.c
@@ -114,9 +114,9 @@ void compat_exit_robust_list(struct task_struct *curr)
 	}
 }
 
-asmlinkage long
-compat_sys_set_robust_list(struct compat_robust_list_head __user *head,
-			   compat_size_t len)
+long
+do_compat_set_robust_list(struct compat_robust_list_head __user *head,
+			  compat_size_t len)
 {
 	if (!futex_cmpxchg_enabled)
 		return -ENOSYS;
@@ -130,6 +130,13 @@ compat_sys_set_robust_list(struct compat_robust_list_head __user *head,
 }
 
 asmlinkage long
+compat_sys_set_robust_list(struct compat_robust_list_head __user *head,
+			   compat_size_t len)
+{
+	return do_compat_set_robust_list(head, len);
+}
+
+asmlinkage long
 compat_sys_get_robust_list(int pid, compat_uptr_t __user *head_ptr,
 			   compat_size_t __user *len_ptr)
 {
-- 
1.6.0.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH v18 30/80] c/r: infrastructure for shared objects
       [not found] ` <1253749920-18673-1-git-send-email-orenl-RdfvBDnrOixBDgjK7y7TUQ@public.gmane.org>
                     ` (28 preceding siblings ...)
  2009-09-23 23:51   ` [PATCH v18 29/80] c/r: Save and restore the [compat_]robust_list member of the task struct Oren Laadan
@ 2009-09-23 23:51   ` Oren Laadan
  2009-09-23 23:51   ` [PATCH v18 31/80] c/r: detect resource leaks for whole-container checkpoint Oren Laadan
                     ` (51 subsequent siblings)
  81 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-23 23:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-api-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Linus Torvalds, Ingo Molnar,
	Pavel Emelyanov

The state of shared objects is saved once. On the first encounter, the
state is dumped and the object is assigned a unique identifier (objref)
and also stored in a hash table (indexed by its physical kernel address).
From then on the object will be found in the hash and only its identifier
is saved.

On restart the identifier is looked up in the hash table; if not found
then the state is read, the object is created, and added to the hash
table (this time indexed by its identifier). Otherwise, the object in
the hash table is used.

The hash is "one-way": objects added to it are never deleted until the
hash it discarded. The hash is discarded at the end of checkpoint or
restart, whether successful or not.

The hash keeps a reference to every object that is added to it, matching
the object's type, and maintains this reference during its lifetime.
Therefore, it is always safe to use an object that is stored in the hash.

Changelog[v18]:
  - Add ckpt_obj_reserve()
  - Change ref_drop() to accept a @lastref argument (useful for cleanup)
  - Disallow multiple objects with same objref in restart
  - Allow _ckpt_read_obj_type() to read object header only (w/o payload)
Changelog[v17]:
  - Add ckpt_obj->flags with CKPT_OBJ_CHECKPOINTED flag
  - Add prototype of ckpt_obj_lookup
  - Complain on attempt to add NULL ptr to objhash
  - Prepare for 'leaks detection'
Changelog[v16]:
  - Introduce ckpt_obj_lookup() to find an object by its ptr
Changelog[v14]:
  - Introduce 'struct ckpt_obj_ops' to better modularize shared objs.
  - Replace long 'switch' statements with table lookups and callbacks.
  - Introduce checkpoint_obj() and restart_obj() helpers
  - Shared objects now dumped/saved right before they are referenced
  - Cleanup interface of shared objects
Changelog[v13]:
  - Use hash_long() with 'unsigned long' cast to support 64bit archs
    (Nathan Lynch <ntl-e+AXbWqSrlAAvxtiuMwx3w@public.gmane.org>)
Changelog[v11]:
  - Doc: be explicit about grabbing a reference and object lifetime
Changelog[v4]:
  - Fix calculation of hash table size
Changelog[v3]:
  - Use standard hlist_... for hash table

Signed-off-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
---
 checkpoint/Makefile              |    1 +
 checkpoint/objhash.c             |  439 ++++++++++++++++++++++++++++++++++++++
 checkpoint/restart.c             |   42 ++++
 checkpoint/sys.c                 |    6 +
 include/linux/checkpoint.h       |   18 ++
 include/linux/checkpoint_hdr.h   |   14 ++
 include/linux/checkpoint_types.h |    2 +
 7 files changed, 522 insertions(+), 0 deletions(-)
 create mode 100644 checkpoint/objhash.c

diff --git a/checkpoint/Makefile b/checkpoint/Makefile
index 99364cc..5aa6a75 100644
--- a/checkpoint/Makefile
+++ b/checkpoint/Makefile
@@ -4,6 +4,7 @@
 
 obj-$(CONFIG_CHECKPOINT) += \
 	sys.o \
+	objhash.o \
 	checkpoint.o \
 	restart.o \
 	process.o
diff --git a/checkpoint/objhash.c b/checkpoint/objhash.c
new file mode 100644
index 0000000..e7f5393
--- /dev/null
+++ b/checkpoint/objhash.c
@@ -0,0 +1,439 @@
+/*
+ *  Checkpoint-restart - object hash infrastructure to manage shared objects
+ *
+ *  Copyright (C) 2008-2009 Oren Laadan
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+/* default debug level for output */
+#define CKPT_DFLAG  CKPT_DOBJ
+
+#include <linux/kernel.h>
+#include <linux/hash.h>
+#include <linux/checkpoint.h>
+#include <linux/checkpoint_hdr.h>
+
+struct ckpt_obj;
+struct ckpt_obj_ops;
+
+/* object operations */
+struct ckpt_obj_ops {
+	char *obj_name;
+	enum obj_type obj_type;
+	void (*ref_drop)(void *ptr, int lastref);
+	int (*ref_grab)(void *ptr);
+	int (*checkpoint)(struct ckpt_ctx *ctx, void *ptr);
+	void *(*restore)(struct ckpt_ctx *ctx);
+};
+
+struct ckpt_obj {
+	int objref;
+	int flags;
+	void *ptr;
+	struct ckpt_obj_ops *ops;
+	struct hlist_node hash;
+};
+
+/* object internal flags */
+#define CKPT_OBJ_CHECKPOINTED		0x1   /* object already checkpointed */
+
+struct ckpt_obj_hash {
+	struct hlist_head *head;
+	int next_free_objref;
+};
+
+/* helper grab/drop functions: */
+
+static void obj_no_drop(void *ptr, int lastref)
+{
+	return;
+}
+
+static int obj_no_grab(void *ptr)
+{
+	return 0;
+}
+
+static struct ckpt_obj_ops ckpt_obj_ops[] = {
+	/* ignored object */
+	{
+		.obj_name = "IGNORED",
+		.obj_type = CKPT_OBJ_IGNORE,
+		.ref_drop = obj_no_drop,
+		.ref_grab = obj_no_grab,
+	},
+};
+
+
+#define CKPT_OBJ_HASH_NBITS  10
+#define CKPT_OBJ_HASH_TOTAL  (1UL << CKPT_OBJ_HASH_NBITS)
+
+static void obj_hash_clear(struct ckpt_obj_hash *obj_hash)
+{
+	struct hlist_head *h = obj_hash->head;
+	struct hlist_node *n, *t;
+	struct ckpt_obj *obj;
+	int i;
+
+	for (i = 0; i < CKPT_OBJ_HASH_TOTAL; i++) {
+		hlist_for_each_entry_safe(obj, n, t, &h[i], hash) {
+			obj->ops->ref_drop(obj->ptr, 1);
+			kfree(obj);
+		}
+	}
+}
+
+void ckpt_obj_hash_free(struct ckpt_ctx *ctx)
+{
+	struct ckpt_obj_hash *obj_hash = ctx->obj_hash;
+
+	if (obj_hash) {
+		obj_hash_clear(obj_hash);
+		kfree(obj_hash->head);
+		kfree(ctx->obj_hash);
+		ctx->obj_hash = NULL;
+	}
+}
+
+int ckpt_obj_hash_alloc(struct ckpt_ctx *ctx)
+{
+	struct ckpt_obj_hash *obj_hash;
+	struct hlist_head *head;
+
+	obj_hash = kzalloc(sizeof(*obj_hash), GFP_KERNEL);
+	if (!obj_hash)
+		return -ENOMEM;
+	head = kzalloc(CKPT_OBJ_HASH_TOTAL * sizeof(*head), GFP_KERNEL);
+	if (!head) {
+		kfree(obj_hash);
+		return -ENOMEM;
+	}
+
+	obj_hash->head = head;
+	obj_hash->next_free_objref = 1;
+
+	ctx->obj_hash = obj_hash;
+	return 0;
+}
+
+static struct ckpt_obj *obj_find_by_ptr(struct ckpt_ctx *ctx, void *ptr)
+{
+	struct hlist_head *h;
+	struct hlist_node *n;
+	struct ckpt_obj *obj;
+
+	h = &ctx->obj_hash->head[hash_long((unsigned long) ptr,
+					   CKPT_OBJ_HASH_NBITS)];
+	hlist_for_each_entry(obj, n, h, hash)
+		if (obj->ptr == ptr)
+			return obj;
+	return NULL;
+}
+
+static struct ckpt_obj *obj_find_by_objref(struct ckpt_ctx *ctx, int objref)
+{
+	struct hlist_head *h;
+	struct hlist_node *n;
+	struct ckpt_obj *obj;
+
+	h = &ctx->obj_hash->head[hash_long((unsigned long) objref,
+					   CKPT_OBJ_HASH_NBITS)];
+	hlist_for_each_entry(obj, n, h, hash)
+		if (obj->objref == objref)
+			return obj;
+	return NULL;
+}
+
+static inline int obj_alloc_objref(struct ckpt_ctx *ctx)
+{
+	return ctx->obj_hash->next_free_objref++;
+}
+
+/**
+ * ckpt_obj_new - add an object to the obj_hash
+ * @ctx: checkpoint context
+ * @ptr: pointer to object
+ * @objref: object unique id
+ * @ops: object operations
+ *
+ * Add the object to the obj_hash. If @objref is zero, assign a unique
+ * object id and use @ptr as a hash key [checkpoint]. Else use @objref
+ * as a key [restart].
+ */
+static struct ckpt_obj *obj_new(struct ckpt_ctx *ctx, void *ptr,
+				int objref, enum obj_type type)
+{
+	struct ckpt_obj_ops *ops = &ckpt_obj_ops[type];
+	struct ckpt_obj *obj;
+	int i, ret;
+
+	/* explicitly disallow null pointers */
+	BUG_ON(!ptr);
+	/* make sure we don't change this accidentally */
+	BUG_ON(ops->obj_type != type);
+
+	obj = kzalloc(sizeof(*obj), GFP_KERNEL);
+	if (!obj)
+		return ERR_PTR(-ENOMEM);
+
+	obj->ptr = ptr;
+	obj->ops = ops;
+
+	if (!objref) {
+		/* use @obj->ptr to index, assign objref (checkpoint) */
+		obj->objref = obj_alloc_objref(ctx);
+		i = hash_long((unsigned long) ptr, CKPT_OBJ_HASH_NBITS);
+	} else {
+		/* use @obj->objref to index (restart) */
+		obj->objref = objref;
+		i = hash_long((unsigned long) objref, CKPT_OBJ_HASH_NBITS);
+	}
+
+	ret = ops->ref_grab(obj->ptr);
+	if (ret < 0) {
+		kfree(obj);
+		obj = ERR_PTR(ret);
+	} else {
+		hlist_add_head(&obj->hash, &ctx->obj_hash->head[i]);
+	}
+
+	return obj;
+}
+
+/**************************************************************************
+ * Checkpoint
+ */
+
+/**
+ * obj_lookup_add - lookup object and add if not in objhash
+ * @ctx: checkpoint context
+ * @ptr: pointer to object
+ * @type: object type
+ * @first: [output] first encounter (added to table)
+ *
+ * Look up the object pointed to by @ptr in the hash table. If it isn't
+ * already found there, add the object, and allocate a unique object
+ * id. Grab a reference to every object that is added, and maintain the
+ * reference until the entire hash is freed.
+ */
+static struct ckpt_obj *obj_lookup_add(struct ckpt_ctx *ctx, void *ptr,
+				       enum obj_type type, int *first)
+{
+	struct ckpt_obj *obj;
+
+	obj = obj_find_by_ptr(ctx, ptr);
+	if (!obj) {
+		obj = obj_new(ctx, ptr, 0, type);
+		*first = 1;
+	} else {
+		BUG_ON(obj->ops->obj_type != type);
+		*first = 0;
+	}
+	return obj;
+}
+
+/**
+ * ckpt_obj_lookup - lookup object (by pointer) in objhash
+ * @ctx: checkpoint context
+ * @ptr: pointer to object
+ * @type: object type
+ *
+ * [used during checkpoint].
+ * Return: objref (or zero if not found)
+ */
+int ckpt_obj_lookup(struct ckpt_ctx *ctx, void *ptr, enum obj_type type)
+{
+	struct ckpt_obj *obj;
+
+	obj = obj_find_by_ptr(ctx, ptr);
+	BUG_ON(obj && obj->ops->obj_type != type);
+	if (obj)
+		ckpt_debug("%s objref %d\n", obj->ops->obj_name, obj->objref);
+	return obj ? obj->objref : 0;
+}
+
+/**
+ * ckpt_obj_lookup_add - lookup object and add if not in objhash
+ * @ctx: checkpoint context
+ * @ptr: pointer to object
+ * @type: object type
+ * @first: [output] first encoutner (added to table)
+ *
+ * [used during checkpoint].
+ * Return: objref
+ */
+int ckpt_obj_lookup_add(struct ckpt_ctx *ctx, void *ptr,
+			enum obj_type type, int *first)
+{
+	struct ckpt_obj *obj;
+
+	obj = obj_lookup_add(ctx, ptr, type, first);
+	if (IS_ERR(obj))
+		return PTR_ERR(obj);
+	ckpt_debug("%s objref %d first %d\n",
+		   obj->ops->obj_name, obj->objref, *first);
+	obj->flags |= CKPT_OBJ_CHECKPOINTED;
+	return obj->objref;
+}
+
+/**
+ * ckpt_obj_reserve - reserve an objref
+ * @ctx: checkpoint context
+ *
+ * The reserved objref will not be used for subsequent objects. This
+ * gives an objref that can be safely used during restart without a
+ * matching object in checkpoint.  [used during checkpoint].
+ */
+int ckpt_obj_reserve(struct ckpt_ctx *ctx)
+{
+	return obj_alloc_objref(ctx);
+}
+
+/**
+ * checkpoint_obj - if not already in hash, add object and checkpoint
+ * @ctx: checkpoint context
+ * @ptr: pointer to object
+ * @type: object type
+ *
+ * Use obj_lookup_add() to lookup (and possibly add) the object to the
+ * hash table. If the CKPT_OBJ_CHECKPOINTED flag isn't set, then also
+ * save the object's state using its ops->checkpoint().
+ *
+ * [This is used during checkpoint].
+ * Returns: objref
+ */
+int checkpoint_obj(struct ckpt_ctx *ctx, void *ptr, enum obj_type type)
+{
+	struct ckpt_hdr_objref *h;
+	struct ckpt_obj *obj;
+	int new, ret = 0;
+
+	obj = obj_lookup_add(ctx, ptr, type, &new);
+	if (IS_ERR(obj))
+		return PTR_ERR(obj);
+
+	if (!(obj->flags & CKPT_OBJ_CHECKPOINTED)) {
+		h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_OBJREF);
+		if (!h)
+			return -ENOMEM;
+
+		h->objtype = type;
+		h->objref = obj->objref;
+		ret = ckpt_write_obj(ctx, &h->h);
+		ckpt_hdr_put(ctx, h);
+
+		if (ret < 0)
+			return ret;
+
+		/* invoke callback to actually dump the state */
+		if (obj->ops->checkpoint)
+			ret = obj->ops->checkpoint(ctx, ptr);
+
+		obj->flags |= CKPT_OBJ_CHECKPOINTED;
+	}
+	return (ret < 0 ? ret : obj->objref);
+}
+
+/**************************************************************************
+ * Restart
+ */
+
+/**
+ * restore_obj - read in and restore a (first seen) shared object
+ * @ctx: checkpoint context
+ * @h: ckpt_hdr of shared object
+ *
+ * Read in the header payload (struct ckpt_hdr_objref). Lookup the
+ * object to verify it isn't there.  Then restore the object's state
+ * and add it to the objash. No need to explicitly grab a reference -
+ * we hold the initial instance of this object. (Object maintained
+ * until the entire hash is free).
+ *
+ * [This is used during restart].
+ */
+int restore_obj(struct ckpt_ctx *ctx, struct ckpt_hdr_objref *h)
+{
+	struct ckpt_obj_ops *ops;
+	struct ckpt_obj *obj;
+	void *ptr = NULL;
+
+	ckpt_debug("len %d ref %d type %d\n", h->h.len, h->objref, h->objtype);
+	if (h->objtype >= CKPT_OBJ_MAX)
+		return -EINVAL;
+
+	ops = &ckpt_obj_ops[h->objtype];
+	BUG_ON(ops->obj_type != h->objtype);
+
+	if (ops->restore)
+		ptr = ops->restore(ctx);
+	if (IS_ERR(ptr))
+		return PTR_ERR(ptr);
+
+	if (obj_find_by_objref(ctx, h->objref))
+		obj = ERR_PTR(-EINVAL);
+	else
+		obj = obj_new(ctx, ptr, h->objref, h->objtype);
+	/*
+	 * Drop an extra reference to the object returned by ops->restore:
+	 * On success, this clears the extra reference taken by obj_new(),
+	 * and on failure, this cleans up the object itself.
+	 */
+	ops->ref_drop(ptr, 0);
+	if (IS_ERR(obj)) {
+		ops->ref_drop(ptr, 1);
+		return PTR_ERR(obj);
+	}
+	return obj->objref;
+}
+
+/**
+ * ckpt_obj_insert - add an object with a given objref to obj_hash
+ * @ctx: checkpoint context
+ * @ptr: pointer to object
+ * @objref: unique object id
+ * @type: object type
+ *
+ * Add the object pointer to by @ptr and identified by unique object id
+ * @objref to the hash table (indexed by @objref).  Grab a reference to
+ * every object added, and maintain it until the entire hash is freed.
+ *
+ * [This is used during restart].
+ */
+int ckpt_obj_insert(struct ckpt_ctx *ctx, void *ptr,
+		    int objref, enum obj_type type)
+{
+	struct ckpt_obj *obj;
+
+	if (obj_find_by_objref(ctx, objref))
+		return -EINVAL;
+	obj = obj_new(ctx, ptr, objref, type);
+	if (IS_ERR(obj))
+		return PTR_ERR(obj);
+	ckpt_debug("%s objref %d\n", obj->ops->obj_name, objref);
+	return obj->objref;
+}
+
+/**
+ * ckpt_obj_fetch - fetch an object by its identifier
+ * @ctx: checkpoint context
+ * @objref: object id
+ * @type: object type
+ *
+ * Lookup the objref identifier by @objref in the hash table. Return
+ * an error not found.
+ *
+ * [This is used during restart].
+ */
+void *ckpt_obj_fetch(struct ckpt_ctx *ctx, int objref, enum obj_type type)
+{
+	struct ckpt_obj *obj;
+
+	obj = obj_find_by_objref(ctx, objref);
+	if (!obj)
+		return ERR_PTR(-EINVAL);
+	ckpt_debug("%s ref %d\n", obj->ops->obj_name, obj->objref);
+	return (obj->ops->obj_type == type ? obj->ptr : ERR_PTR(-ENOMSG));
+}
diff --git a/checkpoint/restart.c b/checkpoint/restart.c
index d43eec7..73db44a 100644
--- a/checkpoint/restart.c
+++ b/checkpoint/restart.c
@@ -47,6 +47,34 @@ static int _ckpt_read_err(struct ckpt_ctx *ctx, struct ckpt_hdr *h)
 }
 
 /**
+ * _ckpt_read_objref - dispatch handling of a shared object
+ * @ctx: checkpoint context
+ * @hh: objrect descriptor
+ */
+static int _ckpt_read_objref(struct ckpt_ctx *ctx, struct ckpt_hdr *hh)
+{
+	struct ckpt_hdr *h;
+	int ret;
+
+	h = ckpt_hdr_get(ctx, hh->len);
+	if (!h)
+		return -ENOMEM;
+
+	*h = *hh;	/* yay ! */
+
+	_ckpt_debug(CKPT_DOBJ, "shared len %d type %d\n", h->len, h->type);
+	ret = ckpt_kread(ctx, (h + 1), hh->len - sizeof(struct ckpt_hdr));
+	if (ret < 0)
+		goto out;
+
+	ret = restore_obj(ctx, (struct ckpt_hdr_objref *) h);
+ out:
+	ckpt_hdr_put(ctx, h);
+	return ret;
+}
+
+
+/**
  * _ckpt_read_obj - read an object (ckpt_hdr followed by payload)
  * @ctx: checkpoint context
  * @h: desired ckpt_hdr
@@ -75,6 +103,11 @@ static int _ckpt_read_obj(struct ckpt_ctx *ctx, struct ckpt_hdr *h,
 		if (ret < 0)
 			return ret;
 		goto again;
+	} else if (h->type == CKPT_HDR_OBJREF) {
+		ret = _ckpt_read_objref(ctx, h);
+		if (ret < 0)
+			return ret;
+		goto again;
 	}
 
 	/* if len specified, enforce, else if maximum specified, enforce */
@@ -164,6 +197,7 @@ static void *ckpt_read_obj(struct ckpt_ctx *ctx, int len, int max)
 	struct ckpt_hdr *h;
 	int ret;
 
+ again:
 	ret = ckpt_kread(ctx, &hh, sizeof(hh));
 	if (ret < 0)
 		return ERR_PTR(ret);
@@ -171,6 +205,14 @@ static void *ckpt_read_obj(struct ckpt_ctx *ctx, int len, int max)
 		    hh.type, hh.len, len, max);
 	if (hh.len < sizeof(*h))
 		return ERR_PTR(-EINVAL);
+
+	if (hh.type == CKPT_HDR_OBJREF) {
+		ret = _ckpt_read_objref(ctx, &hh);
+		if (ret < 0)
+			return ERR_PTR(ret);
+		goto again;
+	}
+
 	/* if len specified, enforce, else if maximum specified, enforce */
 	if ((len && hh.len != len) || (!len && max && hh.len > max))
 		return ERR_PTR(-EINVAL);
diff --git a/checkpoint/sys.c b/checkpoint/sys.c
index c8921f0..d16d48f 100644
--- a/checkpoint/sys.c
+++ b/checkpoint/sys.c
@@ -194,6 +194,8 @@ static void ckpt_ctx_free(struct ckpt_ctx *ctx)
 	if (ctx->file)
 		fput(ctx->file);
 
+	ckpt_obj_hash_free(ctx);
+
 	if (ctx->tasks_arr)
 		task_arr_free(ctx);
 
@@ -231,6 +233,10 @@ static struct ckpt_ctx *ckpt_ctx_alloc(int fd, unsigned long uflags,
 	if (!ctx->file)
 		goto err;
 
+	err = -ENOMEM;
+	if (ckpt_obj_hash_alloc(ctx) < 0)
+		goto err;
+
 	atomic_inc(&ctx->refcount);
 	return ctx;
  err:
diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
index 5c02d9b..1315884 100644
--- a/include/linux/checkpoint.h
+++ b/include/linux/checkpoint.h
@@ -89,6 +89,23 @@ static inline void ckpt_set_ctx_error(struct ckpt_ctx *ctx, int errno)
 #define ckpt_test_ctx_complete(ctx)  \
 	((ctx)->kflags & (CKPT_CTX_SUCCESS | CKPT_CTX_ERROR))
 
+/* obj_hash */
+extern void ckpt_obj_hash_free(struct ckpt_ctx *ctx);
+extern int ckpt_obj_hash_alloc(struct ckpt_ctx *ctx);
+
+extern int restore_obj(struct ckpt_ctx *ctx, struct ckpt_hdr_objref *h);
+extern int checkpoint_obj(struct ckpt_ctx *ctx, void *ptr,
+			  enum obj_type type);
+extern int ckpt_obj_lookup(struct ckpt_ctx *ctx, void *ptr,
+			   enum obj_type type);
+extern int ckpt_obj_lookup_add(struct ckpt_ctx *ctx, void *ptr,
+			       enum obj_type type, int *first);
+extern void *ckpt_obj_fetch(struct ckpt_ctx *ctx, int objref,
+			    enum obj_type type);
+extern int ckpt_obj_insert(struct ckpt_ctx *ctx, void *ptr, int objref,
+			   enum obj_type type);
+extern int ckpt_obj_reserve(struct ckpt_ctx *ctx);
+
 extern void ckpt_ctx_get(struct ckpt_ctx *ctx);
 extern void ckpt_ctx_put(struct ckpt_ctx *ctx);
 
@@ -122,6 +139,7 @@ static inline int ckpt_validate_errno(int errno)
 #define CKPT_DBASE	0x1		/* anything */
 #define CKPT_DSYS	0x2		/* generic (system) */
 #define CKPT_DRW	0x4		/* image read/write */
+#define CKPT_DOBJ	0x8		/* shared objects */
 
 #define CKPT_DDEFAULT	0xffff		/* default debug level */
 
diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
index b56fe71..7a4015b 100644
--- a/include/linux/checkpoint_hdr.h
+++ b/include/linux/checkpoint_hdr.h
@@ -50,6 +50,7 @@ enum {
 	CKPT_HDR_HEADER_ARCH,
 	CKPT_HDR_BUFFER,
 	CKPT_HDR_STRING,
+	CKPT_HDR_OBJREF,
 
 	CKPT_HDR_TREE = 101,
 	CKPT_HDR_TASK,
@@ -69,6 +70,19 @@ enum {
 	CKPT_ARCH_X86_32 = 1,
 };
 
+/* shared objrects (objref) */
+struct ckpt_hdr_objref {
+	struct ckpt_hdr h;
+	__u32 objtype;
+	__s32 objref;
+} __attribute__((aligned(8)));
+
+/* shared objects types */
+enum obj_type {
+	CKPT_OBJ_IGNORE = 0,
+	CKPT_OBJ_MAX
+};
+
 /* kernel constants */
 struct ckpt_const {
 	/* task */
diff --git a/include/linux/checkpoint_types.h b/include/linux/checkpoint_types.h
index f74deac..f11fd07 100644
--- a/include/linux/checkpoint_types.h
+++ b/include/linux/checkpoint_types.h
@@ -38,6 +38,8 @@ struct ckpt_ctx {
 
 	atomic_t refcount;
 
+	struct ckpt_obj_hash *obj_hash;	/* repository for shared objects */
+
 	struct task_struct *tsk;/* checkpoint: current target task */
 	char err_string[256];	/* checkpoint: error string */
 	int errno;		/* restart: errno that caused failure */
-- 
1.6.0.4

^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH v18 30/80] c/r: infrastructure for shared objects
  2009-09-23 23:50 ` Oren Laadan
@ 2009-09-23 23:51   ` Oren Laadan
  -1 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-23 23:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, containers, linux-kernel, linux-mm, linux-api,
	Serge Hallyn, Ingo Molnar, Pavel Emelyanov, Oren Laadan,
	Oren Laadan

The state of shared objects is saved once. On the first encounter, the
state is dumped and the object is assigned a unique identifier (objref)
and also stored in a hash table (indexed by its physical kernel address).
>From then on the object will be found in the hash and only its identifier
is saved.

On restart the identifier is looked up in the hash table; if not found
then the state is read, the object is created, and added to the hash
table (this time indexed by its identifier). Otherwise, the object in
the hash table is used.

The hash is "one-way": objects added to it are never deleted until the
hash it discarded. The hash is discarded at the end of checkpoint or
restart, whether successful or not.

The hash keeps a reference to every object that is added to it, matching
the object's type, and maintains this reference during its lifetime.
Therefore, it is always safe to use an object that is stored in the hash.

Changelog[v18]:
  - Add ckpt_obj_reserve()
  - Change ref_drop() to accept a @lastref argument (useful for cleanup)
  - Disallow multiple objects with same objref in restart
  - Allow _ckpt_read_obj_type() to read object header only (w/o payload)
Changelog[v17]:
  - Add ckpt_obj->flags with CKPT_OBJ_CHECKPOINTED flag
  - Add prototype of ckpt_obj_lookup
  - Complain on attempt to add NULL ptr to objhash
  - Prepare for 'leaks detection'
Changelog[v16]:
  - Introduce ckpt_obj_lookup() to find an object by its ptr
Changelog[v14]:
  - Introduce 'struct ckpt_obj_ops' to better modularize shared objs.
  - Replace long 'switch' statements with table lookups and callbacks.
  - Introduce checkpoint_obj() and restart_obj() helpers
  - Shared objects now dumped/saved right before they are referenced
  - Cleanup interface of shared objects
Changelog[v13]:
  - Use hash_long() with 'unsigned long' cast to support 64bit archs
    (Nathan Lynch <ntl@pobox.com>)
Changelog[v11]:
  - Doc: be explicit about grabbing a reference and object lifetime
Changelog[v4]:
  - Fix calculation of hash table size
Changelog[v3]:
  - Use standard hlist_... for hash table

Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
---
 checkpoint/Makefile              |    1 +
 checkpoint/objhash.c             |  439 ++++++++++++++++++++++++++++++++++++++
 checkpoint/restart.c             |   42 ++++
 checkpoint/sys.c                 |    6 +
 include/linux/checkpoint.h       |   18 ++
 include/linux/checkpoint_hdr.h   |   14 ++
 include/linux/checkpoint_types.h |    2 +
 7 files changed, 522 insertions(+), 0 deletions(-)
 create mode 100644 checkpoint/objhash.c

diff --git a/checkpoint/Makefile b/checkpoint/Makefile
index 99364cc..5aa6a75 100644
--- a/checkpoint/Makefile
+++ b/checkpoint/Makefile
@@ -4,6 +4,7 @@
 
 obj-$(CONFIG_CHECKPOINT) += \
 	sys.o \
+	objhash.o \
 	checkpoint.o \
 	restart.o \
 	process.o
diff --git a/checkpoint/objhash.c b/checkpoint/objhash.c
new file mode 100644
index 0000000..e7f5393
--- /dev/null
+++ b/checkpoint/objhash.c
@@ -0,0 +1,439 @@
+/*
+ *  Checkpoint-restart - object hash infrastructure to manage shared objects
+ *
+ *  Copyright (C) 2008-2009 Oren Laadan
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+/* default debug level for output */
+#define CKPT_DFLAG  CKPT_DOBJ
+
+#include <linux/kernel.h>
+#include <linux/hash.h>
+#include <linux/checkpoint.h>
+#include <linux/checkpoint_hdr.h>
+
+struct ckpt_obj;
+struct ckpt_obj_ops;
+
+/* object operations */
+struct ckpt_obj_ops {
+	char *obj_name;
+	enum obj_type obj_type;
+	void (*ref_drop)(void *ptr, int lastref);
+	int (*ref_grab)(void *ptr);
+	int (*checkpoint)(struct ckpt_ctx *ctx, void *ptr);
+	void *(*restore)(struct ckpt_ctx *ctx);
+};
+
+struct ckpt_obj {
+	int objref;
+	int flags;
+	void *ptr;
+	struct ckpt_obj_ops *ops;
+	struct hlist_node hash;
+};
+
+/* object internal flags */
+#define CKPT_OBJ_CHECKPOINTED		0x1   /* object already checkpointed */
+
+struct ckpt_obj_hash {
+	struct hlist_head *head;
+	int next_free_objref;
+};
+
+/* helper grab/drop functions: */
+
+static void obj_no_drop(void *ptr, int lastref)
+{
+	return;
+}
+
+static int obj_no_grab(void *ptr)
+{
+	return 0;
+}
+
+static struct ckpt_obj_ops ckpt_obj_ops[] = {
+	/* ignored object */
+	{
+		.obj_name = "IGNORED",
+		.obj_type = CKPT_OBJ_IGNORE,
+		.ref_drop = obj_no_drop,
+		.ref_grab = obj_no_grab,
+	},
+};
+
+
+#define CKPT_OBJ_HASH_NBITS  10
+#define CKPT_OBJ_HASH_TOTAL  (1UL << CKPT_OBJ_HASH_NBITS)
+
+static void obj_hash_clear(struct ckpt_obj_hash *obj_hash)
+{
+	struct hlist_head *h = obj_hash->head;
+	struct hlist_node *n, *t;
+	struct ckpt_obj *obj;
+	int i;
+
+	for (i = 0; i < CKPT_OBJ_HASH_TOTAL; i++) {
+		hlist_for_each_entry_safe(obj, n, t, &h[i], hash) {
+			obj->ops->ref_drop(obj->ptr, 1);
+			kfree(obj);
+		}
+	}
+}
+
+void ckpt_obj_hash_free(struct ckpt_ctx *ctx)
+{
+	struct ckpt_obj_hash *obj_hash = ctx->obj_hash;
+
+	if (obj_hash) {
+		obj_hash_clear(obj_hash);
+		kfree(obj_hash->head);
+		kfree(ctx->obj_hash);
+		ctx->obj_hash = NULL;
+	}
+}
+
+int ckpt_obj_hash_alloc(struct ckpt_ctx *ctx)
+{
+	struct ckpt_obj_hash *obj_hash;
+	struct hlist_head *head;
+
+	obj_hash = kzalloc(sizeof(*obj_hash), GFP_KERNEL);
+	if (!obj_hash)
+		return -ENOMEM;
+	head = kzalloc(CKPT_OBJ_HASH_TOTAL * sizeof(*head), GFP_KERNEL);
+	if (!head) {
+		kfree(obj_hash);
+		return -ENOMEM;
+	}
+
+	obj_hash->head = head;
+	obj_hash->next_free_objref = 1;
+
+	ctx->obj_hash = obj_hash;
+	return 0;
+}
+
+static struct ckpt_obj *obj_find_by_ptr(struct ckpt_ctx *ctx, void *ptr)
+{
+	struct hlist_head *h;
+	struct hlist_node *n;
+	struct ckpt_obj *obj;
+
+	h = &ctx->obj_hash->head[hash_long((unsigned long) ptr,
+					   CKPT_OBJ_HASH_NBITS)];
+	hlist_for_each_entry(obj, n, h, hash)
+		if (obj->ptr == ptr)
+			return obj;
+	return NULL;
+}
+
+static struct ckpt_obj *obj_find_by_objref(struct ckpt_ctx *ctx, int objref)
+{
+	struct hlist_head *h;
+	struct hlist_node *n;
+	struct ckpt_obj *obj;
+
+	h = &ctx->obj_hash->head[hash_long((unsigned long) objref,
+					   CKPT_OBJ_HASH_NBITS)];
+	hlist_for_each_entry(obj, n, h, hash)
+		if (obj->objref == objref)
+			return obj;
+	return NULL;
+}
+
+static inline int obj_alloc_objref(struct ckpt_ctx *ctx)
+{
+	return ctx->obj_hash->next_free_objref++;
+}
+
+/**
+ * ckpt_obj_new - add an object to the obj_hash
+ * @ctx: checkpoint context
+ * @ptr: pointer to object
+ * @objref: object unique id
+ * @ops: object operations
+ *
+ * Add the object to the obj_hash. If @objref is zero, assign a unique
+ * object id and use @ptr as a hash key [checkpoint]. Else use @objref
+ * as a key [restart].
+ */
+static struct ckpt_obj *obj_new(struct ckpt_ctx *ctx, void *ptr,
+				int objref, enum obj_type type)
+{
+	struct ckpt_obj_ops *ops = &ckpt_obj_ops[type];
+	struct ckpt_obj *obj;
+	int i, ret;
+
+	/* explicitly disallow null pointers */
+	BUG_ON(!ptr);
+	/* make sure we don't change this accidentally */
+	BUG_ON(ops->obj_type != type);
+
+	obj = kzalloc(sizeof(*obj), GFP_KERNEL);
+	if (!obj)
+		return ERR_PTR(-ENOMEM);
+
+	obj->ptr = ptr;
+	obj->ops = ops;
+
+	if (!objref) {
+		/* use @obj->ptr to index, assign objref (checkpoint) */
+		obj->objref = obj_alloc_objref(ctx);
+		i = hash_long((unsigned long) ptr, CKPT_OBJ_HASH_NBITS);
+	} else {
+		/* use @obj->objref to index (restart) */
+		obj->objref = objref;
+		i = hash_long((unsigned long) objref, CKPT_OBJ_HASH_NBITS);
+	}
+
+	ret = ops->ref_grab(obj->ptr);
+	if (ret < 0) {
+		kfree(obj);
+		obj = ERR_PTR(ret);
+	} else {
+		hlist_add_head(&obj->hash, &ctx->obj_hash->head[i]);
+	}
+
+	return obj;
+}
+
+/**************************************************************************
+ * Checkpoint
+ */
+
+/**
+ * obj_lookup_add - lookup object and add if not in objhash
+ * @ctx: checkpoint context
+ * @ptr: pointer to object
+ * @type: object type
+ * @first: [output] first encounter (added to table)
+ *
+ * Look up the object pointed to by @ptr in the hash table. If it isn't
+ * already found there, add the object, and allocate a unique object
+ * id. Grab a reference to every object that is added, and maintain the
+ * reference until the entire hash is freed.
+ */
+static struct ckpt_obj *obj_lookup_add(struct ckpt_ctx *ctx, void *ptr,
+				       enum obj_type type, int *first)
+{
+	struct ckpt_obj *obj;
+
+	obj = obj_find_by_ptr(ctx, ptr);
+	if (!obj) {
+		obj = obj_new(ctx, ptr, 0, type);
+		*first = 1;
+	} else {
+		BUG_ON(obj->ops->obj_type != type);
+		*first = 0;
+	}
+	return obj;
+}
+
+/**
+ * ckpt_obj_lookup - lookup object (by pointer) in objhash
+ * @ctx: checkpoint context
+ * @ptr: pointer to object
+ * @type: object type
+ *
+ * [used during checkpoint].
+ * Return: objref (or zero if not found)
+ */
+int ckpt_obj_lookup(struct ckpt_ctx *ctx, void *ptr, enum obj_type type)
+{
+	struct ckpt_obj *obj;
+
+	obj = obj_find_by_ptr(ctx, ptr);
+	BUG_ON(obj && obj->ops->obj_type != type);
+	if (obj)
+		ckpt_debug("%s objref %d\n", obj->ops->obj_name, obj->objref);
+	return obj ? obj->objref : 0;
+}
+
+/**
+ * ckpt_obj_lookup_add - lookup object and add if not in objhash
+ * @ctx: checkpoint context
+ * @ptr: pointer to object
+ * @type: object type
+ * @first: [output] first encoutner (added to table)
+ *
+ * [used during checkpoint].
+ * Return: objref
+ */
+int ckpt_obj_lookup_add(struct ckpt_ctx *ctx, void *ptr,
+			enum obj_type type, int *first)
+{
+	struct ckpt_obj *obj;
+
+	obj = obj_lookup_add(ctx, ptr, type, first);
+	if (IS_ERR(obj))
+		return PTR_ERR(obj);
+	ckpt_debug("%s objref %d first %d\n",
+		   obj->ops->obj_name, obj->objref, *first);
+	obj->flags |= CKPT_OBJ_CHECKPOINTED;
+	return obj->objref;
+}
+
+/**
+ * ckpt_obj_reserve - reserve an objref
+ * @ctx: checkpoint context
+ *
+ * The reserved objref will not be used for subsequent objects. This
+ * gives an objref that can be safely used during restart without a
+ * matching object in checkpoint.  [used during checkpoint].
+ */
+int ckpt_obj_reserve(struct ckpt_ctx *ctx)
+{
+	return obj_alloc_objref(ctx);
+}
+
+/**
+ * checkpoint_obj - if not already in hash, add object and checkpoint
+ * @ctx: checkpoint context
+ * @ptr: pointer to object
+ * @type: object type
+ *
+ * Use obj_lookup_add() to lookup (and possibly add) the object to the
+ * hash table. If the CKPT_OBJ_CHECKPOINTED flag isn't set, then also
+ * save the object's state using its ops->checkpoint().
+ *
+ * [This is used during checkpoint].
+ * Returns: objref
+ */
+int checkpoint_obj(struct ckpt_ctx *ctx, void *ptr, enum obj_type type)
+{
+	struct ckpt_hdr_objref *h;
+	struct ckpt_obj *obj;
+	int new, ret = 0;
+
+	obj = obj_lookup_add(ctx, ptr, type, &new);
+	if (IS_ERR(obj))
+		return PTR_ERR(obj);
+
+	if (!(obj->flags & CKPT_OBJ_CHECKPOINTED)) {
+		h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_OBJREF);
+		if (!h)
+			return -ENOMEM;
+
+		h->objtype = type;
+		h->objref = obj->objref;
+		ret = ckpt_write_obj(ctx, &h->h);
+		ckpt_hdr_put(ctx, h);
+
+		if (ret < 0)
+			return ret;
+
+		/* invoke callback to actually dump the state */
+		if (obj->ops->checkpoint)
+			ret = obj->ops->checkpoint(ctx, ptr);
+
+		obj->flags |= CKPT_OBJ_CHECKPOINTED;
+	}
+	return (ret < 0 ? ret : obj->objref);
+}
+
+/**************************************************************************
+ * Restart
+ */
+
+/**
+ * restore_obj - read in and restore a (first seen) shared object
+ * @ctx: checkpoint context
+ * @h: ckpt_hdr of shared object
+ *
+ * Read in the header payload (struct ckpt_hdr_objref). Lookup the
+ * object to verify it isn't there.  Then restore the object's state
+ * and add it to the objash. No need to explicitly grab a reference -
+ * we hold the initial instance of this object. (Object maintained
+ * until the entire hash is free).
+ *
+ * [This is used during restart].
+ */
+int restore_obj(struct ckpt_ctx *ctx, struct ckpt_hdr_objref *h)
+{
+	struct ckpt_obj_ops *ops;
+	struct ckpt_obj *obj;
+	void *ptr = NULL;
+
+	ckpt_debug("len %d ref %d type %d\n", h->h.len, h->objref, h->objtype);
+	if (h->objtype >= CKPT_OBJ_MAX)
+		return -EINVAL;
+
+	ops = &ckpt_obj_ops[h->objtype];
+	BUG_ON(ops->obj_type != h->objtype);
+
+	if (ops->restore)
+		ptr = ops->restore(ctx);
+	if (IS_ERR(ptr))
+		return PTR_ERR(ptr);
+
+	if (obj_find_by_objref(ctx, h->objref))
+		obj = ERR_PTR(-EINVAL);
+	else
+		obj = obj_new(ctx, ptr, h->objref, h->objtype);
+	/*
+	 * Drop an extra reference to the object returned by ops->restore:
+	 * On success, this clears the extra reference taken by obj_new(),
+	 * and on failure, this cleans up the object itself.
+	 */
+	ops->ref_drop(ptr, 0);
+	if (IS_ERR(obj)) {
+		ops->ref_drop(ptr, 1);
+		return PTR_ERR(obj);
+	}
+	return obj->objref;
+}
+
+/**
+ * ckpt_obj_insert - add an object with a given objref to obj_hash
+ * @ctx: checkpoint context
+ * @ptr: pointer to object
+ * @objref: unique object id
+ * @type: object type
+ *
+ * Add the object pointer to by @ptr and identified by unique object id
+ * @objref to the hash table (indexed by @objref).  Grab a reference to
+ * every object added, and maintain it until the entire hash is freed.
+ *
+ * [This is used during restart].
+ */
+int ckpt_obj_insert(struct ckpt_ctx *ctx, void *ptr,
+		    int objref, enum obj_type type)
+{
+	struct ckpt_obj *obj;
+
+	if (obj_find_by_objref(ctx, objref))
+		return -EINVAL;
+	obj = obj_new(ctx, ptr, objref, type);
+	if (IS_ERR(obj))
+		return PTR_ERR(obj);
+	ckpt_debug("%s objref %d\n", obj->ops->obj_name, objref);
+	return obj->objref;
+}
+
+/**
+ * ckpt_obj_fetch - fetch an object by its identifier
+ * @ctx: checkpoint context
+ * @objref: object id
+ * @type: object type
+ *
+ * Lookup the objref identifier by @objref in the hash table. Return
+ * an error not found.
+ *
+ * [This is used during restart].
+ */
+void *ckpt_obj_fetch(struct ckpt_ctx *ctx, int objref, enum obj_type type)
+{
+	struct ckpt_obj *obj;
+
+	obj = obj_find_by_objref(ctx, objref);
+	if (!obj)
+		return ERR_PTR(-EINVAL);
+	ckpt_debug("%s ref %d\n", obj->ops->obj_name, obj->objref);
+	return (obj->ops->obj_type == type ? obj->ptr : ERR_PTR(-ENOMSG));
+}
diff --git a/checkpoint/restart.c b/checkpoint/restart.c
index d43eec7..73db44a 100644
--- a/checkpoint/restart.c
+++ b/checkpoint/restart.c
@@ -47,6 +47,34 @@ static int _ckpt_read_err(struct ckpt_ctx *ctx, struct ckpt_hdr *h)
 }
 
 /**
+ * _ckpt_read_objref - dispatch handling of a shared object
+ * @ctx: checkpoint context
+ * @hh: objrect descriptor
+ */
+static int _ckpt_read_objref(struct ckpt_ctx *ctx, struct ckpt_hdr *hh)
+{
+	struct ckpt_hdr *h;
+	int ret;
+
+	h = ckpt_hdr_get(ctx, hh->len);
+	if (!h)
+		return -ENOMEM;
+
+	*h = *hh;	/* yay ! */
+
+	_ckpt_debug(CKPT_DOBJ, "shared len %d type %d\n", h->len, h->type);
+	ret = ckpt_kread(ctx, (h + 1), hh->len - sizeof(struct ckpt_hdr));
+	if (ret < 0)
+		goto out;
+
+	ret = restore_obj(ctx, (struct ckpt_hdr_objref *) h);
+ out:
+	ckpt_hdr_put(ctx, h);
+	return ret;
+}
+
+
+/**
  * _ckpt_read_obj - read an object (ckpt_hdr followed by payload)
  * @ctx: checkpoint context
  * @h: desired ckpt_hdr
@@ -75,6 +103,11 @@ static int _ckpt_read_obj(struct ckpt_ctx *ctx, struct ckpt_hdr *h,
 		if (ret < 0)
 			return ret;
 		goto again;
+	} else if (h->type == CKPT_HDR_OBJREF) {
+		ret = _ckpt_read_objref(ctx, h);
+		if (ret < 0)
+			return ret;
+		goto again;
 	}
 
 	/* if len specified, enforce, else if maximum specified, enforce */
@@ -164,6 +197,7 @@ static void *ckpt_read_obj(struct ckpt_ctx *ctx, int len, int max)
 	struct ckpt_hdr *h;
 	int ret;
 
+ again:
 	ret = ckpt_kread(ctx, &hh, sizeof(hh));
 	if (ret < 0)
 		return ERR_PTR(ret);
@@ -171,6 +205,14 @@ static void *ckpt_read_obj(struct ckpt_ctx *ctx, int len, int max)
 		    hh.type, hh.len, len, max);
 	if (hh.len < sizeof(*h))
 		return ERR_PTR(-EINVAL);
+
+	if (hh.type == CKPT_HDR_OBJREF) {
+		ret = _ckpt_read_objref(ctx, &hh);
+		if (ret < 0)
+			return ERR_PTR(ret);
+		goto again;
+	}
+
 	/* if len specified, enforce, else if maximum specified, enforce */
 	if ((len && hh.len != len) || (!len && max && hh.len > max))
 		return ERR_PTR(-EINVAL);
diff --git a/checkpoint/sys.c b/checkpoint/sys.c
index c8921f0..d16d48f 100644
--- a/checkpoint/sys.c
+++ b/checkpoint/sys.c
@@ -194,6 +194,8 @@ static void ckpt_ctx_free(struct ckpt_ctx *ctx)
 	if (ctx->file)
 		fput(ctx->file);
 
+	ckpt_obj_hash_free(ctx);
+
 	if (ctx->tasks_arr)
 		task_arr_free(ctx);
 
@@ -231,6 +233,10 @@ static struct ckpt_ctx *ckpt_ctx_alloc(int fd, unsigned long uflags,
 	if (!ctx->file)
 		goto err;
 
+	err = -ENOMEM;
+	if (ckpt_obj_hash_alloc(ctx) < 0)
+		goto err;
+
 	atomic_inc(&ctx->refcount);
 	return ctx;
  err:
diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
index 5c02d9b..1315884 100644
--- a/include/linux/checkpoint.h
+++ b/include/linux/checkpoint.h
@@ -89,6 +89,23 @@ static inline void ckpt_set_ctx_error(struct ckpt_ctx *ctx, int errno)
 #define ckpt_test_ctx_complete(ctx)  \
 	((ctx)->kflags & (CKPT_CTX_SUCCESS | CKPT_CTX_ERROR))
 
+/* obj_hash */
+extern void ckpt_obj_hash_free(struct ckpt_ctx *ctx);
+extern int ckpt_obj_hash_alloc(struct ckpt_ctx *ctx);
+
+extern int restore_obj(struct ckpt_ctx *ctx, struct ckpt_hdr_objref *h);
+extern int checkpoint_obj(struct ckpt_ctx *ctx, void *ptr,
+			  enum obj_type type);
+extern int ckpt_obj_lookup(struct ckpt_ctx *ctx, void *ptr,
+			   enum obj_type type);
+extern int ckpt_obj_lookup_add(struct ckpt_ctx *ctx, void *ptr,
+			       enum obj_type type, int *first);
+extern void *ckpt_obj_fetch(struct ckpt_ctx *ctx, int objref,
+			    enum obj_type type);
+extern int ckpt_obj_insert(struct ckpt_ctx *ctx, void *ptr, int objref,
+			   enum obj_type type);
+extern int ckpt_obj_reserve(struct ckpt_ctx *ctx);
+
 extern void ckpt_ctx_get(struct ckpt_ctx *ctx);
 extern void ckpt_ctx_put(struct ckpt_ctx *ctx);
 
@@ -122,6 +139,7 @@ static inline int ckpt_validate_errno(int errno)
 #define CKPT_DBASE	0x1		/* anything */
 #define CKPT_DSYS	0x2		/* generic (system) */
 #define CKPT_DRW	0x4		/* image read/write */
+#define CKPT_DOBJ	0x8		/* shared objects */
 
 #define CKPT_DDEFAULT	0xffff		/* default debug level */
 
diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
index b56fe71..7a4015b 100644
--- a/include/linux/checkpoint_hdr.h
+++ b/include/linux/checkpoint_hdr.h
@@ -50,6 +50,7 @@ enum {
 	CKPT_HDR_HEADER_ARCH,
 	CKPT_HDR_BUFFER,
 	CKPT_HDR_STRING,
+	CKPT_HDR_OBJREF,
 
 	CKPT_HDR_TREE = 101,
 	CKPT_HDR_TASK,
@@ -69,6 +70,19 @@ enum {
 	CKPT_ARCH_X86_32 = 1,
 };
 
+/* shared objrects (objref) */
+struct ckpt_hdr_objref {
+	struct ckpt_hdr h;
+	__u32 objtype;
+	__s32 objref;
+} __attribute__((aligned(8)));
+
+/* shared objects types */
+enum obj_type {
+	CKPT_OBJ_IGNORE = 0,
+	CKPT_OBJ_MAX
+};
+
 /* kernel constants */
 struct ckpt_const {
 	/* task */
diff --git a/include/linux/checkpoint_types.h b/include/linux/checkpoint_types.h
index f74deac..f11fd07 100644
--- a/include/linux/checkpoint_types.h
+++ b/include/linux/checkpoint_types.h
@@ -38,6 +38,8 @@ struct ckpt_ctx {
 
 	atomic_t refcount;
 
+	struct ckpt_obj_hash *obj_hash;	/* repository for shared objects */
+
 	struct task_struct *tsk;/* checkpoint: current target task */
 	char err_string[256];	/* checkpoint: error string */
 	int errno;		/* restart: errno that caused failure */
-- 
1.6.0.4


^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH v18 30/80] c/r: infrastructure for shared objects
@ 2009-09-23 23:51   ` Oren Laadan
  0 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-23 23:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, containers, linux-kernel, linux-mm, linux-api,
	Serge Hallyn, Ingo Molnar, Pavel Emelyanov, Oren Laadan,
	Oren Laadan

The state of shared objects is saved once. On the first encounter, the
state is dumped and the object is assigned a unique identifier (objref)
and also stored in a hash table (indexed by its physical kernel address).
>From then on the object will be found in the hash and only its identifier
is saved.

On restart the identifier is looked up in the hash table; if not found
then the state is read, the object is created, and added to the hash
table (this time indexed by its identifier). Otherwise, the object in
the hash table is used.

The hash is "one-way": objects added to it are never deleted until the
hash it discarded. The hash is discarded at the end of checkpoint or
restart, whether successful or not.

The hash keeps a reference to every object that is added to it, matching
the object's type, and maintains this reference during its lifetime.
Therefore, it is always safe to use an object that is stored in the hash.

Changelog[v18]:
  - Add ckpt_obj_reserve()
  - Change ref_drop() to accept a @lastref argument (useful for cleanup)
  - Disallow multiple objects with same objref in restart
  - Allow _ckpt_read_obj_type() to read object header only (w/o payload)
Changelog[v17]:
  - Add ckpt_obj->flags with CKPT_OBJ_CHECKPOINTED flag
  - Add prototype of ckpt_obj_lookup
  - Complain on attempt to add NULL ptr to objhash
  - Prepare for 'leaks detection'
Changelog[v16]:
  - Introduce ckpt_obj_lookup() to find an object by its ptr
Changelog[v14]:
  - Introduce 'struct ckpt_obj_ops' to better modularize shared objs.
  - Replace long 'switch' statements with table lookups and callbacks.
  - Introduce checkpoint_obj() and restart_obj() helpers
  - Shared objects now dumped/saved right before they are referenced
  - Cleanup interface of shared objects
Changelog[v13]:
  - Use hash_long() with 'unsigned long' cast to support 64bit archs
    (Nathan Lynch <ntl@pobox.com>)
Changelog[v11]:
  - Doc: be explicit about grabbing a reference and object lifetime
Changelog[v4]:
  - Fix calculation of hash table size
Changelog[v3]:
  - Use standard hlist_... for hash table

Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
---
 checkpoint/Makefile              |    1 +
 checkpoint/objhash.c             |  439 ++++++++++++++++++++++++++++++++++++++
 checkpoint/restart.c             |   42 ++++
 checkpoint/sys.c                 |    6 +
 include/linux/checkpoint.h       |   18 ++
 include/linux/checkpoint_hdr.h   |   14 ++
 include/linux/checkpoint_types.h |    2 +
 7 files changed, 522 insertions(+), 0 deletions(-)
 create mode 100644 checkpoint/objhash.c

diff --git a/checkpoint/Makefile b/checkpoint/Makefile
index 99364cc..5aa6a75 100644
--- a/checkpoint/Makefile
+++ b/checkpoint/Makefile
@@ -4,6 +4,7 @@
 
 obj-$(CONFIG_CHECKPOINT) += \
 	sys.o \
+	objhash.o \
 	checkpoint.o \
 	restart.o \
 	process.o
diff --git a/checkpoint/objhash.c b/checkpoint/objhash.c
new file mode 100644
index 0000000..e7f5393
--- /dev/null
+++ b/checkpoint/objhash.c
@@ -0,0 +1,439 @@
+/*
+ *  Checkpoint-restart - object hash infrastructure to manage shared objects
+ *
+ *  Copyright (C) 2008-2009 Oren Laadan
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+/* default debug level for output */
+#define CKPT_DFLAG  CKPT_DOBJ
+
+#include <linux/kernel.h>
+#include <linux/hash.h>
+#include <linux/checkpoint.h>
+#include <linux/checkpoint_hdr.h>
+
+struct ckpt_obj;
+struct ckpt_obj_ops;
+
+/* object operations */
+struct ckpt_obj_ops {
+	char *obj_name;
+	enum obj_type obj_type;
+	void (*ref_drop)(void *ptr, int lastref);
+	int (*ref_grab)(void *ptr);
+	int (*checkpoint)(struct ckpt_ctx *ctx, void *ptr);
+	void *(*restore)(struct ckpt_ctx *ctx);
+};
+
+struct ckpt_obj {
+	int objref;
+	int flags;
+	void *ptr;
+	struct ckpt_obj_ops *ops;
+	struct hlist_node hash;
+};
+
+/* object internal flags */
+#define CKPT_OBJ_CHECKPOINTED		0x1   /* object already checkpointed */
+
+struct ckpt_obj_hash {
+	struct hlist_head *head;
+	int next_free_objref;
+};
+
+/* helper grab/drop functions: */
+
+static void obj_no_drop(void *ptr, int lastref)
+{
+	return;
+}
+
+static int obj_no_grab(void *ptr)
+{
+	return 0;
+}
+
+static struct ckpt_obj_ops ckpt_obj_ops[] = {
+	/* ignored object */
+	{
+		.obj_name = "IGNORED",
+		.obj_type = CKPT_OBJ_IGNORE,
+		.ref_drop = obj_no_drop,
+		.ref_grab = obj_no_grab,
+	},
+};
+
+
+#define CKPT_OBJ_HASH_NBITS  10
+#define CKPT_OBJ_HASH_TOTAL  (1UL << CKPT_OBJ_HASH_NBITS)
+
+static void obj_hash_clear(struct ckpt_obj_hash *obj_hash)
+{
+	struct hlist_head *h = obj_hash->head;
+	struct hlist_node *n, *t;
+	struct ckpt_obj *obj;
+	int i;
+
+	for (i = 0; i < CKPT_OBJ_HASH_TOTAL; i++) {
+		hlist_for_each_entry_safe(obj, n, t, &h[i], hash) {
+			obj->ops->ref_drop(obj->ptr, 1);
+			kfree(obj);
+		}
+	}
+}
+
+void ckpt_obj_hash_free(struct ckpt_ctx *ctx)
+{
+	struct ckpt_obj_hash *obj_hash = ctx->obj_hash;
+
+	if (obj_hash) {
+		obj_hash_clear(obj_hash);
+		kfree(obj_hash->head);
+		kfree(ctx->obj_hash);
+		ctx->obj_hash = NULL;
+	}
+}
+
+int ckpt_obj_hash_alloc(struct ckpt_ctx *ctx)
+{
+	struct ckpt_obj_hash *obj_hash;
+	struct hlist_head *head;
+
+	obj_hash = kzalloc(sizeof(*obj_hash), GFP_KERNEL);
+	if (!obj_hash)
+		return -ENOMEM;
+	head = kzalloc(CKPT_OBJ_HASH_TOTAL * sizeof(*head), GFP_KERNEL);
+	if (!head) {
+		kfree(obj_hash);
+		return -ENOMEM;
+	}
+
+	obj_hash->head = head;
+	obj_hash->next_free_objref = 1;
+
+	ctx->obj_hash = obj_hash;
+	return 0;
+}
+
+static struct ckpt_obj *obj_find_by_ptr(struct ckpt_ctx *ctx, void *ptr)
+{
+	struct hlist_head *h;
+	struct hlist_node *n;
+	struct ckpt_obj *obj;
+
+	h = &ctx->obj_hash->head[hash_long((unsigned long) ptr,
+					   CKPT_OBJ_HASH_NBITS)];
+	hlist_for_each_entry(obj, n, h, hash)
+		if (obj->ptr == ptr)
+			return obj;
+	return NULL;
+}
+
+static struct ckpt_obj *obj_find_by_objref(struct ckpt_ctx *ctx, int objref)
+{
+	struct hlist_head *h;
+	struct hlist_node *n;
+	struct ckpt_obj *obj;
+
+	h = &ctx->obj_hash->head[hash_long((unsigned long) objref,
+					   CKPT_OBJ_HASH_NBITS)];
+	hlist_for_each_entry(obj, n, h, hash)
+		if (obj->objref == objref)
+			return obj;
+	return NULL;
+}
+
+static inline int obj_alloc_objref(struct ckpt_ctx *ctx)
+{
+	return ctx->obj_hash->next_free_objref++;
+}
+
+/**
+ * ckpt_obj_new - add an object to the obj_hash
+ * @ctx: checkpoint context
+ * @ptr: pointer to object
+ * @objref: object unique id
+ * @ops: object operations
+ *
+ * Add the object to the obj_hash. If @objref is zero, assign a unique
+ * object id and use @ptr as a hash key [checkpoint]. Else use @objref
+ * as a key [restart].
+ */
+static struct ckpt_obj *obj_new(struct ckpt_ctx *ctx, void *ptr,
+				int objref, enum obj_type type)
+{
+	struct ckpt_obj_ops *ops = &ckpt_obj_ops[type];
+	struct ckpt_obj *obj;
+	int i, ret;
+
+	/* explicitly disallow null pointers */
+	BUG_ON(!ptr);
+	/* make sure we don't change this accidentally */
+	BUG_ON(ops->obj_type != type);
+
+	obj = kzalloc(sizeof(*obj), GFP_KERNEL);
+	if (!obj)
+		return ERR_PTR(-ENOMEM);
+
+	obj->ptr = ptr;
+	obj->ops = ops;
+
+	if (!objref) {
+		/* use @obj->ptr to index, assign objref (checkpoint) */
+		obj->objref = obj_alloc_objref(ctx);
+		i = hash_long((unsigned long) ptr, CKPT_OBJ_HASH_NBITS);
+	} else {
+		/* use @obj->objref to index (restart) */
+		obj->objref = objref;
+		i = hash_long((unsigned long) objref, CKPT_OBJ_HASH_NBITS);
+	}
+
+	ret = ops->ref_grab(obj->ptr);
+	if (ret < 0) {
+		kfree(obj);
+		obj = ERR_PTR(ret);
+	} else {
+		hlist_add_head(&obj->hash, &ctx->obj_hash->head[i]);
+	}
+
+	return obj;
+}
+
+/**************************************************************************
+ * Checkpoint
+ */
+
+/**
+ * obj_lookup_add - lookup object and add if not in objhash
+ * @ctx: checkpoint context
+ * @ptr: pointer to object
+ * @type: object type
+ * @first: [output] first encounter (added to table)
+ *
+ * Look up the object pointed to by @ptr in the hash table. If it isn't
+ * already found there, add the object, and allocate a unique object
+ * id. Grab a reference to every object that is added, and maintain the
+ * reference until the entire hash is freed.
+ */
+static struct ckpt_obj *obj_lookup_add(struct ckpt_ctx *ctx, void *ptr,
+				       enum obj_type type, int *first)
+{
+	struct ckpt_obj *obj;
+
+	obj = obj_find_by_ptr(ctx, ptr);
+	if (!obj) {
+		obj = obj_new(ctx, ptr, 0, type);
+		*first = 1;
+	} else {
+		BUG_ON(obj->ops->obj_type != type);
+		*first = 0;
+	}
+	return obj;
+}
+
+/**
+ * ckpt_obj_lookup - lookup object (by pointer) in objhash
+ * @ctx: checkpoint context
+ * @ptr: pointer to object
+ * @type: object type
+ *
+ * [used during checkpoint].
+ * Return: objref (or zero if not found)
+ */
+int ckpt_obj_lookup(struct ckpt_ctx *ctx, void *ptr, enum obj_type type)
+{
+	struct ckpt_obj *obj;
+
+	obj = obj_find_by_ptr(ctx, ptr);
+	BUG_ON(obj && obj->ops->obj_type != type);
+	if (obj)
+		ckpt_debug("%s objref %d\n", obj->ops->obj_name, obj->objref);
+	return obj ? obj->objref : 0;
+}
+
+/**
+ * ckpt_obj_lookup_add - lookup object and add if not in objhash
+ * @ctx: checkpoint context
+ * @ptr: pointer to object
+ * @type: object type
+ * @first: [output] first encoutner (added to table)
+ *
+ * [used during checkpoint].
+ * Return: objref
+ */
+int ckpt_obj_lookup_add(struct ckpt_ctx *ctx, void *ptr,
+			enum obj_type type, int *first)
+{
+	struct ckpt_obj *obj;
+
+	obj = obj_lookup_add(ctx, ptr, type, first);
+	if (IS_ERR(obj))
+		return PTR_ERR(obj);
+	ckpt_debug("%s objref %d first %d\n",
+		   obj->ops->obj_name, obj->objref, *first);
+	obj->flags |= CKPT_OBJ_CHECKPOINTED;
+	return obj->objref;
+}
+
+/**
+ * ckpt_obj_reserve - reserve an objref
+ * @ctx: checkpoint context
+ *
+ * The reserved objref will not be used for subsequent objects. This
+ * gives an objref that can be safely used during restart without a
+ * matching object in checkpoint.  [used during checkpoint].
+ */
+int ckpt_obj_reserve(struct ckpt_ctx *ctx)
+{
+	return obj_alloc_objref(ctx);
+}
+
+/**
+ * checkpoint_obj - if not already in hash, add object and checkpoint
+ * @ctx: checkpoint context
+ * @ptr: pointer to object
+ * @type: object type
+ *
+ * Use obj_lookup_add() to lookup (and possibly add) the object to the
+ * hash table. If the CKPT_OBJ_CHECKPOINTED flag isn't set, then also
+ * save the object's state using its ops->checkpoint().
+ *
+ * [This is used during checkpoint].
+ * Returns: objref
+ */
+int checkpoint_obj(struct ckpt_ctx *ctx, void *ptr, enum obj_type type)
+{
+	struct ckpt_hdr_objref *h;
+	struct ckpt_obj *obj;
+	int new, ret = 0;
+
+	obj = obj_lookup_add(ctx, ptr, type, &new);
+	if (IS_ERR(obj))
+		return PTR_ERR(obj);
+
+	if (!(obj->flags & CKPT_OBJ_CHECKPOINTED)) {
+		h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_OBJREF);
+		if (!h)
+			return -ENOMEM;
+
+		h->objtype = type;
+		h->objref = obj->objref;
+		ret = ckpt_write_obj(ctx, &h->h);
+		ckpt_hdr_put(ctx, h);
+
+		if (ret < 0)
+			return ret;
+
+		/* invoke callback to actually dump the state */
+		if (obj->ops->checkpoint)
+			ret = obj->ops->checkpoint(ctx, ptr);
+
+		obj->flags |= CKPT_OBJ_CHECKPOINTED;
+	}
+	return (ret < 0 ? ret : obj->objref);
+}
+
+/**************************************************************************
+ * Restart
+ */
+
+/**
+ * restore_obj - read in and restore a (first seen) shared object
+ * @ctx: checkpoint context
+ * @h: ckpt_hdr of shared object
+ *
+ * Read in the header payload (struct ckpt_hdr_objref). Lookup the
+ * object to verify it isn't there.  Then restore the object's state
+ * and add it to the objash. No need to explicitly grab a reference -
+ * we hold the initial instance of this object. (Object maintained
+ * until the entire hash is free).
+ *
+ * [This is used during restart].
+ */
+int restore_obj(struct ckpt_ctx *ctx, struct ckpt_hdr_objref *h)
+{
+	struct ckpt_obj_ops *ops;
+	struct ckpt_obj *obj;
+	void *ptr = NULL;
+
+	ckpt_debug("len %d ref %d type %d\n", h->h.len, h->objref, h->objtype);
+	if (h->objtype >= CKPT_OBJ_MAX)
+		return -EINVAL;
+
+	ops = &ckpt_obj_ops[h->objtype];
+	BUG_ON(ops->obj_type != h->objtype);
+
+	if (ops->restore)
+		ptr = ops->restore(ctx);
+	if (IS_ERR(ptr))
+		return PTR_ERR(ptr);
+
+	if (obj_find_by_objref(ctx, h->objref))
+		obj = ERR_PTR(-EINVAL);
+	else
+		obj = obj_new(ctx, ptr, h->objref, h->objtype);
+	/*
+	 * Drop an extra reference to the object returned by ops->restore:
+	 * On success, this clears the extra reference taken by obj_new(),
+	 * and on failure, this cleans up the object itself.
+	 */
+	ops->ref_drop(ptr, 0);
+	if (IS_ERR(obj)) {
+		ops->ref_drop(ptr, 1);
+		return PTR_ERR(obj);
+	}
+	return obj->objref;
+}
+
+/**
+ * ckpt_obj_insert - add an object with a given objref to obj_hash
+ * @ctx: checkpoint context
+ * @ptr: pointer to object
+ * @objref: unique object id
+ * @type: object type
+ *
+ * Add the object pointer to by @ptr and identified by unique object id
+ * @objref to the hash table (indexed by @objref).  Grab a reference to
+ * every object added, and maintain it until the entire hash is freed.
+ *
+ * [This is used during restart].
+ */
+int ckpt_obj_insert(struct ckpt_ctx *ctx, void *ptr,
+		    int objref, enum obj_type type)
+{
+	struct ckpt_obj *obj;
+
+	if (obj_find_by_objref(ctx, objref))
+		return -EINVAL;
+	obj = obj_new(ctx, ptr, objref, type);
+	if (IS_ERR(obj))
+		return PTR_ERR(obj);
+	ckpt_debug("%s objref %d\n", obj->ops->obj_name, objref);
+	return obj->objref;
+}
+
+/**
+ * ckpt_obj_fetch - fetch an object by its identifier
+ * @ctx: checkpoint context
+ * @objref: object id
+ * @type: object type
+ *
+ * Lookup the objref identifier by @objref in the hash table. Return
+ * an error not found.
+ *
+ * [This is used during restart].
+ */
+void *ckpt_obj_fetch(struct ckpt_ctx *ctx, int objref, enum obj_type type)
+{
+	struct ckpt_obj *obj;
+
+	obj = obj_find_by_objref(ctx, objref);
+	if (!obj)
+		return ERR_PTR(-EINVAL);
+	ckpt_debug("%s ref %d\n", obj->ops->obj_name, obj->objref);
+	return (obj->ops->obj_type == type ? obj->ptr : ERR_PTR(-ENOMSG));
+}
diff --git a/checkpoint/restart.c b/checkpoint/restart.c
index d43eec7..73db44a 100644
--- a/checkpoint/restart.c
+++ b/checkpoint/restart.c
@@ -47,6 +47,34 @@ static int _ckpt_read_err(struct ckpt_ctx *ctx, struct ckpt_hdr *h)
 }
 
 /**
+ * _ckpt_read_objref - dispatch handling of a shared object
+ * @ctx: checkpoint context
+ * @hh: objrect descriptor
+ */
+static int _ckpt_read_objref(struct ckpt_ctx *ctx, struct ckpt_hdr *hh)
+{
+	struct ckpt_hdr *h;
+	int ret;
+
+	h = ckpt_hdr_get(ctx, hh->len);
+	if (!h)
+		return -ENOMEM;
+
+	*h = *hh;	/* yay ! */
+
+	_ckpt_debug(CKPT_DOBJ, "shared len %d type %d\n", h->len, h->type);
+	ret = ckpt_kread(ctx, (h + 1), hh->len - sizeof(struct ckpt_hdr));
+	if (ret < 0)
+		goto out;
+
+	ret = restore_obj(ctx, (struct ckpt_hdr_objref *) h);
+ out:
+	ckpt_hdr_put(ctx, h);
+	return ret;
+}
+
+
+/**
  * _ckpt_read_obj - read an object (ckpt_hdr followed by payload)
  * @ctx: checkpoint context
  * @h: desired ckpt_hdr
@@ -75,6 +103,11 @@ static int _ckpt_read_obj(struct ckpt_ctx *ctx, struct ckpt_hdr *h,
 		if (ret < 0)
 			return ret;
 		goto again;
+	} else if (h->type == CKPT_HDR_OBJREF) {
+		ret = _ckpt_read_objref(ctx, h);
+		if (ret < 0)
+			return ret;
+		goto again;
 	}
 
 	/* if len specified, enforce, else if maximum specified, enforce */
@@ -164,6 +197,7 @@ static void *ckpt_read_obj(struct ckpt_ctx *ctx, int len, int max)
 	struct ckpt_hdr *h;
 	int ret;
 
+ again:
 	ret = ckpt_kread(ctx, &hh, sizeof(hh));
 	if (ret < 0)
 		return ERR_PTR(ret);
@@ -171,6 +205,14 @@ static void *ckpt_read_obj(struct ckpt_ctx *ctx, int len, int max)
 		    hh.type, hh.len, len, max);
 	if (hh.len < sizeof(*h))
 		return ERR_PTR(-EINVAL);
+
+	if (hh.type == CKPT_HDR_OBJREF) {
+		ret = _ckpt_read_objref(ctx, &hh);
+		if (ret < 0)
+			return ERR_PTR(ret);
+		goto again;
+	}
+
 	/* if len specified, enforce, else if maximum specified, enforce */
 	if ((len && hh.len != len) || (!len && max && hh.len > max))
 		return ERR_PTR(-EINVAL);
diff --git a/checkpoint/sys.c b/checkpoint/sys.c
index c8921f0..d16d48f 100644
--- a/checkpoint/sys.c
+++ b/checkpoint/sys.c
@@ -194,6 +194,8 @@ static void ckpt_ctx_free(struct ckpt_ctx *ctx)
 	if (ctx->file)
 		fput(ctx->file);
 
+	ckpt_obj_hash_free(ctx);
+
 	if (ctx->tasks_arr)
 		task_arr_free(ctx);
 
@@ -231,6 +233,10 @@ static struct ckpt_ctx *ckpt_ctx_alloc(int fd, unsigned long uflags,
 	if (!ctx->file)
 		goto err;
 
+	err = -ENOMEM;
+	if (ckpt_obj_hash_alloc(ctx) < 0)
+		goto err;
+
 	atomic_inc(&ctx->refcount);
 	return ctx;
  err:
diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
index 5c02d9b..1315884 100644
--- a/include/linux/checkpoint.h
+++ b/include/linux/checkpoint.h
@@ -89,6 +89,23 @@ static inline void ckpt_set_ctx_error(struct ckpt_ctx *ctx, int errno)
 #define ckpt_test_ctx_complete(ctx)  \
 	((ctx)->kflags & (CKPT_CTX_SUCCESS | CKPT_CTX_ERROR))
 
+/* obj_hash */
+extern void ckpt_obj_hash_free(struct ckpt_ctx *ctx);
+extern int ckpt_obj_hash_alloc(struct ckpt_ctx *ctx);
+
+extern int restore_obj(struct ckpt_ctx *ctx, struct ckpt_hdr_objref *h);
+extern int checkpoint_obj(struct ckpt_ctx *ctx, void *ptr,
+			  enum obj_type type);
+extern int ckpt_obj_lookup(struct ckpt_ctx *ctx, void *ptr,
+			   enum obj_type type);
+extern int ckpt_obj_lookup_add(struct ckpt_ctx *ctx, void *ptr,
+			       enum obj_type type, int *first);
+extern void *ckpt_obj_fetch(struct ckpt_ctx *ctx, int objref,
+			    enum obj_type type);
+extern int ckpt_obj_insert(struct ckpt_ctx *ctx, void *ptr, int objref,
+			   enum obj_type type);
+extern int ckpt_obj_reserve(struct ckpt_ctx *ctx);
+
 extern void ckpt_ctx_get(struct ckpt_ctx *ctx);
 extern void ckpt_ctx_put(struct ckpt_ctx *ctx);
 
@@ -122,6 +139,7 @@ static inline int ckpt_validate_errno(int errno)
 #define CKPT_DBASE	0x1		/* anything */
 #define CKPT_DSYS	0x2		/* generic (system) */
 #define CKPT_DRW	0x4		/* image read/write */
+#define CKPT_DOBJ	0x8		/* shared objects */
 
 #define CKPT_DDEFAULT	0xffff		/* default debug level */
 
diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
index b56fe71..7a4015b 100644
--- a/include/linux/checkpoint_hdr.h
+++ b/include/linux/checkpoint_hdr.h
@@ -50,6 +50,7 @@ enum {
 	CKPT_HDR_HEADER_ARCH,
 	CKPT_HDR_BUFFER,
 	CKPT_HDR_STRING,
+	CKPT_HDR_OBJREF,
 
 	CKPT_HDR_TREE = 101,
 	CKPT_HDR_TASK,
@@ -69,6 +70,19 @@ enum {
 	CKPT_ARCH_X86_32 = 1,
 };
 
+/* shared objrects (objref) */
+struct ckpt_hdr_objref {
+	struct ckpt_hdr h;
+	__u32 objtype;
+	__s32 objref;
+} __attribute__((aligned(8)));
+
+/* shared objects types */
+enum obj_type {
+	CKPT_OBJ_IGNORE = 0,
+	CKPT_OBJ_MAX
+};
+
 /* kernel constants */
 struct ckpt_const {
 	/* task */
diff --git a/include/linux/checkpoint_types.h b/include/linux/checkpoint_types.h
index f74deac..f11fd07 100644
--- a/include/linux/checkpoint_types.h
+++ b/include/linux/checkpoint_types.h
@@ -38,6 +38,8 @@ struct ckpt_ctx {
 
 	atomic_t refcount;
 
+	struct ckpt_obj_hash *obj_hash;	/* repository for shared objects */
+
 	struct task_struct *tsk;/* checkpoint: current target task */
 	char err_string[256];	/* checkpoint: error string */
 	int errno;		/* restart: errno that caused failure */
-- 
1.6.0.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH v18 31/80] c/r: detect resource leaks for whole-container checkpoint
       [not found] ` <1253749920-18673-1-git-send-email-orenl-RdfvBDnrOixBDgjK7y7TUQ@public.gmane.org>
                     ` (29 preceding siblings ...)
  2009-09-23 23:51   ` [PATCH v18 30/80] c/r: infrastructure for shared objects Oren Laadan
@ 2009-09-23 23:51   ` Oren Laadan
  2009-09-23 23:51   ` [PATCH v18 32/80] deferqueue: generic queue to defer work Oren Laadan
                     ` (50 subsequent siblings)
  81 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-23 23:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-api-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Linus Torvalds, Ingo Molnar,
	Pavel Emelyanov

Add a 'users' count to objhash items, and, for a !CHECKPOINT_SUBTREE
checkpoint, return an error code if the actual objects' counts are
higher, indicating leaks (references to the objects from a task not
being checkpointed).

The comparison of the objhash user counts to object refcounts as a
basis for checking for leaks comes from Alexey's OpenVZ-based c/r
patchset.

"Leak detection" occurs _before_ any real state is saved, as a
pre-step. This prevents races due to sharing with outside world where
the sharing ceases before the leak test takes place, thus protecting
the checkpoint image from inconsistencies.

Once leak testing concludes, checkpoint will proceed. Because objects
are already in the objhash, checkpoint_obj() cannot distinguish
between the first and subsequent encounters. This is solved with a
flag (CKPT_OBJ_CHECKPOINTED) per object.

Two additional checks take place during checkpoint: for objects that
were created during, and objects destroyed, while the leak-detection
pre-step took place. (By the time this occurs part of the checkpoint
image has been written out to disk, so this is purely advisory).

Changelog[v18]:
  - [Dan Smith] Fix ckpt_obj_lookup_add() leak detection logic
  - Replace some EAGAIN with EBUSY
  - Add a few more ckpt_write_err()s
  - Introduce CKPT_OBJ_VISITED
  - ckpt_obj_collect() returns objref for new objects, 0 otherwise
  - Rename ckpt_obj_checkpointed() to ckpt_obj_visited()
  - Introduce ckpt_obj_visit() to mark objects as visited
  - Set the CHECKPOINTED flag on objects before calling checkpoint
Changelog[v17]:
  - Leak detection is performed in two-steps
  - Detect reverse-leaks (objects disappearing unexpectedly)
  - Skip reverse-leak detection if ops->ref_users isn't defined

Signed-off-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
---
 checkpoint/checkpoint.c    |   41 ++++++++++
 checkpoint/objhash.c       |  188 +++++++++++++++++++++++++++++++++++++++++++-
 checkpoint/process.c       |    5 +
 include/linux/checkpoint.h |    7 ++
 4 files changed, 237 insertions(+), 4 deletions(-)

diff --git a/checkpoint/checkpoint.c b/checkpoint/checkpoint.c
index 93d7860..c21646d 100644
--- a/checkpoint/checkpoint.c
+++ b/checkpoint/checkpoint.c
@@ -473,6 +473,24 @@ static int checkpoint_pids(struct ckpt_ctx *ctx)
 	return ret;
 }
 
+static int collect_objects(struct ckpt_ctx *ctx)
+{
+	int n, ret = 0;
+
+	for (n = 0; n < ctx->nr_tasks; n++) {
+		ckpt_debug("dumping task #%d\n", n);
+		ret = ckpt_collect_task(ctx, ctx->tasks_arr[n]);
+		if (ret < 0) {
+			ctx->tsk = ctx->tasks_arr[n];
+			ckpt_write_err(ctx, "TE", "collect failed", ret);
+			ctx->tsk = NULL;
+			break;
+		}
+	}
+
+	return ret;
+}
+
 /* count number of tasks in tree (and optionally fill pid's in array) */
 static int tree_count_tasks(struct ckpt_ctx *ctx)
 {
@@ -719,6 +737,21 @@ long do_checkpoint(struct ckpt_ctx *ctx, pid_t pid)
 	if (ret < 0)
 		goto out;
 
+	if (!(ctx->uflags & CHECKPOINT_SUBTREE)) {
+		/*
+		 * Verify that all objects are contained (no leaks):
+		 * First collect them all into the while counting users
+		 * and then compare to the objects' real user counts.
+		 */
+		ret = collect_objects(ctx);
+		if (ret < 0)
+			goto out;
+		if (!ckpt_obj_contained(ctx)) {
+			ret = -EBUSY;
+			goto out;
+		}
+	}
+
 	ret = checkpoint_write_header(ctx);
 	if (ret < 0)
 		goto out;
@@ -728,6 +761,14 @@ long do_checkpoint(struct ckpt_ctx *ctx, pid_t pid)
 	ret = checkpoint_all_tasks(ctx);
 	if (ret < 0)
 		goto out;
+
+	/* verify that all objects were indeed visited */
+	if (!ckpt_obj_visited(ctx)) {
+		ckpt_write_err(ctx, "E", "leak: unvisited", -EBUSY);
+		ret = -EBUSY;
+		goto out;
+	}
+
 	ret = checkpoint_write_tail(ctx);
 	if (ret < 0)
 		goto out;
diff --git a/checkpoint/objhash.c b/checkpoint/objhash.c
index e7f5393..dd1f3e5 100644
--- a/checkpoint/objhash.c
+++ b/checkpoint/objhash.c
@@ -25,27 +25,32 @@ struct ckpt_obj_ops {
 	enum obj_type obj_type;
 	void (*ref_drop)(void *ptr, int lastref);
 	int (*ref_grab)(void *ptr);
+	int (*ref_users)(void *ptr);
 	int (*checkpoint)(struct ckpt_ctx *ctx, void *ptr);
 	void *(*restore)(struct ckpt_ctx *ctx);
 };
 
 struct ckpt_obj {
+	int users;
 	int objref;
 	int flags;
 	void *ptr;
 	struct ckpt_obj_ops *ops;
 	struct hlist_node hash;
+	struct hlist_node next;
 };
 
 /* object internal flags */
 #define CKPT_OBJ_CHECKPOINTED		0x1   /* object already checkpointed */
+#define CKPT_OBJ_VISITED		0x2   /* object already visited */
 
 struct ckpt_obj_hash {
 	struct hlist_head *head;
+	struct hlist_head list;
 	int next_free_objref;
 };
 
-/* helper grab/drop functions: */
+/* helper grab/drop/users functions */
 
 static void obj_no_drop(void *ptr, int lastref)
 {
@@ -114,6 +119,7 @@ int ckpt_obj_hash_alloc(struct ckpt_ctx *ctx)
 
 	obj_hash->head = head;
 	obj_hash->next_free_objref = 1;
+	INIT_HLIST_HEAD(&obj_hash->list);
 
 	ctx->obj_hash = obj_hash;
 	return 0;
@@ -181,6 +187,7 @@ static struct ckpt_obj *obj_new(struct ckpt_ctx *ctx, void *ptr,
 
 	obj->ptr = ptr;
 	obj->ops = ops;
+	obj->users = 2;  /* extra reference that objhash itself takes */
 
 	if (!objref) {
 		/* use @obj->ptr to index, assign objref (checkpoint) */
@@ -198,6 +205,7 @@ static struct ckpt_obj *obj_new(struct ckpt_ctx *ctx, void *ptr,
 		obj = ERR_PTR(ret);
 	} else {
 		hlist_add_head(&obj->hash, &ctx->obj_hash->head[i]);
+		hlist_add_head(&obj->next, &ctx->obj_hash->list);
 	}
 
 	return obj;
@@ -230,12 +238,35 @@ static struct ckpt_obj *obj_lookup_add(struct ckpt_ctx *ctx, void *ptr,
 		*first = 1;
 	} else {
 		BUG_ON(obj->ops->obj_type != type);
+		obj->users++;
 		*first = 0;
 	}
 	return obj;
 }
 
 /**
+ * ckpt_obj_collect - collect object into objhash
+ * @ctx: checkpoint context
+ * @ptr: pointer to object
+ * @type: object type
+ *
+ * [used during checkpoint].
+ * Return: objref if object is new, 0 otherwise, or an error
+ */
+int ckpt_obj_collect(struct ckpt_ctx *ctx, void *ptr, enum obj_type type)
+{
+	struct ckpt_obj *obj;
+	int first;
+
+	obj = obj_lookup_add(ctx, ptr, type, &first);
+	if (IS_ERR(obj))
+		return PTR_ERR(obj);
+	ckpt_debug("%s objref %d first %d\n",
+		   obj->ops->obj_name, obj->objref, first);
+	return (first ? obj->objref : 0);
+}
+
+/**
  * ckpt_obj_lookup - lookup object (by pointer) in objhash
  * @ctx: checkpoint context
  * @ptr: pointer to object
@@ -255,6 +286,21 @@ int ckpt_obj_lookup(struct ckpt_ctx *ctx, void *ptr, enum obj_type type)
 	return obj ? obj->objref : 0;
 }
 
+static inline int obj_reverse_leak(struct ckpt_ctx *ctx, struct ckpt_obj *obj)
+{
+	/*
+	 * A "reverse" leak ?  All objects should already be in the
+	 * objhash by now. But an outside task may have created an
+	 * object while we were collecting, which we didn't catch.
+	 */
+	if (obj->ops->ref_users && !(ctx->uflags & CHECKPOINT_SUBTREE)) {
+		ckpt_write_err(ctx, "OP", "leak: reverse added late (%s)",
+			       obj->objref, obj->ptr, obj->ops->obj_name);
+		return -EBUSY;
+	}
+	return 0;
+}
+
 /**
  * ckpt_obj_lookup_add - lookup object and add if not in objhash
  * @ctx: checkpoint context
@@ -275,7 +321,11 @@ int ckpt_obj_lookup_add(struct ckpt_ctx *ctx, void *ptr,
 		return PTR_ERR(obj);
 	ckpt_debug("%s objref %d first %d\n",
 		   obj->ops->obj_name, obj->objref, *first);
-	obj->flags |= CKPT_OBJ_CHECKPOINTED;
+
+	if (*first && obj_reverse_leak(ctx, obj))
+		return -EBUSY;
+
+	obj->flags |= CKPT_OBJ_VISITED;
 	return obj->objref;
 }
 
@@ -315,6 +365,9 @@ int checkpoint_obj(struct ckpt_ctx *ctx, void *ptr, enum obj_type type)
 	if (IS_ERR(obj))
 		return PTR_ERR(obj);
 
+	if (new && obj_reverse_leak(ctx, obj))
+		return -EBUSY;
+
 	if (!(obj->flags & CKPT_OBJ_CHECKPOINTED)) {
 		h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_OBJREF);
 		if (!h)
@@ -329,14 +382,141 @@ int checkpoint_obj(struct ckpt_ctx *ctx, void *ptr, enum obj_type type)
 			return ret;
 
 		/* invoke callback to actually dump the state */
-		if (obj->ops->checkpoint)
-			ret = obj->ops->checkpoint(ctx, ptr);
+		BUG_ON(!obj->ops->checkpoint);
 
 		obj->flags |= CKPT_OBJ_CHECKPOINTED;
+		ret = obj->ops->checkpoint(ctx, ptr);
 	}
+
+	obj->flags |= CKPT_OBJ_VISITED;
 	return (ret < 0 ? ret : obj->objref);
 }
 
+/**
+ * ckpt_obj_visit - mark object as visited
+ * @ctx: checkpoint context
+ * @ptr: pointer to object
+ * @type: object type
+ *
+ * [used during checkpoint].
+ * Marks the object as visited, or fail if not found
+ */
+int ckpt_obj_visit(struct ckpt_ctx *ctx, void *ptr, enum obj_type type)
+{
+	struct ckpt_obj *obj;
+
+	obj = obj_find_by_ptr(ctx, ptr);
+	BUG_ON(obj && obj->ops->obj_type != type);
+
+	if (!obj) {
+		if (!(ctx->uflags & CHECKPOINT_SUBTREE)) {
+			/* if not found report reverse leak (full container) */
+			ckpt_write_err(ctx, "OP", "leak: reverse unknown (%s)",
+				       obj->objref, obj->ptr,
+				       obj->ops->obj_name);
+			return -EBUSY;
+		}
+	} else {
+		ckpt_debug("visit %s objref %d\n",
+			   obj->ops->obj_name, obj->objref);
+		obj->flags |= CKPT_OBJ_VISITED;
+	}
+	return 0;
+}
+
+/* increment the 'users' count of an object */
+static void ckpt_obj_users_inc(struct ckpt_ctx *ctx, void *ptr, int increment)
+{
+	struct ckpt_obj *obj;
+
+	obj = obj_find_by_ptr(ctx, ptr);
+	if (obj)
+		obj->users += increment;
+}
+
+/*
+ * "Leak detection" - to guarantee a consistent checkpoint of a full
+ * container we verify that all resources are confined and isolated in
+ * that container:
+ *
+ * c/r code first walks through all tasks and collects all shared
+ * resources into the objhash, while counting the references to them;
+ * then, it compares this count to the object's real reference count,
+ * and if they don't match it means that an object has "leaked" to the
+ * outside.
+ *
+ * Otherwise, it is guaranteed that there are no references outside
+ * (of container). c/r code now proceeds to walk through all tasks,
+ * again, and checkpoints the resources. It ensures that all resources
+ * are already in the objhash, and that all of them are checkpointed.
+ * Otherwise it means that due to a race, an object was created or
+ * destroyed during the first walk but not accounted for.
+ *
+ * For instance, consider an outside task A that shared files_struct
+ * with inside task B. Then, after B's files where collected, A opens
+ * or closes a file, and immediately exits - before the first leak
+ * test is performed, such that the test passes.
+ */
+
+/**
+ * ckpt_obj_contained - test if shared objects are contained in checkpoint
+ * @ctx: checkpoint context
+ *
+ * Loops through all objects in the table and compares the number of
+ * references accumulated during checkpoint, with the reference count
+ * reported by the kernel.
+ *
+ * Return 1 if respective counts match for all objects, 0 otherwise.
+ */
+int ckpt_obj_contained(struct ckpt_ctx *ctx)
+{
+	struct ckpt_obj *obj;
+	struct hlist_node *node;
+
+	/* account for ctx->file reference (if in the table already) */
+	ckpt_obj_users_inc(ctx, ctx->file, 1);
+
+	hlist_for_each_entry(obj, node, &ctx->obj_hash->list, next) {
+		if (!obj->ops->ref_users)
+			continue;
+		if (obj->ops->ref_users(obj->ptr) != obj->users) {
+			ckpt_debug("usage leak: %s\n", obj->ops->obj_name);
+			ckpt_write_err(ctx, "OP", "leak: usage (%d != %d (%s)",
+				       obj->objref, obj->ptr,
+				       obj->ops->ref_users(obj->ptr),
+				       obj->users, obj->ops->obj_name);
+			return 0;
+		}
+	}
+
+	return 1;
+}
+
+/**
+ * ckpt_obj_visited - test that all shared objects were visited
+ * @ctx: checkpoint context
+ *
+ * Return 1 if all objects where visited, 0 otherwise.
+ */
+int ckpt_obj_visited(struct ckpt_ctx *ctx)
+{
+	struct ckpt_obj *obj;
+	struct hlist_node *node;
+
+	hlist_for_each_entry(obj, node, &ctx->obj_hash->list, next) {
+		if (!(obj->flags & CKPT_OBJ_VISITED)) {
+			ckpt_debug("reverse leak: %s (%d)\n",
+				   obj->ops->obj_name, obj->objref);
+			ckpt_write_err(ctx, "OP", "leak: not visited (%s)",
+				       obj->objref, obj->ptr,
+				       obj->ops->obj_name);
+			return 0;
+		}
+	}
+
+	return 1;
+}
+
 /**************************************************************************
  * Restart
  */
diff --git a/checkpoint/process.c b/checkpoint/process.c
index 2580b31..5e690d3 100644
--- a/checkpoint/process.c
+++ b/checkpoint/process.c
@@ -246,6 +246,11 @@ int checkpoint_task(struct ckpt_ctx *ctx, struct task_struct *t)
 	return ret;
 }
 
+int ckpt_collect_task(struct ckpt_ctx *ctx, struct task_struct *t)
+{
+	return 0;
+}
+
 /***********************************************************************
  * Restart
  */
diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
index 1315884..b698b19 100644
--- a/include/linux/checkpoint.h
+++ b/include/linux/checkpoint.h
@@ -96,6 +96,12 @@ extern int ckpt_obj_hash_alloc(struct ckpt_ctx *ctx);
 extern int restore_obj(struct ckpt_ctx *ctx, struct ckpt_hdr_objref *h);
 extern int checkpoint_obj(struct ckpt_ctx *ctx, void *ptr,
 			  enum obj_type type);
+extern int ckpt_obj_collect(struct ckpt_ctx *ctx, void *ptr,
+			    enum obj_type type);
+extern int ckpt_obj_contained(struct ckpt_ctx *ctx);
+extern int ckpt_obj_visited(struct ckpt_ctx *ctx);
+extern int ckpt_obj_visit(struct ckpt_ctx *ctx, void *ptr,
+			  enum obj_type type);
 extern int ckpt_obj_lookup(struct ckpt_ctx *ctx, void *ptr,
 			   enum obj_type type);
 extern int ckpt_obj_lookup_add(struct ckpt_ctx *ctx, void *ptr,
@@ -114,6 +120,7 @@ extern long do_restart(struct ckpt_ctx *ctx, pid_t pid);
 
 /* task */
 extern int ckpt_activate_next(struct ckpt_ctx *ctx);
+extern int ckpt_collect_task(struct ckpt_ctx *ctx, struct task_struct *t);
 extern int checkpoint_task(struct ckpt_ctx *ctx, struct task_struct *t);
 extern int restore_task(struct ckpt_ctx *ctx);
 
-- 
1.6.0.4

^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH v18 31/80] c/r: detect resource leaks for whole-container checkpoint
  2009-09-23 23:50 ` Oren Laadan
@ 2009-09-23 23:51   ` Oren Laadan
  -1 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-23 23:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, containers, linux-kernel, linux-mm, linux-api,
	Serge Hallyn, Ingo Molnar, Pavel Emelyanov, Oren Laadan,
	Oren Laadan

Add a 'users' count to objhash items, and, for a !CHECKPOINT_SUBTREE
checkpoint, return an error code if the actual objects' counts are
higher, indicating leaks (references to the objects from a task not
being checkpointed).

The comparison of the objhash user counts to object refcounts as a
basis for checking for leaks comes from Alexey's OpenVZ-based c/r
patchset.

"Leak detection" occurs _before_ any real state is saved, as a
pre-step. This prevents races due to sharing with outside world where
the sharing ceases before the leak test takes place, thus protecting
the checkpoint image from inconsistencies.

Once leak testing concludes, checkpoint will proceed. Because objects
are already in the objhash, checkpoint_obj() cannot distinguish
between the first and subsequent encounters. This is solved with a
flag (CKPT_OBJ_CHECKPOINTED) per object.

Two additional checks take place during checkpoint: for objects that
were created during, and objects destroyed, while the leak-detection
pre-step took place. (By the time this occurs part of the checkpoint
image has been written out to disk, so this is purely advisory).

Changelog[v18]:
  - [Dan Smith] Fix ckpt_obj_lookup_add() leak detection logic
  - Replace some EAGAIN with EBUSY
  - Add a few more ckpt_write_err()s
  - Introduce CKPT_OBJ_VISITED
  - ckpt_obj_collect() returns objref for new objects, 0 otherwise
  - Rename ckpt_obj_checkpointed() to ckpt_obj_visited()
  - Introduce ckpt_obj_visit() to mark objects as visited
  - Set the CHECKPOINTED flag on objects before calling checkpoint
Changelog[v17]:
  - Leak detection is performed in two-steps
  - Detect reverse-leaks (objects disappearing unexpectedly)
  - Skip reverse-leak detection if ops->ref_users isn't defined

Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
---
 checkpoint/checkpoint.c    |   41 ++++++++++
 checkpoint/objhash.c       |  188 +++++++++++++++++++++++++++++++++++++++++++-
 checkpoint/process.c       |    5 +
 include/linux/checkpoint.h |    7 ++
 4 files changed, 237 insertions(+), 4 deletions(-)

diff --git a/checkpoint/checkpoint.c b/checkpoint/checkpoint.c
index 93d7860..c21646d 100644
--- a/checkpoint/checkpoint.c
+++ b/checkpoint/checkpoint.c
@@ -473,6 +473,24 @@ static int checkpoint_pids(struct ckpt_ctx *ctx)
 	return ret;
 }
 
+static int collect_objects(struct ckpt_ctx *ctx)
+{
+	int n, ret = 0;
+
+	for (n = 0; n < ctx->nr_tasks; n++) {
+		ckpt_debug("dumping task #%d\n", n);
+		ret = ckpt_collect_task(ctx, ctx->tasks_arr[n]);
+		if (ret < 0) {
+			ctx->tsk = ctx->tasks_arr[n];
+			ckpt_write_err(ctx, "TE", "collect failed", ret);
+			ctx->tsk = NULL;
+			break;
+		}
+	}
+
+	return ret;
+}
+
 /* count number of tasks in tree (and optionally fill pid's in array) */
 static int tree_count_tasks(struct ckpt_ctx *ctx)
 {
@@ -719,6 +737,21 @@ long do_checkpoint(struct ckpt_ctx *ctx, pid_t pid)
 	if (ret < 0)
 		goto out;
 
+	if (!(ctx->uflags & CHECKPOINT_SUBTREE)) {
+		/*
+		 * Verify that all objects are contained (no leaks):
+		 * First collect them all into the while counting users
+		 * and then compare to the objects' real user counts.
+		 */
+		ret = collect_objects(ctx);
+		if (ret < 0)
+			goto out;
+		if (!ckpt_obj_contained(ctx)) {
+			ret = -EBUSY;
+			goto out;
+		}
+	}
+
 	ret = checkpoint_write_header(ctx);
 	if (ret < 0)
 		goto out;
@@ -728,6 +761,14 @@ long do_checkpoint(struct ckpt_ctx *ctx, pid_t pid)
 	ret = checkpoint_all_tasks(ctx);
 	if (ret < 0)
 		goto out;
+
+	/* verify that all objects were indeed visited */
+	if (!ckpt_obj_visited(ctx)) {
+		ckpt_write_err(ctx, "E", "leak: unvisited", -EBUSY);
+		ret = -EBUSY;
+		goto out;
+	}
+
 	ret = checkpoint_write_tail(ctx);
 	if (ret < 0)
 		goto out;
diff --git a/checkpoint/objhash.c b/checkpoint/objhash.c
index e7f5393..dd1f3e5 100644
--- a/checkpoint/objhash.c
+++ b/checkpoint/objhash.c
@@ -25,27 +25,32 @@ struct ckpt_obj_ops {
 	enum obj_type obj_type;
 	void (*ref_drop)(void *ptr, int lastref);
 	int (*ref_grab)(void *ptr);
+	int (*ref_users)(void *ptr);
 	int (*checkpoint)(struct ckpt_ctx *ctx, void *ptr);
 	void *(*restore)(struct ckpt_ctx *ctx);
 };
 
 struct ckpt_obj {
+	int users;
 	int objref;
 	int flags;
 	void *ptr;
 	struct ckpt_obj_ops *ops;
 	struct hlist_node hash;
+	struct hlist_node next;
 };
 
 /* object internal flags */
 #define CKPT_OBJ_CHECKPOINTED		0x1   /* object already checkpointed */
+#define CKPT_OBJ_VISITED		0x2   /* object already visited */
 
 struct ckpt_obj_hash {
 	struct hlist_head *head;
+	struct hlist_head list;
 	int next_free_objref;
 };
 
-/* helper grab/drop functions: */
+/* helper grab/drop/users functions */
 
 static void obj_no_drop(void *ptr, int lastref)
 {
@@ -114,6 +119,7 @@ int ckpt_obj_hash_alloc(struct ckpt_ctx *ctx)
 
 	obj_hash->head = head;
 	obj_hash->next_free_objref = 1;
+	INIT_HLIST_HEAD(&obj_hash->list);
 
 	ctx->obj_hash = obj_hash;
 	return 0;
@@ -181,6 +187,7 @@ static struct ckpt_obj *obj_new(struct ckpt_ctx *ctx, void *ptr,
 
 	obj->ptr = ptr;
 	obj->ops = ops;
+	obj->users = 2;  /* extra reference that objhash itself takes */
 
 	if (!objref) {
 		/* use @obj->ptr to index, assign objref (checkpoint) */
@@ -198,6 +205,7 @@ static struct ckpt_obj *obj_new(struct ckpt_ctx *ctx, void *ptr,
 		obj = ERR_PTR(ret);
 	} else {
 		hlist_add_head(&obj->hash, &ctx->obj_hash->head[i]);
+		hlist_add_head(&obj->next, &ctx->obj_hash->list);
 	}
 
 	return obj;
@@ -230,12 +238,35 @@ static struct ckpt_obj *obj_lookup_add(struct ckpt_ctx *ctx, void *ptr,
 		*first = 1;
 	} else {
 		BUG_ON(obj->ops->obj_type != type);
+		obj->users++;
 		*first = 0;
 	}
 	return obj;
 }
 
 /**
+ * ckpt_obj_collect - collect object into objhash
+ * @ctx: checkpoint context
+ * @ptr: pointer to object
+ * @type: object type
+ *
+ * [used during checkpoint].
+ * Return: objref if object is new, 0 otherwise, or an error
+ */
+int ckpt_obj_collect(struct ckpt_ctx *ctx, void *ptr, enum obj_type type)
+{
+	struct ckpt_obj *obj;
+	int first;
+
+	obj = obj_lookup_add(ctx, ptr, type, &first);
+	if (IS_ERR(obj))
+		return PTR_ERR(obj);
+	ckpt_debug("%s objref %d first %d\n",
+		   obj->ops->obj_name, obj->objref, first);
+	return (first ? obj->objref : 0);
+}
+
+/**
  * ckpt_obj_lookup - lookup object (by pointer) in objhash
  * @ctx: checkpoint context
  * @ptr: pointer to object
@@ -255,6 +286,21 @@ int ckpt_obj_lookup(struct ckpt_ctx *ctx, void *ptr, enum obj_type type)
 	return obj ? obj->objref : 0;
 }
 
+static inline int obj_reverse_leak(struct ckpt_ctx *ctx, struct ckpt_obj *obj)
+{
+	/*
+	 * A "reverse" leak ?  All objects should already be in the
+	 * objhash by now. But an outside task may have created an
+	 * object while we were collecting, which we didn't catch.
+	 */
+	if (obj->ops->ref_users && !(ctx->uflags & CHECKPOINT_SUBTREE)) {
+		ckpt_write_err(ctx, "OP", "leak: reverse added late (%s)",
+			       obj->objref, obj->ptr, obj->ops->obj_name);
+		return -EBUSY;
+	}
+	return 0;
+}
+
 /**
  * ckpt_obj_lookup_add - lookup object and add if not in objhash
  * @ctx: checkpoint context
@@ -275,7 +321,11 @@ int ckpt_obj_lookup_add(struct ckpt_ctx *ctx, void *ptr,
 		return PTR_ERR(obj);
 	ckpt_debug("%s objref %d first %d\n",
 		   obj->ops->obj_name, obj->objref, *first);
-	obj->flags |= CKPT_OBJ_CHECKPOINTED;
+
+	if (*first && obj_reverse_leak(ctx, obj))
+		return -EBUSY;
+
+	obj->flags |= CKPT_OBJ_VISITED;
 	return obj->objref;
 }
 
@@ -315,6 +365,9 @@ int checkpoint_obj(struct ckpt_ctx *ctx, void *ptr, enum obj_type type)
 	if (IS_ERR(obj))
 		return PTR_ERR(obj);
 
+	if (new && obj_reverse_leak(ctx, obj))
+		return -EBUSY;
+
 	if (!(obj->flags & CKPT_OBJ_CHECKPOINTED)) {
 		h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_OBJREF);
 		if (!h)
@@ -329,14 +382,141 @@ int checkpoint_obj(struct ckpt_ctx *ctx, void *ptr, enum obj_type type)
 			return ret;
 
 		/* invoke callback to actually dump the state */
-		if (obj->ops->checkpoint)
-			ret = obj->ops->checkpoint(ctx, ptr);
+		BUG_ON(!obj->ops->checkpoint);
 
 		obj->flags |= CKPT_OBJ_CHECKPOINTED;
+		ret = obj->ops->checkpoint(ctx, ptr);
 	}
+
+	obj->flags |= CKPT_OBJ_VISITED;
 	return (ret < 0 ? ret : obj->objref);
 }
 
+/**
+ * ckpt_obj_visit - mark object as visited
+ * @ctx: checkpoint context
+ * @ptr: pointer to object
+ * @type: object type
+ *
+ * [used during checkpoint].
+ * Marks the object as visited, or fail if not found
+ */
+int ckpt_obj_visit(struct ckpt_ctx *ctx, void *ptr, enum obj_type type)
+{
+	struct ckpt_obj *obj;
+
+	obj = obj_find_by_ptr(ctx, ptr);
+	BUG_ON(obj && obj->ops->obj_type != type);
+
+	if (!obj) {
+		if (!(ctx->uflags & CHECKPOINT_SUBTREE)) {
+			/* if not found report reverse leak (full container) */
+			ckpt_write_err(ctx, "OP", "leak: reverse unknown (%s)",
+				       obj->objref, obj->ptr,
+				       obj->ops->obj_name);
+			return -EBUSY;
+		}
+	} else {
+		ckpt_debug("visit %s objref %d\n",
+			   obj->ops->obj_name, obj->objref);
+		obj->flags |= CKPT_OBJ_VISITED;
+	}
+	return 0;
+}
+
+/* increment the 'users' count of an object */
+static void ckpt_obj_users_inc(struct ckpt_ctx *ctx, void *ptr, int increment)
+{
+	struct ckpt_obj *obj;
+
+	obj = obj_find_by_ptr(ctx, ptr);
+	if (obj)
+		obj->users += increment;
+}
+
+/*
+ * "Leak detection" - to guarantee a consistent checkpoint of a full
+ * container we verify that all resources are confined and isolated in
+ * that container:
+ *
+ * c/r code first walks through all tasks and collects all shared
+ * resources into the objhash, while counting the references to them;
+ * then, it compares this count to the object's real reference count,
+ * and if they don't match it means that an object has "leaked" to the
+ * outside.
+ *
+ * Otherwise, it is guaranteed that there are no references outside
+ * (of container). c/r code now proceeds to walk through all tasks,
+ * again, and checkpoints the resources. It ensures that all resources
+ * are already in the objhash, and that all of them are checkpointed.
+ * Otherwise it means that due to a race, an object was created or
+ * destroyed during the first walk but not accounted for.
+ *
+ * For instance, consider an outside task A that shared files_struct
+ * with inside task B. Then, after B's files where collected, A opens
+ * or closes a file, and immediately exits - before the first leak
+ * test is performed, such that the test passes.
+ */
+
+/**
+ * ckpt_obj_contained - test if shared objects are contained in checkpoint
+ * @ctx: checkpoint context
+ *
+ * Loops through all objects in the table and compares the number of
+ * references accumulated during checkpoint, with the reference count
+ * reported by the kernel.
+ *
+ * Return 1 if respective counts match for all objects, 0 otherwise.
+ */
+int ckpt_obj_contained(struct ckpt_ctx *ctx)
+{
+	struct ckpt_obj *obj;
+	struct hlist_node *node;
+
+	/* account for ctx->file reference (if in the table already) */
+	ckpt_obj_users_inc(ctx, ctx->file, 1);
+
+	hlist_for_each_entry(obj, node, &ctx->obj_hash->list, next) {
+		if (!obj->ops->ref_users)
+			continue;
+		if (obj->ops->ref_users(obj->ptr) != obj->users) {
+			ckpt_debug("usage leak: %s\n", obj->ops->obj_name);
+			ckpt_write_err(ctx, "OP", "leak: usage (%d != %d (%s)",
+				       obj->objref, obj->ptr,
+				       obj->ops->ref_users(obj->ptr),
+				       obj->users, obj->ops->obj_name);
+			return 0;
+		}
+	}
+
+	return 1;
+}
+
+/**
+ * ckpt_obj_visited - test that all shared objects were visited
+ * @ctx: checkpoint context
+ *
+ * Return 1 if all objects where visited, 0 otherwise.
+ */
+int ckpt_obj_visited(struct ckpt_ctx *ctx)
+{
+	struct ckpt_obj *obj;
+	struct hlist_node *node;
+
+	hlist_for_each_entry(obj, node, &ctx->obj_hash->list, next) {
+		if (!(obj->flags & CKPT_OBJ_VISITED)) {
+			ckpt_debug("reverse leak: %s (%d)\n",
+				   obj->ops->obj_name, obj->objref);
+			ckpt_write_err(ctx, "OP", "leak: not visited (%s)",
+				       obj->objref, obj->ptr,
+				       obj->ops->obj_name);
+			return 0;
+		}
+	}
+
+	return 1;
+}
+
 /**************************************************************************
  * Restart
  */
diff --git a/checkpoint/process.c b/checkpoint/process.c
index 2580b31..5e690d3 100644
--- a/checkpoint/process.c
+++ b/checkpoint/process.c
@@ -246,6 +246,11 @@ int checkpoint_task(struct ckpt_ctx *ctx, struct task_struct *t)
 	return ret;
 }
 
+int ckpt_collect_task(struct ckpt_ctx *ctx, struct task_struct *t)
+{
+	return 0;
+}
+
 /***********************************************************************
  * Restart
  */
diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
index 1315884..b698b19 100644
--- a/include/linux/checkpoint.h
+++ b/include/linux/checkpoint.h
@@ -96,6 +96,12 @@ extern int ckpt_obj_hash_alloc(struct ckpt_ctx *ctx);
 extern int restore_obj(struct ckpt_ctx *ctx, struct ckpt_hdr_objref *h);
 extern int checkpoint_obj(struct ckpt_ctx *ctx, void *ptr,
 			  enum obj_type type);
+extern int ckpt_obj_collect(struct ckpt_ctx *ctx, void *ptr,
+			    enum obj_type type);
+extern int ckpt_obj_contained(struct ckpt_ctx *ctx);
+extern int ckpt_obj_visited(struct ckpt_ctx *ctx);
+extern int ckpt_obj_visit(struct ckpt_ctx *ctx, void *ptr,
+			  enum obj_type type);
 extern int ckpt_obj_lookup(struct ckpt_ctx *ctx, void *ptr,
 			   enum obj_type type);
 extern int ckpt_obj_lookup_add(struct ckpt_ctx *ctx, void *ptr,
@@ -114,6 +120,7 @@ extern long do_restart(struct ckpt_ctx *ctx, pid_t pid);
 
 /* task */
 extern int ckpt_activate_next(struct ckpt_ctx *ctx);
+extern int ckpt_collect_task(struct ckpt_ctx *ctx, struct task_struct *t);
 extern int checkpoint_task(struct ckpt_ctx *ctx, struct task_struct *t);
 extern int restore_task(struct ckpt_ctx *ctx);
 
-- 
1.6.0.4


^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH v18 31/80] c/r: detect resource leaks for whole-container checkpoint
@ 2009-09-23 23:51   ` Oren Laadan
  0 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-23 23:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, containers, linux-kernel, linux-mm, linux-api,
	Serge Hallyn, Ingo Molnar, Pavel Emelyanov, Oren Laadan,
	Oren Laadan

Add a 'users' count to objhash items, and, for a !CHECKPOINT_SUBTREE
checkpoint, return an error code if the actual objects' counts are
higher, indicating leaks (references to the objects from a task not
being checkpointed).

The comparison of the objhash user counts to object refcounts as a
basis for checking for leaks comes from Alexey's OpenVZ-based c/r
patchset.

"Leak detection" occurs _before_ any real state is saved, as a
pre-step. This prevents races due to sharing with outside world where
the sharing ceases before the leak test takes place, thus protecting
the checkpoint image from inconsistencies.

Once leak testing concludes, checkpoint will proceed. Because objects
are already in the objhash, checkpoint_obj() cannot distinguish
between the first and subsequent encounters. This is solved with a
flag (CKPT_OBJ_CHECKPOINTED) per object.

Two additional checks take place during checkpoint: for objects that
were created during, and objects destroyed, while the leak-detection
pre-step took place. (By the time this occurs part of the checkpoint
image has been written out to disk, so this is purely advisory).

Changelog[v18]:
  - [Dan Smith] Fix ckpt_obj_lookup_add() leak detection logic
  - Replace some EAGAIN with EBUSY
  - Add a few more ckpt_write_err()s
  - Introduce CKPT_OBJ_VISITED
  - ckpt_obj_collect() returns objref for new objects, 0 otherwise
  - Rename ckpt_obj_checkpointed() to ckpt_obj_visited()
  - Introduce ckpt_obj_visit() to mark objects as visited
  - Set the CHECKPOINTED flag on objects before calling checkpoint
Changelog[v17]:
  - Leak detection is performed in two-steps
  - Detect reverse-leaks (objects disappearing unexpectedly)
  - Skip reverse-leak detection if ops->ref_users isn't defined

Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
---
 checkpoint/checkpoint.c    |   41 ++++++++++
 checkpoint/objhash.c       |  188 +++++++++++++++++++++++++++++++++++++++++++-
 checkpoint/process.c       |    5 +
 include/linux/checkpoint.h |    7 ++
 4 files changed, 237 insertions(+), 4 deletions(-)

diff --git a/checkpoint/checkpoint.c b/checkpoint/checkpoint.c
index 93d7860..c21646d 100644
--- a/checkpoint/checkpoint.c
+++ b/checkpoint/checkpoint.c
@@ -473,6 +473,24 @@ static int checkpoint_pids(struct ckpt_ctx *ctx)
 	return ret;
 }
 
+static int collect_objects(struct ckpt_ctx *ctx)
+{
+	int n, ret = 0;
+
+	for (n = 0; n < ctx->nr_tasks; n++) {
+		ckpt_debug("dumping task #%d\n", n);
+		ret = ckpt_collect_task(ctx, ctx->tasks_arr[n]);
+		if (ret < 0) {
+			ctx->tsk = ctx->tasks_arr[n];
+			ckpt_write_err(ctx, "TE", "collect failed", ret);
+			ctx->tsk = NULL;
+			break;
+		}
+	}
+
+	return ret;
+}
+
 /* count number of tasks in tree (and optionally fill pid's in array) */
 static int tree_count_tasks(struct ckpt_ctx *ctx)
 {
@@ -719,6 +737,21 @@ long do_checkpoint(struct ckpt_ctx *ctx, pid_t pid)
 	if (ret < 0)
 		goto out;
 
+	if (!(ctx->uflags & CHECKPOINT_SUBTREE)) {
+		/*
+		 * Verify that all objects are contained (no leaks):
+		 * First collect them all into the while counting users
+		 * and then compare to the objects' real user counts.
+		 */
+		ret = collect_objects(ctx);
+		if (ret < 0)
+			goto out;
+		if (!ckpt_obj_contained(ctx)) {
+			ret = -EBUSY;
+			goto out;
+		}
+	}
+
 	ret = checkpoint_write_header(ctx);
 	if (ret < 0)
 		goto out;
@@ -728,6 +761,14 @@ long do_checkpoint(struct ckpt_ctx *ctx, pid_t pid)
 	ret = checkpoint_all_tasks(ctx);
 	if (ret < 0)
 		goto out;
+
+	/* verify that all objects were indeed visited */
+	if (!ckpt_obj_visited(ctx)) {
+		ckpt_write_err(ctx, "E", "leak: unvisited", -EBUSY);
+		ret = -EBUSY;
+		goto out;
+	}
+
 	ret = checkpoint_write_tail(ctx);
 	if (ret < 0)
 		goto out;
diff --git a/checkpoint/objhash.c b/checkpoint/objhash.c
index e7f5393..dd1f3e5 100644
--- a/checkpoint/objhash.c
+++ b/checkpoint/objhash.c
@@ -25,27 +25,32 @@ struct ckpt_obj_ops {
 	enum obj_type obj_type;
 	void (*ref_drop)(void *ptr, int lastref);
 	int (*ref_grab)(void *ptr);
+	int (*ref_users)(void *ptr);
 	int (*checkpoint)(struct ckpt_ctx *ctx, void *ptr);
 	void *(*restore)(struct ckpt_ctx *ctx);
 };
 
 struct ckpt_obj {
+	int users;
 	int objref;
 	int flags;
 	void *ptr;
 	struct ckpt_obj_ops *ops;
 	struct hlist_node hash;
+	struct hlist_node next;
 };
 
 /* object internal flags */
 #define CKPT_OBJ_CHECKPOINTED		0x1   /* object already checkpointed */
+#define CKPT_OBJ_VISITED		0x2   /* object already visited */
 
 struct ckpt_obj_hash {
 	struct hlist_head *head;
+	struct hlist_head list;
 	int next_free_objref;
 };
 
-/* helper grab/drop functions: */
+/* helper grab/drop/users functions */
 
 static void obj_no_drop(void *ptr, int lastref)
 {
@@ -114,6 +119,7 @@ int ckpt_obj_hash_alloc(struct ckpt_ctx *ctx)
 
 	obj_hash->head = head;
 	obj_hash->next_free_objref = 1;
+	INIT_HLIST_HEAD(&obj_hash->list);
 
 	ctx->obj_hash = obj_hash;
 	return 0;
@@ -181,6 +187,7 @@ static struct ckpt_obj *obj_new(struct ckpt_ctx *ctx, void *ptr,
 
 	obj->ptr = ptr;
 	obj->ops = ops;
+	obj->users = 2;  /* extra reference that objhash itself takes */
 
 	if (!objref) {
 		/* use @obj->ptr to index, assign objref (checkpoint) */
@@ -198,6 +205,7 @@ static struct ckpt_obj *obj_new(struct ckpt_ctx *ctx, void *ptr,
 		obj = ERR_PTR(ret);
 	} else {
 		hlist_add_head(&obj->hash, &ctx->obj_hash->head[i]);
+		hlist_add_head(&obj->next, &ctx->obj_hash->list);
 	}
 
 	return obj;
@@ -230,12 +238,35 @@ static struct ckpt_obj *obj_lookup_add(struct ckpt_ctx *ctx, void *ptr,
 		*first = 1;
 	} else {
 		BUG_ON(obj->ops->obj_type != type);
+		obj->users++;
 		*first = 0;
 	}
 	return obj;
 }
 
 /**
+ * ckpt_obj_collect - collect object into objhash
+ * @ctx: checkpoint context
+ * @ptr: pointer to object
+ * @type: object type
+ *
+ * [used during checkpoint].
+ * Return: objref if object is new, 0 otherwise, or an error
+ */
+int ckpt_obj_collect(struct ckpt_ctx *ctx, void *ptr, enum obj_type type)
+{
+	struct ckpt_obj *obj;
+	int first;
+
+	obj = obj_lookup_add(ctx, ptr, type, &first);
+	if (IS_ERR(obj))
+		return PTR_ERR(obj);
+	ckpt_debug("%s objref %d first %d\n",
+		   obj->ops->obj_name, obj->objref, first);
+	return (first ? obj->objref : 0);
+}
+
+/**
  * ckpt_obj_lookup - lookup object (by pointer) in objhash
  * @ctx: checkpoint context
  * @ptr: pointer to object
@@ -255,6 +286,21 @@ int ckpt_obj_lookup(struct ckpt_ctx *ctx, void *ptr, enum obj_type type)
 	return obj ? obj->objref : 0;
 }
 
+static inline int obj_reverse_leak(struct ckpt_ctx *ctx, struct ckpt_obj *obj)
+{
+	/*
+	 * A "reverse" leak ?  All objects should already be in the
+	 * objhash by now. But an outside task may have created an
+	 * object while we were collecting, which we didn't catch.
+	 */
+	if (obj->ops->ref_users && !(ctx->uflags & CHECKPOINT_SUBTREE)) {
+		ckpt_write_err(ctx, "OP", "leak: reverse added late (%s)",
+			       obj->objref, obj->ptr, obj->ops->obj_name);
+		return -EBUSY;
+	}
+	return 0;
+}
+
 /**
  * ckpt_obj_lookup_add - lookup object and add if not in objhash
  * @ctx: checkpoint context
@@ -275,7 +321,11 @@ int ckpt_obj_lookup_add(struct ckpt_ctx *ctx, void *ptr,
 		return PTR_ERR(obj);
 	ckpt_debug("%s objref %d first %d\n",
 		   obj->ops->obj_name, obj->objref, *first);
-	obj->flags |= CKPT_OBJ_CHECKPOINTED;
+
+	if (*first && obj_reverse_leak(ctx, obj))
+		return -EBUSY;
+
+	obj->flags |= CKPT_OBJ_VISITED;
 	return obj->objref;
 }
 
@@ -315,6 +365,9 @@ int checkpoint_obj(struct ckpt_ctx *ctx, void *ptr, enum obj_type type)
 	if (IS_ERR(obj))
 		return PTR_ERR(obj);
 
+	if (new && obj_reverse_leak(ctx, obj))
+		return -EBUSY;
+
 	if (!(obj->flags & CKPT_OBJ_CHECKPOINTED)) {
 		h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_OBJREF);
 		if (!h)
@@ -329,14 +382,141 @@ int checkpoint_obj(struct ckpt_ctx *ctx, void *ptr, enum obj_type type)
 			return ret;
 
 		/* invoke callback to actually dump the state */
-		if (obj->ops->checkpoint)
-			ret = obj->ops->checkpoint(ctx, ptr);
+		BUG_ON(!obj->ops->checkpoint);
 
 		obj->flags |= CKPT_OBJ_CHECKPOINTED;
+		ret = obj->ops->checkpoint(ctx, ptr);
 	}
+
+	obj->flags |= CKPT_OBJ_VISITED;
 	return (ret < 0 ? ret : obj->objref);
 }
 
+/**
+ * ckpt_obj_visit - mark object as visited
+ * @ctx: checkpoint context
+ * @ptr: pointer to object
+ * @type: object type
+ *
+ * [used during checkpoint].
+ * Marks the object as visited, or fail if not found
+ */
+int ckpt_obj_visit(struct ckpt_ctx *ctx, void *ptr, enum obj_type type)
+{
+	struct ckpt_obj *obj;
+
+	obj = obj_find_by_ptr(ctx, ptr);
+	BUG_ON(obj && obj->ops->obj_type != type);
+
+	if (!obj) {
+		if (!(ctx->uflags & CHECKPOINT_SUBTREE)) {
+			/* if not found report reverse leak (full container) */
+			ckpt_write_err(ctx, "OP", "leak: reverse unknown (%s)",
+				       obj->objref, obj->ptr,
+				       obj->ops->obj_name);
+			return -EBUSY;
+		}
+	} else {
+		ckpt_debug("visit %s objref %d\n",
+			   obj->ops->obj_name, obj->objref);
+		obj->flags |= CKPT_OBJ_VISITED;
+	}
+	return 0;
+}
+
+/* increment the 'users' count of an object */
+static void ckpt_obj_users_inc(struct ckpt_ctx *ctx, void *ptr, int increment)
+{
+	struct ckpt_obj *obj;
+
+	obj = obj_find_by_ptr(ctx, ptr);
+	if (obj)
+		obj->users += increment;
+}
+
+/*
+ * "Leak detection" - to guarantee a consistent checkpoint of a full
+ * container we verify that all resources are confined and isolated in
+ * that container:
+ *
+ * c/r code first walks through all tasks and collects all shared
+ * resources into the objhash, while counting the references to them;
+ * then, it compares this count to the object's real reference count,
+ * and if they don't match it means that an object has "leaked" to the
+ * outside.
+ *
+ * Otherwise, it is guaranteed that there are no references outside
+ * (of container). c/r code now proceeds to walk through all tasks,
+ * again, and checkpoints the resources. It ensures that all resources
+ * are already in the objhash, and that all of them are checkpointed.
+ * Otherwise it means that due to a race, an object was created or
+ * destroyed during the first walk but not accounted for.
+ *
+ * For instance, consider an outside task A that shared files_struct
+ * with inside task B. Then, after B's files where collected, A opens
+ * or closes a file, and immediately exits - before the first leak
+ * test is performed, such that the test passes.
+ */
+
+/**
+ * ckpt_obj_contained - test if shared objects are contained in checkpoint
+ * @ctx: checkpoint context
+ *
+ * Loops through all objects in the table and compares the number of
+ * references accumulated during checkpoint, with the reference count
+ * reported by the kernel.
+ *
+ * Return 1 if respective counts match for all objects, 0 otherwise.
+ */
+int ckpt_obj_contained(struct ckpt_ctx *ctx)
+{
+	struct ckpt_obj *obj;
+	struct hlist_node *node;
+
+	/* account for ctx->file reference (if in the table already) */
+	ckpt_obj_users_inc(ctx, ctx->file, 1);
+
+	hlist_for_each_entry(obj, node, &ctx->obj_hash->list, next) {
+		if (!obj->ops->ref_users)
+			continue;
+		if (obj->ops->ref_users(obj->ptr) != obj->users) {
+			ckpt_debug("usage leak: %s\n", obj->ops->obj_name);
+			ckpt_write_err(ctx, "OP", "leak: usage (%d != %d (%s)",
+				       obj->objref, obj->ptr,
+				       obj->ops->ref_users(obj->ptr),
+				       obj->users, obj->ops->obj_name);
+			return 0;
+		}
+	}
+
+	return 1;
+}
+
+/**
+ * ckpt_obj_visited - test that all shared objects were visited
+ * @ctx: checkpoint context
+ *
+ * Return 1 if all objects where visited, 0 otherwise.
+ */
+int ckpt_obj_visited(struct ckpt_ctx *ctx)
+{
+	struct ckpt_obj *obj;
+	struct hlist_node *node;
+
+	hlist_for_each_entry(obj, node, &ctx->obj_hash->list, next) {
+		if (!(obj->flags & CKPT_OBJ_VISITED)) {
+			ckpt_debug("reverse leak: %s (%d)\n",
+				   obj->ops->obj_name, obj->objref);
+			ckpt_write_err(ctx, "OP", "leak: not visited (%s)",
+				       obj->objref, obj->ptr,
+				       obj->ops->obj_name);
+			return 0;
+		}
+	}
+
+	return 1;
+}
+
 /**************************************************************************
  * Restart
  */
diff --git a/checkpoint/process.c b/checkpoint/process.c
index 2580b31..5e690d3 100644
--- a/checkpoint/process.c
+++ b/checkpoint/process.c
@@ -246,6 +246,11 @@ int checkpoint_task(struct ckpt_ctx *ctx, struct task_struct *t)
 	return ret;
 }
 
+int ckpt_collect_task(struct ckpt_ctx *ctx, struct task_struct *t)
+{
+	return 0;
+}
+
 /***********************************************************************
  * Restart
  */
diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
index 1315884..b698b19 100644
--- a/include/linux/checkpoint.h
+++ b/include/linux/checkpoint.h
@@ -96,6 +96,12 @@ extern int ckpt_obj_hash_alloc(struct ckpt_ctx *ctx);
 extern int restore_obj(struct ckpt_ctx *ctx, struct ckpt_hdr_objref *h);
 extern int checkpoint_obj(struct ckpt_ctx *ctx, void *ptr,
 			  enum obj_type type);
+extern int ckpt_obj_collect(struct ckpt_ctx *ctx, void *ptr,
+			    enum obj_type type);
+extern int ckpt_obj_contained(struct ckpt_ctx *ctx);
+extern int ckpt_obj_visited(struct ckpt_ctx *ctx);
+extern int ckpt_obj_visit(struct ckpt_ctx *ctx, void *ptr,
+			  enum obj_type type);
 extern int ckpt_obj_lookup(struct ckpt_ctx *ctx, void *ptr,
 			   enum obj_type type);
 extern int ckpt_obj_lookup_add(struct ckpt_ctx *ctx, void *ptr,
@@ -114,6 +120,7 @@ extern long do_restart(struct ckpt_ctx *ctx, pid_t pid);
 
 /* task */
 extern int ckpt_activate_next(struct ckpt_ctx *ctx);
+extern int ckpt_collect_task(struct ckpt_ctx *ctx, struct task_struct *t);
 extern int checkpoint_task(struct ckpt_ctx *ctx, struct task_struct *t);
 extern int restore_task(struct ckpt_ctx *ctx);
 
-- 
1.6.0.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH v18 32/80] deferqueue: generic queue to defer work
       [not found] ` <1253749920-18673-1-git-send-email-orenl-RdfvBDnrOixBDgjK7y7TUQ@public.gmane.org>
                     ` (30 preceding siblings ...)
  2009-09-23 23:51   ` [PATCH v18 31/80] c/r: detect resource leaks for whole-container checkpoint Oren Laadan
@ 2009-09-23 23:51   ` Oren Laadan
  2009-09-23 23:51   ` [PATCH v18 33/80] c/r: introduce new 'file_operations': ->checkpoint, ->collect() Oren Laadan
                     ` (49 subsequent siblings)
  81 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-23 23:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-api-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Linus Torvalds, Ingo Molnar,
	Pavel Emelyanov

Add a interface to postpone an action until the end of the entire
checkpoint or restart operation. This is useful when during the
scan of tasks an operation cannot be performed in place, to avoid
the need for a second scan.

One use case is when restoring an ipc shared memory region that has
been deleted (but is still attached), during restart it needs to be
create, attached and then deleted. However, creation and attachment
are performed in distinct locations, so deletion can not be performed
on the spot. Instead, this work (delete) is deferred until later.
(This example is in one of the following patches).

This interface allows chronic procrastination in the kernel:

deferqueue_create(void):
    Allocates and returns a new deferqueue.

deferqueue_run(deferqueue):
    Executes all the pending works in the queue. Returns the number
    of works executed, or an error upon the first error reported by
    a deferred work.

deferqueue_add(deferqueue, data, size, func, dtor):
    Enqueue a deferred work. @function is the callback function to
    do the work, which will be called with @data as an argument.
    @size tells the size of data. @dtor is a destructor callback
    that is invoked for deferred works remaining in the queue when
    the queue is destroyed. NOTE: for a given deferred work, @dtor
    is _not_ called if @func was already called (regardless of the
    return value of the latter).

deferqueue_destroy(deferqueue):
    Free the deferqueue and any queued items while invoking the
    @dtor callback for each queued item.

Why aren't we using the existing kernel workqueue mechanism?  We need
to defer to work until the end of the operation: not earlier, since we
need other things to be in place; not later, to not block waiting for
it. However, the workqueue schedules the work for 'some time later'.
Also, the kernel workqueue may run in any task context, but we require
many times that an operation be run in the context of some specific
restarting task (e.g., restoring IPC state of a certain ipc_ns).

Instead, this mechanism is a simple way for the c/r operation as a
whole, and later a task in particular, to defer some action until
later (but not arbitrarily later) _in the restore_ operation.

Changelog[v18]
  - Interface to pass simple pointers as data with deferqueue
Changelog[v17]
  - Fix deferqueue_add() function

Signed-off-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
---
 checkpoint/Kconfig         |    5 ++
 include/linux/deferqueue.h |   78 +++++++++++++++++++++++++++++++
 kernel/Makefile            |    1 +
 kernel/deferqueue.c        |  109 ++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 193 insertions(+), 0 deletions(-)
 create mode 100644 include/linux/deferqueue.h
 create mode 100644 kernel/deferqueue.c

diff --git a/checkpoint/Kconfig b/checkpoint/Kconfig
index 21fc86b..4a2c845 100644
--- a/checkpoint/Kconfig
+++ b/checkpoint/Kconfig
@@ -2,10 +2,15 @@
 # implemented the hooks for processor state etc. needed by the
 # core checkpoint/restart code.
 
+config DEFERQUEUE
+	bool
+	default n
+
 config CHECKPOINT
 	bool "Checkpoint/restart (EXPERIMENTAL)"
 	depends on CHECKPOINT_SUPPORT && EXPERIMENTAL
 	depends on CGROUP_FREEZER
+	select DEFERQUEUE
 	help
 	  Application checkpoint/restart is the ability to save the
 	  state of a running application so that it can later resume
diff --git a/include/linux/deferqueue.h b/include/linux/deferqueue.h
new file mode 100644
index 0000000..ea3b620
--- /dev/null
+++ b/include/linux/deferqueue.h
@@ -0,0 +1,78 @@
+/*
+ * deferqueue.h --- deferred work queue handling for Linux.
+ */
+
+#ifndef _LINUX_DEFERQUEUE_H
+#define _LINUX_DEFERQUEUE_H
+
+#include <linux/list.h>
+#include <linux/slab.h>
+#include <linux/spinlock.h>
+
+/*
+ * This interface allows chronic procrastination in the kernel:
+ *
+ * deferqueue_create(void):
+ *     Allocates and returns a new deferqueue.
+ *
+ * deferqueue_run(deferqueue):
+ *     Executes all the pending works in the queue. Returns the number
+ *     of works executed, or an error upon the first error reported by
+ *     a deferred work.
+ *
+ * deferqueue_add(deferqueue, data, size, func, dtor):
+ * 	Enqueue a deferred work. @function is the callback function to
+ *      do the work, which will be called with @data as an argument.
+ *      @size tells the size of data. @dtor is a destructor callback
+ *      that is invoked for deferred works remaining in the queue when
+ *      the queue is destroyed. NOTE: for a given deferred work, @dtor
+ *      is _not_ called if @func was already called (regardless of the
+ *      return value of the latter).
+ *
+ * deferqueue_destroy(deferqueue):
+ *      Free the deferqueue and any queued items while invoking the
+ *      @dtor callback for each queued item.
+ *
+ * The following helpers are useful when @data is a simple pointer:
+ *
+ * deferqueue_add_ptr(deferqueue, ptr, func, dtor):
+ *	Enqueue a deferred work whos data is @ptr.
+ *
+ * deferqueue_data_ptr(data):
+ *	Convert a deferqueue @data to a void * pointer.
+ */
+
+
+typedef int (*deferqueue_func_t)(void *);
+
+struct deferqueue_entry {
+	deferqueue_func_t function;
+	deferqueue_func_t destructor;
+	struct list_head list;
+	char data[0];
+};
+
+struct deferqueue_head {
+	spinlock_t lock;
+	struct list_head list;
+};
+
+struct deferqueue_head *deferqueue_create(void);
+void deferqueue_destroy(struct deferqueue_head *head);
+int deferqueue_add(struct deferqueue_head *head, void *data, int size,
+		   deferqueue_func_t func, deferqueue_func_t dtor);
+int deferqueue_run(struct deferqueue_head *head);
+
+static inline int deferqueue_add_ptr(struct deferqueue_head *head, void *ptr,
+				     deferqueue_func_t func,
+				     deferqueue_func_t dtor)
+{
+	return deferqueue_add(head, &ptr, sizeof(ptr), func, dtor);
+}
+
+static inline void *deferqueue_data_ptr(void *data)
+{
+	return *((void **) data);
+}
+
+#endif
diff --git a/kernel/Makefile b/kernel/Makefile
index 2093a69..ef229da 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -23,6 +23,7 @@ CFLAGS_REMOVE_cgroup-debug.o = -pg
 CFLAGS_REMOVE_sched_clock.o = -pg
 endif
 
+obj-$(CONFIG_DEFERQUEUE) += deferqueue.o
 obj-$(CONFIG_FREEZER) += freezer.o
 obj-$(CONFIG_PROFILING) += profile.o
 obj-$(CONFIG_SYSCTL_SYSCALL_CHECK) += sysctl_check.o
diff --git a/kernel/deferqueue.c b/kernel/deferqueue.c
new file mode 100644
index 0000000..3fb388b
--- /dev/null
+++ b/kernel/deferqueue.c
@@ -0,0 +1,109 @@
+/*
+ *  Infrastructure to manage deferred work
+ *
+ *  This differs from a workqueue in that the work must be deferred
+ *  until specifically run by the caller.
+ *
+ *  As the only user currently is checkpoint/restart, which has
+ *  very simple usage, the locking is kept simple.  Adding rules
+ *  is protected by the head->lock.  But deferqueue_run() is only
+ *  called once, after all entries have been added.  So it is not
+ *  protected.  Similarly, _destroy is only called once when the
+ *  ckpt_ctx is releeased, so it is not locked or refcounted.  These
+ *  can of course be added if needed by other users.
+ *
+ *  Why not use workqueue ?  We need to defer work until the end of an
+ *  operation: not earlier, since we need other things to be in place;
+ *  not later, to not block waiting for it. However, the workqueue
+ *  schedules the work for 'some time later'. Also, workqueue may run
+ *  in any task context, but we require many times that an operation
+ *  be run in the context of some specific restarting task (e.g.,
+ *  restoring IPC state of a certain ipc_ns).
+ *
+ *  Instead, this mechanism is a simple way for the c/r operation as a
+ *  whole, and later a task in particular, to defer some action until
+ *  later (but not arbitrarily later) _in the restore_ operation.
+ *
+ *  Copyright (C) 2009 Oren Laadan
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ *
+ */
+
+#include <linux/module.h>
+#include <linux/kernel.h>
+#include <linux/deferqueue.h>
+
+struct deferqueue_head *deferqueue_create(void)
+{
+	struct deferqueue_head *h = kmalloc(sizeof(*h), GFP_KERNEL);
+	if (h) {
+		spin_lock_init(&h->lock);
+		INIT_LIST_HEAD(&h->list);
+	}
+	return h;
+}
+
+void deferqueue_destroy(struct deferqueue_head *h)
+{
+	if (!list_empty(&h->list)) {
+		struct deferqueue_entry *dq, *n;
+
+		pr_debug("%s: freeing non-empty queue\n", __func__);
+		list_for_each_entry_safe(dq, n, &h->list, list) {
+			dq->destructor(dq->data);
+			list_del(&dq->list);
+			kfree(dq);
+		}
+	}
+	kfree(h);
+}
+
+int deferqueue_add(struct deferqueue_head *head, void *data, int size,
+		   deferqueue_func_t func, deferqueue_func_t dtor)
+{
+	struct deferqueue_entry *dq;
+
+	dq = kmalloc(sizeof(*dq) + size, GFP_KERNEL);
+	if (!dq)
+		return -ENOMEM;
+
+	dq->function = func;
+	dq->destructor = dtor;
+	memcpy(dq->data, data, size);
+
+	pr_debug("%s: adding work %p func %p dtor %p\n",
+		 __func__, dq, func, dtor);
+	spin_lock(&head->lock);
+	list_add_tail(&dq->list, &head->list);
+	spin_unlock(&head->lock);
+	return 0;
+}
+
+/*
+ * deferqueue_run - perform all work in the work queue
+ * @head: deferqueue_head from which to run
+ *
+ * returns: number of works performed, or < 0 on error
+ */
+int deferqueue_run(struct deferqueue_head *head)
+{
+	struct deferqueue_entry *dq, *n;
+	int nr = 0;
+	int ret;
+
+	list_for_each_entry_safe(dq, n, &head->list, list) {
+		pr_debug("doing work %p function %p\n", dq, dq->function);
+		/* don't call destructor - function callback should do it */
+		ret = dq->function(dq->data);
+		if (ret < 0)
+			pr_debug("wq function failed %d\n", ret);
+		list_del(&dq->list);
+		kfree(dq);
+		nr++;
+	}
+
+	return nr;
+}
-- 
1.6.0.4

^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH v18 32/80] deferqueue: generic queue to defer work
  2009-09-23 23:50 ` Oren Laadan
@ 2009-09-23 23:51   ` Oren Laadan
  -1 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-23 23:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, containers, linux-kernel, linux-mm, linux-api,
	Serge Hallyn, Ingo Molnar, Pavel Emelyanov, Oren Laadan,
	Oren Laadan

Add a interface to postpone an action until the end of the entire
checkpoint or restart operation. This is useful when during the
scan of tasks an operation cannot be performed in place, to avoid
the need for a second scan.

One use case is when restoring an ipc shared memory region that has
been deleted (but is still attached), during restart it needs to be
create, attached and then deleted. However, creation and attachment
are performed in distinct locations, so deletion can not be performed
on the spot. Instead, this work (delete) is deferred until later.
(This example is in one of the following patches).

This interface allows chronic procrastination in the kernel:

deferqueue_create(void):
    Allocates and returns a new deferqueue.

deferqueue_run(deferqueue):
    Executes all the pending works in the queue. Returns the number
    of works executed, or an error upon the first error reported by
    a deferred work.

deferqueue_add(deferqueue, data, size, func, dtor):
    Enqueue a deferred work. @function is the callback function to
    do the work, which will be called with @data as an argument.
    @size tells the size of data. @dtor is a destructor callback
    that is invoked for deferred works remaining in the queue when
    the queue is destroyed. NOTE: for a given deferred work, @dtor
    is _not_ called if @func was already called (regardless of the
    return value of the latter).

deferqueue_destroy(deferqueue):
    Free the deferqueue and any queued items while invoking the
    @dtor callback for each queued item.

Why aren't we using the existing kernel workqueue mechanism?  We need
to defer to work until the end of the operation: not earlier, since we
need other things to be in place; not later, to not block waiting for
it. However, the workqueue schedules the work for 'some time later'.
Also, the kernel workqueue may run in any task context, but we require
many times that an operation be run in the context of some specific
restarting task (e.g., restoring IPC state of a certain ipc_ns).

Instead, this mechanism is a simple way for the c/r operation as a
whole, and later a task in particular, to defer some action until
later (but not arbitrarily later) _in the restore_ operation.

Changelog[v18]
  - Interface to pass simple pointers as data with deferqueue
Changelog[v17]
  - Fix deferqueue_add() function

Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
---
 checkpoint/Kconfig         |    5 ++
 include/linux/deferqueue.h |   78 +++++++++++++++++++++++++++++++
 kernel/Makefile            |    1 +
 kernel/deferqueue.c        |  109 ++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 193 insertions(+), 0 deletions(-)
 create mode 100644 include/linux/deferqueue.h
 create mode 100644 kernel/deferqueue.c

diff --git a/checkpoint/Kconfig b/checkpoint/Kconfig
index 21fc86b..4a2c845 100644
--- a/checkpoint/Kconfig
+++ b/checkpoint/Kconfig
@@ -2,10 +2,15 @@
 # implemented the hooks for processor state etc. needed by the
 # core checkpoint/restart code.
 
+config DEFERQUEUE
+	bool
+	default n
+
 config CHECKPOINT
 	bool "Checkpoint/restart (EXPERIMENTAL)"
 	depends on CHECKPOINT_SUPPORT && EXPERIMENTAL
 	depends on CGROUP_FREEZER
+	select DEFERQUEUE
 	help
 	  Application checkpoint/restart is the ability to save the
 	  state of a running application so that it can later resume
diff --git a/include/linux/deferqueue.h b/include/linux/deferqueue.h
new file mode 100644
index 0000000..ea3b620
--- /dev/null
+++ b/include/linux/deferqueue.h
@@ -0,0 +1,78 @@
+/*
+ * deferqueue.h --- deferred work queue handling for Linux.
+ */
+
+#ifndef _LINUX_DEFERQUEUE_H
+#define _LINUX_DEFERQUEUE_H
+
+#include <linux/list.h>
+#include <linux/slab.h>
+#include <linux/spinlock.h>
+
+/*
+ * This interface allows chronic procrastination in the kernel:
+ *
+ * deferqueue_create(void):
+ *     Allocates and returns a new deferqueue.
+ *
+ * deferqueue_run(deferqueue):
+ *     Executes all the pending works in the queue. Returns the number
+ *     of works executed, or an error upon the first error reported by
+ *     a deferred work.
+ *
+ * deferqueue_add(deferqueue, data, size, func, dtor):
+ * 	Enqueue a deferred work. @function is the callback function to
+ *      do the work, which will be called with @data as an argument.
+ *      @size tells the size of data. @dtor is a destructor callback
+ *      that is invoked for deferred works remaining in the queue when
+ *      the queue is destroyed. NOTE: for a given deferred work, @dtor
+ *      is _not_ called if @func was already called (regardless of the
+ *      return value of the latter).
+ *
+ * deferqueue_destroy(deferqueue):
+ *      Free the deferqueue and any queued items while invoking the
+ *      @dtor callback for each queued item.
+ *
+ * The following helpers are useful when @data is a simple pointer:
+ *
+ * deferqueue_add_ptr(deferqueue, ptr, func, dtor):
+ *	Enqueue a deferred work whos data is @ptr.
+ *
+ * deferqueue_data_ptr(data):
+ *	Convert a deferqueue @data to a void * pointer.
+ */
+
+
+typedef int (*deferqueue_func_t)(void *);
+
+struct deferqueue_entry {
+	deferqueue_func_t function;
+	deferqueue_func_t destructor;
+	struct list_head list;
+	char data[0];
+};
+
+struct deferqueue_head {
+	spinlock_t lock;
+	struct list_head list;
+};
+
+struct deferqueue_head *deferqueue_create(void);
+void deferqueue_destroy(struct deferqueue_head *head);
+int deferqueue_add(struct deferqueue_head *head, void *data, int size,
+		   deferqueue_func_t func, deferqueue_func_t dtor);
+int deferqueue_run(struct deferqueue_head *head);
+
+static inline int deferqueue_add_ptr(struct deferqueue_head *head, void *ptr,
+				     deferqueue_func_t func,
+				     deferqueue_func_t dtor)
+{
+	return deferqueue_add(head, &ptr, sizeof(ptr), func, dtor);
+}
+
+static inline void *deferqueue_data_ptr(void *data)
+{
+	return *((void **) data);
+}
+
+#endif
diff --git a/kernel/Makefile b/kernel/Makefile
index 2093a69..ef229da 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -23,6 +23,7 @@ CFLAGS_REMOVE_cgroup-debug.o = -pg
 CFLAGS_REMOVE_sched_clock.o = -pg
 endif
 
+obj-$(CONFIG_DEFERQUEUE) += deferqueue.o
 obj-$(CONFIG_FREEZER) += freezer.o
 obj-$(CONFIG_PROFILING) += profile.o
 obj-$(CONFIG_SYSCTL_SYSCALL_CHECK) += sysctl_check.o
diff --git a/kernel/deferqueue.c b/kernel/deferqueue.c
new file mode 100644
index 0000000..3fb388b
--- /dev/null
+++ b/kernel/deferqueue.c
@@ -0,0 +1,109 @@
+/*
+ *  Infrastructure to manage deferred work
+ *
+ *  This differs from a workqueue in that the work must be deferred
+ *  until specifically run by the caller.
+ *
+ *  As the only user currently is checkpoint/restart, which has
+ *  very simple usage, the locking is kept simple.  Adding rules
+ *  is protected by the head->lock.  But deferqueue_run() is only
+ *  called once, after all entries have been added.  So it is not
+ *  protected.  Similarly, _destroy is only called once when the
+ *  ckpt_ctx is releeased, so it is not locked or refcounted.  These
+ *  can of course be added if needed by other users.
+ *
+ *  Why not use workqueue ?  We need to defer work until the end of an
+ *  operation: not earlier, since we need other things to be in place;
+ *  not later, to not block waiting for it. However, the workqueue
+ *  schedules the work for 'some time later'. Also, workqueue may run
+ *  in any task context, but we require many times that an operation
+ *  be run in the context of some specific restarting task (e.g.,
+ *  restoring IPC state of a certain ipc_ns).
+ *
+ *  Instead, this mechanism is a simple way for the c/r operation as a
+ *  whole, and later a task in particular, to defer some action until
+ *  later (but not arbitrarily later) _in the restore_ operation.
+ *
+ *  Copyright (C) 2009 Oren Laadan
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ *
+ */
+
+#include <linux/module.h>
+#include <linux/kernel.h>
+#include <linux/deferqueue.h>
+
+struct deferqueue_head *deferqueue_create(void)
+{
+	struct deferqueue_head *h = kmalloc(sizeof(*h), GFP_KERNEL);
+	if (h) {
+		spin_lock_init(&h->lock);
+		INIT_LIST_HEAD(&h->list);
+	}
+	return h;
+}
+
+void deferqueue_destroy(struct deferqueue_head *h)
+{
+	if (!list_empty(&h->list)) {
+		struct deferqueue_entry *dq, *n;
+
+		pr_debug("%s: freeing non-empty queue\n", __func__);
+		list_for_each_entry_safe(dq, n, &h->list, list) {
+			dq->destructor(dq->data);
+			list_del(&dq->list);
+			kfree(dq);
+		}
+	}
+	kfree(h);
+}
+
+int deferqueue_add(struct deferqueue_head *head, void *data, int size,
+		   deferqueue_func_t func, deferqueue_func_t dtor)
+{
+	struct deferqueue_entry *dq;
+
+	dq = kmalloc(sizeof(*dq) + size, GFP_KERNEL);
+	if (!dq)
+		return -ENOMEM;
+
+	dq->function = func;
+	dq->destructor = dtor;
+	memcpy(dq->data, data, size);
+
+	pr_debug("%s: adding work %p func %p dtor %p\n",
+		 __func__, dq, func, dtor);
+	spin_lock(&head->lock);
+	list_add_tail(&dq->list, &head->list);
+	spin_unlock(&head->lock);
+	return 0;
+}
+
+/*
+ * deferqueue_run - perform all work in the work queue
+ * @head: deferqueue_head from which to run
+ *
+ * returns: number of works performed, or < 0 on error
+ */
+int deferqueue_run(struct deferqueue_head *head)
+{
+	struct deferqueue_entry *dq, *n;
+	int nr = 0;
+	int ret;
+
+	list_for_each_entry_safe(dq, n, &head->list, list) {
+		pr_debug("doing work %p function %p\n", dq, dq->function);
+		/* don't call destructor - function callback should do it */
+		ret = dq->function(dq->data);
+		if (ret < 0)
+			pr_debug("wq function failed %d\n", ret);
+		list_del(&dq->list);
+		kfree(dq);
+		nr++;
+	}
+
+	return nr;
+}
-- 
1.6.0.4


^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH v18 32/80] deferqueue: generic queue to defer work
@ 2009-09-23 23:51   ` Oren Laadan
  0 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-23 23:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, containers, linux-kernel, linux-mm, linux-api,
	Serge Hallyn, Ingo Molnar, Pavel Emelyanov, Oren Laadan,
	Oren Laadan

Add a interface to postpone an action until the end of the entire
checkpoint or restart operation. This is useful when during the
scan of tasks an operation cannot be performed in place, to avoid
the need for a second scan.

One use case is when restoring an ipc shared memory region that has
been deleted (but is still attached), during restart it needs to be
create, attached and then deleted. However, creation and attachment
are performed in distinct locations, so deletion can not be performed
on the spot. Instead, this work (delete) is deferred until later.
(This example is in one of the following patches).

This interface allows chronic procrastination in the kernel:

deferqueue_create(void):
    Allocates and returns a new deferqueue.

deferqueue_run(deferqueue):
    Executes all the pending works in the queue. Returns the number
    of works executed, or an error upon the first error reported by
    a deferred work.

deferqueue_add(deferqueue, data, size, func, dtor):
    Enqueue a deferred work. @function is the callback function to
    do the work, which will be called with @data as an argument.
    @size tells the size of data. @dtor is a destructor callback
    that is invoked for deferred works remaining in the queue when
    the queue is destroyed. NOTE: for a given deferred work, @dtor
    is _not_ called if @func was already called (regardless of the
    return value of the latter).

deferqueue_destroy(deferqueue):
    Free the deferqueue and any queued items while invoking the
    @dtor callback for each queued item.

Why aren't we using the existing kernel workqueue mechanism?  We need
to defer to work until the end of the operation: not earlier, since we
need other things to be in place; not later, to not block waiting for
it. However, the workqueue schedules the work for 'some time later'.
Also, the kernel workqueue may run in any task context, but we require
many times that an operation be run in the context of some specific
restarting task (e.g., restoring IPC state of a certain ipc_ns).

Instead, this mechanism is a simple way for the c/r operation as a
whole, and later a task in particular, to defer some action until
later (but not arbitrarily later) _in the restore_ operation.

Changelog[v18]
  - Interface to pass simple pointers as data with deferqueue
Changelog[v17]
  - Fix deferqueue_add() function

Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
---
 checkpoint/Kconfig         |    5 ++
 include/linux/deferqueue.h |   78 +++++++++++++++++++++++++++++++
 kernel/Makefile            |    1 +
 kernel/deferqueue.c        |  109 ++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 193 insertions(+), 0 deletions(-)
 create mode 100644 include/linux/deferqueue.h
 create mode 100644 kernel/deferqueue.c

diff --git a/checkpoint/Kconfig b/checkpoint/Kconfig
index 21fc86b..4a2c845 100644
--- a/checkpoint/Kconfig
+++ b/checkpoint/Kconfig
@@ -2,10 +2,15 @@
 # implemented the hooks for processor state etc. needed by the
 # core checkpoint/restart code.
 
+config DEFERQUEUE
+	bool
+	default n
+
 config CHECKPOINT
 	bool "Checkpoint/restart (EXPERIMENTAL)"
 	depends on CHECKPOINT_SUPPORT && EXPERIMENTAL
 	depends on CGROUP_FREEZER
+	select DEFERQUEUE
 	help
 	  Application checkpoint/restart is the ability to save the
 	  state of a running application so that it can later resume
diff --git a/include/linux/deferqueue.h b/include/linux/deferqueue.h
new file mode 100644
index 0000000..ea3b620
--- /dev/null
+++ b/include/linux/deferqueue.h
@@ -0,0 +1,78 @@
+/*
+ * deferqueue.h --- deferred work queue handling for Linux.
+ */
+
+#ifndef _LINUX_DEFERQUEUE_H
+#define _LINUX_DEFERQUEUE_H
+
+#include <linux/list.h>
+#include <linux/slab.h>
+#include <linux/spinlock.h>
+
+/*
+ * This interface allows chronic procrastination in the kernel:
+ *
+ * deferqueue_create(void):
+ *     Allocates and returns a new deferqueue.
+ *
+ * deferqueue_run(deferqueue):
+ *     Executes all the pending works in the queue. Returns the number
+ *     of works executed, or an error upon the first error reported by
+ *     a deferred work.
+ *
+ * deferqueue_add(deferqueue, data, size, func, dtor):
+ * 	Enqueue a deferred work. @function is the callback function to
+ *      do the work, which will be called with @data as an argument.
+ *      @size tells the size of data. @dtor is a destructor callback
+ *      that is invoked for deferred works remaining in the queue when
+ *      the queue is destroyed. NOTE: for a given deferred work, @dtor
+ *      is _not_ called if @func was already called (regardless of the
+ *      return value of the latter).
+ *
+ * deferqueue_destroy(deferqueue):
+ *      Free the deferqueue and any queued items while invoking the
+ *      @dtor callback for each queued item.
+ *
+ * The following helpers are useful when @data is a simple pointer:
+ *
+ * deferqueue_add_ptr(deferqueue, ptr, func, dtor):
+ *	Enqueue a deferred work whos data is @ptr.
+ *
+ * deferqueue_data_ptr(data):
+ *	Convert a deferqueue @data to a void * pointer.
+ */
+
+
+typedef int (*deferqueue_func_t)(void *);
+
+struct deferqueue_entry {
+	deferqueue_func_t function;
+	deferqueue_func_t destructor;
+	struct list_head list;
+	char data[0];
+};
+
+struct deferqueue_head {
+	spinlock_t lock;
+	struct list_head list;
+};
+
+struct deferqueue_head *deferqueue_create(void);
+void deferqueue_destroy(struct deferqueue_head *head);
+int deferqueue_add(struct deferqueue_head *head, void *data, int size,
+		   deferqueue_func_t func, deferqueue_func_t dtor);
+int deferqueue_run(struct deferqueue_head *head);
+
+static inline int deferqueue_add_ptr(struct deferqueue_head *head, void *ptr,
+				     deferqueue_func_t func,
+				     deferqueue_func_t dtor)
+{
+	return deferqueue_add(head, &ptr, sizeof(ptr), func, dtor);
+}
+
+static inline void *deferqueue_data_ptr(void *data)
+{
+	return *((void **) data);
+}
+
+#endif
diff --git a/kernel/Makefile b/kernel/Makefile
index 2093a69..ef229da 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -23,6 +23,7 @@ CFLAGS_REMOVE_cgroup-debug.o = -pg
 CFLAGS_REMOVE_sched_clock.o = -pg
 endif
 
+obj-$(CONFIG_DEFERQUEUE) += deferqueue.o
 obj-$(CONFIG_FREEZER) += freezer.o
 obj-$(CONFIG_PROFILING) += profile.o
 obj-$(CONFIG_SYSCTL_SYSCALL_CHECK) += sysctl_check.o
diff --git a/kernel/deferqueue.c b/kernel/deferqueue.c
new file mode 100644
index 0000000..3fb388b
--- /dev/null
+++ b/kernel/deferqueue.c
@@ -0,0 +1,109 @@
+/*
+ *  Infrastructure to manage deferred work
+ *
+ *  This differs from a workqueue in that the work must be deferred
+ *  until specifically run by the caller.
+ *
+ *  As the only user currently is checkpoint/restart, which has
+ *  very simple usage, the locking is kept simple.  Adding rules
+ *  is protected by the head->lock.  But deferqueue_run() is only
+ *  called once, after all entries have been added.  So it is not
+ *  protected.  Similarly, _destroy is only called once when the
+ *  ckpt_ctx is releeased, so it is not locked or refcounted.  These
+ *  can of course be added if needed by other users.
+ *
+ *  Why not use workqueue ?  We need to defer work until the end of an
+ *  operation: not earlier, since we need other things to be in place;
+ *  not later, to not block waiting for it. However, the workqueue
+ *  schedules the work for 'some time later'. Also, workqueue may run
+ *  in any task context, but we require many times that an operation
+ *  be run in the context of some specific restarting task (e.g.,
+ *  restoring IPC state of a certain ipc_ns).
+ *
+ *  Instead, this mechanism is a simple way for the c/r operation as a
+ *  whole, and later a task in particular, to defer some action until
+ *  later (but not arbitrarily later) _in the restore_ operation.
+ *
+ *  Copyright (C) 2009 Oren Laadan
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ *
+ */
+
+#include <linux/module.h>
+#include <linux/kernel.h>
+#include <linux/deferqueue.h>
+
+struct deferqueue_head *deferqueue_create(void)
+{
+	struct deferqueue_head *h = kmalloc(sizeof(*h), GFP_KERNEL);
+	if (h) {
+		spin_lock_init(&h->lock);
+		INIT_LIST_HEAD(&h->list);
+	}
+	return h;
+}
+
+void deferqueue_destroy(struct deferqueue_head *h)
+{
+	if (!list_empty(&h->list)) {
+		struct deferqueue_entry *dq, *n;
+
+		pr_debug("%s: freeing non-empty queue\n", __func__);
+		list_for_each_entry_safe(dq, n, &h->list, list) {
+			dq->destructor(dq->data);
+			list_del(&dq->list);
+			kfree(dq);
+		}
+	}
+	kfree(h);
+}
+
+int deferqueue_add(struct deferqueue_head *head, void *data, int size,
+		   deferqueue_func_t func, deferqueue_func_t dtor)
+{
+	struct deferqueue_entry *dq;
+
+	dq = kmalloc(sizeof(*dq) + size, GFP_KERNEL);
+	if (!dq)
+		return -ENOMEM;
+
+	dq->function = func;
+	dq->destructor = dtor;
+	memcpy(dq->data, data, size);
+
+	pr_debug("%s: adding work %p func %p dtor %p\n",
+		 __func__, dq, func, dtor);
+	spin_lock(&head->lock);
+	list_add_tail(&dq->list, &head->list);
+	spin_unlock(&head->lock);
+	return 0;
+}
+
+/*
+ * deferqueue_run - perform all work in the work queue
+ * @head: deferqueue_head from which to run
+ *
+ * returns: number of works performed, or < 0 on error
+ */
+int deferqueue_run(struct deferqueue_head *head)
+{
+	struct deferqueue_entry *dq, *n;
+	int nr = 0;
+	int ret;
+
+	list_for_each_entry_safe(dq, n, &head->list, list) {
+		pr_debug("doing work %p function %p\n", dq, dq->function);
+		/* don't call destructor - function callback should do it */
+		ret = dq->function(dq->data);
+		if (ret < 0)
+			pr_debug("wq function failed %d\n", ret);
+		list_del(&dq->list);
+		kfree(dq);
+		nr++;
+	}
+
+	return nr;
+}
-- 
1.6.0.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH v18 33/80] c/r: introduce new 'file_operations': ->checkpoint, ->collect()
       [not found] ` <1253749920-18673-1-git-send-email-orenl-RdfvBDnrOixBDgjK7y7TUQ@public.gmane.org>
                     ` (31 preceding siblings ...)
  2009-09-23 23:51   ` [PATCH v18 32/80] deferqueue: generic queue to defer work Oren Laadan
@ 2009-09-23 23:51   ` Oren Laadan
  2009-09-23 23:51   ` [PATCH v18 34/80] c/r: dump open file descriptors Oren Laadan
                     ` (48 subsequent siblings)
  81 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-23 23:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-api-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Linus Torvalds, Ingo Molnar,
	Pavel Emelyanov

While we assume all normal files and directories can be checkpointed,
there are, as usual in the VFS, specialized places that will always
need an ability to override these defaults. Although we could do this
completely in the checkpoint code, that would bitrot quickly.

This adds a new 'file_operations' function for checkpointing a file.
It is assumed that there should be a dirt-simple way to make something
(un)checkpointable that fits in with current code.

As you can see in the ext[234] patches down the road, all that we have
to do to make something simple be supported is add a single "generic"
f_op entry.

Also adds a new 'file_operations' function for 'collecting' a file for
leak-detection during full-container checkpoint. This is useful for
those files that hold references to other "collectable" objects. Two
examples are pty files that point to corresponding tty objects, and
eventpoll files that refer to the files they are monitoring.

Finally, this patch introduces vfs_fcntl() so that it can be called
from restart (see patch adding restart of files).

Changelog[v17]
  - Introduce 'collect' method
Changelog[v17]
  - Forward-declare 'ckpt_ctx' et-al, don't use checkpoint_types.h

Signed-off-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
---
 fs/fcntl.c         |   21 +++++++++++++--------
 include/linux/fs.h |    7 +++++++
 2 files changed, 20 insertions(+), 8 deletions(-)

diff --git a/fs/fcntl.c b/fs/fcntl.c
index ae41308..78d3116 100644
--- a/fs/fcntl.c
+++ b/fs/fcntl.c
@@ -339,6 +339,18 @@ static long do_fcntl(int fd, unsigned int cmd, unsigned long arg,
 	return err;
 }
 
+int vfs_fcntl(int fd, unsigned int cmd, unsigned long arg, struct file *filp)
+{
+	int err;
+
+	err = security_file_fcntl(filp, cmd, arg);
+	if (err)
+		goto out;
+	err = do_fcntl(fd, cmd, arg, filp);
+ out:
+	return err;
+}
+
 SYSCALL_DEFINE3(fcntl, unsigned int, fd, unsigned int, cmd, unsigned long, arg)
 {	
 	struct file *filp;
@@ -348,14 +360,7 @@ SYSCALL_DEFINE3(fcntl, unsigned int, fd, unsigned int, cmd, unsigned long, arg)
 	if (!filp)
 		goto out;
 
-	err = security_file_fcntl(filp, cmd, arg);
-	if (err) {
-		fput(filp);
-		return err;
-	}
-
-	err = do_fcntl(fd, cmd, arg, filp);
-
+	err = vfs_fcntl(fd, cmd, arg, filp);
  	fput(filp);
 out:
 	return err;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index a21f175..02638a7 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -388,6 +388,7 @@ struct kstatfs;
 struct vm_area_struct;
 struct vfsmount;
 struct cred;
+struct ckpt_ctx;
 
 extern void __init inode_init(void);
 extern void __init inode_init_early(void);
@@ -1088,6 +1089,8 @@ struct file_lock {
 
 #include <linux/fcntl.h>
 
+extern int vfs_fcntl(int fd, unsigned cmd, unsigned long arg, struct file *fp);
+
 extern void send_sigio(struct fown_struct *fown, int fd, int band);
 
 /* fs/sync.c */
@@ -1510,6 +1513,8 @@ struct file_operations {
 	ssize_t (*splice_write)(struct pipe_inode_info *, struct file *, loff_t *, size_t, unsigned int);
 	ssize_t (*splice_read)(struct file *, loff_t *, struct pipe_inode_info *, size_t, unsigned int);
 	int (*setlease)(struct file *, long, struct file_lock **);
+	int (*checkpoint)(struct ckpt_ctx *, struct file *);
+	int (*collect)(struct ckpt_ctx *, struct file *);
 };
 
 struct inode_operations {
@@ -2311,6 +2316,8 @@ void inode_sub_bytes(struct inode *inode, loff_t bytes);
 loff_t inode_get_bytes(struct inode *inode);
 void inode_set_bytes(struct inode *inode, loff_t bytes);
 
+#define generic_file_checkpoint NULL
+
 extern int vfs_readdir(struct file *, filldir_t, void *);
 
 extern int vfs_stat(char __user *, struct kstat *);
-- 
1.6.0.4

^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH v18 33/80] c/r: introduce new 'file_operations': ->checkpoint, ->collect()
  2009-09-23 23:50 ` Oren Laadan
@ 2009-09-23 23:51   ` Oren Laadan
  -1 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-23 23:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, containers, linux-kernel, linux-mm, linux-api,
	Serge Hallyn, Ingo Molnar, Pavel Emelyanov, Oren Laadan,
	Oren Laadan

While we assume all normal files and directories can be checkpointed,
there are, as usual in the VFS, specialized places that will always
need an ability to override these defaults. Although we could do this
completely in the checkpoint code, that would bitrot quickly.

This adds a new 'file_operations' function for checkpointing a file.
It is assumed that there should be a dirt-simple way to make something
(un)checkpointable that fits in with current code.

As you can see in the ext[234] patches down the road, all that we have
to do to make something simple be supported is add a single "generic"
f_op entry.

Also adds a new 'file_operations' function for 'collecting' a file for
leak-detection during full-container checkpoint. This is useful for
those files that hold references to other "collectable" objects. Two
examples are pty files that point to corresponding tty objects, and
eventpoll files that refer to the files they are monitoring.

Finally, this patch introduces vfs_fcntl() so that it can be called
from restart (see patch adding restart of files).

Changelog[v17]
  - Introduce 'collect' method
Changelog[v17]
  - Forward-declare 'ckpt_ctx' et-al, don't use checkpoint_types.h

Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
---
 fs/fcntl.c         |   21 +++++++++++++--------
 include/linux/fs.h |    7 +++++++
 2 files changed, 20 insertions(+), 8 deletions(-)

diff --git a/fs/fcntl.c b/fs/fcntl.c
index ae41308..78d3116 100644
--- a/fs/fcntl.c
+++ b/fs/fcntl.c
@@ -339,6 +339,18 @@ static long do_fcntl(int fd, unsigned int cmd, unsigned long arg,
 	return err;
 }
 
+int vfs_fcntl(int fd, unsigned int cmd, unsigned long arg, struct file *filp)
+{
+	int err;
+
+	err = security_file_fcntl(filp, cmd, arg);
+	if (err)
+		goto out;
+	err = do_fcntl(fd, cmd, arg, filp);
+ out:
+	return err;
+}
+
 SYSCALL_DEFINE3(fcntl, unsigned int, fd, unsigned int, cmd, unsigned long, arg)
 {	
 	struct file *filp;
@@ -348,14 +360,7 @@ SYSCALL_DEFINE3(fcntl, unsigned int, fd, unsigned int, cmd, unsigned long, arg)
 	if (!filp)
 		goto out;
 
-	err = security_file_fcntl(filp, cmd, arg);
-	if (err) {
-		fput(filp);
-		return err;
-	}
-
-	err = do_fcntl(fd, cmd, arg, filp);
-
+	err = vfs_fcntl(fd, cmd, arg, filp);
  	fput(filp);
 out:
 	return err;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index a21f175..02638a7 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -388,6 +388,7 @@ struct kstatfs;
 struct vm_area_struct;
 struct vfsmount;
 struct cred;
+struct ckpt_ctx;
 
 extern void __init inode_init(void);
 extern void __init inode_init_early(void);
@@ -1088,6 +1089,8 @@ struct file_lock {
 
 #include <linux/fcntl.h>
 
+extern int vfs_fcntl(int fd, unsigned cmd, unsigned long arg, struct file *fp);
+
 extern void send_sigio(struct fown_struct *fown, int fd, int band);
 
 /* fs/sync.c */
@@ -1510,6 +1513,8 @@ struct file_operations {
 	ssize_t (*splice_write)(struct pipe_inode_info *, struct file *, loff_t *, size_t, unsigned int);
 	ssize_t (*splice_read)(struct file *, loff_t *, struct pipe_inode_info *, size_t, unsigned int);
 	int (*setlease)(struct file *, long, struct file_lock **);
+	int (*checkpoint)(struct ckpt_ctx *, struct file *);
+	int (*collect)(struct ckpt_ctx *, struct file *);
 };
 
 struct inode_operations {
@@ -2311,6 +2316,8 @@ void inode_sub_bytes(struct inode *inode, loff_t bytes);
 loff_t inode_get_bytes(struct inode *inode);
 void inode_set_bytes(struct inode *inode, loff_t bytes);
 
+#define generic_file_checkpoint NULL
+
 extern int vfs_readdir(struct file *, filldir_t, void *);
 
 extern int vfs_stat(char __user *, struct kstat *);
-- 
1.6.0.4


^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH v18 33/80] c/r: introduce new 'file_operations': ->checkpoint, ->collect()
@ 2009-09-23 23:51   ` Oren Laadan
  0 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-23 23:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, containers, linux-kernel, linux-mm, linux-api,
	Serge Hallyn, Ingo Molnar, Pavel Emelyanov, Oren Laadan,
	Oren Laadan

While we assume all normal files and directories can be checkpointed,
there are, as usual in the VFS, specialized places that will always
need an ability to override these defaults. Although we could do this
completely in the checkpoint code, that would bitrot quickly.

This adds a new 'file_operations' function for checkpointing a file.
It is assumed that there should be a dirt-simple way to make something
(un)checkpointable that fits in with current code.

As you can see in the ext[234] patches down the road, all that we have
to do to make something simple be supported is add a single "generic"
f_op entry.

Also adds a new 'file_operations' function for 'collecting' a file for
leak-detection during full-container checkpoint. This is useful for
those files that hold references to other "collectable" objects. Two
examples are pty files that point to corresponding tty objects, and
eventpoll files that refer to the files they are monitoring.

Finally, this patch introduces vfs_fcntl() so that it can be called
from restart (see patch adding restart of files).

Changelog[v17]
  - Introduce 'collect' method
Changelog[v17]
  - Forward-declare 'ckpt_ctx' et-al, don't use checkpoint_types.h

Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
---
 fs/fcntl.c         |   21 +++++++++++++--------
 include/linux/fs.h |    7 +++++++
 2 files changed, 20 insertions(+), 8 deletions(-)

diff --git a/fs/fcntl.c b/fs/fcntl.c
index ae41308..78d3116 100644
--- a/fs/fcntl.c
+++ b/fs/fcntl.c
@@ -339,6 +339,18 @@ static long do_fcntl(int fd, unsigned int cmd, unsigned long arg,
 	return err;
 }
 
+int vfs_fcntl(int fd, unsigned int cmd, unsigned long arg, struct file *filp)
+{
+	int err;
+
+	err = security_file_fcntl(filp, cmd, arg);
+	if (err)
+		goto out;
+	err = do_fcntl(fd, cmd, arg, filp);
+ out:
+	return err;
+}
+
 SYSCALL_DEFINE3(fcntl, unsigned int, fd, unsigned int, cmd, unsigned long, arg)
 {	
 	struct file *filp;
@@ -348,14 +360,7 @@ SYSCALL_DEFINE3(fcntl, unsigned int, fd, unsigned int, cmd, unsigned long, arg)
 	if (!filp)
 		goto out;
 
-	err = security_file_fcntl(filp, cmd, arg);
-	if (err) {
-		fput(filp);
-		return err;
-	}
-
-	err = do_fcntl(fd, cmd, arg, filp);
-
+	err = vfs_fcntl(fd, cmd, arg, filp);
  	fput(filp);
 out:
 	return err;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index a21f175..02638a7 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -388,6 +388,7 @@ struct kstatfs;
 struct vm_area_struct;
 struct vfsmount;
 struct cred;
+struct ckpt_ctx;
 
 extern void __init inode_init(void);
 extern void __init inode_init_early(void);
@@ -1088,6 +1089,8 @@ struct file_lock {
 
 #include <linux/fcntl.h>
 
+extern int vfs_fcntl(int fd, unsigned cmd, unsigned long arg, struct file *fp);
+
 extern void send_sigio(struct fown_struct *fown, int fd, int band);
 
 /* fs/sync.c */
@@ -1510,6 +1513,8 @@ struct file_operations {
 	ssize_t (*splice_write)(struct pipe_inode_info *, struct file *, loff_t *, size_t, unsigned int);
 	ssize_t (*splice_read)(struct file *, loff_t *, struct pipe_inode_info *, size_t, unsigned int);
 	int (*setlease)(struct file *, long, struct file_lock **);
+	int (*checkpoint)(struct ckpt_ctx *, struct file *);
+	int (*collect)(struct ckpt_ctx *, struct file *);
 };
 
 struct inode_operations {
@@ -2311,6 +2316,8 @@ void inode_sub_bytes(struct inode *inode, loff_t bytes);
 loff_t inode_get_bytes(struct inode *inode);
 void inode_set_bytes(struct inode *inode, loff_t bytes);
 
+#define generic_file_checkpoint NULL
+
 extern int vfs_readdir(struct file *, filldir_t, void *);
 
 extern int vfs_stat(char __user *, struct kstat *);
-- 
1.6.0.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH v18 34/80] c/r: dump open file descriptors
       [not found] ` <1253749920-18673-1-git-send-email-orenl-RdfvBDnrOixBDgjK7y7TUQ@public.gmane.org>
                     ` (32 preceding siblings ...)
  2009-09-23 23:51   ` [PATCH v18 33/80] c/r: introduce new 'file_operations': ->checkpoint, ->collect() Oren Laadan
@ 2009-09-23 23:51   ` Oren Laadan
  2009-09-23 23:51   ` [PATCH v18 35/80] c/r: restore " Oren Laadan
                     ` (47 subsequent siblings)
  81 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-23 23:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-api-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Linus Torvalds, Ingo Molnar,
	Pavel Emelyanov

Dump the file table with 'struct ckpt_hdr_file_table, followed by all
open file descriptors. Because the 'struct file' corresponding to an
fd can be shared, they are assigned an objref and registered in the
object hash. A reference to the 'file *' is kept for as long as it
lives in the hash (the hash is only cleaned up at the end of the
checkpoint).

Also provide generic_checkpoint_file() and generic_restore_file()
which is good for normal files and directories. It does not support
yet unlinked files or directories.

Changelog[v18]:
  - Add a few more ckpt_write_err()s
  - [Dan Smith] Export fill_fname() as ckpt_fill_fname()
  - Introduce ckpt_collect_file() that also uses file->collect method
  - In collect_file_stabl() use retval from ckpt_obj_collect() to
    test for first-time-object
Changelog[v17]:
  - Only collect sub-objects of files_struct once
  - Better file error debugging
  - Use (new) d_unlinked()
Changelog[v16]:
  - Fix compile warning in checkpoint_bad()
Changelog[v16]:
  - Reorder patch (move earlier in series)
  - Handle shared files_struct objects
Changelog[v14]:
  - File objects are dumped/restored prior to the first reference
  - Introduce a per file-type restore() callback
  - Use struct file_operations->checkpoint()
  - Put code for generic file descriptors in a separate function
  - Use one CKPT_FILE_GENERIC for both regular files and dirs
  - Revert change to pr_debug(), back to ckpt_debug()
  - Use only unsigned fields in checkpoint headers
  - Rename:  ckpt_write_files() => checkpoint_fd_table()
  - Rename:  ckpt_write_fd_data() => checkpoint_file()
  - Discard field 'h->parent'
Changelog[v12]:
  - Replace obsolete ckpt_debug() with pr_debug()
Changelog[v11]:
  - Discard handling of opened symlinks (there is no such thing)
  - ckpt_scan_fds() retries from scratch if hits size limits
Changelog[v9]:
  - Fix a couple of leaks in ckpt_write_files()
  - Drop useless kfree from ckpt_scan_fds()
Changelog[v8]:
  - initialize 'coe' to workaround gcc false warning
Changelog[v6]:
  - Balance all calls to ckpt_hbuf_get() with matching ckpt_hbuf_put()
    (even though it's not really needed)

Signed-off-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
---
 checkpoint/Makefile              |    3 +-
 checkpoint/checkpoint.c          |   11 +
 checkpoint/files.c               |  417 ++++++++++++++++++++++++++++++++++++++
 checkpoint/objhash.c             |   52 +++++
 checkpoint/process.c             |   33 +++-
 checkpoint/sys.c                 |    8 +
 include/linux/checkpoint.h       |   19 ++
 include/linux/checkpoint_hdr.h   |   49 +++++
 include/linux/checkpoint_types.h |    5 +
 include/linux/fs.h               |    4 +
 10 files changed, 599 insertions(+), 2 deletions(-)
 create mode 100644 checkpoint/files.c

diff --git a/checkpoint/Makefile b/checkpoint/Makefile
index 5aa6a75..1d0c058 100644
--- a/checkpoint/Makefile
+++ b/checkpoint/Makefile
@@ -7,4 +7,5 @@ obj-$(CONFIG_CHECKPOINT) += \
 	objhash.o \
 	checkpoint.o \
 	restart.o \
-	process.o
+	process.o \
+	files.o
diff --git a/checkpoint/checkpoint.c b/checkpoint/checkpoint.c
index c21646d..4cc2a2f 100644
--- a/checkpoint/checkpoint.c
+++ b/checkpoint/checkpoint.c
@@ -18,6 +18,7 @@
 #include <linux/time.h>
 #include <linux/fs.h>
 #include <linux/file.h>
+#include <linux/fs_struct.h>
 #include <linux/dcache.h>
 #include <linux/mount.h>
 #include <linux/utsname.h>
@@ -673,6 +674,7 @@ static int init_checkpoint_ctx(struct ckpt_ctx *ctx, pid_t pid)
 {
 	struct task_struct *task;
 	struct nsproxy *nsproxy;
+	struct fs_struct *fs;
 
 	/*
 	 * No need for explicit cleanup here, because if an error
@@ -714,6 +716,15 @@ static int init_checkpoint_ctx(struct ckpt_ctx *ctx, pid_t pid)
 		return -EINVAL;  /* cleanup by ckpt_ctx_free() */
 	}
 
+	/* root vfs (FIX: WILL CHANGE with mnt-ns etc */
+	task_lock(ctx->root_task);
+	fs = ctx->root_task->fs;
+	read_lock(&fs->lock);
+	ctx->fs_mnt = fs->root;
+	path_get(&ctx->fs_mnt);
+	read_unlock(&fs->lock);
+	task_unlock(ctx->root_task);
+
 	return 0;
 }
 
diff --git a/checkpoint/files.c b/checkpoint/files.c
new file mode 100644
index 0000000..a554cbc
--- /dev/null
+++ b/checkpoint/files.c
@@ -0,0 +1,417 @@
+/*
+ *  Checkpoint file descriptors
+ *
+ *  Copyright (C) 2008-2009 Oren Laadan
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+/* default debug level for output */
+#define CKPT_DFLAG  CKPT_DFILE
+
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/sched.h>
+#include <linux/file.h>
+#include <linux/fdtable.h>
+#include <linux/deferqueue.h>
+#include <linux/checkpoint.h>
+#include <linux/checkpoint_hdr.h>
+
+
+/**************************************************************************
+ * Checkpoint
+ */
+
+/**
+ * ckpt_fill_fname - return pathname of a given file
+ * @path: path name
+ * @root: relative root
+ * @buf: buffer for pathname
+ * @len: buffer length (in) and pathname length (out)
+ */
+char *ckpt_fill_fname(struct path *path, struct path *root, char *buf, int *len)
+{
+	struct path tmp = *root;
+	char *fname;
+
+	BUG_ON(!buf);
+	spin_lock(&dcache_lock);
+	fname = __d_path(path, &tmp, buf, *len);
+	spin_unlock(&dcache_lock);
+	if (IS_ERR(fname))
+		return fname;
+	*len = (buf + (*len) - fname);
+	/*
+	 * FIX: if __d_path() changed these, it must have stepped out of
+	 * init's namespace. Since currently we require a unified namespace
+	 * within the container: simply fail.
+	 */
+	if (tmp.mnt != root->mnt || tmp.dentry != root->dentry)
+		fname = ERR_PTR(-EBADF);
+
+	return fname;
+}
+
+/**
+ * checkpoint_fname - write a file name
+ * @ctx: checkpoint context
+ * @path: path name
+ * @root: relative root
+ */
+int checkpoint_fname(struct ckpt_ctx *ctx, struct path *path, struct path *root)
+{
+	char *buf, *fname;
+	int ret, flen;
+
+	/*
+	 * FIXME: we can optimize and save memory (and storage) if we
+	 * share strings (through objhash) and reference them instead
+	 */
+
+	flen = PATH_MAX;
+	buf = kmalloc(flen, GFP_KERNEL);
+	if (!buf)
+		return -ENOMEM;
+
+	fname = ckpt_fill_fname(path, root, buf, &flen);
+	if (!IS_ERR(fname)) {
+		ret = ckpt_write_obj_type(ctx, fname, flen,
+					  CKPT_HDR_FILE_NAME);
+	} else {
+		ret = PTR_ERR(fname);
+		ckpt_write_err(ctx, "TEP", "obtain filename (file)", ret);
+	}
+
+	kfree(buf);
+	return ret;
+}
+
+#define CKPT_DEFAULT_FDTABLE  256		/* an initial guess */
+
+/**
+ * scan_fds - scan file table and construct array of open fds
+ * @files: files_struct pointer
+ * @fdtable: (output) array of open fds
+ *
+ * Returns the number of open fds found, and also the file table
+ * array via *fdtable. The caller should free the array.
+ *
+ * The caller must validate the file descriptors collected in the
+ * array before using them, e.g. by using fcheck_files(), in case
+ * the task's fdtable changes in the meantime.
+ */
+static int scan_fds(struct files_struct *files, int **fdtable)
+{
+	struct fdtable *fdt;
+	int *fds = NULL;
+	int i = 0, n = 0;
+	int tot = CKPT_DEFAULT_FDTABLE;
+
+	/*
+	 * We assume that all tasks possibly sharing the file table are
+	 * frozen (or we are a single process and we checkpoint ourselves).
+	 * Therefore, we can safely proceed after krealloc() from where we
+	 * left off. Otherwise the file table may be modified by another
+	 * task after we scan it. The behavior is this case is undefined,
+	 * and either checkpoint or restart will likely fail.
+	 */
+ retry:
+	fds = krealloc(fds, tot * sizeof(*fds), GFP_KERNEL);
+	if (!fds)
+		return -ENOMEM;
+
+	rcu_read_lock();
+	fdt = files_fdtable(files);
+	for (/**/; i < fdt->max_fds; i++) {
+		if (!fcheck_files(files, i))
+			continue;
+		if (n == tot) {
+			rcu_read_unlock();
+			tot *= 2;	/* won't overflow: kmalloc will fail */
+			goto retry;
+		}
+		fds[n++] = i;
+	}
+	rcu_read_unlock();
+
+	*fdtable = fds;
+	return n;
+}
+
+int checkpoint_file_common(struct ckpt_ctx *ctx, struct file *file,
+			   struct ckpt_hdr_file *h)
+{
+	h->f_flags = file->f_flags;
+	h->f_mode = file->f_mode;
+	h->f_pos = file->f_pos;
+	h->f_version = file->f_version;
+
+	/* FIX: need also file->uid, file->gid, file->f_owner, etc */
+
+	return 0;
+}
+
+int generic_file_checkpoint(struct ckpt_ctx *ctx, struct file *file)
+{
+	struct ckpt_hdr_file_generic *h;
+	int ret;
+
+	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_FILE);
+	if (!h)
+		return -ENOMEM;
+
+	/*
+	 * FIXME: when we'll add support for unlinked files/dirs, we'll
+	 * need to distinguish between unlinked filed and unlinked dirs.
+	 */
+	h->common.f_type = CKPT_FILE_GENERIC;
+
+	ret = checkpoint_file_common(ctx, file, &h->common);
+	if (ret < 0)
+		goto out;
+	ret = ckpt_write_obj(ctx, &h->common.h);
+	if (ret < 0)
+		goto out;
+	ret = checkpoint_fname(ctx, &file->f_path, &ctx->fs_mnt);
+ out:
+	ckpt_hdr_put(ctx, h);
+	return ret;
+}
+EXPORT_SYMBOL(generic_file_checkpoint);
+
+/* checkpoint callback for file pointer */
+int checkpoint_file(struct ckpt_ctx *ctx, void *ptr)
+{
+	struct file *file = (struct file *) ptr;
+	int ret;
+
+	if (!file->f_op || !file->f_op->checkpoint) {
+		ckpt_write_err(ctx, "TEPS", "f_op lacks checkpoint",
+			       -EBADF, file, file->f_op);
+		ckpt_debug("f_op lacks checkpoint handler: %pS\n", file->f_op);
+		return -EBADF;
+	}
+	if (d_unlinked(file->f_dentry)) {
+		ckpt_write_err(ctx, "TEP", "unlinked file", -EBADF, file);
+		ckpt_debug("unlinked files are unsupported\n");
+		return -EBADF;
+	}
+
+	ret = file->f_op->checkpoint(ctx, file);
+	if (ret < 0)
+		ckpt_write_err(ctx, "TEP", "file checkpoint failed", ret, file);
+	return ret;
+}
+
+/**
+ * ckpt_write_file_desc - dump the state of a given file descriptor
+ * @ctx: checkpoint context
+ * @files: files_struct pointer
+ * @fd: file descriptor
+ *
+ * Saves the state of the file descriptor; looks up the actual file
+ * pointer in the hash table, and if found saves the matching objref,
+ * otherwise calls ckpt_write_file to dump the file pointer too.
+ */
+static int checkpoint_file_desc(struct ckpt_ctx *ctx,
+				struct files_struct *files, int fd)
+{
+	struct ckpt_hdr_file_desc *h;
+	struct file *file = NULL;
+	struct fdtable *fdt;
+	int objref, ret;
+	int coe = 0;	/* avoid gcc warning */
+
+	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_FILE_DESC);
+	if (!h)
+		return -ENOMEM;
+
+	rcu_read_lock();
+	fdt = files_fdtable(files);
+	file = fcheck_files(files, fd);
+	if (file) {
+		coe = FD_ISSET(fd, fdt->close_on_exec);
+		get_file(file);
+	}
+	rcu_read_unlock();
+
+	/* sanity check (although this shouldn't happen) */
+	ret = -EBADF;
+	if (!file) {
+		pr_warning("c/r: file descriptor gone?");
+		ckpt_write_err(ctx, "TEP", "file gone? (%d)", ret, file, fd);
+		goto out;
+	}
+
+	/*
+	 * if seen first time, this will add 'file' to the objhash, keep
+	 * a reference to it, dump its state while at it.
+	 */
+	objref = checkpoint_obj(ctx, file, CKPT_OBJ_FILE);
+	ckpt_debug("fd %d objref %d file %p coe %d)\n", fd, objref, file, coe);
+	if (objref < 0) {
+		ret = objref;
+		goto out;
+	}
+
+	h->fd_objref = objref;
+	h->fd_descriptor = fd;
+	h->fd_close_on_exec = coe;
+
+	ret = ckpt_write_obj(ctx, &h->h);
+out:
+	ckpt_hdr_put(ctx, h);
+	if (file)
+		fput(file);
+	return ret;
+}
+
+static int do_checkpoint_file_table(struct ckpt_ctx *ctx,
+				    struct files_struct *files)
+{
+	struct ckpt_hdr_file_table *h;
+	int *fdtable = NULL;
+	int nfds, n, ret;
+
+	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_FILE_TABLE);
+	if (!h)
+		return -ENOMEM;
+
+	nfds = scan_fds(files, &fdtable);
+	if (nfds < 0) {
+		ret = nfds;
+		goto out;
+	}
+
+	h->fdt_nfds = nfds;
+
+	ret = ckpt_write_obj(ctx, &h->h);
+	ckpt_hdr_put(ctx, h);
+	if (ret < 0)
+		goto out;
+
+	ckpt_debug("nfds %d\n", nfds);
+	for (n = 0; n < nfds; n++) {
+		ret = checkpoint_file_desc(ctx, files, fdtable[n]);
+		if (ret < 0)
+			goto out;
+	}
+
+	ret = deferqueue_run(ctx->files_deferq);
+	ckpt_debug("files_deferq ran %d entries\n", ret);
+	if (ret > 0)
+		ret = 0;
+ out:
+	kfree(fdtable);
+	return ret;
+}
+
+/* checkpoint callback for file table */
+int checkpoint_file_table(struct ckpt_ctx *ctx, void *ptr)
+{
+	return do_checkpoint_file_table(ctx, (struct files_struct *) ptr);
+}
+
+/* checkpoint wrapper for file table */
+int checkpoint_obj_file_table(struct ckpt_ctx *ctx, struct task_struct *t)
+{
+	struct files_struct *files;
+	int objref;
+
+	files = get_files_struct(t);
+	if (!files)
+		return -EBUSY;
+	objref = checkpoint_obj(ctx, files, CKPT_OBJ_FILE_TABLE);
+	put_files_struct(files);
+
+	return objref;
+}
+
+/***********************************************************************
+ * Collect
+ */
+
+int ckpt_collect_file(struct ckpt_ctx *ctx, struct file *file)
+{
+	int ret;
+
+	ret = ckpt_obj_collect(ctx, file, CKPT_OBJ_FILE);
+	if (ret <= 0)
+		return ret;
+	/* if first time for this file (ret > 0), invoke ->collect() */
+	if (file->f_op->collect)
+		ret = file->f_op->collect(ctx, file);
+	if (ret < 0)
+		ckpt_write_err(ctx, "TEP", "file collect", ret, file);
+	return ret;
+}
+
+static int collect_file_desc(struct ckpt_ctx *ctx,
+			     struct files_struct *files, int fd)
+{
+	struct fdtable *fdt;
+	struct file *file;
+	int ret;
+
+	rcu_read_lock();
+	fdt = files_fdtable(files);
+	file = fcheck_files(files, fd);
+	if (file)
+		get_file(file);
+	rcu_read_unlock();
+
+	if (!file) {
+		ckpt_write_err(ctx, "TE", "file removed", -EBUSY, file);
+		return -EBUSY;
+	}
+
+	ret = ckpt_collect_file(ctx, file);
+	fput(file);
+
+	return ret;
+}
+
+static int collect_file_table(struct ckpt_ctx *ctx, struct files_struct *files)
+{
+	int *fdtable;
+	int nfds, n;
+	int ret;
+
+	/* if already exists (ret == 0), nothing to do */
+	ret = ckpt_obj_collect(ctx, files, CKPT_OBJ_FILE_TABLE);
+	if (ret <= 0)
+		return ret;
+
+	/* if first time for this file table (ret > 0), proceed inside */
+	nfds = scan_fds(files, &fdtable);
+	if (nfds < 0)
+		return nfds;
+
+	for (n = 0; n < nfds; n++) {
+		ret = collect_file_desc(ctx, files, fdtable[n]);
+		if (ret < 0)
+			break;
+	}
+
+	kfree(fdtable);
+	return ret;
+}
+
+int ckpt_collect_file_table(struct ckpt_ctx *ctx, struct task_struct *t)
+{
+	struct files_struct *files;
+	int ret;
+
+	files = get_files_struct(t);
+	if (!files) {
+		ckpt_write_err(ctx, "TE", "files_struct missing", -EBUSY);
+		return -EBUSY;
+	}
+	ret = collect_file_table(ctx, files);
+	put_files_struct(files);
+
+	return ret;
+}
diff --git a/checkpoint/objhash.c b/checkpoint/objhash.c
index dd1f3e5..cefbab6 100644
--- a/checkpoint/objhash.c
+++ b/checkpoint/objhash.c
@@ -13,6 +13,8 @@
 
 #include <linux/kernel.h>
 #include <linux/hash.h>
+#include <linux/file.h>
+#include <linux/fdtable.h>
 #include <linux/checkpoint.h>
 #include <linux/checkpoint_hdr.h>
 
@@ -62,6 +64,38 @@ static int obj_no_grab(void *ptr)
 	return 0;
 }
 
+static int obj_file_table_grab(void *ptr)
+{
+	atomic_inc(&((struct files_struct *) ptr)->count);
+	return 0;
+}
+
+static void obj_file_table_drop(void *ptr, int lastref)
+{
+	put_files_struct((struct files_struct *) ptr);
+}
+
+static int obj_file_table_users(void *ptr)
+{
+	return atomic_read(&((struct files_struct *) ptr)->count);
+}
+
+static int obj_file_grab(void *ptr)
+{
+	get_file((struct file *) ptr);
+	return 0;
+}
+
+static void obj_file_drop(void *ptr, int lastref)
+{
+	fput((struct file *) ptr);
+}
+
+static int obj_file_users(void *ptr)
+{
+	return atomic_long_read(&((struct file *) ptr)->f_count);
+}
+
 static struct ckpt_obj_ops ckpt_obj_ops[] = {
 	/* ignored object */
 	{
@@ -70,6 +104,24 @@ static struct ckpt_obj_ops ckpt_obj_ops[] = {
 		.ref_drop = obj_no_drop,
 		.ref_grab = obj_no_grab,
 	},
+	/* files_struct object */
+	{
+		.obj_name = "FILE_TABLE",
+		.obj_type = CKPT_OBJ_FILE_TABLE,
+		.ref_drop = obj_file_table_drop,
+		.ref_grab = obj_file_table_grab,
+		.ref_users = obj_file_table_users,
+		.checkpoint = checkpoint_file_table,
+	},
+	/* file object */
+	{
+		.obj_name = "FILE",
+		.obj_type = CKPT_OBJ_FILE,
+		.ref_drop = obj_file_drop,
+		.ref_grab = obj_file_grab,
+		.ref_users = obj_file_users,
+		.checkpoint = checkpoint_file,
+	},
 };
 
 
diff --git a/checkpoint/process.c b/checkpoint/process.c
index 5e690d3..042dc45 100644
--- a/checkpoint/process.c
+++ b/checkpoint/process.c
@@ -104,6 +104,29 @@ static int checkpoint_task_struct(struct ckpt_ctx *ctx, struct task_struct *t)
 	return ckpt_write_string(ctx, t->comm, TASK_COMM_LEN);
 }
 
+static int checkpoint_task_objs(struct ckpt_ctx *ctx, struct task_struct *t)
+{
+	struct ckpt_hdr_task_objs *h;
+	int files_objref;
+	int ret;
+
+	files_objref = checkpoint_obj_file_table(ctx, t);
+	ckpt_debug("files: objref %d\n", files_objref);
+	if (files_objref < 0) {
+		ckpt_write_err(ctx, "TE", "files_struct", files_objref);
+		return files_objref;
+	}
+
+	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_TASK_OBJS);
+	if (!h)
+		return -ENOMEM;
+	h->files_objref = files_objref;
+	ret = ckpt_write_obj(ctx, &h->h);
+	ckpt_hdr_put(ctx, h);
+
+	return ret;
+}
+
 /* dump the task_struct of a given task */
 int checkpoint_restart_block(struct ckpt_ctx *ctx, struct task_struct *t)
 {
@@ -231,6 +254,10 @@ int checkpoint_task(struct ckpt_ctx *ctx, struct task_struct *t)
 	if (t->exit_state)
 		return 0;
 
+	ret = checkpoint_task_objs(ctx, t);
+	ckpt_debug("objs %d\n", ret);
+	if (ret < 0)
+		goto out;
 	ret = checkpoint_thread(ctx, t);
 	ckpt_debug("thread %d\n", ret);
 	if (ret < 0)
@@ -248,7 +275,11 @@ int checkpoint_task(struct ckpt_ctx *ctx, struct task_struct *t)
 
 int ckpt_collect_task(struct ckpt_ctx *ctx, struct task_struct *t)
 {
-	return 0;
+	int ret;
+
+	ret = ckpt_collect_file_table(ctx, t);
+
+	return ret;
 }
 
 /***********************************************************************
diff --git a/checkpoint/sys.c b/checkpoint/sys.c
index d16d48f..1373ff9 100644
--- a/checkpoint/sys.c
+++ b/checkpoint/sys.c
@@ -191,10 +191,14 @@ static void ckpt_ctx_free(struct ckpt_ctx *ctx)
 {
 	BUG_ON(atomic_read(&ctx->refcount));
 
+	if (ctx->files_deferq)
+		deferqueue_destroy(ctx->files_deferq);
+
 	if (ctx->file)
 		fput(ctx->file);
 
 	ckpt_obj_hash_free(ctx);
+	path_put(&ctx->fs_mnt);
 
 	if (ctx->tasks_arr)
 		task_arr_free(ctx);
@@ -237,6 +241,10 @@ static struct ckpt_ctx *ckpt_ctx_alloc(int fd, unsigned long uflags,
 	if (ckpt_obj_hash_alloc(ctx) < 0)
 		goto err;
 
+	ctx->files_deferq = deferqueue_create();
+	if (!ctx->files_deferq)
+		goto err;
+
 	atomic_inc(&ctx->refcount);
 	return ctx;
  err:
diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
index b698b19..6fa5035 100644
--- a/include/linux/checkpoint.h
+++ b/include/linux/checkpoint.h
@@ -70,6 +70,9 @@ extern int ckpt_read_payload(struct ckpt_ctx *ctx,
 extern char *ckpt_read_string(struct ckpt_ctx *ctx, int max);
 extern int ckpt_read_consume(struct ckpt_ctx *ctx, int len, int type);
 
+extern char *ckpt_fill_fname(struct path *path, struct path *root,
+			     char *buf, int *len);
+
 /* ckpt kflags */
 #define ckpt_set_ctx_kflag(__ctx, __kflag)  \
 	set_bit(__kflag##_BIT, &(__ctx)->kflags)
@@ -137,6 +140,21 @@ extern int checkpoint_restart_block(struct ckpt_ctx *ctx,
 				    struct task_struct *t);
 extern int restore_restart_block(struct ckpt_ctx *ctx);
 
+/* file table */
+extern int ckpt_collect_file_table(struct ckpt_ctx *ctx, struct task_struct *t);
+extern int checkpoint_obj_file_table(struct ckpt_ctx *ctx,
+				     struct task_struct *t);
+extern int checkpoint_file_table(struct ckpt_ctx *ctx, void *ptr);
+
+/* files */
+extern int checkpoint_fname(struct ckpt_ctx *ctx,
+			    struct path *path, struct path *root);
+extern int ckpt_collect_file(struct ckpt_ctx *ctx, struct file *file);
+extern int checkpoint_file(struct ckpt_ctx *ctx, void *ptr);
+
+extern int checkpoint_file_common(struct ckpt_ctx *ctx, struct file *file,
+				  struct ckpt_hdr_file *h);
+
 static inline int ckpt_validate_errno(int errno)
 {
 	return (errno >= 0) && (errno < MAX_ERRNO);
@@ -147,6 +165,7 @@ static inline int ckpt_validate_errno(int errno)
 #define CKPT_DSYS	0x2		/* generic (system) */
 #define CKPT_DRW	0x4		/* image read/write */
 #define CKPT_DOBJ	0x8		/* shared objects */
+#define CKPT_DFILE	0x10		/* files and filesystem */
 
 #define CKPT_DDEFAULT	0xffff		/* default debug level */
 
diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
index 7a4015b..1124375 100644
--- a/include/linux/checkpoint_hdr.h
+++ b/include/linux/checkpoint_hdr.h
@@ -54,12 +54,18 @@ enum {
 
 	CKPT_HDR_TREE = 101,
 	CKPT_HDR_TASK,
+	CKPT_HDR_TASK_OBJS,
 	CKPT_HDR_RESTART_BLOCK,
 	CKPT_HDR_THREAD,
 	CKPT_HDR_CPU,
 
 	/* 201-299: reserved for arch-dependent */
 
+	CKPT_HDR_FILE_TABLE = 301,
+	CKPT_HDR_FILE_DESC,
+	CKPT_HDR_FILE_NAME,
+	CKPT_HDR_FILE,
+
 	CKPT_HDR_TAIL = 9001,
 
 	CKPT_HDR_ERROR = 9999,
@@ -80,6 +86,8 @@ struct ckpt_hdr_objref {
 /* shared objects types */
 enum obj_type {
 	CKPT_OBJ_IGNORE = 0,
+	CKPT_OBJ_FILE_TABLE,
+	CKPT_OBJ_FILE,
 	CKPT_OBJ_MAX
 };
 
@@ -157,6 +165,12 @@ struct ckpt_hdr_task {
 	__u64 robust_futex_list; /* a __user ptr */
 } __attribute__((aligned(8)));
 
+/* task's shared resources */
+struct ckpt_hdr_task_objs {
+	struct ckpt_hdr h;
+	__s32 files_objref;
+} __attribute__((aligned(8)));
+
 /* restart blocks */
 struct ckpt_hdr_restart_block {
 	struct ckpt_hdr h;
@@ -178,4 +192,39 @@ enum restart_block_type {
 	CKPT_RESTART_BLOCK_FUTEX
 };
 
+/* file system */
+struct ckpt_hdr_file_table {
+	struct ckpt_hdr h;
+	__s32 fdt_nfds;
+} __attribute__((aligned(8)));
+
+/* file descriptors */
+struct ckpt_hdr_file_desc {
+	struct ckpt_hdr h;
+	__s32 fd_objref;
+	__s32 fd_descriptor;
+	__u32 fd_close_on_exec;
+} __attribute__((aligned(8)));
+
+enum file_type {
+	CKPT_FILE_IGNORE = 0,
+	CKPT_FILE_GENERIC,
+	CKPT_FILE_MAX
+};
+
+/* file objects */
+struct ckpt_hdr_file {
+	struct ckpt_hdr h;
+	__u32 f_type;
+	__u32 f_mode;
+	__u32 f_flags;
+	__u32 _padding;
+	__u64 f_pos;
+	__u64 f_version;
+} __attribute__((aligned(8)));
+
+struct ckpt_hdr_file_generic {
+	struct ckpt_hdr_file common;
+} __attribute__((aligned(8)));
+
 #endif /* _CHECKPOINT_CKPT_HDR_H_ */
diff --git a/include/linux/checkpoint_types.h b/include/linux/checkpoint_types.h
index f11fd07..795742f 100644
--- a/include/linux/checkpoint_types.h
+++ b/include/linux/checkpoint_types.h
@@ -14,6 +14,8 @@
 
 #include <linux/sched.h>
 #include <linux/nsproxy.h>
+#include <linux/list.h>
+#include <linux/path.h>
 #include <linux/fs.h>
 #include <linux/ktime.h>
 #include <linux/wait.h>
@@ -39,6 +41,9 @@ struct ckpt_ctx {
 	atomic_t refcount;
 
 	struct ckpt_obj_hash *obj_hash;	/* repository for shared objects */
+	struct deferqueue_head *files_deferq;	/* deferred file-table work */
+
+	struct path fs_mnt;     /* container root (FIXME) */
 
 	struct task_struct *tsk;/* checkpoint: current target task */
 	char err_string[256];	/* checkpoint: error string */
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 02638a7..5ec844f 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2316,7 +2316,11 @@ void inode_sub_bytes(struct inode *inode, loff_t bytes);
 loff_t inode_get_bytes(struct inode *inode);
 void inode_set_bytes(struct inode *inode, loff_t bytes);
 
+#ifdef CONFIG_CHECKPOINT
+extern int generic_file_checkpoint(struct ckpt_ctx *ctx, struct file *file);
+#else
 #define generic_file_checkpoint NULL
+#endif
 
 extern int vfs_readdir(struct file *, filldir_t, void *);
 
-- 
1.6.0.4

^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH v18 34/80] c/r: dump open file descriptors
  2009-09-23 23:50 ` Oren Laadan
@ 2009-09-23 23:51   ` Oren Laadan
  -1 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-23 23:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, containers, linux-kernel, linux-mm, linux-api,
	Serge Hallyn, Ingo Molnar, Pavel Emelyanov, Oren Laadan,
	Oren Laadan

Dump the file table with 'struct ckpt_hdr_file_table, followed by all
open file descriptors. Because the 'struct file' corresponding to an
fd can be shared, they are assigned an objref and registered in the
object hash. A reference to the 'file *' is kept for as long as it
lives in the hash (the hash is only cleaned up at the end of the
checkpoint).

Also provide generic_checkpoint_file() and generic_restore_file()
which is good for normal files and directories. It does not support
yet unlinked files or directories.

Changelog[v18]:
  - Add a few more ckpt_write_err()s
  - [Dan Smith] Export fill_fname() as ckpt_fill_fname()
  - Introduce ckpt_collect_file() that also uses file->collect method
  - In collect_file_stabl() use retval from ckpt_obj_collect() to
    test for first-time-object
Changelog[v17]:
  - Only collect sub-objects of files_struct once
  - Better file error debugging
  - Use (new) d_unlinked()
Changelog[v16]:
  - Fix compile warning in checkpoint_bad()
Changelog[v16]:
  - Reorder patch (move earlier in series)
  - Handle shared files_struct objects
Changelog[v14]:
  - File objects are dumped/restored prior to the first reference
  - Introduce a per file-type restore() callback
  - Use struct file_operations->checkpoint()
  - Put code for generic file descriptors in a separate function
  - Use one CKPT_FILE_GENERIC for both regular files and dirs
  - Revert change to pr_debug(), back to ckpt_debug()
  - Use only unsigned fields in checkpoint headers
  - Rename:  ckpt_write_files() => checkpoint_fd_table()
  - Rename:  ckpt_write_fd_data() => checkpoint_file()
  - Discard field 'h->parent'
Changelog[v12]:
  - Replace obsolete ckpt_debug() with pr_debug()
Changelog[v11]:
  - Discard handling of opened symlinks (there is no such thing)
  - ckpt_scan_fds() retries from scratch if hits size limits
Changelog[v9]:
  - Fix a couple of leaks in ckpt_write_files()
  - Drop useless kfree from ckpt_scan_fds()
Changelog[v8]:
  - initialize 'coe' to workaround gcc false warning
Changelog[v6]:
  - Balance all calls to ckpt_hbuf_get() with matching ckpt_hbuf_put()
    (even though it's not really needed)

Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
---
 checkpoint/Makefile              |    3 +-
 checkpoint/checkpoint.c          |   11 +
 checkpoint/files.c               |  417 ++++++++++++++++++++++++++++++++++++++
 checkpoint/objhash.c             |   52 +++++
 checkpoint/process.c             |   33 +++-
 checkpoint/sys.c                 |    8 +
 include/linux/checkpoint.h       |   19 ++
 include/linux/checkpoint_hdr.h   |   49 +++++
 include/linux/checkpoint_types.h |    5 +
 include/linux/fs.h               |    4 +
 10 files changed, 599 insertions(+), 2 deletions(-)
 create mode 100644 checkpoint/files.c

diff --git a/checkpoint/Makefile b/checkpoint/Makefile
index 5aa6a75..1d0c058 100644
--- a/checkpoint/Makefile
+++ b/checkpoint/Makefile
@@ -7,4 +7,5 @@ obj-$(CONFIG_CHECKPOINT) += \
 	objhash.o \
 	checkpoint.o \
 	restart.o \
-	process.o
+	process.o \
+	files.o
diff --git a/checkpoint/checkpoint.c b/checkpoint/checkpoint.c
index c21646d..4cc2a2f 100644
--- a/checkpoint/checkpoint.c
+++ b/checkpoint/checkpoint.c
@@ -18,6 +18,7 @@
 #include <linux/time.h>
 #include <linux/fs.h>
 #include <linux/file.h>
+#include <linux/fs_struct.h>
 #include <linux/dcache.h>
 #include <linux/mount.h>
 #include <linux/utsname.h>
@@ -673,6 +674,7 @@ static int init_checkpoint_ctx(struct ckpt_ctx *ctx, pid_t pid)
 {
 	struct task_struct *task;
 	struct nsproxy *nsproxy;
+	struct fs_struct *fs;
 
 	/*
 	 * No need for explicit cleanup here, because if an error
@@ -714,6 +716,15 @@ static int init_checkpoint_ctx(struct ckpt_ctx *ctx, pid_t pid)
 		return -EINVAL;  /* cleanup by ckpt_ctx_free() */
 	}
 
+	/* root vfs (FIX: WILL CHANGE with mnt-ns etc */
+	task_lock(ctx->root_task);
+	fs = ctx->root_task->fs;
+	read_lock(&fs->lock);
+	ctx->fs_mnt = fs->root;
+	path_get(&ctx->fs_mnt);
+	read_unlock(&fs->lock);
+	task_unlock(ctx->root_task);
+
 	return 0;
 }
 
diff --git a/checkpoint/files.c b/checkpoint/files.c
new file mode 100644
index 0000000..a554cbc
--- /dev/null
+++ b/checkpoint/files.c
@@ -0,0 +1,417 @@
+/*
+ *  Checkpoint file descriptors
+ *
+ *  Copyright (C) 2008-2009 Oren Laadan
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+/* default debug level for output */
+#define CKPT_DFLAG  CKPT_DFILE
+
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/sched.h>
+#include <linux/file.h>
+#include <linux/fdtable.h>
+#include <linux/deferqueue.h>
+#include <linux/checkpoint.h>
+#include <linux/checkpoint_hdr.h>
+
+
+/**************************************************************************
+ * Checkpoint
+ */
+
+/**
+ * ckpt_fill_fname - return pathname of a given file
+ * @path: path name
+ * @root: relative root
+ * @buf: buffer for pathname
+ * @len: buffer length (in) and pathname length (out)
+ */
+char *ckpt_fill_fname(struct path *path, struct path *root, char *buf, int *len)
+{
+	struct path tmp = *root;
+	char *fname;
+
+	BUG_ON(!buf);
+	spin_lock(&dcache_lock);
+	fname = __d_path(path, &tmp, buf, *len);
+	spin_unlock(&dcache_lock);
+	if (IS_ERR(fname))
+		return fname;
+	*len = (buf + (*len) - fname);
+	/*
+	 * FIX: if __d_path() changed these, it must have stepped out of
+	 * init's namespace. Since currently we require a unified namespace
+	 * within the container: simply fail.
+	 */
+	if (tmp.mnt != root->mnt || tmp.dentry != root->dentry)
+		fname = ERR_PTR(-EBADF);
+
+	return fname;
+}
+
+/**
+ * checkpoint_fname - write a file name
+ * @ctx: checkpoint context
+ * @path: path name
+ * @root: relative root
+ */
+int checkpoint_fname(struct ckpt_ctx *ctx, struct path *path, struct path *root)
+{
+	char *buf, *fname;
+	int ret, flen;
+
+	/*
+	 * FIXME: we can optimize and save memory (and storage) if we
+	 * share strings (through objhash) and reference them instead
+	 */
+
+	flen = PATH_MAX;
+	buf = kmalloc(flen, GFP_KERNEL);
+	if (!buf)
+		return -ENOMEM;
+
+	fname = ckpt_fill_fname(path, root, buf, &flen);
+	if (!IS_ERR(fname)) {
+		ret = ckpt_write_obj_type(ctx, fname, flen,
+					  CKPT_HDR_FILE_NAME);
+	} else {
+		ret = PTR_ERR(fname);
+		ckpt_write_err(ctx, "TEP", "obtain filename (file)", ret);
+	}
+
+	kfree(buf);
+	return ret;
+}
+
+#define CKPT_DEFAULT_FDTABLE  256		/* an initial guess */
+
+/**
+ * scan_fds - scan file table and construct array of open fds
+ * @files: files_struct pointer
+ * @fdtable: (output) array of open fds
+ *
+ * Returns the number of open fds found, and also the file table
+ * array via *fdtable. The caller should free the array.
+ *
+ * The caller must validate the file descriptors collected in the
+ * array before using them, e.g. by using fcheck_files(), in case
+ * the task's fdtable changes in the meantime.
+ */
+static int scan_fds(struct files_struct *files, int **fdtable)
+{
+	struct fdtable *fdt;
+	int *fds = NULL;
+	int i = 0, n = 0;
+	int tot = CKPT_DEFAULT_FDTABLE;
+
+	/*
+	 * We assume that all tasks possibly sharing the file table are
+	 * frozen (or we are a single process and we checkpoint ourselves).
+	 * Therefore, we can safely proceed after krealloc() from where we
+	 * left off. Otherwise the file table may be modified by another
+	 * task after we scan it. The behavior is this case is undefined,
+	 * and either checkpoint or restart will likely fail.
+	 */
+ retry:
+	fds = krealloc(fds, tot * sizeof(*fds), GFP_KERNEL);
+	if (!fds)
+		return -ENOMEM;
+
+	rcu_read_lock();
+	fdt = files_fdtable(files);
+	for (/**/; i < fdt->max_fds; i++) {
+		if (!fcheck_files(files, i))
+			continue;
+		if (n == tot) {
+			rcu_read_unlock();
+			tot *= 2;	/* won't overflow: kmalloc will fail */
+			goto retry;
+		}
+		fds[n++] = i;
+	}
+	rcu_read_unlock();
+
+	*fdtable = fds;
+	return n;
+}
+
+int checkpoint_file_common(struct ckpt_ctx *ctx, struct file *file,
+			   struct ckpt_hdr_file *h)
+{
+	h->f_flags = file->f_flags;
+	h->f_mode = file->f_mode;
+	h->f_pos = file->f_pos;
+	h->f_version = file->f_version;
+
+	/* FIX: need also file->uid, file->gid, file->f_owner, etc */
+
+	return 0;
+}
+
+int generic_file_checkpoint(struct ckpt_ctx *ctx, struct file *file)
+{
+	struct ckpt_hdr_file_generic *h;
+	int ret;
+
+	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_FILE);
+	if (!h)
+		return -ENOMEM;
+
+	/*
+	 * FIXME: when we'll add support for unlinked files/dirs, we'll
+	 * need to distinguish between unlinked filed and unlinked dirs.
+	 */
+	h->common.f_type = CKPT_FILE_GENERIC;
+
+	ret = checkpoint_file_common(ctx, file, &h->common);
+	if (ret < 0)
+		goto out;
+	ret = ckpt_write_obj(ctx, &h->common.h);
+	if (ret < 0)
+		goto out;
+	ret = checkpoint_fname(ctx, &file->f_path, &ctx->fs_mnt);
+ out:
+	ckpt_hdr_put(ctx, h);
+	return ret;
+}
+EXPORT_SYMBOL(generic_file_checkpoint);
+
+/* checkpoint callback for file pointer */
+int checkpoint_file(struct ckpt_ctx *ctx, void *ptr)
+{
+	struct file *file = (struct file *) ptr;
+	int ret;
+
+	if (!file->f_op || !file->f_op->checkpoint) {
+		ckpt_write_err(ctx, "TEPS", "f_op lacks checkpoint",
+			       -EBADF, file, file->f_op);
+		ckpt_debug("f_op lacks checkpoint handler: %pS\n", file->f_op);
+		return -EBADF;
+	}
+	if (d_unlinked(file->f_dentry)) {
+		ckpt_write_err(ctx, "TEP", "unlinked file", -EBADF, file);
+		ckpt_debug("unlinked files are unsupported\n");
+		return -EBADF;
+	}
+
+	ret = file->f_op->checkpoint(ctx, file);
+	if (ret < 0)
+		ckpt_write_err(ctx, "TEP", "file checkpoint failed", ret, file);
+	return ret;
+}
+
+/**
+ * ckpt_write_file_desc - dump the state of a given file descriptor
+ * @ctx: checkpoint context
+ * @files: files_struct pointer
+ * @fd: file descriptor
+ *
+ * Saves the state of the file descriptor; looks up the actual file
+ * pointer in the hash table, and if found saves the matching objref,
+ * otherwise calls ckpt_write_file to dump the file pointer too.
+ */
+static int checkpoint_file_desc(struct ckpt_ctx *ctx,
+				struct files_struct *files, int fd)
+{
+	struct ckpt_hdr_file_desc *h;
+	struct file *file = NULL;
+	struct fdtable *fdt;
+	int objref, ret;
+	int coe = 0;	/* avoid gcc warning */
+
+	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_FILE_DESC);
+	if (!h)
+		return -ENOMEM;
+
+	rcu_read_lock();
+	fdt = files_fdtable(files);
+	file = fcheck_files(files, fd);
+	if (file) {
+		coe = FD_ISSET(fd, fdt->close_on_exec);
+		get_file(file);
+	}
+	rcu_read_unlock();
+
+	/* sanity check (although this shouldn't happen) */
+	ret = -EBADF;
+	if (!file) {
+		pr_warning("c/r: file descriptor gone?");
+		ckpt_write_err(ctx, "TEP", "file gone? (%d)", ret, file, fd);
+		goto out;
+	}
+
+	/*
+	 * if seen first time, this will add 'file' to the objhash, keep
+	 * a reference to it, dump its state while at it.
+	 */
+	objref = checkpoint_obj(ctx, file, CKPT_OBJ_FILE);
+	ckpt_debug("fd %d objref %d file %p coe %d)\n", fd, objref, file, coe);
+	if (objref < 0) {
+		ret = objref;
+		goto out;
+	}
+
+	h->fd_objref = objref;
+	h->fd_descriptor = fd;
+	h->fd_close_on_exec = coe;
+
+	ret = ckpt_write_obj(ctx, &h->h);
+out:
+	ckpt_hdr_put(ctx, h);
+	if (file)
+		fput(file);
+	return ret;
+}
+
+static int do_checkpoint_file_table(struct ckpt_ctx *ctx,
+				    struct files_struct *files)
+{
+	struct ckpt_hdr_file_table *h;
+	int *fdtable = NULL;
+	int nfds, n, ret;
+
+	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_FILE_TABLE);
+	if (!h)
+		return -ENOMEM;
+
+	nfds = scan_fds(files, &fdtable);
+	if (nfds < 0) {
+		ret = nfds;
+		goto out;
+	}
+
+	h->fdt_nfds = nfds;
+
+	ret = ckpt_write_obj(ctx, &h->h);
+	ckpt_hdr_put(ctx, h);
+	if (ret < 0)
+		goto out;
+
+	ckpt_debug("nfds %d\n", nfds);
+	for (n = 0; n < nfds; n++) {
+		ret = checkpoint_file_desc(ctx, files, fdtable[n]);
+		if (ret < 0)
+			goto out;
+	}
+
+	ret = deferqueue_run(ctx->files_deferq);
+	ckpt_debug("files_deferq ran %d entries\n", ret);
+	if (ret > 0)
+		ret = 0;
+ out:
+	kfree(fdtable);
+	return ret;
+}
+
+/* checkpoint callback for file table */
+int checkpoint_file_table(struct ckpt_ctx *ctx, void *ptr)
+{
+	return do_checkpoint_file_table(ctx, (struct files_struct *) ptr);
+}
+
+/* checkpoint wrapper for file table */
+int checkpoint_obj_file_table(struct ckpt_ctx *ctx, struct task_struct *t)
+{
+	struct files_struct *files;
+	int objref;
+
+	files = get_files_struct(t);
+	if (!files)
+		return -EBUSY;
+	objref = checkpoint_obj(ctx, files, CKPT_OBJ_FILE_TABLE);
+	put_files_struct(files);
+
+	return objref;
+}
+
+/***********************************************************************
+ * Collect
+ */
+
+int ckpt_collect_file(struct ckpt_ctx *ctx, struct file *file)
+{
+	int ret;
+
+	ret = ckpt_obj_collect(ctx, file, CKPT_OBJ_FILE);
+	if (ret <= 0)
+		return ret;
+	/* if first time for this file (ret > 0), invoke ->collect() */
+	if (file->f_op->collect)
+		ret = file->f_op->collect(ctx, file);
+	if (ret < 0)
+		ckpt_write_err(ctx, "TEP", "file collect", ret, file);
+	return ret;
+}
+
+static int collect_file_desc(struct ckpt_ctx *ctx,
+			     struct files_struct *files, int fd)
+{
+	struct fdtable *fdt;
+	struct file *file;
+	int ret;
+
+	rcu_read_lock();
+	fdt = files_fdtable(files);
+	file = fcheck_files(files, fd);
+	if (file)
+		get_file(file);
+	rcu_read_unlock();
+
+	if (!file) {
+		ckpt_write_err(ctx, "TE", "file removed", -EBUSY, file);
+		return -EBUSY;
+	}
+
+	ret = ckpt_collect_file(ctx, file);
+	fput(file);
+
+	return ret;
+}
+
+static int collect_file_table(struct ckpt_ctx *ctx, struct files_struct *files)
+{
+	int *fdtable;
+	int nfds, n;
+	int ret;
+
+	/* if already exists (ret == 0), nothing to do */
+	ret = ckpt_obj_collect(ctx, files, CKPT_OBJ_FILE_TABLE);
+	if (ret <= 0)
+		return ret;
+
+	/* if first time for this file table (ret > 0), proceed inside */
+	nfds = scan_fds(files, &fdtable);
+	if (nfds < 0)
+		return nfds;
+
+	for (n = 0; n < nfds; n++) {
+		ret = collect_file_desc(ctx, files, fdtable[n]);
+		if (ret < 0)
+			break;
+	}
+
+	kfree(fdtable);
+	return ret;
+}
+
+int ckpt_collect_file_table(struct ckpt_ctx *ctx, struct task_struct *t)
+{
+	struct files_struct *files;
+	int ret;
+
+	files = get_files_struct(t);
+	if (!files) {
+		ckpt_write_err(ctx, "TE", "files_struct missing", -EBUSY);
+		return -EBUSY;
+	}
+	ret = collect_file_table(ctx, files);
+	put_files_struct(files);
+
+	return ret;
+}
diff --git a/checkpoint/objhash.c b/checkpoint/objhash.c
index dd1f3e5..cefbab6 100644
--- a/checkpoint/objhash.c
+++ b/checkpoint/objhash.c
@@ -13,6 +13,8 @@
 
 #include <linux/kernel.h>
 #include <linux/hash.h>
+#include <linux/file.h>
+#include <linux/fdtable.h>
 #include <linux/checkpoint.h>
 #include <linux/checkpoint_hdr.h>
 
@@ -62,6 +64,38 @@ static int obj_no_grab(void *ptr)
 	return 0;
 }
 
+static int obj_file_table_grab(void *ptr)
+{
+	atomic_inc(&((struct files_struct *) ptr)->count);
+	return 0;
+}
+
+static void obj_file_table_drop(void *ptr, int lastref)
+{
+	put_files_struct((struct files_struct *) ptr);
+}
+
+static int obj_file_table_users(void *ptr)
+{
+	return atomic_read(&((struct files_struct *) ptr)->count);
+}
+
+static int obj_file_grab(void *ptr)
+{
+	get_file((struct file *) ptr);
+	return 0;
+}
+
+static void obj_file_drop(void *ptr, int lastref)
+{
+	fput((struct file *) ptr);
+}
+
+static int obj_file_users(void *ptr)
+{
+	return atomic_long_read(&((struct file *) ptr)->f_count);
+}
+
 static struct ckpt_obj_ops ckpt_obj_ops[] = {
 	/* ignored object */
 	{
@@ -70,6 +104,24 @@ static struct ckpt_obj_ops ckpt_obj_ops[] = {
 		.ref_drop = obj_no_drop,
 		.ref_grab = obj_no_grab,
 	},
+	/* files_struct object */
+	{
+		.obj_name = "FILE_TABLE",
+		.obj_type = CKPT_OBJ_FILE_TABLE,
+		.ref_drop = obj_file_table_drop,
+		.ref_grab = obj_file_table_grab,
+		.ref_users = obj_file_table_users,
+		.checkpoint = checkpoint_file_table,
+	},
+	/* file object */
+	{
+		.obj_name = "FILE",
+		.obj_type = CKPT_OBJ_FILE,
+		.ref_drop = obj_file_drop,
+		.ref_grab = obj_file_grab,
+		.ref_users = obj_file_users,
+		.checkpoint = checkpoint_file,
+	},
 };
 
 
diff --git a/checkpoint/process.c b/checkpoint/process.c
index 5e690d3..042dc45 100644
--- a/checkpoint/process.c
+++ b/checkpoint/process.c
@@ -104,6 +104,29 @@ static int checkpoint_task_struct(struct ckpt_ctx *ctx, struct task_struct *t)
 	return ckpt_write_string(ctx, t->comm, TASK_COMM_LEN);
 }
 
+static int checkpoint_task_objs(struct ckpt_ctx *ctx, struct task_struct *t)
+{
+	struct ckpt_hdr_task_objs *h;
+	int files_objref;
+	int ret;
+
+	files_objref = checkpoint_obj_file_table(ctx, t);
+	ckpt_debug("files: objref %d\n", files_objref);
+	if (files_objref < 0) {
+		ckpt_write_err(ctx, "TE", "files_struct", files_objref);
+		return files_objref;
+	}
+
+	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_TASK_OBJS);
+	if (!h)
+		return -ENOMEM;
+	h->files_objref = files_objref;
+	ret = ckpt_write_obj(ctx, &h->h);
+	ckpt_hdr_put(ctx, h);
+
+	return ret;
+}
+
 /* dump the task_struct of a given task */
 int checkpoint_restart_block(struct ckpt_ctx *ctx, struct task_struct *t)
 {
@@ -231,6 +254,10 @@ int checkpoint_task(struct ckpt_ctx *ctx, struct task_struct *t)
 	if (t->exit_state)
 		return 0;
 
+	ret = checkpoint_task_objs(ctx, t);
+	ckpt_debug("objs %d\n", ret);
+	if (ret < 0)
+		goto out;
 	ret = checkpoint_thread(ctx, t);
 	ckpt_debug("thread %d\n", ret);
 	if (ret < 0)
@@ -248,7 +275,11 @@ int checkpoint_task(struct ckpt_ctx *ctx, struct task_struct *t)
 
 int ckpt_collect_task(struct ckpt_ctx *ctx, struct task_struct *t)
 {
-	return 0;
+	int ret;
+
+	ret = ckpt_collect_file_table(ctx, t);
+
+	return ret;
 }
 
 /***********************************************************************
diff --git a/checkpoint/sys.c b/checkpoint/sys.c
index d16d48f..1373ff9 100644
--- a/checkpoint/sys.c
+++ b/checkpoint/sys.c
@@ -191,10 +191,14 @@ static void ckpt_ctx_free(struct ckpt_ctx *ctx)
 {
 	BUG_ON(atomic_read(&ctx->refcount));
 
+	if (ctx->files_deferq)
+		deferqueue_destroy(ctx->files_deferq);
+
 	if (ctx->file)
 		fput(ctx->file);
 
 	ckpt_obj_hash_free(ctx);
+	path_put(&ctx->fs_mnt);
 
 	if (ctx->tasks_arr)
 		task_arr_free(ctx);
@@ -237,6 +241,10 @@ static struct ckpt_ctx *ckpt_ctx_alloc(int fd, unsigned long uflags,
 	if (ckpt_obj_hash_alloc(ctx) < 0)
 		goto err;
 
+	ctx->files_deferq = deferqueue_create();
+	if (!ctx->files_deferq)
+		goto err;
+
 	atomic_inc(&ctx->refcount);
 	return ctx;
  err:
diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
index b698b19..6fa5035 100644
--- a/include/linux/checkpoint.h
+++ b/include/linux/checkpoint.h
@@ -70,6 +70,9 @@ extern int ckpt_read_payload(struct ckpt_ctx *ctx,
 extern char *ckpt_read_string(struct ckpt_ctx *ctx, int max);
 extern int ckpt_read_consume(struct ckpt_ctx *ctx, int len, int type);
 
+extern char *ckpt_fill_fname(struct path *path, struct path *root,
+			     char *buf, int *len);
+
 /* ckpt kflags */
 #define ckpt_set_ctx_kflag(__ctx, __kflag)  \
 	set_bit(__kflag##_BIT, &(__ctx)->kflags)
@@ -137,6 +140,21 @@ extern int checkpoint_restart_block(struct ckpt_ctx *ctx,
 				    struct task_struct *t);
 extern int restore_restart_block(struct ckpt_ctx *ctx);
 
+/* file table */
+extern int ckpt_collect_file_table(struct ckpt_ctx *ctx, struct task_struct *t);
+extern int checkpoint_obj_file_table(struct ckpt_ctx *ctx,
+				     struct task_struct *t);
+extern int checkpoint_file_table(struct ckpt_ctx *ctx, void *ptr);
+
+/* files */
+extern int checkpoint_fname(struct ckpt_ctx *ctx,
+			    struct path *path, struct path *root);
+extern int ckpt_collect_file(struct ckpt_ctx *ctx, struct file *file);
+extern int checkpoint_file(struct ckpt_ctx *ctx, void *ptr);
+
+extern int checkpoint_file_common(struct ckpt_ctx *ctx, struct file *file,
+				  struct ckpt_hdr_file *h);
+
 static inline int ckpt_validate_errno(int errno)
 {
 	return (errno >= 0) && (errno < MAX_ERRNO);
@@ -147,6 +165,7 @@ static inline int ckpt_validate_errno(int errno)
 #define CKPT_DSYS	0x2		/* generic (system) */
 #define CKPT_DRW	0x4		/* image read/write */
 #define CKPT_DOBJ	0x8		/* shared objects */
+#define CKPT_DFILE	0x10		/* files and filesystem */
 
 #define CKPT_DDEFAULT	0xffff		/* default debug level */
 
diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
index 7a4015b..1124375 100644
--- a/include/linux/checkpoint_hdr.h
+++ b/include/linux/checkpoint_hdr.h
@@ -54,12 +54,18 @@ enum {
 
 	CKPT_HDR_TREE = 101,
 	CKPT_HDR_TASK,
+	CKPT_HDR_TASK_OBJS,
 	CKPT_HDR_RESTART_BLOCK,
 	CKPT_HDR_THREAD,
 	CKPT_HDR_CPU,
 
 	/* 201-299: reserved for arch-dependent */
 
+	CKPT_HDR_FILE_TABLE = 301,
+	CKPT_HDR_FILE_DESC,
+	CKPT_HDR_FILE_NAME,
+	CKPT_HDR_FILE,
+
 	CKPT_HDR_TAIL = 9001,
 
 	CKPT_HDR_ERROR = 9999,
@@ -80,6 +86,8 @@ struct ckpt_hdr_objref {
 /* shared objects types */
 enum obj_type {
 	CKPT_OBJ_IGNORE = 0,
+	CKPT_OBJ_FILE_TABLE,
+	CKPT_OBJ_FILE,
 	CKPT_OBJ_MAX
 };
 
@@ -157,6 +165,12 @@ struct ckpt_hdr_task {
 	__u64 robust_futex_list; /* a __user ptr */
 } __attribute__((aligned(8)));
 
+/* task's shared resources */
+struct ckpt_hdr_task_objs {
+	struct ckpt_hdr h;
+	__s32 files_objref;
+} __attribute__((aligned(8)));
+
 /* restart blocks */
 struct ckpt_hdr_restart_block {
 	struct ckpt_hdr h;
@@ -178,4 +192,39 @@ enum restart_block_type {
 	CKPT_RESTART_BLOCK_FUTEX
 };
 
+/* file system */
+struct ckpt_hdr_file_table {
+	struct ckpt_hdr h;
+	__s32 fdt_nfds;
+} __attribute__((aligned(8)));
+
+/* file descriptors */
+struct ckpt_hdr_file_desc {
+	struct ckpt_hdr h;
+	__s32 fd_objref;
+	__s32 fd_descriptor;
+	__u32 fd_close_on_exec;
+} __attribute__((aligned(8)));
+
+enum file_type {
+	CKPT_FILE_IGNORE = 0,
+	CKPT_FILE_GENERIC,
+	CKPT_FILE_MAX
+};
+
+/* file objects */
+struct ckpt_hdr_file {
+	struct ckpt_hdr h;
+	__u32 f_type;
+	__u32 f_mode;
+	__u32 f_flags;
+	__u32 _padding;
+	__u64 f_pos;
+	__u64 f_version;
+} __attribute__((aligned(8)));
+
+struct ckpt_hdr_file_generic {
+	struct ckpt_hdr_file common;
+} __attribute__((aligned(8)));
+
 #endif /* _CHECKPOINT_CKPT_HDR_H_ */
diff --git a/include/linux/checkpoint_types.h b/include/linux/checkpoint_types.h
index f11fd07..795742f 100644
--- a/include/linux/checkpoint_types.h
+++ b/include/linux/checkpoint_types.h
@@ -14,6 +14,8 @@
 
 #include <linux/sched.h>
 #include <linux/nsproxy.h>
+#include <linux/list.h>
+#include <linux/path.h>
 #include <linux/fs.h>
 #include <linux/ktime.h>
 #include <linux/wait.h>
@@ -39,6 +41,9 @@ struct ckpt_ctx {
 	atomic_t refcount;
 
 	struct ckpt_obj_hash *obj_hash;	/* repository for shared objects */
+	struct deferqueue_head *files_deferq;	/* deferred file-table work */
+
+	struct path fs_mnt;     /* container root (FIXME) */
 
 	struct task_struct *tsk;/* checkpoint: current target task */
 	char err_string[256];	/* checkpoint: error string */
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 02638a7..5ec844f 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2316,7 +2316,11 @@ void inode_sub_bytes(struct inode *inode, loff_t bytes);
 loff_t inode_get_bytes(struct inode *inode);
 void inode_set_bytes(struct inode *inode, loff_t bytes);
 
+#ifdef CONFIG_CHECKPOINT
+extern int generic_file_checkpoint(struct ckpt_ctx *ctx, struct file *file);
+#else
 #define generic_file_checkpoint NULL
+#endif
 
 extern int vfs_readdir(struct file *, filldir_t, void *);
 
-- 
1.6.0.4


^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH v18 34/80] c/r: dump open file descriptors
@ 2009-09-23 23:51   ` Oren Laadan
  0 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-23 23:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, containers, linux-kernel, linux-mm, linux-api,
	Serge Hallyn, Ingo Molnar, Pavel Emelyanov, Oren Laadan,
	Oren Laadan

Dump the file table with 'struct ckpt_hdr_file_table, followed by all
open file descriptors. Because the 'struct file' corresponding to an
fd can be shared, they are assigned an objref and registered in the
object hash. A reference to the 'file *' is kept for as long as it
lives in the hash (the hash is only cleaned up at the end of the
checkpoint).

Also provide generic_checkpoint_file() and generic_restore_file()
which is good for normal files and directories. It does not support
yet unlinked files or directories.

Changelog[v18]:
  - Add a few more ckpt_write_err()s
  - [Dan Smith] Export fill_fname() as ckpt_fill_fname()
  - Introduce ckpt_collect_file() that also uses file->collect method
  - In collect_file_stabl() use retval from ckpt_obj_collect() to
    test for first-time-object
Changelog[v17]:
  - Only collect sub-objects of files_struct once
  - Better file error debugging
  - Use (new) d_unlinked()
Changelog[v16]:
  - Fix compile warning in checkpoint_bad()
Changelog[v16]:
  - Reorder patch (move earlier in series)
  - Handle shared files_struct objects
Changelog[v14]:
  - File objects are dumped/restored prior to the first reference
  - Introduce a per file-type restore() callback
  - Use struct file_operations->checkpoint()
  - Put code for generic file descriptors in a separate function
  - Use one CKPT_FILE_GENERIC for both regular files and dirs
  - Revert change to pr_debug(), back to ckpt_debug()
  - Use only unsigned fields in checkpoint headers
  - Rename:  ckpt_write_files() => checkpoint_fd_table()
  - Rename:  ckpt_write_fd_data() => checkpoint_file()
  - Discard field 'h->parent'
Changelog[v12]:
  - Replace obsolete ckpt_debug() with pr_debug()
Changelog[v11]:
  - Discard handling of opened symlinks (there is no such thing)
  - ckpt_scan_fds() retries from scratch if hits size limits
Changelog[v9]:
  - Fix a couple of leaks in ckpt_write_files()
  - Drop useless kfree from ckpt_scan_fds()
Changelog[v8]:
  - initialize 'coe' to workaround gcc false warning
Changelog[v6]:
  - Balance all calls to ckpt_hbuf_get() with matching ckpt_hbuf_put()
    (even though it's not really needed)

Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
---
 checkpoint/Makefile              |    3 +-
 checkpoint/checkpoint.c          |   11 +
 checkpoint/files.c               |  417 ++++++++++++++++++++++++++++++++++++++
 checkpoint/objhash.c             |   52 +++++
 checkpoint/process.c             |   33 +++-
 checkpoint/sys.c                 |    8 +
 include/linux/checkpoint.h       |   19 ++
 include/linux/checkpoint_hdr.h   |   49 +++++
 include/linux/checkpoint_types.h |    5 +
 include/linux/fs.h               |    4 +
 10 files changed, 599 insertions(+), 2 deletions(-)
 create mode 100644 checkpoint/files.c

diff --git a/checkpoint/Makefile b/checkpoint/Makefile
index 5aa6a75..1d0c058 100644
--- a/checkpoint/Makefile
+++ b/checkpoint/Makefile
@@ -7,4 +7,5 @@ obj-$(CONFIG_CHECKPOINT) += \
 	objhash.o \
 	checkpoint.o \
 	restart.o \
-	process.o
+	process.o \
+	files.o
diff --git a/checkpoint/checkpoint.c b/checkpoint/checkpoint.c
index c21646d..4cc2a2f 100644
--- a/checkpoint/checkpoint.c
+++ b/checkpoint/checkpoint.c
@@ -18,6 +18,7 @@
 #include <linux/time.h>
 #include <linux/fs.h>
 #include <linux/file.h>
+#include <linux/fs_struct.h>
 #include <linux/dcache.h>
 #include <linux/mount.h>
 #include <linux/utsname.h>
@@ -673,6 +674,7 @@ static int init_checkpoint_ctx(struct ckpt_ctx *ctx, pid_t pid)
 {
 	struct task_struct *task;
 	struct nsproxy *nsproxy;
+	struct fs_struct *fs;
 
 	/*
 	 * No need for explicit cleanup here, because if an error
@@ -714,6 +716,15 @@ static int init_checkpoint_ctx(struct ckpt_ctx *ctx, pid_t pid)
 		return -EINVAL;  /* cleanup by ckpt_ctx_free() */
 	}
 
+	/* root vfs (FIX: WILL CHANGE with mnt-ns etc */
+	task_lock(ctx->root_task);
+	fs = ctx->root_task->fs;
+	read_lock(&fs->lock);
+	ctx->fs_mnt = fs->root;
+	path_get(&ctx->fs_mnt);
+	read_unlock(&fs->lock);
+	task_unlock(ctx->root_task);
+
 	return 0;
 }
 
diff --git a/checkpoint/files.c b/checkpoint/files.c
new file mode 100644
index 0000000..a554cbc
--- /dev/null
+++ b/checkpoint/files.c
@@ -0,0 +1,417 @@
+/*
+ *  Checkpoint file descriptors
+ *
+ *  Copyright (C) 2008-2009 Oren Laadan
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+/* default debug level for output */
+#define CKPT_DFLAG  CKPT_DFILE
+
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/sched.h>
+#include <linux/file.h>
+#include <linux/fdtable.h>
+#include <linux/deferqueue.h>
+#include <linux/checkpoint.h>
+#include <linux/checkpoint_hdr.h>
+
+
+/**************************************************************************
+ * Checkpoint
+ */
+
+/**
+ * ckpt_fill_fname - return pathname of a given file
+ * @path: path name
+ * @root: relative root
+ * @buf: buffer for pathname
+ * @len: buffer length (in) and pathname length (out)
+ */
+char *ckpt_fill_fname(struct path *path, struct path *root, char *buf, int *len)
+{
+	struct path tmp = *root;
+	char *fname;
+
+	BUG_ON(!buf);
+	spin_lock(&dcache_lock);
+	fname = __d_path(path, &tmp, buf, *len);
+	spin_unlock(&dcache_lock);
+	if (IS_ERR(fname))
+		return fname;
+	*len = (buf + (*len) - fname);
+	/*
+	 * FIX: if __d_path() changed these, it must have stepped out of
+	 * init's namespace. Since currently we require a unified namespace
+	 * within the container: simply fail.
+	 */
+	if (tmp.mnt != root->mnt || tmp.dentry != root->dentry)
+		fname = ERR_PTR(-EBADF);
+
+	return fname;
+}
+
+/**
+ * checkpoint_fname - write a file name
+ * @ctx: checkpoint context
+ * @path: path name
+ * @root: relative root
+ */
+int checkpoint_fname(struct ckpt_ctx *ctx, struct path *path, struct path *root)
+{
+	char *buf, *fname;
+	int ret, flen;
+
+	/*
+	 * FIXME: we can optimize and save memory (and storage) if we
+	 * share strings (through objhash) and reference them instead
+	 */
+
+	flen = PATH_MAX;
+	buf = kmalloc(flen, GFP_KERNEL);
+	if (!buf)
+		return -ENOMEM;
+
+	fname = ckpt_fill_fname(path, root, buf, &flen);
+	if (!IS_ERR(fname)) {
+		ret = ckpt_write_obj_type(ctx, fname, flen,
+					  CKPT_HDR_FILE_NAME);
+	} else {
+		ret = PTR_ERR(fname);
+		ckpt_write_err(ctx, "TEP", "obtain filename (file)", ret);
+	}
+
+	kfree(buf);
+	return ret;
+}
+
+#define CKPT_DEFAULT_FDTABLE  256		/* an initial guess */
+
+/**
+ * scan_fds - scan file table and construct array of open fds
+ * @files: files_struct pointer
+ * @fdtable: (output) array of open fds
+ *
+ * Returns the number of open fds found, and also the file table
+ * array via *fdtable. The caller should free the array.
+ *
+ * The caller must validate the file descriptors collected in the
+ * array before using them, e.g. by using fcheck_files(), in case
+ * the task's fdtable changes in the meantime.
+ */
+static int scan_fds(struct files_struct *files, int **fdtable)
+{
+	struct fdtable *fdt;
+	int *fds = NULL;
+	int i = 0, n = 0;
+	int tot = CKPT_DEFAULT_FDTABLE;
+
+	/*
+	 * We assume that all tasks possibly sharing the file table are
+	 * frozen (or we are a single process and we checkpoint ourselves).
+	 * Therefore, we can safely proceed after krealloc() from where we
+	 * left off. Otherwise the file table may be modified by another
+	 * task after we scan it. The behavior is this case is undefined,
+	 * and either checkpoint or restart will likely fail.
+	 */
+ retry:
+	fds = krealloc(fds, tot * sizeof(*fds), GFP_KERNEL);
+	if (!fds)
+		return -ENOMEM;
+
+	rcu_read_lock();
+	fdt = files_fdtable(files);
+	for (/**/; i < fdt->max_fds; i++) {
+		if (!fcheck_files(files, i))
+			continue;
+		if (n == tot) {
+			rcu_read_unlock();
+			tot *= 2;	/* won't overflow: kmalloc will fail */
+			goto retry;
+		}
+		fds[n++] = i;
+	}
+	rcu_read_unlock();
+
+	*fdtable = fds;
+	return n;
+}
+
+int checkpoint_file_common(struct ckpt_ctx *ctx, struct file *file,
+			   struct ckpt_hdr_file *h)
+{
+	h->f_flags = file->f_flags;
+	h->f_mode = file->f_mode;
+	h->f_pos = file->f_pos;
+	h->f_version = file->f_version;
+
+	/* FIX: need also file->uid, file->gid, file->f_owner, etc */
+
+	return 0;
+}
+
+int generic_file_checkpoint(struct ckpt_ctx *ctx, struct file *file)
+{
+	struct ckpt_hdr_file_generic *h;
+	int ret;
+
+	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_FILE);
+	if (!h)
+		return -ENOMEM;
+
+	/*
+	 * FIXME: when we'll add support for unlinked files/dirs, we'll
+	 * need to distinguish between unlinked filed and unlinked dirs.
+	 */
+	h->common.f_type = CKPT_FILE_GENERIC;
+
+	ret = checkpoint_file_common(ctx, file, &h->common);
+	if (ret < 0)
+		goto out;
+	ret = ckpt_write_obj(ctx, &h->common.h);
+	if (ret < 0)
+		goto out;
+	ret = checkpoint_fname(ctx, &file->f_path, &ctx->fs_mnt);
+ out:
+	ckpt_hdr_put(ctx, h);
+	return ret;
+}
+EXPORT_SYMBOL(generic_file_checkpoint);
+
+/* checkpoint callback for file pointer */
+int checkpoint_file(struct ckpt_ctx *ctx, void *ptr)
+{
+	struct file *file = (struct file *) ptr;
+	int ret;
+
+	if (!file->f_op || !file->f_op->checkpoint) {
+		ckpt_write_err(ctx, "TEPS", "f_op lacks checkpoint",
+			       -EBADF, file, file->f_op);
+		ckpt_debug("f_op lacks checkpoint handler: %pS\n", file->f_op);
+		return -EBADF;
+	}
+	if (d_unlinked(file->f_dentry)) {
+		ckpt_write_err(ctx, "TEP", "unlinked file", -EBADF, file);
+		ckpt_debug("unlinked files are unsupported\n");
+		return -EBADF;
+	}
+
+	ret = file->f_op->checkpoint(ctx, file);
+	if (ret < 0)
+		ckpt_write_err(ctx, "TEP", "file checkpoint failed", ret, file);
+	return ret;
+}
+
+/**
+ * ckpt_write_file_desc - dump the state of a given file descriptor
+ * @ctx: checkpoint context
+ * @files: files_struct pointer
+ * @fd: file descriptor
+ *
+ * Saves the state of the file descriptor; looks up the actual file
+ * pointer in the hash table, and if found saves the matching objref,
+ * otherwise calls ckpt_write_file to dump the file pointer too.
+ */
+static int checkpoint_file_desc(struct ckpt_ctx *ctx,
+				struct files_struct *files, int fd)
+{
+	struct ckpt_hdr_file_desc *h;
+	struct file *file = NULL;
+	struct fdtable *fdt;
+	int objref, ret;
+	int coe = 0;	/* avoid gcc warning */
+
+	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_FILE_DESC);
+	if (!h)
+		return -ENOMEM;
+
+	rcu_read_lock();
+	fdt = files_fdtable(files);
+	file = fcheck_files(files, fd);
+	if (file) {
+		coe = FD_ISSET(fd, fdt->close_on_exec);
+		get_file(file);
+	}
+	rcu_read_unlock();
+
+	/* sanity check (although this shouldn't happen) */
+	ret = -EBADF;
+	if (!file) {
+		pr_warning("c/r: file descriptor gone?");
+		ckpt_write_err(ctx, "TEP", "file gone? (%d)", ret, file, fd);
+		goto out;
+	}
+
+	/*
+	 * if seen first time, this will add 'file' to the objhash, keep
+	 * a reference to it, dump its state while at it.
+	 */
+	objref = checkpoint_obj(ctx, file, CKPT_OBJ_FILE);
+	ckpt_debug("fd %d objref %d file %p coe %d)\n", fd, objref, file, coe);
+	if (objref < 0) {
+		ret = objref;
+		goto out;
+	}
+
+	h->fd_objref = objref;
+	h->fd_descriptor = fd;
+	h->fd_close_on_exec = coe;
+
+	ret = ckpt_write_obj(ctx, &h->h);
+out:
+	ckpt_hdr_put(ctx, h);
+	if (file)
+		fput(file);
+	return ret;
+}
+
+static int do_checkpoint_file_table(struct ckpt_ctx *ctx,
+				    struct files_struct *files)
+{
+	struct ckpt_hdr_file_table *h;
+	int *fdtable = NULL;
+	int nfds, n, ret;
+
+	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_FILE_TABLE);
+	if (!h)
+		return -ENOMEM;
+
+	nfds = scan_fds(files, &fdtable);
+	if (nfds < 0) {
+		ret = nfds;
+		goto out;
+	}
+
+	h->fdt_nfds = nfds;
+
+	ret = ckpt_write_obj(ctx, &h->h);
+	ckpt_hdr_put(ctx, h);
+	if (ret < 0)
+		goto out;
+
+	ckpt_debug("nfds %d\n", nfds);
+	for (n = 0; n < nfds; n++) {
+		ret = checkpoint_file_desc(ctx, files, fdtable[n]);
+		if (ret < 0)
+			goto out;
+	}
+
+	ret = deferqueue_run(ctx->files_deferq);
+	ckpt_debug("files_deferq ran %d entries\n", ret);
+	if (ret > 0)
+		ret = 0;
+ out:
+	kfree(fdtable);
+	return ret;
+}
+
+/* checkpoint callback for file table */
+int checkpoint_file_table(struct ckpt_ctx *ctx, void *ptr)
+{
+	return do_checkpoint_file_table(ctx, (struct files_struct *) ptr);
+}
+
+/* checkpoint wrapper for file table */
+int checkpoint_obj_file_table(struct ckpt_ctx *ctx, struct task_struct *t)
+{
+	struct files_struct *files;
+	int objref;
+
+	files = get_files_struct(t);
+	if (!files)
+		return -EBUSY;
+	objref = checkpoint_obj(ctx, files, CKPT_OBJ_FILE_TABLE);
+	put_files_struct(files);
+
+	return objref;
+}
+
+/***********************************************************************
+ * Collect
+ */
+
+int ckpt_collect_file(struct ckpt_ctx *ctx, struct file *file)
+{
+	int ret;
+
+	ret = ckpt_obj_collect(ctx, file, CKPT_OBJ_FILE);
+	if (ret <= 0)
+		return ret;
+	/* if first time for this file (ret > 0), invoke ->collect() */
+	if (file->f_op->collect)
+		ret = file->f_op->collect(ctx, file);
+	if (ret < 0)
+		ckpt_write_err(ctx, "TEP", "file collect", ret, file);
+	return ret;
+}
+
+static int collect_file_desc(struct ckpt_ctx *ctx,
+			     struct files_struct *files, int fd)
+{
+	struct fdtable *fdt;
+	struct file *file;
+	int ret;
+
+	rcu_read_lock();
+	fdt = files_fdtable(files);
+	file = fcheck_files(files, fd);
+	if (file)
+		get_file(file);
+	rcu_read_unlock();
+
+	if (!file) {
+		ckpt_write_err(ctx, "TE", "file removed", -EBUSY, file);
+		return -EBUSY;
+	}
+
+	ret = ckpt_collect_file(ctx, file);
+	fput(file);
+
+	return ret;
+}
+
+static int collect_file_table(struct ckpt_ctx *ctx, struct files_struct *files)
+{
+	int *fdtable;
+	int nfds, n;
+	int ret;
+
+	/* if already exists (ret == 0), nothing to do */
+	ret = ckpt_obj_collect(ctx, files, CKPT_OBJ_FILE_TABLE);
+	if (ret <= 0)
+		return ret;
+
+	/* if first time for this file table (ret > 0), proceed inside */
+	nfds = scan_fds(files, &fdtable);
+	if (nfds < 0)
+		return nfds;
+
+	for (n = 0; n < nfds; n++) {
+		ret = collect_file_desc(ctx, files, fdtable[n]);
+		if (ret < 0)
+			break;
+	}
+
+	kfree(fdtable);
+	return ret;
+}
+
+int ckpt_collect_file_table(struct ckpt_ctx *ctx, struct task_struct *t)
+{
+	struct files_struct *files;
+	int ret;
+
+	files = get_files_struct(t);
+	if (!files) {
+		ckpt_write_err(ctx, "TE", "files_struct missing", -EBUSY);
+		return -EBUSY;
+	}
+	ret = collect_file_table(ctx, files);
+	put_files_struct(files);
+
+	return ret;
+}
diff --git a/checkpoint/objhash.c b/checkpoint/objhash.c
index dd1f3e5..cefbab6 100644
--- a/checkpoint/objhash.c
+++ b/checkpoint/objhash.c
@@ -13,6 +13,8 @@
 
 #include <linux/kernel.h>
 #include <linux/hash.h>
+#include <linux/file.h>
+#include <linux/fdtable.h>
 #include <linux/checkpoint.h>
 #include <linux/checkpoint_hdr.h>
 
@@ -62,6 +64,38 @@ static int obj_no_grab(void *ptr)
 	return 0;
 }
 
+static int obj_file_table_grab(void *ptr)
+{
+	atomic_inc(&((struct files_struct *) ptr)->count);
+	return 0;
+}
+
+static void obj_file_table_drop(void *ptr, int lastref)
+{
+	put_files_struct((struct files_struct *) ptr);
+}
+
+static int obj_file_table_users(void *ptr)
+{
+	return atomic_read(&((struct files_struct *) ptr)->count);
+}
+
+static int obj_file_grab(void *ptr)
+{
+	get_file((struct file *) ptr);
+	return 0;
+}
+
+static void obj_file_drop(void *ptr, int lastref)
+{
+	fput((struct file *) ptr);
+}
+
+static int obj_file_users(void *ptr)
+{
+	return atomic_long_read(&((struct file *) ptr)->f_count);
+}
+
 static struct ckpt_obj_ops ckpt_obj_ops[] = {
 	/* ignored object */
 	{
@@ -70,6 +104,24 @@ static struct ckpt_obj_ops ckpt_obj_ops[] = {
 		.ref_drop = obj_no_drop,
 		.ref_grab = obj_no_grab,
 	},
+	/* files_struct object */
+	{
+		.obj_name = "FILE_TABLE",
+		.obj_type = CKPT_OBJ_FILE_TABLE,
+		.ref_drop = obj_file_table_drop,
+		.ref_grab = obj_file_table_grab,
+		.ref_users = obj_file_table_users,
+		.checkpoint = checkpoint_file_table,
+	},
+	/* file object */
+	{
+		.obj_name = "FILE",
+		.obj_type = CKPT_OBJ_FILE,
+		.ref_drop = obj_file_drop,
+		.ref_grab = obj_file_grab,
+		.ref_users = obj_file_users,
+		.checkpoint = checkpoint_file,
+	},
 };
 
 
diff --git a/checkpoint/process.c b/checkpoint/process.c
index 5e690d3..042dc45 100644
--- a/checkpoint/process.c
+++ b/checkpoint/process.c
@@ -104,6 +104,29 @@ static int checkpoint_task_struct(struct ckpt_ctx *ctx, struct task_struct *t)
 	return ckpt_write_string(ctx, t->comm, TASK_COMM_LEN);
 }
 
+static int checkpoint_task_objs(struct ckpt_ctx *ctx, struct task_struct *t)
+{
+	struct ckpt_hdr_task_objs *h;
+	int files_objref;
+	int ret;
+
+	files_objref = checkpoint_obj_file_table(ctx, t);
+	ckpt_debug("files: objref %d\n", files_objref);
+	if (files_objref < 0) {
+		ckpt_write_err(ctx, "TE", "files_struct", files_objref);
+		return files_objref;
+	}
+
+	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_TASK_OBJS);
+	if (!h)
+		return -ENOMEM;
+	h->files_objref = files_objref;
+	ret = ckpt_write_obj(ctx, &h->h);
+	ckpt_hdr_put(ctx, h);
+
+	return ret;
+}
+
 /* dump the task_struct of a given task */
 int checkpoint_restart_block(struct ckpt_ctx *ctx, struct task_struct *t)
 {
@@ -231,6 +254,10 @@ int checkpoint_task(struct ckpt_ctx *ctx, struct task_struct *t)
 	if (t->exit_state)
 		return 0;
 
+	ret = checkpoint_task_objs(ctx, t);
+	ckpt_debug("objs %d\n", ret);
+	if (ret < 0)
+		goto out;
 	ret = checkpoint_thread(ctx, t);
 	ckpt_debug("thread %d\n", ret);
 	if (ret < 0)
@@ -248,7 +275,11 @@ int checkpoint_task(struct ckpt_ctx *ctx, struct task_struct *t)
 
 int ckpt_collect_task(struct ckpt_ctx *ctx, struct task_struct *t)
 {
-	return 0;
+	int ret;
+
+	ret = ckpt_collect_file_table(ctx, t);
+
+	return ret;
 }
 
 /***********************************************************************
diff --git a/checkpoint/sys.c b/checkpoint/sys.c
index d16d48f..1373ff9 100644
--- a/checkpoint/sys.c
+++ b/checkpoint/sys.c
@@ -191,10 +191,14 @@ static void ckpt_ctx_free(struct ckpt_ctx *ctx)
 {
 	BUG_ON(atomic_read(&ctx->refcount));
 
+	if (ctx->files_deferq)
+		deferqueue_destroy(ctx->files_deferq);
+
 	if (ctx->file)
 		fput(ctx->file);
 
 	ckpt_obj_hash_free(ctx);
+	path_put(&ctx->fs_mnt);
 
 	if (ctx->tasks_arr)
 		task_arr_free(ctx);
@@ -237,6 +241,10 @@ static struct ckpt_ctx *ckpt_ctx_alloc(int fd, unsigned long uflags,
 	if (ckpt_obj_hash_alloc(ctx) < 0)
 		goto err;
 
+	ctx->files_deferq = deferqueue_create();
+	if (!ctx->files_deferq)
+		goto err;
+
 	atomic_inc(&ctx->refcount);
 	return ctx;
  err:
diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
index b698b19..6fa5035 100644
--- a/include/linux/checkpoint.h
+++ b/include/linux/checkpoint.h
@@ -70,6 +70,9 @@ extern int ckpt_read_payload(struct ckpt_ctx *ctx,
 extern char *ckpt_read_string(struct ckpt_ctx *ctx, int max);
 extern int ckpt_read_consume(struct ckpt_ctx *ctx, int len, int type);
 
+extern char *ckpt_fill_fname(struct path *path, struct path *root,
+			     char *buf, int *len);
+
 /* ckpt kflags */
 #define ckpt_set_ctx_kflag(__ctx, __kflag)  \
 	set_bit(__kflag##_BIT, &(__ctx)->kflags)
@@ -137,6 +140,21 @@ extern int checkpoint_restart_block(struct ckpt_ctx *ctx,
 				    struct task_struct *t);
 extern int restore_restart_block(struct ckpt_ctx *ctx);
 
+/* file table */
+extern int ckpt_collect_file_table(struct ckpt_ctx *ctx, struct task_struct *t);
+extern int checkpoint_obj_file_table(struct ckpt_ctx *ctx,
+				     struct task_struct *t);
+extern int checkpoint_file_table(struct ckpt_ctx *ctx, void *ptr);
+
+/* files */
+extern int checkpoint_fname(struct ckpt_ctx *ctx,
+			    struct path *path, struct path *root);
+extern int ckpt_collect_file(struct ckpt_ctx *ctx, struct file *file);
+extern int checkpoint_file(struct ckpt_ctx *ctx, void *ptr);
+
+extern int checkpoint_file_common(struct ckpt_ctx *ctx, struct file *file,
+				  struct ckpt_hdr_file *h);
+
 static inline int ckpt_validate_errno(int errno)
 {
 	return (errno >= 0) && (errno < MAX_ERRNO);
@@ -147,6 +165,7 @@ static inline int ckpt_validate_errno(int errno)
 #define CKPT_DSYS	0x2		/* generic (system) */
 #define CKPT_DRW	0x4		/* image read/write */
 #define CKPT_DOBJ	0x8		/* shared objects */
+#define CKPT_DFILE	0x10		/* files and filesystem */
 
 #define CKPT_DDEFAULT	0xffff		/* default debug level */
 
diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
index 7a4015b..1124375 100644
--- a/include/linux/checkpoint_hdr.h
+++ b/include/linux/checkpoint_hdr.h
@@ -54,12 +54,18 @@ enum {
 
 	CKPT_HDR_TREE = 101,
 	CKPT_HDR_TASK,
+	CKPT_HDR_TASK_OBJS,
 	CKPT_HDR_RESTART_BLOCK,
 	CKPT_HDR_THREAD,
 	CKPT_HDR_CPU,
 
 	/* 201-299: reserved for arch-dependent */
 
+	CKPT_HDR_FILE_TABLE = 301,
+	CKPT_HDR_FILE_DESC,
+	CKPT_HDR_FILE_NAME,
+	CKPT_HDR_FILE,
+
 	CKPT_HDR_TAIL = 9001,
 
 	CKPT_HDR_ERROR = 9999,
@@ -80,6 +86,8 @@ struct ckpt_hdr_objref {
 /* shared objects types */
 enum obj_type {
 	CKPT_OBJ_IGNORE = 0,
+	CKPT_OBJ_FILE_TABLE,
+	CKPT_OBJ_FILE,
 	CKPT_OBJ_MAX
 };
 
@@ -157,6 +165,12 @@ struct ckpt_hdr_task {
 	__u64 robust_futex_list; /* a __user ptr */
 } __attribute__((aligned(8)));
 
+/* task's shared resources */
+struct ckpt_hdr_task_objs {
+	struct ckpt_hdr h;
+	__s32 files_objref;
+} __attribute__((aligned(8)));
+
 /* restart blocks */
 struct ckpt_hdr_restart_block {
 	struct ckpt_hdr h;
@@ -178,4 +192,39 @@ enum restart_block_type {
 	CKPT_RESTART_BLOCK_FUTEX
 };
 
+/* file system */
+struct ckpt_hdr_file_table {
+	struct ckpt_hdr h;
+	__s32 fdt_nfds;
+} __attribute__((aligned(8)));
+
+/* file descriptors */
+struct ckpt_hdr_file_desc {
+	struct ckpt_hdr h;
+	__s32 fd_objref;
+	__s32 fd_descriptor;
+	__u32 fd_close_on_exec;
+} __attribute__((aligned(8)));
+
+enum file_type {
+	CKPT_FILE_IGNORE = 0,
+	CKPT_FILE_GENERIC,
+	CKPT_FILE_MAX
+};
+
+/* file objects */
+struct ckpt_hdr_file {
+	struct ckpt_hdr h;
+	__u32 f_type;
+	__u32 f_mode;
+	__u32 f_flags;
+	__u32 _padding;
+	__u64 f_pos;
+	__u64 f_version;
+} __attribute__((aligned(8)));
+
+struct ckpt_hdr_file_generic {
+	struct ckpt_hdr_file common;
+} __attribute__((aligned(8)));
+
 #endif /* _CHECKPOINT_CKPT_HDR_H_ */
diff --git a/include/linux/checkpoint_types.h b/include/linux/checkpoint_types.h
index f11fd07..795742f 100644
--- a/include/linux/checkpoint_types.h
+++ b/include/linux/checkpoint_types.h
@@ -14,6 +14,8 @@
 
 #include <linux/sched.h>
 #include <linux/nsproxy.h>
+#include <linux/list.h>
+#include <linux/path.h>
 #include <linux/fs.h>
 #include <linux/ktime.h>
 #include <linux/wait.h>
@@ -39,6 +41,9 @@ struct ckpt_ctx {
 	atomic_t refcount;
 
 	struct ckpt_obj_hash *obj_hash;	/* repository for shared objects */
+	struct deferqueue_head *files_deferq;	/* deferred file-table work */
+
+	struct path fs_mnt;     /* container root (FIXME) */
 
 	struct task_struct *tsk;/* checkpoint: current target task */
 	char err_string[256];	/* checkpoint: error string */
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 02638a7..5ec844f 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2316,7 +2316,11 @@ void inode_sub_bytes(struct inode *inode, loff_t bytes);
 loff_t inode_get_bytes(struct inode *inode);
 void inode_set_bytes(struct inode *inode, loff_t bytes);
 
+#ifdef CONFIG_CHECKPOINT
+extern int generic_file_checkpoint(struct ckpt_ctx *ctx, struct file *file);
+#else
 #define generic_file_checkpoint NULL
+#endif
 
 extern int vfs_readdir(struct file *, filldir_t, void *);
 
-- 
1.6.0.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH v18 35/80] c/r: restore open file descriptors
       [not found] ` <1253749920-18673-1-git-send-email-orenl-RdfvBDnrOixBDgjK7y7TUQ@public.gmane.org>
                     ` (33 preceding siblings ...)
  2009-09-23 23:51   ` [PATCH v18 34/80] c/r: dump open file descriptors Oren Laadan
@ 2009-09-23 23:51   ` Oren Laadan
  2009-09-23 23:51   ` [PATCH v18 36/80] c/r: introduce method '->checkpoint()' in struct vm_operations_struct Oren Laadan
                     ` (46 subsequent siblings)
  81 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-23 23:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-api-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Linus Torvalds, Ingo Molnar,
	Pavel Emelyanov

For each fd read 'struct ckpt_hdr_file_desc' and lookup objref in the
hash table; If not found in the hash table, (first occurence), read in
'struct ckpt_hdr_file', create a new file and register in the hash.
Otherwise attach the file pointer from the hash as an FD.

Changelog[v18]:
  - Invoke set_close_on_exec() unconditionally on restart
Changelog[v17]:
  - Validate f_mode after restore against saved f_mode
  - Fail if f_flags have O_CREAT|O_EXCL|O_NOCTTY|O_TRUN
  - Reorder patch (move earlier in series)
  - Handle shared files_struct objects
Changelog[v14]:
  - Introduce a per file-type restore() callback
  - Revert change to pr_debug(), back to ckpt_debug()
  - Rename:  restore_files() => restore_fd_table()
  - Rename:  ckpt_read_fd_data() => restore_file()
  - Check whether calls to ckpt_hbuf_get() fail
  - Discard field 'hh->parent'
Changelog[v12]:
  - Replace obsolete ckpt_debug() with pr_debug()
Changelog[v6]:
  - Balance all calls to ckpt_hbuf_get() with matching ckpt_hbuf_put()
    (even though it's not really needed)

Signed-off-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
---
 checkpoint/files.c         |  313 ++++++++++++++++++++++++++++++++++++++++++++
 checkpoint/objhash.c       |    2 +
 checkpoint/process.c       |   20 +++
 include/linux/checkpoint.h |    7 +
 4 files changed, 342 insertions(+), 0 deletions(-)

diff --git a/checkpoint/files.c b/checkpoint/files.c
index a554cbc..3cdfdb3 100644
--- a/checkpoint/files.c
+++ b/checkpoint/files.c
@@ -16,6 +16,8 @@
 #include <linux/sched.h>
 #include <linux/file.h>
 #include <linux/fdtable.h>
+#include <linux/fsnotify.h>
+#include <linux/syscalls.h>
 #include <linux/deferqueue.h>
 #include <linux/checkpoint.h>
 #include <linux/checkpoint_hdr.h>
@@ -415,3 +417,314 @@ int ckpt_collect_file_table(struct ckpt_ctx *ctx, struct task_struct *t)
 
 	return ret;
 }
+
+/**************************************************************************
+ * Restart
+ */
+
+/**
+ * restore_open_fname - read a file name and open a file
+ * @ctx: checkpoint context
+ * @flags: file flags
+ */
+struct file *restore_open_fname(struct ckpt_ctx *ctx, int flags)
+{
+	struct ckpt_hdr *h;
+	struct file *file;
+	char *fname;
+
+	/* prevent bad input from doing bad things */
+	if (flags & (O_CREAT | O_EXCL | O_NOCTTY | O_TRUNC))
+		return ERR_PTR(-EINVAL);
+
+	h = ckpt_read_buf_type(ctx, PATH_MAX, CKPT_HDR_FILE_NAME);
+	if (IS_ERR(h))
+		return (struct file *) h;
+	fname = (char *) (h + 1);
+	ckpt_debug("fname '%s' flags %#x\n", fname, flags);
+
+	file = filp_open(fname, flags, 0);
+	ckpt_hdr_put(ctx, h);
+
+	return file;
+}
+
+static int close_all_fds(struct files_struct *files)
+{
+	int *fdtable;
+	int nfds;
+
+	nfds = scan_fds(files, &fdtable);
+	if (nfds < 0)
+		return nfds;
+	while (nfds--)
+		sys_close(fdtable[nfds]);
+	kfree(fdtable);
+	return 0;
+}
+
+/**
+ * attach_file - attach a lonely file ptr to a file descriptor
+ * @file: lonely file pointer
+ */
+static int attach_file(struct file *file)
+{
+	int fd = get_unused_fd_flags(0);
+
+	if (fd >= 0) {
+		get_file(file);
+		fsnotify_open(file->f_path.dentry);
+		fd_install(fd, file);
+	}
+	return fd;
+}
+
+#define CKPT_SETFL_MASK  \
+	(O_APPEND | O_NONBLOCK | O_NDELAY | FASYNC | O_DIRECT | O_NOATIME)
+
+int restore_file_common(struct ckpt_ctx *ctx, struct file *file,
+			struct ckpt_hdr_file *h)
+{
+	fmode_t new_mode = file->f_mode;
+	fmode_t saved_mode = (__force fmode_t) h->f_mode;
+	int ret;
+
+	/* FIX: need to restore uid, gid, owner etc */
+
+	/* safe to set 1st arg (fd) to 0, as command is F_SETFL */
+	ret = vfs_fcntl(0, F_SETFL, h->f_flags & CKPT_SETFL_MASK, file);
+	if (ret < 0)
+		return ret;
+
+	/*
+	 * Normally f_mode is set by open, and modified only via
+	 * fcntl(), so its value now should match that at checkpoint.
+	 * However, a file may be downgraded from (read-)write to
+	 * read-only, e.g:
+	 *  - mark_files_ro() unsets FMODE_WRITE
+	 *  - nfs4_file_downgrade() too, and also sert FMODE_READ
+	 * Validate the new f_mode against saved f_mode, allowing:
+	 *  - new with FMODE_WRITE, saved without FMODE_WRITE
+	 *  - new without FMODE_READ, saved with FMODE_READ
+	 */
+	if ((new_mode & FMODE_WRITE) && !(saved_mode & FMODE_WRITE)) {
+		new_mode &= ~FMODE_WRITE;
+		if (!(new_mode & FMODE_READ) && (saved_mode & FMODE_READ))
+			new_mode |= FMODE_READ;
+	}
+	/* finally, at this point new mode should match saved mode */
+	if (new_mode ^ saved_mode)
+		return -EINVAL;
+
+	if (file->f_mode & FMODE_LSEEK)
+		ret = vfs_llseek(file, h->f_pos, SEEK_SET);
+
+	return ret;
+}
+
+static struct file *generic_file_restore(struct ckpt_ctx *ctx,
+					 struct ckpt_hdr_file *ptr)
+{
+	struct file *file;
+	int ret;
+
+	if (ptr->h.type != CKPT_HDR_FILE  ||
+	    ptr->h.len != sizeof(*ptr) || ptr->f_type != CKPT_FILE_GENERIC)
+		return ERR_PTR(-EINVAL);
+
+	file = restore_open_fname(ctx, ptr->f_flags);
+	if (IS_ERR(file))
+		return file;
+
+	ret = restore_file_common(ctx, file, ptr);
+	if (ret < 0) {
+		fput(file);
+		file = ERR_PTR(ret);
+	}
+	return file;
+}
+
+struct restore_file_ops {
+	char *file_name;
+	enum file_type file_type;
+	struct file * (*restore) (struct ckpt_ctx *ctx,
+				  struct ckpt_hdr_file *ptr);
+};
+
+static struct restore_file_ops restore_file_ops[] = {
+	/* ignored file */
+	{
+		.file_name = "IGNORE",
+		.file_type = CKPT_FILE_IGNORE,
+		.restore = NULL,
+	},
+	/* regular file/directory */
+	{
+		.file_name = "GENERIC",
+		.file_type = CKPT_FILE_GENERIC,
+		.restore = generic_file_restore,
+	},
+};
+
+static struct file *do_restore_file(struct ckpt_ctx *ctx)
+{
+	struct restore_file_ops *ops;
+	struct ckpt_hdr_file *h;
+	struct file *file = ERR_PTR(-EINVAL);
+
+	/*
+	 * All 'struct ckpt_hdr_file_...' begin with ckpt_hdr_file,
+	 * but the actual object depends on the file type. The length
+	 * should never be more than page.
+	 */
+	h = ckpt_read_buf_type(ctx, PAGE_SIZE, CKPT_HDR_FILE);
+	if (IS_ERR(h))
+		return (struct file *) h;
+	ckpt_debug("flags %#x mode %#x type %d\n",
+		 h->f_flags, h->f_mode, h->f_type);
+
+	if (h->f_type >= CKPT_FILE_MAX)
+		goto out;
+
+	ops = &restore_file_ops[h->f_type];
+	BUG_ON(ops->file_type != h->f_type);
+
+	if (ops->restore)
+		file = ops->restore(ctx, h);
+ out:
+	ckpt_hdr_put(ctx, h);
+	return file;
+}
+
+/* restore callback for file pointer */
+void *restore_file(struct ckpt_ctx *ctx)
+{
+	return (void *) do_restore_file(ctx);
+}
+
+/**
+ * ckpt_read_file_desc - restore the state of a given file descriptor
+ * @ctx: checkpoint context
+ *
+ * Restores the state of a file descriptor; looks up the objref (in the
+ * header) in the hash table, and if found picks the matching file and
+ * use it; otherwise calls restore_file to restore the file too.
+ */
+static int restore_file_desc(struct ckpt_ctx *ctx)
+{
+	struct ckpt_hdr_file_desc *h;
+	struct file *file;
+	int newfd, ret;
+
+	h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_FILE_DESC);
+	if (IS_ERR(h))
+		return PTR_ERR(h);
+	ckpt_debug("ref %d fd %d c.o.e %d\n",
+		 h->fd_objref, h->fd_descriptor, h->fd_close_on_exec);
+
+	ret = -EINVAL;
+	if (h->fd_objref <= 0 || h->fd_descriptor < 0)
+		goto out;
+
+	file = ckpt_obj_fetch(ctx, h->fd_objref, CKPT_OBJ_FILE);
+	if (IS_ERR(file)) {
+		ret = PTR_ERR(file);
+		goto out;
+	}
+
+	newfd = attach_file(file);
+	if (newfd < 0) {
+		ret = newfd;
+		goto out;
+	}
+
+	ckpt_debug("newfd got %d wanted %d\n", newfd, h->fd_descriptor);
+
+	/* reposition if newfd isn't desired fd */
+	if (newfd != h->fd_descriptor) {
+		ret = sys_dup2(newfd, h->fd_descriptor);
+		if (ret < 0)
+			goto out;
+		sys_close(newfd);
+	}
+
+	set_close_on_exec(h->fd_descriptor, h->fd_close_on_exec);
+	ret = 0;
+ out:
+	ckpt_hdr_put(ctx, h);
+	return ret;
+}
+
+/* restore callback for file table */
+static struct files_struct *do_restore_file_table(struct ckpt_ctx *ctx)
+{
+	struct ckpt_hdr_file_table *h;
+	struct files_struct *files;
+	int i, ret;
+
+	h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_FILE_TABLE);
+	if (IS_ERR(h))
+		return (struct files_struct *) h;
+
+	ckpt_debug("nfds %d\n", h->fdt_nfds);
+
+	ret = -EMFILE;
+	if (h->fdt_nfds < 0 || h->fdt_nfds > sysctl_nr_open)
+		goto out;
+
+	/*
+	 * We assume that restarting tasks, as created in user-space,
+	 * have distinct files_struct objects each. If not, we need to
+	 * call dup_fd() to make sure we don't overwrite an already
+	 * restored one.
+	 */
+
+	/* point of no return -- close all file descriptors */
+	ret = close_all_fds(current->files);
+	if (ret < 0)
+		goto out;
+
+	for (i = 0; i < h->fdt_nfds; i++) {
+		ret = restore_file_desc(ctx);
+		if (ret < 0)
+			goto out;
+	}
+
+	ret = deferqueue_run(ctx->files_deferq);
+	ckpt_debug("files_deferq ran %d entries\n", ret);
+	if (ret > 0)
+		ret = 0;
+ out:
+	ckpt_hdr_put(ctx, h);
+	if (!ret) {
+		files = current->files;
+		atomic_inc(&files->count);
+	} else {
+		files = ERR_PTR(ret);
+	}
+	return files;
+}
+
+void *restore_file_table(struct ckpt_ctx *ctx)
+{
+	return (void *) do_restore_file_table(ctx);
+}
+
+int restore_obj_file_table(struct ckpt_ctx *ctx, int files_objref)
+{
+	struct files_struct *files;
+
+	files = ckpt_obj_fetch(ctx, files_objref, CKPT_OBJ_FILE_TABLE);
+	if (IS_ERR(files))
+		return PTR_ERR(files);
+
+	if (files != current->files) {
+		task_lock(current);
+		put_files_struct(current->files);
+		current->files = files;
+		task_unlock(current);
+		atomic_inc(&files->count);
+	}
+
+	return 0;
+}
diff --git a/checkpoint/objhash.c b/checkpoint/objhash.c
index cefbab6..b7c8fdb 100644
--- a/checkpoint/objhash.c
+++ b/checkpoint/objhash.c
@@ -112,6 +112,7 @@ static struct ckpt_obj_ops ckpt_obj_ops[] = {
 		.ref_grab = obj_file_table_grab,
 		.ref_users = obj_file_table_users,
 		.checkpoint = checkpoint_file_table,
+		.restore = restore_file_table,
 	},
 	/* file object */
 	{
@@ -121,6 +122,7 @@ static struct ckpt_obj_ops ckpt_obj_ops[] = {
 		.ref_grab = obj_file_grab,
 		.ref_users = obj_file_users,
 		.checkpoint = checkpoint_file,
+		.restore = restore_file,
 	},
 };
 
diff --git a/checkpoint/process.c b/checkpoint/process.c
index 042dc45..6ad9c01 100644
--- a/checkpoint/process.c
+++ b/checkpoint/process.c
@@ -349,6 +349,22 @@ static int restore_task_struct(struct ckpt_ctx *ctx)
 	return ret;
 }
 
+static int restore_task_objs(struct ckpt_ctx *ctx)
+{
+	struct ckpt_hdr_task_objs *h;
+	int ret;
+
+	h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_TASK_OBJS);
+	if (IS_ERR(h))
+		return PTR_ERR(h);
+
+	ret = restore_obj_file_table(ctx, h->files_objref);
+	ckpt_debug("file_table: ret %d (%p)\n", ret, current->files);
+
+	ckpt_hdr_put(ctx, h);
+	return ret;
+}
+
 int restore_restart_block(struct ckpt_ctx *ctx)
 {
 	struct ckpt_hdr_restart_block *h;
@@ -468,6 +484,10 @@ int restore_task(struct ckpt_ctx *ctx)
 	if (ret)
 		goto out;
 
+	ret = restore_task_objs(ctx);
+	ckpt_debug("objs %d\n", ret);
+	if (ret < 0)
+		goto out;
 	ret = restore_thread(ctx);
 	ckpt_debug("thread %d\n", ret);
 	if (ret < 0)
diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
index 6fa5035..026d058 100644
--- a/include/linux/checkpoint.h
+++ b/include/linux/checkpoint.h
@@ -144,16 +144,23 @@ extern int restore_restart_block(struct ckpt_ctx *ctx);
 extern int ckpt_collect_file_table(struct ckpt_ctx *ctx, struct task_struct *t);
 extern int checkpoint_obj_file_table(struct ckpt_ctx *ctx,
 				     struct task_struct *t);
+extern int restore_obj_file_table(struct ckpt_ctx *ctx, int files_objref);
 extern int checkpoint_file_table(struct ckpt_ctx *ctx, void *ptr);
+extern void *restore_file_table(struct ckpt_ctx *ctx);
 
 /* files */
 extern int checkpoint_fname(struct ckpt_ctx *ctx,
 			    struct path *path, struct path *root);
+extern struct file *restore_open_fname(struct ckpt_ctx *ctx, int flags);
+
 extern int ckpt_collect_file(struct ckpt_ctx *ctx, struct file *file);
 extern int checkpoint_file(struct ckpt_ctx *ctx, void *ptr);
+extern void *restore_file(struct ckpt_ctx *ctx);
 
 extern int checkpoint_file_common(struct ckpt_ctx *ctx, struct file *file,
 				  struct ckpt_hdr_file *h);
+extern int restore_file_common(struct ckpt_ctx *ctx, struct file *file,
+			       struct ckpt_hdr_file *h);
 
 static inline int ckpt_validate_errno(int errno)
 {
-- 
1.6.0.4

^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH v18 35/80] c/r: restore open file descriptors
  2009-09-23 23:50 ` Oren Laadan
@ 2009-09-23 23:51   ` Oren Laadan
  -1 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-23 23:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, containers, linux-kernel, linux-mm, linux-api,
	Serge Hallyn, Ingo Molnar, Pavel Emelyanov, Oren Laadan,
	Oren Laadan

For each fd read 'struct ckpt_hdr_file_desc' and lookup objref in the
hash table; If not found in the hash table, (first occurence), read in
'struct ckpt_hdr_file', create a new file and register in the hash.
Otherwise attach the file pointer from the hash as an FD.

Changelog[v18]:
  - Invoke set_close_on_exec() unconditionally on restart
Changelog[v17]:
  - Validate f_mode after restore against saved f_mode
  - Fail if f_flags have O_CREAT|O_EXCL|O_NOCTTY|O_TRUN
  - Reorder patch (move earlier in series)
  - Handle shared files_struct objects
Changelog[v14]:
  - Introduce a per file-type restore() callback
  - Revert change to pr_debug(), back to ckpt_debug()
  - Rename:  restore_files() => restore_fd_table()
  - Rename:  ckpt_read_fd_data() => restore_file()
  - Check whether calls to ckpt_hbuf_get() fail
  - Discard field 'hh->parent'
Changelog[v12]:
  - Replace obsolete ckpt_debug() with pr_debug()
Changelog[v6]:
  - Balance all calls to ckpt_hbuf_get() with matching ckpt_hbuf_put()
    (even though it's not really needed)

Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
---
 checkpoint/files.c         |  313 ++++++++++++++++++++++++++++++++++++++++++++
 checkpoint/objhash.c       |    2 +
 checkpoint/process.c       |   20 +++
 include/linux/checkpoint.h |    7 +
 4 files changed, 342 insertions(+), 0 deletions(-)

diff --git a/checkpoint/files.c b/checkpoint/files.c
index a554cbc..3cdfdb3 100644
--- a/checkpoint/files.c
+++ b/checkpoint/files.c
@@ -16,6 +16,8 @@
 #include <linux/sched.h>
 #include <linux/file.h>
 #include <linux/fdtable.h>
+#include <linux/fsnotify.h>
+#include <linux/syscalls.h>
 #include <linux/deferqueue.h>
 #include <linux/checkpoint.h>
 #include <linux/checkpoint_hdr.h>
@@ -415,3 +417,314 @@ int ckpt_collect_file_table(struct ckpt_ctx *ctx, struct task_struct *t)
 
 	return ret;
 }
+
+/**************************************************************************
+ * Restart
+ */
+
+/**
+ * restore_open_fname - read a file name and open a file
+ * @ctx: checkpoint context
+ * @flags: file flags
+ */
+struct file *restore_open_fname(struct ckpt_ctx *ctx, int flags)
+{
+	struct ckpt_hdr *h;
+	struct file *file;
+	char *fname;
+
+	/* prevent bad input from doing bad things */
+	if (flags & (O_CREAT | O_EXCL | O_NOCTTY | O_TRUNC))
+		return ERR_PTR(-EINVAL);
+
+	h = ckpt_read_buf_type(ctx, PATH_MAX, CKPT_HDR_FILE_NAME);
+	if (IS_ERR(h))
+		return (struct file *) h;
+	fname = (char *) (h + 1);
+	ckpt_debug("fname '%s' flags %#x\n", fname, flags);
+
+	file = filp_open(fname, flags, 0);
+	ckpt_hdr_put(ctx, h);
+
+	return file;
+}
+
+static int close_all_fds(struct files_struct *files)
+{
+	int *fdtable;
+	int nfds;
+
+	nfds = scan_fds(files, &fdtable);
+	if (nfds < 0)
+		return nfds;
+	while (nfds--)
+		sys_close(fdtable[nfds]);
+	kfree(fdtable);
+	return 0;
+}
+
+/**
+ * attach_file - attach a lonely file ptr to a file descriptor
+ * @file: lonely file pointer
+ */
+static int attach_file(struct file *file)
+{
+	int fd = get_unused_fd_flags(0);
+
+	if (fd >= 0) {
+		get_file(file);
+		fsnotify_open(file->f_path.dentry);
+		fd_install(fd, file);
+	}
+	return fd;
+}
+
+#define CKPT_SETFL_MASK  \
+	(O_APPEND | O_NONBLOCK | O_NDELAY | FASYNC | O_DIRECT | O_NOATIME)
+
+int restore_file_common(struct ckpt_ctx *ctx, struct file *file,
+			struct ckpt_hdr_file *h)
+{
+	fmode_t new_mode = file->f_mode;
+	fmode_t saved_mode = (__force fmode_t) h->f_mode;
+	int ret;
+
+	/* FIX: need to restore uid, gid, owner etc */
+
+	/* safe to set 1st arg (fd) to 0, as command is F_SETFL */
+	ret = vfs_fcntl(0, F_SETFL, h->f_flags & CKPT_SETFL_MASK, file);
+	if (ret < 0)
+		return ret;
+
+	/*
+	 * Normally f_mode is set by open, and modified only via
+	 * fcntl(), so its value now should match that at checkpoint.
+	 * However, a file may be downgraded from (read-)write to
+	 * read-only, e.g:
+	 *  - mark_files_ro() unsets FMODE_WRITE
+	 *  - nfs4_file_downgrade() too, and also sert FMODE_READ
+	 * Validate the new f_mode against saved f_mode, allowing:
+	 *  - new with FMODE_WRITE, saved without FMODE_WRITE
+	 *  - new without FMODE_READ, saved with FMODE_READ
+	 */
+	if ((new_mode & FMODE_WRITE) && !(saved_mode & FMODE_WRITE)) {
+		new_mode &= ~FMODE_WRITE;
+		if (!(new_mode & FMODE_READ) && (saved_mode & FMODE_READ))
+			new_mode |= FMODE_READ;
+	}
+	/* finally, at this point new mode should match saved mode */
+	if (new_mode ^ saved_mode)
+		return -EINVAL;
+
+	if (file->f_mode & FMODE_LSEEK)
+		ret = vfs_llseek(file, h->f_pos, SEEK_SET);
+
+	return ret;
+}
+
+static struct file *generic_file_restore(struct ckpt_ctx *ctx,
+					 struct ckpt_hdr_file *ptr)
+{
+	struct file *file;
+	int ret;
+
+	if (ptr->h.type != CKPT_HDR_FILE  ||
+	    ptr->h.len != sizeof(*ptr) || ptr->f_type != CKPT_FILE_GENERIC)
+		return ERR_PTR(-EINVAL);
+
+	file = restore_open_fname(ctx, ptr->f_flags);
+	if (IS_ERR(file))
+		return file;
+
+	ret = restore_file_common(ctx, file, ptr);
+	if (ret < 0) {
+		fput(file);
+		file = ERR_PTR(ret);
+	}
+	return file;
+}
+
+struct restore_file_ops {
+	char *file_name;
+	enum file_type file_type;
+	struct file * (*restore) (struct ckpt_ctx *ctx,
+				  struct ckpt_hdr_file *ptr);
+};
+
+static struct restore_file_ops restore_file_ops[] = {
+	/* ignored file */
+	{
+		.file_name = "IGNORE",
+		.file_type = CKPT_FILE_IGNORE,
+		.restore = NULL,
+	},
+	/* regular file/directory */
+	{
+		.file_name = "GENERIC",
+		.file_type = CKPT_FILE_GENERIC,
+		.restore = generic_file_restore,
+	},
+};
+
+static struct file *do_restore_file(struct ckpt_ctx *ctx)
+{
+	struct restore_file_ops *ops;
+	struct ckpt_hdr_file *h;
+	struct file *file = ERR_PTR(-EINVAL);
+
+	/*
+	 * All 'struct ckpt_hdr_file_...' begin with ckpt_hdr_file,
+	 * but the actual object depends on the file type. The length
+	 * should never be more than page.
+	 */
+	h = ckpt_read_buf_type(ctx, PAGE_SIZE, CKPT_HDR_FILE);
+	if (IS_ERR(h))
+		return (struct file *) h;
+	ckpt_debug("flags %#x mode %#x type %d\n",
+		 h->f_flags, h->f_mode, h->f_type);
+
+	if (h->f_type >= CKPT_FILE_MAX)
+		goto out;
+
+	ops = &restore_file_ops[h->f_type];
+	BUG_ON(ops->file_type != h->f_type);
+
+	if (ops->restore)
+		file = ops->restore(ctx, h);
+ out:
+	ckpt_hdr_put(ctx, h);
+	return file;
+}
+
+/* restore callback for file pointer */
+void *restore_file(struct ckpt_ctx *ctx)
+{
+	return (void *) do_restore_file(ctx);
+}
+
+/**
+ * ckpt_read_file_desc - restore the state of a given file descriptor
+ * @ctx: checkpoint context
+ *
+ * Restores the state of a file descriptor; looks up the objref (in the
+ * header) in the hash table, and if found picks the matching file and
+ * use it; otherwise calls restore_file to restore the file too.
+ */
+static int restore_file_desc(struct ckpt_ctx *ctx)
+{
+	struct ckpt_hdr_file_desc *h;
+	struct file *file;
+	int newfd, ret;
+
+	h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_FILE_DESC);
+	if (IS_ERR(h))
+		return PTR_ERR(h);
+	ckpt_debug("ref %d fd %d c.o.e %d\n",
+		 h->fd_objref, h->fd_descriptor, h->fd_close_on_exec);
+
+	ret = -EINVAL;
+	if (h->fd_objref <= 0 || h->fd_descriptor < 0)
+		goto out;
+
+	file = ckpt_obj_fetch(ctx, h->fd_objref, CKPT_OBJ_FILE);
+	if (IS_ERR(file)) {
+		ret = PTR_ERR(file);
+		goto out;
+	}
+
+	newfd = attach_file(file);
+	if (newfd < 0) {
+		ret = newfd;
+		goto out;
+	}
+
+	ckpt_debug("newfd got %d wanted %d\n", newfd, h->fd_descriptor);
+
+	/* reposition if newfd isn't desired fd */
+	if (newfd != h->fd_descriptor) {
+		ret = sys_dup2(newfd, h->fd_descriptor);
+		if (ret < 0)
+			goto out;
+		sys_close(newfd);
+	}
+
+	set_close_on_exec(h->fd_descriptor, h->fd_close_on_exec);
+	ret = 0;
+ out:
+	ckpt_hdr_put(ctx, h);
+	return ret;
+}
+
+/* restore callback for file table */
+static struct files_struct *do_restore_file_table(struct ckpt_ctx *ctx)
+{
+	struct ckpt_hdr_file_table *h;
+	struct files_struct *files;
+	int i, ret;
+
+	h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_FILE_TABLE);
+	if (IS_ERR(h))
+		return (struct files_struct *) h;
+
+	ckpt_debug("nfds %d\n", h->fdt_nfds);
+
+	ret = -EMFILE;
+	if (h->fdt_nfds < 0 || h->fdt_nfds > sysctl_nr_open)
+		goto out;
+
+	/*
+	 * We assume that restarting tasks, as created in user-space,
+	 * have distinct files_struct objects each. If not, we need to
+	 * call dup_fd() to make sure we don't overwrite an already
+	 * restored one.
+	 */
+
+	/* point of no return -- close all file descriptors */
+	ret = close_all_fds(current->files);
+	if (ret < 0)
+		goto out;
+
+	for (i = 0; i < h->fdt_nfds; i++) {
+		ret = restore_file_desc(ctx);
+		if (ret < 0)
+			goto out;
+	}
+
+	ret = deferqueue_run(ctx->files_deferq);
+	ckpt_debug("files_deferq ran %d entries\n", ret);
+	if (ret > 0)
+		ret = 0;
+ out:
+	ckpt_hdr_put(ctx, h);
+	if (!ret) {
+		files = current->files;
+		atomic_inc(&files->count);
+	} else {
+		files = ERR_PTR(ret);
+	}
+	return files;
+}
+
+void *restore_file_table(struct ckpt_ctx *ctx)
+{
+	return (void *) do_restore_file_table(ctx);
+}
+
+int restore_obj_file_table(struct ckpt_ctx *ctx, int files_objref)
+{
+	struct files_struct *files;
+
+	files = ckpt_obj_fetch(ctx, files_objref, CKPT_OBJ_FILE_TABLE);
+	if (IS_ERR(files))
+		return PTR_ERR(files);
+
+	if (files != current->files) {
+		task_lock(current);
+		put_files_struct(current->files);
+		current->files = files;
+		task_unlock(current);
+		atomic_inc(&files->count);
+	}
+
+	return 0;
+}
diff --git a/checkpoint/objhash.c b/checkpoint/objhash.c
index cefbab6..b7c8fdb 100644
--- a/checkpoint/objhash.c
+++ b/checkpoint/objhash.c
@@ -112,6 +112,7 @@ static struct ckpt_obj_ops ckpt_obj_ops[] = {
 		.ref_grab = obj_file_table_grab,
 		.ref_users = obj_file_table_users,
 		.checkpoint = checkpoint_file_table,
+		.restore = restore_file_table,
 	},
 	/* file object */
 	{
@@ -121,6 +122,7 @@ static struct ckpt_obj_ops ckpt_obj_ops[] = {
 		.ref_grab = obj_file_grab,
 		.ref_users = obj_file_users,
 		.checkpoint = checkpoint_file,
+		.restore = restore_file,
 	},
 };
 
diff --git a/checkpoint/process.c b/checkpoint/process.c
index 042dc45..6ad9c01 100644
--- a/checkpoint/process.c
+++ b/checkpoint/process.c
@@ -349,6 +349,22 @@ static int restore_task_struct(struct ckpt_ctx *ctx)
 	return ret;
 }
 
+static int restore_task_objs(struct ckpt_ctx *ctx)
+{
+	struct ckpt_hdr_task_objs *h;
+	int ret;
+
+	h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_TASK_OBJS);
+	if (IS_ERR(h))
+		return PTR_ERR(h);
+
+	ret = restore_obj_file_table(ctx, h->files_objref);
+	ckpt_debug("file_table: ret %d (%p)\n", ret, current->files);
+
+	ckpt_hdr_put(ctx, h);
+	return ret;
+}
+
 int restore_restart_block(struct ckpt_ctx *ctx)
 {
 	struct ckpt_hdr_restart_block *h;
@@ -468,6 +484,10 @@ int restore_task(struct ckpt_ctx *ctx)
 	if (ret)
 		goto out;
 
+	ret = restore_task_objs(ctx);
+	ckpt_debug("objs %d\n", ret);
+	if (ret < 0)
+		goto out;
 	ret = restore_thread(ctx);
 	ckpt_debug("thread %d\n", ret);
 	if (ret < 0)
diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
index 6fa5035..026d058 100644
--- a/include/linux/checkpoint.h
+++ b/include/linux/checkpoint.h
@@ -144,16 +144,23 @@ extern int restore_restart_block(struct ckpt_ctx *ctx);
 extern int ckpt_collect_file_table(struct ckpt_ctx *ctx, struct task_struct *t);
 extern int checkpoint_obj_file_table(struct ckpt_ctx *ctx,
 				     struct task_struct *t);
+extern int restore_obj_file_table(struct ckpt_ctx *ctx, int files_objref);
 extern int checkpoint_file_table(struct ckpt_ctx *ctx, void *ptr);
+extern void *restore_file_table(struct ckpt_ctx *ctx);
 
 /* files */
 extern int checkpoint_fname(struct ckpt_ctx *ctx,
 			    struct path *path, struct path *root);
+extern struct file *restore_open_fname(struct ckpt_ctx *ctx, int flags);
+
 extern int ckpt_collect_file(struct ckpt_ctx *ctx, struct file *file);
 extern int checkpoint_file(struct ckpt_ctx *ctx, void *ptr);
+extern void *restore_file(struct ckpt_ctx *ctx);
 
 extern int checkpoint_file_common(struct ckpt_ctx *ctx, struct file *file,
 				  struct ckpt_hdr_file *h);
+extern int restore_file_common(struct ckpt_ctx *ctx, struct file *file,
+			       struct ckpt_hdr_file *h);
 
 static inline int ckpt_validate_errno(int errno)
 {
-- 
1.6.0.4


^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH v18 35/80] c/r: restore open file descriptors
@ 2009-09-23 23:51   ` Oren Laadan
  0 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-23 23:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, containers, linux-kernel, linux-mm, linux-api,
	Serge Hallyn, Ingo Molnar, Pavel Emelyanov, Oren Laadan,
	Oren Laadan

For each fd read 'struct ckpt_hdr_file_desc' and lookup objref in the
hash table; If not found in the hash table, (first occurence), read in
'struct ckpt_hdr_file', create a new file and register in the hash.
Otherwise attach the file pointer from the hash as an FD.

Changelog[v18]:
  - Invoke set_close_on_exec() unconditionally on restart
Changelog[v17]:
  - Validate f_mode after restore against saved f_mode
  - Fail if f_flags have O_CREAT|O_EXCL|O_NOCTTY|O_TRUN
  - Reorder patch (move earlier in series)
  - Handle shared files_struct objects
Changelog[v14]:
  - Introduce a per file-type restore() callback
  - Revert change to pr_debug(), back to ckpt_debug()
  - Rename:  restore_files() => restore_fd_table()
  - Rename:  ckpt_read_fd_data() => restore_file()
  - Check whether calls to ckpt_hbuf_get() fail
  - Discard field 'hh->parent'
Changelog[v12]:
  - Replace obsolete ckpt_debug() with pr_debug()
Changelog[v6]:
  - Balance all calls to ckpt_hbuf_get() with matching ckpt_hbuf_put()
    (even though it's not really needed)

Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
---
 checkpoint/files.c         |  313 ++++++++++++++++++++++++++++++++++++++++++++
 checkpoint/objhash.c       |    2 +
 checkpoint/process.c       |   20 +++
 include/linux/checkpoint.h |    7 +
 4 files changed, 342 insertions(+), 0 deletions(-)

diff --git a/checkpoint/files.c b/checkpoint/files.c
index a554cbc..3cdfdb3 100644
--- a/checkpoint/files.c
+++ b/checkpoint/files.c
@@ -16,6 +16,8 @@
 #include <linux/sched.h>
 #include <linux/file.h>
 #include <linux/fdtable.h>
+#include <linux/fsnotify.h>
+#include <linux/syscalls.h>
 #include <linux/deferqueue.h>
 #include <linux/checkpoint.h>
 #include <linux/checkpoint_hdr.h>
@@ -415,3 +417,314 @@ int ckpt_collect_file_table(struct ckpt_ctx *ctx, struct task_struct *t)
 
 	return ret;
 }
+
+/**************************************************************************
+ * Restart
+ */
+
+/**
+ * restore_open_fname - read a file name and open a file
+ * @ctx: checkpoint context
+ * @flags: file flags
+ */
+struct file *restore_open_fname(struct ckpt_ctx *ctx, int flags)
+{
+	struct ckpt_hdr *h;
+	struct file *file;
+	char *fname;
+
+	/* prevent bad input from doing bad things */
+	if (flags & (O_CREAT | O_EXCL | O_NOCTTY | O_TRUNC))
+		return ERR_PTR(-EINVAL);
+
+	h = ckpt_read_buf_type(ctx, PATH_MAX, CKPT_HDR_FILE_NAME);
+	if (IS_ERR(h))
+		return (struct file *) h;
+	fname = (char *) (h + 1);
+	ckpt_debug("fname '%s' flags %#x\n", fname, flags);
+
+	file = filp_open(fname, flags, 0);
+	ckpt_hdr_put(ctx, h);
+
+	return file;
+}
+
+static int close_all_fds(struct files_struct *files)
+{
+	int *fdtable;
+	int nfds;
+
+	nfds = scan_fds(files, &fdtable);
+	if (nfds < 0)
+		return nfds;
+	while (nfds--)
+		sys_close(fdtable[nfds]);
+	kfree(fdtable);
+	return 0;
+}
+
+/**
+ * attach_file - attach a lonely file ptr to a file descriptor
+ * @file: lonely file pointer
+ */
+static int attach_file(struct file *file)
+{
+	int fd = get_unused_fd_flags(0);
+
+	if (fd >= 0) {
+		get_file(file);
+		fsnotify_open(file->f_path.dentry);
+		fd_install(fd, file);
+	}
+	return fd;
+}
+
+#define CKPT_SETFL_MASK  \
+	(O_APPEND | O_NONBLOCK | O_NDELAY | FASYNC | O_DIRECT | O_NOATIME)
+
+int restore_file_common(struct ckpt_ctx *ctx, struct file *file,
+			struct ckpt_hdr_file *h)
+{
+	fmode_t new_mode = file->f_mode;
+	fmode_t saved_mode = (__force fmode_t) h->f_mode;
+	int ret;
+
+	/* FIX: need to restore uid, gid, owner etc */
+
+	/* safe to set 1st arg (fd) to 0, as command is F_SETFL */
+	ret = vfs_fcntl(0, F_SETFL, h->f_flags & CKPT_SETFL_MASK, file);
+	if (ret < 0)
+		return ret;
+
+	/*
+	 * Normally f_mode is set by open, and modified only via
+	 * fcntl(), so its value now should match that at checkpoint.
+	 * However, a file may be downgraded from (read-)write to
+	 * read-only, e.g:
+	 *  - mark_files_ro() unsets FMODE_WRITE
+	 *  - nfs4_file_downgrade() too, and also sert FMODE_READ
+	 * Validate the new f_mode against saved f_mode, allowing:
+	 *  - new with FMODE_WRITE, saved without FMODE_WRITE
+	 *  - new without FMODE_READ, saved with FMODE_READ
+	 */
+	if ((new_mode & FMODE_WRITE) && !(saved_mode & FMODE_WRITE)) {
+		new_mode &= ~FMODE_WRITE;
+		if (!(new_mode & FMODE_READ) && (saved_mode & FMODE_READ))
+			new_mode |= FMODE_READ;
+	}
+	/* finally, at this point new mode should match saved mode */
+	if (new_mode ^ saved_mode)
+		return -EINVAL;
+
+	if (file->f_mode & FMODE_LSEEK)
+		ret = vfs_llseek(file, h->f_pos, SEEK_SET);
+
+	return ret;
+}
+
+static struct file *generic_file_restore(struct ckpt_ctx *ctx,
+					 struct ckpt_hdr_file *ptr)
+{
+	struct file *file;
+	int ret;
+
+	if (ptr->h.type != CKPT_HDR_FILE  ||
+	    ptr->h.len != sizeof(*ptr) || ptr->f_type != CKPT_FILE_GENERIC)
+		return ERR_PTR(-EINVAL);
+
+	file = restore_open_fname(ctx, ptr->f_flags);
+	if (IS_ERR(file))
+		return file;
+
+	ret = restore_file_common(ctx, file, ptr);
+	if (ret < 0) {
+		fput(file);
+		file = ERR_PTR(ret);
+	}
+	return file;
+}
+
+struct restore_file_ops {
+	char *file_name;
+	enum file_type file_type;
+	struct file * (*restore) (struct ckpt_ctx *ctx,
+				  struct ckpt_hdr_file *ptr);
+};
+
+static struct restore_file_ops restore_file_ops[] = {
+	/* ignored file */
+	{
+		.file_name = "IGNORE",
+		.file_type = CKPT_FILE_IGNORE,
+		.restore = NULL,
+	},
+	/* regular file/directory */
+	{
+		.file_name = "GENERIC",
+		.file_type = CKPT_FILE_GENERIC,
+		.restore = generic_file_restore,
+	},
+};
+
+static struct file *do_restore_file(struct ckpt_ctx *ctx)
+{
+	struct restore_file_ops *ops;
+	struct ckpt_hdr_file *h;
+	struct file *file = ERR_PTR(-EINVAL);
+
+	/*
+	 * All 'struct ckpt_hdr_file_...' begin with ckpt_hdr_file,
+	 * but the actual object depends on the file type. The length
+	 * should never be more than page.
+	 */
+	h = ckpt_read_buf_type(ctx, PAGE_SIZE, CKPT_HDR_FILE);
+	if (IS_ERR(h))
+		return (struct file *) h;
+	ckpt_debug("flags %#x mode %#x type %d\n",
+		 h->f_flags, h->f_mode, h->f_type);
+
+	if (h->f_type >= CKPT_FILE_MAX)
+		goto out;
+
+	ops = &restore_file_ops[h->f_type];
+	BUG_ON(ops->file_type != h->f_type);
+
+	if (ops->restore)
+		file = ops->restore(ctx, h);
+ out:
+	ckpt_hdr_put(ctx, h);
+	return file;
+}
+
+/* restore callback for file pointer */
+void *restore_file(struct ckpt_ctx *ctx)
+{
+	return (void *) do_restore_file(ctx);
+}
+
+/**
+ * ckpt_read_file_desc - restore the state of a given file descriptor
+ * @ctx: checkpoint context
+ *
+ * Restores the state of a file descriptor; looks up the objref (in the
+ * header) in the hash table, and if found picks the matching file and
+ * use it; otherwise calls restore_file to restore the file too.
+ */
+static int restore_file_desc(struct ckpt_ctx *ctx)
+{
+	struct ckpt_hdr_file_desc *h;
+	struct file *file;
+	int newfd, ret;
+
+	h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_FILE_DESC);
+	if (IS_ERR(h))
+		return PTR_ERR(h);
+	ckpt_debug("ref %d fd %d c.o.e %d\n",
+		 h->fd_objref, h->fd_descriptor, h->fd_close_on_exec);
+
+	ret = -EINVAL;
+	if (h->fd_objref <= 0 || h->fd_descriptor < 0)
+		goto out;
+
+	file = ckpt_obj_fetch(ctx, h->fd_objref, CKPT_OBJ_FILE);
+	if (IS_ERR(file)) {
+		ret = PTR_ERR(file);
+		goto out;
+	}
+
+	newfd = attach_file(file);
+	if (newfd < 0) {
+		ret = newfd;
+		goto out;
+	}
+
+	ckpt_debug("newfd got %d wanted %d\n", newfd, h->fd_descriptor);
+
+	/* reposition if newfd isn't desired fd */
+	if (newfd != h->fd_descriptor) {
+		ret = sys_dup2(newfd, h->fd_descriptor);
+		if (ret < 0)
+			goto out;
+		sys_close(newfd);
+	}
+
+	set_close_on_exec(h->fd_descriptor, h->fd_close_on_exec);
+	ret = 0;
+ out:
+	ckpt_hdr_put(ctx, h);
+	return ret;
+}
+
+/* restore callback for file table */
+static struct files_struct *do_restore_file_table(struct ckpt_ctx *ctx)
+{
+	struct ckpt_hdr_file_table *h;
+	struct files_struct *files;
+	int i, ret;
+
+	h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_FILE_TABLE);
+	if (IS_ERR(h))
+		return (struct files_struct *) h;
+
+	ckpt_debug("nfds %d\n", h->fdt_nfds);
+
+	ret = -EMFILE;
+	if (h->fdt_nfds < 0 || h->fdt_nfds > sysctl_nr_open)
+		goto out;
+
+	/*
+	 * We assume that restarting tasks, as created in user-space,
+	 * have distinct files_struct objects each. If not, we need to
+	 * call dup_fd() to make sure we don't overwrite an already
+	 * restored one.
+	 */
+
+	/* point of no return -- close all file descriptors */
+	ret = close_all_fds(current->files);
+	if (ret < 0)
+		goto out;
+
+	for (i = 0; i < h->fdt_nfds; i++) {
+		ret = restore_file_desc(ctx);
+		if (ret < 0)
+			goto out;
+	}
+
+	ret = deferqueue_run(ctx->files_deferq);
+	ckpt_debug("files_deferq ran %d entries\n", ret);
+	if (ret > 0)
+		ret = 0;
+ out:
+	ckpt_hdr_put(ctx, h);
+	if (!ret) {
+		files = current->files;
+		atomic_inc(&files->count);
+	} else {
+		files = ERR_PTR(ret);
+	}
+	return files;
+}
+
+void *restore_file_table(struct ckpt_ctx *ctx)
+{
+	return (void *) do_restore_file_table(ctx);
+}
+
+int restore_obj_file_table(struct ckpt_ctx *ctx, int files_objref)
+{
+	struct files_struct *files;
+
+	files = ckpt_obj_fetch(ctx, files_objref, CKPT_OBJ_FILE_TABLE);
+	if (IS_ERR(files))
+		return PTR_ERR(files);
+
+	if (files != current->files) {
+		task_lock(current);
+		put_files_struct(current->files);
+		current->files = files;
+		task_unlock(current);
+		atomic_inc(&files->count);
+	}
+
+	return 0;
+}
diff --git a/checkpoint/objhash.c b/checkpoint/objhash.c
index cefbab6..b7c8fdb 100644
--- a/checkpoint/objhash.c
+++ b/checkpoint/objhash.c
@@ -112,6 +112,7 @@ static struct ckpt_obj_ops ckpt_obj_ops[] = {
 		.ref_grab = obj_file_table_grab,
 		.ref_users = obj_file_table_users,
 		.checkpoint = checkpoint_file_table,
+		.restore = restore_file_table,
 	},
 	/* file object */
 	{
@@ -121,6 +122,7 @@ static struct ckpt_obj_ops ckpt_obj_ops[] = {
 		.ref_grab = obj_file_grab,
 		.ref_users = obj_file_users,
 		.checkpoint = checkpoint_file,
+		.restore = restore_file,
 	},
 };
 
diff --git a/checkpoint/process.c b/checkpoint/process.c
index 042dc45..6ad9c01 100644
--- a/checkpoint/process.c
+++ b/checkpoint/process.c
@@ -349,6 +349,22 @@ static int restore_task_struct(struct ckpt_ctx *ctx)
 	return ret;
 }
 
+static int restore_task_objs(struct ckpt_ctx *ctx)
+{
+	struct ckpt_hdr_task_objs *h;
+	int ret;
+
+	h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_TASK_OBJS);
+	if (IS_ERR(h))
+		return PTR_ERR(h);
+
+	ret = restore_obj_file_table(ctx, h->files_objref);
+	ckpt_debug("file_table: ret %d (%p)\n", ret, current->files);
+
+	ckpt_hdr_put(ctx, h);
+	return ret;
+}
+
 int restore_restart_block(struct ckpt_ctx *ctx)
 {
 	struct ckpt_hdr_restart_block *h;
@@ -468,6 +484,10 @@ int restore_task(struct ckpt_ctx *ctx)
 	if (ret)
 		goto out;
 
+	ret = restore_task_objs(ctx);
+	ckpt_debug("objs %d\n", ret);
+	if (ret < 0)
+		goto out;
 	ret = restore_thread(ctx);
 	ckpt_debug("thread %d\n", ret);
 	if (ret < 0)
diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
index 6fa5035..026d058 100644
--- a/include/linux/checkpoint.h
+++ b/include/linux/checkpoint.h
@@ -144,16 +144,23 @@ extern int restore_restart_block(struct ckpt_ctx *ctx);
 extern int ckpt_collect_file_table(struct ckpt_ctx *ctx, struct task_struct *t);
 extern int checkpoint_obj_file_table(struct ckpt_ctx *ctx,
 				     struct task_struct *t);
+extern int restore_obj_file_table(struct ckpt_ctx *ctx, int files_objref);
 extern int checkpoint_file_table(struct ckpt_ctx *ctx, void *ptr);
+extern void *restore_file_table(struct ckpt_ctx *ctx);
 
 /* files */
 extern int checkpoint_fname(struct ckpt_ctx *ctx,
 			    struct path *path, struct path *root);
+extern struct file *restore_open_fname(struct ckpt_ctx *ctx, int flags);
+
 extern int ckpt_collect_file(struct ckpt_ctx *ctx, struct file *file);
 extern int checkpoint_file(struct ckpt_ctx *ctx, void *ptr);
+extern void *restore_file(struct ckpt_ctx *ctx);
 
 extern int checkpoint_file_common(struct ckpt_ctx *ctx, struct file *file,
 				  struct ckpt_hdr_file *h);
+extern int restore_file_common(struct ckpt_ctx *ctx, struct file *file,
+			       struct ckpt_hdr_file *h);
 
 static inline int ckpt_validate_errno(int errno)
 {
-- 
1.6.0.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH v18 36/80] c/r: introduce method '->checkpoint()' in struct vm_operations_struct
       [not found] ` <1253749920-18673-1-git-send-email-orenl-RdfvBDnrOixBDgjK7y7TUQ@public.gmane.org>
                     ` (34 preceding siblings ...)
  2009-09-23 23:51   ` [PATCH v18 35/80] c/r: restore " Oren Laadan
@ 2009-09-23 23:51   ` Oren Laadan
  2009-09-23 23:51   ` [PATCH v18 37/80] c/r: dump memory address space (private memory) Oren Laadan
                     ` (45 subsequent siblings)
  81 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-23 23:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-api-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Linus Torvalds, Ingo Molnar,
	Pavel Emelyanov

Changelog[v17]
  - Forward-declare 'ckpt_ctx et-al, don't use checkpoint_types.h

Signed-off-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
---
 include/linux/mm.h |    4 ++++
 1 files changed, 4 insertions(+), 0 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 9a72cc7..d5ace89 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -19,6 +19,7 @@ struct file_ra_state;
 struct user_struct;
 struct writeback_control;
 struct rlimit;
+struct ckpt_ctx;
 
 #ifndef CONFIG_DISCONTIGMEM          /* Don't use mapnrs, do it properly */
 extern unsigned long max_mapnr;
@@ -218,6 +219,9 @@ struct vm_operations_struct {
 	int (*migrate)(struct vm_area_struct *vma, const nodemask_t *from,
 		const nodemask_t *to, unsigned long flags);
 #endif
+#ifdef CONFIG_CHECKPOINT
+	int (*checkpoint)(struct ckpt_ctx *ctx, struct vm_area_struct *vma);
+#endif
 };
 
 struct mmu_gather;
-- 
1.6.0.4

^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH v18 36/80] c/r: introduce method '->checkpoint()' in struct vm_operations_struct
  2009-09-23 23:50 ` Oren Laadan
@ 2009-09-23 23:51   ` Oren Laadan
  -1 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-23 23:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, containers, linux-kernel, linux-mm, linux-api,
	Serge Hallyn, Ingo Molnar, Pavel Emelyanov, Oren Laadan,
	Oren Laadan

Changelog[v17]
  - Forward-declare 'ckpt_ctx et-al, don't use checkpoint_types.h

Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
---
 include/linux/mm.h |    4 ++++
 1 files changed, 4 insertions(+), 0 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 9a72cc7..d5ace89 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -19,6 +19,7 @@ struct file_ra_state;
 struct user_struct;
 struct writeback_control;
 struct rlimit;
+struct ckpt_ctx;
 
 #ifndef CONFIG_DISCONTIGMEM          /* Don't use mapnrs, do it properly */
 extern unsigned long max_mapnr;
@@ -218,6 +219,9 @@ struct vm_operations_struct {
 	int (*migrate)(struct vm_area_struct *vma, const nodemask_t *from,
 		const nodemask_t *to, unsigned long flags);
 #endif
+#ifdef CONFIG_CHECKPOINT
+	int (*checkpoint)(struct ckpt_ctx *ctx, struct vm_area_struct *vma);
+#endif
 };
 
 struct mmu_gather;
-- 
1.6.0.4


^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH v18 36/80] c/r: introduce method '->checkpoint()' in struct vm_operations_struct
@ 2009-09-23 23:51   ` Oren Laadan
  0 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-23 23:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, containers, linux-kernel, linux-mm, linux-api,
	Serge Hallyn, Ingo Molnar, Pavel Emelyanov, Oren Laadan,
	Oren Laadan

Changelog[v17]
  - Forward-declare 'ckpt_ctx et-al, don't use checkpoint_types.h

Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
---
 include/linux/mm.h |    4 ++++
 1 files changed, 4 insertions(+), 0 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 9a72cc7..d5ace89 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -19,6 +19,7 @@ struct file_ra_state;
 struct user_struct;
 struct writeback_control;
 struct rlimit;
+struct ckpt_ctx;
 
 #ifndef CONFIG_DISCONTIGMEM          /* Don't use mapnrs, do it properly */
 extern unsigned long max_mapnr;
@@ -218,6 +219,9 @@ struct vm_operations_struct {
 	int (*migrate)(struct vm_area_struct *vma, const nodemask_t *from,
 		const nodemask_t *to, unsigned long flags);
 #endif
+#ifdef CONFIG_CHECKPOINT
+	int (*checkpoint)(struct ckpt_ctx *ctx, struct vm_area_struct *vma);
+#endif
 };
 
 struct mmu_gather;
-- 
1.6.0.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH v18 37/80] c/r: dump memory address space (private memory)
       [not found] ` <1253749920-18673-1-git-send-email-orenl-RdfvBDnrOixBDgjK7y7TUQ@public.gmane.org>
                     ` (35 preceding siblings ...)
  2009-09-23 23:51   ` [PATCH v18 36/80] c/r: introduce method '->checkpoint()' in struct vm_operations_struct Oren Laadan
@ 2009-09-23 23:51   ` Oren Laadan
  2009-09-23 23:51   ` [PATCH v18 38/80] c/r: restore " Oren Laadan
                     ` (44 subsequent siblings)
  81 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-23 23:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-api-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Linus Torvalds, Ingo Molnar,
	Pavel Emelyanov

For each vma, there is a 'struct ckpt_vma'; Then comes the actual
contents, in one or more chunk: each chunk begins with a header that
specifies how many pages it holds, then the virtual addresses of all
the dumped pages in that chunk, followed by the actual contents of all
dumped pages. A header with zero number of pages marks the end of the
contents.  Then comes the next vma and so on.

To checkpoint a vma, call the ops->checkpoint() method of that vma.
Normally the per-vma function will invoke generic_vma_checkpoint()
which first writes the vma description, followed by the specific
logic to dump the contents of the pages.

Currently for private mapped memory we save the pathname of the file
that is mapped (restart will use it to re-open it and then map it).
Later we change that to reference a file object.

Changelog[v18]:
  - Tighten checks on supported vma to checkpoint or restart
  - Add a few more ckpt_write_err()s
  - [Serge Hallyn] Export filemap_checkpoint() (used later for ext4)
  - Use ckpt_collect_file() instead of ckpt_obj_collect() for files
  - In collect_mm() use retval from ckpt_obj_collect() to test for
    first-time-object
Changelog[v17]:
  - Only collect sub-objects of mm_struct once
  - Save mm->{flags,def_flags,saved_auxv}
Changelog[v16]:
  - Precede vaddrs/pages with a buffer header
  - Checkpoint mm->exe_file
  - Handle shared task->mm
Changelog[v14]:
  - Modify the ops->checkpoint method to be much more powerful
  - Improve support for VDSO (with special_mapping checkpoint callback)
  - Save new field 'vdso' in mm_context
  - Revert change to pr_debug(), back to ckpt_debug()
  - Check whether calls to ckpt_hbuf_get() fail
  - Discard field 'h->parent'
Changelog[v13]:
  - pgprot_t is an abstract type; use the proper accessor (fix for
    64-bit powerpc (Nathan Lynch <ntl-e+AXbWqSrlAAvxtiuMwx3w@public.gmane.org>)
Changelog[v12]:
  - Hide pgarr management inside ckpt_private_vma_fill_pgarr()
  - Fix management of pgarr chain reset and alloc/expand: keep empty
    pgarr in a pool chain
  - Replace obsolete ckpt_debug() with pr_debug()
Changelog[v11]:
  - Copy contents of 'init->fs->root' instead of pointing to them.
  - Add missing test for VM_MAYSHARE when dumping memory
Changelog[v10]:
  - Acquire dcache_lock around call to __d_path() in ckpt_fill_name()
Changelog[v9]:
  - Introduce ckpt_ctx_checkpoint() for checkpoint-specific ctx setup
  - Test if __d_path() changes mnt/dentry (when crossing filesystem
    namespace boundary). for now ckpt_fill_fname() fails the checkpoint.
Changelog[v7]:
  - Fix argument given to kunmap_atomic() in memory dump/restore
Changelog[v6]:
  - Balance all calls to ckpt_hbuf_get() with matching ckpt_hbuf_put()
    (even though it's not really needed)
Changelog[v5]:
  - Improve memory dump code (following Dave Hansen's comments)
  - Change dump format (and code) to allow chunks of <vaddrs, pages>
    instead of one long list of each
  - Fix use of follow_page() to avoid faulting in non-present pages
Changelog[v4]:
  - Use standard list_... for ckpt_pgarr

Signed-off-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
---
 arch/x86/include/asm/checkpoint_hdr.h |    8 +
 arch/x86/mm/checkpoint.c              |   31 ++
 checkpoint/Makefile                   |    3 +-
 checkpoint/checkpoint.c               |    3 +
 checkpoint/memory.c                   |  697 +++++++++++++++++++++++++++++++++
 checkpoint/objhash.c                  |   25 ++
 checkpoint/process.c                  |   12 +
 checkpoint/sys.c                      |    3 +
 include/linux/checkpoint.h            |   26 ++
 include/linux/checkpoint_hdr.h        |   52 +++
 include/linux/checkpoint_types.h      |    5 +
 include/linux/mm.h                    |    5 +
 mm/filemap.c                          |   25 ++
 mm/mmap.c                             |   28 ++
 14 files changed, 922 insertions(+), 1 deletions(-)
 create mode 100644 checkpoint/memory.c

diff --git a/arch/x86/include/asm/checkpoint_hdr.h b/arch/x86/include/asm/checkpoint_hdr.h
index f4d1e14..0e756b0 100644
--- a/arch/x86/include/asm/checkpoint_hdr.h
+++ b/arch/x86/include/asm/checkpoint_hdr.h
@@ -45,6 +45,7 @@
 /* arch dependent header types */
 enum {
 	CKPT_HDR_CPU_FPU = 201,
+	CKPT_HDR_MM_CONTEXT_LDT,
 };
 
 struct ckpt_hdr_header_arch {
@@ -118,4 +119,11 @@ struct ckpt_hdr_cpu {
 #define CKPT_X86_SEG_TLS	0x4000	/* 0100 0000 0000 00xx */
 #define CKPT_X86_SEG_LDT	0x8000	/* 100x xxxx xxxx xxxx */
 
+struct ckpt_hdr_mm_context {
+	struct ckpt_hdr h;
+	__u64 vdso;
+	__u32 ldt_entry_size;
+	__u32 nldt;
+} __attribute__((aligned(8)));
+
 #endif /* __ASM_X86_CKPT_HDR__H */
diff --git a/arch/x86/mm/checkpoint.c b/arch/x86/mm/checkpoint.c
index 023039f..83ed047 100644
--- a/arch/x86/mm/checkpoint.c
+++ b/arch/x86/mm/checkpoint.c
@@ -324,6 +324,37 @@ int checkpoint_write_header_arch(struct ckpt_ctx *ctx)
 	return ret;
 }
 
+/* dump the mm->context state */
+int checkpoint_mm_context(struct ckpt_ctx *ctx, struct mm_struct *mm)
+{
+	struct ckpt_hdr_mm_context *h;
+	int ret;
+
+	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_MM_CONTEXT);
+	if (!h)
+		return -ENOMEM;
+
+	mutex_lock(&mm->context.lock);
+
+	h->vdso = (unsigned long) mm->context.vdso;
+	h->ldt_entry_size = LDT_ENTRY_SIZE;
+	h->nldt = mm->context.size;
+
+	ckpt_debug("nldt %d vdso %#llx\n", h->nldt, h->vdso);
+
+	ret = ckpt_write_obj(ctx, &h->h);
+	ckpt_hdr_put(ctx, h);
+	if (ret < 0)
+		goto out;
+
+	ret = ckpt_write_obj_type(ctx, mm->context.ldt,
+				  mm->context.size * LDT_ENTRY_SIZE,
+				  CKPT_HDR_MM_CONTEXT_LDT);
+ out:
+	mutex_unlock(&mm->context.lock);
+	return ret;
+}
+
 /**************************************************************************
  * Restart
  */
diff --git a/checkpoint/Makefile b/checkpoint/Makefile
index 1d0c058..f56a7d6 100644
--- a/checkpoint/Makefile
+++ b/checkpoint/Makefile
@@ -8,4 +8,5 @@ obj-$(CONFIG_CHECKPOINT) += \
 	checkpoint.o \
 	restart.o \
 	process.o \
-	files.o
+	files.o \
+	memory.o
diff --git a/checkpoint/checkpoint.c b/checkpoint/checkpoint.c
index 4cc2a2f..f907485 100644
--- a/checkpoint/checkpoint.c
+++ b/checkpoint/checkpoint.c
@@ -280,10 +280,13 @@ int ckpt_write_err(struct ckpt_ctx *ctx, char *pre, char *fmt, ...)
 static void fill_kernel_const(struct ckpt_const *h)
 {
 	struct task_struct *tsk;
+	struct mm_struct *mm;
 	struct new_utsname *uts;
 
 	/* task */
 	h->task_comm_len = sizeof(tsk->comm);
+	/* mm */
+	h->mm_saved_auxv_len = sizeof(mm->saved_auxv);
 	/* uts */
 	h->uts_release_len = sizeof(uts->release);
 	h->uts_version_len = sizeof(uts->version);
diff --git a/checkpoint/memory.c b/checkpoint/memory.c
new file mode 100644
index 0000000..d19d627
--- /dev/null
+++ b/checkpoint/memory.c
@@ -0,0 +1,697 @@
+/*
+ *  Checkpoint/restart memory contents
+ *
+ *  Copyright (C) 2008-2009 Oren Laadan
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+/* default debug level for output */
+#define CKPT_DFLAG  CKPT_DMEM
+
+#include <linux/kernel.h>
+#include <linux/sched.h>
+#include <linux/slab.h>
+#include <linux/file.h>
+#include <linux/pagemap.h>
+#include <linux/mm_types.h>
+#include <linux/proc_fs.h>
+#include <linux/checkpoint.h>
+#include <linux/checkpoint_hdr.h>
+
+/*
+ * page-array chains: each ckpt_pgarr describes a set of <struct page *,vaddr>
+ * tuples (where vaddr is the virtual address of a page in a particular mm).
+ * Specifically, we use separate arrays so that all vaddrs can be written
+ * and read at once.
+ */
+
+struct ckpt_pgarr {
+	unsigned long *vaddrs;
+	struct page **pages;
+	unsigned int nr_used;
+	struct list_head list;
+};
+
+#define CKPT_PGARR_TOTAL  (PAGE_SIZE / sizeof(void *))
+#define CKPT_PGARR_BATCH  (16 * CKPT_PGARR_TOTAL)
+
+static inline int pgarr_is_full(struct ckpt_pgarr *pgarr)
+{
+	return (pgarr->nr_used == CKPT_PGARR_TOTAL);
+}
+
+static inline int pgarr_nr_free(struct ckpt_pgarr *pgarr)
+{
+	return CKPT_PGARR_TOTAL - pgarr->nr_used;
+}
+
+/*
+ * utilities to alloc, free, and handle 'struct ckpt_pgarr' (page-arrays)
+ * (common to ckpt_mem.c and rstr_mem.c).
+ *
+ * The checkpoint context structure has two members for page-arrays:
+ *   ctx->pgarr_list: list head of populated page-array chain
+ *   ctx->pgarr_pool: list head of empty page-array pool chain
+ *
+ * During checkpoint (and restart) the chain tracks the dirty pages (page
+ * pointer and virtual address) of each MM. For a particular MM, these are
+ * always added to the head of the page-array chain (ctx->pgarr_list).
+ * Before the next chunk of pages, the chain is reset (by dereferencing
+ * all pages) but not freed; instead, empty descsriptors are kept in pool.
+ *
+ * The head of the chain page-array ("current") advances as necessary. When
+ * it gets full, a new page-array descriptor is pushed in front of it. The
+ * new descriptor is taken from first empty descriptor (if one exists, for
+ * instance, after a chain reset), or allocated on-demand.
+ *
+ * When dumping the data, the chain is traversed in reverse order.
+ */
+
+/* return first page-array in the chain */
+static inline struct ckpt_pgarr *pgarr_first(struct ckpt_ctx *ctx)
+{
+	if (list_empty(&ctx->pgarr_list))
+		return NULL;
+	return list_first_entry(&ctx->pgarr_list, struct ckpt_pgarr, list);
+}
+
+/* return (and detach) first empty page-array in the pool, if exists */
+static inline struct ckpt_pgarr *pgarr_from_pool(struct ckpt_ctx *ctx)
+{
+	struct ckpt_pgarr *pgarr;
+
+	if (list_empty(&ctx->pgarr_pool))
+		return NULL;
+	pgarr = list_first_entry(&ctx->pgarr_pool, struct ckpt_pgarr, list);
+	list_del(&pgarr->list);
+	return pgarr;
+}
+
+/* release pages referenced by a page-array */
+static void pgarr_release_pages(struct ckpt_pgarr *pgarr)
+{
+	ckpt_debug("total pages %d\n", pgarr->nr_used);
+	/*
+	 * both checkpoint and restart use 'nr_used', however we only
+	 * collect pages during checkpoint; in restart we simply return
+	 * because pgarr->pages remains NULL.
+	 */
+	if (pgarr->pages) {
+		struct page **pages = pgarr->pages;
+		int nr = pgarr->nr_used;
+
+		while (nr--)
+			page_cache_release(pages[nr]);
+	}
+
+	pgarr->nr_used = 0;
+}
+
+/* free a single page-array object */
+static void pgarr_free_one(struct ckpt_pgarr *pgarr)
+{
+	pgarr_release_pages(pgarr);
+	kfree(pgarr->pages);
+	kfree(pgarr->vaddrs);
+	kfree(pgarr);
+}
+
+/* free the chains of page-arrays (populated and empty pool) */
+void ckpt_pgarr_free(struct ckpt_ctx *ctx)
+{
+	struct ckpt_pgarr *pgarr, *tmp;
+
+	list_for_each_entry_safe(pgarr, tmp, &ctx->pgarr_list, list) {
+		list_del(&pgarr->list);
+		pgarr_free_one(pgarr);
+	}
+
+	list_for_each_entry_safe(pgarr, tmp, &ctx->pgarr_pool, list) {
+		list_del(&pgarr->list);
+		pgarr_free_one(pgarr);
+	}
+}
+
+/* allocate a single page-array object */
+static struct ckpt_pgarr *pgarr_alloc_one(unsigned long flags)
+{
+	struct ckpt_pgarr *pgarr;
+
+	pgarr = kzalloc(sizeof(*pgarr), GFP_KERNEL);
+	if (!pgarr)
+		return NULL;
+	pgarr->vaddrs = kmalloc(CKPT_PGARR_TOTAL * sizeof(unsigned long),
+				GFP_KERNEL);
+	if (!pgarr->vaddrs)
+		goto nomem;
+
+	/* pgarr->pages is needed only for checkpoint */
+	if (flags & CKPT_CTX_CHECKPOINT) {
+		pgarr->pages = kmalloc(CKPT_PGARR_TOTAL *
+				       sizeof(struct page *), GFP_KERNEL);
+		if (!pgarr->pages)
+			goto nomem;
+	}
+
+	return pgarr;
+ nomem:
+	pgarr_free_one(pgarr);
+	return NULL;
+}
+
+/* pgarr_current - return the next available page-array in the chain
+ * @ctx: checkpoint context
+ *
+ * Returns the first page-array in the list that has space. Otherwise,
+ * try the next page-array after the last non-empty one, and move it to
+ * the front of the chain. Extends the list if none has space.
+ */
+static struct ckpt_pgarr *pgarr_current(struct ckpt_ctx *ctx)
+{
+	struct ckpt_pgarr *pgarr;
+
+	pgarr = pgarr_first(ctx);
+	if (pgarr && !pgarr_is_full(pgarr))
+		return pgarr;
+
+	pgarr = pgarr_from_pool(ctx);
+	if (!pgarr)
+		pgarr = pgarr_alloc_one(ctx->kflags);
+	if (!pgarr)
+		return NULL;
+
+	list_add(&pgarr->list, &ctx->pgarr_list);
+	return pgarr;
+}
+
+/* reset the page-array chain (dropping page references if necessary) */
+static void pgarr_reset_all(struct ckpt_ctx *ctx)
+{
+	struct ckpt_pgarr *pgarr;
+
+	list_for_each_entry(pgarr, &ctx->pgarr_list, list)
+		pgarr_release_pages(pgarr);
+	list_splice_init(&ctx->pgarr_list, &ctx->pgarr_pool);
+}
+
+/**************************************************************************
+ * Checkpoint
+ *
+ * Checkpoint is outside the context of the checkpointee, so one cannot
+ * simply read pages from user-space. Instead, we scan the address space
+ * of the target to cherry-pick pages of interest. Selected pages are
+ * enlisted in a page-array chain (attached to the checkpoint context).
+ * To save their contents, each page is mapped to kernel memory and then
+ * dumped to the file descriptor.
+ */
+
+
+/**
+ * private_follow_page - return page pointer for dirty pages
+ * @vma - target vma
+ * @addr - page address
+ *
+ * Looks up the page that correspond to the address in the vma, and
+ * returns the page if it was modified (and grabs a reference to it),
+ * or otherwise returns NULL (or error).
+ */
+static struct page *consider_private_page(struct vm_area_struct *vma,
+					  unsigned long addr)
+{
+	struct page *page;
+
+	/*
+	 * simplified version of get_user_pages(): already have vma,
+	 * only need FOLL_ANON, and (for now) ignore fault stats.
+	 *
+	 * follow_page() will return NULL if the page is not present
+	 * (swapped), ZERO_PAGE(0) if the pte wasn't allocated, and
+	 * the actual page pointer otherwise.
+	 *
+	 * FIXME: consolidate with get_user_pages()
+	 */
+
+	cond_resched();
+	while (!(page = follow_page(vma, addr, FOLL_ANON | FOLL_GET))) {
+		int ret;
+
+		/* the page is swapped out - bring it in (optimize ?) */
+		ret = handle_mm_fault(vma->vm_mm, vma, addr, 0);
+		if (ret & VM_FAULT_ERROR) {
+			if (ret & VM_FAULT_OOM)
+				return ERR_PTR(-ENOMEM);
+			else if (ret & VM_FAULT_SIGBUS)
+				return ERR_PTR(-EFAULT);
+			else
+				BUG();
+			break;
+		}
+		cond_resched();
+	}
+
+	if (IS_ERR(page))
+		return page;
+
+	/*
+	 * Only care about dirty pages: either anonymous non-zero pages,
+	 * or file-backed COW (copy-on-write) pages that were modified.
+	 * A clean COW page is not interesting because its contents are
+	 * identical to the backing file; ignore such pages.
+	 * A file-backed broken COW is identified by its page_mapping()
+	 * being unset (NULL) because the page will no longer be mapped
+	 * to the original file after having been modified.
+	 */
+	if (page == ZERO_PAGE(0)) {
+		/* this is the zero page: ignore */
+		page_cache_release(page);
+		page = NULL;
+	} else if (vma->vm_file && (page_mapping(page) != NULL)) {
+		/* file backed clean cow: ignore */
+		page_cache_release(page);
+		page = NULL;
+	}
+
+	return page;
+}
+
+/**
+ * vma_fill_pgarr - fill a page-array with addr/page tuples
+ * @ctx - checkpoint context
+ * @vma - vma to scan
+ * @start - start address (updated)
+ *
+ * Returns the number of pages collected
+ */
+static int vma_fill_pgarr(struct ckpt_ctx *ctx,
+			  struct vm_area_struct *vma,
+			  unsigned long *start)
+{
+	unsigned long end = vma->vm_end;
+	unsigned long addr = *start;
+	struct ckpt_pgarr *pgarr;
+	int nr_used;
+	int cnt = 0;
+
+	/* this function is only for private memory (anon or file-mapped) */
+	BUG_ON(vma->vm_flags & (VM_SHARED | VM_MAYSHARE));
+
+	do {
+		pgarr = pgarr_current(ctx);
+		if (!pgarr)
+			return -ENOMEM;
+
+		nr_used = pgarr->nr_used;
+
+		while (addr < end) {
+			struct page *page;
+
+			page = consider_private_page(vma, addr);
+			if (IS_ERR(page))
+				return PTR_ERR(page);
+
+			if (page) {
+				_ckpt_debug(CKPT_DPAGE,
+					    "got page %#lx\n", addr);
+				pgarr->pages[pgarr->nr_used] = page;
+				pgarr->vaddrs[pgarr->nr_used] = addr;
+				pgarr->nr_used++;
+			}
+
+			addr += PAGE_SIZE;
+
+			if (pgarr_is_full(pgarr))
+				break;
+		}
+
+		cnt += pgarr->nr_used - nr_used;
+
+	} while ((cnt < CKPT_PGARR_BATCH) && (addr < end));
+
+	*start = addr;
+	return cnt;
+}
+
+/* dump contents of a pages: use kmap_atomic() to avoid TLB flush */
+static int checkpoint_dump_page(struct ckpt_ctx *ctx,
+				struct page *page, char *buf)
+{
+	void *ptr;
+
+	ptr = kmap_atomic(page, KM_USER1);
+	memcpy(buf, ptr, PAGE_SIZE);
+	kunmap_atomic(ptr, KM_USER1);
+
+	return ckpt_kwrite(ctx, buf, PAGE_SIZE);
+}
+
+/**
+ * vma_dump_pages - dump pages listed in the ctx page-array chain
+ * @ctx - checkpoint context
+ * @total - total number of pages
+ *
+ * First dump all virtual addresses, followed by the contents of all pages
+ */
+static int vma_dump_pages(struct ckpt_ctx *ctx, int total)
+{
+	struct ckpt_pgarr *pgarr;
+	void *buf;
+	int i, ret = 0;
+
+	if (!total)
+		return 0;
+
+	i =  total * (sizeof(unsigned long) + PAGE_SIZE);
+	ret = ckpt_write_obj_type(ctx, NULL, i, CKPT_HDR_BUFFER);
+	if (ret < 0)
+		return ret;
+
+	list_for_each_entry_reverse(pgarr, &ctx->pgarr_list, list) {
+		ret = ckpt_kwrite(ctx, pgarr->vaddrs,
+				  pgarr->nr_used * sizeof(unsigned long));
+		if (ret < 0)
+			return ret;
+	}
+
+	buf = (void *) __get_free_page(GFP_KERNEL);
+	if (!buf)
+		return -ENOMEM;
+
+	list_for_each_entry_reverse(pgarr, &ctx->pgarr_list, list) {
+		for (i = 0; i < pgarr->nr_used; i++) {
+			ret = checkpoint_dump_page(ctx, pgarr->pages[i], buf);
+			if (ret < 0)
+				goto out;
+		}
+	}
+ out:
+	free_page((unsigned long) buf);
+	return ret;
+}
+
+/**
+ * checkpoint_memory_contents - dump contents of a VMA with private memory
+ * @ctx - checkpoint context
+ * @vma - vma to scan
+ *
+ * Collect lists of pages that needs to be dumped, and corresponding
+ * virtual addresses into ctx->pgarr_list page-array chain. Then dump
+ * the addresses, followed by the page contents.
+ */
+static int checkpoint_memory_contents(struct ckpt_ctx *ctx,
+				      struct vm_area_struct *vma)
+{
+	struct ckpt_hdr_pgarr *h;
+	unsigned long addr, end;
+	int cnt, ret;
+
+	addr = vma->vm_start;
+	end = vma->vm_end;
+
+	/*
+	 * Work iteratively, collecting and dumping at most CKPT_PGARR_BATCH
+	 * in each round. Each iterations is divided into two steps:
+	 *
+	 * (1) scan: scan through the PTEs of the vma to collect the pages
+	 * to dump (later we'll also make them COW), while keeping a list
+	 * of pages and their corresponding addresses on ctx->pgarr_list.
+	 *
+	 * (2) dump: write out a header specifying how many pages, followed
+	 * by the addresses of all pages in ctx->pgarr_list, followed by
+	 * the actual contents of all pages. (Then, release the references
+	 * to the pages and reset the page-array chain).
+	 *
+	 * (This split makes the logic simpler by first counting the pages
+	 * that need saving. More importantly, it allows for a future
+	 * optimization that will reduce application downtime by deferring
+	 * the actual write-out of the data to after the application is
+	 * allowed to resume execution).
+	 *
+	 * After dumping the entire contents, conclude with a header that
+	 * specifies 0 pages to mark the end of the contents.
+	 */
+
+	while (addr < end) {
+		cnt = vma_fill_pgarr(ctx, vma, &addr);
+		if (cnt == 0)
+			break;
+		else if (cnt < 0)
+			return cnt;
+
+		ckpt_debug("collected %d pages\n", cnt);
+
+		h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_PGARR);
+		if (!h)
+			return -ENOMEM;
+
+		h->nr_pages = cnt;
+		ret = ckpt_write_obj(ctx, &h->h);
+		ckpt_hdr_put(ctx, h);
+		if (ret < 0)
+			return ret;
+
+		ret = vma_dump_pages(ctx, cnt);
+		if (ret < 0)
+			return ret;
+
+		pgarr_reset_all(ctx);
+	}
+
+	/* mark end of contents with header saying "0" pages */
+	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_PGARR);
+	if (!h)
+		return -ENOMEM;
+	h->nr_pages = 0;
+	ret = ckpt_write_obj(ctx, &h->h);
+	ckpt_hdr_put(ctx, h);
+
+	return ret;
+}
+
+/**
+ * generic_vma_checkpoint - dump metadata of vma
+ * @ctx: checkpoint context
+ * @vma: vma object
+ * @type: vma type
+ * @vma_objref: vma objref
+ */
+int generic_vma_checkpoint(struct ckpt_ctx *ctx, struct vm_area_struct *vma,
+			   enum vma_type type, int vma_objref)
+{
+	struct ckpt_hdr_vma *h;
+	int ret;
+
+	ckpt_debug("vma %#lx-%#lx flags %#lx type %d\n",
+		 vma->vm_start, vma->vm_end, vma->vm_flags, type);
+
+	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_VMA);
+	if (!h)
+		return -ENOMEM;
+
+	h->vma_type = type;
+	h->vma_objref = vma_objref;
+	h->vm_start = vma->vm_start;
+	h->vm_end = vma->vm_end;
+	h->vm_page_prot = pgprot_val(vma->vm_page_prot);
+	h->vm_flags = vma->vm_flags;
+	h->vm_pgoff = vma->vm_pgoff;
+
+	ret = ckpt_write_obj(ctx, &h->h);
+	ckpt_hdr_put(ctx, h);
+
+	return ret;
+}
+
+/**
+ * private_vma_checkpoint - dump contents of private (anon, file) vma
+ * @ctx: checkpoint context
+ * @vma: vma object
+ * @type: vma type
+ * @vma_objref: vma objref
+ */
+int private_vma_checkpoint(struct ckpt_ctx *ctx,
+			   struct vm_area_struct *vma,
+			   enum vma_type type, int vma_objref)
+{
+	int ret;
+
+	BUG_ON(vma->vm_flags & (VM_SHARED | VM_MAYSHARE));
+
+	ret = generic_vma_checkpoint(ctx, vma, type, vma_objref);
+	if (ret < 0)
+		goto out;
+	ret = checkpoint_memory_contents(ctx, vma);
+ out:
+	return ret;
+}
+
+/**
+ * anonymous_checkpoint - dump contents of private-anonymous vma
+ * @ctx: checkpoint context
+ * @vma: vma object
+ */
+static int anonymous_checkpoint(struct ckpt_ctx *ctx,
+				struct vm_area_struct *vma)
+{
+	/* should be private anonymous ... verify that this is the case */
+	BUG_ON(vma->vm_flags & VM_MAYSHARE);
+	BUG_ON(vma->vm_file);
+
+	return private_vma_checkpoint(ctx, vma, CKPT_VMA_ANON, 0);
+}
+
+static int do_checkpoint_mm(struct ckpt_ctx *ctx, struct mm_struct *mm)
+{
+	struct ckpt_hdr_mm *h;
+	struct vm_area_struct *vma;
+	int exe_objref = 0;
+	int ret;
+
+	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_MM);
+	if (!h)
+		return -ENOMEM;
+
+	down_read(&mm->mmap_sem);
+
+	h->flags = mm->flags;
+	h->def_flags = mm->def_flags;
+
+	h->start_code = mm->start_code;
+	h->end_code = mm->end_code;
+	h->start_data = mm->start_data;
+	h->end_data = mm->end_data;
+	h->start_brk = mm->start_brk;
+	h->brk = mm->brk;
+	h->start_stack = mm->start_stack;
+	h->arg_start = mm->arg_start;
+	h->arg_end = mm->arg_end;
+	h->env_start = mm->env_start;
+	h->env_end = mm->env_end;
+
+	h->map_count = mm->map_count;
+
+	/* checkpoint the ->exe_file */
+	if (mm->exe_file) {
+		exe_objref = checkpoint_obj(ctx, mm->exe_file, CKPT_OBJ_FILE);
+		if (exe_objref < 0) {
+			ret = exe_objref;
+			goto out;
+		}
+		h->exe_objref = exe_objref;
+	}
+
+	ret = ckpt_write_obj(ctx, &h->h);
+	if (ret < 0)
+		goto out;
+
+	ret = ckpt_write_buffer(ctx, mm->saved_auxv, sizeof(mm->saved_auxv));
+	if (ret < 0)
+		return ret;
+
+	/* write the vma's */
+	for (vma = mm->mmap; vma; vma = vma->vm_next) {
+		ckpt_debug("vma %#lx-%#lx flags %#lx\n",
+			 vma->vm_start, vma->vm_end, vma->vm_flags);
+		if (vma->vm_flags & CKPT_VMA_NOT_SUPPORTED) {
+			ckpt_write_err(ctx, "TE", "vma: bad flags (%#lx)\n",
+				       -ENOSYS, vma->vm_flags);
+			return -ENOSYS;
+		}
+		if (!vma->vm_ops)
+			ret = anonymous_checkpoint(ctx, vma);
+		else if (vma->vm_ops->checkpoint)
+			ret = (*vma->vm_ops->checkpoint)(ctx, vma);
+		else
+			ret = -ENOSYS;
+		if (ret < 0) {
+			ckpt_write_err(ctx, "TE", "vma: failed", ret);
+			goto out;
+		}
+		/*
+		 * The file was collected, but not always checkpointed;
+		 * be safe and mark as visited to appease leak detection
+		 */
+		if (vma->vm_file && !(ctx->uflags & CHECKPOINT_SUBTREE)) {
+			ret = ckpt_obj_visit(ctx, vma->vm_file, CKPT_OBJ_FILE);
+			if (ret < 0)
+				goto out;
+		}
+	}
+
+	ret = checkpoint_mm_context(ctx, mm);
+ out:
+	ckpt_hdr_put(ctx, h);
+	up_read(&mm->mmap_sem);
+	return ret;
+}
+
+int checkpoint_mm(struct ckpt_ctx *ctx, void *ptr)
+{
+	return do_checkpoint_mm(ctx, (struct mm_struct *) ptr);
+}
+
+int checkpoint_obj_mm(struct ckpt_ctx *ctx, struct task_struct *t)
+{
+	struct mm_struct *mm;
+	int objref;
+
+	mm = get_task_mm(t);
+	objref = checkpoint_obj(ctx, mm, CKPT_OBJ_MM);
+	mmput(mm);
+
+	return objref;
+}
+
+/***********************************************************************
+ * Collect
+ */
+
+static int collect_mm(struct ckpt_ctx *ctx, struct mm_struct *mm)
+{
+	struct vm_area_struct *vma;
+	struct file *file;
+	int ret;
+
+	/* if already exists (ret == 0), nothing to do */
+	ret = ckpt_obj_collect(ctx, mm, CKPT_OBJ_MM);
+	if (ret <= 0)
+		return ret;
+
+	/* if first time for this mm (ret > 0), proceed inside */
+	down_read(&mm->mmap_sem);
+	if (mm->exe_file) {
+		ret = ckpt_collect_file(ctx, mm->exe_file);
+		if (ret < 0) {
+			ckpt_write_err(ctx, "TE", "mm: collect exe_file", ret);
+			goto out;
+		}
+	}
+	for (vma = mm->mmap; vma; vma = vma->vm_next) {
+		file = vma->vm_file;
+		if (!file)
+			continue;
+		ret = ckpt_collect_file(ctx, file);
+		if (ret < 0) {
+			ckpt_write_err(ctx, "TE", "mm: collect vm_file", ret);
+			break;
+		}
+	}
+ out:
+	up_read(&mm->mmap_sem);
+	return ret;
+
+}
+
+int ckpt_collect_mm(struct ckpt_ctx *ctx, struct task_struct *t)
+{
+	struct mm_struct *mm;
+	int ret;
+
+	mm = get_task_mm(t);
+	ret = collect_mm(ctx, mm);
+	mmput(mm);
+
+	return ret;
+}
diff --git a/checkpoint/objhash.c b/checkpoint/objhash.c
index b7c8fdb..4c207c3 100644
--- a/checkpoint/objhash.c
+++ b/checkpoint/objhash.c
@@ -96,6 +96,22 @@ static int obj_file_users(void *ptr)
 	return atomic_long_read(&((struct file *) ptr)->f_count);
 }
 
+static int obj_mm_grab(void *ptr)
+{
+	atomic_inc(&((struct mm_struct *) ptr)->mm_users);
+	return 0;
+}
+
+static void obj_mm_drop(void *ptr, int lastref)
+{
+	mmput((struct mm_struct *) ptr);
+}
+
+static int obj_mm_users(void *ptr)
+{
+	return atomic_read(&((struct mm_struct *) ptr)->mm_users);
+}
+
 static struct ckpt_obj_ops ckpt_obj_ops[] = {
 	/* ignored object */
 	{
@@ -124,6 +140,15 @@ static struct ckpt_obj_ops ckpt_obj_ops[] = {
 		.checkpoint = checkpoint_file,
 		.restore = restore_file,
 	},
+	/* mm object */
+	{
+		.obj_name = "MM",
+		.obj_type = CKPT_OBJ_MM,
+		.ref_drop = obj_mm_drop,
+		.ref_grab = obj_mm_grab,
+		.ref_users = obj_mm_users,
+		.checkpoint = checkpoint_mm,
+	},
 };
 
 
diff --git a/checkpoint/process.c b/checkpoint/process.c
index 6ad9c01..3d0eb36 100644
--- a/checkpoint/process.c
+++ b/checkpoint/process.c
@@ -108,6 +108,7 @@ static int checkpoint_task_objs(struct ckpt_ctx *ctx, struct task_struct *t)
 {
 	struct ckpt_hdr_task_objs *h;
 	int files_objref;
+	int mm_objref;
 	int ret;
 
 	files_objref = checkpoint_obj_file_table(ctx, t);
@@ -117,10 +118,18 @@ static int checkpoint_task_objs(struct ckpt_ctx *ctx, struct task_struct *t)
 		return files_objref;
 	}
 
+	mm_objref = checkpoint_obj_mm(ctx, t);
+	ckpt_debug("mm: objref %d\n", mm_objref);
+	if (mm_objref < 0) {
+		ckpt_write_err(ctx, "TE", "mm_struct", mm_objref);
+		return mm_objref;
+	}
+
 	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_TASK_OBJS);
 	if (!h)
 		return -ENOMEM;
 	h->files_objref = files_objref;
+	h->mm_objref = mm_objref;
 	ret = ckpt_write_obj(ctx, &h->h);
 	ckpt_hdr_put(ctx, h);
 
@@ -278,6 +287,9 @@ int ckpt_collect_task(struct ckpt_ctx *ctx, struct task_struct *t)
 	int ret;
 
 	ret = ckpt_collect_file_table(ctx, t);
+	if (ret < 0)
+		return ret;
+	ret = ckpt_collect_mm(ctx, t);
 
 	return ret;
 }
diff --git a/checkpoint/sys.c b/checkpoint/sys.c
index 1373ff9..1c98eee 100644
--- a/checkpoint/sys.c
+++ b/checkpoint/sys.c
@@ -199,6 +199,7 @@ static void ckpt_ctx_free(struct ckpt_ctx *ctx)
 
 	ckpt_obj_hash_free(ctx);
 	path_put(&ctx->fs_mnt);
+	ckpt_pgarr_free(ctx);
 
 	if (ctx->tasks_arr)
 		task_arr_free(ctx);
@@ -230,6 +231,8 @@ static struct ckpt_ctx *ckpt_ctx_alloc(int fd, unsigned long uflags,
 	ctx->ktime_begin = ktime_get();
 
 	atomic_set(&ctx->refcount, 0);
+	INIT_LIST_HEAD(&ctx->pgarr_list);
+	INIT_LIST_HEAD(&ctx->pgarr_pool);
 	init_waitqueue_head(&ctx->waitq);
 
 	err = -EBADF;
diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
index 026d058..70d9506 100644
--- a/include/linux/checkpoint.h
+++ b/include/linux/checkpoint.h
@@ -131,6 +131,7 @@ extern int restore_task(struct ckpt_ctx *ctx);
 extern int checkpoint_write_header_arch(struct ckpt_ctx *ctx);
 extern int checkpoint_thread(struct ckpt_ctx *ctx, struct task_struct *t);
 extern int checkpoint_cpu(struct ckpt_ctx *ctx, struct task_struct *t);
+extern int checkpoint_mm_context(struct ckpt_ctx *ctx, struct mm_struct *mm);
 
 extern int restore_read_header_arch(struct ckpt_ctx *ctx);
 extern int restore_thread(struct ckpt_ctx *ctx);
@@ -162,6 +163,29 @@ extern int checkpoint_file_common(struct ckpt_ctx *ctx, struct file *file,
 extern int restore_file_common(struct ckpt_ctx *ctx, struct file *file,
 			       struct ckpt_hdr_file *h);
 
+/* memory */
+extern void ckpt_pgarr_free(struct ckpt_ctx *ctx);
+
+extern int generic_vma_checkpoint(struct ckpt_ctx *ctx,
+				  struct vm_area_struct *vma,
+				  enum vma_type type,
+				  int vma_objref);
+extern int private_vma_checkpoint(struct ckpt_ctx *ctx,
+				  struct vm_area_struct *vma,
+				  enum vma_type type,
+				  int vma_objref);
+
+extern int checkpoint_obj_mm(struct ckpt_ctx *ctx, struct task_struct *t);
+
+extern int ckpt_collect_mm(struct ckpt_ctx *ctx, struct task_struct *t);
+extern int checkpoint_mm(struct ckpt_ctx *ctx, void *ptr);
+
+#define CKPT_VMA_NOT_SUPPORTED					\
+	(VM_SHARED | VM_MAYSHARE | VM_IO | VM_HUGETLB |		\
+	 VM_NONLINEAR | VM_PFNMAP | VM_RESERVED | VM_NORESERVE	\
+	 | VM_HUGETLB | VM_NONLINEAR | VM_MAPPED_COPY |		\
+	 VM_INSERTPAGE | VM_MIXEDMAP | VM_SAO)
+
 static inline int ckpt_validate_errno(int errno)
 {
 	return (errno >= 0) && (errno < MAX_ERRNO);
@@ -173,6 +197,8 @@ static inline int ckpt_validate_errno(int errno)
 #define CKPT_DRW	0x4		/* image read/write */
 #define CKPT_DOBJ	0x8		/* shared objects */
 #define CKPT_DFILE	0x10		/* files and filesystem */
+#define CKPT_DMEM	0x20		/* memory state */
+#define CKPT_DPAGE	0x40		/* memory pages */
 
 #define CKPT_DDEFAULT	0xffff		/* default debug level */
 
diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
index 1124375..f29f87a 100644
--- a/include/linux/checkpoint_hdr.h
+++ b/include/linux/checkpoint_hdr.h
@@ -66,6 +66,11 @@ enum {
 	CKPT_HDR_FILE_NAME,
 	CKPT_HDR_FILE,
 
+	CKPT_HDR_MM = 401,
+	CKPT_HDR_VMA,
+	CKPT_HDR_PGARR,
+	CKPT_HDR_MM_CONTEXT,
+
 	CKPT_HDR_TAIL = 9001,
 
 	CKPT_HDR_ERROR = 9999,
@@ -88,6 +93,7 @@ enum obj_type {
 	CKPT_OBJ_IGNORE = 0,
 	CKPT_OBJ_FILE_TABLE,
 	CKPT_OBJ_FILE,
+	CKPT_OBJ_MM,
 	CKPT_OBJ_MAX
 };
 
@@ -95,6 +101,8 @@ enum obj_type {
 struct ckpt_const {
 	/* task */
 	__u16 task_comm_len;
+	/* mm */
+	__u16 mm_saved_auxv_len;
 	/* uts */
 	__u16 uts_release_len;
 	__u16 uts_version_len;
@@ -169,6 +177,7 @@ struct ckpt_hdr_task {
 struct ckpt_hdr_task_objs {
 	struct ckpt_hdr h;
 	__s32 files_objref;
+	__s32 mm_objref;
 } __attribute__((aligned(8)));
 
 /* restart blocks */
@@ -227,4 +236,47 @@ struct ckpt_hdr_file_generic {
 	struct ckpt_hdr_file common;
 } __attribute__((aligned(8)));
 
+/* memory layout */
+struct ckpt_hdr_mm {
+	struct ckpt_hdr h;
+	__u32 map_count;
+	__s32 exe_objref;
+
+	__u64 def_flags;
+	__u64 flags;
+
+	__u64 start_code, end_code, start_data, end_data;
+	__u64 start_brk, brk, start_stack;
+	__u64 arg_start, arg_end, env_start, env_end;
+} __attribute__((aligned(8)));
+
+/* vma subtypes */
+enum vma_type {
+	CKPT_VMA_IGNORE = 0,
+	CKPT_VMA_VDSO,		/* special vdso vma */
+	CKPT_VMA_ANON,		/* private anonymous */
+	CKPT_VMA_FILE,		/* private mapped file */
+	CKPT_VMA_MAX
+};
+
+/* vma descriptor */
+struct ckpt_hdr_vma {
+	struct ckpt_hdr h;
+	__u32 vma_type;
+	__s32 vma_objref;	/* objref of backing file */
+
+	__u64 vm_start;
+	__u64 vm_end;
+	__u64 vm_page_prot;
+	__u64 vm_flags;
+	__u64 vm_pgoff;
+} __attribute__((aligned(8)));
+
+/* page array */
+struct ckpt_hdr_pgarr {
+	struct ckpt_hdr h;
+	__u64 nr_pages;		/* number of pages to saved */
+} __attribute__((aligned(8)));
+
+
 #endif /* _CHECKPOINT_CKPT_HDR_H_ */
diff --git a/include/linux/checkpoint_types.h b/include/linux/checkpoint_types.h
index 795742f..f214109 100644
--- a/include/linux/checkpoint_types.h
+++ b/include/linux/checkpoint_types.h
@@ -15,6 +15,8 @@
 #include <linux/sched.h>
 #include <linux/nsproxy.h>
 #include <linux/list.h>
+#include <linux/sched.h>
+#include <linux/nsproxy.h>
 #include <linux/path.h>
 #include <linux/fs.h>
 #include <linux/ktime.h>
@@ -49,6 +51,9 @@ struct ckpt_ctx {
 	char err_string[256];	/* checkpoint: error string */
 	int errno;		/* restart: errno that caused failure */
 
+	struct list_head pgarr_list;	/* page array to dump VMA contents */
+	struct list_head pgarr_pool;	/* pool of empty page arrays chain */
+
 	/* [multi-process checkpoint] */
 	struct task_struct **tasks_arr; /* array of all tasks [checkpoint] */
 	int nr_tasks;                   /* size of tasks array */
diff --git a/include/linux/mm.h b/include/linux/mm.h
index d5ace89..d5f9889 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1157,6 +1157,11 @@ extern void truncate_inode_pages_range(struct address_space *,
 /* generic vm_area_ops exported for stackable file systems */
 extern int filemap_fault(struct vm_area_struct *, struct vm_fault *);
 
+#ifdef CONFIG_CHECKPOINT
+/* generic vm_area_ops exported for mapped files checkpoint */
+extern int filemap_checkpoint(struct ckpt_ctx *, struct vm_area_struct *);
+#endif
+
 /* mm/page-writeback.c */
 int write_one_page(struct page *page, int wait);
 void task_dirty_inc(struct task_struct *tsk);
diff --git a/mm/filemap.c b/mm/filemap.c
index ccea3b6..72ba805 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -34,6 +34,7 @@
 #include <linux/hardirq.h> /* for BUG_ON(!in_atomic()) only */
 #include <linux/memcontrol.h>
 #include <linux/mm_inline.h> /* for page_is_file_cache() */
+#include <linux/checkpoint.h>
 #include "internal.h"
 
 /*
@@ -1648,8 +1649,32 @@ page_not_uptodate:
 }
 EXPORT_SYMBOL(filemap_fault);
 
+#ifdef CONFIG_CHECKPOINT
+int filemap_checkpoint(struct ckpt_ctx *ctx, struct vm_area_struct *vma)
+{
+	struct file *file = vma->vm_file;
+	int vma_objref;
+
+	if (vma->vm_flags & CKPT_VMA_NOT_SUPPORTED) {
+		pr_warning("c/r: unsupported VMA %#lx\n", vma->vm_flags);
+		return -ENOSYS;
+	}
+
+	BUG_ON(!file);
+
+	vma_objref = checkpoint_obj(ctx, file, CKPT_OBJ_FILE);
+	if (vma_objref < 0)
+		return vma_objref;
+
+	return private_vma_checkpoint(ctx, vma, CKPT_VMA_FILE, vma_objref);
+}
+#endif /* CONFIG_CHECKPOINT */
+
 struct vm_operations_struct generic_file_vm_ops = {
 	.fault		= filemap_fault,
+#ifdef CONFIG_CHECKPOINT
+	.checkpoint	= filemap_checkpoint,
+#endif
 };
 
 /* This is used for a general mmap of a disk file */
diff --git a/mm/mmap.c b/mm/mmap.c
index 8101de4..7415a61 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -29,6 +29,7 @@
 #include <linux/rmap.h>
 #include <linux/mmu_notifier.h>
 #include <linux/perf_counter.h>
+#include <linux/checkpoint.h>
 
 #include <asm/uaccess.h>
 #include <asm/cacheflush.h>
@@ -2267,9 +2268,36 @@ static void special_mapping_close(struct vm_area_struct *vma)
 {
 }
 
+#ifdef CONFIG_CHECKPOINT
+static int special_mapping_checkpoint(struct ckpt_ctx *ctx,
+				      struct vm_area_struct *vma)
+{
+	const char *name;
+
+	/*
+	 * FIX:
+	 * Currently, we only handle VDSO/vsyscall special handling.
+	 * Even that, is very basic - we just skip the contents and
+	 * hope for the best in terms of compatilibity upon restart.
+	 */
+
+	if (vma->vm_flags & CKPT_VMA_NOT_SUPPORTED)
+		return -ENOSYS;
+
+	name = arch_vma_name(vma);
+	if (!name || strcmp(name, "[vdso]"))
+		return -ENOSYS;
+
+	return generic_vma_checkpoint(ctx, vma, CKPT_VMA_VDSO, 0);
+}
+#endif /* CONFIG_CHECKPOINT */
+
 static struct vm_operations_struct special_mapping_vmops = {
 	.close = special_mapping_close,
 	.fault = special_mapping_fault,
+#ifdef CONFIG_CHECKPOINT
+	.checkpoint = special_mapping_checkpoint,
+#endif
 };
 
 /*
-- 
1.6.0.4

^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH v18 37/80] c/r: dump memory address space (private memory)
  2009-09-23 23:50 ` Oren Laadan
@ 2009-09-23 23:51   ` Oren Laadan
  -1 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-23 23:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, containers, linux-kernel, linux-mm, linux-api,
	Serge Hallyn, Ingo Molnar, Pavel Emelyanov, Oren Laadan,
	Oren Laadan

For each vma, there is a 'struct ckpt_vma'; Then comes the actual
contents, in one or more chunk: each chunk begins with a header that
specifies how many pages it holds, then the virtual addresses of all
the dumped pages in that chunk, followed by the actual contents of all
dumped pages. A header with zero number of pages marks the end of the
contents.  Then comes the next vma and so on.

To checkpoint a vma, call the ops->checkpoint() method of that vma.
Normally the per-vma function will invoke generic_vma_checkpoint()
which first writes the vma description, followed by the specific
logic to dump the contents of the pages.

Currently for private mapped memory we save the pathname of the file
that is mapped (restart will use it to re-open it and then map it).
Later we change that to reference a file object.

Changelog[v18]:
  - Tighten checks on supported vma to checkpoint or restart
  - Add a few more ckpt_write_err()s
  - [Serge Hallyn] Export filemap_checkpoint() (used later for ext4)
  - Use ckpt_collect_file() instead of ckpt_obj_collect() for files
  - In collect_mm() use retval from ckpt_obj_collect() to test for
    first-time-object
Changelog[v17]:
  - Only collect sub-objects of mm_struct once
  - Save mm->{flags,def_flags,saved_auxv}
Changelog[v16]:
  - Precede vaddrs/pages with a buffer header
  - Checkpoint mm->exe_file
  - Handle shared task->mm
Changelog[v14]:
  - Modify the ops->checkpoint method to be much more powerful
  - Improve support for VDSO (with special_mapping checkpoint callback)
  - Save new field 'vdso' in mm_context
  - Revert change to pr_debug(), back to ckpt_debug()
  - Check whether calls to ckpt_hbuf_get() fail
  - Discard field 'h->parent'
Changelog[v13]:
  - pgprot_t is an abstract type; use the proper accessor (fix for
    64-bit powerpc (Nathan Lynch <ntl@pobox.com>)
Changelog[v12]:
  - Hide pgarr management inside ckpt_private_vma_fill_pgarr()
  - Fix management of pgarr chain reset and alloc/expand: keep empty
    pgarr in a pool chain
  - Replace obsolete ckpt_debug() with pr_debug()
Changelog[v11]:
  - Copy contents of 'init->fs->root' instead of pointing to them.
  - Add missing test for VM_MAYSHARE when dumping memory
Changelog[v10]:
  - Acquire dcache_lock around call to __d_path() in ckpt_fill_name()
Changelog[v9]:
  - Introduce ckpt_ctx_checkpoint() for checkpoint-specific ctx setup
  - Test if __d_path() changes mnt/dentry (when crossing filesystem
    namespace boundary). for now ckpt_fill_fname() fails the checkpoint.
Changelog[v7]:
  - Fix argument given to kunmap_atomic() in memory dump/restore
Changelog[v6]:
  - Balance all calls to ckpt_hbuf_get() with matching ckpt_hbuf_put()
    (even though it's not really needed)
Changelog[v5]:
  - Improve memory dump code (following Dave Hansen's comments)
  - Change dump format (and code) to allow chunks of <vaddrs, pages>
    instead of one long list of each
  - Fix use of follow_page() to avoid faulting in non-present pages
Changelog[v4]:
  - Use standard list_... for ckpt_pgarr

Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
---
 arch/x86/include/asm/checkpoint_hdr.h |    8 +
 arch/x86/mm/checkpoint.c              |   31 ++
 checkpoint/Makefile                   |    3 +-
 checkpoint/checkpoint.c               |    3 +
 checkpoint/memory.c                   |  697 +++++++++++++++++++++++++++++++++
 checkpoint/objhash.c                  |   25 ++
 checkpoint/process.c                  |   12 +
 checkpoint/sys.c                      |    3 +
 include/linux/checkpoint.h            |   26 ++
 include/linux/checkpoint_hdr.h        |   52 +++
 include/linux/checkpoint_types.h      |    5 +
 include/linux/mm.h                    |    5 +
 mm/filemap.c                          |   25 ++
 mm/mmap.c                             |   28 ++
 14 files changed, 922 insertions(+), 1 deletions(-)
 create mode 100644 checkpoint/memory.c

diff --git a/arch/x86/include/asm/checkpoint_hdr.h b/arch/x86/include/asm/checkpoint_hdr.h
index f4d1e14..0e756b0 100644
--- a/arch/x86/include/asm/checkpoint_hdr.h
+++ b/arch/x86/include/asm/checkpoint_hdr.h
@@ -45,6 +45,7 @@
 /* arch dependent header types */
 enum {
 	CKPT_HDR_CPU_FPU = 201,
+	CKPT_HDR_MM_CONTEXT_LDT,
 };
 
 struct ckpt_hdr_header_arch {
@@ -118,4 +119,11 @@ struct ckpt_hdr_cpu {
 #define CKPT_X86_SEG_TLS	0x4000	/* 0100 0000 0000 00xx */
 #define CKPT_X86_SEG_LDT	0x8000	/* 100x xxxx xxxx xxxx */
 
+struct ckpt_hdr_mm_context {
+	struct ckpt_hdr h;
+	__u64 vdso;
+	__u32 ldt_entry_size;
+	__u32 nldt;
+} __attribute__((aligned(8)));
+
 #endif /* __ASM_X86_CKPT_HDR__H */
diff --git a/arch/x86/mm/checkpoint.c b/arch/x86/mm/checkpoint.c
index 023039f..83ed047 100644
--- a/arch/x86/mm/checkpoint.c
+++ b/arch/x86/mm/checkpoint.c
@@ -324,6 +324,37 @@ int checkpoint_write_header_arch(struct ckpt_ctx *ctx)
 	return ret;
 }
 
+/* dump the mm->context state */
+int checkpoint_mm_context(struct ckpt_ctx *ctx, struct mm_struct *mm)
+{
+	struct ckpt_hdr_mm_context *h;
+	int ret;
+
+	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_MM_CONTEXT);
+	if (!h)
+		return -ENOMEM;
+
+	mutex_lock(&mm->context.lock);
+
+	h->vdso = (unsigned long) mm->context.vdso;
+	h->ldt_entry_size = LDT_ENTRY_SIZE;
+	h->nldt = mm->context.size;
+
+	ckpt_debug("nldt %d vdso %#llx\n", h->nldt, h->vdso);
+
+	ret = ckpt_write_obj(ctx, &h->h);
+	ckpt_hdr_put(ctx, h);
+	if (ret < 0)
+		goto out;
+
+	ret = ckpt_write_obj_type(ctx, mm->context.ldt,
+				  mm->context.size * LDT_ENTRY_SIZE,
+				  CKPT_HDR_MM_CONTEXT_LDT);
+ out:
+	mutex_unlock(&mm->context.lock);
+	return ret;
+}
+
 /**************************************************************************
  * Restart
  */
diff --git a/checkpoint/Makefile b/checkpoint/Makefile
index 1d0c058..f56a7d6 100644
--- a/checkpoint/Makefile
+++ b/checkpoint/Makefile
@@ -8,4 +8,5 @@ obj-$(CONFIG_CHECKPOINT) += \
 	checkpoint.o \
 	restart.o \
 	process.o \
-	files.o
+	files.o \
+	memory.o
diff --git a/checkpoint/checkpoint.c b/checkpoint/checkpoint.c
index 4cc2a2f..f907485 100644
--- a/checkpoint/checkpoint.c
+++ b/checkpoint/checkpoint.c
@@ -280,10 +280,13 @@ int ckpt_write_err(struct ckpt_ctx *ctx, char *pre, char *fmt, ...)
 static void fill_kernel_const(struct ckpt_const *h)
 {
 	struct task_struct *tsk;
+	struct mm_struct *mm;
 	struct new_utsname *uts;
 
 	/* task */
 	h->task_comm_len = sizeof(tsk->comm);
+	/* mm */
+	h->mm_saved_auxv_len = sizeof(mm->saved_auxv);
 	/* uts */
 	h->uts_release_len = sizeof(uts->release);
 	h->uts_version_len = sizeof(uts->version);
diff --git a/checkpoint/memory.c b/checkpoint/memory.c
new file mode 100644
index 0000000..d19d627
--- /dev/null
+++ b/checkpoint/memory.c
@@ -0,0 +1,697 @@
+/*
+ *  Checkpoint/restart memory contents
+ *
+ *  Copyright (C) 2008-2009 Oren Laadan
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+/* default debug level for output */
+#define CKPT_DFLAG  CKPT_DMEM
+
+#include <linux/kernel.h>
+#include <linux/sched.h>
+#include <linux/slab.h>
+#include <linux/file.h>
+#include <linux/pagemap.h>
+#include <linux/mm_types.h>
+#include <linux/proc_fs.h>
+#include <linux/checkpoint.h>
+#include <linux/checkpoint_hdr.h>
+
+/*
+ * page-array chains: each ckpt_pgarr describes a set of <struct page *,vaddr>
+ * tuples (where vaddr is the virtual address of a page in a particular mm).
+ * Specifically, we use separate arrays so that all vaddrs can be written
+ * and read at once.
+ */
+
+struct ckpt_pgarr {
+	unsigned long *vaddrs;
+	struct page **pages;
+	unsigned int nr_used;
+	struct list_head list;
+};
+
+#define CKPT_PGARR_TOTAL  (PAGE_SIZE / sizeof(void *))
+#define CKPT_PGARR_BATCH  (16 * CKPT_PGARR_TOTAL)
+
+static inline int pgarr_is_full(struct ckpt_pgarr *pgarr)
+{
+	return (pgarr->nr_used == CKPT_PGARR_TOTAL);
+}
+
+static inline int pgarr_nr_free(struct ckpt_pgarr *pgarr)
+{
+	return CKPT_PGARR_TOTAL - pgarr->nr_used;
+}
+
+/*
+ * utilities to alloc, free, and handle 'struct ckpt_pgarr' (page-arrays)
+ * (common to ckpt_mem.c and rstr_mem.c).
+ *
+ * The checkpoint context structure has two members for page-arrays:
+ *   ctx->pgarr_list: list head of populated page-array chain
+ *   ctx->pgarr_pool: list head of empty page-array pool chain
+ *
+ * During checkpoint (and restart) the chain tracks the dirty pages (page
+ * pointer and virtual address) of each MM. For a particular MM, these are
+ * always added to the head of the page-array chain (ctx->pgarr_list).
+ * Before the next chunk of pages, the chain is reset (by dereferencing
+ * all pages) but not freed; instead, empty descsriptors are kept in pool.
+ *
+ * The head of the chain page-array ("current") advances as necessary. When
+ * it gets full, a new page-array descriptor is pushed in front of it. The
+ * new descriptor is taken from first empty descriptor (if one exists, for
+ * instance, after a chain reset), or allocated on-demand.
+ *
+ * When dumping the data, the chain is traversed in reverse order.
+ */
+
+/* return first page-array in the chain */
+static inline struct ckpt_pgarr *pgarr_first(struct ckpt_ctx *ctx)
+{
+	if (list_empty(&ctx->pgarr_list))
+		return NULL;
+	return list_first_entry(&ctx->pgarr_list, struct ckpt_pgarr, list);
+}
+
+/* return (and detach) first empty page-array in the pool, if exists */
+static inline struct ckpt_pgarr *pgarr_from_pool(struct ckpt_ctx *ctx)
+{
+	struct ckpt_pgarr *pgarr;
+
+	if (list_empty(&ctx->pgarr_pool))
+		return NULL;
+	pgarr = list_first_entry(&ctx->pgarr_pool, struct ckpt_pgarr, list);
+	list_del(&pgarr->list);
+	return pgarr;
+}
+
+/* release pages referenced by a page-array */
+static void pgarr_release_pages(struct ckpt_pgarr *pgarr)
+{
+	ckpt_debug("total pages %d\n", pgarr->nr_used);
+	/*
+	 * both checkpoint and restart use 'nr_used', however we only
+	 * collect pages during checkpoint; in restart we simply return
+	 * because pgarr->pages remains NULL.
+	 */
+	if (pgarr->pages) {
+		struct page **pages = pgarr->pages;
+		int nr = pgarr->nr_used;
+
+		while (nr--)
+			page_cache_release(pages[nr]);
+	}
+
+	pgarr->nr_used = 0;
+}
+
+/* free a single page-array object */
+static void pgarr_free_one(struct ckpt_pgarr *pgarr)
+{
+	pgarr_release_pages(pgarr);
+	kfree(pgarr->pages);
+	kfree(pgarr->vaddrs);
+	kfree(pgarr);
+}
+
+/* free the chains of page-arrays (populated and empty pool) */
+void ckpt_pgarr_free(struct ckpt_ctx *ctx)
+{
+	struct ckpt_pgarr *pgarr, *tmp;
+
+	list_for_each_entry_safe(pgarr, tmp, &ctx->pgarr_list, list) {
+		list_del(&pgarr->list);
+		pgarr_free_one(pgarr);
+	}
+
+	list_for_each_entry_safe(pgarr, tmp, &ctx->pgarr_pool, list) {
+		list_del(&pgarr->list);
+		pgarr_free_one(pgarr);
+	}
+}
+
+/* allocate a single page-array object */
+static struct ckpt_pgarr *pgarr_alloc_one(unsigned long flags)
+{
+	struct ckpt_pgarr *pgarr;
+
+	pgarr = kzalloc(sizeof(*pgarr), GFP_KERNEL);
+	if (!pgarr)
+		return NULL;
+	pgarr->vaddrs = kmalloc(CKPT_PGARR_TOTAL * sizeof(unsigned long),
+				GFP_KERNEL);
+	if (!pgarr->vaddrs)
+		goto nomem;
+
+	/* pgarr->pages is needed only for checkpoint */
+	if (flags & CKPT_CTX_CHECKPOINT) {
+		pgarr->pages = kmalloc(CKPT_PGARR_TOTAL *
+				       sizeof(struct page *), GFP_KERNEL);
+		if (!pgarr->pages)
+			goto nomem;
+	}
+
+	return pgarr;
+ nomem:
+	pgarr_free_one(pgarr);
+	return NULL;
+}
+
+/* pgarr_current - return the next available page-array in the chain
+ * @ctx: checkpoint context
+ *
+ * Returns the first page-array in the list that has space. Otherwise,
+ * try the next page-array after the last non-empty one, and move it to
+ * the front of the chain. Extends the list if none has space.
+ */
+static struct ckpt_pgarr *pgarr_current(struct ckpt_ctx *ctx)
+{
+	struct ckpt_pgarr *pgarr;
+
+	pgarr = pgarr_first(ctx);
+	if (pgarr && !pgarr_is_full(pgarr))
+		return pgarr;
+
+	pgarr = pgarr_from_pool(ctx);
+	if (!pgarr)
+		pgarr = pgarr_alloc_one(ctx->kflags);
+	if (!pgarr)
+		return NULL;
+
+	list_add(&pgarr->list, &ctx->pgarr_list);
+	return pgarr;
+}
+
+/* reset the page-array chain (dropping page references if necessary) */
+static void pgarr_reset_all(struct ckpt_ctx *ctx)
+{
+	struct ckpt_pgarr *pgarr;
+
+	list_for_each_entry(pgarr, &ctx->pgarr_list, list)
+		pgarr_release_pages(pgarr);
+	list_splice_init(&ctx->pgarr_list, &ctx->pgarr_pool);
+}
+
+/**************************************************************************
+ * Checkpoint
+ *
+ * Checkpoint is outside the context of the checkpointee, so one cannot
+ * simply read pages from user-space. Instead, we scan the address space
+ * of the target to cherry-pick pages of interest. Selected pages are
+ * enlisted in a page-array chain (attached to the checkpoint context).
+ * To save their contents, each page is mapped to kernel memory and then
+ * dumped to the file descriptor.
+ */
+
+
+/**
+ * private_follow_page - return page pointer for dirty pages
+ * @vma - target vma
+ * @addr - page address
+ *
+ * Looks up the page that correspond to the address in the vma, and
+ * returns the page if it was modified (and grabs a reference to it),
+ * or otherwise returns NULL (or error).
+ */
+static struct page *consider_private_page(struct vm_area_struct *vma,
+					  unsigned long addr)
+{
+	struct page *page;
+
+	/*
+	 * simplified version of get_user_pages(): already have vma,
+	 * only need FOLL_ANON, and (for now) ignore fault stats.
+	 *
+	 * follow_page() will return NULL if the page is not present
+	 * (swapped), ZERO_PAGE(0) if the pte wasn't allocated, and
+	 * the actual page pointer otherwise.
+	 *
+	 * FIXME: consolidate with get_user_pages()
+	 */
+
+	cond_resched();
+	while (!(page = follow_page(vma, addr, FOLL_ANON | FOLL_GET))) {
+		int ret;
+
+		/* the page is swapped out - bring it in (optimize ?) */
+		ret = handle_mm_fault(vma->vm_mm, vma, addr, 0);
+		if (ret & VM_FAULT_ERROR) {
+			if (ret & VM_FAULT_OOM)
+				return ERR_PTR(-ENOMEM);
+			else if (ret & VM_FAULT_SIGBUS)
+				return ERR_PTR(-EFAULT);
+			else
+				BUG();
+			break;
+		}
+		cond_resched();
+	}
+
+	if (IS_ERR(page))
+		return page;
+
+	/*
+	 * Only care about dirty pages: either anonymous non-zero pages,
+	 * or file-backed COW (copy-on-write) pages that were modified.
+	 * A clean COW page is not interesting because its contents are
+	 * identical to the backing file; ignore such pages.
+	 * A file-backed broken COW is identified by its page_mapping()
+	 * being unset (NULL) because the page will no longer be mapped
+	 * to the original file after having been modified.
+	 */
+	if (page == ZERO_PAGE(0)) {
+		/* this is the zero page: ignore */
+		page_cache_release(page);
+		page = NULL;
+	} else if (vma->vm_file && (page_mapping(page) != NULL)) {
+		/* file backed clean cow: ignore */
+		page_cache_release(page);
+		page = NULL;
+	}
+
+	return page;
+}
+
+/**
+ * vma_fill_pgarr - fill a page-array with addr/page tuples
+ * @ctx - checkpoint context
+ * @vma - vma to scan
+ * @start - start address (updated)
+ *
+ * Returns the number of pages collected
+ */
+static int vma_fill_pgarr(struct ckpt_ctx *ctx,
+			  struct vm_area_struct *vma,
+			  unsigned long *start)
+{
+	unsigned long end = vma->vm_end;
+	unsigned long addr = *start;
+	struct ckpt_pgarr *pgarr;
+	int nr_used;
+	int cnt = 0;
+
+	/* this function is only for private memory (anon or file-mapped) */
+	BUG_ON(vma->vm_flags & (VM_SHARED | VM_MAYSHARE));
+
+	do {
+		pgarr = pgarr_current(ctx);
+		if (!pgarr)
+			return -ENOMEM;
+
+		nr_used = pgarr->nr_used;
+
+		while (addr < end) {
+			struct page *page;
+
+			page = consider_private_page(vma, addr);
+			if (IS_ERR(page))
+				return PTR_ERR(page);
+
+			if (page) {
+				_ckpt_debug(CKPT_DPAGE,
+					    "got page %#lx\n", addr);
+				pgarr->pages[pgarr->nr_used] = page;
+				pgarr->vaddrs[pgarr->nr_used] = addr;
+				pgarr->nr_used++;
+			}
+
+			addr += PAGE_SIZE;
+
+			if (pgarr_is_full(pgarr))
+				break;
+		}
+
+		cnt += pgarr->nr_used - nr_used;
+
+	} while ((cnt < CKPT_PGARR_BATCH) && (addr < end));
+
+	*start = addr;
+	return cnt;
+}
+
+/* dump contents of a pages: use kmap_atomic() to avoid TLB flush */
+static int checkpoint_dump_page(struct ckpt_ctx *ctx,
+				struct page *page, char *buf)
+{
+	void *ptr;
+
+	ptr = kmap_atomic(page, KM_USER1);
+	memcpy(buf, ptr, PAGE_SIZE);
+	kunmap_atomic(ptr, KM_USER1);
+
+	return ckpt_kwrite(ctx, buf, PAGE_SIZE);
+}
+
+/**
+ * vma_dump_pages - dump pages listed in the ctx page-array chain
+ * @ctx - checkpoint context
+ * @total - total number of pages
+ *
+ * First dump all virtual addresses, followed by the contents of all pages
+ */
+static int vma_dump_pages(struct ckpt_ctx *ctx, int total)
+{
+	struct ckpt_pgarr *pgarr;
+	void *buf;
+	int i, ret = 0;
+
+	if (!total)
+		return 0;
+
+	i =  total * (sizeof(unsigned long) + PAGE_SIZE);
+	ret = ckpt_write_obj_type(ctx, NULL, i, CKPT_HDR_BUFFER);
+	if (ret < 0)
+		return ret;
+
+	list_for_each_entry_reverse(pgarr, &ctx->pgarr_list, list) {
+		ret = ckpt_kwrite(ctx, pgarr->vaddrs,
+				  pgarr->nr_used * sizeof(unsigned long));
+		if (ret < 0)
+			return ret;
+	}
+
+	buf = (void *) __get_free_page(GFP_KERNEL);
+	if (!buf)
+		return -ENOMEM;
+
+	list_for_each_entry_reverse(pgarr, &ctx->pgarr_list, list) {
+		for (i = 0; i < pgarr->nr_used; i++) {
+			ret = checkpoint_dump_page(ctx, pgarr->pages[i], buf);
+			if (ret < 0)
+				goto out;
+		}
+	}
+ out:
+	free_page((unsigned long) buf);
+	return ret;
+}
+
+/**
+ * checkpoint_memory_contents - dump contents of a VMA with private memory
+ * @ctx - checkpoint context
+ * @vma - vma to scan
+ *
+ * Collect lists of pages that needs to be dumped, and corresponding
+ * virtual addresses into ctx->pgarr_list page-array chain. Then dump
+ * the addresses, followed by the page contents.
+ */
+static int checkpoint_memory_contents(struct ckpt_ctx *ctx,
+				      struct vm_area_struct *vma)
+{
+	struct ckpt_hdr_pgarr *h;
+	unsigned long addr, end;
+	int cnt, ret;
+
+	addr = vma->vm_start;
+	end = vma->vm_end;
+
+	/*
+	 * Work iteratively, collecting and dumping at most CKPT_PGARR_BATCH
+	 * in each round. Each iterations is divided into two steps:
+	 *
+	 * (1) scan: scan through the PTEs of the vma to collect the pages
+	 * to dump (later we'll also make them COW), while keeping a list
+	 * of pages and their corresponding addresses on ctx->pgarr_list.
+	 *
+	 * (2) dump: write out a header specifying how many pages, followed
+	 * by the addresses of all pages in ctx->pgarr_list, followed by
+	 * the actual contents of all pages. (Then, release the references
+	 * to the pages and reset the page-array chain).
+	 *
+	 * (This split makes the logic simpler by first counting the pages
+	 * that need saving. More importantly, it allows for a future
+	 * optimization that will reduce application downtime by deferring
+	 * the actual write-out of the data to after the application is
+	 * allowed to resume execution).
+	 *
+	 * After dumping the entire contents, conclude with a header that
+	 * specifies 0 pages to mark the end of the contents.
+	 */
+
+	while (addr < end) {
+		cnt = vma_fill_pgarr(ctx, vma, &addr);
+		if (cnt == 0)
+			break;
+		else if (cnt < 0)
+			return cnt;
+
+		ckpt_debug("collected %d pages\n", cnt);
+
+		h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_PGARR);
+		if (!h)
+			return -ENOMEM;
+
+		h->nr_pages = cnt;
+		ret = ckpt_write_obj(ctx, &h->h);
+		ckpt_hdr_put(ctx, h);
+		if (ret < 0)
+			return ret;
+
+		ret = vma_dump_pages(ctx, cnt);
+		if (ret < 0)
+			return ret;
+
+		pgarr_reset_all(ctx);
+	}
+
+	/* mark end of contents with header saying "0" pages */
+	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_PGARR);
+	if (!h)
+		return -ENOMEM;
+	h->nr_pages = 0;
+	ret = ckpt_write_obj(ctx, &h->h);
+	ckpt_hdr_put(ctx, h);
+
+	return ret;
+}
+
+/**
+ * generic_vma_checkpoint - dump metadata of vma
+ * @ctx: checkpoint context
+ * @vma: vma object
+ * @type: vma type
+ * @vma_objref: vma objref
+ */
+int generic_vma_checkpoint(struct ckpt_ctx *ctx, struct vm_area_struct *vma,
+			   enum vma_type type, int vma_objref)
+{
+	struct ckpt_hdr_vma *h;
+	int ret;
+
+	ckpt_debug("vma %#lx-%#lx flags %#lx type %d\n",
+		 vma->vm_start, vma->vm_end, vma->vm_flags, type);
+
+	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_VMA);
+	if (!h)
+		return -ENOMEM;
+
+	h->vma_type = type;
+	h->vma_objref = vma_objref;
+	h->vm_start = vma->vm_start;
+	h->vm_end = vma->vm_end;
+	h->vm_page_prot = pgprot_val(vma->vm_page_prot);
+	h->vm_flags = vma->vm_flags;
+	h->vm_pgoff = vma->vm_pgoff;
+
+	ret = ckpt_write_obj(ctx, &h->h);
+	ckpt_hdr_put(ctx, h);
+
+	return ret;
+}
+
+/**
+ * private_vma_checkpoint - dump contents of private (anon, file) vma
+ * @ctx: checkpoint context
+ * @vma: vma object
+ * @type: vma type
+ * @vma_objref: vma objref
+ */
+int private_vma_checkpoint(struct ckpt_ctx *ctx,
+			   struct vm_area_struct *vma,
+			   enum vma_type type, int vma_objref)
+{
+	int ret;
+
+	BUG_ON(vma->vm_flags & (VM_SHARED | VM_MAYSHARE));
+
+	ret = generic_vma_checkpoint(ctx, vma, type, vma_objref);
+	if (ret < 0)
+		goto out;
+	ret = checkpoint_memory_contents(ctx, vma);
+ out:
+	return ret;
+}
+
+/**
+ * anonymous_checkpoint - dump contents of private-anonymous vma
+ * @ctx: checkpoint context
+ * @vma: vma object
+ */
+static int anonymous_checkpoint(struct ckpt_ctx *ctx,
+				struct vm_area_struct *vma)
+{
+	/* should be private anonymous ... verify that this is the case */
+	BUG_ON(vma->vm_flags & VM_MAYSHARE);
+	BUG_ON(vma->vm_file);
+
+	return private_vma_checkpoint(ctx, vma, CKPT_VMA_ANON, 0);
+}
+
+static int do_checkpoint_mm(struct ckpt_ctx *ctx, struct mm_struct *mm)
+{
+	struct ckpt_hdr_mm *h;
+	struct vm_area_struct *vma;
+	int exe_objref = 0;
+	int ret;
+
+	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_MM);
+	if (!h)
+		return -ENOMEM;
+
+	down_read(&mm->mmap_sem);
+
+	h->flags = mm->flags;
+	h->def_flags = mm->def_flags;
+
+	h->start_code = mm->start_code;
+	h->end_code = mm->end_code;
+	h->start_data = mm->start_data;
+	h->end_data = mm->end_data;
+	h->start_brk = mm->start_brk;
+	h->brk = mm->brk;
+	h->start_stack = mm->start_stack;
+	h->arg_start = mm->arg_start;
+	h->arg_end = mm->arg_end;
+	h->env_start = mm->env_start;
+	h->env_end = mm->env_end;
+
+	h->map_count = mm->map_count;
+
+	/* checkpoint the ->exe_file */
+	if (mm->exe_file) {
+		exe_objref = checkpoint_obj(ctx, mm->exe_file, CKPT_OBJ_FILE);
+		if (exe_objref < 0) {
+			ret = exe_objref;
+			goto out;
+		}
+		h->exe_objref = exe_objref;
+	}
+
+	ret = ckpt_write_obj(ctx, &h->h);
+	if (ret < 0)
+		goto out;
+
+	ret = ckpt_write_buffer(ctx, mm->saved_auxv, sizeof(mm->saved_auxv));
+	if (ret < 0)
+		return ret;
+
+	/* write the vma's */
+	for (vma = mm->mmap; vma; vma = vma->vm_next) {
+		ckpt_debug("vma %#lx-%#lx flags %#lx\n",
+			 vma->vm_start, vma->vm_end, vma->vm_flags);
+		if (vma->vm_flags & CKPT_VMA_NOT_SUPPORTED) {
+			ckpt_write_err(ctx, "TE", "vma: bad flags (%#lx)\n",
+				       -ENOSYS, vma->vm_flags);
+			return -ENOSYS;
+		}
+		if (!vma->vm_ops)
+			ret = anonymous_checkpoint(ctx, vma);
+		else if (vma->vm_ops->checkpoint)
+			ret = (*vma->vm_ops->checkpoint)(ctx, vma);
+		else
+			ret = -ENOSYS;
+		if (ret < 0) {
+			ckpt_write_err(ctx, "TE", "vma: failed", ret);
+			goto out;
+		}
+		/*
+		 * The file was collected, but not always checkpointed;
+		 * be safe and mark as visited to appease leak detection
+		 */
+		if (vma->vm_file && !(ctx->uflags & CHECKPOINT_SUBTREE)) {
+			ret = ckpt_obj_visit(ctx, vma->vm_file, CKPT_OBJ_FILE);
+			if (ret < 0)
+				goto out;
+		}
+	}
+
+	ret = checkpoint_mm_context(ctx, mm);
+ out:
+	ckpt_hdr_put(ctx, h);
+	up_read(&mm->mmap_sem);
+	return ret;
+}
+
+int checkpoint_mm(struct ckpt_ctx *ctx, void *ptr)
+{
+	return do_checkpoint_mm(ctx, (struct mm_struct *) ptr);
+}
+
+int checkpoint_obj_mm(struct ckpt_ctx *ctx, struct task_struct *t)
+{
+	struct mm_struct *mm;
+	int objref;
+
+	mm = get_task_mm(t);
+	objref = checkpoint_obj(ctx, mm, CKPT_OBJ_MM);
+	mmput(mm);
+
+	return objref;
+}
+
+/***********************************************************************
+ * Collect
+ */
+
+static int collect_mm(struct ckpt_ctx *ctx, struct mm_struct *mm)
+{
+	struct vm_area_struct *vma;
+	struct file *file;
+	int ret;
+
+	/* if already exists (ret == 0), nothing to do */
+	ret = ckpt_obj_collect(ctx, mm, CKPT_OBJ_MM);
+	if (ret <= 0)
+		return ret;
+
+	/* if first time for this mm (ret > 0), proceed inside */
+	down_read(&mm->mmap_sem);
+	if (mm->exe_file) {
+		ret = ckpt_collect_file(ctx, mm->exe_file);
+		if (ret < 0) {
+			ckpt_write_err(ctx, "TE", "mm: collect exe_file", ret);
+			goto out;
+		}
+	}
+	for (vma = mm->mmap; vma; vma = vma->vm_next) {
+		file = vma->vm_file;
+		if (!file)
+			continue;
+		ret = ckpt_collect_file(ctx, file);
+		if (ret < 0) {
+			ckpt_write_err(ctx, "TE", "mm: collect vm_file", ret);
+			break;
+		}
+	}
+ out:
+	up_read(&mm->mmap_sem);
+	return ret;
+
+}
+
+int ckpt_collect_mm(struct ckpt_ctx *ctx, struct task_struct *t)
+{
+	struct mm_struct *mm;
+	int ret;
+
+	mm = get_task_mm(t);
+	ret = collect_mm(ctx, mm);
+	mmput(mm);
+
+	return ret;
+}
diff --git a/checkpoint/objhash.c b/checkpoint/objhash.c
index b7c8fdb..4c207c3 100644
--- a/checkpoint/objhash.c
+++ b/checkpoint/objhash.c
@@ -96,6 +96,22 @@ static int obj_file_users(void *ptr)
 	return atomic_long_read(&((struct file *) ptr)->f_count);
 }
 
+static int obj_mm_grab(void *ptr)
+{
+	atomic_inc(&((struct mm_struct *) ptr)->mm_users);
+	return 0;
+}
+
+static void obj_mm_drop(void *ptr, int lastref)
+{
+	mmput((struct mm_struct *) ptr);
+}
+
+static int obj_mm_users(void *ptr)
+{
+	return atomic_read(&((struct mm_struct *) ptr)->mm_users);
+}
+
 static struct ckpt_obj_ops ckpt_obj_ops[] = {
 	/* ignored object */
 	{
@@ -124,6 +140,15 @@ static struct ckpt_obj_ops ckpt_obj_ops[] = {
 		.checkpoint = checkpoint_file,
 		.restore = restore_file,
 	},
+	/* mm object */
+	{
+		.obj_name = "MM",
+		.obj_type = CKPT_OBJ_MM,
+		.ref_drop = obj_mm_drop,
+		.ref_grab = obj_mm_grab,
+		.ref_users = obj_mm_users,
+		.checkpoint = checkpoint_mm,
+	},
 };
 
 
diff --git a/checkpoint/process.c b/checkpoint/process.c
index 6ad9c01..3d0eb36 100644
--- a/checkpoint/process.c
+++ b/checkpoint/process.c
@@ -108,6 +108,7 @@ static int checkpoint_task_objs(struct ckpt_ctx *ctx, struct task_struct *t)
 {
 	struct ckpt_hdr_task_objs *h;
 	int files_objref;
+	int mm_objref;
 	int ret;
 
 	files_objref = checkpoint_obj_file_table(ctx, t);
@@ -117,10 +118,18 @@ static int checkpoint_task_objs(struct ckpt_ctx *ctx, struct task_struct *t)
 		return files_objref;
 	}
 
+	mm_objref = checkpoint_obj_mm(ctx, t);
+	ckpt_debug("mm: objref %d\n", mm_objref);
+	if (mm_objref < 0) {
+		ckpt_write_err(ctx, "TE", "mm_struct", mm_objref);
+		return mm_objref;
+	}
+
 	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_TASK_OBJS);
 	if (!h)
 		return -ENOMEM;
 	h->files_objref = files_objref;
+	h->mm_objref = mm_objref;
 	ret = ckpt_write_obj(ctx, &h->h);
 	ckpt_hdr_put(ctx, h);
 
@@ -278,6 +287,9 @@ int ckpt_collect_task(struct ckpt_ctx *ctx, struct task_struct *t)
 	int ret;
 
 	ret = ckpt_collect_file_table(ctx, t);
+	if (ret < 0)
+		return ret;
+	ret = ckpt_collect_mm(ctx, t);
 
 	return ret;
 }
diff --git a/checkpoint/sys.c b/checkpoint/sys.c
index 1373ff9..1c98eee 100644
--- a/checkpoint/sys.c
+++ b/checkpoint/sys.c
@@ -199,6 +199,7 @@ static void ckpt_ctx_free(struct ckpt_ctx *ctx)
 
 	ckpt_obj_hash_free(ctx);
 	path_put(&ctx->fs_mnt);
+	ckpt_pgarr_free(ctx);
 
 	if (ctx->tasks_arr)
 		task_arr_free(ctx);
@@ -230,6 +231,8 @@ static struct ckpt_ctx *ckpt_ctx_alloc(int fd, unsigned long uflags,
 	ctx->ktime_begin = ktime_get();
 
 	atomic_set(&ctx->refcount, 0);
+	INIT_LIST_HEAD(&ctx->pgarr_list);
+	INIT_LIST_HEAD(&ctx->pgarr_pool);
 	init_waitqueue_head(&ctx->waitq);
 
 	err = -EBADF;
diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
index 026d058..70d9506 100644
--- a/include/linux/checkpoint.h
+++ b/include/linux/checkpoint.h
@@ -131,6 +131,7 @@ extern int restore_task(struct ckpt_ctx *ctx);
 extern int checkpoint_write_header_arch(struct ckpt_ctx *ctx);
 extern int checkpoint_thread(struct ckpt_ctx *ctx, struct task_struct *t);
 extern int checkpoint_cpu(struct ckpt_ctx *ctx, struct task_struct *t);
+extern int checkpoint_mm_context(struct ckpt_ctx *ctx, struct mm_struct *mm);
 
 extern int restore_read_header_arch(struct ckpt_ctx *ctx);
 extern int restore_thread(struct ckpt_ctx *ctx);
@@ -162,6 +163,29 @@ extern int checkpoint_file_common(struct ckpt_ctx *ctx, struct file *file,
 extern int restore_file_common(struct ckpt_ctx *ctx, struct file *file,
 			       struct ckpt_hdr_file *h);
 
+/* memory */
+extern void ckpt_pgarr_free(struct ckpt_ctx *ctx);
+
+extern int generic_vma_checkpoint(struct ckpt_ctx *ctx,
+				  struct vm_area_struct *vma,
+				  enum vma_type type,
+				  int vma_objref);
+extern int private_vma_checkpoint(struct ckpt_ctx *ctx,
+				  struct vm_area_struct *vma,
+				  enum vma_type type,
+				  int vma_objref);
+
+extern int checkpoint_obj_mm(struct ckpt_ctx *ctx, struct task_struct *t);
+
+extern int ckpt_collect_mm(struct ckpt_ctx *ctx, struct task_struct *t);
+extern int checkpoint_mm(struct ckpt_ctx *ctx, void *ptr);
+
+#define CKPT_VMA_NOT_SUPPORTED					\
+	(VM_SHARED | VM_MAYSHARE | VM_IO | VM_HUGETLB |		\
+	 VM_NONLINEAR | VM_PFNMAP | VM_RESERVED | VM_NORESERVE	\
+	 | VM_HUGETLB | VM_NONLINEAR | VM_MAPPED_COPY |		\
+	 VM_INSERTPAGE | VM_MIXEDMAP | VM_SAO)
+
 static inline int ckpt_validate_errno(int errno)
 {
 	return (errno >= 0) && (errno < MAX_ERRNO);
@@ -173,6 +197,8 @@ static inline int ckpt_validate_errno(int errno)
 #define CKPT_DRW	0x4		/* image read/write */
 #define CKPT_DOBJ	0x8		/* shared objects */
 #define CKPT_DFILE	0x10		/* files and filesystem */
+#define CKPT_DMEM	0x20		/* memory state */
+#define CKPT_DPAGE	0x40		/* memory pages */
 
 #define CKPT_DDEFAULT	0xffff		/* default debug level */
 
diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
index 1124375..f29f87a 100644
--- a/include/linux/checkpoint_hdr.h
+++ b/include/linux/checkpoint_hdr.h
@@ -66,6 +66,11 @@ enum {
 	CKPT_HDR_FILE_NAME,
 	CKPT_HDR_FILE,
 
+	CKPT_HDR_MM = 401,
+	CKPT_HDR_VMA,
+	CKPT_HDR_PGARR,
+	CKPT_HDR_MM_CONTEXT,
+
 	CKPT_HDR_TAIL = 9001,
 
 	CKPT_HDR_ERROR = 9999,
@@ -88,6 +93,7 @@ enum obj_type {
 	CKPT_OBJ_IGNORE = 0,
 	CKPT_OBJ_FILE_TABLE,
 	CKPT_OBJ_FILE,
+	CKPT_OBJ_MM,
 	CKPT_OBJ_MAX
 };
 
@@ -95,6 +101,8 @@ enum obj_type {
 struct ckpt_const {
 	/* task */
 	__u16 task_comm_len;
+	/* mm */
+	__u16 mm_saved_auxv_len;
 	/* uts */
 	__u16 uts_release_len;
 	__u16 uts_version_len;
@@ -169,6 +177,7 @@ struct ckpt_hdr_task {
 struct ckpt_hdr_task_objs {
 	struct ckpt_hdr h;
 	__s32 files_objref;
+	__s32 mm_objref;
 } __attribute__((aligned(8)));
 
 /* restart blocks */
@@ -227,4 +236,47 @@ struct ckpt_hdr_file_generic {
 	struct ckpt_hdr_file common;
 } __attribute__((aligned(8)));
 
+/* memory layout */
+struct ckpt_hdr_mm {
+	struct ckpt_hdr h;
+	__u32 map_count;
+	__s32 exe_objref;
+
+	__u64 def_flags;
+	__u64 flags;
+
+	__u64 start_code, end_code, start_data, end_data;
+	__u64 start_brk, brk, start_stack;
+	__u64 arg_start, arg_end, env_start, env_end;
+} __attribute__((aligned(8)));
+
+/* vma subtypes */
+enum vma_type {
+	CKPT_VMA_IGNORE = 0,
+	CKPT_VMA_VDSO,		/* special vdso vma */
+	CKPT_VMA_ANON,		/* private anonymous */
+	CKPT_VMA_FILE,		/* private mapped file */
+	CKPT_VMA_MAX
+};
+
+/* vma descriptor */
+struct ckpt_hdr_vma {
+	struct ckpt_hdr h;
+	__u32 vma_type;
+	__s32 vma_objref;	/* objref of backing file */
+
+	__u64 vm_start;
+	__u64 vm_end;
+	__u64 vm_page_prot;
+	__u64 vm_flags;
+	__u64 vm_pgoff;
+} __attribute__((aligned(8)));
+
+/* page array */
+struct ckpt_hdr_pgarr {
+	struct ckpt_hdr h;
+	__u64 nr_pages;		/* number of pages to saved */
+} __attribute__((aligned(8)));
+
+
 #endif /* _CHECKPOINT_CKPT_HDR_H_ */
diff --git a/include/linux/checkpoint_types.h b/include/linux/checkpoint_types.h
index 795742f..f214109 100644
--- a/include/linux/checkpoint_types.h
+++ b/include/linux/checkpoint_types.h
@@ -15,6 +15,8 @@
 #include <linux/sched.h>
 #include <linux/nsproxy.h>
 #include <linux/list.h>
+#include <linux/sched.h>
+#include <linux/nsproxy.h>
 #include <linux/path.h>
 #include <linux/fs.h>
 #include <linux/ktime.h>
@@ -49,6 +51,9 @@ struct ckpt_ctx {
 	char err_string[256];	/* checkpoint: error string */
 	int errno;		/* restart: errno that caused failure */
 
+	struct list_head pgarr_list;	/* page array to dump VMA contents */
+	struct list_head pgarr_pool;	/* pool of empty page arrays chain */
+
 	/* [multi-process checkpoint] */
 	struct task_struct **tasks_arr; /* array of all tasks [checkpoint] */
 	int nr_tasks;                   /* size of tasks array */
diff --git a/include/linux/mm.h b/include/linux/mm.h
index d5ace89..d5f9889 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1157,6 +1157,11 @@ extern void truncate_inode_pages_range(struct address_space *,
 /* generic vm_area_ops exported for stackable file systems */
 extern int filemap_fault(struct vm_area_struct *, struct vm_fault *);
 
+#ifdef CONFIG_CHECKPOINT
+/* generic vm_area_ops exported for mapped files checkpoint */
+extern int filemap_checkpoint(struct ckpt_ctx *, struct vm_area_struct *);
+#endif
+
 /* mm/page-writeback.c */
 int write_one_page(struct page *page, int wait);
 void task_dirty_inc(struct task_struct *tsk);
diff --git a/mm/filemap.c b/mm/filemap.c
index ccea3b6..72ba805 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -34,6 +34,7 @@
 #include <linux/hardirq.h> /* for BUG_ON(!in_atomic()) only */
 #include <linux/memcontrol.h>
 #include <linux/mm_inline.h> /* for page_is_file_cache() */
+#include <linux/checkpoint.h>
 #include "internal.h"
 
 /*
@@ -1648,8 +1649,32 @@ page_not_uptodate:
 }
 EXPORT_SYMBOL(filemap_fault);
 
+#ifdef CONFIG_CHECKPOINT
+int filemap_checkpoint(struct ckpt_ctx *ctx, struct vm_area_struct *vma)
+{
+	struct file *file = vma->vm_file;
+	int vma_objref;
+
+	if (vma->vm_flags & CKPT_VMA_NOT_SUPPORTED) {
+		pr_warning("c/r: unsupported VMA %#lx\n", vma->vm_flags);
+		return -ENOSYS;
+	}
+
+	BUG_ON(!file);
+
+	vma_objref = checkpoint_obj(ctx, file, CKPT_OBJ_FILE);
+	if (vma_objref < 0)
+		return vma_objref;
+
+	return private_vma_checkpoint(ctx, vma, CKPT_VMA_FILE, vma_objref);
+}
+#endif /* CONFIG_CHECKPOINT */
+
 struct vm_operations_struct generic_file_vm_ops = {
 	.fault		= filemap_fault,
+#ifdef CONFIG_CHECKPOINT
+	.checkpoint	= filemap_checkpoint,
+#endif
 };
 
 /* This is used for a general mmap of a disk file */
diff --git a/mm/mmap.c b/mm/mmap.c
index 8101de4..7415a61 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -29,6 +29,7 @@
 #include <linux/rmap.h>
 #include <linux/mmu_notifier.h>
 #include <linux/perf_counter.h>
+#include <linux/checkpoint.h>
 
 #include <asm/uaccess.h>
 #include <asm/cacheflush.h>
@@ -2267,9 +2268,36 @@ static void special_mapping_close(struct vm_area_struct *vma)
 {
 }
 
+#ifdef CONFIG_CHECKPOINT
+static int special_mapping_checkpoint(struct ckpt_ctx *ctx,
+				      struct vm_area_struct *vma)
+{
+	const char *name;
+
+	/*
+	 * FIX:
+	 * Currently, we only handle VDSO/vsyscall special handling.
+	 * Even that, is very basic - we just skip the contents and
+	 * hope for the best in terms of compatilibity upon restart.
+	 */
+
+	if (vma->vm_flags & CKPT_VMA_NOT_SUPPORTED)
+		return -ENOSYS;
+
+	name = arch_vma_name(vma);
+	if (!name || strcmp(name, "[vdso]"))
+		return -ENOSYS;
+
+	return generic_vma_checkpoint(ctx, vma, CKPT_VMA_VDSO, 0);
+}
+#endif /* CONFIG_CHECKPOINT */
+
 static struct vm_operations_struct special_mapping_vmops = {
 	.close = special_mapping_close,
 	.fault = special_mapping_fault,
+#ifdef CONFIG_CHECKPOINT
+	.checkpoint = special_mapping_checkpoint,
+#endif
 };
 
 /*
-- 
1.6.0.4


^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH v18 37/80] c/r: dump memory address space (private memory)
@ 2009-09-23 23:51   ` Oren Laadan
  0 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-23 23:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, containers, linux-kernel, linux-mm, linux-api,
	Serge Hallyn, Ingo Molnar, Pavel Emelyanov, Oren Laadan,
	Oren Laadan

For each vma, there is a 'struct ckpt_vma'; Then comes the actual
contents, in one or more chunk: each chunk begins with a header that
specifies how many pages it holds, then the virtual addresses of all
the dumped pages in that chunk, followed by the actual contents of all
dumped pages. A header with zero number of pages marks the end of the
contents.  Then comes the next vma and so on.

To checkpoint a vma, call the ops->checkpoint() method of that vma.
Normally the per-vma function will invoke generic_vma_checkpoint()
which first writes the vma description, followed by the specific
logic to dump the contents of the pages.

Currently for private mapped memory we save the pathname of the file
that is mapped (restart will use it to re-open it and then map it).
Later we change that to reference a file object.

Changelog[v18]:
  - Tighten checks on supported vma to checkpoint or restart
  - Add a few more ckpt_write_err()s
  - [Serge Hallyn] Export filemap_checkpoint() (used later for ext4)
  - Use ckpt_collect_file() instead of ckpt_obj_collect() for files
  - In collect_mm() use retval from ckpt_obj_collect() to test for
    first-time-object
Changelog[v17]:
  - Only collect sub-objects of mm_struct once
  - Save mm->{flags,def_flags,saved_auxv}
Changelog[v16]:
  - Precede vaddrs/pages with a buffer header
  - Checkpoint mm->exe_file
  - Handle shared task->mm
Changelog[v14]:
  - Modify the ops->checkpoint method to be much more powerful
  - Improve support for VDSO (with special_mapping checkpoint callback)
  - Save new field 'vdso' in mm_context
  - Revert change to pr_debug(), back to ckpt_debug()
  - Check whether calls to ckpt_hbuf_get() fail
  - Discard field 'h->parent'
Changelog[v13]:
  - pgprot_t is an abstract type; use the proper accessor (fix for
    64-bit powerpc (Nathan Lynch <ntl@pobox.com>)
Changelog[v12]:
  - Hide pgarr management inside ckpt_private_vma_fill_pgarr()
  - Fix management of pgarr chain reset and alloc/expand: keep empty
    pgarr in a pool chain
  - Replace obsolete ckpt_debug() with pr_debug()
Changelog[v11]:
  - Copy contents of 'init->fs->root' instead of pointing to them.
  - Add missing test for VM_MAYSHARE when dumping memory
Changelog[v10]:
  - Acquire dcache_lock around call to __d_path() in ckpt_fill_name()
Changelog[v9]:
  - Introduce ckpt_ctx_checkpoint() for checkpoint-specific ctx setup
  - Test if __d_path() changes mnt/dentry (when crossing filesystem
    namespace boundary). for now ckpt_fill_fname() fails the checkpoint.
Changelog[v7]:
  - Fix argument given to kunmap_atomic() in memory dump/restore
Changelog[v6]:
  - Balance all calls to ckpt_hbuf_get() with matching ckpt_hbuf_put()
    (even though it's not really needed)
Changelog[v5]:
  - Improve memory dump code (following Dave Hansen's comments)
  - Change dump format (and code) to allow chunks of <vaddrs, pages>
    instead of one long list of each
  - Fix use of follow_page() to avoid faulting in non-present pages
Changelog[v4]:
  - Use standard list_... for ckpt_pgarr

Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
---
 arch/x86/include/asm/checkpoint_hdr.h |    8 +
 arch/x86/mm/checkpoint.c              |   31 ++
 checkpoint/Makefile                   |    3 +-
 checkpoint/checkpoint.c               |    3 +
 checkpoint/memory.c                   |  697 +++++++++++++++++++++++++++++++++
 checkpoint/objhash.c                  |   25 ++
 checkpoint/process.c                  |   12 +
 checkpoint/sys.c                      |    3 +
 include/linux/checkpoint.h            |   26 ++
 include/linux/checkpoint_hdr.h        |   52 +++
 include/linux/checkpoint_types.h      |    5 +
 include/linux/mm.h                    |    5 +
 mm/filemap.c                          |   25 ++
 mm/mmap.c                             |   28 ++
 14 files changed, 922 insertions(+), 1 deletions(-)
 create mode 100644 checkpoint/memory.c

diff --git a/arch/x86/include/asm/checkpoint_hdr.h b/arch/x86/include/asm/checkpoint_hdr.h
index f4d1e14..0e756b0 100644
--- a/arch/x86/include/asm/checkpoint_hdr.h
+++ b/arch/x86/include/asm/checkpoint_hdr.h
@@ -45,6 +45,7 @@
 /* arch dependent header types */
 enum {
 	CKPT_HDR_CPU_FPU = 201,
+	CKPT_HDR_MM_CONTEXT_LDT,
 };
 
 struct ckpt_hdr_header_arch {
@@ -118,4 +119,11 @@ struct ckpt_hdr_cpu {
 #define CKPT_X86_SEG_TLS	0x4000	/* 0100 0000 0000 00xx */
 #define CKPT_X86_SEG_LDT	0x8000	/* 100x xxxx xxxx xxxx */
 
+struct ckpt_hdr_mm_context {
+	struct ckpt_hdr h;
+	__u64 vdso;
+	__u32 ldt_entry_size;
+	__u32 nldt;
+} __attribute__((aligned(8)));
+
 #endif /* __ASM_X86_CKPT_HDR__H */
diff --git a/arch/x86/mm/checkpoint.c b/arch/x86/mm/checkpoint.c
index 023039f..83ed047 100644
--- a/arch/x86/mm/checkpoint.c
+++ b/arch/x86/mm/checkpoint.c
@@ -324,6 +324,37 @@ int checkpoint_write_header_arch(struct ckpt_ctx *ctx)
 	return ret;
 }
 
+/* dump the mm->context state */
+int checkpoint_mm_context(struct ckpt_ctx *ctx, struct mm_struct *mm)
+{
+	struct ckpt_hdr_mm_context *h;
+	int ret;
+
+	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_MM_CONTEXT);
+	if (!h)
+		return -ENOMEM;
+
+	mutex_lock(&mm->context.lock);
+
+	h->vdso = (unsigned long) mm->context.vdso;
+	h->ldt_entry_size = LDT_ENTRY_SIZE;
+	h->nldt = mm->context.size;
+
+	ckpt_debug("nldt %d vdso %#llx\n", h->nldt, h->vdso);
+
+	ret = ckpt_write_obj(ctx, &h->h);
+	ckpt_hdr_put(ctx, h);
+	if (ret < 0)
+		goto out;
+
+	ret = ckpt_write_obj_type(ctx, mm->context.ldt,
+				  mm->context.size * LDT_ENTRY_SIZE,
+				  CKPT_HDR_MM_CONTEXT_LDT);
+ out:
+	mutex_unlock(&mm->context.lock);
+	return ret;
+}
+
 /**************************************************************************
  * Restart
  */
diff --git a/checkpoint/Makefile b/checkpoint/Makefile
index 1d0c058..f56a7d6 100644
--- a/checkpoint/Makefile
+++ b/checkpoint/Makefile
@@ -8,4 +8,5 @@ obj-$(CONFIG_CHECKPOINT) += \
 	checkpoint.o \
 	restart.o \
 	process.o \
-	files.o
+	files.o \
+	memory.o
diff --git a/checkpoint/checkpoint.c b/checkpoint/checkpoint.c
index 4cc2a2f..f907485 100644
--- a/checkpoint/checkpoint.c
+++ b/checkpoint/checkpoint.c
@@ -280,10 +280,13 @@ int ckpt_write_err(struct ckpt_ctx *ctx, char *pre, char *fmt, ...)
 static void fill_kernel_const(struct ckpt_const *h)
 {
 	struct task_struct *tsk;
+	struct mm_struct *mm;
 	struct new_utsname *uts;
 
 	/* task */
 	h->task_comm_len = sizeof(tsk->comm);
+	/* mm */
+	h->mm_saved_auxv_len = sizeof(mm->saved_auxv);
 	/* uts */
 	h->uts_release_len = sizeof(uts->release);
 	h->uts_version_len = sizeof(uts->version);
diff --git a/checkpoint/memory.c b/checkpoint/memory.c
new file mode 100644
index 0000000..d19d627
--- /dev/null
+++ b/checkpoint/memory.c
@@ -0,0 +1,697 @@
+/*
+ *  Checkpoint/restart memory contents
+ *
+ *  Copyright (C) 2008-2009 Oren Laadan
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+/* default debug level for output */
+#define CKPT_DFLAG  CKPT_DMEM
+
+#include <linux/kernel.h>
+#include <linux/sched.h>
+#include <linux/slab.h>
+#include <linux/file.h>
+#include <linux/pagemap.h>
+#include <linux/mm_types.h>
+#include <linux/proc_fs.h>
+#include <linux/checkpoint.h>
+#include <linux/checkpoint_hdr.h>
+
+/*
+ * page-array chains: each ckpt_pgarr describes a set of <struct page *,vaddr>
+ * tuples (where vaddr is the virtual address of a page in a particular mm).
+ * Specifically, we use separate arrays so that all vaddrs can be written
+ * and read at once.
+ */
+
+struct ckpt_pgarr {
+	unsigned long *vaddrs;
+	struct page **pages;
+	unsigned int nr_used;
+	struct list_head list;
+};
+
+#define CKPT_PGARR_TOTAL  (PAGE_SIZE / sizeof(void *))
+#define CKPT_PGARR_BATCH  (16 * CKPT_PGARR_TOTAL)
+
+static inline int pgarr_is_full(struct ckpt_pgarr *pgarr)
+{
+	return (pgarr->nr_used == CKPT_PGARR_TOTAL);
+}
+
+static inline int pgarr_nr_free(struct ckpt_pgarr *pgarr)
+{
+	return CKPT_PGARR_TOTAL - pgarr->nr_used;
+}
+
+/*
+ * utilities to alloc, free, and handle 'struct ckpt_pgarr' (page-arrays)
+ * (common to ckpt_mem.c and rstr_mem.c).
+ *
+ * The checkpoint context structure has two members for page-arrays:
+ *   ctx->pgarr_list: list head of populated page-array chain
+ *   ctx->pgarr_pool: list head of empty page-array pool chain
+ *
+ * During checkpoint (and restart) the chain tracks the dirty pages (page
+ * pointer and virtual address) of each MM. For a particular MM, these are
+ * always added to the head of the page-array chain (ctx->pgarr_list).
+ * Before the next chunk of pages, the chain is reset (by dereferencing
+ * all pages) but not freed; instead, empty descsriptors are kept in pool.
+ *
+ * The head of the chain page-array ("current") advances as necessary. When
+ * it gets full, a new page-array descriptor is pushed in front of it. The
+ * new descriptor is taken from first empty descriptor (if one exists, for
+ * instance, after a chain reset), or allocated on-demand.
+ *
+ * When dumping the data, the chain is traversed in reverse order.
+ */
+
+/* return first page-array in the chain */
+static inline struct ckpt_pgarr *pgarr_first(struct ckpt_ctx *ctx)
+{
+	if (list_empty(&ctx->pgarr_list))
+		return NULL;
+	return list_first_entry(&ctx->pgarr_list, struct ckpt_pgarr, list);
+}
+
+/* return (and detach) first empty page-array in the pool, if exists */
+static inline struct ckpt_pgarr *pgarr_from_pool(struct ckpt_ctx *ctx)
+{
+	struct ckpt_pgarr *pgarr;
+
+	if (list_empty(&ctx->pgarr_pool))
+		return NULL;
+	pgarr = list_first_entry(&ctx->pgarr_pool, struct ckpt_pgarr, list);
+	list_del(&pgarr->list);
+	return pgarr;
+}
+
+/* release pages referenced by a page-array */
+static void pgarr_release_pages(struct ckpt_pgarr *pgarr)
+{
+	ckpt_debug("total pages %d\n", pgarr->nr_used);
+	/*
+	 * both checkpoint and restart use 'nr_used', however we only
+	 * collect pages during checkpoint; in restart we simply return
+	 * because pgarr->pages remains NULL.
+	 */
+	if (pgarr->pages) {
+		struct page **pages = pgarr->pages;
+		int nr = pgarr->nr_used;
+
+		while (nr--)
+			page_cache_release(pages[nr]);
+	}
+
+	pgarr->nr_used = 0;
+}
+
+/* free a single page-array object */
+static void pgarr_free_one(struct ckpt_pgarr *pgarr)
+{
+	pgarr_release_pages(pgarr);
+	kfree(pgarr->pages);
+	kfree(pgarr->vaddrs);
+	kfree(pgarr);
+}
+
+/* free the chains of page-arrays (populated and empty pool) */
+void ckpt_pgarr_free(struct ckpt_ctx *ctx)
+{
+	struct ckpt_pgarr *pgarr, *tmp;
+
+	list_for_each_entry_safe(pgarr, tmp, &ctx->pgarr_list, list) {
+		list_del(&pgarr->list);
+		pgarr_free_one(pgarr);
+	}
+
+	list_for_each_entry_safe(pgarr, tmp, &ctx->pgarr_pool, list) {
+		list_del(&pgarr->list);
+		pgarr_free_one(pgarr);
+	}
+}
+
+/* allocate a single page-array object */
+static struct ckpt_pgarr *pgarr_alloc_one(unsigned long flags)
+{
+	struct ckpt_pgarr *pgarr;
+
+	pgarr = kzalloc(sizeof(*pgarr), GFP_KERNEL);
+	if (!pgarr)
+		return NULL;
+	pgarr->vaddrs = kmalloc(CKPT_PGARR_TOTAL * sizeof(unsigned long),
+				GFP_KERNEL);
+	if (!pgarr->vaddrs)
+		goto nomem;
+
+	/* pgarr->pages is needed only for checkpoint */
+	if (flags & CKPT_CTX_CHECKPOINT) {
+		pgarr->pages = kmalloc(CKPT_PGARR_TOTAL *
+				       sizeof(struct page *), GFP_KERNEL);
+		if (!pgarr->pages)
+			goto nomem;
+	}
+
+	return pgarr;
+ nomem:
+	pgarr_free_one(pgarr);
+	return NULL;
+}
+
+/* pgarr_current - return the next available page-array in the chain
+ * @ctx: checkpoint context
+ *
+ * Returns the first page-array in the list that has space. Otherwise,
+ * try the next page-array after the last non-empty one, and move it to
+ * the front of the chain. Extends the list if none has space.
+ */
+static struct ckpt_pgarr *pgarr_current(struct ckpt_ctx *ctx)
+{
+	struct ckpt_pgarr *pgarr;
+
+	pgarr = pgarr_first(ctx);
+	if (pgarr && !pgarr_is_full(pgarr))
+		return pgarr;
+
+	pgarr = pgarr_from_pool(ctx);
+	if (!pgarr)
+		pgarr = pgarr_alloc_one(ctx->kflags);
+	if (!pgarr)
+		return NULL;
+
+	list_add(&pgarr->list, &ctx->pgarr_list);
+	return pgarr;
+}
+
+/* reset the page-array chain (dropping page references if necessary) */
+static void pgarr_reset_all(struct ckpt_ctx *ctx)
+{
+	struct ckpt_pgarr *pgarr;
+
+	list_for_each_entry(pgarr, &ctx->pgarr_list, list)
+		pgarr_release_pages(pgarr);
+	list_splice_init(&ctx->pgarr_list, &ctx->pgarr_pool);
+}
+
+/**************************************************************************
+ * Checkpoint
+ *
+ * Checkpoint is outside the context of the checkpointee, so one cannot
+ * simply read pages from user-space. Instead, we scan the address space
+ * of the target to cherry-pick pages of interest. Selected pages are
+ * enlisted in a page-array chain (attached to the checkpoint context).
+ * To save their contents, each page is mapped to kernel memory and then
+ * dumped to the file descriptor.
+ */
+
+
+/**
+ * private_follow_page - return page pointer for dirty pages
+ * @vma - target vma
+ * @addr - page address
+ *
+ * Looks up the page that correspond to the address in the vma, and
+ * returns the page if it was modified (and grabs a reference to it),
+ * or otherwise returns NULL (or error).
+ */
+static struct page *consider_private_page(struct vm_area_struct *vma,
+					  unsigned long addr)
+{
+	struct page *page;
+
+	/*
+	 * simplified version of get_user_pages(): already have vma,
+	 * only need FOLL_ANON, and (for now) ignore fault stats.
+	 *
+	 * follow_page() will return NULL if the page is not present
+	 * (swapped), ZERO_PAGE(0) if the pte wasn't allocated, and
+	 * the actual page pointer otherwise.
+	 *
+	 * FIXME: consolidate with get_user_pages()
+	 */
+
+	cond_resched();
+	while (!(page = follow_page(vma, addr, FOLL_ANON | FOLL_GET))) {
+		int ret;
+
+		/* the page is swapped out - bring it in (optimize ?) */
+		ret = handle_mm_fault(vma->vm_mm, vma, addr, 0);
+		if (ret & VM_FAULT_ERROR) {
+			if (ret & VM_FAULT_OOM)
+				return ERR_PTR(-ENOMEM);
+			else if (ret & VM_FAULT_SIGBUS)
+				return ERR_PTR(-EFAULT);
+			else
+				BUG();
+			break;
+		}
+		cond_resched();
+	}
+
+	if (IS_ERR(page))
+		return page;
+
+	/*
+	 * Only care about dirty pages: either anonymous non-zero pages,
+	 * or file-backed COW (copy-on-write) pages that were modified.
+	 * A clean COW page is not interesting because its contents are
+	 * identical to the backing file; ignore such pages.
+	 * A file-backed broken COW is identified by its page_mapping()
+	 * being unset (NULL) because the page will no longer be mapped
+	 * to the original file after having been modified.
+	 */
+	if (page == ZERO_PAGE(0)) {
+		/* this is the zero page: ignore */
+		page_cache_release(page);
+		page = NULL;
+	} else if (vma->vm_file && (page_mapping(page) != NULL)) {
+		/* file backed clean cow: ignore */
+		page_cache_release(page);
+		page = NULL;
+	}
+
+	return page;
+}
+
+/**
+ * vma_fill_pgarr - fill a page-array with addr/page tuples
+ * @ctx - checkpoint context
+ * @vma - vma to scan
+ * @start - start address (updated)
+ *
+ * Returns the number of pages collected
+ */
+static int vma_fill_pgarr(struct ckpt_ctx *ctx,
+			  struct vm_area_struct *vma,
+			  unsigned long *start)
+{
+	unsigned long end = vma->vm_end;
+	unsigned long addr = *start;
+	struct ckpt_pgarr *pgarr;
+	int nr_used;
+	int cnt = 0;
+
+	/* this function is only for private memory (anon or file-mapped) */
+	BUG_ON(vma->vm_flags & (VM_SHARED | VM_MAYSHARE));
+
+	do {
+		pgarr = pgarr_current(ctx);
+		if (!pgarr)
+			return -ENOMEM;
+
+		nr_used = pgarr->nr_used;
+
+		while (addr < end) {
+			struct page *page;
+
+			page = consider_private_page(vma, addr);
+			if (IS_ERR(page))
+				return PTR_ERR(page);
+
+			if (page) {
+				_ckpt_debug(CKPT_DPAGE,
+					    "got page %#lx\n", addr);
+				pgarr->pages[pgarr->nr_used] = page;
+				pgarr->vaddrs[pgarr->nr_used] = addr;
+				pgarr->nr_used++;
+			}
+
+			addr += PAGE_SIZE;
+
+			if (pgarr_is_full(pgarr))
+				break;
+		}
+
+		cnt += pgarr->nr_used - nr_used;
+
+	} while ((cnt < CKPT_PGARR_BATCH) && (addr < end));
+
+	*start = addr;
+	return cnt;
+}
+
+/* dump contents of a pages: use kmap_atomic() to avoid TLB flush */
+static int checkpoint_dump_page(struct ckpt_ctx *ctx,
+				struct page *page, char *buf)
+{
+	void *ptr;
+
+	ptr = kmap_atomic(page, KM_USER1);
+	memcpy(buf, ptr, PAGE_SIZE);
+	kunmap_atomic(ptr, KM_USER1);
+
+	return ckpt_kwrite(ctx, buf, PAGE_SIZE);
+}
+
+/**
+ * vma_dump_pages - dump pages listed in the ctx page-array chain
+ * @ctx - checkpoint context
+ * @total - total number of pages
+ *
+ * First dump all virtual addresses, followed by the contents of all pages
+ */
+static int vma_dump_pages(struct ckpt_ctx *ctx, int total)
+{
+	struct ckpt_pgarr *pgarr;
+	void *buf;
+	int i, ret = 0;
+
+	if (!total)
+		return 0;
+
+	i =  total * (sizeof(unsigned long) + PAGE_SIZE);
+	ret = ckpt_write_obj_type(ctx, NULL, i, CKPT_HDR_BUFFER);
+	if (ret < 0)
+		return ret;
+
+	list_for_each_entry_reverse(pgarr, &ctx->pgarr_list, list) {
+		ret = ckpt_kwrite(ctx, pgarr->vaddrs,
+				  pgarr->nr_used * sizeof(unsigned long));
+		if (ret < 0)
+			return ret;
+	}
+
+	buf = (void *) __get_free_page(GFP_KERNEL);
+	if (!buf)
+		return -ENOMEM;
+
+	list_for_each_entry_reverse(pgarr, &ctx->pgarr_list, list) {
+		for (i = 0; i < pgarr->nr_used; i++) {
+			ret = checkpoint_dump_page(ctx, pgarr->pages[i], buf);
+			if (ret < 0)
+				goto out;
+		}
+	}
+ out:
+	free_page((unsigned long) buf);
+	return ret;
+}
+
+/**
+ * checkpoint_memory_contents - dump contents of a VMA with private memory
+ * @ctx - checkpoint context
+ * @vma - vma to scan
+ *
+ * Collect lists of pages that needs to be dumped, and corresponding
+ * virtual addresses into ctx->pgarr_list page-array chain. Then dump
+ * the addresses, followed by the page contents.
+ */
+static int checkpoint_memory_contents(struct ckpt_ctx *ctx,
+				      struct vm_area_struct *vma)
+{
+	struct ckpt_hdr_pgarr *h;
+	unsigned long addr, end;
+	int cnt, ret;
+
+	addr = vma->vm_start;
+	end = vma->vm_end;
+
+	/*
+	 * Work iteratively, collecting and dumping at most CKPT_PGARR_BATCH
+	 * in each round. Each iterations is divided into two steps:
+	 *
+	 * (1) scan: scan through the PTEs of the vma to collect the pages
+	 * to dump (later we'll also make them COW), while keeping a list
+	 * of pages and their corresponding addresses on ctx->pgarr_list.
+	 *
+	 * (2) dump: write out a header specifying how many pages, followed
+	 * by the addresses of all pages in ctx->pgarr_list, followed by
+	 * the actual contents of all pages. (Then, release the references
+	 * to the pages and reset the page-array chain).
+	 *
+	 * (This split makes the logic simpler by first counting the pages
+	 * that need saving. More importantly, it allows for a future
+	 * optimization that will reduce application downtime by deferring
+	 * the actual write-out of the data to after the application is
+	 * allowed to resume execution).
+	 *
+	 * After dumping the entire contents, conclude with a header that
+	 * specifies 0 pages to mark the end of the contents.
+	 */
+
+	while (addr < end) {
+		cnt = vma_fill_pgarr(ctx, vma, &addr);
+		if (cnt == 0)
+			break;
+		else if (cnt < 0)
+			return cnt;
+
+		ckpt_debug("collected %d pages\n", cnt);
+
+		h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_PGARR);
+		if (!h)
+			return -ENOMEM;
+
+		h->nr_pages = cnt;
+		ret = ckpt_write_obj(ctx, &h->h);
+		ckpt_hdr_put(ctx, h);
+		if (ret < 0)
+			return ret;
+
+		ret = vma_dump_pages(ctx, cnt);
+		if (ret < 0)
+			return ret;
+
+		pgarr_reset_all(ctx);
+	}
+
+	/* mark end of contents with header saying "0" pages */
+	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_PGARR);
+	if (!h)
+		return -ENOMEM;
+	h->nr_pages = 0;
+	ret = ckpt_write_obj(ctx, &h->h);
+	ckpt_hdr_put(ctx, h);
+
+	return ret;
+}
+
+/**
+ * generic_vma_checkpoint - dump metadata of vma
+ * @ctx: checkpoint context
+ * @vma: vma object
+ * @type: vma type
+ * @vma_objref: vma objref
+ */
+int generic_vma_checkpoint(struct ckpt_ctx *ctx, struct vm_area_struct *vma,
+			   enum vma_type type, int vma_objref)
+{
+	struct ckpt_hdr_vma *h;
+	int ret;
+
+	ckpt_debug("vma %#lx-%#lx flags %#lx type %d\n",
+		 vma->vm_start, vma->vm_end, vma->vm_flags, type);
+
+	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_VMA);
+	if (!h)
+		return -ENOMEM;
+
+	h->vma_type = type;
+	h->vma_objref = vma_objref;
+	h->vm_start = vma->vm_start;
+	h->vm_end = vma->vm_end;
+	h->vm_page_prot = pgprot_val(vma->vm_page_prot);
+	h->vm_flags = vma->vm_flags;
+	h->vm_pgoff = vma->vm_pgoff;
+
+	ret = ckpt_write_obj(ctx, &h->h);
+	ckpt_hdr_put(ctx, h);
+
+	return ret;
+}
+
+/**
+ * private_vma_checkpoint - dump contents of private (anon, file) vma
+ * @ctx: checkpoint context
+ * @vma: vma object
+ * @type: vma type
+ * @vma_objref: vma objref
+ */
+int private_vma_checkpoint(struct ckpt_ctx *ctx,
+			   struct vm_area_struct *vma,
+			   enum vma_type type, int vma_objref)
+{
+	int ret;
+
+	BUG_ON(vma->vm_flags & (VM_SHARED | VM_MAYSHARE));
+
+	ret = generic_vma_checkpoint(ctx, vma, type, vma_objref);
+	if (ret < 0)
+		goto out;
+	ret = checkpoint_memory_contents(ctx, vma);
+ out:
+	return ret;
+}
+
+/**
+ * anonymous_checkpoint - dump contents of private-anonymous vma
+ * @ctx: checkpoint context
+ * @vma: vma object
+ */
+static int anonymous_checkpoint(struct ckpt_ctx *ctx,
+				struct vm_area_struct *vma)
+{
+	/* should be private anonymous ... verify that this is the case */
+	BUG_ON(vma->vm_flags & VM_MAYSHARE);
+	BUG_ON(vma->vm_file);
+
+	return private_vma_checkpoint(ctx, vma, CKPT_VMA_ANON, 0);
+}
+
+static int do_checkpoint_mm(struct ckpt_ctx *ctx, struct mm_struct *mm)
+{
+	struct ckpt_hdr_mm *h;
+	struct vm_area_struct *vma;
+	int exe_objref = 0;
+	int ret;
+
+	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_MM);
+	if (!h)
+		return -ENOMEM;
+
+	down_read(&mm->mmap_sem);
+
+	h->flags = mm->flags;
+	h->def_flags = mm->def_flags;
+
+	h->start_code = mm->start_code;
+	h->end_code = mm->end_code;
+	h->start_data = mm->start_data;
+	h->end_data = mm->end_data;
+	h->start_brk = mm->start_brk;
+	h->brk = mm->brk;
+	h->start_stack = mm->start_stack;
+	h->arg_start = mm->arg_start;
+	h->arg_end = mm->arg_end;
+	h->env_start = mm->env_start;
+	h->env_end = mm->env_end;
+
+	h->map_count = mm->map_count;
+
+	/* checkpoint the ->exe_file */
+	if (mm->exe_file) {
+		exe_objref = checkpoint_obj(ctx, mm->exe_file, CKPT_OBJ_FILE);
+		if (exe_objref < 0) {
+			ret = exe_objref;
+			goto out;
+		}
+		h->exe_objref = exe_objref;
+	}
+
+	ret = ckpt_write_obj(ctx, &h->h);
+	if (ret < 0)
+		goto out;
+
+	ret = ckpt_write_buffer(ctx, mm->saved_auxv, sizeof(mm->saved_auxv));
+	if (ret < 0)
+		return ret;
+
+	/* write the vma's */
+	for (vma = mm->mmap; vma; vma = vma->vm_next) {
+		ckpt_debug("vma %#lx-%#lx flags %#lx\n",
+			 vma->vm_start, vma->vm_end, vma->vm_flags);
+		if (vma->vm_flags & CKPT_VMA_NOT_SUPPORTED) {
+			ckpt_write_err(ctx, "TE", "vma: bad flags (%#lx)\n",
+				       -ENOSYS, vma->vm_flags);
+			return -ENOSYS;
+		}
+		if (!vma->vm_ops)
+			ret = anonymous_checkpoint(ctx, vma);
+		else if (vma->vm_ops->checkpoint)
+			ret = (*vma->vm_ops->checkpoint)(ctx, vma);
+		else
+			ret = -ENOSYS;
+		if (ret < 0) {
+			ckpt_write_err(ctx, "TE", "vma: failed", ret);
+			goto out;
+		}
+		/*
+		 * The file was collected, but not always checkpointed;
+		 * be safe and mark as visited to appease leak detection
+		 */
+		if (vma->vm_file && !(ctx->uflags & CHECKPOINT_SUBTREE)) {
+			ret = ckpt_obj_visit(ctx, vma->vm_file, CKPT_OBJ_FILE);
+			if (ret < 0)
+				goto out;
+		}
+	}
+
+	ret = checkpoint_mm_context(ctx, mm);
+ out:
+	ckpt_hdr_put(ctx, h);
+	up_read(&mm->mmap_sem);
+	return ret;
+}
+
+int checkpoint_mm(struct ckpt_ctx *ctx, void *ptr)
+{
+	return do_checkpoint_mm(ctx, (struct mm_struct *) ptr);
+}
+
+int checkpoint_obj_mm(struct ckpt_ctx *ctx, struct task_struct *t)
+{
+	struct mm_struct *mm;
+	int objref;
+
+	mm = get_task_mm(t);
+	objref = checkpoint_obj(ctx, mm, CKPT_OBJ_MM);
+	mmput(mm);
+
+	return objref;
+}
+
+/***********************************************************************
+ * Collect
+ */
+
+static int collect_mm(struct ckpt_ctx *ctx, struct mm_struct *mm)
+{
+	struct vm_area_struct *vma;
+	struct file *file;
+	int ret;
+
+	/* if already exists (ret == 0), nothing to do */
+	ret = ckpt_obj_collect(ctx, mm, CKPT_OBJ_MM);
+	if (ret <= 0)
+		return ret;
+
+	/* if first time for this mm (ret > 0), proceed inside */
+	down_read(&mm->mmap_sem);
+	if (mm->exe_file) {
+		ret = ckpt_collect_file(ctx, mm->exe_file);
+		if (ret < 0) {
+			ckpt_write_err(ctx, "TE", "mm: collect exe_file", ret);
+			goto out;
+		}
+	}
+	for (vma = mm->mmap; vma; vma = vma->vm_next) {
+		file = vma->vm_file;
+		if (!file)
+			continue;
+		ret = ckpt_collect_file(ctx, file);
+		if (ret < 0) {
+			ckpt_write_err(ctx, "TE", "mm: collect vm_file", ret);
+			break;
+		}
+	}
+ out:
+	up_read(&mm->mmap_sem);
+	return ret;
+
+}
+
+int ckpt_collect_mm(struct ckpt_ctx *ctx, struct task_struct *t)
+{
+	struct mm_struct *mm;
+	int ret;
+
+	mm = get_task_mm(t);
+	ret = collect_mm(ctx, mm);
+	mmput(mm);
+
+	return ret;
+}
diff --git a/checkpoint/objhash.c b/checkpoint/objhash.c
index b7c8fdb..4c207c3 100644
--- a/checkpoint/objhash.c
+++ b/checkpoint/objhash.c
@@ -96,6 +96,22 @@ static int obj_file_users(void *ptr)
 	return atomic_long_read(&((struct file *) ptr)->f_count);
 }
 
+static int obj_mm_grab(void *ptr)
+{
+	atomic_inc(&((struct mm_struct *) ptr)->mm_users);
+	return 0;
+}
+
+static void obj_mm_drop(void *ptr, int lastref)
+{
+	mmput((struct mm_struct *) ptr);
+}
+
+static int obj_mm_users(void *ptr)
+{
+	return atomic_read(&((struct mm_struct *) ptr)->mm_users);
+}
+
 static struct ckpt_obj_ops ckpt_obj_ops[] = {
 	/* ignored object */
 	{
@@ -124,6 +140,15 @@ static struct ckpt_obj_ops ckpt_obj_ops[] = {
 		.checkpoint = checkpoint_file,
 		.restore = restore_file,
 	},
+	/* mm object */
+	{
+		.obj_name = "MM",
+		.obj_type = CKPT_OBJ_MM,
+		.ref_drop = obj_mm_drop,
+		.ref_grab = obj_mm_grab,
+		.ref_users = obj_mm_users,
+		.checkpoint = checkpoint_mm,
+	},
 };
 
 
diff --git a/checkpoint/process.c b/checkpoint/process.c
index 6ad9c01..3d0eb36 100644
--- a/checkpoint/process.c
+++ b/checkpoint/process.c
@@ -108,6 +108,7 @@ static int checkpoint_task_objs(struct ckpt_ctx *ctx, struct task_struct *t)
 {
 	struct ckpt_hdr_task_objs *h;
 	int files_objref;
+	int mm_objref;
 	int ret;
 
 	files_objref = checkpoint_obj_file_table(ctx, t);
@@ -117,10 +118,18 @@ static int checkpoint_task_objs(struct ckpt_ctx *ctx, struct task_struct *t)
 		return files_objref;
 	}
 
+	mm_objref = checkpoint_obj_mm(ctx, t);
+	ckpt_debug("mm: objref %d\n", mm_objref);
+	if (mm_objref < 0) {
+		ckpt_write_err(ctx, "TE", "mm_struct", mm_objref);
+		return mm_objref;
+	}
+
 	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_TASK_OBJS);
 	if (!h)
 		return -ENOMEM;
 	h->files_objref = files_objref;
+	h->mm_objref = mm_objref;
 	ret = ckpt_write_obj(ctx, &h->h);
 	ckpt_hdr_put(ctx, h);
 
@@ -278,6 +287,9 @@ int ckpt_collect_task(struct ckpt_ctx *ctx, struct task_struct *t)
 	int ret;
 
 	ret = ckpt_collect_file_table(ctx, t);
+	if (ret < 0)
+		return ret;
+	ret = ckpt_collect_mm(ctx, t);
 
 	return ret;
 }
diff --git a/checkpoint/sys.c b/checkpoint/sys.c
index 1373ff9..1c98eee 100644
--- a/checkpoint/sys.c
+++ b/checkpoint/sys.c
@@ -199,6 +199,7 @@ static void ckpt_ctx_free(struct ckpt_ctx *ctx)
 
 	ckpt_obj_hash_free(ctx);
 	path_put(&ctx->fs_mnt);
+	ckpt_pgarr_free(ctx);
 
 	if (ctx->tasks_arr)
 		task_arr_free(ctx);
@@ -230,6 +231,8 @@ static struct ckpt_ctx *ckpt_ctx_alloc(int fd, unsigned long uflags,
 	ctx->ktime_begin = ktime_get();
 
 	atomic_set(&ctx->refcount, 0);
+	INIT_LIST_HEAD(&ctx->pgarr_list);
+	INIT_LIST_HEAD(&ctx->pgarr_pool);
 	init_waitqueue_head(&ctx->waitq);
 
 	err = -EBADF;
diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
index 026d058..70d9506 100644
--- a/include/linux/checkpoint.h
+++ b/include/linux/checkpoint.h
@@ -131,6 +131,7 @@ extern int restore_task(struct ckpt_ctx *ctx);
 extern int checkpoint_write_header_arch(struct ckpt_ctx *ctx);
 extern int checkpoint_thread(struct ckpt_ctx *ctx, struct task_struct *t);
 extern int checkpoint_cpu(struct ckpt_ctx *ctx, struct task_struct *t);
+extern int checkpoint_mm_context(struct ckpt_ctx *ctx, struct mm_struct *mm);
 
 extern int restore_read_header_arch(struct ckpt_ctx *ctx);
 extern int restore_thread(struct ckpt_ctx *ctx);
@@ -162,6 +163,29 @@ extern int checkpoint_file_common(struct ckpt_ctx *ctx, struct file *file,
 extern int restore_file_common(struct ckpt_ctx *ctx, struct file *file,
 			       struct ckpt_hdr_file *h);
 
+/* memory */
+extern void ckpt_pgarr_free(struct ckpt_ctx *ctx);
+
+extern int generic_vma_checkpoint(struct ckpt_ctx *ctx,
+				  struct vm_area_struct *vma,
+				  enum vma_type type,
+				  int vma_objref);
+extern int private_vma_checkpoint(struct ckpt_ctx *ctx,
+				  struct vm_area_struct *vma,
+				  enum vma_type type,
+				  int vma_objref);
+
+extern int checkpoint_obj_mm(struct ckpt_ctx *ctx, struct task_struct *t);
+
+extern int ckpt_collect_mm(struct ckpt_ctx *ctx, struct task_struct *t);
+extern int checkpoint_mm(struct ckpt_ctx *ctx, void *ptr);
+
+#define CKPT_VMA_NOT_SUPPORTED					\
+	(VM_SHARED | VM_MAYSHARE | VM_IO | VM_HUGETLB |		\
+	 VM_NONLINEAR | VM_PFNMAP | VM_RESERVED | VM_NORESERVE	\
+	 | VM_HUGETLB | VM_NONLINEAR | VM_MAPPED_COPY |		\
+	 VM_INSERTPAGE | VM_MIXEDMAP | VM_SAO)
+
 static inline int ckpt_validate_errno(int errno)
 {
 	return (errno >= 0) && (errno < MAX_ERRNO);
@@ -173,6 +197,8 @@ static inline int ckpt_validate_errno(int errno)
 #define CKPT_DRW	0x4		/* image read/write */
 #define CKPT_DOBJ	0x8		/* shared objects */
 #define CKPT_DFILE	0x10		/* files and filesystem */
+#define CKPT_DMEM	0x20		/* memory state */
+#define CKPT_DPAGE	0x40		/* memory pages */
 
 #define CKPT_DDEFAULT	0xffff		/* default debug level */
 
diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
index 1124375..f29f87a 100644
--- a/include/linux/checkpoint_hdr.h
+++ b/include/linux/checkpoint_hdr.h
@@ -66,6 +66,11 @@ enum {
 	CKPT_HDR_FILE_NAME,
 	CKPT_HDR_FILE,
 
+	CKPT_HDR_MM = 401,
+	CKPT_HDR_VMA,
+	CKPT_HDR_PGARR,
+	CKPT_HDR_MM_CONTEXT,
+
 	CKPT_HDR_TAIL = 9001,
 
 	CKPT_HDR_ERROR = 9999,
@@ -88,6 +93,7 @@ enum obj_type {
 	CKPT_OBJ_IGNORE = 0,
 	CKPT_OBJ_FILE_TABLE,
 	CKPT_OBJ_FILE,
+	CKPT_OBJ_MM,
 	CKPT_OBJ_MAX
 };
 
@@ -95,6 +101,8 @@ enum obj_type {
 struct ckpt_const {
 	/* task */
 	__u16 task_comm_len;
+	/* mm */
+	__u16 mm_saved_auxv_len;
 	/* uts */
 	__u16 uts_release_len;
 	__u16 uts_version_len;
@@ -169,6 +177,7 @@ struct ckpt_hdr_task {
 struct ckpt_hdr_task_objs {
 	struct ckpt_hdr h;
 	__s32 files_objref;
+	__s32 mm_objref;
 } __attribute__((aligned(8)));
 
 /* restart blocks */
@@ -227,4 +236,47 @@ struct ckpt_hdr_file_generic {
 	struct ckpt_hdr_file common;
 } __attribute__((aligned(8)));
 
+/* memory layout */
+struct ckpt_hdr_mm {
+	struct ckpt_hdr h;
+	__u32 map_count;
+	__s32 exe_objref;
+
+	__u64 def_flags;
+	__u64 flags;
+
+	__u64 start_code, end_code, start_data, end_data;
+	__u64 start_brk, brk, start_stack;
+	__u64 arg_start, arg_end, env_start, env_end;
+} __attribute__((aligned(8)));
+
+/* vma subtypes */
+enum vma_type {
+	CKPT_VMA_IGNORE = 0,
+	CKPT_VMA_VDSO,		/* special vdso vma */
+	CKPT_VMA_ANON,		/* private anonymous */
+	CKPT_VMA_FILE,		/* private mapped file */
+	CKPT_VMA_MAX
+};
+
+/* vma descriptor */
+struct ckpt_hdr_vma {
+	struct ckpt_hdr h;
+	__u32 vma_type;
+	__s32 vma_objref;	/* objref of backing file */
+
+	__u64 vm_start;
+	__u64 vm_end;
+	__u64 vm_page_prot;
+	__u64 vm_flags;
+	__u64 vm_pgoff;
+} __attribute__((aligned(8)));
+
+/* page array */
+struct ckpt_hdr_pgarr {
+	struct ckpt_hdr h;
+	__u64 nr_pages;		/* number of pages to saved */
+} __attribute__((aligned(8)));
+
+
 #endif /* _CHECKPOINT_CKPT_HDR_H_ */
diff --git a/include/linux/checkpoint_types.h b/include/linux/checkpoint_types.h
index 795742f..f214109 100644
--- a/include/linux/checkpoint_types.h
+++ b/include/linux/checkpoint_types.h
@@ -15,6 +15,8 @@
 #include <linux/sched.h>
 #include <linux/nsproxy.h>
 #include <linux/list.h>
+#include <linux/sched.h>
+#include <linux/nsproxy.h>
 #include <linux/path.h>
 #include <linux/fs.h>
 #include <linux/ktime.h>
@@ -49,6 +51,9 @@ struct ckpt_ctx {
 	char err_string[256];	/* checkpoint: error string */
 	int errno;		/* restart: errno that caused failure */
 
+	struct list_head pgarr_list;	/* page array to dump VMA contents */
+	struct list_head pgarr_pool;	/* pool of empty page arrays chain */
+
 	/* [multi-process checkpoint] */
 	struct task_struct **tasks_arr; /* array of all tasks [checkpoint] */
 	int nr_tasks;                   /* size of tasks array */
diff --git a/include/linux/mm.h b/include/linux/mm.h
index d5ace89..d5f9889 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1157,6 +1157,11 @@ extern void truncate_inode_pages_range(struct address_space *,
 /* generic vm_area_ops exported for stackable file systems */
 extern int filemap_fault(struct vm_area_struct *, struct vm_fault *);
 
+#ifdef CONFIG_CHECKPOINT
+/* generic vm_area_ops exported for mapped files checkpoint */
+extern int filemap_checkpoint(struct ckpt_ctx *, struct vm_area_struct *);
+#endif
+
 /* mm/page-writeback.c */
 int write_one_page(struct page *page, int wait);
 void task_dirty_inc(struct task_struct *tsk);
diff --git a/mm/filemap.c b/mm/filemap.c
index ccea3b6..72ba805 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -34,6 +34,7 @@
 #include <linux/hardirq.h> /* for BUG_ON(!in_atomic()) only */
 #include <linux/memcontrol.h>
 #include <linux/mm_inline.h> /* for page_is_file_cache() */
+#include <linux/checkpoint.h>
 #include "internal.h"
 
 /*
@@ -1648,8 +1649,32 @@ page_not_uptodate:
 }
 EXPORT_SYMBOL(filemap_fault);
 
+#ifdef CONFIG_CHECKPOINT
+int filemap_checkpoint(struct ckpt_ctx *ctx, struct vm_area_struct *vma)
+{
+	struct file *file = vma->vm_file;
+	int vma_objref;
+
+	if (vma->vm_flags & CKPT_VMA_NOT_SUPPORTED) {
+		pr_warning("c/r: unsupported VMA %#lx\n", vma->vm_flags);
+		return -ENOSYS;
+	}
+
+	BUG_ON(!file);
+
+	vma_objref = checkpoint_obj(ctx, file, CKPT_OBJ_FILE);
+	if (vma_objref < 0)
+		return vma_objref;
+
+	return private_vma_checkpoint(ctx, vma, CKPT_VMA_FILE, vma_objref);
+}
+#endif /* CONFIG_CHECKPOINT */
+
 struct vm_operations_struct generic_file_vm_ops = {
 	.fault		= filemap_fault,
+#ifdef CONFIG_CHECKPOINT
+	.checkpoint	= filemap_checkpoint,
+#endif
 };
 
 /* This is used for a general mmap of a disk file */
diff --git a/mm/mmap.c b/mm/mmap.c
index 8101de4..7415a61 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -29,6 +29,7 @@
 #include <linux/rmap.h>
 #include <linux/mmu_notifier.h>
 #include <linux/perf_counter.h>
+#include <linux/checkpoint.h>
 
 #include <asm/uaccess.h>
 #include <asm/cacheflush.h>
@@ -2267,9 +2268,36 @@ static void special_mapping_close(struct vm_area_struct *vma)
 {
 }
 
+#ifdef CONFIG_CHECKPOINT
+static int special_mapping_checkpoint(struct ckpt_ctx *ctx,
+				      struct vm_area_struct *vma)
+{
+	const char *name;
+
+	/*
+	 * FIX:
+	 * Currently, we only handle VDSO/vsyscall special handling.
+	 * Even that, is very basic - we just skip the contents and
+	 * hope for the best in terms of compatilibity upon restart.
+	 */
+
+	if (vma->vm_flags & CKPT_VMA_NOT_SUPPORTED)
+		return -ENOSYS;
+
+	name = arch_vma_name(vma);
+	if (!name || strcmp(name, "[vdso]"))
+		return -ENOSYS;
+
+	return generic_vma_checkpoint(ctx, vma, CKPT_VMA_VDSO, 0);
+}
+#endif /* CONFIG_CHECKPOINT */
+
 static struct vm_operations_struct special_mapping_vmops = {
 	.close = special_mapping_close,
 	.fault = special_mapping_fault,
+#ifdef CONFIG_CHECKPOINT
+	.checkpoint = special_mapping_checkpoint,
+#endif
 };
 
 /*
-- 
1.6.0.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH v18 38/80] c/r: restore memory address space (private memory)
       [not found] ` <1253749920-18673-1-git-send-email-orenl-RdfvBDnrOixBDgjK7y7TUQ@public.gmane.org>
                     ` (36 preceding siblings ...)
  2009-09-23 23:51   ` [PATCH v18 37/80] c/r: dump memory address space (private memory) Oren Laadan
@ 2009-09-23 23:51   ` Oren Laadan
  2009-09-23 23:51   ` [PATCH v18 39/80] c/r: add generic '->checkpoint' f_op to ext fses Oren Laadan
                     ` (43 subsequent siblings)
  81 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-23 23:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-api-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Linus Torvalds, Ingo Molnar,
	Pavel Emelyanov

Restoring the memory address space begins with nuking the existing one
of the current process, and then reading the vma state and contents.
Call do_mmap_pgoffset() for each vma and then read in the data.

Changelog[v17]:
  - Tighten checks on supported vma to checkpoint or restart
Changelog[v17]:
  - Restore mm->{flags,def_flags,saved_auxv}
  - Fix bogus warning in do_restore_mm()
Changelog[v16]:
  - Restore mm->exe_file
Changelog[v14]:
  - Introduce per vma-type restore() function
  - Merge restart code into same file as checkpoint (memory.c)
  - Compare saved 'vdso' field of mm_context with current value
  - Check whether calls to ckpt_hbuf_get() fail
  - Discard field 'h->parent'
  - Revert change to pr_debug(), back to ckpt_debug()
Changelog[v13]:
  - Avoid access to hh->vma_type after the header is freed
  - Test for no vma's in exit_mmap() before calling unmap_vma() (or it
    may crash if restart fails after having removed all vma's)
Changelog[v12]:
  - Replace obsolete ckpt_debug() with pr_debug()
Changelog[v9]:
  - Introduce ckpt_ctx_checkpoint() for checkpoint-specific ctx setup
Changelog[v7]:
  - Fix argument given to kunmap_atomic() in memory dump/restore
Changelog[v6]:
  - Balance all calls to ckpt_hbuf_get() with matching ckpt_hbuf_put()
    (even though it's not really needed)
Changelog[v5]:
  - Improve memory restore code (following Dave Hansen's comments)
  - Change dump format (and code) to allow chunks of <vaddrs, pages>
    instead of one long list of each
  - Memory restore now maps user pages explicitly to copy data into them,
    instead of reading directly to user space; got rid of mprotect_fixup()
Changelog[v4]:
  - Use standard list_... for ckpt_pgarr


Signed-off-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
---
 arch/x86/include/asm/ldt.h     |    7 +
 arch/x86/mm/checkpoint.c       |   64 ++++++
 checkpoint/memory.c            |  472 ++++++++++++++++++++++++++++++++++++++++
 checkpoint/objhash.c           |    1 +
 checkpoint/process.c           |    3 +
 checkpoint/restart.c           |    4 +
 fs/exec.c                      |    2 +-
 include/linux/checkpoint.h     |    7 +
 include/linux/checkpoint_hdr.h |    2 +-
 include/linux/mm.h             |   13 +
 mm/filemap.c                   |   19 ++
 mm/mmap.c                      |   23 ++-
 12 files changed, 614 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/ldt.h b/arch/x86/include/asm/ldt.h
index 46727eb..f2845f9 100644
--- a/arch/x86/include/asm/ldt.h
+++ b/arch/x86/include/asm/ldt.h
@@ -37,4 +37,11 @@ struct user_desc {
 #define MODIFY_LDT_CONTENTS_CODE	2
 
 #endif /* !__ASSEMBLY__ */
+
+#ifdef __KERNEL__
+#include <linux/linkage.h>
+asmlinkage int sys_modify_ldt(int func, void __user *ptr,
+			      unsigned long bytecount);
+#endif
+
 #endif /* _ASM_X86_LDT_H */
diff --git a/arch/x86/mm/checkpoint.c b/arch/x86/mm/checkpoint.c
index 83ed047..9dd8e12 100644
--- a/arch/x86/mm/checkpoint.c
+++ b/arch/x86/mm/checkpoint.c
@@ -13,6 +13,7 @@
 
 #include <asm/desc.h>
 #include <asm/i387.h>
+#include <asm/elf.h>
 
 #include <linux/checkpoint.h>
 #include <linux/checkpoint_hdr.h>
@@ -560,3 +561,66 @@ int restore_read_header_arch(struct ckpt_ctx *ctx)
 	ckpt_hdr_put(ctx, h);
 	return ret;
 }
+
+int restore_mm_context(struct ckpt_ctx *ctx, struct mm_struct *mm)
+{
+	struct ckpt_hdr_mm_context *h;
+	unsigned int n;
+	int ret;
+
+	h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_MM_CONTEXT);
+	if (IS_ERR(h))
+		return PTR_ERR(h);
+
+	ckpt_debug("nldt %d vdso %#lx (%p)\n",
+		 h->nldt, (unsigned long) h->vdso, mm->context.vdso);
+
+	ret = -EINVAL;
+	if (h->vdso != (unsigned long) mm->context.vdso)
+		goto out;
+	if (h->ldt_entry_size != LDT_ENTRY_SIZE)
+		goto out;
+
+	ret = _ckpt_read_obj_type(ctx, NULL,
+				  h->nldt * LDT_ENTRY_SIZE,
+				  CKPT_HDR_MM_CONTEXT_LDT);
+	if (ret < 0)
+		goto out;
+
+	/*
+	 * to utilize the syscall modify_ldt() we first convert the data
+	 * in the checkpoint image from 'struct desc_struct' to 'struct
+	 * user_desc' with reverse logic of include/asm/desc.h:fill_ldt()
+	 */
+	for (n = 0; n < h->nldt; n++) {
+		struct user_desc info;
+		struct desc_struct desc;
+		mm_segment_t old_fs;
+
+		ret = ckpt_kread(ctx, &desc, LDT_ENTRY_SIZE);
+		if (ret < 0)
+			break;
+
+		info.entry_number = n;
+		info.base_addr = desc.base0 | (desc.base1 << 16);
+		info.limit = desc.limit0;
+		info.seg_32bit = desc.d;
+		info.contents = desc.type >> 2;
+		info.read_exec_only = (desc.type >> 1) ^ 1;
+		info.limit_in_pages = desc.g;
+		info.seg_not_present = desc.p ^ 1;
+		info.useable = desc.avl;
+
+		old_fs = get_fs();
+		set_fs(get_ds());
+		ret = sys_modify_ldt(1, (struct user_desc __user *) &info,
+				     sizeof(info));
+		set_fs(old_fs);
+
+		if (ret < 0)
+			break;
+	}
+ out:
+	ckpt_hdr_put(ctx, h);
+	return ret;
+}
diff --git a/checkpoint/memory.c b/checkpoint/memory.c
index d19d627..6a0dfaf 100644
--- a/checkpoint/memory.c
+++ b/checkpoint/memory.c
@@ -15,6 +15,9 @@
 #include <linux/sched.h>
 #include <linux/slab.h>
 #include <linux/file.h>
+#include <linux/err.h>
+#include <linux/mm.h>
+#include <linux/mman.h>
 #include <linux/pagemap.h>
 #include <linux/mm_types.h>
 #include <linux/proc_fs.h>
@@ -695,3 +698,472 @@ int ckpt_collect_mm(struct ckpt_ctx *ctx, struct task_struct *t)
 
 	return ret;
 }
+
+/***********************************************************************
+ * Restart
+ *
+ * Unlike checkpoint, restart is executed in the context of each restarting
+ * process: vma regions are restored via a call to mmap(), and the data is
+ * read into the address space of the current process.
+ */
+
+/**
+ * read_pages_vaddrs - read addresses of pages to page-array chain
+ * @ctx - restart context
+ * @nr_pages - number of address to read
+ */
+static int read_pages_vaddrs(struct ckpt_ctx *ctx, unsigned long nr_pages)
+{
+	struct ckpt_pgarr *pgarr;
+	unsigned long *vaddrp;
+	int nr, ret;
+
+	while (nr_pages) {
+		pgarr = pgarr_current(ctx);
+		if (!pgarr)
+			return -ENOMEM;
+		nr = pgarr_nr_free(pgarr);
+		if (nr > nr_pages)
+			nr = nr_pages;
+		vaddrp = &pgarr->vaddrs[pgarr->nr_used];
+		ret = ckpt_kread(ctx, vaddrp, nr * sizeof(unsigned long));
+		if (ret < 0)
+			return ret;
+		pgarr->nr_used += nr;
+		nr_pages -= nr;
+	}
+	return 0;
+}
+
+static int restore_read_page(struct ckpt_ctx *ctx, struct page *page, void *p)
+{
+	void *ptr;
+	int ret;
+
+	ret = ckpt_kread(ctx, p, PAGE_SIZE);
+	if (ret < 0)
+		return ret;
+
+	ptr = kmap_atomic(page, KM_USER1);
+	memcpy(ptr, p, PAGE_SIZE);
+	kunmap_atomic(ptr, KM_USER1);
+
+	return 0;
+}
+
+/**
+ * read_pages_contents - read in data of pages in page-array chain
+ * @ctx - restart context
+ */
+static int read_pages_contents(struct ckpt_ctx *ctx)
+{
+	struct mm_struct *mm = current->mm;
+	struct ckpt_pgarr *pgarr;
+	unsigned long *vaddrs;
+	char *buf;
+	int i, ret = 0;
+
+	buf = kmalloc(PAGE_SIZE, GFP_KERNEL);
+	if (!buf)
+		return -ENOMEM;
+
+	down_read(&mm->mmap_sem);
+	list_for_each_entry_reverse(pgarr, &ctx->pgarr_list, list) {
+		vaddrs = pgarr->vaddrs;
+		for (i = 0; i < pgarr->nr_used; i++) {
+			struct page *page;
+
+			_ckpt_debug(CKPT_DPAGE, "got page %#lx\n", vaddrs[i]);
+			ret = get_user_pages(current, mm, vaddrs[i],
+					     1, 1, 1, &page, NULL);
+			if (ret < 0)
+				goto out;
+
+			ret = restore_read_page(ctx, page, buf);
+			page_cache_release(page);
+
+			if (ret < 0)
+				goto out;
+		}
+	}
+
+ out:
+	up_read(&mm->mmap_sem);
+	kfree(buf);
+	return 0;
+}
+
+/**
+ * restore_memory_contents - restore contents of a VMA with private memory
+ * @ctx - restart context
+ *
+ * Reads a header that specifies how many pages will follow, then reads
+ * a list of virtual addresses into ctx->pgarr_list page-array chain,
+ * followed by the actual contents of the corresponding pages. Iterates
+ * these steps until reaching a header specifying "0" pages, which marks
+ * the end of the contents.
+ */
+static int restore_memory_contents(struct ckpt_ctx *ctx)
+{
+	struct ckpt_hdr_pgarr *h;
+	unsigned long nr_pages;
+	int len, ret = 0;
+
+	while (1) {
+		h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_PGARR);
+		if (IS_ERR(h))
+			break;
+
+		ckpt_debug("total pages %ld\n", (unsigned long) h->nr_pages);
+
+		nr_pages = h->nr_pages;
+		ckpt_hdr_put(ctx, h);
+
+		if (!nr_pages)
+			break;
+
+		len = nr_pages * (sizeof(unsigned long) + PAGE_SIZE);
+		ret = _ckpt_read_buffer(ctx, NULL, len);
+		if (ret < 0)
+			break;
+
+		ret = read_pages_vaddrs(ctx, nr_pages);
+		if (ret < 0)
+			break;
+		ret = read_pages_contents(ctx);
+		if (ret < 0)
+			break;
+		pgarr_reset_all(ctx);
+	}
+
+	return ret;
+}
+
+/**
+ * calc_map_prot_bits - convert vm_flags to mmap protection
+ * orig_vm_flags: source vm_flags
+ */
+static unsigned long calc_map_prot_bits(unsigned long orig_vm_flags)
+{
+	unsigned long vm_prot = 0;
+
+	if (orig_vm_flags & VM_READ)
+		vm_prot |= PROT_READ;
+	if (orig_vm_flags & VM_WRITE)
+		vm_prot |= PROT_WRITE;
+	if (orig_vm_flags & VM_EXEC)
+		vm_prot |= PROT_EXEC;
+	if (orig_vm_flags & PROT_SEM)   /* only (?) with IPC-SHM  */
+		vm_prot |= PROT_SEM;
+
+	return vm_prot;
+}
+
+/**
+ * calc_map_flags_bits - convert vm_flags to mmap flags
+ * orig_vm_flags: source vm_flags
+ */
+static unsigned long calc_map_flags_bits(unsigned long orig_vm_flags)
+{
+	unsigned long vm_flags = 0;
+
+	vm_flags = MAP_FIXED;
+	if (orig_vm_flags & VM_GROWSDOWN)
+		vm_flags |= MAP_GROWSDOWN;
+	if (orig_vm_flags & VM_DENYWRITE)
+		vm_flags |= MAP_DENYWRITE;
+	if (orig_vm_flags & VM_EXECUTABLE)
+		vm_flags |= MAP_EXECUTABLE;
+	if (orig_vm_flags & VM_MAYSHARE)
+		vm_flags |= MAP_SHARED;
+	else
+		vm_flags |= MAP_PRIVATE;
+
+	return vm_flags;
+}
+
+/**
+ * generic_vma_restore - restore a vma
+ * @mm - address space
+ * @file - file to map (NULL for anonymous)
+ * @h - vma header data
+ */
+static unsigned long generic_vma_restore(struct mm_struct *mm,
+					 struct file *file,
+					 struct ckpt_hdr_vma *h)
+{
+	unsigned long vm_size, vm_start, vm_flags, vm_prot, vm_pgoff;
+	unsigned long addr;
+
+	if (h->vm_end < h->vm_start)
+		return -EINVAL;
+	if (h->vma_objref < 0)
+		return -EINVAL;
+
+	vm_start = h->vm_start;
+	vm_pgoff = h->vm_pgoff;
+	vm_size = h->vm_end - h->vm_start;
+	vm_prot = calc_map_prot_bits(h->vm_flags);
+	vm_flags = calc_map_flags_bits(h->vm_flags);
+
+	down_write(&mm->mmap_sem);
+	addr = do_mmap_pgoff(file, vm_start, vm_size,
+			     vm_prot, vm_flags, vm_pgoff);
+	up_write(&mm->mmap_sem);
+	ckpt_debug("size %#lx prot %#lx flag %#lx pgoff %#lx => %#lx\n",
+		 vm_size, vm_prot, vm_flags, vm_pgoff, addr);
+
+	return addr;
+}
+
+/**
+ * private_vma_restore - read vma data, recreate it and read contents
+ * @ctx: checkpoint context
+ * @mm: memory address space
+ * @file: file to use for mapping
+ * @h - vma header data
+ */
+int private_vma_restore(struct ckpt_ctx *ctx, struct mm_struct *mm,
+			struct file *file, struct ckpt_hdr_vma *h)
+{
+	unsigned long addr;
+
+	if (h->vm_flags & (VM_SHARED | VM_MAYSHARE))
+		return -EINVAL;
+
+	addr = generic_vma_restore(mm, file, h);
+	if (IS_ERR((void *) addr))
+		return PTR_ERR((void *) addr);
+
+	return restore_memory_contents(ctx);
+}
+
+/**
+ * anon_private_restore - read vma data, recreate it and read contents
+ * @ctx: checkpoint context
+ * @mm: memory address space
+ * @h - vma header data
+ */
+static int anon_private_restore(struct ckpt_ctx *ctx,
+				     struct mm_struct *mm,
+				     struct ckpt_hdr_vma *h)
+{
+	/*
+	 * vm_pgoff for anonymous mapping is the "global" page
+	 * offset (namely from addr 0x0), so we force a zero
+	 */
+	h->vm_pgoff = 0;
+
+	return private_vma_restore(ctx, mm, NULL, h);
+}
+
+/* callbacks to restore vma per its type: */
+struct restore_vma_ops {
+	char *vma_name;
+	enum vma_type vma_type;
+	int (*restore) (struct ckpt_ctx *ctx,
+			struct mm_struct *mm,
+			struct ckpt_hdr_vma *ptr);
+};
+
+static struct restore_vma_ops restore_vma_ops[] = {
+	/* ignored vma */
+	{
+		.vma_name = "IGNORE",
+		.vma_type = CKPT_VMA_IGNORE,
+		.restore = NULL,
+	},
+	/* special mapping (vdso) */
+	{
+		.vma_name = "VDSO",
+		.vma_type = CKPT_VMA_VDSO,
+		.restore = special_mapping_restore,
+	},
+	/* anonymous private */
+	{
+		.vma_name = "ANON PRIVATE",
+		.vma_type = CKPT_VMA_ANON,
+		.restore = anon_private_restore,
+	},
+	/* file-mapped private */
+	{
+		.vma_name = "FILE PRIVATE",
+		.vma_type = CKPT_VMA_FILE,
+		.restore = filemap_restore,
+	},
+};
+
+/**
+ * restore_vma - read vma data, recreate it and read contents
+ * @ctx: checkpoint context
+ * @mm: memory address space
+ */
+static int restore_vma(struct ckpt_ctx *ctx, struct mm_struct *mm)
+{
+	struct ckpt_hdr_vma *h;
+	struct restore_vma_ops *ops;
+	int ret;
+
+	h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_VMA);
+	if (IS_ERR(h))
+		return PTR_ERR(h);
+
+	ckpt_debug("vma %#lx-%#lx flags %#lx type %d vmaref %d\n",
+		   (unsigned long) h->vm_start, (unsigned long) h->vm_end,
+		   (unsigned long) h->vm_flags, (int) h->vma_type,
+		   (int) h->vma_objref);
+
+	ret = -EINVAL;
+	if (h->vm_end < h->vm_start)
+		goto out;
+	if (h->vma_objref < 0)
+		goto out;
+	if (h->vma_type >= CKPT_VMA_MAX)
+		goto out;
+	if (h->vm_flags & CKPT_VMA_NOT_SUPPORTED)
+		return -ENOSYS;
+
+	ops = &restore_vma_ops[h->vma_type];
+
+	/* make sure we don't change this accidentally */
+	BUG_ON(ops->vma_type != h->vma_type);
+
+	if (ops->restore) {
+		ckpt_debug("vma type %s\n", ops->vma_name);
+		ret = ops->restore(ctx, mm, h);
+	} else {
+		ckpt_debug("vma ignored\n");
+		ret = 0;
+	}
+ out:
+	ckpt_hdr_put(ctx, h);
+	return ret;
+}
+
+static int destroy_mm(struct mm_struct *mm)
+{
+	struct vm_area_struct *vmnext = mm->mmap;
+	struct vm_area_struct *vma;
+	int ret;
+
+	while (vmnext) {
+		vma = vmnext;
+		vmnext = vmnext->vm_next;
+		ret = do_munmap(mm, vma->vm_start, vma->vm_end-vma->vm_start);
+		if (ret < 0) {
+			pr_warning("c/r: failed do_munmap (%d)\n", ret);
+			return ret;
+		}
+	}
+	return 0;
+}
+
+static struct mm_struct *do_restore_mm(struct ckpt_ctx *ctx)
+{
+	struct ckpt_hdr_mm *h;
+	struct mm_struct *mm = NULL;
+	struct file *file;
+	unsigned int nr;
+	int ret;
+
+	h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_MM);
+	if (IS_ERR(h))
+		return (struct mm_struct *) h;
+
+	ckpt_debug("map_count %d\n", h->map_count);
+
+	/* XXX need more sanity checks */
+
+	ret = -EINVAL;
+	if ((h->start_code > h->end_code) ||
+	    (h->start_data > h->end_data))
+		goto out;
+	if (h->exe_objref < 0)
+		goto out;
+	if (h->def_flags & ~VM_LOCKED)
+		goto out;
+	if (h->flags & ~(MMF_DUMP_FILTER_MASK |
+			 ((1 << MMF_DUMP_FILTER_BITS) - 1)))
+		goto out;
+
+	mm = current->mm;
+
+	/* point of no return -- destruct current mm */
+	down_write(&mm->mmap_sem);
+	ret = destroy_mm(mm);
+	if (ret < 0) {
+		up_write(&mm->mmap_sem);
+		goto out;
+	}
+
+	mm->flags = h->flags;
+	mm->def_flags = h->def_flags;
+
+	mm->start_code = h->start_code;
+	mm->end_code = h->end_code;
+	mm->start_data = h->start_data;
+	mm->end_data = h->end_data;
+	mm->start_brk = h->start_brk;
+	mm->brk = h->brk;
+	mm->start_stack = h->start_stack;
+	mm->arg_start = h->arg_start;
+	mm->arg_end = h->arg_end;
+	mm->env_start = h->env_start;
+	mm->env_end = h->env_end;
+
+	/* restore the ->exe_file */
+	if (h->exe_objref) {
+		file = ckpt_obj_fetch(ctx, h->exe_objref, CKPT_OBJ_FILE);
+		if (IS_ERR(file)) {
+			up_write(&mm->mmap_sem);
+			ret = PTR_ERR(file);
+			goto out;
+		}
+		set_mm_exe_file(mm, file);
+	}
+
+	ret = _ckpt_read_buffer(ctx, mm->saved_auxv, sizeof(mm->saved_auxv));
+	up_write(&mm->mmap_sem);
+	if (ret < 0)
+		goto out;
+
+	for (nr = h->map_count; nr; nr--) {
+		ret = restore_vma(ctx, mm);
+		if (ret < 0)
+			goto out;
+	}
+
+	ret = restore_mm_context(ctx, mm);
+ out:
+	ckpt_hdr_put(ctx, h);
+	if (ret < 0)
+		return ERR_PTR(ret);
+	/* restore_obj() expect an extra reference */
+	atomic_inc(&mm->mm_users);
+	return mm;
+}
+
+void *restore_mm(struct ckpt_ctx *ctx)
+{
+	return (void *) do_restore_mm(ctx);
+}
+
+int restore_obj_mm(struct ckpt_ctx *ctx, int mm_objref)
+{
+	struct mm_struct *mm;
+	int ret;
+
+	mm = ckpt_obj_fetch(ctx, mm_objref, CKPT_OBJ_MM);
+	if (IS_ERR(mm))
+		return PTR_ERR(mm);
+
+	if (mm == current->mm)
+		return 0;
+
+	ret = exec_mmap(mm);
+	if (ret < 0)
+		return ret;
+
+	atomic_inc(&mm->mm_users);
+	return 0;
+}
diff --git a/checkpoint/objhash.c b/checkpoint/objhash.c
index 4c207c3..734be3a 100644
--- a/checkpoint/objhash.c
+++ b/checkpoint/objhash.c
@@ -148,6 +148,7 @@ static struct ckpt_obj_ops ckpt_obj_ops[] = {
 		.ref_grab = obj_mm_grab,
 		.ref_users = obj_mm_users,
 		.checkpoint = checkpoint_mm,
+		.restore = restore_mm,
 	},
 };
 
diff --git a/checkpoint/process.c b/checkpoint/process.c
index 3d0eb36..1788a39 100644
--- a/checkpoint/process.c
+++ b/checkpoint/process.c
@@ -373,6 +373,9 @@ static int restore_task_objs(struct ckpt_ctx *ctx)
 	ret = restore_obj_file_table(ctx, h->files_objref);
 	ckpt_debug("file_table: ret %d (%p)\n", ret, current->files);
 
+	ret = restore_obj_mm(ctx, h->mm_objref);
+	ckpt_debug("mm: ret %d (%p)\n", ret, current->mm);
+
 	ckpt_hdr_put(ctx, h);
 	return ret;
 }
diff --git a/checkpoint/restart.c b/checkpoint/restart.c
index 73db44a..6183074 100644
--- a/checkpoint/restart.c
+++ b/checkpoint/restart.c
@@ -373,11 +373,15 @@ int ckpt_read_consume(struct ckpt_ctx *ctx, int len, int type)
 static int check_kernel_const(struct ckpt_const *h)
 {
 	struct task_struct *tsk;
+	struct mm_struct *mm;
 	struct new_utsname *uts;
 
 	/* task */
 	if (h->task_comm_len != sizeof(tsk->comm))
 		return -EINVAL;
+	/* mm */
+	if (h->mm_saved_auxv_len != sizeof(mm->saved_auxv))
+		return -EINVAL;
 	/* uts */
 	if (h->uts_release_len != sizeof(uts->release))
 		return -EINVAL;
diff --git a/fs/exec.c b/fs/exec.c
index 172ceb6..c8436a7 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -695,7 +695,7 @@ int kernel_read(struct file *file, loff_t offset,
 
 EXPORT_SYMBOL(kernel_read);
 
-static int exec_mmap(struct mm_struct *mm)
+int exec_mmap(struct mm_struct *mm)
 {
 	struct task_struct *tsk;
 	struct mm_struct * old_mm, *active_mm;
diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
index 70d9506..cb13e8f 100644
--- a/include/linux/checkpoint.h
+++ b/include/linux/checkpoint.h
@@ -136,6 +136,7 @@ extern int checkpoint_mm_context(struct ckpt_ctx *ctx, struct mm_struct *mm);
 extern int restore_read_header_arch(struct ckpt_ctx *ctx);
 extern int restore_thread(struct ckpt_ctx *ctx);
 extern int restore_cpu(struct ckpt_ctx *ctx);
+extern int restore_mm_context(struct ckpt_ctx *ctx, struct mm_struct *mm);
 
 extern int checkpoint_restart_block(struct ckpt_ctx *ctx,
 				    struct task_struct *t);
@@ -176,9 +177,15 @@ extern int private_vma_checkpoint(struct ckpt_ctx *ctx,
 				  int vma_objref);
 
 extern int checkpoint_obj_mm(struct ckpt_ctx *ctx, struct task_struct *t);
+extern int restore_obj_mm(struct ckpt_ctx *ctx, int mm_objref);
 
 extern int ckpt_collect_mm(struct ckpt_ctx *ctx, struct task_struct *t);
 extern int checkpoint_mm(struct ckpt_ctx *ctx, void *ptr);
+extern void *restore_mm(struct ckpt_ctx *ctx);
+
+extern int private_vma_restore(struct ckpt_ctx *ctx, struct mm_struct *mm,
+			       struct file *file, struct ckpt_hdr_vma *h);
+
 
 #define CKPT_VMA_NOT_SUPPORTED					\
 	(VM_SHARED | VM_MAYSHARE | VM_IO | VM_HUGETLB |		\
diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
index f29f87a..097d2e2 100644
--- a/include/linux/checkpoint_hdr.h
+++ b/include/linux/checkpoint_hdr.h
@@ -250,7 +250,7 @@ struct ckpt_hdr_mm {
 	__u64 arg_start, arg_end, env_start, env_end;
 } __attribute__((aligned(8)));
 
-/* vma subtypes */
+/* vma subtypes - index into restore_vma_dispatch[] */
 enum vma_type {
 	CKPT_VMA_IGNORE = 0,
 	CKPT_VMA_VDSO,		/* special vdso vma */
diff --git a/include/linux/mm.h b/include/linux/mm.h
index d5f9889..0d0c46b 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1148,6 +1148,9 @@ extern int do_munmap(struct mm_struct *, unsigned long, size_t);
 
 extern unsigned long do_brk(unsigned long, unsigned long);
 
+/* fs/exec.c */
+extern int exec_mmap(struct mm_struct *mm);
+
 /* filemap.c */
 extern unsigned long page_unuse(struct page *);
 extern void truncate_inode_pages(struct address_space *, loff_t);
@@ -1166,6 +1169,16 @@ extern int filemap_checkpoint(struct ckpt_ctx *, struct vm_area_struct *);
 int write_one_page(struct page *page, int wait);
 void task_dirty_inc(struct task_struct *tsk);
 
+
+/* checkpoint/restart */
+#ifdef CONFIG_CHECKPOINT
+struct ckpt_hdr_vma;
+extern int filemap_restore(struct ckpt_ctx *ctx, struct mm_struct *mm,
+			   struct ckpt_hdr_vma *hh);
+extern int special_mapping_restore(struct ckpt_ctx *ctx, struct mm_struct *mm,
+				   struct ckpt_hdr_vma *hh);
+#endif
+
 /* readahead.c */
 #define VM_MAX_READAHEAD	128	/* kbytes */
 #define VM_MIN_READAHEAD	16	/* kbytes (includes current page) */
diff --git a/mm/filemap.c b/mm/filemap.c
index 72ba805..ef7205b 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1668,6 +1668,25 @@ int filemap_checkpoint(struct ckpt_ctx *ctx, struct vm_area_struct *vma)
 
 	return private_vma_checkpoint(ctx, vma, CKPT_VMA_FILE, vma_objref);
 }
+
+int filemap_restore(struct ckpt_ctx *ctx,
+		    struct mm_struct *mm,
+		    struct ckpt_hdr_vma *h)
+{
+	struct file *file;
+	int ret;
+
+	if (h->vma_type == CKPT_VMA_FILE &&
+	    (h->vm_flags & (VM_SHARED | VM_MAYSHARE)))
+		return -EINVAL;
+
+	file = ckpt_obj_fetch(ctx, h->vma_objref, CKPT_OBJ_FILE);
+	if (IS_ERR(file))
+		return PTR_ERR(file);
+
+	ret = private_vma_restore(ctx, mm, file, h);
+	return ret;
+}
 #endif /* CONFIG_CHECKPOINT */
 
 struct vm_operations_struct generic_file_vm_ops = {
diff --git a/mm/mmap.c b/mm/mmap.c
index 7415a61..3be4678 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2110,7 +2110,7 @@ void exit_mmap(struct mm_struct *mm)
 	tlb = tlb_gather_mmu(mm, 1);
 	/* update_hiwater_rss(mm) here? but nobody should be looking */
 	/* Use -1 here to ensure all VMAs in the mm are unmapped */
-	end = unmap_vmas(&tlb, vma, 0, -1, &nr_accounted, NULL);
+	end = vma ? unmap_vmas(&tlb, vma, 0, -1, &nr_accounted, NULL) : 0;
 	vm_unacct_memory(nr_accounted);
 	free_pgtables(tlb, vma, FIRST_USER_ADDRESS, 0);
 	tlb_finish_mmu(tlb, 0, end);
@@ -2269,6 +2269,14 @@ static void special_mapping_close(struct vm_area_struct *vma)
 }
 
 #ifdef CONFIG_CHECKPOINT
+/*
+ * FIX:
+ *   - checkpoint vdso pages (once per distinct vdso is enough)
+ *   - check for compatilibility between saved and current vdso
+ *   - accommodate for dynamic kernel data in vdso page
+ *
+ * Current, we require COMPAT_VDSO which somewhat mitigates the issue
+ */
 static int special_mapping_checkpoint(struct ckpt_ctx *ctx,
 				      struct vm_area_struct *vma)
 {
@@ -2290,6 +2298,19 @@ static int special_mapping_checkpoint(struct ckpt_ctx *ctx,
 
 	return generic_vma_checkpoint(ctx, vma, CKPT_VMA_VDSO, 0);
 }
+
+int special_mapping_restore(struct ckpt_ctx *ctx,
+			    struct mm_struct *mm,
+			    struct ckpt_hdr_vma *h)
+{
+	/*
+	 * FIX:
+	 * Currently, we only handle VDSO/vsyscall special handling.
+	 * Even that, is very basic - call arch_setup_additional_pages
+	 * requiring the same mapping (start address) as before.
+	 */
+	return arch_setup_additional_pages(NULL, h->vm_start, 0);
+}
 #endif /* CONFIG_CHECKPOINT */
 
 static struct vm_operations_struct special_mapping_vmops = {
-- 
1.6.0.4

^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH v18 38/80] c/r: restore memory address space (private memory)
  2009-09-23 23:50 ` Oren Laadan
@ 2009-09-23 23:51   ` Oren Laadan
  -1 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-23 23:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, containers, linux-kernel, linux-mm, linux-api,
	Serge Hallyn, Ingo Molnar, Pavel Emelyanov, Oren Laadan,
	Oren Laadan

Restoring the memory address space begins with nuking the existing one
of the current process, and then reading the vma state and contents.
Call do_mmap_pgoffset() for each vma and then read in the data.

Changelog[v17]:
  - Tighten checks on supported vma to checkpoint or restart
Changelog[v17]:
  - Restore mm->{flags,def_flags,saved_auxv}
  - Fix bogus warning in do_restore_mm()
Changelog[v16]:
  - Restore mm->exe_file
Changelog[v14]:
  - Introduce per vma-type restore() function
  - Merge restart code into same file as checkpoint (memory.c)
  - Compare saved 'vdso' field of mm_context with current value
  - Check whether calls to ckpt_hbuf_get() fail
  - Discard field 'h->parent'
  - Revert change to pr_debug(), back to ckpt_debug()
Changelog[v13]:
  - Avoid access to hh->vma_type after the header is freed
  - Test for no vma's in exit_mmap() before calling unmap_vma() (or it
    may crash if restart fails after having removed all vma's)
Changelog[v12]:
  - Replace obsolete ckpt_debug() with pr_debug()
Changelog[v9]:
  - Introduce ckpt_ctx_checkpoint() for checkpoint-specific ctx setup
Changelog[v7]:
  - Fix argument given to kunmap_atomic() in memory dump/restore
Changelog[v6]:
  - Balance all calls to ckpt_hbuf_get() with matching ckpt_hbuf_put()
    (even though it's not really needed)
Changelog[v5]:
  - Improve memory restore code (following Dave Hansen's comments)
  - Change dump format (and code) to allow chunks of <vaddrs, pages>
    instead of one long list of each
  - Memory restore now maps user pages explicitly to copy data into them,
    instead of reading directly to user space; got rid of mprotect_fixup()
Changelog[v4]:
  - Use standard list_... for ckpt_pgarr


Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
---
 arch/x86/include/asm/ldt.h     |    7 +
 arch/x86/mm/checkpoint.c       |   64 ++++++
 checkpoint/memory.c            |  472 ++++++++++++++++++++++++++++++++++++++++
 checkpoint/objhash.c           |    1 +
 checkpoint/process.c           |    3 +
 checkpoint/restart.c           |    4 +
 fs/exec.c                      |    2 +-
 include/linux/checkpoint.h     |    7 +
 include/linux/checkpoint_hdr.h |    2 +-
 include/linux/mm.h             |   13 +
 mm/filemap.c                   |   19 ++
 mm/mmap.c                      |   23 ++-
 12 files changed, 614 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/ldt.h b/arch/x86/include/asm/ldt.h
index 46727eb..f2845f9 100644
--- a/arch/x86/include/asm/ldt.h
+++ b/arch/x86/include/asm/ldt.h
@@ -37,4 +37,11 @@ struct user_desc {
 #define MODIFY_LDT_CONTENTS_CODE	2
 
 #endif /* !__ASSEMBLY__ */
+
+#ifdef __KERNEL__
+#include <linux/linkage.h>
+asmlinkage int sys_modify_ldt(int func, void __user *ptr,
+			      unsigned long bytecount);
+#endif
+
 #endif /* _ASM_X86_LDT_H */
diff --git a/arch/x86/mm/checkpoint.c b/arch/x86/mm/checkpoint.c
index 83ed047..9dd8e12 100644
--- a/arch/x86/mm/checkpoint.c
+++ b/arch/x86/mm/checkpoint.c
@@ -13,6 +13,7 @@
 
 #include <asm/desc.h>
 #include <asm/i387.h>
+#include <asm/elf.h>
 
 #include <linux/checkpoint.h>
 #include <linux/checkpoint_hdr.h>
@@ -560,3 +561,66 @@ int restore_read_header_arch(struct ckpt_ctx *ctx)
 	ckpt_hdr_put(ctx, h);
 	return ret;
 }
+
+int restore_mm_context(struct ckpt_ctx *ctx, struct mm_struct *mm)
+{
+	struct ckpt_hdr_mm_context *h;
+	unsigned int n;
+	int ret;
+
+	h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_MM_CONTEXT);
+	if (IS_ERR(h))
+		return PTR_ERR(h);
+
+	ckpt_debug("nldt %d vdso %#lx (%p)\n",
+		 h->nldt, (unsigned long) h->vdso, mm->context.vdso);
+
+	ret = -EINVAL;
+	if (h->vdso != (unsigned long) mm->context.vdso)
+		goto out;
+	if (h->ldt_entry_size != LDT_ENTRY_SIZE)
+		goto out;
+
+	ret = _ckpt_read_obj_type(ctx, NULL,
+				  h->nldt * LDT_ENTRY_SIZE,
+				  CKPT_HDR_MM_CONTEXT_LDT);
+	if (ret < 0)
+		goto out;
+
+	/*
+	 * to utilize the syscall modify_ldt() we first convert the data
+	 * in the checkpoint image from 'struct desc_struct' to 'struct
+	 * user_desc' with reverse logic of include/asm/desc.h:fill_ldt()
+	 */
+	for (n = 0; n < h->nldt; n++) {
+		struct user_desc info;
+		struct desc_struct desc;
+		mm_segment_t old_fs;
+
+		ret = ckpt_kread(ctx, &desc, LDT_ENTRY_SIZE);
+		if (ret < 0)
+			break;
+
+		info.entry_number = n;
+		info.base_addr = desc.base0 | (desc.base1 << 16);
+		info.limit = desc.limit0;
+		info.seg_32bit = desc.d;
+		info.contents = desc.type >> 2;
+		info.read_exec_only = (desc.type >> 1) ^ 1;
+		info.limit_in_pages = desc.g;
+		info.seg_not_present = desc.p ^ 1;
+		info.useable = desc.avl;
+
+		old_fs = get_fs();
+		set_fs(get_ds());
+		ret = sys_modify_ldt(1, (struct user_desc __user *) &info,
+				     sizeof(info));
+		set_fs(old_fs);
+
+		if (ret < 0)
+			break;
+	}
+ out:
+	ckpt_hdr_put(ctx, h);
+	return ret;
+}
diff --git a/checkpoint/memory.c b/checkpoint/memory.c
index d19d627..6a0dfaf 100644
--- a/checkpoint/memory.c
+++ b/checkpoint/memory.c
@@ -15,6 +15,9 @@
 #include <linux/sched.h>
 #include <linux/slab.h>
 #include <linux/file.h>
+#include <linux/err.h>
+#include <linux/mm.h>
+#include <linux/mman.h>
 #include <linux/pagemap.h>
 #include <linux/mm_types.h>
 #include <linux/proc_fs.h>
@@ -695,3 +698,472 @@ int ckpt_collect_mm(struct ckpt_ctx *ctx, struct task_struct *t)
 
 	return ret;
 }
+
+/***********************************************************************
+ * Restart
+ *
+ * Unlike checkpoint, restart is executed in the context of each restarting
+ * process: vma regions are restored via a call to mmap(), and the data is
+ * read into the address space of the current process.
+ */
+
+/**
+ * read_pages_vaddrs - read addresses of pages to page-array chain
+ * @ctx - restart context
+ * @nr_pages - number of address to read
+ */
+static int read_pages_vaddrs(struct ckpt_ctx *ctx, unsigned long nr_pages)
+{
+	struct ckpt_pgarr *pgarr;
+	unsigned long *vaddrp;
+	int nr, ret;
+
+	while (nr_pages) {
+		pgarr = pgarr_current(ctx);
+		if (!pgarr)
+			return -ENOMEM;
+		nr = pgarr_nr_free(pgarr);
+		if (nr > nr_pages)
+			nr = nr_pages;
+		vaddrp = &pgarr->vaddrs[pgarr->nr_used];
+		ret = ckpt_kread(ctx, vaddrp, nr * sizeof(unsigned long));
+		if (ret < 0)
+			return ret;
+		pgarr->nr_used += nr;
+		nr_pages -= nr;
+	}
+	return 0;
+}
+
+static int restore_read_page(struct ckpt_ctx *ctx, struct page *page, void *p)
+{
+	void *ptr;
+	int ret;
+
+	ret = ckpt_kread(ctx, p, PAGE_SIZE);
+	if (ret < 0)
+		return ret;
+
+	ptr = kmap_atomic(page, KM_USER1);
+	memcpy(ptr, p, PAGE_SIZE);
+	kunmap_atomic(ptr, KM_USER1);
+
+	return 0;
+}
+
+/**
+ * read_pages_contents - read in data of pages in page-array chain
+ * @ctx - restart context
+ */
+static int read_pages_contents(struct ckpt_ctx *ctx)
+{
+	struct mm_struct *mm = current->mm;
+	struct ckpt_pgarr *pgarr;
+	unsigned long *vaddrs;
+	char *buf;
+	int i, ret = 0;
+
+	buf = kmalloc(PAGE_SIZE, GFP_KERNEL);
+	if (!buf)
+		return -ENOMEM;
+
+	down_read(&mm->mmap_sem);
+	list_for_each_entry_reverse(pgarr, &ctx->pgarr_list, list) {
+		vaddrs = pgarr->vaddrs;
+		for (i = 0; i < pgarr->nr_used; i++) {
+			struct page *page;
+
+			_ckpt_debug(CKPT_DPAGE, "got page %#lx\n", vaddrs[i]);
+			ret = get_user_pages(current, mm, vaddrs[i],
+					     1, 1, 1, &page, NULL);
+			if (ret < 0)
+				goto out;
+
+			ret = restore_read_page(ctx, page, buf);
+			page_cache_release(page);
+
+			if (ret < 0)
+				goto out;
+		}
+	}
+
+ out:
+	up_read(&mm->mmap_sem);
+	kfree(buf);
+	return 0;
+}
+
+/**
+ * restore_memory_contents - restore contents of a VMA with private memory
+ * @ctx - restart context
+ *
+ * Reads a header that specifies how many pages will follow, then reads
+ * a list of virtual addresses into ctx->pgarr_list page-array chain,
+ * followed by the actual contents of the corresponding pages. Iterates
+ * these steps until reaching a header specifying "0" pages, which marks
+ * the end of the contents.
+ */
+static int restore_memory_contents(struct ckpt_ctx *ctx)
+{
+	struct ckpt_hdr_pgarr *h;
+	unsigned long nr_pages;
+	int len, ret = 0;
+
+	while (1) {
+		h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_PGARR);
+		if (IS_ERR(h))
+			break;
+
+		ckpt_debug("total pages %ld\n", (unsigned long) h->nr_pages);
+
+		nr_pages = h->nr_pages;
+		ckpt_hdr_put(ctx, h);
+
+		if (!nr_pages)
+			break;
+
+		len = nr_pages * (sizeof(unsigned long) + PAGE_SIZE);
+		ret = _ckpt_read_buffer(ctx, NULL, len);
+		if (ret < 0)
+			break;
+
+		ret = read_pages_vaddrs(ctx, nr_pages);
+		if (ret < 0)
+			break;
+		ret = read_pages_contents(ctx);
+		if (ret < 0)
+			break;
+		pgarr_reset_all(ctx);
+	}
+
+	return ret;
+}
+
+/**
+ * calc_map_prot_bits - convert vm_flags to mmap protection
+ * orig_vm_flags: source vm_flags
+ */
+static unsigned long calc_map_prot_bits(unsigned long orig_vm_flags)
+{
+	unsigned long vm_prot = 0;
+
+	if (orig_vm_flags & VM_READ)
+		vm_prot |= PROT_READ;
+	if (orig_vm_flags & VM_WRITE)
+		vm_prot |= PROT_WRITE;
+	if (orig_vm_flags & VM_EXEC)
+		vm_prot |= PROT_EXEC;
+	if (orig_vm_flags & PROT_SEM)   /* only (?) with IPC-SHM  */
+		vm_prot |= PROT_SEM;
+
+	return vm_prot;
+}
+
+/**
+ * calc_map_flags_bits - convert vm_flags to mmap flags
+ * orig_vm_flags: source vm_flags
+ */
+static unsigned long calc_map_flags_bits(unsigned long orig_vm_flags)
+{
+	unsigned long vm_flags = 0;
+
+	vm_flags = MAP_FIXED;
+	if (orig_vm_flags & VM_GROWSDOWN)
+		vm_flags |= MAP_GROWSDOWN;
+	if (orig_vm_flags & VM_DENYWRITE)
+		vm_flags |= MAP_DENYWRITE;
+	if (orig_vm_flags & VM_EXECUTABLE)
+		vm_flags |= MAP_EXECUTABLE;
+	if (orig_vm_flags & VM_MAYSHARE)
+		vm_flags |= MAP_SHARED;
+	else
+		vm_flags |= MAP_PRIVATE;
+
+	return vm_flags;
+}
+
+/**
+ * generic_vma_restore - restore a vma
+ * @mm - address space
+ * @file - file to map (NULL for anonymous)
+ * @h - vma header data
+ */
+static unsigned long generic_vma_restore(struct mm_struct *mm,
+					 struct file *file,
+					 struct ckpt_hdr_vma *h)
+{
+	unsigned long vm_size, vm_start, vm_flags, vm_prot, vm_pgoff;
+	unsigned long addr;
+
+	if (h->vm_end < h->vm_start)
+		return -EINVAL;
+	if (h->vma_objref < 0)
+		return -EINVAL;
+
+	vm_start = h->vm_start;
+	vm_pgoff = h->vm_pgoff;
+	vm_size = h->vm_end - h->vm_start;
+	vm_prot = calc_map_prot_bits(h->vm_flags);
+	vm_flags = calc_map_flags_bits(h->vm_flags);
+
+	down_write(&mm->mmap_sem);
+	addr = do_mmap_pgoff(file, vm_start, vm_size,
+			     vm_prot, vm_flags, vm_pgoff);
+	up_write(&mm->mmap_sem);
+	ckpt_debug("size %#lx prot %#lx flag %#lx pgoff %#lx => %#lx\n",
+		 vm_size, vm_prot, vm_flags, vm_pgoff, addr);
+
+	return addr;
+}
+
+/**
+ * private_vma_restore - read vma data, recreate it and read contents
+ * @ctx: checkpoint context
+ * @mm: memory address space
+ * @file: file to use for mapping
+ * @h - vma header data
+ */
+int private_vma_restore(struct ckpt_ctx *ctx, struct mm_struct *mm,
+			struct file *file, struct ckpt_hdr_vma *h)
+{
+	unsigned long addr;
+
+	if (h->vm_flags & (VM_SHARED | VM_MAYSHARE))
+		return -EINVAL;
+
+	addr = generic_vma_restore(mm, file, h);
+	if (IS_ERR((void *) addr))
+		return PTR_ERR((void *) addr);
+
+	return restore_memory_contents(ctx);
+}
+
+/**
+ * anon_private_restore - read vma data, recreate it and read contents
+ * @ctx: checkpoint context
+ * @mm: memory address space
+ * @h - vma header data
+ */
+static int anon_private_restore(struct ckpt_ctx *ctx,
+				     struct mm_struct *mm,
+				     struct ckpt_hdr_vma *h)
+{
+	/*
+	 * vm_pgoff for anonymous mapping is the "global" page
+	 * offset (namely from addr 0x0), so we force a zero
+	 */
+	h->vm_pgoff = 0;
+
+	return private_vma_restore(ctx, mm, NULL, h);
+}
+
+/* callbacks to restore vma per its type: */
+struct restore_vma_ops {
+	char *vma_name;
+	enum vma_type vma_type;
+	int (*restore) (struct ckpt_ctx *ctx,
+			struct mm_struct *mm,
+			struct ckpt_hdr_vma *ptr);
+};
+
+static struct restore_vma_ops restore_vma_ops[] = {
+	/* ignored vma */
+	{
+		.vma_name = "IGNORE",
+		.vma_type = CKPT_VMA_IGNORE,
+		.restore = NULL,
+	},
+	/* special mapping (vdso) */
+	{
+		.vma_name = "VDSO",
+		.vma_type = CKPT_VMA_VDSO,
+		.restore = special_mapping_restore,
+	},
+	/* anonymous private */
+	{
+		.vma_name = "ANON PRIVATE",
+		.vma_type = CKPT_VMA_ANON,
+		.restore = anon_private_restore,
+	},
+	/* file-mapped private */
+	{
+		.vma_name = "FILE PRIVATE",
+		.vma_type = CKPT_VMA_FILE,
+		.restore = filemap_restore,
+	},
+};
+
+/**
+ * restore_vma - read vma data, recreate it and read contents
+ * @ctx: checkpoint context
+ * @mm: memory address space
+ */
+static int restore_vma(struct ckpt_ctx *ctx, struct mm_struct *mm)
+{
+	struct ckpt_hdr_vma *h;
+	struct restore_vma_ops *ops;
+	int ret;
+
+	h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_VMA);
+	if (IS_ERR(h))
+		return PTR_ERR(h);
+
+	ckpt_debug("vma %#lx-%#lx flags %#lx type %d vmaref %d\n",
+		   (unsigned long) h->vm_start, (unsigned long) h->vm_end,
+		   (unsigned long) h->vm_flags, (int) h->vma_type,
+		   (int) h->vma_objref);
+
+	ret = -EINVAL;
+	if (h->vm_end < h->vm_start)
+		goto out;
+	if (h->vma_objref < 0)
+		goto out;
+	if (h->vma_type >= CKPT_VMA_MAX)
+		goto out;
+	if (h->vm_flags & CKPT_VMA_NOT_SUPPORTED)
+		return -ENOSYS;
+
+	ops = &restore_vma_ops[h->vma_type];
+
+	/* make sure we don't change this accidentally */
+	BUG_ON(ops->vma_type != h->vma_type);
+
+	if (ops->restore) {
+		ckpt_debug("vma type %s\n", ops->vma_name);
+		ret = ops->restore(ctx, mm, h);
+	} else {
+		ckpt_debug("vma ignored\n");
+		ret = 0;
+	}
+ out:
+	ckpt_hdr_put(ctx, h);
+	return ret;
+}
+
+static int destroy_mm(struct mm_struct *mm)
+{
+	struct vm_area_struct *vmnext = mm->mmap;
+	struct vm_area_struct *vma;
+	int ret;
+
+	while (vmnext) {
+		vma = vmnext;
+		vmnext = vmnext->vm_next;
+		ret = do_munmap(mm, vma->vm_start, vma->vm_end-vma->vm_start);
+		if (ret < 0) {
+			pr_warning("c/r: failed do_munmap (%d)\n", ret);
+			return ret;
+		}
+	}
+	return 0;
+}
+
+static struct mm_struct *do_restore_mm(struct ckpt_ctx *ctx)
+{
+	struct ckpt_hdr_mm *h;
+	struct mm_struct *mm = NULL;
+	struct file *file;
+	unsigned int nr;
+	int ret;
+
+	h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_MM);
+	if (IS_ERR(h))
+		return (struct mm_struct *) h;
+
+	ckpt_debug("map_count %d\n", h->map_count);
+
+	/* XXX need more sanity checks */
+
+	ret = -EINVAL;
+	if ((h->start_code > h->end_code) ||
+	    (h->start_data > h->end_data))
+		goto out;
+	if (h->exe_objref < 0)
+		goto out;
+	if (h->def_flags & ~VM_LOCKED)
+		goto out;
+	if (h->flags & ~(MMF_DUMP_FILTER_MASK |
+			 ((1 << MMF_DUMP_FILTER_BITS) - 1)))
+		goto out;
+
+	mm = current->mm;
+
+	/* point of no return -- destruct current mm */
+	down_write(&mm->mmap_sem);
+	ret = destroy_mm(mm);
+	if (ret < 0) {
+		up_write(&mm->mmap_sem);
+		goto out;
+	}
+
+	mm->flags = h->flags;
+	mm->def_flags = h->def_flags;
+
+	mm->start_code = h->start_code;
+	mm->end_code = h->end_code;
+	mm->start_data = h->start_data;
+	mm->end_data = h->end_data;
+	mm->start_brk = h->start_brk;
+	mm->brk = h->brk;
+	mm->start_stack = h->start_stack;
+	mm->arg_start = h->arg_start;
+	mm->arg_end = h->arg_end;
+	mm->env_start = h->env_start;
+	mm->env_end = h->env_end;
+
+	/* restore the ->exe_file */
+	if (h->exe_objref) {
+		file = ckpt_obj_fetch(ctx, h->exe_objref, CKPT_OBJ_FILE);
+		if (IS_ERR(file)) {
+			up_write(&mm->mmap_sem);
+			ret = PTR_ERR(file);
+			goto out;
+		}
+		set_mm_exe_file(mm, file);
+	}
+
+	ret = _ckpt_read_buffer(ctx, mm->saved_auxv, sizeof(mm->saved_auxv));
+	up_write(&mm->mmap_sem);
+	if (ret < 0)
+		goto out;
+
+	for (nr = h->map_count; nr; nr--) {
+		ret = restore_vma(ctx, mm);
+		if (ret < 0)
+			goto out;
+	}
+
+	ret = restore_mm_context(ctx, mm);
+ out:
+	ckpt_hdr_put(ctx, h);
+	if (ret < 0)
+		return ERR_PTR(ret);
+	/* restore_obj() expect an extra reference */
+	atomic_inc(&mm->mm_users);
+	return mm;
+}
+
+void *restore_mm(struct ckpt_ctx *ctx)
+{
+	return (void *) do_restore_mm(ctx);
+}
+
+int restore_obj_mm(struct ckpt_ctx *ctx, int mm_objref)
+{
+	struct mm_struct *mm;
+	int ret;
+
+	mm = ckpt_obj_fetch(ctx, mm_objref, CKPT_OBJ_MM);
+	if (IS_ERR(mm))
+		return PTR_ERR(mm);
+
+	if (mm == current->mm)
+		return 0;
+
+	ret = exec_mmap(mm);
+	if (ret < 0)
+		return ret;
+
+	atomic_inc(&mm->mm_users);
+	return 0;
+}
diff --git a/checkpoint/objhash.c b/checkpoint/objhash.c
index 4c207c3..734be3a 100644
--- a/checkpoint/objhash.c
+++ b/checkpoint/objhash.c
@@ -148,6 +148,7 @@ static struct ckpt_obj_ops ckpt_obj_ops[] = {
 		.ref_grab = obj_mm_grab,
 		.ref_users = obj_mm_users,
 		.checkpoint = checkpoint_mm,
+		.restore = restore_mm,
 	},
 };
 
diff --git a/checkpoint/process.c b/checkpoint/process.c
index 3d0eb36..1788a39 100644
--- a/checkpoint/process.c
+++ b/checkpoint/process.c
@@ -373,6 +373,9 @@ static int restore_task_objs(struct ckpt_ctx *ctx)
 	ret = restore_obj_file_table(ctx, h->files_objref);
 	ckpt_debug("file_table: ret %d (%p)\n", ret, current->files);
 
+	ret = restore_obj_mm(ctx, h->mm_objref);
+	ckpt_debug("mm: ret %d (%p)\n", ret, current->mm);
+
 	ckpt_hdr_put(ctx, h);
 	return ret;
 }
diff --git a/checkpoint/restart.c b/checkpoint/restart.c
index 73db44a..6183074 100644
--- a/checkpoint/restart.c
+++ b/checkpoint/restart.c
@@ -373,11 +373,15 @@ int ckpt_read_consume(struct ckpt_ctx *ctx, int len, int type)
 static int check_kernel_const(struct ckpt_const *h)
 {
 	struct task_struct *tsk;
+	struct mm_struct *mm;
 	struct new_utsname *uts;
 
 	/* task */
 	if (h->task_comm_len != sizeof(tsk->comm))
 		return -EINVAL;
+	/* mm */
+	if (h->mm_saved_auxv_len != sizeof(mm->saved_auxv))
+		return -EINVAL;
 	/* uts */
 	if (h->uts_release_len != sizeof(uts->release))
 		return -EINVAL;
diff --git a/fs/exec.c b/fs/exec.c
index 172ceb6..c8436a7 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -695,7 +695,7 @@ int kernel_read(struct file *file, loff_t offset,
 
 EXPORT_SYMBOL(kernel_read);
 
-static int exec_mmap(struct mm_struct *mm)
+int exec_mmap(struct mm_struct *mm)
 {
 	struct task_struct *tsk;
 	struct mm_struct * old_mm, *active_mm;
diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
index 70d9506..cb13e8f 100644
--- a/include/linux/checkpoint.h
+++ b/include/linux/checkpoint.h
@@ -136,6 +136,7 @@ extern int checkpoint_mm_context(struct ckpt_ctx *ctx, struct mm_struct *mm);
 extern int restore_read_header_arch(struct ckpt_ctx *ctx);
 extern int restore_thread(struct ckpt_ctx *ctx);
 extern int restore_cpu(struct ckpt_ctx *ctx);
+extern int restore_mm_context(struct ckpt_ctx *ctx, struct mm_struct *mm);
 
 extern int checkpoint_restart_block(struct ckpt_ctx *ctx,
 				    struct task_struct *t);
@@ -176,9 +177,15 @@ extern int private_vma_checkpoint(struct ckpt_ctx *ctx,
 				  int vma_objref);
 
 extern int checkpoint_obj_mm(struct ckpt_ctx *ctx, struct task_struct *t);
+extern int restore_obj_mm(struct ckpt_ctx *ctx, int mm_objref);
 
 extern int ckpt_collect_mm(struct ckpt_ctx *ctx, struct task_struct *t);
 extern int checkpoint_mm(struct ckpt_ctx *ctx, void *ptr);
+extern void *restore_mm(struct ckpt_ctx *ctx);
+
+extern int private_vma_restore(struct ckpt_ctx *ctx, struct mm_struct *mm,
+			       struct file *file, struct ckpt_hdr_vma *h);
+
 
 #define CKPT_VMA_NOT_SUPPORTED					\
 	(VM_SHARED | VM_MAYSHARE | VM_IO | VM_HUGETLB |		\
diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
index f29f87a..097d2e2 100644
--- a/include/linux/checkpoint_hdr.h
+++ b/include/linux/checkpoint_hdr.h
@@ -250,7 +250,7 @@ struct ckpt_hdr_mm {
 	__u64 arg_start, arg_end, env_start, env_end;
 } __attribute__((aligned(8)));
 
-/* vma subtypes */
+/* vma subtypes - index into restore_vma_dispatch[] */
 enum vma_type {
 	CKPT_VMA_IGNORE = 0,
 	CKPT_VMA_VDSO,		/* special vdso vma */
diff --git a/include/linux/mm.h b/include/linux/mm.h
index d5f9889..0d0c46b 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1148,6 +1148,9 @@ extern int do_munmap(struct mm_struct *, unsigned long, size_t);
 
 extern unsigned long do_brk(unsigned long, unsigned long);
 
+/* fs/exec.c */
+extern int exec_mmap(struct mm_struct *mm);
+
 /* filemap.c */
 extern unsigned long page_unuse(struct page *);
 extern void truncate_inode_pages(struct address_space *, loff_t);
@@ -1166,6 +1169,16 @@ extern int filemap_checkpoint(struct ckpt_ctx *, struct vm_area_struct *);
 int write_one_page(struct page *page, int wait);
 void task_dirty_inc(struct task_struct *tsk);
 
+
+/* checkpoint/restart */
+#ifdef CONFIG_CHECKPOINT
+struct ckpt_hdr_vma;
+extern int filemap_restore(struct ckpt_ctx *ctx, struct mm_struct *mm,
+			   struct ckpt_hdr_vma *hh);
+extern int special_mapping_restore(struct ckpt_ctx *ctx, struct mm_struct *mm,
+				   struct ckpt_hdr_vma *hh);
+#endif
+
 /* readahead.c */
 #define VM_MAX_READAHEAD	128	/* kbytes */
 #define VM_MIN_READAHEAD	16	/* kbytes (includes current page) */
diff --git a/mm/filemap.c b/mm/filemap.c
index 72ba805..ef7205b 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1668,6 +1668,25 @@ int filemap_checkpoint(struct ckpt_ctx *ctx, struct vm_area_struct *vma)
 
 	return private_vma_checkpoint(ctx, vma, CKPT_VMA_FILE, vma_objref);
 }
+
+int filemap_restore(struct ckpt_ctx *ctx,
+		    struct mm_struct *mm,
+		    struct ckpt_hdr_vma *h)
+{
+	struct file *file;
+	int ret;
+
+	if (h->vma_type == CKPT_VMA_FILE &&
+	    (h->vm_flags & (VM_SHARED | VM_MAYSHARE)))
+		return -EINVAL;
+
+	file = ckpt_obj_fetch(ctx, h->vma_objref, CKPT_OBJ_FILE);
+	if (IS_ERR(file))
+		return PTR_ERR(file);
+
+	ret = private_vma_restore(ctx, mm, file, h);
+	return ret;
+}
 #endif /* CONFIG_CHECKPOINT */
 
 struct vm_operations_struct generic_file_vm_ops = {
diff --git a/mm/mmap.c b/mm/mmap.c
index 7415a61..3be4678 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2110,7 +2110,7 @@ void exit_mmap(struct mm_struct *mm)
 	tlb = tlb_gather_mmu(mm, 1);
 	/* update_hiwater_rss(mm) here? but nobody should be looking */
 	/* Use -1 here to ensure all VMAs in the mm are unmapped */
-	end = unmap_vmas(&tlb, vma, 0, -1, &nr_accounted, NULL);
+	end = vma ? unmap_vmas(&tlb, vma, 0, -1, &nr_accounted, NULL) : 0;
 	vm_unacct_memory(nr_accounted);
 	free_pgtables(tlb, vma, FIRST_USER_ADDRESS, 0);
 	tlb_finish_mmu(tlb, 0, end);
@@ -2269,6 +2269,14 @@ static void special_mapping_close(struct vm_area_struct *vma)
 }
 
 #ifdef CONFIG_CHECKPOINT
+/*
+ * FIX:
+ *   - checkpoint vdso pages (once per distinct vdso is enough)
+ *   - check for compatilibility between saved and current vdso
+ *   - accommodate for dynamic kernel data in vdso page
+ *
+ * Current, we require COMPAT_VDSO which somewhat mitigates the issue
+ */
 static int special_mapping_checkpoint(struct ckpt_ctx *ctx,
 				      struct vm_area_struct *vma)
 {
@@ -2290,6 +2298,19 @@ static int special_mapping_checkpoint(struct ckpt_ctx *ctx,
 
 	return generic_vma_checkpoint(ctx, vma, CKPT_VMA_VDSO, 0);
 }
+
+int special_mapping_restore(struct ckpt_ctx *ctx,
+			    struct mm_struct *mm,
+			    struct ckpt_hdr_vma *h)
+{
+	/*
+	 * FIX:
+	 * Currently, we only handle VDSO/vsyscall special handling.
+	 * Even that, is very basic - call arch_setup_additional_pages
+	 * requiring the same mapping (start address) as before.
+	 */
+	return arch_setup_additional_pages(NULL, h->vm_start, 0);
+}
 #endif /* CONFIG_CHECKPOINT */
 
 static struct vm_operations_struct special_mapping_vmops = {
-- 
1.6.0.4


^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH v18 38/80] c/r: restore memory address space (private memory)
@ 2009-09-23 23:51   ` Oren Laadan
  0 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-23 23:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, containers, linux-kernel, linux-mm, linux-api,
	Serge Hallyn, Ingo Molnar, Pavel Emelyanov, Oren Laadan,
	Oren Laadan

Restoring the memory address space begins with nuking the existing one
of the current process, and then reading the vma state and contents.
Call do_mmap_pgoffset() for each vma and then read in the data.

Changelog[v17]:
  - Tighten checks on supported vma to checkpoint or restart
Changelog[v17]:
  - Restore mm->{flags,def_flags,saved_auxv}
  - Fix bogus warning in do_restore_mm()
Changelog[v16]:
  - Restore mm->exe_file
Changelog[v14]:
  - Introduce per vma-type restore() function
  - Merge restart code into same file as checkpoint (memory.c)
  - Compare saved 'vdso' field of mm_context with current value
  - Check whether calls to ckpt_hbuf_get() fail
  - Discard field 'h->parent'
  - Revert change to pr_debug(), back to ckpt_debug()
Changelog[v13]:
  - Avoid access to hh->vma_type after the header is freed
  - Test for no vma's in exit_mmap() before calling unmap_vma() (or it
    may crash if restart fails after having removed all vma's)
Changelog[v12]:
  - Replace obsolete ckpt_debug() with pr_debug()
Changelog[v9]:
  - Introduce ckpt_ctx_checkpoint() for checkpoint-specific ctx setup
Changelog[v7]:
  - Fix argument given to kunmap_atomic() in memory dump/restore
Changelog[v6]:
  - Balance all calls to ckpt_hbuf_get() with matching ckpt_hbuf_put()
    (even though it's not really needed)
Changelog[v5]:
  - Improve memory restore code (following Dave Hansen's comments)
  - Change dump format (and code) to allow chunks of <vaddrs, pages>
    instead of one long list of each
  - Memory restore now maps user pages explicitly to copy data into them,
    instead of reading directly to user space; got rid of mprotect_fixup()
Changelog[v4]:
  - Use standard list_... for ckpt_pgarr


Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
---
 arch/x86/include/asm/ldt.h     |    7 +
 arch/x86/mm/checkpoint.c       |   64 ++++++
 checkpoint/memory.c            |  472 ++++++++++++++++++++++++++++++++++++++++
 checkpoint/objhash.c           |    1 +
 checkpoint/process.c           |    3 +
 checkpoint/restart.c           |    4 +
 fs/exec.c                      |    2 +-
 include/linux/checkpoint.h     |    7 +
 include/linux/checkpoint_hdr.h |    2 +-
 include/linux/mm.h             |   13 +
 mm/filemap.c                   |   19 ++
 mm/mmap.c                      |   23 ++-
 12 files changed, 614 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/ldt.h b/arch/x86/include/asm/ldt.h
index 46727eb..f2845f9 100644
--- a/arch/x86/include/asm/ldt.h
+++ b/arch/x86/include/asm/ldt.h
@@ -37,4 +37,11 @@ struct user_desc {
 #define MODIFY_LDT_CONTENTS_CODE	2
 
 #endif /* !__ASSEMBLY__ */
+
+#ifdef __KERNEL__
+#include <linux/linkage.h>
+asmlinkage int sys_modify_ldt(int func, void __user *ptr,
+			      unsigned long bytecount);
+#endif
+
 #endif /* _ASM_X86_LDT_H */
diff --git a/arch/x86/mm/checkpoint.c b/arch/x86/mm/checkpoint.c
index 83ed047..9dd8e12 100644
--- a/arch/x86/mm/checkpoint.c
+++ b/arch/x86/mm/checkpoint.c
@@ -13,6 +13,7 @@
 
 #include <asm/desc.h>
 #include <asm/i387.h>
+#include <asm/elf.h>
 
 #include <linux/checkpoint.h>
 #include <linux/checkpoint_hdr.h>
@@ -560,3 +561,66 @@ int restore_read_header_arch(struct ckpt_ctx *ctx)
 	ckpt_hdr_put(ctx, h);
 	return ret;
 }
+
+int restore_mm_context(struct ckpt_ctx *ctx, struct mm_struct *mm)
+{
+	struct ckpt_hdr_mm_context *h;
+	unsigned int n;
+	int ret;
+
+	h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_MM_CONTEXT);
+	if (IS_ERR(h))
+		return PTR_ERR(h);
+
+	ckpt_debug("nldt %d vdso %#lx (%p)\n",
+		 h->nldt, (unsigned long) h->vdso, mm->context.vdso);
+
+	ret = -EINVAL;
+	if (h->vdso != (unsigned long) mm->context.vdso)
+		goto out;
+	if (h->ldt_entry_size != LDT_ENTRY_SIZE)
+		goto out;
+
+	ret = _ckpt_read_obj_type(ctx, NULL,
+				  h->nldt * LDT_ENTRY_SIZE,
+				  CKPT_HDR_MM_CONTEXT_LDT);
+	if (ret < 0)
+		goto out;
+
+	/*
+	 * to utilize the syscall modify_ldt() we first convert the data
+	 * in the checkpoint image from 'struct desc_struct' to 'struct
+	 * user_desc' with reverse logic of include/asm/desc.h:fill_ldt()
+	 */
+	for (n = 0; n < h->nldt; n++) {
+		struct user_desc info;
+		struct desc_struct desc;
+		mm_segment_t old_fs;
+
+		ret = ckpt_kread(ctx, &desc, LDT_ENTRY_SIZE);
+		if (ret < 0)
+			break;
+
+		info.entry_number = n;
+		info.base_addr = desc.base0 | (desc.base1 << 16);
+		info.limit = desc.limit0;
+		info.seg_32bit = desc.d;
+		info.contents = desc.type >> 2;
+		info.read_exec_only = (desc.type >> 1) ^ 1;
+		info.limit_in_pages = desc.g;
+		info.seg_not_present = desc.p ^ 1;
+		info.useable = desc.avl;
+
+		old_fs = get_fs();
+		set_fs(get_ds());
+		ret = sys_modify_ldt(1, (struct user_desc __user *) &info,
+				     sizeof(info));
+		set_fs(old_fs);
+
+		if (ret < 0)
+			break;
+	}
+ out:
+	ckpt_hdr_put(ctx, h);
+	return ret;
+}
diff --git a/checkpoint/memory.c b/checkpoint/memory.c
index d19d627..6a0dfaf 100644
--- a/checkpoint/memory.c
+++ b/checkpoint/memory.c
@@ -15,6 +15,9 @@
 #include <linux/sched.h>
 #include <linux/slab.h>
 #include <linux/file.h>
+#include <linux/err.h>
+#include <linux/mm.h>
+#include <linux/mman.h>
 #include <linux/pagemap.h>
 #include <linux/mm_types.h>
 #include <linux/proc_fs.h>
@@ -695,3 +698,472 @@ int ckpt_collect_mm(struct ckpt_ctx *ctx, struct task_struct *t)
 
 	return ret;
 }
+
+/***********************************************************************
+ * Restart
+ *
+ * Unlike checkpoint, restart is executed in the context of each restarting
+ * process: vma regions are restored via a call to mmap(), and the data is
+ * read into the address space of the current process.
+ */
+
+/**
+ * read_pages_vaddrs - read addresses of pages to page-array chain
+ * @ctx - restart context
+ * @nr_pages - number of address to read
+ */
+static int read_pages_vaddrs(struct ckpt_ctx *ctx, unsigned long nr_pages)
+{
+	struct ckpt_pgarr *pgarr;
+	unsigned long *vaddrp;
+	int nr, ret;
+
+	while (nr_pages) {
+		pgarr = pgarr_current(ctx);
+		if (!pgarr)
+			return -ENOMEM;
+		nr = pgarr_nr_free(pgarr);
+		if (nr > nr_pages)
+			nr = nr_pages;
+		vaddrp = &pgarr->vaddrs[pgarr->nr_used];
+		ret = ckpt_kread(ctx, vaddrp, nr * sizeof(unsigned long));
+		if (ret < 0)
+			return ret;
+		pgarr->nr_used += nr;
+		nr_pages -= nr;
+	}
+	return 0;
+}
+
+static int restore_read_page(struct ckpt_ctx *ctx, struct page *page, void *p)
+{
+	void *ptr;
+	int ret;
+
+	ret = ckpt_kread(ctx, p, PAGE_SIZE);
+	if (ret < 0)
+		return ret;
+
+	ptr = kmap_atomic(page, KM_USER1);
+	memcpy(ptr, p, PAGE_SIZE);
+	kunmap_atomic(ptr, KM_USER1);
+
+	return 0;
+}
+
+/**
+ * read_pages_contents - read in data of pages in page-array chain
+ * @ctx - restart context
+ */
+static int read_pages_contents(struct ckpt_ctx *ctx)
+{
+	struct mm_struct *mm = current->mm;
+	struct ckpt_pgarr *pgarr;
+	unsigned long *vaddrs;
+	char *buf;
+	int i, ret = 0;
+
+	buf = kmalloc(PAGE_SIZE, GFP_KERNEL);
+	if (!buf)
+		return -ENOMEM;
+
+	down_read(&mm->mmap_sem);
+	list_for_each_entry_reverse(pgarr, &ctx->pgarr_list, list) {
+		vaddrs = pgarr->vaddrs;
+		for (i = 0; i < pgarr->nr_used; i++) {
+			struct page *page;
+
+			_ckpt_debug(CKPT_DPAGE, "got page %#lx\n", vaddrs[i]);
+			ret = get_user_pages(current, mm, vaddrs[i],
+					     1, 1, 1, &page, NULL);
+			if (ret < 0)
+				goto out;
+
+			ret = restore_read_page(ctx, page, buf);
+			page_cache_release(page);
+
+			if (ret < 0)
+				goto out;
+		}
+	}
+
+ out:
+	up_read(&mm->mmap_sem);
+	kfree(buf);
+	return 0;
+}
+
+/**
+ * restore_memory_contents - restore contents of a VMA with private memory
+ * @ctx - restart context
+ *
+ * Reads a header that specifies how many pages will follow, then reads
+ * a list of virtual addresses into ctx->pgarr_list page-array chain,
+ * followed by the actual contents of the corresponding pages. Iterates
+ * these steps until reaching a header specifying "0" pages, which marks
+ * the end of the contents.
+ */
+static int restore_memory_contents(struct ckpt_ctx *ctx)
+{
+	struct ckpt_hdr_pgarr *h;
+	unsigned long nr_pages;
+	int len, ret = 0;
+
+	while (1) {
+		h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_PGARR);
+		if (IS_ERR(h))
+			break;
+
+		ckpt_debug("total pages %ld\n", (unsigned long) h->nr_pages);
+
+		nr_pages = h->nr_pages;
+		ckpt_hdr_put(ctx, h);
+
+		if (!nr_pages)
+			break;
+
+		len = nr_pages * (sizeof(unsigned long) + PAGE_SIZE);
+		ret = _ckpt_read_buffer(ctx, NULL, len);
+		if (ret < 0)
+			break;
+
+		ret = read_pages_vaddrs(ctx, nr_pages);
+		if (ret < 0)
+			break;
+		ret = read_pages_contents(ctx);
+		if (ret < 0)
+			break;
+		pgarr_reset_all(ctx);
+	}
+
+	return ret;
+}
+
+/**
+ * calc_map_prot_bits - convert vm_flags to mmap protection
+ * orig_vm_flags: source vm_flags
+ */
+static unsigned long calc_map_prot_bits(unsigned long orig_vm_flags)
+{
+	unsigned long vm_prot = 0;
+
+	if (orig_vm_flags & VM_READ)
+		vm_prot |= PROT_READ;
+	if (orig_vm_flags & VM_WRITE)
+		vm_prot |= PROT_WRITE;
+	if (orig_vm_flags & VM_EXEC)
+		vm_prot |= PROT_EXEC;
+	if (orig_vm_flags & PROT_SEM)   /* only (?) with IPC-SHM  */
+		vm_prot |= PROT_SEM;
+
+	return vm_prot;
+}
+
+/**
+ * calc_map_flags_bits - convert vm_flags to mmap flags
+ * orig_vm_flags: source vm_flags
+ */
+static unsigned long calc_map_flags_bits(unsigned long orig_vm_flags)
+{
+	unsigned long vm_flags = 0;
+
+	vm_flags = MAP_FIXED;
+	if (orig_vm_flags & VM_GROWSDOWN)
+		vm_flags |= MAP_GROWSDOWN;
+	if (orig_vm_flags & VM_DENYWRITE)
+		vm_flags |= MAP_DENYWRITE;
+	if (orig_vm_flags & VM_EXECUTABLE)
+		vm_flags |= MAP_EXECUTABLE;
+	if (orig_vm_flags & VM_MAYSHARE)
+		vm_flags |= MAP_SHARED;
+	else
+		vm_flags |= MAP_PRIVATE;
+
+	return vm_flags;
+}
+
+/**
+ * generic_vma_restore - restore a vma
+ * @mm - address space
+ * @file - file to map (NULL for anonymous)
+ * @h - vma header data
+ */
+static unsigned long generic_vma_restore(struct mm_struct *mm,
+					 struct file *file,
+					 struct ckpt_hdr_vma *h)
+{
+	unsigned long vm_size, vm_start, vm_flags, vm_prot, vm_pgoff;
+	unsigned long addr;
+
+	if (h->vm_end < h->vm_start)
+		return -EINVAL;
+	if (h->vma_objref < 0)
+		return -EINVAL;
+
+	vm_start = h->vm_start;
+	vm_pgoff = h->vm_pgoff;
+	vm_size = h->vm_end - h->vm_start;
+	vm_prot = calc_map_prot_bits(h->vm_flags);
+	vm_flags = calc_map_flags_bits(h->vm_flags);
+
+	down_write(&mm->mmap_sem);
+	addr = do_mmap_pgoff(file, vm_start, vm_size,
+			     vm_prot, vm_flags, vm_pgoff);
+	up_write(&mm->mmap_sem);
+	ckpt_debug("size %#lx prot %#lx flag %#lx pgoff %#lx => %#lx\n",
+		 vm_size, vm_prot, vm_flags, vm_pgoff, addr);
+
+	return addr;
+}
+
+/**
+ * private_vma_restore - read vma data, recreate it and read contents
+ * @ctx: checkpoint context
+ * @mm: memory address space
+ * @file: file to use for mapping
+ * @h - vma header data
+ */
+int private_vma_restore(struct ckpt_ctx *ctx, struct mm_struct *mm,
+			struct file *file, struct ckpt_hdr_vma *h)
+{
+	unsigned long addr;
+
+	if (h->vm_flags & (VM_SHARED | VM_MAYSHARE))
+		return -EINVAL;
+
+	addr = generic_vma_restore(mm, file, h);
+	if (IS_ERR((void *) addr))
+		return PTR_ERR((void *) addr);
+
+	return restore_memory_contents(ctx);
+}
+
+/**
+ * anon_private_restore - read vma data, recreate it and read contents
+ * @ctx: checkpoint context
+ * @mm: memory address space
+ * @h - vma header data
+ */
+static int anon_private_restore(struct ckpt_ctx *ctx,
+				     struct mm_struct *mm,
+				     struct ckpt_hdr_vma *h)
+{
+	/*
+	 * vm_pgoff for anonymous mapping is the "global" page
+	 * offset (namely from addr 0x0), so we force a zero
+	 */
+	h->vm_pgoff = 0;
+
+	return private_vma_restore(ctx, mm, NULL, h);
+}
+
+/* callbacks to restore vma per its type: */
+struct restore_vma_ops {
+	char *vma_name;
+	enum vma_type vma_type;
+	int (*restore) (struct ckpt_ctx *ctx,
+			struct mm_struct *mm,
+			struct ckpt_hdr_vma *ptr);
+};
+
+static struct restore_vma_ops restore_vma_ops[] = {
+	/* ignored vma */
+	{
+		.vma_name = "IGNORE",
+		.vma_type = CKPT_VMA_IGNORE,
+		.restore = NULL,
+	},
+	/* special mapping (vdso) */
+	{
+		.vma_name = "VDSO",
+		.vma_type = CKPT_VMA_VDSO,
+		.restore = special_mapping_restore,
+	},
+	/* anonymous private */
+	{
+		.vma_name = "ANON PRIVATE",
+		.vma_type = CKPT_VMA_ANON,
+		.restore = anon_private_restore,
+	},
+	/* file-mapped private */
+	{
+		.vma_name = "FILE PRIVATE",
+		.vma_type = CKPT_VMA_FILE,
+		.restore = filemap_restore,
+	},
+};
+
+/**
+ * restore_vma - read vma data, recreate it and read contents
+ * @ctx: checkpoint context
+ * @mm: memory address space
+ */
+static int restore_vma(struct ckpt_ctx *ctx, struct mm_struct *mm)
+{
+	struct ckpt_hdr_vma *h;
+	struct restore_vma_ops *ops;
+	int ret;
+
+	h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_VMA);
+	if (IS_ERR(h))
+		return PTR_ERR(h);
+
+	ckpt_debug("vma %#lx-%#lx flags %#lx type %d vmaref %d\n",
+		   (unsigned long) h->vm_start, (unsigned long) h->vm_end,
+		   (unsigned long) h->vm_flags, (int) h->vma_type,
+		   (int) h->vma_objref);
+
+	ret = -EINVAL;
+	if (h->vm_end < h->vm_start)
+		goto out;
+	if (h->vma_objref < 0)
+		goto out;
+	if (h->vma_type >= CKPT_VMA_MAX)
+		goto out;
+	if (h->vm_flags & CKPT_VMA_NOT_SUPPORTED)
+		return -ENOSYS;
+
+	ops = &restore_vma_ops[h->vma_type];
+
+	/* make sure we don't change this accidentally */
+	BUG_ON(ops->vma_type != h->vma_type);
+
+	if (ops->restore) {
+		ckpt_debug("vma type %s\n", ops->vma_name);
+		ret = ops->restore(ctx, mm, h);
+	} else {
+		ckpt_debug("vma ignored\n");
+		ret = 0;
+	}
+ out:
+	ckpt_hdr_put(ctx, h);
+	return ret;
+}
+
+static int destroy_mm(struct mm_struct *mm)
+{
+	struct vm_area_struct *vmnext = mm->mmap;
+	struct vm_area_struct *vma;
+	int ret;
+
+	while (vmnext) {
+		vma = vmnext;
+		vmnext = vmnext->vm_next;
+		ret = do_munmap(mm, vma->vm_start, vma->vm_end-vma->vm_start);
+		if (ret < 0) {
+			pr_warning("c/r: failed do_munmap (%d)\n", ret);
+			return ret;
+		}
+	}
+	return 0;
+}
+
+static struct mm_struct *do_restore_mm(struct ckpt_ctx *ctx)
+{
+	struct ckpt_hdr_mm *h;
+	struct mm_struct *mm = NULL;
+	struct file *file;
+	unsigned int nr;
+	int ret;
+
+	h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_MM);
+	if (IS_ERR(h))
+		return (struct mm_struct *) h;
+
+	ckpt_debug("map_count %d\n", h->map_count);
+
+	/* XXX need more sanity checks */
+
+	ret = -EINVAL;
+	if ((h->start_code > h->end_code) ||
+	    (h->start_data > h->end_data))
+		goto out;
+	if (h->exe_objref < 0)
+		goto out;
+	if (h->def_flags & ~VM_LOCKED)
+		goto out;
+	if (h->flags & ~(MMF_DUMP_FILTER_MASK |
+			 ((1 << MMF_DUMP_FILTER_BITS) - 1)))
+		goto out;
+
+	mm = current->mm;
+
+	/* point of no return -- destruct current mm */
+	down_write(&mm->mmap_sem);
+	ret = destroy_mm(mm);
+	if (ret < 0) {
+		up_write(&mm->mmap_sem);
+		goto out;
+	}
+
+	mm->flags = h->flags;
+	mm->def_flags = h->def_flags;
+
+	mm->start_code = h->start_code;
+	mm->end_code = h->end_code;
+	mm->start_data = h->start_data;
+	mm->end_data = h->end_data;
+	mm->start_brk = h->start_brk;
+	mm->brk = h->brk;
+	mm->start_stack = h->start_stack;
+	mm->arg_start = h->arg_start;
+	mm->arg_end = h->arg_end;
+	mm->env_start = h->env_start;
+	mm->env_end = h->env_end;
+
+	/* restore the ->exe_file */
+	if (h->exe_objref) {
+		file = ckpt_obj_fetch(ctx, h->exe_objref, CKPT_OBJ_FILE);
+		if (IS_ERR(file)) {
+			up_write(&mm->mmap_sem);
+			ret = PTR_ERR(file);
+			goto out;
+		}
+		set_mm_exe_file(mm, file);
+	}
+
+	ret = _ckpt_read_buffer(ctx, mm->saved_auxv, sizeof(mm->saved_auxv));
+	up_write(&mm->mmap_sem);
+	if (ret < 0)
+		goto out;
+
+	for (nr = h->map_count; nr; nr--) {
+		ret = restore_vma(ctx, mm);
+		if (ret < 0)
+			goto out;
+	}
+
+	ret = restore_mm_context(ctx, mm);
+ out:
+	ckpt_hdr_put(ctx, h);
+	if (ret < 0)
+		return ERR_PTR(ret);
+	/* restore_obj() expect an extra reference */
+	atomic_inc(&mm->mm_users);
+	return mm;
+}
+
+void *restore_mm(struct ckpt_ctx *ctx)
+{
+	return (void *) do_restore_mm(ctx);
+}
+
+int restore_obj_mm(struct ckpt_ctx *ctx, int mm_objref)
+{
+	struct mm_struct *mm;
+	int ret;
+
+	mm = ckpt_obj_fetch(ctx, mm_objref, CKPT_OBJ_MM);
+	if (IS_ERR(mm))
+		return PTR_ERR(mm);
+
+	if (mm == current->mm)
+		return 0;
+
+	ret = exec_mmap(mm);
+	if (ret < 0)
+		return ret;
+
+	atomic_inc(&mm->mm_users);
+	return 0;
+}
diff --git a/checkpoint/objhash.c b/checkpoint/objhash.c
index 4c207c3..734be3a 100644
--- a/checkpoint/objhash.c
+++ b/checkpoint/objhash.c
@@ -148,6 +148,7 @@ static struct ckpt_obj_ops ckpt_obj_ops[] = {
 		.ref_grab = obj_mm_grab,
 		.ref_users = obj_mm_users,
 		.checkpoint = checkpoint_mm,
+		.restore = restore_mm,
 	},
 };
 
diff --git a/checkpoint/process.c b/checkpoint/process.c
index 3d0eb36..1788a39 100644
--- a/checkpoint/process.c
+++ b/checkpoint/process.c
@@ -373,6 +373,9 @@ static int restore_task_objs(struct ckpt_ctx *ctx)
 	ret = restore_obj_file_table(ctx, h->files_objref);
 	ckpt_debug("file_table: ret %d (%p)\n", ret, current->files);
 
+	ret = restore_obj_mm(ctx, h->mm_objref);
+	ckpt_debug("mm: ret %d (%p)\n", ret, current->mm);
+
 	ckpt_hdr_put(ctx, h);
 	return ret;
 }
diff --git a/checkpoint/restart.c b/checkpoint/restart.c
index 73db44a..6183074 100644
--- a/checkpoint/restart.c
+++ b/checkpoint/restart.c
@@ -373,11 +373,15 @@ int ckpt_read_consume(struct ckpt_ctx *ctx, int len, int type)
 static int check_kernel_const(struct ckpt_const *h)
 {
 	struct task_struct *tsk;
+	struct mm_struct *mm;
 	struct new_utsname *uts;
 
 	/* task */
 	if (h->task_comm_len != sizeof(tsk->comm))
 		return -EINVAL;
+	/* mm */
+	if (h->mm_saved_auxv_len != sizeof(mm->saved_auxv))
+		return -EINVAL;
 	/* uts */
 	if (h->uts_release_len != sizeof(uts->release))
 		return -EINVAL;
diff --git a/fs/exec.c b/fs/exec.c
index 172ceb6..c8436a7 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -695,7 +695,7 @@ int kernel_read(struct file *file, loff_t offset,
 
 EXPORT_SYMBOL(kernel_read);
 
-static int exec_mmap(struct mm_struct *mm)
+int exec_mmap(struct mm_struct *mm)
 {
 	struct task_struct *tsk;
 	struct mm_struct * old_mm, *active_mm;
diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
index 70d9506..cb13e8f 100644
--- a/include/linux/checkpoint.h
+++ b/include/linux/checkpoint.h
@@ -136,6 +136,7 @@ extern int checkpoint_mm_context(struct ckpt_ctx *ctx, struct mm_struct *mm);
 extern int restore_read_header_arch(struct ckpt_ctx *ctx);
 extern int restore_thread(struct ckpt_ctx *ctx);
 extern int restore_cpu(struct ckpt_ctx *ctx);
+extern int restore_mm_context(struct ckpt_ctx *ctx, struct mm_struct *mm);
 
 extern int checkpoint_restart_block(struct ckpt_ctx *ctx,
 				    struct task_struct *t);
@@ -176,9 +177,15 @@ extern int private_vma_checkpoint(struct ckpt_ctx *ctx,
 				  int vma_objref);
 
 extern int checkpoint_obj_mm(struct ckpt_ctx *ctx, struct task_struct *t);
+extern int restore_obj_mm(struct ckpt_ctx *ctx, int mm_objref);
 
 extern int ckpt_collect_mm(struct ckpt_ctx *ctx, struct task_struct *t);
 extern int checkpoint_mm(struct ckpt_ctx *ctx, void *ptr);
+extern void *restore_mm(struct ckpt_ctx *ctx);
+
+extern int private_vma_restore(struct ckpt_ctx *ctx, struct mm_struct *mm,
+			       struct file *file, struct ckpt_hdr_vma *h);
+
 
 #define CKPT_VMA_NOT_SUPPORTED					\
 	(VM_SHARED | VM_MAYSHARE | VM_IO | VM_HUGETLB |		\
diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
index f29f87a..097d2e2 100644
--- a/include/linux/checkpoint_hdr.h
+++ b/include/linux/checkpoint_hdr.h
@@ -250,7 +250,7 @@ struct ckpt_hdr_mm {
 	__u64 arg_start, arg_end, env_start, env_end;
 } __attribute__((aligned(8)));
 
-/* vma subtypes */
+/* vma subtypes - index into restore_vma_dispatch[] */
 enum vma_type {
 	CKPT_VMA_IGNORE = 0,
 	CKPT_VMA_VDSO,		/* special vdso vma */
diff --git a/include/linux/mm.h b/include/linux/mm.h
index d5f9889..0d0c46b 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1148,6 +1148,9 @@ extern int do_munmap(struct mm_struct *, unsigned long, size_t);
 
 extern unsigned long do_brk(unsigned long, unsigned long);
 
+/* fs/exec.c */
+extern int exec_mmap(struct mm_struct *mm);
+
 /* filemap.c */
 extern unsigned long page_unuse(struct page *);
 extern void truncate_inode_pages(struct address_space *, loff_t);
@@ -1166,6 +1169,16 @@ extern int filemap_checkpoint(struct ckpt_ctx *, struct vm_area_struct *);
 int write_one_page(struct page *page, int wait);
 void task_dirty_inc(struct task_struct *tsk);
 
+
+/* checkpoint/restart */
+#ifdef CONFIG_CHECKPOINT
+struct ckpt_hdr_vma;
+extern int filemap_restore(struct ckpt_ctx *ctx, struct mm_struct *mm,
+			   struct ckpt_hdr_vma *hh);
+extern int special_mapping_restore(struct ckpt_ctx *ctx, struct mm_struct *mm,
+				   struct ckpt_hdr_vma *hh);
+#endif
+
 /* readahead.c */
 #define VM_MAX_READAHEAD	128	/* kbytes */
 #define VM_MIN_READAHEAD	16	/* kbytes (includes current page) */
diff --git a/mm/filemap.c b/mm/filemap.c
index 72ba805..ef7205b 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1668,6 +1668,25 @@ int filemap_checkpoint(struct ckpt_ctx *ctx, struct vm_area_struct *vma)
 
 	return private_vma_checkpoint(ctx, vma, CKPT_VMA_FILE, vma_objref);
 }
+
+int filemap_restore(struct ckpt_ctx *ctx,
+		    struct mm_struct *mm,
+		    struct ckpt_hdr_vma *h)
+{
+	struct file *file;
+	int ret;
+
+	if (h->vma_type == CKPT_VMA_FILE &&
+	    (h->vm_flags & (VM_SHARED | VM_MAYSHARE)))
+		return -EINVAL;
+
+	file = ckpt_obj_fetch(ctx, h->vma_objref, CKPT_OBJ_FILE);
+	if (IS_ERR(file))
+		return PTR_ERR(file);
+
+	ret = private_vma_restore(ctx, mm, file, h);
+	return ret;
+}
 #endif /* CONFIG_CHECKPOINT */
 
 struct vm_operations_struct generic_file_vm_ops = {
diff --git a/mm/mmap.c b/mm/mmap.c
index 7415a61..3be4678 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2110,7 +2110,7 @@ void exit_mmap(struct mm_struct *mm)
 	tlb = tlb_gather_mmu(mm, 1);
 	/* update_hiwater_rss(mm) here? but nobody should be looking */
 	/* Use -1 here to ensure all VMAs in the mm are unmapped */
-	end = unmap_vmas(&tlb, vma, 0, -1, &nr_accounted, NULL);
+	end = vma ? unmap_vmas(&tlb, vma, 0, -1, &nr_accounted, NULL) : 0;
 	vm_unacct_memory(nr_accounted);
 	free_pgtables(tlb, vma, FIRST_USER_ADDRESS, 0);
 	tlb_finish_mmu(tlb, 0, end);
@@ -2269,6 +2269,14 @@ static void special_mapping_close(struct vm_area_struct *vma)
 }
 
 #ifdef CONFIG_CHECKPOINT
+/*
+ * FIX:
+ *   - checkpoint vdso pages (once per distinct vdso is enough)
+ *   - check for compatilibility between saved and current vdso
+ *   - accommodate for dynamic kernel data in vdso page
+ *
+ * Current, we require COMPAT_VDSO which somewhat mitigates the issue
+ */
 static int special_mapping_checkpoint(struct ckpt_ctx *ctx,
 				      struct vm_area_struct *vma)
 {
@@ -2290,6 +2298,19 @@ static int special_mapping_checkpoint(struct ckpt_ctx *ctx,
 
 	return generic_vma_checkpoint(ctx, vma, CKPT_VMA_VDSO, 0);
 }
+
+int special_mapping_restore(struct ckpt_ctx *ctx,
+			    struct mm_struct *mm,
+			    struct ckpt_hdr_vma *h)
+{
+	/*
+	 * FIX:
+	 * Currently, we only handle VDSO/vsyscall special handling.
+	 * Even that, is very basic - call arch_setup_additional_pages
+	 * requiring the same mapping (start address) as before.
+	 */
+	return arch_setup_additional_pages(NULL, h->vm_start, 0);
+}
 #endif /* CONFIG_CHECKPOINT */
 
 static struct vm_operations_struct special_mapping_vmops = {
-- 
1.6.0.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH v18 39/80] c/r: add generic '->checkpoint' f_op to ext fses
       [not found] ` <1253749920-18673-1-git-send-email-orenl-RdfvBDnrOixBDgjK7y7TUQ@public.gmane.org>
                     ` (37 preceding siblings ...)
  2009-09-23 23:51   ` [PATCH v18 38/80] c/r: restore " Oren Laadan
@ 2009-09-23 23:51   ` Oren Laadan
  2009-09-23 23:51   ` [PATCH v18 40/80] c/r: add generic '->checkpoint()' f_op to simple devices Oren Laadan
                     ` (42 subsequent siblings)
  81 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-23 23:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-api-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Dave Hansen,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Linus Torvalds, Ingo Molnar,
	Pavel Emelyanov

From: Dave Hansen <dave-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>

This marks ext[234] as being checkpointable.  There will be many
more to do this to, but this is a start.

Changelog[v1]:
  - [Serge Hallyn] Use filemap_checkpoint() in ext4_file_vm_ops

Signed-off-by: Dave Hansen <dave-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
Signed-off-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
---
 fs/ext2/dir.c  |    1 +
 fs/ext2/file.c |    2 ++
 fs/ext3/dir.c  |    1 +
 fs/ext3/file.c |    1 +
 fs/ext4/dir.c  |    1 +
 fs/ext4/file.c |    4 ++++
 6 files changed, 10 insertions(+), 0 deletions(-)

diff --git a/fs/ext2/dir.c b/fs/ext2/dir.c
index 6cde970..78e9157 100644
--- a/fs/ext2/dir.c
+++ b/fs/ext2/dir.c
@@ -722,4 +722,5 @@ const struct file_operations ext2_dir_operations = {
 	.compat_ioctl	= ext2_compat_ioctl,
 #endif
 	.fsync		= simple_fsync,
+	.checkpoint	= generic_file_checkpoint,
 };
diff --git a/fs/ext2/file.c b/fs/ext2/file.c
index 2b9e47d..edbc3dc 100644
--- a/fs/ext2/file.c
+++ b/fs/ext2/file.c
@@ -58,6 +58,7 @@ const struct file_operations ext2_file_operations = {
 	.fsync		= simple_fsync,
 	.splice_read	= generic_file_splice_read,
 	.splice_write	= generic_file_splice_write,
+	.checkpoint	= generic_file_checkpoint,
 };
 
 #ifdef CONFIG_EXT2_FS_XIP
@@ -73,6 +74,7 @@ const struct file_operations ext2_xip_file_operations = {
 	.open		= generic_file_open,
 	.release	= ext2_release_file,
 	.fsync		= simple_fsync,
+	.checkpoint	= generic_file_checkpoint,
 };
 #endif
 
diff --git a/fs/ext3/dir.c b/fs/ext3/dir.c
index 373fa90..65f98af 100644
--- a/fs/ext3/dir.c
+++ b/fs/ext3/dir.c
@@ -48,6 +48,7 @@ const struct file_operations ext3_dir_operations = {
 #endif
 	.fsync		= ext3_sync_file,	/* BKL held */
 	.release	= ext3_release_dir,
+	.checkpoint	= generic_file_checkpoint,
 };
 
 
diff --git a/fs/ext3/file.c b/fs/ext3/file.c
index 5b49704..a421e07 100644
--- a/fs/ext3/file.c
+++ b/fs/ext3/file.c
@@ -126,6 +126,7 @@ const struct file_operations ext3_file_operations = {
 	.fsync		= ext3_sync_file,
 	.splice_read	= generic_file_splice_read,
 	.splice_write	= generic_file_splice_write,
+	.checkpoint	= generic_file_checkpoint,
 };
 
 const struct inode_operations ext3_file_inode_operations = {
diff --git a/fs/ext4/dir.c b/fs/ext4/dir.c
index 9dc9316..f69404c 100644
--- a/fs/ext4/dir.c
+++ b/fs/ext4/dir.c
@@ -48,6 +48,7 @@ const struct file_operations ext4_dir_operations = {
 #endif
 	.fsync		= ext4_sync_file,
 	.release	= ext4_release_dir,
+	.checkpoint	= generic_file_checkpoint,
 };
 
 
diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 3f1873f..cdac137 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -133,6 +133,9 @@ force_commit:
 static struct vm_operations_struct ext4_file_vm_ops = {
 	.fault		= filemap_fault,
 	.page_mkwrite   = ext4_page_mkwrite,
+#ifdef CONFIG_CHECKPOINT
+	.checkpoint	= filemap_checkpoint,
+#endif
 };
 
 static int ext4_file_mmap(struct file *file, struct vm_area_struct *vma)
@@ -195,6 +198,7 @@ const struct file_operations ext4_file_operations = {
 	.fsync		= ext4_sync_file,
 	.splice_read	= generic_file_splice_read,
 	.splice_write	= generic_file_splice_write,
+	.checkpoint	= generic_file_checkpoint,
 };
 
 const struct inode_operations ext4_file_inode_operations = {
-- 
1.6.0.4

^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH v18 39/80] c/r: add generic '->checkpoint' f_op to ext fses
  2009-09-23 23:50 ` Oren Laadan
@ 2009-09-23 23:51   ` Oren Laadan
  -1 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-23 23:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, containers, linux-kernel, linux-mm, linux-api,
	Serge Hallyn, Ingo Molnar, Pavel Emelyanov, Dave Hansen,
	Oren Laadan

From: Dave Hansen <dave@linux.vnet.ibm.com>

This marks ext[234] as being checkpointable.  There will be many
more to do this to, but this is a start.

Changelog[v1]:
  - [Serge Hallyn] Use filemap_checkpoint() in ext4_file_vm_ops

Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com>
Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
---
 fs/ext2/dir.c  |    1 +
 fs/ext2/file.c |    2 ++
 fs/ext3/dir.c  |    1 +
 fs/ext3/file.c |    1 +
 fs/ext4/dir.c  |    1 +
 fs/ext4/file.c |    4 ++++
 6 files changed, 10 insertions(+), 0 deletions(-)

diff --git a/fs/ext2/dir.c b/fs/ext2/dir.c
index 6cde970..78e9157 100644
--- a/fs/ext2/dir.c
+++ b/fs/ext2/dir.c
@@ -722,4 +722,5 @@ const struct file_operations ext2_dir_operations = {
 	.compat_ioctl	= ext2_compat_ioctl,
 #endif
 	.fsync		= simple_fsync,
+	.checkpoint	= generic_file_checkpoint,
 };
diff --git a/fs/ext2/file.c b/fs/ext2/file.c
index 2b9e47d..edbc3dc 100644
--- a/fs/ext2/file.c
+++ b/fs/ext2/file.c
@@ -58,6 +58,7 @@ const struct file_operations ext2_file_operations = {
 	.fsync		= simple_fsync,
 	.splice_read	= generic_file_splice_read,
 	.splice_write	= generic_file_splice_write,
+	.checkpoint	= generic_file_checkpoint,
 };
 
 #ifdef CONFIG_EXT2_FS_XIP
@@ -73,6 +74,7 @@ const struct file_operations ext2_xip_file_operations = {
 	.open		= generic_file_open,
 	.release	= ext2_release_file,
 	.fsync		= simple_fsync,
+	.checkpoint	= generic_file_checkpoint,
 };
 #endif
 
diff --git a/fs/ext3/dir.c b/fs/ext3/dir.c
index 373fa90..65f98af 100644
--- a/fs/ext3/dir.c
+++ b/fs/ext3/dir.c
@@ -48,6 +48,7 @@ const struct file_operations ext3_dir_operations = {
 #endif
 	.fsync		= ext3_sync_file,	/* BKL held */
 	.release	= ext3_release_dir,
+	.checkpoint	= generic_file_checkpoint,
 };
 
 
diff --git a/fs/ext3/file.c b/fs/ext3/file.c
index 5b49704..a421e07 100644
--- a/fs/ext3/file.c
+++ b/fs/ext3/file.c
@@ -126,6 +126,7 @@ const struct file_operations ext3_file_operations = {
 	.fsync		= ext3_sync_file,
 	.splice_read	= generic_file_splice_read,
 	.splice_write	= generic_file_splice_write,
+	.checkpoint	= generic_file_checkpoint,
 };
 
 const struct inode_operations ext3_file_inode_operations = {
diff --git a/fs/ext4/dir.c b/fs/ext4/dir.c
index 9dc9316..f69404c 100644
--- a/fs/ext4/dir.c
+++ b/fs/ext4/dir.c
@@ -48,6 +48,7 @@ const struct file_operations ext4_dir_operations = {
 #endif
 	.fsync		= ext4_sync_file,
 	.release	= ext4_release_dir,
+	.checkpoint	= generic_file_checkpoint,
 };
 
 
diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 3f1873f..cdac137 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -133,6 +133,9 @@ force_commit:
 static struct vm_operations_struct ext4_file_vm_ops = {
 	.fault		= filemap_fault,
 	.page_mkwrite   = ext4_page_mkwrite,
+#ifdef CONFIG_CHECKPOINT
+	.checkpoint	= filemap_checkpoint,
+#endif
 };
 
 static int ext4_file_mmap(struct file *file, struct vm_area_struct *vma)
@@ -195,6 +198,7 @@ const struct file_operations ext4_file_operations = {
 	.fsync		= ext4_sync_file,
 	.splice_read	= generic_file_splice_read,
 	.splice_write	= generic_file_splice_write,
+	.checkpoint	= generic_file_checkpoint,
 };
 
 const struct inode_operations ext4_file_inode_operations = {
-- 
1.6.0.4


^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH v18 39/80] c/r: add generic '->checkpoint' f_op to ext fses
@ 2009-09-23 23:51   ` Oren Laadan
  0 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-23 23:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, containers, linux-kernel, linux-mm, linux-api,
	Serge Hallyn, Ingo Molnar, Pavel Emelyanov, Dave Hansen,
	Oren Laadan

From: Dave Hansen <dave@linux.vnet.ibm.com>

This marks ext[234] as being checkpointable.  There will be many
more to do this to, but this is a start.

Changelog[v1]:
  - [Serge Hallyn] Use filemap_checkpoint() in ext4_file_vm_ops

Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com>
Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
---
 fs/ext2/dir.c  |    1 +
 fs/ext2/file.c |    2 ++
 fs/ext3/dir.c  |    1 +
 fs/ext3/file.c |    1 +
 fs/ext4/dir.c  |    1 +
 fs/ext4/file.c |    4 ++++
 6 files changed, 10 insertions(+), 0 deletions(-)

diff --git a/fs/ext2/dir.c b/fs/ext2/dir.c
index 6cde970..78e9157 100644
--- a/fs/ext2/dir.c
+++ b/fs/ext2/dir.c
@@ -722,4 +722,5 @@ const struct file_operations ext2_dir_operations = {
 	.compat_ioctl	= ext2_compat_ioctl,
 #endif
 	.fsync		= simple_fsync,
+	.checkpoint	= generic_file_checkpoint,
 };
diff --git a/fs/ext2/file.c b/fs/ext2/file.c
index 2b9e47d..edbc3dc 100644
--- a/fs/ext2/file.c
+++ b/fs/ext2/file.c
@@ -58,6 +58,7 @@ const struct file_operations ext2_file_operations = {
 	.fsync		= simple_fsync,
 	.splice_read	= generic_file_splice_read,
 	.splice_write	= generic_file_splice_write,
+	.checkpoint	= generic_file_checkpoint,
 };
 
 #ifdef CONFIG_EXT2_FS_XIP
@@ -73,6 +74,7 @@ const struct file_operations ext2_xip_file_operations = {
 	.open		= generic_file_open,
 	.release	= ext2_release_file,
 	.fsync		= simple_fsync,
+	.checkpoint	= generic_file_checkpoint,
 };
 #endif
 
diff --git a/fs/ext3/dir.c b/fs/ext3/dir.c
index 373fa90..65f98af 100644
--- a/fs/ext3/dir.c
+++ b/fs/ext3/dir.c
@@ -48,6 +48,7 @@ const struct file_operations ext3_dir_operations = {
 #endif
 	.fsync		= ext3_sync_file,	/* BKL held */
 	.release	= ext3_release_dir,
+	.checkpoint	= generic_file_checkpoint,
 };
 
 
diff --git a/fs/ext3/file.c b/fs/ext3/file.c
index 5b49704..a421e07 100644
--- a/fs/ext3/file.c
+++ b/fs/ext3/file.c
@@ -126,6 +126,7 @@ const struct file_operations ext3_file_operations = {
 	.fsync		= ext3_sync_file,
 	.splice_read	= generic_file_splice_read,
 	.splice_write	= generic_file_splice_write,
+	.checkpoint	= generic_file_checkpoint,
 };
 
 const struct inode_operations ext3_file_inode_operations = {
diff --git a/fs/ext4/dir.c b/fs/ext4/dir.c
index 9dc9316..f69404c 100644
--- a/fs/ext4/dir.c
+++ b/fs/ext4/dir.c
@@ -48,6 +48,7 @@ const struct file_operations ext4_dir_operations = {
 #endif
 	.fsync		= ext4_sync_file,
 	.release	= ext4_release_dir,
+	.checkpoint	= generic_file_checkpoint,
 };
 
 
diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 3f1873f..cdac137 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -133,6 +133,9 @@ force_commit:
 static struct vm_operations_struct ext4_file_vm_ops = {
 	.fault		= filemap_fault,
 	.page_mkwrite   = ext4_page_mkwrite,
+#ifdef CONFIG_CHECKPOINT
+	.checkpoint	= filemap_checkpoint,
+#endif
 };
 
 static int ext4_file_mmap(struct file *file, struct vm_area_struct *vma)
@@ -195,6 +198,7 @@ const struct file_operations ext4_file_operations = {
 	.fsync		= ext4_sync_file,
 	.splice_read	= generic_file_splice_read,
 	.splice_write	= generic_file_splice_write,
+	.checkpoint	= generic_file_checkpoint,
 };
 
 const struct inode_operations ext4_file_inode_operations = {
-- 
1.6.0.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH v18 40/80] c/r: add generic '->checkpoint()' f_op to simple devices
       [not found] ` <1253749920-18673-1-git-send-email-orenl-RdfvBDnrOixBDgjK7y7TUQ@public.gmane.org>
                     ` (38 preceding siblings ...)
  2009-09-23 23:51   ` [PATCH v18 39/80] c/r: add generic '->checkpoint' f_op to ext fses Oren Laadan
@ 2009-09-23 23:51   ` Oren Laadan
  2009-09-23 23:51   ` [PATCH v18 41/80] Add the checkpoint operation for opened files of generic filesystems Oren Laadan
                     ` (41 subsequent siblings)
  81 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-23 23:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-api-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Linus Torvalds, Ingo Molnar,
	Pavel Emelyanov

* /dev/null
* /dev/zero
* /dev/random
* /dev/urandom

Signed-off-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
---
 drivers/char/mem.c    |    2 ++
 drivers/char/random.c |    2 ++
 2 files changed, 4 insertions(+), 0 deletions(-)

diff --git a/drivers/char/mem.c b/drivers/char/mem.c
index afa8813..828ba7f 100644
--- a/drivers/char/mem.c
+++ b/drivers/char/mem.c
@@ -799,6 +799,7 @@ static const struct file_operations null_fops = {
 	.read		= read_null,
 	.write		= write_null,
 	.splice_write	= splice_write_null,
+	.checkpoint	= generic_file_checkpoint,
 };
 
 #ifdef CONFIG_DEVPORT
@@ -815,6 +816,7 @@ static const struct file_operations zero_fops = {
 	.read		= read_zero,
 	.write		= write_zero,
 	.mmap		= mmap_zero,
+	.checkpoint	= generic_file_checkpoint,
 };
 
 /*
diff --git a/drivers/char/random.c b/drivers/char/random.c
index 8c74448..211ca70 100644
--- a/drivers/char/random.c
+++ b/drivers/char/random.c
@@ -1164,6 +1164,7 @@ const struct file_operations random_fops = {
 	.poll  = random_poll,
 	.unlocked_ioctl = random_ioctl,
 	.fasync = random_fasync,
+	.checkpoint = generic_file_checkpoint,
 };
 
 const struct file_operations urandom_fops = {
@@ -1171,6 +1172,7 @@ const struct file_operations urandom_fops = {
 	.write = random_write,
 	.unlocked_ioctl = random_ioctl,
 	.fasync = random_fasync,
+	.checkpoint = generic_file_checkpoint,
 };
 
 /***************************************************************
-- 
1.6.0.4

^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH v18 40/80] c/r: add generic '->checkpoint()' f_op to simple devices
  2009-09-23 23:50 ` Oren Laadan
@ 2009-09-23 23:51   ` Oren Laadan
  -1 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-23 23:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, containers, linux-kernel, linux-mm, linux-api,
	Serge Hallyn, Ingo Molnar, Pavel Emelyanov, Oren Laadan,
	Oren Laadan

* /dev/null
* /dev/zero
* /dev/random
* /dev/urandom

Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
---
 drivers/char/mem.c    |    2 ++
 drivers/char/random.c |    2 ++
 2 files changed, 4 insertions(+), 0 deletions(-)

diff --git a/drivers/char/mem.c b/drivers/char/mem.c
index afa8813..828ba7f 100644
--- a/drivers/char/mem.c
+++ b/drivers/char/mem.c
@@ -799,6 +799,7 @@ static const struct file_operations null_fops = {
 	.read		= read_null,
 	.write		= write_null,
 	.splice_write	= splice_write_null,
+	.checkpoint	= generic_file_checkpoint,
 };
 
 #ifdef CONFIG_DEVPORT
@@ -815,6 +816,7 @@ static const struct file_operations zero_fops = {
 	.read		= read_zero,
 	.write		= write_zero,
 	.mmap		= mmap_zero,
+	.checkpoint	= generic_file_checkpoint,
 };
 
 /*
diff --git a/drivers/char/random.c b/drivers/char/random.c
index 8c74448..211ca70 100644
--- a/drivers/char/random.c
+++ b/drivers/char/random.c
@@ -1164,6 +1164,7 @@ const struct file_operations random_fops = {
 	.poll  = random_poll,
 	.unlocked_ioctl = random_ioctl,
 	.fasync = random_fasync,
+	.checkpoint = generic_file_checkpoint,
 };
 
 const struct file_operations urandom_fops = {
@@ -1171,6 +1172,7 @@ const struct file_operations urandom_fops = {
 	.write = random_write,
 	.unlocked_ioctl = random_ioctl,
 	.fasync = random_fasync,
+	.checkpoint = generic_file_checkpoint,
 };
 
 /***************************************************************
-- 
1.6.0.4


^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH v18 40/80] c/r: add generic '->checkpoint()' f_op to simple devices
@ 2009-09-23 23:51   ` Oren Laadan
  0 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-23 23:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, containers, linux-kernel, linux-mm, linux-api,
	Serge Hallyn, Ingo Molnar, Pavel Emelyanov, Oren Laadan,
	Oren Laadan

* /dev/null
* /dev/zero
* /dev/random
* /dev/urandom

Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
---
 drivers/char/mem.c    |    2 ++
 drivers/char/random.c |    2 ++
 2 files changed, 4 insertions(+), 0 deletions(-)

diff --git a/drivers/char/mem.c b/drivers/char/mem.c
index afa8813..828ba7f 100644
--- a/drivers/char/mem.c
+++ b/drivers/char/mem.c
@@ -799,6 +799,7 @@ static const struct file_operations null_fops = {
 	.read		= read_null,
 	.write		= write_null,
 	.splice_write	= splice_write_null,
+	.checkpoint	= generic_file_checkpoint,
 };
 
 #ifdef CONFIG_DEVPORT
@@ -815,6 +816,7 @@ static const struct file_operations zero_fops = {
 	.read		= read_zero,
 	.write		= write_zero,
 	.mmap		= mmap_zero,
+	.checkpoint	= generic_file_checkpoint,
 };
 
 /*
diff --git a/drivers/char/random.c b/drivers/char/random.c
index 8c74448..211ca70 100644
--- a/drivers/char/random.c
+++ b/drivers/char/random.c
@@ -1164,6 +1164,7 @@ const struct file_operations random_fops = {
 	.poll  = random_poll,
 	.unlocked_ioctl = random_ioctl,
 	.fasync = random_fasync,
+	.checkpoint = generic_file_checkpoint,
 };
 
 const struct file_operations urandom_fops = {
@@ -1171,6 +1172,7 @@ const struct file_operations urandom_fops = {
 	.write = random_write,
 	.unlocked_ioctl = random_ioctl,
 	.fasync = random_fasync,
+	.checkpoint = generic_file_checkpoint,
 };
 
 /***************************************************************
-- 
1.6.0.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH v18 41/80] Add the checkpoint operation for opened files of generic filesystems
       [not found] ` <1253749920-18673-1-git-send-email-orenl-RdfvBDnrOixBDgjK7y7TUQ@public.gmane.org>
                     ` (39 preceding siblings ...)
  2009-09-23 23:51   ` [PATCH v18 40/80] c/r: add generic '->checkpoint()' f_op to simple devices Oren Laadan
@ 2009-09-23 23:51   ` Oren Laadan
  2009-09-23 23:51   ` [PATCH v18 42/80] c/r: export shmem_getpage() to support shared memory Oren Laadan
                     ` (40 subsequent siblings)
  81 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-23 23:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-api-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Linus Torvalds,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Ingo Molnar,
	Pavel Emelyanov

From: Matt Helsley <matthltc-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>

These patches extend the use of the generic file checkpoint operation to
non-extX filesystems which have lseek operations that ensure we can save
and restore the files for later use. Note that this does not include
things like FUSE, network filesystems, or pseudo-filesystem kernel
interfaces.

Only compile and boot tested (on x86-32).

[Oren Laadan] Folded patch series into a single patch; original post
included 36 separate patches for individual filesystems:

  [PATCH 01/36] Add the checkpoint operation for affs files and directories.
  [PATCH 02/36] Add the checkpoint operation for befs directories.
  [PATCH 03/36] Add the checkpoint operation for bfs files and directories.
  [PATCH 04/36] Add the checkpoint operation for btrfs files and directories.
  [PATCH 05/36] Add the checkpoint operation for cramfs directories.
  [PATCH 06/36] Add the checkpoint operation for ecryptfs files and directories.
  [PATCH 07/36] Add the checkpoint operation for fat files and directories.
  [PATCH 08/36] Add the checkpoint operation for freevxfs directories.
  [PATCH 09/36] Add the checkpoint operation for hfs files and directories.
  [PATCH 10/36] Add the checkpoint operation for hfsplus files and directories.
  [PATCH 11/36] Add the checkpoint operation for hpfs files and directories.
  [PATCH 12/36] Add the checkpoint operation for hppfs files and directories.
  [PATCH 13/36] Add the checkpoint operation for iso directories.
  [PATCH 14/36] Add the checkpoint operation for jffs2 files and directories.
  [PATCH 15/36] Add the checkpoint operation for jfs files and directories.
  [PATCH 16/36] Add the checkpoint operation for regular nfs files and directories. Skip the various /proc files for now.
  [PATCH 17/36] Add the checkpoint operation for ntfs directories.
  [PATCH 18/36] Add the checkpoint operation for openromfs directories. Explicitly skip the properties for now.
  [PATCH 19/36] Add the checkpoint operation for qnx4 files and directories.
  [PATCH 20/36] Add the checkpoint operation for reiserfs files and directories.
  [PATCH 21/36] Add the checkpoint operation for romfs directories.
  [PATCH 22/36] Add the checkpoint operation for squashfs directories.
  [PATCH 23/36] Add the checkpoint operation for sysv filesystem files and directories.
  [PATCH 24/36] Add the checkpoint operation for ubifs files and directories.
  [PATCH 25/36] Add the checkpoint operation for udf filesystem files and directories.
  [PATCH 26/36] Add the checkpoint operation for xfs files and directories.
  [PATCH 27/36] Add checkpoint operation for efs directories.
  [PATCH 28/36] Add the checkpoint operation for generic, read-only files. At present, some/all files of the following filesystems use this generic definition:
  [PATCH 29/36] Add checkpoint operation for minix filesystem files and directories.
  [PATCH 30/36] Add checkpoint operations for omfs files and directories.
  [PATCH 31/36] Add checkpoint operations for ufs files and directories.
  [PATCH 32/36] Add checkpoint operations for ramfs files. NOTE: since simple_dir_operations are shared between multiple filesystems including ramfs, it's not currently possible to checkpoint open ramfs directories.
  [PATCH 33/36] Add the checkpoint operation for adfs files and directories.
  [PATCH 34/36] Add the checkpoint operation to exofs files and directories.
  [PATCH 35/36] Add the checkpoint operation to nilfs2 files and directories.
  [PATCH 36/36] Add checkpoint operations for UML host filesystem files and directories.

Signed-off-by: Matt Helsley <matthltc-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
Acked-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
Cc: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
---
 fs/adfs/dir.c               |    1 +
 fs/adfs/file.c              |    1 +
 fs/affs/dir.c               |    1 +
 fs/affs/file.c              |    1 +
 fs/befs/linuxvfs.c          |    1 +
 fs/bfs/dir.c                |    1 +
 fs/bfs/file.c               |    1 +
 fs/btrfs/file.c             |    1 +
 fs/btrfs/inode.c            |    1 +
 fs/btrfs/super.c            |    1 +
 fs/cramfs/inode.c           |    1 +
 fs/ecryptfs/file.c          |    2 ++
 fs/ecryptfs/miscdev.c       |    1 +
 fs/efs/dir.c                |    1 +
 fs/exofs/dir.c              |    1 +
 fs/exofs/file.c             |    1 +
 fs/fat/dir.c                |    1 +
 fs/fat/file.c               |    1 +
 fs/freevxfs/vxfs_lookup.c   |    1 +
 fs/hfs/dir.c                |    1 +
 fs/hfs/inode.c              |    1 +
 fs/hfsplus/dir.c            |    1 +
 fs/hfsplus/inode.c          |    1 +
 fs/hostfs/hostfs_kern.c     |    2 ++
 fs/hpfs/dir.c               |    1 +
 fs/hpfs/file.c              |    1 +
 fs/hppfs/hppfs.c            |    2 ++
 fs/isofs/dir.c              |    1 +
 fs/jffs2/dir.c              |    1 +
 fs/jffs2/file.c             |    1 +
 fs/jfs/file.c               |    1 +
 fs/jfs/namei.c              |    1 +
 fs/minix/dir.c              |    1 +
 fs/minix/file.c             |    1 +
 fs/nfs/dir.c                |    1 +
 fs/nfs/file.c               |    1 +
 fs/nilfs2/dir.c             |    2 +-
 fs/nilfs2/file.c            |    1 +
 fs/ntfs/dir.c               |    1 +
 fs/ntfs/file.c              |    3 ++-
 fs/omfs/dir.c               |    1 +
 fs/omfs/file.c              |    1 +
 fs/openpromfs/inode.c       |    2 ++
 fs/qnx4/dir.c               |    1 +
 fs/qnx4/file.c              |    1 +
 fs/ramfs/file-mmu.c         |    1 +
 fs/ramfs/file-nommu.c       |    1 +
 fs/read_write.c             |    1 +
 fs/reiserfs/dir.c           |    1 +
 fs/reiserfs/file.c          |    1 +
 fs/romfs/mmap-nommu.c       |    1 +
 fs/romfs/super.c            |    1 +
 fs/squashfs/dir.c           |    3 ++-
 fs/sysv/dir.c               |    1 +
 fs/sysv/file.c              |    1 +
 fs/ubifs/debug.c            |    1 +
 fs/ubifs/dir.c              |    1 +
 fs/ubifs/file.c             |    1 +
 fs/udf/dir.c                |    1 +
 fs/udf/file.c               |    1 +
 fs/ufs/dir.c                |    1 +
 fs/ufs/file.c               |    1 +
 fs/xfs/linux-2.6/xfs_file.c |    2 ++
 63 files changed, 70 insertions(+), 3 deletions(-)

diff --git a/fs/adfs/dir.c b/fs/adfs/dir.c
index 23aa52f..7106f32 100644
--- a/fs/adfs/dir.c
+++ b/fs/adfs/dir.c
@@ -198,6 +198,7 @@ const struct file_operations adfs_dir_operations = {
 	.llseek		= generic_file_llseek,
 	.readdir	= adfs_readdir,
 	.fsync		= simple_fsync,
+	.checkpoint	= generic_file_checkpoint,
 };
 
 static int
diff --git a/fs/adfs/file.c b/fs/adfs/file.c
index 005ea34..97bd298 100644
--- a/fs/adfs/file.c
+++ b/fs/adfs/file.c
@@ -30,6 +30,7 @@ const struct file_operations adfs_file_operations = {
 	.write		= do_sync_write,
 	.aio_write	= generic_file_aio_write,
 	.splice_read	= generic_file_splice_read,
+	.checkpoint	= generic_file_checkpoint,
 };
 
 const struct inode_operations adfs_file_inode_operations = {
diff --git a/fs/affs/dir.c b/fs/affs/dir.c
index 8ca8f3a..6cc5e43 100644
--- a/fs/affs/dir.c
+++ b/fs/affs/dir.c
@@ -22,6 +22,7 @@ const struct file_operations affs_dir_operations = {
 	.llseek		= generic_file_llseek,
 	.readdir	= affs_readdir,
 	.fsync		= affs_file_fsync,
+	.checkpoint	= generic_file_checkpoint,
 };
 
 /*
diff --git a/fs/affs/file.c b/fs/affs/file.c
index 184e55c..d580a12 100644
--- a/fs/affs/file.c
+++ b/fs/affs/file.c
@@ -36,6 +36,7 @@ const struct file_operations affs_file_operations = {
 	.release	= affs_file_release,
 	.fsync		= affs_file_fsync,
 	.splice_read	= generic_file_splice_read,
+	.checkpoint	= generic_file_checkpoint,
 };
 
 const struct inode_operations affs_file_inode_operations = {
diff --git a/fs/befs/linuxvfs.c b/fs/befs/linuxvfs.c
index 615d549..6c46cb8 100644
--- a/fs/befs/linuxvfs.c
+++ b/fs/befs/linuxvfs.c
@@ -67,6 +67,7 @@ static const struct file_operations befs_dir_operations = {
 	.read		= generic_read_dir,
 	.readdir	= befs_readdir,
 	.llseek		= generic_file_llseek,
+	.checkpoint	= generic_file_checkpoint,
 };
 
 static const struct inode_operations befs_dir_inode_operations = {
diff --git a/fs/bfs/dir.c b/fs/bfs/dir.c
index 1e41aad..d78015e 100644
--- a/fs/bfs/dir.c
+++ b/fs/bfs/dir.c
@@ -80,6 +80,7 @@ const struct file_operations bfs_dir_operations = {
 	.readdir	= bfs_readdir,
 	.fsync		= simple_fsync,
 	.llseek		= generic_file_llseek,
+	.checkpoint	= generic_file_checkpoint,
 };
 
 extern void dump_imap(const char *, struct super_block *);
diff --git a/fs/bfs/file.c b/fs/bfs/file.c
index 88b9a3f..7f61ed6 100644
--- a/fs/bfs/file.c
+++ b/fs/bfs/file.c
@@ -29,6 +29,7 @@ const struct file_operations bfs_file_operations = {
 	.aio_write	= generic_file_aio_write,
 	.mmap		= generic_file_mmap,
 	.splice_read	= generic_file_splice_read,
+	.checkpoint	= generic_file_checkpoint,
 };
 
 static int bfs_move_block(unsigned long from, unsigned long to,
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 4b83397..6425f19 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1229,4 +1229,5 @@ struct file_operations btrfs_file_operations = {
 #ifdef CONFIG_COMPAT
 	.compat_ioctl	= btrfs_ioctl,
 #endif
+	.checkpoint	= generic_file_checkpoint,
 };
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 59cba18..a13b1b7 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -5233,6 +5233,7 @@ static struct file_operations btrfs_dir_file_operations = {
 #endif
 	.release        = btrfs_release_file,
 	.fsync		= btrfs_sync_file,
+	.checkpoint	= generic_file_checkpoint,
 };
 
 static struct extent_io_ops btrfs_extent_io_ops = {
diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index 6d6d06c..58569ba 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -694,6 +694,7 @@ static const struct file_operations btrfs_ctl_fops = {
 	.unlocked_ioctl	 = btrfs_control_ioctl,
 	.compat_ioctl = btrfs_control_ioctl,
 	.owner	 = THIS_MODULE,
+	.checkpoint = generic_file_checkpoint,
 };
 
 static struct miscdevice btrfs_misc = {
diff --git a/fs/cramfs/inode.c b/fs/cramfs/inode.c
index dd3634e..0927503 100644
--- a/fs/cramfs/inode.c
+++ b/fs/cramfs/inode.c
@@ -532,6 +532,7 @@ static const struct file_operations cramfs_directory_operations = {
 	.llseek		= generic_file_llseek,
 	.read		= generic_read_dir,
 	.readdir	= cramfs_readdir,
+	.checkpoint	= generic_file_checkpoint,
 };
 
 static const struct inode_operations cramfs_dir_inode_operations = {
diff --git a/fs/ecryptfs/file.c b/fs/ecryptfs/file.c
index 9e94405..e6d2523 100644
--- a/fs/ecryptfs/file.c
+++ b/fs/ecryptfs/file.c
@@ -306,6 +306,7 @@ const struct file_operations ecryptfs_dir_fops = {
 	.fsync = ecryptfs_fsync,
 	.fasync = ecryptfs_fasync,
 	.splice_read = generic_file_splice_read,
+	.checkpoint = generic_file_checkpoint,
 };
 
 const struct file_operations ecryptfs_main_fops = {
@@ -323,6 +324,7 @@ const struct file_operations ecryptfs_main_fops = {
 	.fsync = ecryptfs_fsync,
 	.fasync = ecryptfs_fasync,
 	.splice_read = generic_file_splice_read,
+	.checkpoint = generic_file_checkpoint,
 };
 
 static int
diff --git a/fs/ecryptfs/miscdev.c b/fs/ecryptfs/miscdev.c
index 4ec8f61..9fd9b39 100644
--- a/fs/ecryptfs/miscdev.c
+++ b/fs/ecryptfs/miscdev.c
@@ -481,6 +481,7 @@ static const struct file_operations ecryptfs_miscdev_fops = {
 	.read    = ecryptfs_miscdev_read,
 	.write   = ecryptfs_miscdev_write,
 	.release = ecryptfs_miscdev_release,
+	.checkpoint = generic_file_checkpoint,
 };
 
 static struct miscdevice ecryptfs_miscdev = {
diff --git a/fs/efs/dir.c b/fs/efs/dir.c
index 7ee6f7e..da344b8 100644
--- a/fs/efs/dir.c
+++ b/fs/efs/dir.c
@@ -13,6 +13,7 @@ const struct file_operations efs_dir_operations = {
 	.llseek		= generic_file_llseek,
 	.read		= generic_read_dir,
 	.readdir	= efs_readdir,
+	.checkpoint	= generic_file_checkpoint,
 };
 
 const struct inode_operations efs_dir_inode_operations = {
diff --git a/fs/exofs/dir.c b/fs/exofs/dir.c
index 4cfab1c..f6693d3 100644
--- a/fs/exofs/dir.c
+++ b/fs/exofs/dir.c
@@ -667,4 +667,5 @@ const struct file_operations exofs_dir_operations = {
 	.llseek		= generic_file_llseek,
 	.read		= generic_read_dir,
 	.readdir	= exofs_readdir,
+	.checkpoint	= generic_file_checkpoint,
 };
diff --git a/fs/exofs/file.c b/fs/exofs/file.c
index 839b9dc..257e9da 100644
--- a/fs/exofs/file.c
+++ b/fs/exofs/file.c
@@ -73,6 +73,7 @@ static int exofs_flush(struct file *file, fl_owner_t id)
 
 const struct file_operations exofs_file_operations = {
 	.llseek		= generic_file_llseek,
+	.checkpoint	= generic_file_checkpoint,
 	.read		= do_sync_read,
 	.write		= do_sync_write,
 	.aio_read	= generic_file_aio_read,
diff --git a/fs/fat/dir.c b/fs/fat/dir.c
index 530b4ca..e3fa353 100644
--- a/fs/fat/dir.c
+++ b/fs/fat/dir.c
@@ -841,6 +841,7 @@ const struct file_operations fat_dir_operations = {
 	.compat_ioctl	= fat_compat_dir_ioctl,
 #endif
 	.fsync		= fat_file_fsync,
+	.checkpoint	= generic_file_checkpoint,
 };
 
 static int fat_get_short_entry(struct inode *dir, loff_t *pos,
diff --git a/fs/fat/file.c b/fs/fat/file.c
index f042b96..56351c2 100644
--- a/fs/fat/file.c
+++ b/fs/fat/file.c
@@ -162,6 +162,7 @@ const struct file_operations fat_file_operations = {
 	.ioctl		= fat_generic_ioctl,
 	.fsync		= fat_file_fsync,
 	.splice_read	= generic_file_splice_read,
+	.checkpoint	= generic_file_checkpoint,
 };
 
 static int fat_cont_expand(struct inode *inode, loff_t size)
diff --git a/fs/freevxfs/vxfs_lookup.c b/fs/freevxfs/vxfs_lookup.c
index aee049c..3a09132 100644
--- a/fs/freevxfs/vxfs_lookup.c
+++ b/fs/freevxfs/vxfs_lookup.c
@@ -58,6 +58,7 @@ const struct inode_operations vxfs_dir_inode_ops = {
 
 const struct file_operations vxfs_dir_operations = {
 	.readdir =		vxfs_readdir,
+	.checkpoint =		generic_file_checkpoint,
 };
 
  
diff --git a/fs/hfs/dir.c b/fs/hfs/dir.c
index 7c69b98..8d90a24 100644
--- a/fs/hfs/dir.c
+++ b/fs/hfs/dir.c
@@ -318,6 +318,7 @@ const struct file_operations hfs_dir_operations = {
 	.readdir	= hfs_readdir,
 	.llseek		= generic_file_llseek,
 	.release	= hfs_dir_release,
+	.checkpoint	= generic_file_checkpoint,
 };
 
 const struct inode_operations hfs_dir_inode_operations = {
diff --git a/fs/hfs/inode.c b/fs/hfs/inode.c
index a1cbff2..bf8950f 100644
--- a/fs/hfs/inode.c
+++ b/fs/hfs/inode.c
@@ -607,6 +607,7 @@ static const struct file_operations hfs_file_operations = {
 	.fsync		= file_fsync,
 	.open		= hfs_file_open,
 	.release	= hfs_file_release,
+	.checkpoint	= generic_file_checkpoint,
 };
 
 static const struct inode_operations hfs_file_inode_operations = {
diff --git a/fs/hfsplus/dir.c b/fs/hfsplus/dir.c
index 5f40236..41fbf2d 100644
--- a/fs/hfsplus/dir.c
+++ b/fs/hfsplus/dir.c
@@ -497,4 +497,5 @@ const struct file_operations hfsplus_dir_operations = {
 	.ioctl          = hfsplus_ioctl,
 	.llseek		= generic_file_llseek,
 	.release	= hfsplus_dir_release,
+	.checkpoint	= generic_file_checkpoint,
 };
diff --git a/fs/hfsplus/inode.c b/fs/hfsplus/inode.c
index 1bcf597..19abd7e 100644
--- a/fs/hfsplus/inode.c
+++ b/fs/hfsplus/inode.c
@@ -286,6 +286,7 @@ static const struct file_operations hfsplus_file_operations = {
 	.open		= hfsplus_file_open,
 	.release	= hfsplus_file_release,
 	.ioctl          = hfsplus_ioctl,
+	.checkpoint	= generic_file_checkpoint,
 };
 
 struct inode *hfsplus_new_inode(struct super_block *sb, int mode)
diff --git a/fs/hostfs/hostfs_kern.c b/fs/hostfs/hostfs_kern.c
index 032604e..67e2356 100644
--- a/fs/hostfs/hostfs_kern.c
+++ b/fs/hostfs/hostfs_kern.c
@@ -417,6 +417,7 @@ int hostfs_fsync(struct file *file, struct dentry *dentry, int datasync)
 
 static const struct file_operations hostfs_file_fops = {
 	.llseek		= generic_file_llseek,
+	.checkpoint	= generic_file_checkpoint,
 	.read		= do_sync_read,
 	.splice_read	= generic_file_splice_read,
 	.aio_read	= generic_file_aio_read,
@@ -430,6 +431,7 @@ static const struct file_operations hostfs_file_fops = {
 
 static const struct file_operations hostfs_dir_fops = {
 	.llseek		= generic_file_llseek,
+	.checkpoint	= generic_file_checkpoint,
 	.readdir	= hostfs_readdir,
 	.read		= generic_read_dir,
 };
diff --git a/fs/hpfs/dir.c b/fs/hpfs/dir.c
index 8865c94..dcde10f 100644
--- a/fs/hpfs/dir.c
+++ b/fs/hpfs/dir.c
@@ -322,4 +322,5 @@ const struct file_operations hpfs_dir_ops =
 	.readdir	= hpfs_readdir,
 	.release	= hpfs_dir_release,
 	.fsync		= hpfs_file_fsync,
+	.checkpoint	= generic_file_checkpoint,
 };
diff --git a/fs/hpfs/file.c b/fs/hpfs/file.c
index 3efabff..f1211f0 100644
--- a/fs/hpfs/file.c
+++ b/fs/hpfs/file.c
@@ -139,6 +139,7 @@ const struct file_operations hpfs_file_ops =
 	.release	= hpfs_file_release,
 	.fsync		= hpfs_file_fsync,
 	.splice_read	= generic_file_splice_read,
+	.checkpoint	= generic_file_checkpoint,
 };
 
 const struct inode_operations hpfs_file_iops =
diff --git a/fs/hppfs/hppfs.c b/fs/hppfs/hppfs.c
index a5089a6..f132fa2 100644
--- a/fs/hppfs/hppfs.c
+++ b/fs/hppfs/hppfs.c
@@ -546,6 +546,7 @@ static const struct file_operations hppfs_file_fops = {
 	.read		= hppfs_read,
 	.write		= hppfs_write,
 	.open		= hppfs_open,
+	.checkpoint	= generic_file_checkpoint,
 };
 
 struct hppfs_dirent {
@@ -597,6 +598,7 @@ static const struct file_operations hppfs_dir_fops = {
 	.readdir	= hppfs_readdir,
 	.open		= hppfs_dir_open,
 	.fsync		= hppfs_fsync,
+	.checkpoint	= generic_file_checkpoint,
 };
 
 static int hppfs_statfs(struct dentry *dentry, struct kstatfs *sf)
diff --git a/fs/isofs/dir.c b/fs/isofs/dir.c
index 8ba5441..848059d 100644
--- a/fs/isofs/dir.c
+++ b/fs/isofs/dir.c
@@ -273,6 +273,7 @@ const struct file_operations isofs_dir_operations =
 {
 	.read = generic_read_dir,
 	.readdir = isofs_readdir,
+	.checkpoint = generic_file_checkpoint,
 };
 
 /*
diff --git a/fs/jffs2/dir.c b/fs/jffs2/dir.c
index 6f60cc9..c2b6487 100644
--- a/fs/jffs2/dir.c
+++ b/fs/jffs2/dir.c
@@ -41,6 +41,7 @@ const struct file_operations jffs2_dir_operations =
 	.unlocked_ioctl=jffs2_ioctl,
 	.fsync =	jffs2_fsync,
 	.llseek =	generic_file_llseek,
+	.checkpoint =	generic_file_checkpoint,
 };
 
 
diff --git a/fs/jffs2/file.c b/fs/jffs2/file.c
index 23c9475..7e5d2e2 100644
--- a/fs/jffs2/file.c
+++ b/fs/jffs2/file.c
@@ -50,6 +50,7 @@ const struct file_operations jffs2_file_operations =
 	.mmap =		generic_file_readonly_mmap,
 	.fsync =	jffs2_fsync,
 	.splice_read =	generic_file_splice_read,
+	.checkpoint =	generic_file_checkpoint,
 };
 
 /* jffs2_file_inode_operations */
diff --git a/fs/jfs/file.c b/fs/jfs/file.c
index 7f6063a..90ab090 100644
--- a/fs/jfs/file.c
+++ b/fs/jfs/file.c
@@ -116,4 +116,5 @@ const struct file_operations jfs_file_operations = {
 #ifdef CONFIG_COMPAT
 	.compat_ioctl	= jfs_compat_ioctl,
 #endif
+	.checkpoint	= generic_file_checkpoint,
 };
diff --git a/fs/jfs/namei.c b/fs/jfs/namei.c
index 514ee2e..5486db9 100644
--- a/fs/jfs/namei.c
+++ b/fs/jfs/namei.c
@@ -1556,6 +1556,7 @@ const struct file_operations jfs_dir_operations = {
 	.compat_ioctl	= jfs_compat_ioctl,
 #endif
 	.llseek		= generic_file_llseek,
+	.checkpoint	= generic_file_checkpoint,
 };
 
 static int jfs_ci_hash(struct dentry *dir, struct qstr *this)
diff --git a/fs/minix/dir.c b/fs/minix/dir.c
index d407e7a..9cf04af 100644
--- a/fs/minix/dir.c
+++ b/fs/minix/dir.c
@@ -23,6 +23,7 @@ const struct file_operations minix_dir_operations = {
 	.read		= generic_read_dir,
 	.readdir	= minix_readdir,
 	.fsync		= simple_fsync,
+	.checkpoint	= generic_file_checkpoint,
 };
 
 static inline void dir_put_page(struct page *page)
diff --git a/fs/minix/file.c b/fs/minix/file.c
index 3eec3e6..2048d09 100644
--- a/fs/minix/file.c
+++ b/fs/minix/file.c
@@ -21,6 +21,7 @@ const struct file_operations minix_file_operations = {
 	.mmap		= generic_file_mmap,
 	.fsync		= simple_fsync,
 	.splice_read	= generic_file_splice_read,
+	.checkpoint	= generic_file_checkpoint,
 };
 
 const struct inode_operations minix_file_inode_operations = {
diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
index 32062c3..43b9025 100644
--- a/fs/nfs/dir.c
+++ b/fs/nfs/dir.c
@@ -63,6 +63,7 @@ const struct file_operations nfs_dir_operations = {
 	.open		= nfs_opendir,
 	.release	= nfs_release,
 	.fsync		= nfs_fsync_dir,
+	.checkpoint	= generic_file_checkpoint,
 };
 
 const struct inode_operations nfs_dir_inode_operations = {
diff --git a/fs/nfs/file.c b/fs/nfs/file.c
index 0506232..813fd8d 100644
--- a/fs/nfs/file.c
+++ b/fs/nfs/file.c
@@ -78,6 +78,7 @@ const struct file_operations nfs_file_operations = {
 	.splice_write	= nfs_file_splice_write,
 	.check_flags	= nfs_check_flags,
 	.setlease	= nfs_setlease,
+	.checkpoint	= generic_file_checkpoint,
 };
 
 const struct inode_operations nfs_file_inode_operations = {
diff --git a/fs/nilfs2/dir.c b/fs/nilfs2/dir.c
index 1a4fa04..afa0f80 100644
--- a/fs/nilfs2/dir.c
+++ b/fs/nilfs2/dir.c
@@ -706,5 +706,5 @@ struct file_operations nilfs_dir_operations = {
 	.compat_ioctl	= nilfs_ioctl,
 #endif	/* CONFIG_COMPAT */
 	.fsync		= nilfs_sync_file,
-
+	.checkpoint	= generic_file_checkpoint,
 };
diff --git a/fs/nilfs2/file.c b/fs/nilfs2/file.c
index 6bd84a0..0f27ab5 100644
--- a/fs/nilfs2/file.c
+++ b/fs/nilfs2/file.c
@@ -136,6 +136,7 @@ static int nilfs_file_mmap(struct file *file, struct vm_area_struct *vma)
  */
 struct file_operations nilfs_file_operations = {
 	.llseek		= generic_file_llseek,
+	.checkpoint	= generic_file_checkpoint,
 	.read		= do_sync_read,
 	.write		= do_sync_write,
 	.aio_read	= generic_file_aio_read,
diff --git a/fs/ntfs/dir.c b/fs/ntfs/dir.c
index 5a9e344..4fe3759 100644
--- a/fs/ntfs/dir.c
+++ b/fs/ntfs/dir.c
@@ -1572,4 +1572,5 @@ const struct file_operations ntfs_dir_ops = {
 	/*.ioctl	= ,*/			/* Perform function on the
 						   mounted filesystem. */
 	.open		= ntfs_dir_open,	/* Open directory. */
+	.checkpoint	= generic_file_checkpoint,
 };
diff --git a/fs/ntfs/file.c b/fs/ntfs/file.c
index 3140a44..3f75c5e 100644
--- a/fs/ntfs/file.c
+++ b/fs/ntfs/file.c
@@ -2272,7 +2272,7 @@ const struct file_operations ntfs_file_ops = {
 						    mounted filesystem. */
 	.mmap		= generic_file_mmap,	 /* Mmap file. */
 	.open		= ntfs_file_open,	 /* Open file. */
-	.splice_read	= generic_file_splice_read /* Zero-copy data send with
+	.splice_read	= generic_file_splice_read, /* Zero-copy data send with
 						    the data source being on
 						    the ntfs partition.  We do
 						    not need to care about the
@@ -2282,6 +2282,7 @@ const struct file_operations ntfs_file_ops = {
 						    on the ntfs partition.  We
 						    do not need to care about
 						    the data source. */
+	.checkpoint	= generic_file_checkpoint,
 };
 
 const struct inode_operations ntfs_file_inode_ops = {
diff --git a/fs/omfs/dir.c b/fs/omfs/dir.c
index c7275cf..5ed9cb3 100644
--- a/fs/omfs/dir.c
+++ b/fs/omfs/dir.c
@@ -502,4 +502,5 @@ struct file_operations omfs_dir_operations = {
 	.read = generic_read_dir,
 	.readdir = omfs_readdir,
 	.llseek = generic_file_llseek,
+	.checkpoint = generic_file_checkpoint,
 };
diff --git a/fs/omfs/file.c b/fs/omfs/file.c
index d17e774..c85fb31 100644
--- a/fs/omfs/file.c
+++ b/fs/omfs/file.c
@@ -331,6 +331,7 @@ struct file_operations omfs_file_operations = {
 	.mmap = generic_file_mmap,
 	.fsync = simple_fsync,
 	.splice_read = generic_file_splice_read,
+	.checkpoint = generic_file_checkpoint,
 };
 
 struct inode_operations omfs_file_inops = {
diff --git a/fs/openpromfs/inode.c b/fs/openpromfs/inode.c
index ffcd04f..d1f0677 100644
--- a/fs/openpromfs/inode.c
+++ b/fs/openpromfs/inode.c
@@ -160,6 +160,7 @@ static const struct file_operations openpromfs_prop_ops = {
 	.read		= seq_read,
 	.llseek		= seq_lseek,
 	.release	= seq_release,
+	.checkpoint	= NULL,
 };
 
 static int openpromfs_readdir(struct file *, void *, filldir_t);
@@ -168,6 +169,7 @@ static const struct file_operations openprom_operations = {
 	.read		= generic_read_dir,
 	.readdir	= openpromfs_readdir,
 	.llseek		= generic_file_llseek,
+	.checkpoint	= generic_file_checkpoint,
 };
 
 static struct dentry *openpromfs_lookup(struct inode *, struct dentry *, struct nameidata *);
diff --git a/fs/qnx4/dir.c b/fs/qnx4/dir.c
index 003c68f..ca99e01 100644
--- a/fs/qnx4/dir.c
+++ b/fs/qnx4/dir.c
@@ -80,6 +80,7 @@ const struct file_operations qnx4_dir_operations =
 	.read		= generic_read_dir,
 	.readdir	= qnx4_readdir,
 	.fsync		= simple_fsync,
+	.checkpoint	= generic_file_checkpoint,
 };
 
 const struct inode_operations qnx4_dir_inode_operations =
diff --git a/fs/qnx4/file.c b/fs/qnx4/file.c
index 09b170a..8aaa882 100644
--- a/fs/qnx4/file.c
+++ b/fs/qnx4/file.c
@@ -30,6 +30,7 @@ const struct file_operations qnx4_file_operations =
 	.aio_write	= generic_file_aio_write,
 	.fsync		= simple_fsync,
 #endif
+	.checkpoint	= generic_file_checkpoint,
 };
 
 const struct inode_operations qnx4_file_inode_operations =
diff --git a/fs/ramfs/file-mmu.c b/fs/ramfs/file-mmu.c
index 78f613c..4430239 100644
--- a/fs/ramfs/file-mmu.c
+++ b/fs/ramfs/file-mmu.c
@@ -47,6 +47,7 @@ const struct file_operations ramfs_file_operations = {
 	.splice_read	= generic_file_splice_read,
 	.splice_write	= generic_file_splice_write,
 	.llseek		= generic_file_llseek,
+	.checkpoint	= generic_file_checkpoint,
 };
 
 const struct inode_operations ramfs_file_inode_operations = {
diff --git a/fs/ramfs/file-nommu.c b/fs/ramfs/file-nommu.c
index 11f0c06..d645bb3 100644
--- a/fs/ramfs/file-nommu.c
+++ b/fs/ramfs/file-nommu.c
@@ -45,6 +45,7 @@ const struct file_operations ramfs_file_operations = {
 	.splice_read		= generic_file_splice_read,
 	.splice_write		= generic_file_splice_write,
 	.llseek			= generic_file_llseek,
+	.checkpoint		= generic_file_checkpoint,
 };
 
 const struct inode_operations ramfs_file_inode_operations = {
diff --git a/fs/read_write.c b/fs/read_write.c
index d331975..d314234 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -27,6 +27,7 @@ const struct file_operations generic_ro_fops = {
 	.aio_read	= generic_file_aio_read,
 	.mmap		= generic_file_readonly_mmap,
 	.splice_read	= generic_file_splice_read,
+	.checkpoint	= generic_file_checkpoint,
 };
 
 EXPORT_SYMBOL(generic_ro_fops);
diff --git a/fs/reiserfs/dir.c b/fs/reiserfs/dir.c
index 6d2668f..40ce3fa 100644
--- a/fs/reiserfs/dir.c
+++ b/fs/reiserfs/dir.c
@@ -24,6 +24,7 @@ const struct file_operations reiserfs_dir_operations = {
 #ifdef CONFIG_COMPAT
 	.compat_ioctl = reiserfs_compat_ioctl,
 #endif
+	.checkpoint = generic_file_checkpoint,
 };
 
 static int reiserfs_dir_fsync(struct file *filp, struct dentry *dentry,
diff --git a/fs/reiserfs/file.c b/fs/reiserfs/file.c
index 9f43666..a5dcc02 100644
--- a/fs/reiserfs/file.c
+++ b/fs/reiserfs/file.c
@@ -297,6 +297,7 @@ const struct file_operations reiserfs_file_operations = {
 	.splice_read = generic_file_splice_read,
 	.splice_write = generic_file_splice_write,
 	.llseek = generic_file_llseek,
+	.checkpoint = generic_file_checkpoint,
 };
 
 const struct inode_operations reiserfs_file_inode_operations = {
diff --git a/fs/romfs/mmap-nommu.c b/fs/romfs/mmap-nommu.c
index f0511e8..03c24d9 100644
--- a/fs/romfs/mmap-nommu.c
+++ b/fs/romfs/mmap-nommu.c
@@ -72,4 +72,5 @@ const struct file_operations romfs_ro_fops = {
 	.splice_read		= generic_file_splice_read,
 	.mmap			= romfs_mmap,
 	.get_unmapped_area	= romfs_get_unmapped_area,
+	.checkpoint		= generic_file_checkpoint,
 };
diff --git a/fs/romfs/super.c b/fs/romfs/super.c
index 4ab3c03..7b6e951 100644
--- a/fs/romfs/super.c
+++ b/fs/romfs/super.c
@@ -282,6 +282,7 @@ error:
 static const struct file_operations romfs_dir_operations = {
 	.read		= generic_read_dir,
 	.readdir	= romfs_readdir,
+	.checkpoint	= generic_file_checkpoint,
 };
 
 static struct inode_operations romfs_dir_inode_operations = {
diff --git a/fs/squashfs/dir.c b/fs/squashfs/dir.c
index 566b0ea..b0c5336 100644
--- a/fs/squashfs/dir.c
+++ b/fs/squashfs/dir.c
@@ -231,5 +231,6 @@ failed_read:
 
 const struct file_operations squashfs_dir_ops = {
 	.read = generic_read_dir,
-	.readdir = squashfs_readdir
+	.readdir = squashfs_readdir,
+	.checkpoint = generic_file_checkpoint,
 };
diff --git a/fs/sysv/dir.c b/fs/sysv/dir.c
index 4e50286..53acd29 100644
--- a/fs/sysv/dir.c
+++ b/fs/sysv/dir.c
@@ -25,6 +25,7 @@ const struct file_operations sysv_dir_operations = {
 	.read		= generic_read_dir,
 	.readdir	= sysv_readdir,
 	.fsync		= simple_fsync,
+	.checkpoint	= generic_file_checkpoint,
 };
 
 static inline void dir_put_page(struct page *page)
diff --git a/fs/sysv/file.c b/fs/sysv/file.c
index 96340c0..aee556d 100644
--- a/fs/sysv/file.c
+++ b/fs/sysv/file.c
@@ -28,6 +28,7 @@ const struct file_operations sysv_file_operations = {
 	.mmap		= generic_file_mmap,
 	.fsync		= simple_fsync,
 	.splice_read	= generic_file_splice_read,
+	.checkpoint	= generic_file_checkpoint,
 };
 
 const struct inode_operations sysv_file_inode_operations = {
diff --git a/fs/ubifs/debug.c b/fs/ubifs/debug.c
index ce2cd83..69a8892 100644
--- a/fs/ubifs/debug.c
+++ b/fs/ubifs/debug.c
@@ -2530,6 +2530,7 @@ static ssize_t write_debugfs_file(struct file *file, const char __user *buf,
 static const struct file_operations dfs_fops = {
 	.open = open_debugfs_file,
 	.write = write_debugfs_file,
+	.checkpoint = generic_file_checkpoint,
 	.owner = THIS_MODULE,
 };
 
diff --git a/fs/ubifs/dir.c b/fs/ubifs/dir.c
index 552fb01..89ab2aa 100644
--- a/fs/ubifs/dir.c
+++ b/fs/ubifs/dir.c
@@ -1228,4 +1228,5 @@ const struct file_operations ubifs_dir_operations = {
 #ifdef CONFIG_COMPAT
 	.compat_ioctl   = ubifs_compat_ioctl,
 #endif
+	.checkpoint     = generic_file_checkpoint,
 };
diff --git a/fs/ubifs/file.c b/fs/ubifs/file.c
index 6d34dc7..2d94676 100644
--- a/fs/ubifs/file.c
+++ b/fs/ubifs/file.c
@@ -1595,4 +1595,5 @@ const struct file_operations ubifs_file_operations = {
 #ifdef CONFIG_COMPAT
 	.compat_ioctl   = ubifs_compat_ioctl,
 #endif
+	.checkpoint     = generic_file_checkpoint,
 };
diff --git a/fs/udf/dir.c b/fs/udf/dir.c
index 61d9a76..6586dbe 100644
--- a/fs/udf/dir.c
+++ b/fs/udf/dir.c
@@ -211,4 +211,5 @@ const struct file_operations udf_dir_operations = {
 	.readdir		= udf_readdir,
 	.ioctl			= udf_ioctl,
 	.fsync			= simple_fsync,
+	.checkpoint		= generic_file_checkpoint,
 };
diff --git a/fs/udf/file.c b/fs/udf/file.c
index 7464305..33c63a8 100644
--- a/fs/udf/file.c
+++ b/fs/udf/file.c
@@ -212,6 +212,7 @@ const struct file_operations udf_file_operations = {
 	.fsync			= simple_fsync,
 	.splice_read		= generic_file_splice_read,
 	.llseek			= generic_file_llseek,
+	.checkpoint		= generic_file_checkpoint,
 };
 
 const struct inode_operations udf_file_inode_operations = {
diff --git a/fs/ufs/dir.c b/fs/ufs/dir.c
index 6f671f1..9379010 100644
--- a/fs/ufs/dir.c
+++ b/fs/ufs/dir.c
@@ -668,4 +668,5 @@ const struct file_operations ufs_dir_operations = {
 	.readdir	= ufs_readdir,
 	.fsync		= simple_fsync,
 	.llseek		= generic_file_llseek,
+	.checkpoint	= generic_file_checkpoint,
 };
diff --git a/fs/ufs/file.c b/fs/ufs/file.c
index 73655c6..15c8616 100644
--- a/fs/ufs/file.c
+++ b/fs/ufs/file.c
@@ -43,4 +43,5 @@ const struct file_operations ufs_file_operations = {
 	.open           = generic_file_open,
 	.fsync		= simple_fsync,
 	.splice_read	= generic_file_splice_read,
+	.checkpoint	= generic_file_checkpoint,
 };
diff --git a/fs/xfs/linux-2.6/xfs_file.c b/fs/xfs/linux-2.6/xfs_file.c
index 0542fd5..2b4bdb6 100644
--- a/fs/xfs/linux-2.6/xfs_file.c
+++ b/fs/xfs/linux-2.6/xfs_file.c
@@ -257,6 +257,7 @@ const struct file_operations xfs_file_operations = {
 #ifdef HAVE_FOP_OPEN_EXEC
 	.open_exec	= xfs_file_open_exec,
 #endif
+	.checkpoint	= generic_file_checkpoint,
 };
 
 const struct file_operations xfs_dir_file_operations = {
@@ -269,6 +270,7 @@ const struct file_operations xfs_dir_file_operations = {
 	.compat_ioctl	= xfs_file_compat_ioctl,
 #endif
 	.fsync		= xfs_file_fsync,
+	.checkpoint	= generic_file_checkpoint,
 };
 
 static struct vm_operations_struct xfs_file_vm_ops = {
-- 
1.6.0.4

^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH v18 41/80] Add the checkpoint operation for opened files of generic filesystems
  2009-09-23 23:50 ` Oren Laadan
@ 2009-09-23 23:51   ` Oren Laadan
  -1 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-23 23:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, containers, linux-kernel, linux-mm, linux-api,
	Serge Hallyn, Ingo Molnar, Pavel Emelyanov, Matt Helsley,
	linux-fsdevel

From: Matt Helsley <matthltc@us.ibm.com>

These patches extend the use of the generic file checkpoint operation to
non-extX filesystems which have lseek operations that ensure we can save
and restore the files for later use. Note that this does not include
things like FUSE, network filesystems, or pseudo-filesystem kernel
interfaces.

Only compile and boot tested (on x86-32).

[Oren Laadan] Folded patch series into a single patch; original post
included 36 separate patches for individual filesystems:

  [PATCH 01/36] Add the checkpoint operation for affs files and directories.
  [PATCH 02/36] Add the checkpoint operation for befs directories.
  [PATCH 03/36] Add the checkpoint operation for bfs files and directories.
  [PATCH 04/36] Add the checkpoint operation for btrfs files and directories.
  [PATCH 05/36] Add the checkpoint operation for cramfs directories.
  [PATCH 06/36] Add the checkpoint operation for ecryptfs files and directories.
  [PATCH 07/36] Add the checkpoint operation for fat files and directories.
  [PATCH 08/36] Add the checkpoint operation for freevxfs directories.
  [PATCH 09/36] Add the checkpoint operation for hfs files and directories.
  [PATCH 10/36] Add the checkpoint operation for hfsplus files and directories.
  [PATCH 11/36] Add the checkpoint operation for hpfs files and directories.
  [PATCH 12/36] Add the checkpoint operation for hppfs files and directories.
  [PATCH 13/36] Add the checkpoint operation for iso directories.
  [PATCH 14/36] Add the checkpoint operation for jffs2 files and directories.
  [PATCH 15/36] Add the checkpoint operation for jfs files and directories.
  [PATCH 16/36] Add the checkpoint operation for regular nfs files and directories. Skip the various /proc files for now.
  [PATCH 17/36] Add the checkpoint operation for ntfs directories.
  [PATCH 18/36] Add the checkpoint operation for openromfs directories. Explicitly skip the properties for now.
  [PATCH 19/36] Add the checkpoint operation for qnx4 files and directories.
  [PATCH 20/36] Add the checkpoint operation for reiserfs files and directories.
  [PATCH 21/36] Add the checkpoint operation for romfs directories.
  [PATCH 22/36] Add the checkpoint operation for squashfs directories.
  [PATCH 23/36] Add the checkpoint operation for sysv filesystem files and directories.
  [PATCH 24/36] Add the checkpoint operation for ubifs files and directories.
  [PATCH 25/36] Add the checkpoint operation for udf filesystem files and directories.
  [PATCH 26/36] Add the checkpoint operation for xfs files and directories.
  [PATCH 27/36] Add checkpoint operation for efs directories.
  [PATCH 28/36] Add the checkpoint operation for generic, read-only files. At present, some/all files of the following filesystems use this generic definition:
  [PATCH 29/36] Add checkpoint operation for minix filesystem files and directories.
  [PATCH 30/36] Add checkpoint operations for omfs files and directories.
  [PATCH 31/36] Add checkpoint operations for ufs files and directories.
  [PATCH 32/36] Add checkpoint operations for ramfs files. NOTE: since simple_dir_operations are shared between multiple filesystems including ramfs, it's not currently possible to checkpoint open ramfs directories.
  [PATCH 33/36] Add the checkpoint operation for adfs files and directories.
  [PATCH 34/36] Add the checkpoint operation to exofs files and directories.
  [PATCH 35/36] Add the checkpoint operation to nilfs2 files and directories.
  [PATCH 36/36] Add checkpoint operations for UML host filesystem files and directories.

Signed-off-by: Matt Helsley <matthltc@us.ibm.com>
Acked-by: Oren Laadan <orenl@cs.columbia.edu>
Cc: linux-fsdevel@vger.kernel.org
---
 fs/adfs/dir.c               |    1 +
 fs/adfs/file.c              |    1 +
 fs/affs/dir.c               |    1 +
 fs/affs/file.c              |    1 +
 fs/befs/linuxvfs.c          |    1 +
 fs/bfs/dir.c                |    1 +
 fs/bfs/file.c               |    1 +
 fs/btrfs/file.c             |    1 +
 fs/btrfs/inode.c            |    1 +
 fs/btrfs/super.c            |    1 +
 fs/cramfs/inode.c           |    1 +
 fs/ecryptfs/file.c          |    2 ++
 fs/ecryptfs/miscdev.c       |    1 +
 fs/efs/dir.c                |    1 +
 fs/exofs/dir.c              |    1 +
 fs/exofs/file.c             |    1 +
 fs/fat/dir.c                |    1 +
 fs/fat/file.c               |    1 +
 fs/freevxfs/vxfs_lookup.c   |    1 +
 fs/hfs/dir.c                |    1 +
 fs/hfs/inode.c              |    1 +
 fs/hfsplus/dir.c            |    1 +
 fs/hfsplus/inode.c          |    1 +
 fs/hostfs/hostfs_kern.c     |    2 ++
 fs/hpfs/dir.c               |    1 +
 fs/hpfs/file.c              |    1 +
 fs/hppfs/hppfs.c            |    2 ++
 fs/isofs/dir.c              |    1 +
 fs/jffs2/dir.c              |    1 +
 fs/jffs2/file.c             |    1 +
 fs/jfs/file.c               |    1 +
 fs/jfs/namei.c              |    1 +
 fs/minix/dir.c              |    1 +
 fs/minix/file.c             |    1 +
 fs/nfs/dir.c                |    1 +
 fs/nfs/file.c               |    1 +
 fs/nilfs2/dir.c             |    2 +-
 fs/nilfs2/file.c            |    1 +
 fs/ntfs/dir.c               |    1 +
 fs/ntfs/file.c              |    3 ++-
 fs/omfs/dir.c               |    1 +
 fs/omfs/file.c              |    1 +
 fs/openpromfs/inode.c       |    2 ++
 fs/qnx4/dir.c               |    1 +
 fs/qnx4/file.c              |    1 +
 fs/ramfs/file-mmu.c         |    1 +
 fs/ramfs/file-nommu.c       |    1 +
 fs/read_write.c             |    1 +
 fs/reiserfs/dir.c           |    1 +
 fs/reiserfs/file.c          |    1 +
 fs/romfs/mmap-nommu.c       |    1 +
 fs/romfs/super.c            |    1 +
 fs/squashfs/dir.c           |    3 ++-
 fs/sysv/dir.c               |    1 +
 fs/sysv/file.c              |    1 +
 fs/ubifs/debug.c            |    1 +
 fs/ubifs/dir.c              |    1 +
 fs/ubifs/file.c             |    1 +
 fs/udf/dir.c                |    1 +
 fs/udf/file.c               |    1 +
 fs/ufs/dir.c                |    1 +
 fs/ufs/file.c               |    1 +
 fs/xfs/linux-2.6/xfs_file.c |    2 ++
 63 files changed, 70 insertions(+), 3 deletions(-)

diff --git a/fs/adfs/dir.c b/fs/adfs/dir.c
index 23aa52f..7106f32 100644
--- a/fs/adfs/dir.c
+++ b/fs/adfs/dir.c
@@ -198,6 +198,7 @@ const struct file_operations adfs_dir_operations = {
 	.llseek		= generic_file_llseek,
 	.readdir	= adfs_readdir,
 	.fsync		= simple_fsync,
+	.checkpoint	= generic_file_checkpoint,
 };
 
 static int
diff --git a/fs/adfs/file.c b/fs/adfs/file.c
index 005ea34..97bd298 100644
--- a/fs/adfs/file.c
+++ b/fs/adfs/file.c
@@ -30,6 +30,7 @@ const struct file_operations adfs_file_operations = {
 	.write		= do_sync_write,
 	.aio_write	= generic_file_aio_write,
 	.splice_read	= generic_file_splice_read,
+	.checkpoint	= generic_file_checkpoint,
 };
 
 const struct inode_operations adfs_file_inode_operations = {
diff --git a/fs/affs/dir.c b/fs/affs/dir.c
index 8ca8f3a..6cc5e43 100644
--- a/fs/affs/dir.c
+++ b/fs/affs/dir.c
@@ -22,6 +22,7 @@ const struct file_operations affs_dir_operations = {
 	.llseek		= generic_file_llseek,
 	.readdir	= affs_readdir,
 	.fsync		= affs_file_fsync,
+	.checkpoint	= generic_file_checkpoint,
 };
 
 /*
diff --git a/fs/affs/file.c b/fs/affs/file.c
index 184e55c..d580a12 100644
--- a/fs/affs/file.c
+++ b/fs/affs/file.c
@@ -36,6 +36,7 @@ const struct file_operations affs_file_operations = {
 	.release	= affs_file_release,
 	.fsync		= affs_file_fsync,
 	.splice_read	= generic_file_splice_read,
+	.checkpoint	= generic_file_checkpoint,
 };
 
 const struct inode_operations affs_file_inode_operations = {
diff --git a/fs/befs/linuxvfs.c b/fs/befs/linuxvfs.c
index 615d549..6c46cb8 100644
--- a/fs/befs/linuxvfs.c
+++ b/fs/befs/linuxvfs.c
@@ -67,6 +67,7 @@ static const struct file_operations befs_dir_operations = {
 	.read		= generic_read_dir,
 	.readdir	= befs_readdir,
 	.llseek		= generic_file_llseek,
+	.checkpoint	= generic_file_checkpoint,
 };
 
 static const struct inode_operations befs_dir_inode_operations = {
diff --git a/fs/bfs/dir.c b/fs/bfs/dir.c
index 1e41aad..d78015e 100644
--- a/fs/bfs/dir.c
+++ b/fs/bfs/dir.c
@@ -80,6 +80,7 @@ const struct file_operations bfs_dir_operations = {
 	.readdir	= bfs_readdir,
 	.fsync		= simple_fsync,
 	.llseek		= generic_file_llseek,
+	.checkpoint	= generic_file_checkpoint,
 };
 
 extern void dump_imap(const char *, struct super_block *);
diff --git a/fs/bfs/file.c b/fs/bfs/file.c
index 88b9a3f..7f61ed6 100644
--- a/fs/bfs/file.c
+++ b/fs/bfs/file.c
@@ -29,6 +29,7 @@ const struct file_operations bfs_file_operations = {
 	.aio_write	= generic_file_aio_write,
 	.mmap		= generic_file_mmap,
 	.splice_read	= generic_file_splice_read,
+	.checkpoint	= generic_file_checkpoint,
 };
 
 static int bfs_move_block(unsigned long from, unsigned long to,
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 4b83397..6425f19 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1229,4 +1229,5 @@ struct file_operations btrfs_file_operations = {
 #ifdef CONFIG_COMPAT
 	.compat_ioctl	= btrfs_ioctl,
 #endif
+	.checkpoint	= generic_file_checkpoint,
 };
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 59cba18..a13b1b7 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -5233,6 +5233,7 @@ static struct file_operations btrfs_dir_file_operations = {
 #endif
 	.release        = btrfs_release_file,
 	.fsync		= btrfs_sync_file,
+	.checkpoint	= generic_file_checkpoint,
 };
 
 static struct extent_io_ops btrfs_extent_io_ops = {
diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index 6d6d06c..58569ba 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -694,6 +694,7 @@ static const struct file_operations btrfs_ctl_fops = {
 	.unlocked_ioctl	 = btrfs_control_ioctl,
 	.compat_ioctl = btrfs_control_ioctl,
 	.owner	 = THIS_MODULE,
+	.checkpoint = generic_file_checkpoint,
 };
 
 static struct miscdevice btrfs_misc = {
diff --git a/fs/cramfs/inode.c b/fs/cramfs/inode.c
index dd3634e..0927503 100644
--- a/fs/cramfs/inode.c
+++ b/fs/cramfs/inode.c
@@ -532,6 +532,7 @@ static const struct file_operations cramfs_directory_operations = {
 	.llseek		= generic_file_llseek,
 	.read		= generic_read_dir,
 	.readdir	= cramfs_readdir,
+	.checkpoint	= generic_file_checkpoint,
 };
 
 static const struct inode_operations cramfs_dir_inode_operations = {
diff --git a/fs/ecryptfs/file.c b/fs/ecryptfs/file.c
index 9e94405..e6d2523 100644
--- a/fs/ecryptfs/file.c
+++ b/fs/ecryptfs/file.c
@@ -306,6 +306,7 @@ const struct file_operations ecryptfs_dir_fops = {
 	.fsync = ecryptfs_fsync,
 	.fasync = ecryptfs_fasync,
 	.splice_read = generic_file_splice_read,
+	.checkpoint = generic_file_checkpoint,
 };
 
 const struct file_operations ecryptfs_main_fops = {
@@ -323,6 +324,7 @@ const struct file_operations ecryptfs_main_fops = {
 	.fsync = ecryptfs_fsync,
 	.fasync = ecryptfs_fasync,
 	.splice_read = generic_file_splice_read,
+	.checkpoint = generic_file_checkpoint,
 };
 
 static int
diff --git a/fs/ecryptfs/miscdev.c b/fs/ecryptfs/miscdev.c
index 4ec8f61..9fd9b39 100644
--- a/fs/ecryptfs/miscdev.c
+++ b/fs/ecryptfs/miscdev.c
@@ -481,6 +481,7 @@ static const struct file_operations ecryptfs_miscdev_fops = {
 	.read    = ecryptfs_miscdev_read,
 	.write   = ecryptfs_miscdev_write,
 	.release = ecryptfs_miscdev_release,
+	.checkpoint = generic_file_checkpoint,
 };
 
 static struct miscdevice ecryptfs_miscdev = {
diff --git a/fs/efs/dir.c b/fs/efs/dir.c
index 7ee6f7e..da344b8 100644
--- a/fs/efs/dir.c
+++ b/fs/efs/dir.c
@@ -13,6 +13,7 @@ const struct file_operations efs_dir_operations = {
 	.llseek		= generic_file_llseek,
 	.read		= generic_read_dir,
 	.readdir	= efs_readdir,
+	.checkpoint	= generic_file_checkpoint,
 };
 
 const struct inode_operations efs_dir_inode_operations = {
diff --git a/fs/exofs/dir.c b/fs/exofs/dir.c
index 4cfab1c..f6693d3 100644
--- a/fs/exofs/dir.c
+++ b/fs/exofs/dir.c
@@ -667,4 +667,5 @@ const struct file_operations exofs_dir_operations = {
 	.llseek		= generic_file_llseek,
 	.read		= generic_read_dir,
 	.readdir	= exofs_readdir,
+	.checkpoint	= generic_file_checkpoint,
 };
diff --git a/fs/exofs/file.c b/fs/exofs/file.c
index 839b9dc..257e9da 100644
--- a/fs/exofs/file.c
+++ b/fs/exofs/file.c
@@ -73,6 +73,7 @@ static int exofs_flush(struct file *file, fl_owner_t id)
 
 const struct file_operations exofs_file_operations = {
 	.llseek		= generic_file_llseek,
+	.checkpoint	= generic_file_checkpoint,
 	.read		= do_sync_read,
 	.write		= do_sync_write,
 	.aio_read	= generic_file_aio_read,
diff --git a/fs/fat/dir.c b/fs/fat/dir.c
index 530b4ca..e3fa353 100644
--- a/fs/fat/dir.c
+++ b/fs/fat/dir.c
@@ -841,6 +841,7 @@ const struct file_operations fat_dir_operations = {
 	.compat_ioctl	= fat_compat_dir_ioctl,
 #endif
 	.fsync		= fat_file_fsync,
+	.checkpoint	= generic_file_checkpoint,
 };
 
 static int fat_get_short_entry(struct inode *dir, loff_t *pos,
diff --git a/fs/fat/file.c b/fs/fat/file.c
index f042b96..56351c2 100644
--- a/fs/fat/file.c
+++ b/fs/fat/file.c
@@ -162,6 +162,7 @@ const struct file_operations fat_file_operations = {
 	.ioctl		= fat_generic_ioctl,
 	.fsync		= fat_file_fsync,
 	.splice_read	= generic_file_splice_read,
+	.checkpoint	= generic_file_checkpoint,
 };
 
 static int fat_cont_expand(struct inode *inode, loff_t size)
diff --git a/fs/freevxfs/vxfs_lookup.c b/fs/freevxfs/vxfs_lookup.c
index aee049c..3a09132 100644
--- a/fs/freevxfs/vxfs_lookup.c
+++ b/fs/freevxfs/vxfs_lookup.c
@@ -58,6 +58,7 @@ const struct inode_operations vxfs_dir_inode_ops = {
 
 const struct file_operations vxfs_dir_operations = {
 	.readdir =		vxfs_readdir,
+	.checkpoint =		generic_file_checkpoint,
 };
 
  
diff --git a/fs/hfs/dir.c b/fs/hfs/dir.c
index 7c69b98..8d90a24 100644
--- a/fs/hfs/dir.c
+++ b/fs/hfs/dir.c
@@ -318,6 +318,7 @@ const struct file_operations hfs_dir_operations = {
 	.readdir	= hfs_readdir,
 	.llseek		= generic_file_llseek,
 	.release	= hfs_dir_release,
+	.checkpoint	= generic_file_checkpoint,
 };
 
 const struct inode_operations hfs_dir_inode_operations = {
diff --git a/fs/hfs/inode.c b/fs/hfs/inode.c
index a1cbff2..bf8950f 100644
--- a/fs/hfs/inode.c
+++ b/fs/hfs/inode.c
@@ -607,6 +607,7 @@ static const struct file_operations hfs_file_operations = {
 	.fsync		= file_fsync,
 	.open		= hfs_file_open,
 	.release	= hfs_file_release,
+	.checkpoint	= generic_file_checkpoint,
 };
 
 static const struct inode_operations hfs_file_inode_operations = {
diff --git a/fs/hfsplus/dir.c b/fs/hfsplus/dir.c
index 5f40236..41fbf2d 100644
--- a/fs/hfsplus/dir.c
+++ b/fs/hfsplus/dir.c
@@ -497,4 +497,5 @@ const struct file_operations hfsplus_dir_operations = {
 	.ioctl          = hfsplus_ioctl,
 	.llseek		= generic_file_llseek,
 	.release	= hfsplus_dir_release,
+	.checkpoint	= generic_file_checkpoint,
 };
diff --git a/fs/hfsplus/inode.c b/fs/hfsplus/inode.c
index 1bcf597..19abd7e 100644
--- a/fs/hfsplus/inode.c
+++ b/fs/hfsplus/inode.c
@@ -286,6 +286,7 @@ static const struct file_operations hfsplus_file_operations = {
 	.open		= hfsplus_file_open,
 	.release	= hfsplus_file_release,
 	.ioctl          = hfsplus_ioctl,
+	.checkpoint	= generic_file_checkpoint,
 };
 
 struct inode *hfsplus_new_inode(struct super_block *sb, int mode)
diff --git a/fs/hostfs/hostfs_kern.c b/fs/hostfs/hostfs_kern.c
index 032604e..67e2356 100644
--- a/fs/hostfs/hostfs_kern.c
+++ b/fs/hostfs/hostfs_kern.c
@@ -417,6 +417,7 @@ int hostfs_fsync(struct file *file, struct dentry *dentry, int datasync)
 
 static const struct file_operations hostfs_file_fops = {
 	.llseek		= generic_file_llseek,
+	.checkpoint	= generic_file_checkpoint,
 	.read		= do_sync_read,
 	.splice_read	= generic_file_splice_read,
 	.aio_read	= generic_file_aio_read,
@@ -430,6 +431,7 @@ static const struct file_operations hostfs_file_fops = {
 
 static const struct file_operations hostfs_dir_fops = {
 	.llseek		= generic_file_llseek,
+	.checkpoint	= generic_file_checkpoint,
 	.readdir	= hostfs_readdir,
 	.read		= generic_read_dir,
 };
diff --git a/fs/hpfs/dir.c b/fs/hpfs/dir.c
index 8865c94..dcde10f 100644
--- a/fs/hpfs/dir.c
+++ b/fs/hpfs/dir.c
@@ -322,4 +322,5 @@ const struct file_operations hpfs_dir_ops =
 	.readdir	= hpfs_readdir,
 	.release	= hpfs_dir_release,
 	.fsync		= hpfs_file_fsync,
+	.checkpoint	= generic_file_checkpoint,
 };
diff --git a/fs/hpfs/file.c b/fs/hpfs/file.c
index 3efabff..f1211f0 100644
--- a/fs/hpfs/file.c
+++ b/fs/hpfs/file.c
@@ -139,6 +139,7 @@ const struct file_operations hpfs_file_ops =
 	.release	= hpfs_file_release,
 	.fsync		= hpfs_file_fsync,
 	.splice_read	= generic_file_splice_read,
+	.checkpoint	= generic_file_checkpoint,
 };
 
 const struct inode_operations hpfs_file_iops =
diff --git a/fs/hppfs/hppfs.c b/fs/hppfs/hppfs.c
index a5089a6..f132fa2 100644
--- a/fs/hppfs/hppfs.c
+++ b/fs/hppfs/hppfs.c
@@ -546,6 +546,7 @@ static const struct file_operations hppfs_file_fops = {
 	.read		= hppfs_read,
 	.write		= hppfs_write,
 	.open		= hppfs_open,
+	.checkpoint	= generic_file_checkpoint,
 };
 
 struct hppfs_dirent {
@@ -597,6 +598,7 @@ static const struct file_operations hppfs_dir_fops = {
 	.readdir	= hppfs_readdir,
 	.open		= hppfs_dir_open,
 	.fsync		= hppfs_fsync,
+	.checkpoint	= generic_file_checkpoint,
 };
 
 static int hppfs_statfs(struct dentry *dentry, struct kstatfs *sf)
diff --git a/fs/isofs/dir.c b/fs/isofs/dir.c
index 8ba5441..848059d 100644
--- a/fs/isofs/dir.c
+++ b/fs/isofs/dir.c
@@ -273,6 +273,7 @@ const struct file_operations isofs_dir_operations =
 {
 	.read = generic_read_dir,
 	.readdir = isofs_readdir,
+	.checkpoint = generic_file_checkpoint,
 };
 
 /*
diff --git a/fs/jffs2/dir.c b/fs/jffs2/dir.c
index 6f60cc9..c2b6487 100644
--- a/fs/jffs2/dir.c
+++ b/fs/jffs2/dir.c
@@ -41,6 +41,7 @@ const struct file_operations jffs2_dir_operations =
 	.unlocked_ioctl=jffs2_ioctl,
 	.fsync =	jffs2_fsync,
 	.llseek =	generic_file_llseek,
+	.checkpoint =	generic_file_checkpoint,
 };
 
 
diff --git a/fs/jffs2/file.c b/fs/jffs2/file.c
index 23c9475..7e5d2e2 100644
--- a/fs/jffs2/file.c
+++ b/fs/jffs2/file.c
@@ -50,6 +50,7 @@ const struct file_operations jffs2_file_operations =
 	.mmap =		generic_file_readonly_mmap,
 	.fsync =	jffs2_fsync,
 	.splice_read =	generic_file_splice_read,
+	.checkpoint =	generic_file_checkpoint,
 };
 
 /* jffs2_file_inode_operations */
diff --git a/fs/jfs/file.c b/fs/jfs/file.c
index 7f6063a..90ab090 100644
--- a/fs/jfs/file.c
+++ b/fs/jfs/file.c
@@ -116,4 +116,5 @@ const struct file_operations jfs_file_operations = {
 #ifdef CONFIG_COMPAT
 	.compat_ioctl	= jfs_compat_ioctl,
 #endif
+	.checkpoint	= generic_file_checkpoint,
 };
diff --git a/fs/jfs/namei.c b/fs/jfs/namei.c
index 514ee2e..5486db9 100644
--- a/fs/jfs/namei.c
+++ b/fs/jfs/namei.c
@@ -1556,6 +1556,7 @@ const struct file_operations jfs_dir_operations = {
 	.compat_ioctl	= jfs_compat_ioctl,
 #endif
 	.llseek		= generic_file_llseek,
+	.checkpoint	= generic_file_checkpoint,
 };
 
 static int jfs_ci_hash(struct dentry *dir, struct qstr *this)
diff --git a/fs/minix/dir.c b/fs/minix/dir.c
index d407e7a..9cf04af 100644
--- a/fs/minix/dir.c
+++ b/fs/minix/dir.c
@@ -23,6 +23,7 @@ const struct file_operations minix_dir_operations = {
 	.read		= generic_read_dir,
 	.readdir	= minix_readdir,
 	.fsync		= simple_fsync,
+	.checkpoint	= generic_file_checkpoint,
 };
 
 static inline void dir_put_page(struct page *page)
diff --git a/fs/minix/file.c b/fs/minix/file.c
index 3eec3e6..2048d09 100644
--- a/fs/minix/file.c
+++ b/fs/minix/file.c
@@ -21,6 +21,7 @@ const struct file_operations minix_file_operations = {
 	.mmap		= generic_file_mmap,
 	.fsync		= simple_fsync,
 	.splice_read	= generic_file_splice_read,
+	.checkpoint	= generic_file_checkpoint,
 };
 
 const struct inode_operations minix_file_inode_operations = {
diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
index 32062c3..43b9025 100644
--- a/fs/nfs/dir.c
+++ b/fs/nfs/dir.c
@@ -63,6 +63,7 @@ const struct file_operations nfs_dir_operations = {
 	.open		= nfs_opendir,
 	.release	= nfs_release,
 	.fsync		= nfs_fsync_dir,
+	.checkpoint	= generic_file_checkpoint,
 };
 
 const struct inode_operations nfs_dir_inode_operations = {
diff --git a/fs/nfs/file.c b/fs/nfs/file.c
index 0506232..813fd8d 100644
--- a/fs/nfs/file.c
+++ b/fs/nfs/file.c
@@ -78,6 +78,7 @@ const struct file_operations nfs_file_operations = {
 	.splice_write	= nfs_file_splice_write,
 	.check_flags	= nfs_check_flags,
 	.setlease	= nfs_setlease,
+	.checkpoint	= generic_file_checkpoint,
 };
 
 const struct inode_operations nfs_file_inode_operations = {
diff --git a/fs/nilfs2/dir.c b/fs/nilfs2/dir.c
index 1a4fa04..afa0f80 100644
--- a/fs/nilfs2/dir.c
+++ b/fs/nilfs2/dir.c
@@ -706,5 +706,5 @@ struct file_operations nilfs_dir_operations = {
 	.compat_ioctl	= nilfs_ioctl,
 #endif	/* CONFIG_COMPAT */
 	.fsync		= nilfs_sync_file,
-
+	.checkpoint	= generic_file_checkpoint,
 };
diff --git a/fs/nilfs2/file.c b/fs/nilfs2/file.c
index 6bd84a0..0f27ab5 100644
--- a/fs/nilfs2/file.c
+++ b/fs/nilfs2/file.c
@@ -136,6 +136,7 @@ static int nilfs_file_mmap(struct file *file, struct vm_area_struct *vma)
  */
 struct file_operations nilfs_file_operations = {
 	.llseek		= generic_file_llseek,
+	.checkpoint	= generic_file_checkpoint,
 	.read		= do_sync_read,
 	.write		= do_sync_write,
 	.aio_read	= generic_file_aio_read,
diff --git a/fs/ntfs/dir.c b/fs/ntfs/dir.c
index 5a9e344..4fe3759 100644
--- a/fs/ntfs/dir.c
+++ b/fs/ntfs/dir.c
@@ -1572,4 +1572,5 @@ const struct file_operations ntfs_dir_ops = {
 	/*.ioctl	= ,*/			/* Perform function on the
 						   mounted filesystem. */
 	.open		= ntfs_dir_open,	/* Open directory. */
+	.checkpoint	= generic_file_checkpoint,
 };
diff --git a/fs/ntfs/file.c b/fs/ntfs/file.c
index 3140a44..3f75c5e 100644
--- a/fs/ntfs/file.c
+++ b/fs/ntfs/file.c
@@ -2272,7 +2272,7 @@ const struct file_operations ntfs_file_ops = {
 						    mounted filesystem. */
 	.mmap		= generic_file_mmap,	 /* Mmap file. */
 	.open		= ntfs_file_open,	 /* Open file. */
-	.splice_read	= generic_file_splice_read /* Zero-copy data send with
+	.splice_read	= generic_file_splice_read, /* Zero-copy data send with
 						    the data source being on
 						    the ntfs partition.  We do
 						    not need to care about the
@@ -2282,6 +2282,7 @@ const struct file_operations ntfs_file_ops = {
 						    on the ntfs partition.  We
 						    do not need to care about
 						    the data source. */
+	.checkpoint	= generic_file_checkpoint,
 };
 
 const struct inode_operations ntfs_file_inode_ops = {
diff --git a/fs/omfs/dir.c b/fs/omfs/dir.c
index c7275cf..5ed9cb3 100644
--- a/fs/omfs/dir.c
+++ b/fs/omfs/dir.c
@@ -502,4 +502,5 @@ struct file_operations omfs_dir_operations = {
 	.read = generic_read_dir,
 	.readdir = omfs_readdir,
 	.llseek = generic_file_llseek,
+	.checkpoint = generic_file_checkpoint,
 };
diff --git a/fs/omfs/file.c b/fs/omfs/file.c
index d17e774..c85fb31 100644
--- a/fs/omfs/file.c
+++ b/fs/omfs/file.c
@@ -331,6 +331,7 @@ struct file_operations omfs_file_operations = {
 	.mmap = generic_file_mmap,
 	.fsync = simple_fsync,
 	.splice_read = generic_file_splice_read,
+	.checkpoint = generic_file_checkpoint,
 };
 
 struct inode_operations omfs_file_inops = {
diff --git a/fs/openpromfs/inode.c b/fs/openpromfs/inode.c
index ffcd04f..d1f0677 100644
--- a/fs/openpromfs/inode.c
+++ b/fs/openpromfs/inode.c
@@ -160,6 +160,7 @@ static const struct file_operations openpromfs_prop_ops = {
 	.read		= seq_read,
 	.llseek		= seq_lseek,
 	.release	= seq_release,
+	.checkpoint	= NULL,
 };
 
 static int openpromfs_readdir(struct file *, void *, filldir_t);
@@ -168,6 +169,7 @@ static const struct file_operations openprom_operations = {
 	.read		= generic_read_dir,
 	.readdir	= openpromfs_readdir,
 	.llseek		= generic_file_llseek,
+	.checkpoint	= generic_file_checkpoint,
 };
 
 static struct dentry *openpromfs_lookup(struct inode *, struct dentry *, struct nameidata *);
diff --git a/fs/qnx4/dir.c b/fs/qnx4/dir.c
index 003c68f..ca99e01 100644
--- a/fs/qnx4/dir.c
+++ b/fs/qnx4/dir.c
@@ -80,6 +80,7 @@ const struct file_operations qnx4_dir_operations =
 	.read		= generic_read_dir,
 	.readdir	= qnx4_readdir,
 	.fsync		= simple_fsync,
+	.checkpoint	= generic_file_checkpoint,
 };
 
 const struct inode_operations qnx4_dir_inode_operations =
diff --git a/fs/qnx4/file.c b/fs/qnx4/file.c
index 09b170a..8aaa882 100644
--- a/fs/qnx4/file.c
+++ b/fs/qnx4/file.c
@@ -30,6 +30,7 @@ const struct file_operations qnx4_file_operations =
 	.aio_write	= generic_file_aio_write,
 	.fsync		= simple_fsync,
 #endif
+	.checkpoint	= generic_file_checkpoint,
 };
 
 const struct inode_operations qnx4_file_inode_operations =
diff --git a/fs/ramfs/file-mmu.c b/fs/ramfs/file-mmu.c
index 78f613c..4430239 100644
--- a/fs/ramfs/file-mmu.c
+++ b/fs/ramfs/file-mmu.c
@@ -47,6 +47,7 @@ const struct file_operations ramfs_file_operations = {
 	.splice_read	= generic_file_splice_read,
 	.splice_write	= generic_file_splice_write,
 	.llseek		= generic_file_llseek,
+	.checkpoint	= generic_file_checkpoint,
 };
 
 const struct inode_operations ramfs_file_inode_operations = {
diff --git a/fs/ramfs/file-nommu.c b/fs/ramfs/file-nommu.c
index 11f0c06..d645bb3 100644
--- a/fs/ramfs/file-nommu.c
+++ b/fs/ramfs/file-nommu.c
@@ -45,6 +45,7 @@ const struct file_operations ramfs_file_operations = {
 	.splice_read		= generic_file_splice_read,
 	.splice_write		= generic_file_splice_write,
 	.llseek			= generic_file_llseek,
+	.checkpoint		= generic_file_checkpoint,
 };
 
 const struct inode_operations ramfs_file_inode_operations = {
diff --git a/fs/read_write.c b/fs/read_write.c
index d331975..d314234 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -27,6 +27,7 @@ const struct file_operations generic_ro_fops = {
 	.aio_read	= generic_file_aio_read,
 	.mmap		= generic_file_readonly_mmap,
 	.splice_read	= generic_file_splice_read,
+	.checkpoint	= generic_file_checkpoint,
 };
 
 EXPORT_SYMBOL(generic_ro_fops);
diff --git a/fs/reiserfs/dir.c b/fs/reiserfs/dir.c
index 6d2668f..40ce3fa 100644
--- a/fs/reiserfs/dir.c
+++ b/fs/reiserfs/dir.c
@@ -24,6 +24,7 @@ const struct file_operations reiserfs_dir_operations = {
 #ifdef CONFIG_COMPAT
 	.compat_ioctl = reiserfs_compat_ioctl,
 #endif
+	.checkpoint = generic_file_checkpoint,
 };
 
 static int reiserfs_dir_fsync(struct file *filp, struct dentry *dentry,
diff --git a/fs/reiserfs/file.c b/fs/reiserfs/file.c
index 9f43666..a5dcc02 100644
--- a/fs/reiserfs/file.c
+++ b/fs/reiserfs/file.c
@@ -297,6 +297,7 @@ const struct file_operations reiserfs_file_operations = {
 	.splice_read = generic_file_splice_read,
 	.splice_write = generic_file_splice_write,
 	.llseek = generic_file_llseek,
+	.checkpoint = generic_file_checkpoint,
 };
 
 const struct inode_operations reiserfs_file_inode_operations = {
diff --git a/fs/romfs/mmap-nommu.c b/fs/romfs/mmap-nommu.c
index f0511e8..03c24d9 100644
--- a/fs/romfs/mmap-nommu.c
+++ b/fs/romfs/mmap-nommu.c
@@ -72,4 +72,5 @@ const struct file_operations romfs_ro_fops = {
 	.splice_read		= generic_file_splice_read,
 	.mmap			= romfs_mmap,
 	.get_unmapped_area	= romfs_get_unmapped_area,
+	.checkpoint		= generic_file_checkpoint,
 };
diff --git a/fs/romfs/super.c b/fs/romfs/super.c
index 4ab3c03..7b6e951 100644
--- a/fs/romfs/super.c
+++ b/fs/romfs/super.c
@@ -282,6 +282,7 @@ error:
 static const struct file_operations romfs_dir_operations = {
 	.read		= generic_read_dir,
 	.readdir	= romfs_readdir,
+	.checkpoint	= generic_file_checkpoint,
 };
 
 static struct inode_operations romfs_dir_inode_operations = {
diff --git a/fs/squashfs/dir.c b/fs/squashfs/dir.c
index 566b0ea..b0c5336 100644
--- a/fs/squashfs/dir.c
+++ b/fs/squashfs/dir.c
@@ -231,5 +231,6 @@ failed_read:
 
 const struct file_operations squashfs_dir_ops = {
 	.read = generic_read_dir,
-	.readdir = squashfs_readdir
+	.readdir = squashfs_readdir,
+	.checkpoint = generic_file_checkpoint,
 };
diff --git a/fs/sysv/dir.c b/fs/sysv/dir.c
index 4e50286..53acd29 100644
--- a/fs/sysv/dir.c
+++ b/fs/sysv/dir.c
@@ -25,6 +25,7 @@ const struct file_operations sysv_dir_operations = {
 	.read		= generic_read_dir,
 	.readdir	= sysv_readdir,
 	.fsync		= simple_fsync,
+	.checkpoint	= generic_file_checkpoint,
 };
 
 static inline void dir_put_page(struct page *page)
diff --git a/fs/sysv/file.c b/fs/sysv/file.c
index 96340c0..aee556d 100644
--- a/fs/sysv/file.c
+++ b/fs/sysv/file.c
@@ -28,6 +28,7 @@ const struct file_operations sysv_file_operations = {
 	.mmap		= generic_file_mmap,
 	.fsync		= simple_fsync,
 	.splice_read	= generic_file_splice_read,
+	.checkpoint	= generic_file_checkpoint,
 };
 
 const struct inode_operations sysv_file_inode_operations = {
diff --git a/fs/ubifs/debug.c b/fs/ubifs/debug.c
index ce2cd83..69a8892 100644
--- a/fs/ubifs/debug.c
+++ b/fs/ubifs/debug.c
@@ -2530,6 +2530,7 @@ static ssize_t write_debugfs_file(struct file *file, const char __user *buf,
 static const struct file_operations dfs_fops = {
 	.open = open_debugfs_file,
 	.write = write_debugfs_file,
+	.checkpoint = generic_file_checkpoint,
 	.owner = THIS_MODULE,
 };
 
diff --git a/fs/ubifs/dir.c b/fs/ubifs/dir.c
index 552fb01..89ab2aa 100644
--- a/fs/ubifs/dir.c
+++ b/fs/ubifs/dir.c
@@ -1228,4 +1228,5 @@ const struct file_operations ubifs_dir_operations = {
 #ifdef CONFIG_COMPAT
 	.compat_ioctl   = ubifs_compat_ioctl,
 #endif
+	.checkpoint     = generic_file_checkpoint,
 };
diff --git a/fs/ubifs/file.c b/fs/ubifs/file.c
index 6d34dc7..2d94676 100644
--- a/fs/ubifs/file.c
+++ b/fs/ubifs/file.c
@@ -1595,4 +1595,5 @@ const struct file_operations ubifs_file_operations = {
 #ifdef CONFIG_COMPAT
 	.compat_ioctl   = ubifs_compat_ioctl,
 #endif
+	.checkpoint     = generic_file_checkpoint,
 };
diff --git a/fs/udf/dir.c b/fs/udf/dir.c
index 61d9a76..6586dbe 100644
--- a/fs/udf/dir.c
+++ b/fs/udf/dir.c
@@ -211,4 +211,5 @@ const struct file_operations udf_dir_operations = {
 	.readdir		= udf_readdir,
 	.ioctl			= udf_ioctl,
 	.fsync			= simple_fsync,
+	.checkpoint		= generic_file_checkpoint,
 };
diff --git a/fs/udf/file.c b/fs/udf/file.c
index 7464305..33c63a8 100644
--- a/fs/udf/file.c
+++ b/fs/udf/file.c
@@ -212,6 +212,7 @@ const struct file_operations udf_file_operations = {
 	.fsync			= simple_fsync,
 	.splice_read		= generic_file_splice_read,
 	.llseek			= generic_file_llseek,
+	.checkpoint		= generic_file_checkpoint,
 };
 
 const struct inode_operations udf_file_inode_operations = {
diff --git a/fs/ufs/dir.c b/fs/ufs/dir.c
index 6f671f1..9379010 100644
--- a/fs/ufs/dir.c
+++ b/fs/ufs/dir.c
@@ -668,4 +668,5 @@ const struct file_operations ufs_dir_operations = {
 	.readdir	= ufs_readdir,
 	.fsync		= simple_fsync,
 	.llseek		= generic_file_llseek,
+	.checkpoint	= generic_file_checkpoint,
 };
diff --git a/fs/ufs/file.c b/fs/ufs/file.c
index 73655c6..15c8616 100644
--- a/fs/ufs/file.c
+++ b/fs/ufs/file.c
@@ -43,4 +43,5 @@ const struct file_operations ufs_file_operations = {
 	.open           = generic_file_open,
 	.fsync		= simple_fsync,
 	.splice_read	= generic_file_splice_read,
+	.checkpoint	= generic_file_checkpoint,
 };
diff --git a/fs/xfs/linux-2.6/xfs_file.c b/fs/xfs/linux-2.6/xfs_file.c
index 0542fd5..2b4bdb6 100644
--- a/fs/xfs/linux-2.6/xfs_file.c
+++ b/fs/xfs/linux-2.6/xfs_file.c
@@ -257,6 +257,7 @@ const struct file_operations xfs_file_operations = {
 #ifdef HAVE_FOP_OPEN_EXEC
 	.open_exec	= xfs_file_open_exec,
 #endif
+	.checkpoint	= generic_file_checkpoint,
 };
 
 const struct file_operations xfs_dir_file_operations = {
@@ -269,6 +270,7 @@ const struct file_operations xfs_dir_file_operations = {
 	.compat_ioctl	= xfs_file_compat_ioctl,
 #endif
 	.fsync		= xfs_file_fsync,
+	.checkpoint	= generic_file_checkpoint,
 };
 
 static struct vm_operations_struct xfs_file_vm_ops = {
-- 
1.6.0.4


^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH v18 41/80] Add the checkpoint operation for opened files of generic filesystems
@ 2009-09-23 23:51   ` Oren Laadan
  0 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-23 23:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, containers, linux-kernel, linux-mm, linux-api,
	Serge Hallyn, Ingo Molnar, Pavel Emelyanov, Matt Helsley,
	linux-fsdevel

From: Matt Helsley <matthltc@us.ibm.com>

These patches extend the use of the generic file checkpoint operation to
non-extX filesystems which have lseek operations that ensure we can save
and restore the files for later use. Note that this does not include
things like FUSE, network filesystems, or pseudo-filesystem kernel
interfaces.

Only compile and boot tested (on x86-32).

[Oren Laadan] Folded patch series into a single patch; original post
included 36 separate patches for individual filesystems:

  [PATCH 01/36] Add the checkpoint operation for affs files and directories.
  [PATCH 02/36] Add the checkpoint operation for befs directories.
  [PATCH 03/36] Add the checkpoint operation for bfs files and directories.
  [PATCH 04/36] Add the checkpoint operation for btrfs files and directories.
  [PATCH 05/36] Add the checkpoint operation for cramfs directories.
  [PATCH 06/36] Add the checkpoint operation for ecryptfs files and directories.
  [PATCH 07/36] Add the checkpoint operation for fat files and directories.
  [PATCH 08/36] Add the checkpoint operation for freevxfs directories.
  [PATCH 09/36] Add the checkpoint operation for hfs files and directories.
  [PATCH 10/36] Add the checkpoint operation for hfsplus files and directories.
  [PATCH 11/36] Add the checkpoint operation for hpfs files and directories.
  [PATCH 12/36] Add the checkpoint operation for hppfs files and directories.
  [PATCH 13/36] Add the checkpoint operation for iso directories.
  [PATCH 14/36] Add the checkpoint operation for jffs2 files and directories.
  [PATCH 15/36] Add the checkpoint operation for jfs files and directories.
  [PATCH 16/36] Add the checkpoint operation for regular nfs files and directories. Skip the various /proc files for now.
  [PATCH 17/36] Add the checkpoint operation for ntfs directories.
  [PATCH 18/36] Add the checkpoint operation for openromfs directories. Explicitly skip the properties for now.
  [PATCH 19/36] Add the checkpoint operation for qnx4 files and directories.
  [PATCH 20/36] Add the checkpoint operation for reiserfs files and directories.
  [PATCH 21/36] Add the checkpoint operation for romfs directories.
  [PATCH 22/36] Add the checkpoint operation for squashfs directories.
  [PATCH 23/36] Add the checkpoint operation for sysv filesystem files and directories.
  [PATCH 24/36] Add the checkpoint operation for ubifs files and directories.
  [PATCH 25/36] Add the checkpoint operation for udf filesystem files and directories.
  [PATCH 26/36] Add the checkpoint operation for xfs files and directories.
  [PATCH 27/36] Add checkpoint operation for efs directories.
  [PATCH 28/36] Add the checkpoint operation for generic, read-only files. At present, some/all files of the following filesystems use this generic definition:
  [PATCH 29/36] Add checkpoint operation for minix filesystem files and directories.
  [PATCH 30/36] Add checkpoint operations for omfs files and directories.
  [PATCH 31/36] Add checkpoint operations for ufs files and directories.
  [PATCH 32/36] Add checkpoint operations for ramfs files. NOTE: since simple_dir_operations are shared between multiple filesystems including ramfs, it's not currently possible to checkpoint open ramfs directories.
  [PATCH 33/36] Add the checkpoint operation for adfs files and directories.
  [PATCH 34/36] Add the checkpoint operation to exofs files and directories.
  [PATCH 35/36] Add the checkpoint operation to nilfs2 files and directories.
  [PATCH 36/36] Add checkpoint operations for UML host filesystem files and directories.

Signed-off-by: Matt Helsley <matthltc@us.ibm.com>
Acked-by: Oren Laadan <orenl@cs.columbia.edu>
Cc: linux-fsdevel@vger.kernel.org
---
 fs/adfs/dir.c               |    1 +
 fs/adfs/file.c              |    1 +
 fs/affs/dir.c               |    1 +
 fs/affs/file.c              |    1 +
 fs/befs/linuxvfs.c          |    1 +
 fs/bfs/dir.c                |    1 +
 fs/bfs/file.c               |    1 +
 fs/btrfs/file.c             |    1 +
 fs/btrfs/inode.c            |    1 +
 fs/btrfs/super.c            |    1 +
 fs/cramfs/inode.c           |    1 +
 fs/ecryptfs/file.c          |    2 ++
 fs/ecryptfs/miscdev.c       |    1 +
 fs/efs/dir.c                |    1 +
 fs/exofs/dir.c              |    1 +
 fs/exofs/file.c             |    1 +
 fs/fat/dir.c                |    1 +
 fs/fat/file.c               |    1 +
 fs/freevxfs/vxfs_lookup.c   |    1 +
 fs/hfs/dir.c                |    1 +
 fs/hfs/inode.c              |    1 +
 fs/hfsplus/dir.c            |    1 +
 fs/hfsplus/inode.c          |    1 +
 fs/hostfs/hostfs_kern.c     |    2 ++
 fs/hpfs/dir.c               |    1 +
 fs/hpfs/file.c              |    1 +
 fs/hppfs/hppfs.c            |    2 ++
 fs/isofs/dir.c              |    1 +
 fs/jffs2/dir.c              |    1 +
 fs/jffs2/file.c             |    1 +
 fs/jfs/file.c               |    1 +
 fs/jfs/namei.c              |    1 +
 fs/minix/dir.c              |    1 +
 fs/minix/file.c             |    1 +
 fs/nfs/dir.c                |    1 +
 fs/nfs/file.c               |    1 +
 fs/nilfs2/dir.c             |    2 +-
 fs/nilfs2/file.c            |    1 +
 fs/ntfs/dir.c               |    1 +
 fs/ntfs/file.c              |    3 ++-
 fs/omfs/dir.c               |    1 +
 fs/omfs/file.c              |    1 +
 fs/openpromfs/inode.c       |    2 ++
 fs/qnx4/dir.c               |    1 +
 fs/qnx4/file.c              |    1 +
 fs/ramfs/file-mmu.c         |    1 +
 fs/ramfs/file-nommu.c       |    1 +
 fs/read_write.c             |    1 +
 fs/reiserfs/dir.c           |    1 +
 fs/reiserfs/file.c          |    1 +
 fs/romfs/mmap-nommu.c       |    1 +
 fs/romfs/super.c            |    1 +
 fs/squashfs/dir.c           |    3 ++-
 fs/sysv/dir.c               |    1 +
 fs/sysv/file.c              |    1 +
 fs/ubifs/debug.c            |    1 +
 fs/ubifs/dir.c              |    1 +
 fs/ubifs/file.c             |    1 +
 fs/udf/dir.c                |    1 +
 fs/udf/file.c               |    1 +
 fs/ufs/dir.c                |    1 +
 fs/ufs/file.c               |    1 +
 fs/xfs/linux-2.6/xfs_file.c |    2 ++
 63 files changed, 70 insertions(+), 3 deletions(-)

diff --git a/fs/adfs/dir.c b/fs/adfs/dir.c
index 23aa52f..7106f32 100644
--- a/fs/adfs/dir.c
+++ b/fs/adfs/dir.c
@@ -198,6 +198,7 @@ const struct file_operations adfs_dir_operations = {
 	.llseek		= generic_file_llseek,
 	.readdir	= adfs_readdir,
 	.fsync		= simple_fsync,
+	.checkpoint	= generic_file_checkpoint,
 };
 
 static int
diff --git a/fs/adfs/file.c b/fs/adfs/file.c
index 005ea34..97bd298 100644
--- a/fs/adfs/file.c
+++ b/fs/adfs/file.c
@@ -30,6 +30,7 @@ const struct file_operations adfs_file_operations = {
 	.write		= do_sync_write,
 	.aio_write	= generic_file_aio_write,
 	.splice_read	= generic_file_splice_read,
+	.checkpoint	= generic_file_checkpoint,
 };
 
 const struct inode_operations adfs_file_inode_operations = {
diff --git a/fs/affs/dir.c b/fs/affs/dir.c
index 8ca8f3a..6cc5e43 100644
--- a/fs/affs/dir.c
+++ b/fs/affs/dir.c
@@ -22,6 +22,7 @@ const struct file_operations affs_dir_operations = {
 	.llseek		= generic_file_llseek,
 	.readdir	= affs_readdir,
 	.fsync		= affs_file_fsync,
+	.checkpoint	= generic_file_checkpoint,
 };
 
 /*
diff --git a/fs/affs/file.c b/fs/affs/file.c
index 184e55c..d580a12 100644
--- a/fs/affs/file.c
+++ b/fs/affs/file.c
@@ -36,6 +36,7 @@ const struct file_operations affs_file_operations = {
 	.release	= affs_file_release,
 	.fsync		= affs_file_fsync,
 	.splice_read	= generic_file_splice_read,
+	.checkpoint	= generic_file_checkpoint,
 };
 
 const struct inode_operations affs_file_inode_operations = {
diff --git a/fs/befs/linuxvfs.c b/fs/befs/linuxvfs.c
index 615d549..6c46cb8 100644
--- a/fs/befs/linuxvfs.c
+++ b/fs/befs/linuxvfs.c
@@ -67,6 +67,7 @@ static const struct file_operations befs_dir_operations = {
 	.read		= generic_read_dir,
 	.readdir	= befs_readdir,
 	.llseek		= generic_file_llseek,
+	.checkpoint	= generic_file_checkpoint,
 };
 
 static const struct inode_operations befs_dir_inode_operations = {
diff --git a/fs/bfs/dir.c b/fs/bfs/dir.c
index 1e41aad..d78015e 100644
--- a/fs/bfs/dir.c
+++ b/fs/bfs/dir.c
@@ -80,6 +80,7 @@ const struct file_operations bfs_dir_operations = {
 	.readdir	= bfs_readdir,
 	.fsync		= simple_fsync,
 	.llseek		= generic_file_llseek,
+	.checkpoint	= generic_file_checkpoint,
 };
 
 extern void dump_imap(const char *, struct super_block *);
diff --git a/fs/bfs/file.c b/fs/bfs/file.c
index 88b9a3f..7f61ed6 100644
--- a/fs/bfs/file.c
+++ b/fs/bfs/file.c
@@ -29,6 +29,7 @@ const struct file_operations bfs_file_operations = {
 	.aio_write	= generic_file_aio_write,
 	.mmap		= generic_file_mmap,
 	.splice_read	= generic_file_splice_read,
+	.checkpoint	= generic_file_checkpoint,
 };
 
 static int bfs_move_block(unsigned long from, unsigned long to,
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 4b83397..6425f19 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1229,4 +1229,5 @@ struct file_operations btrfs_file_operations = {
 #ifdef CONFIG_COMPAT
 	.compat_ioctl	= btrfs_ioctl,
 #endif
+	.checkpoint	= generic_file_checkpoint,
 };
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 59cba18..a13b1b7 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -5233,6 +5233,7 @@ static struct file_operations btrfs_dir_file_operations = {
 #endif
 	.release        = btrfs_release_file,
 	.fsync		= btrfs_sync_file,
+	.checkpoint	= generic_file_checkpoint,
 };
 
 static struct extent_io_ops btrfs_extent_io_ops = {
diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index 6d6d06c..58569ba 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -694,6 +694,7 @@ static const struct file_operations btrfs_ctl_fops = {
 	.unlocked_ioctl	 = btrfs_control_ioctl,
 	.compat_ioctl = btrfs_control_ioctl,
 	.owner	 = THIS_MODULE,
+	.checkpoint = generic_file_checkpoint,
 };
 
 static struct miscdevice btrfs_misc = {
diff --git a/fs/cramfs/inode.c b/fs/cramfs/inode.c
index dd3634e..0927503 100644
--- a/fs/cramfs/inode.c
+++ b/fs/cramfs/inode.c
@@ -532,6 +532,7 @@ static const struct file_operations cramfs_directory_operations = {
 	.llseek		= generic_file_llseek,
 	.read		= generic_read_dir,
 	.readdir	= cramfs_readdir,
+	.checkpoint	= generic_file_checkpoint,
 };
 
 static const struct inode_operations cramfs_dir_inode_operations = {
diff --git a/fs/ecryptfs/file.c b/fs/ecryptfs/file.c
index 9e94405..e6d2523 100644
--- a/fs/ecryptfs/file.c
+++ b/fs/ecryptfs/file.c
@@ -306,6 +306,7 @@ const struct file_operations ecryptfs_dir_fops = {
 	.fsync = ecryptfs_fsync,
 	.fasync = ecryptfs_fasync,
 	.splice_read = generic_file_splice_read,
+	.checkpoint = generic_file_checkpoint,
 };
 
 const struct file_operations ecryptfs_main_fops = {
@@ -323,6 +324,7 @@ const struct file_operations ecryptfs_main_fops = {
 	.fsync = ecryptfs_fsync,
 	.fasync = ecryptfs_fasync,
 	.splice_read = generic_file_splice_read,
+	.checkpoint = generic_file_checkpoint,
 };
 
 static int
diff --git a/fs/ecryptfs/miscdev.c b/fs/ecryptfs/miscdev.c
index 4ec8f61..9fd9b39 100644
--- a/fs/ecryptfs/miscdev.c
+++ b/fs/ecryptfs/miscdev.c
@@ -481,6 +481,7 @@ static const struct file_operations ecryptfs_miscdev_fops = {
 	.read    = ecryptfs_miscdev_read,
 	.write   = ecryptfs_miscdev_write,
 	.release = ecryptfs_miscdev_release,
+	.checkpoint = generic_file_checkpoint,
 };
 
 static struct miscdevice ecryptfs_miscdev = {
diff --git a/fs/efs/dir.c b/fs/efs/dir.c
index 7ee6f7e..da344b8 100644
--- a/fs/efs/dir.c
+++ b/fs/efs/dir.c
@@ -13,6 +13,7 @@ const struct file_operations efs_dir_operations = {
 	.llseek		= generic_file_llseek,
 	.read		= generic_read_dir,
 	.readdir	= efs_readdir,
+	.checkpoint	= generic_file_checkpoint,
 };
 
 const struct inode_operations efs_dir_inode_operations = {
diff --git a/fs/exofs/dir.c b/fs/exofs/dir.c
index 4cfab1c..f6693d3 100644
--- a/fs/exofs/dir.c
+++ b/fs/exofs/dir.c
@@ -667,4 +667,5 @@ const struct file_operations exofs_dir_operations = {
 	.llseek		= generic_file_llseek,
 	.read		= generic_read_dir,
 	.readdir	= exofs_readdir,
+	.checkpoint	= generic_file_checkpoint,
 };
diff --git a/fs/exofs/file.c b/fs/exofs/file.c
index 839b9dc..257e9da 100644
--- a/fs/exofs/file.c
+++ b/fs/exofs/file.c
@@ -73,6 +73,7 @@ static int exofs_flush(struct file *file, fl_owner_t id)
 
 const struct file_operations exofs_file_operations = {
 	.llseek		= generic_file_llseek,
+	.checkpoint	= generic_file_checkpoint,
 	.read		= do_sync_read,
 	.write		= do_sync_write,
 	.aio_read	= generic_file_aio_read,
diff --git a/fs/fat/dir.c b/fs/fat/dir.c
index 530b4ca..e3fa353 100644
--- a/fs/fat/dir.c
+++ b/fs/fat/dir.c
@@ -841,6 +841,7 @@ const struct file_operations fat_dir_operations = {
 	.compat_ioctl	= fat_compat_dir_ioctl,
 #endif
 	.fsync		= fat_file_fsync,
+	.checkpoint	= generic_file_checkpoint,
 };
 
 static int fat_get_short_entry(struct inode *dir, loff_t *pos,
diff --git a/fs/fat/file.c b/fs/fat/file.c
index f042b96..56351c2 100644
--- a/fs/fat/file.c
+++ b/fs/fat/file.c
@@ -162,6 +162,7 @@ const struct file_operations fat_file_operations = {
 	.ioctl		= fat_generic_ioctl,
 	.fsync		= fat_file_fsync,
 	.splice_read	= generic_file_splice_read,
+	.checkpoint	= generic_file_checkpoint,
 };
 
 static int fat_cont_expand(struct inode *inode, loff_t size)
diff --git a/fs/freevxfs/vxfs_lookup.c b/fs/freevxfs/vxfs_lookup.c
index aee049c..3a09132 100644
--- a/fs/freevxfs/vxfs_lookup.c
+++ b/fs/freevxfs/vxfs_lookup.c
@@ -58,6 +58,7 @@ const struct inode_operations vxfs_dir_inode_ops = {
 
 const struct file_operations vxfs_dir_operations = {
 	.readdir =		vxfs_readdir,
+	.checkpoint =		generic_file_checkpoint,
 };
 
  
diff --git a/fs/hfs/dir.c b/fs/hfs/dir.c
index 7c69b98..8d90a24 100644
--- a/fs/hfs/dir.c
+++ b/fs/hfs/dir.c
@@ -318,6 +318,7 @@ const struct file_operations hfs_dir_operations = {
 	.readdir	= hfs_readdir,
 	.llseek		= generic_file_llseek,
 	.release	= hfs_dir_release,
+	.checkpoint	= generic_file_checkpoint,
 };
 
 const struct inode_operations hfs_dir_inode_operations = {
diff --git a/fs/hfs/inode.c b/fs/hfs/inode.c
index a1cbff2..bf8950f 100644
--- a/fs/hfs/inode.c
+++ b/fs/hfs/inode.c
@@ -607,6 +607,7 @@ static const struct file_operations hfs_file_operations = {
 	.fsync		= file_fsync,
 	.open		= hfs_file_open,
 	.release	= hfs_file_release,
+	.checkpoint	= generic_file_checkpoint,
 };
 
 static const struct inode_operations hfs_file_inode_operations = {
diff --git a/fs/hfsplus/dir.c b/fs/hfsplus/dir.c
index 5f40236..41fbf2d 100644
--- a/fs/hfsplus/dir.c
+++ b/fs/hfsplus/dir.c
@@ -497,4 +497,5 @@ const struct file_operations hfsplus_dir_operations = {
 	.ioctl          = hfsplus_ioctl,
 	.llseek		= generic_file_llseek,
 	.release	= hfsplus_dir_release,
+	.checkpoint	= generic_file_checkpoint,
 };
diff --git a/fs/hfsplus/inode.c b/fs/hfsplus/inode.c
index 1bcf597..19abd7e 100644
--- a/fs/hfsplus/inode.c
+++ b/fs/hfsplus/inode.c
@@ -286,6 +286,7 @@ static const struct file_operations hfsplus_file_operations = {
 	.open		= hfsplus_file_open,
 	.release	= hfsplus_file_release,
 	.ioctl          = hfsplus_ioctl,
+	.checkpoint	= generic_file_checkpoint,
 };
 
 struct inode *hfsplus_new_inode(struct super_block *sb, int mode)
diff --git a/fs/hostfs/hostfs_kern.c b/fs/hostfs/hostfs_kern.c
index 032604e..67e2356 100644
--- a/fs/hostfs/hostfs_kern.c
+++ b/fs/hostfs/hostfs_kern.c
@@ -417,6 +417,7 @@ int hostfs_fsync(struct file *file, struct dentry *dentry, int datasync)
 
 static const struct file_operations hostfs_file_fops = {
 	.llseek		= generic_file_llseek,
+	.checkpoint	= generic_file_checkpoint,
 	.read		= do_sync_read,
 	.splice_read	= generic_file_splice_read,
 	.aio_read	= generic_file_aio_read,
@@ -430,6 +431,7 @@ static const struct file_operations hostfs_file_fops = {
 
 static const struct file_operations hostfs_dir_fops = {
 	.llseek		= generic_file_llseek,
+	.checkpoint	= generic_file_checkpoint,
 	.readdir	= hostfs_readdir,
 	.read		= generic_read_dir,
 };
diff --git a/fs/hpfs/dir.c b/fs/hpfs/dir.c
index 8865c94..dcde10f 100644
--- a/fs/hpfs/dir.c
+++ b/fs/hpfs/dir.c
@@ -322,4 +322,5 @@ const struct file_operations hpfs_dir_ops =
 	.readdir	= hpfs_readdir,
 	.release	= hpfs_dir_release,
 	.fsync		= hpfs_file_fsync,
+	.checkpoint	= generic_file_checkpoint,
 };
diff --git a/fs/hpfs/file.c b/fs/hpfs/file.c
index 3efabff..f1211f0 100644
--- a/fs/hpfs/file.c
+++ b/fs/hpfs/file.c
@@ -139,6 +139,7 @@ const struct file_operations hpfs_file_ops =
 	.release	= hpfs_file_release,
 	.fsync		= hpfs_file_fsync,
 	.splice_read	= generic_file_splice_read,
+	.checkpoint	= generic_file_checkpoint,
 };
 
 const struct inode_operations hpfs_file_iops =
diff --git a/fs/hppfs/hppfs.c b/fs/hppfs/hppfs.c
index a5089a6..f132fa2 100644
--- a/fs/hppfs/hppfs.c
+++ b/fs/hppfs/hppfs.c
@@ -546,6 +546,7 @@ static const struct file_operations hppfs_file_fops = {
 	.read		= hppfs_read,
 	.write		= hppfs_write,
 	.open		= hppfs_open,
+	.checkpoint	= generic_file_checkpoint,
 };
 
 struct hppfs_dirent {
@@ -597,6 +598,7 @@ static const struct file_operations hppfs_dir_fops = {
 	.readdir	= hppfs_readdir,
 	.open		= hppfs_dir_open,
 	.fsync		= hppfs_fsync,
+	.checkpoint	= generic_file_checkpoint,
 };
 
 static int hppfs_statfs(struct dentry *dentry, struct kstatfs *sf)
diff --git a/fs/isofs/dir.c b/fs/isofs/dir.c
index 8ba5441..848059d 100644
--- a/fs/isofs/dir.c
+++ b/fs/isofs/dir.c
@@ -273,6 +273,7 @@ const struct file_operations isofs_dir_operations =
 {
 	.read = generic_read_dir,
 	.readdir = isofs_readdir,
+	.checkpoint = generic_file_checkpoint,
 };
 
 /*
diff --git a/fs/jffs2/dir.c b/fs/jffs2/dir.c
index 6f60cc9..c2b6487 100644
--- a/fs/jffs2/dir.c
+++ b/fs/jffs2/dir.c
@@ -41,6 +41,7 @@ const struct file_operations jffs2_dir_operations =
 	.unlocked_ioctl=jffs2_ioctl,
 	.fsync =	jffs2_fsync,
 	.llseek =	generic_file_llseek,
+	.checkpoint =	generic_file_checkpoint,
 };
 
 
diff --git a/fs/jffs2/file.c b/fs/jffs2/file.c
index 23c9475..7e5d2e2 100644
--- a/fs/jffs2/file.c
+++ b/fs/jffs2/file.c
@@ -50,6 +50,7 @@ const struct file_operations jffs2_file_operations =
 	.mmap =		generic_file_readonly_mmap,
 	.fsync =	jffs2_fsync,
 	.splice_read =	generic_file_splice_read,
+	.checkpoint =	generic_file_checkpoint,
 };
 
 /* jffs2_file_inode_operations */
diff --git a/fs/jfs/file.c b/fs/jfs/file.c
index 7f6063a..90ab090 100644
--- a/fs/jfs/file.c
+++ b/fs/jfs/file.c
@@ -116,4 +116,5 @@ const struct file_operations jfs_file_operations = {
 #ifdef CONFIG_COMPAT
 	.compat_ioctl	= jfs_compat_ioctl,
 #endif
+	.checkpoint	= generic_file_checkpoint,
 };
diff --git a/fs/jfs/namei.c b/fs/jfs/namei.c
index 514ee2e..5486db9 100644
--- a/fs/jfs/namei.c
+++ b/fs/jfs/namei.c
@@ -1556,6 +1556,7 @@ const struct file_operations jfs_dir_operations = {
 	.compat_ioctl	= jfs_compat_ioctl,
 #endif
 	.llseek		= generic_file_llseek,
+	.checkpoint	= generic_file_checkpoint,
 };
 
 static int jfs_ci_hash(struct dentry *dir, struct qstr *this)
diff --git a/fs/minix/dir.c b/fs/minix/dir.c
index d407e7a..9cf04af 100644
--- a/fs/minix/dir.c
+++ b/fs/minix/dir.c
@@ -23,6 +23,7 @@ const struct file_operations minix_dir_operations = {
 	.read		= generic_read_dir,
 	.readdir	= minix_readdir,
 	.fsync		= simple_fsync,
+	.checkpoint	= generic_file_checkpoint,
 };
 
 static inline void dir_put_page(struct page *page)
diff --git a/fs/minix/file.c b/fs/minix/file.c
index 3eec3e6..2048d09 100644
--- a/fs/minix/file.c
+++ b/fs/minix/file.c
@@ -21,6 +21,7 @@ const struct file_operations minix_file_operations = {
 	.mmap		= generic_file_mmap,
 	.fsync		= simple_fsync,
 	.splice_read	= generic_file_splice_read,
+	.checkpoint	= generic_file_checkpoint,
 };
 
 const struct inode_operations minix_file_inode_operations = {
diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
index 32062c3..43b9025 100644
--- a/fs/nfs/dir.c
+++ b/fs/nfs/dir.c
@@ -63,6 +63,7 @@ const struct file_operations nfs_dir_operations = {
 	.open		= nfs_opendir,
 	.release	= nfs_release,
 	.fsync		= nfs_fsync_dir,
+	.checkpoint	= generic_file_checkpoint,
 };
 
 const struct inode_operations nfs_dir_inode_operations = {
diff --git a/fs/nfs/file.c b/fs/nfs/file.c
index 0506232..813fd8d 100644
--- a/fs/nfs/file.c
+++ b/fs/nfs/file.c
@@ -78,6 +78,7 @@ const struct file_operations nfs_file_operations = {
 	.splice_write	= nfs_file_splice_write,
 	.check_flags	= nfs_check_flags,
 	.setlease	= nfs_setlease,
+	.checkpoint	= generic_file_checkpoint,
 };
 
 const struct inode_operations nfs_file_inode_operations = {
diff --git a/fs/nilfs2/dir.c b/fs/nilfs2/dir.c
index 1a4fa04..afa0f80 100644
--- a/fs/nilfs2/dir.c
+++ b/fs/nilfs2/dir.c
@@ -706,5 +706,5 @@ struct file_operations nilfs_dir_operations = {
 	.compat_ioctl	= nilfs_ioctl,
 #endif	/* CONFIG_COMPAT */
 	.fsync		= nilfs_sync_file,
-
+	.checkpoint	= generic_file_checkpoint,
 };
diff --git a/fs/nilfs2/file.c b/fs/nilfs2/file.c
index 6bd84a0..0f27ab5 100644
--- a/fs/nilfs2/file.c
+++ b/fs/nilfs2/file.c
@@ -136,6 +136,7 @@ static int nilfs_file_mmap(struct file *file, struct vm_area_struct *vma)
  */
 struct file_operations nilfs_file_operations = {
 	.llseek		= generic_file_llseek,
+	.checkpoint	= generic_file_checkpoint,
 	.read		= do_sync_read,
 	.write		= do_sync_write,
 	.aio_read	= generic_file_aio_read,
diff --git a/fs/ntfs/dir.c b/fs/ntfs/dir.c
index 5a9e344..4fe3759 100644
--- a/fs/ntfs/dir.c
+++ b/fs/ntfs/dir.c
@@ -1572,4 +1572,5 @@ const struct file_operations ntfs_dir_ops = {
 	/*.ioctl	= ,*/			/* Perform function on the
 						   mounted filesystem. */
 	.open		= ntfs_dir_open,	/* Open directory. */
+	.checkpoint	= generic_file_checkpoint,
 };
diff --git a/fs/ntfs/file.c b/fs/ntfs/file.c
index 3140a44..3f75c5e 100644
--- a/fs/ntfs/file.c
+++ b/fs/ntfs/file.c
@@ -2272,7 +2272,7 @@ const struct file_operations ntfs_file_ops = {
 						    mounted filesystem. */
 	.mmap		= generic_file_mmap,	 /* Mmap file. */
 	.open		= ntfs_file_open,	 /* Open file. */
-	.splice_read	= generic_file_splice_read /* Zero-copy data send with
+	.splice_read	= generic_file_splice_read, /* Zero-copy data send with
 						    the data source being on
 						    the ntfs partition.  We do
 						    not need to care about the
@@ -2282,6 +2282,7 @@ const struct file_operations ntfs_file_ops = {
 						    on the ntfs partition.  We
 						    do not need to care about
 						    the data source. */
+	.checkpoint	= generic_file_checkpoint,
 };
 
 const struct inode_operations ntfs_file_inode_ops = {
diff --git a/fs/omfs/dir.c b/fs/omfs/dir.c
index c7275cf..5ed9cb3 100644
--- a/fs/omfs/dir.c
+++ b/fs/omfs/dir.c
@@ -502,4 +502,5 @@ struct file_operations omfs_dir_operations = {
 	.read = generic_read_dir,
 	.readdir = omfs_readdir,
 	.llseek = generic_file_llseek,
+	.checkpoint = generic_file_checkpoint,
 };
diff --git a/fs/omfs/file.c b/fs/omfs/file.c
index d17e774..c85fb31 100644
--- a/fs/omfs/file.c
+++ b/fs/omfs/file.c
@@ -331,6 +331,7 @@ struct file_operations omfs_file_operations = {
 	.mmap = generic_file_mmap,
 	.fsync = simple_fsync,
 	.splice_read = generic_file_splice_read,
+	.checkpoint = generic_file_checkpoint,
 };
 
 struct inode_operations omfs_file_inops = {
diff --git a/fs/openpromfs/inode.c b/fs/openpromfs/inode.c
index ffcd04f..d1f0677 100644
--- a/fs/openpromfs/inode.c
+++ b/fs/openpromfs/inode.c
@@ -160,6 +160,7 @@ static const struct file_operations openpromfs_prop_ops = {
 	.read		= seq_read,
 	.llseek		= seq_lseek,
 	.release	= seq_release,
+	.checkpoint	= NULL,
 };
 
 static int openpromfs_readdir(struct file *, void *, filldir_t);
@@ -168,6 +169,7 @@ static const struct file_operations openprom_operations = {
 	.read		= generic_read_dir,
 	.readdir	= openpromfs_readdir,
 	.llseek		= generic_file_llseek,
+	.checkpoint	= generic_file_checkpoint,
 };
 
 static struct dentry *openpromfs_lookup(struct inode *, struct dentry *, struct nameidata *);
diff --git a/fs/qnx4/dir.c b/fs/qnx4/dir.c
index 003c68f..ca99e01 100644
--- a/fs/qnx4/dir.c
+++ b/fs/qnx4/dir.c
@@ -80,6 +80,7 @@ const struct file_operations qnx4_dir_operations =
 	.read		= generic_read_dir,
 	.readdir	= qnx4_readdir,
 	.fsync		= simple_fsync,
+	.checkpoint	= generic_file_checkpoint,
 };
 
 const struct inode_operations qnx4_dir_inode_operations =
diff --git a/fs/qnx4/file.c b/fs/qnx4/file.c
index 09b170a..8aaa882 100644
--- a/fs/qnx4/file.c
+++ b/fs/qnx4/file.c
@@ -30,6 +30,7 @@ const struct file_operations qnx4_file_operations =
 	.aio_write	= generic_file_aio_write,
 	.fsync		= simple_fsync,
 #endif
+	.checkpoint	= generic_file_checkpoint,
 };
 
 const struct inode_operations qnx4_file_inode_operations =
diff --git a/fs/ramfs/file-mmu.c b/fs/ramfs/file-mmu.c
index 78f613c..4430239 100644
--- a/fs/ramfs/file-mmu.c
+++ b/fs/ramfs/file-mmu.c
@@ -47,6 +47,7 @@ const struct file_operations ramfs_file_operations = {
 	.splice_read	= generic_file_splice_read,
 	.splice_write	= generic_file_splice_write,
 	.llseek		= generic_file_llseek,
+	.checkpoint	= generic_file_checkpoint,
 };
 
 const struct inode_operations ramfs_file_inode_operations = {
diff --git a/fs/ramfs/file-nommu.c b/fs/ramfs/file-nommu.c
index 11f0c06..d645bb3 100644
--- a/fs/ramfs/file-nommu.c
+++ b/fs/ramfs/file-nommu.c
@@ -45,6 +45,7 @@ const struct file_operations ramfs_file_operations = {
 	.splice_read		= generic_file_splice_read,
 	.splice_write		= generic_file_splice_write,
 	.llseek			= generic_file_llseek,
+	.checkpoint		= generic_file_checkpoint,
 };
 
 const struct inode_operations ramfs_file_inode_operations = {
diff --git a/fs/read_write.c b/fs/read_write.c
index d331975..d314234 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -27,6 +27,7 @@ const struct file_operations generic_ro_fops = {
 	.aio_read	= generic_file_aio_read,
 	.mmap		= generic_file_readonly_mmap,
 	.splice_read	= generic_file_splice_read,
+	.checkpoint	= generic_file_checkpoint,
 };
 
 EXPORT_SYMBOL(generic_ro_fops);
diff --git a/fs/reiserfs/dir.c b/fs/reiserfs/dir.c
index 6d2668f..40ce3fa 100644
--- a/fs/reiserfs/dir.c
+++ b/fs/reiserfs/dir.c
@@ -24,6 +24,7 @@ const struct file_operations reiserfs_dir_operations = {
 #ifdef CONFIG_COMPAT
 	.compat_ioctl = reiserfs_compat_ioctl,
 #endif
+	.checkpoint = generic_file_checkpoint,
 };
 
 static int reiserfs_dir_fsync(struct file *filp, struct dentry *dentry,
diff --git a/fs/reiserfs/file.c b/fs/reiserfs/file.c
index 9f43666..a5dcc02 100644
--- a/fs/reiserfs/file.c
+++ b/fs/reiserfs/file.c
@@ -297,6 +297,7 @@ const struct file_operations reiserfs_file_operations = {
 	.splice_read = generic_file_splice_read,
 	.splice_write = generic_file_splice_write,
 	.llseek = generic_file_llseek,
+	.checkpoint = generic_file_checkpoint,
 };
 
 const struct inode_operations reiserfs_file_inode_operations = {
diff --git a/fs/romfs/mmap-nommu.c b/fs/romfs/mmap-nommu.c
index f0511e8..03c24d9 100644
--- a/fs/romfs/mmap-nommu.c
+++ b/fs/romfs/mmap-nommu.c
@@ -72,4 +72,5 @@ const struct file_operations romfs_ro_fops = {
 	.splice_read		= generic_file_splice_read,
 	.mmap			= romfs_mmap,
 	.get_unmapped_area	= romfs_get_unmapped_area,
+	.checkpoint		= generic_file_checkpoint,
 };
diff --git a/fs/romfs/super.c b/fs/romfs/super.c
index 4ab3c03..7b6e951 100644
--- a/fs/romfs/super.c
+++ b/fs/romfs/super.c
@@ -282,6 +282,7 @@ error:
 static const struct file_operations romfs_dir_operations = {
 	.read		= generic_read_dir,
 	.readdir	= romfs_readdir,
+	.checkpoint	= generic_file_checkpoint,
 };
 
 static struct inode_operations romfs_dir_inode_operations = {
diff --git a/fs/squashfs/dir.c b/fs/squashfs/dir.c
index 566b0ea..b0c5336 100644
--- a/fs/squashfs/dir.c
+++ b/fs/squashfs/dir.c
@@ -231,5 +231,6 @@ failed_read:
 
 const struct file_operations squashfs_dir_ops = {
 	.read = generic_read_dir,
-	.readdir = squashfs_readdir
+	.readdir = squashfs_readdir,
+	.checkpoint = generic_file_checkpoint,
 };
diff --git a/fs/sysv/dir.c b/fs/sysv/dir.c
index 4e50286..53acd29 100644
--- a/fs/sysv/dir.c
+++ b/fs/sysv/dir.c
@@ -25,6 +25,7 @@ const struct file_operations sysv_dir_operations = {
 	.read		= generic_read_dir,
 	.readdir	= sysv_readdir,
 	.fsync		= simple_fsync,
+	.checkpoint	= generic_file_checkpoint,
 };
 
 static inline void dir_put_page(struct page *page)
diff --git a/fs/sysv/file.c b/fs/sysv/file.c
index 96340c0..aee556d 100644
--- a/fs/sysv/file.c
+++ b/fs/sysv/file.c
@@ -28,6 +28,7 @@ const struct file_operations sysv_file_operations = {
 	.mmap		= generic_file_mmap,
 	.fsync		= simple_fsync,
 	.splice_read	= generic_file_splice_read,
+	.checkpoint	= generic_file_checkpoint,
 };
 
 const struct inode_operations sysv_file_inode_operations = {
diff --git a/fs/ubifs/debug.c b/fs/ubifs/debug.c
index ce2cd83..69a8892 100644
--- a/fs/ubifs/debug.c
+++ b/fs/ubifs/debug.c
@@ -2530,6 +2530,7 @@ static ssize_t write_debugfs_file(struct file *file, const char __user *buf,
 static const struct file_operations dfs_fops = {
 	.open = open_debugfs_file,
 	.write = write_debugfs_file,
+	.checkpoint = generic_file_checkpoint,
 	.owner = THIS_MODULE,
 };
 
diff --git a/fs/ubifs/dir.c b/fs/ubifs/dir.c
index 552fb01..89ab2aa 100644
--- a/fs/ubifs/dir.c
+++ b/fs/ubifs/dir.c
@@ -1228,4 +1228,5 @@ const struct file_operations ubifs_dir_operations = {
 #ifdef CONFIG_COMPAT
 	.compat_ioctl   = ubifs_compat_ioctl,
 #endif
+	.checkpoint     = generic_file_checkpoint,
 };
diff --git a/fs/ubifs/file.c b/fs/ubifs/file.c
index 6d34dc7..2d94676 100644
--- a/fs/ubifs/file.c
+++ b/fs/ubifs/file.c
@@ -1595,4 +1595,5 @@ const struct file_operations ubifs_file_operations = {
 #ifdef CONFIG_COMPAT
 	.compat_ioctl   = ubifs_compat_ioctl,
 #endif
+	.checkpoint     = generic_file_checkpoint,
 };
diff --git a/fs/udf/dir.c b/fs/udf/dir.c
index 61d9a76..6586dbe 100644
--- a/fs/udf/dir.c
+++ b/fs/udf/dir.c
@@ -211,4 +211,5 @@ const struct file_operations udf_dir_operations = {
 	.readdir		= udf_readdir,
 	.ioctl			= udf_ioctl,
 	.fsync			= simple_fsync,
+	.checkpoint		= generic_file_checkpoint,
 };
diff --git a/fs/udf/file.c b/fs/udf/file.c
index 7464305..33c63a8 100644
--- a/fs/udf/file.c
+++ b/fs/udf/file.c
@@ -212,6 +212,7 @@ const struct file_operations udf_file_operations = {
 	.fsync			= simple_fsync,
 	.splice_read		= generic_file_splice_read,
 	.llseek			= generic_file_llseek,
+	.checkpoint		= generic_file_checkpoint,
 };
 
 const struct inode_operations udf_file_inode_operations = {
diff --git a/fs/ufs/dir.c b/fs/ufs/dir.c
index 6f671f1..9379010 100644
--- a/fs/ufs/dir.c
+++ b/fs/ufs/dir.c
@@ -668,4 +668,5 @@ const struct file_operations ufs_dir_operations = {
 	.readdir	= ufs_readdir,
 	.fsync		= simple_fsync,
 	.llseek		= generic_file_llseek,
+	.checkpoint	= generic_file_checkpoint,
 };
diff --git a/fs/ufs/file.c b/fs/ufs/file.c
index 73655c6..15c8616 100644
--- a/fs/ufs/file.c
+++ b/fs/ufs/file.c
@@ -43,4 +43,5 @@ const struct file_operations ufs_file_operations = {
 	.open           = generic_file_open,
 	.fsync		= simple_fsync,
 	.splice_read	= generic_file_splice_read,
+	.checkpoint	= generic_file_checkpoint,
 };
diff --git a/fs/xfs/linux-2.6/xfs_file.c b/fs/xfs/linux-2.6/xfs_file.c
index 0542fd5..2b4bdb6 100644
--- a/fs/xfs/linux-2.6/xfs_file.c
+++ b/fs/xfs/linux-2.6/xfs_file.c
@@ -257,6 +257,7 @@ const struct file_operations xfs_file_operations = {
 #ifdef HAVE_FOP_OPEN_EXEC
 	.open_exec	= xfs_file_open_exec,
 #endif
+	.checkpoint	= generic_file_checkpoint,
 };
 
 const struct file_operations xfs_dir_file_operations = {
@@ -269,6 +270,7 @@ const struct file_operations xfs_dir_file_operations = {
 	.compat_ioctl	= xfs_file_compat_ioctl,
 #endif
 	.fsync		= xfs_file_fsync,
+	.checkpoint	= generic_file_checkpoint,
 };
 
 static struct vm_operations_struct xfs_file_vm_ops = {
-- 
1.6.0.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH v18 42/80] c/r: export shmem_getpage() to support shared memory
       [not found] ` <1253749920-18673-1-git-send-email-orenl-RdfvBDnrOixBDgjK7y7TUQ@public.gmane.org>
                     ` (40 preceding siblings ...)
  2009-09-23 23:51   ` [PATCH v18 41/80] Add the checkpoint operation for opened files of generic filesystems Oren Laadan
@ 2009-09-23 23:51   ` Oren Laadan
  2009-09-23 23:51   ` [PATCH v18 43/80] c/r: dump anonymous- and file-mapped- " Oren Laadan
                     ` (39 subsequent siblings)
  81 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-23 23:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-api-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Linus Torvalds, Ingo Molnar,
	Pavel Emelyanov

Export functionality to retrieve specific pages from shared memory
given an inode in shmem-fs; this will be used in the next two patches
to provide support for c/r of shared memory.

mm/shmem.c:
- shmem_getpage() and 'enum sgp_type' moved to linux/mm.h

Signed-off-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
---
 include/linux/mm.h |   11 +++++++++++
 mm/shmem.c         |   15 ++-------------
 2 files changed, 13 insertions(+), 13 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 0d0c46b..b565a82 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -327,6 +327,17 @@ void put_pages_list(struct list_head *pages);
 
 void split_page(struct page *page, unsigned int order);
 
+/* Flag allocation requirements to shmem_getpage and shmem_swp_alloc */
+enum sgp_type {
+	SGP_READ,	/* don't exceed i_size, don't allocate page */
+	SGP_CACHE,	/* don't exceed i_size, may allocate page */
+	SGP_DIRTY,	/* like SGP_CACHE, but set new page dirty */
+	SGP_WRITE,	/* may exceed i_size, may allocate page */
+};
+
+extern int shmem_getpage(struct inode *inode, unsigned long idx,
+			 struct page **pagep, enum sgp_type sgp, int *type);
+
 /*
  * Compound pages have a destructor function.  Provide a
  * prototype for that function and accessor functions.
diff --git a/mm/shmem.c b/mm/shmem.c
index d713239..d80532b 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -99,14 +99,6 @@ static struct vfsmount *shm_mnt;
 /* Pretend that each entry is of this size in directory's i_size */
 #define BOGO_DIRENT_SIZE 20
 
-/* Flag allocation requirements to shmem_getpage and shmem_swp_alloc */
-enum sgp_type {
-	SGP_READ,	/* don't exceed i_size, don't allocate page */
-	SGP_CACHE,	/* don't exceed i_size, may allocate page */
-	SGP_DIRTY,	/* like SGP_CACHE, but set new page dirty */
-	SGP_WRITE,	/* may exceed i_size, may allocate page */
-};
-
 #ifdef CONFIG_TMPFS
 static unsigned long shmem_default_max_blocks(void)
 {
@@ -119,9 +111,6 @@ static unsigned long shmem_default_max_inodes(void)
 }
 #endif
 
-static int shmem_getpage(struct inode *inode, unsigned long idx,
-			 struct page **pagep, enum sgp_type sgp, int *type);
-
 static inline struct page *shmem_dir_alloc(gfp_t gfp_mask)
 {
 	/*
@@ -1202,8 +1191,8 @@ static inline struct mempolicy *shmem_get_sbmpol(struct shmem_sb_info *sbinfo)
  * vm. If we swap it in we mark it dirty since we also free the swap
  * entry since a page cannot live in both the swap and page cache
  */
-static int shmem_getpage(struct inode *inode, unsigned long idx,
-			struct page **pagep, enum sgp_type sgp, int *type)
+int shmem_getpage(struct inode *inode, unsigned long idx,
+		  struct page **pagep, enum sgp_type sgp, int *type)
 {
 	struct address_space *mapping = inode->i_mapping;
 	struct shmem_inode_info *info = SHMEM_I(inode);
-- 
1.6.0.4

^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH v18 42/80] c/r: export shmem_getpage() to support shared memory
  2009-09-23 23:50 ` Oren Laadan
@ 2009-09-23 23:51   ` Oren Laadan
  -1 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-23 23:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, containers, linux-kernel, linux-mm, linux-api,
	Serge Hallyn, Ingo Molnar, Pavel Emelyanov, Oren Laadan,
	Oren Laadan

Export functionality to retrieve specific pages from shared memory
given an inode in shmem-fs; this will be used in the next two patches
to provide support for c/r of shared memory.

mm/shmem.c:
- shmem_getpage() and 'enum sgp_type' moved to linux/mm.h

Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
---
 include/linux/mm.h |   11 +++++++++++
 mm/shmem.c         |   15 ++-------------
 2 files changed, 13 insertions(+), 13 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 0d0c46b..b565a82 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -327,6 +327,17 @@ void put_pages_list(struct list_head *pages);
 
 void split_page(struct page *page, unsigned int order);
 
+/* Flag allocation requirements to shmem_getpage and shmem_swp_alloc */
+enum sgp_type {
+	SGP_READ,	/* don't exceed i_size, don't allocate page */
+	SGP_CACHE,	/* don't exceed i_size, may allocate page */
+	SGP_DIRTY,	/* like SGP_CACHE, but set new page dirty */
+	SGP_WRITE,	/* may exceed i_size, may allocate page */
+};
+
+extern int shmem_getpage(struct inode *inode, unsigned long idx,
+			 struct page **pagep, enum sgp_type sgp, int *type);
+
 /*
  * Compound pages have a destructor function.  Provide a
  * prototype for that function and accessor functions.
diff --git a/mm/shmem.c b/mm/shmem.c
index d713239..d80532b 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -99,14 +99,6 @@ static struct vfsmount *shm_mnt;
 /* Pretend that each entry is of this size in directory's i_size */
 #define BOGO_DIRENT_SIZE 20
 
-/* Flag allocation requirements to shmem_getpage and shmem_swp_alloc */
-enum sgp_type {
-	SGP_READ,	/* don't exceed i_size, don't allocate page */
-	SGP_CACHE,	/* don't exceed i_size, may allocate page */
-	SGP_DIRTY,	/* like SGP_CACHE, but set new page dirty */
-	SGP_WRITE,	/* may exceed i_size, may allocate page */
-};
-
 #ifdef CONFIG_TMPFS
 static unsigned long shmem_default_max_blocks(void)
 {
@@ -119,9 +111,6 @@ static unsigned long shmem_default_max_inodes(void)
 }
 #endif
 
-static int shmem_getpage(struct inode *inode, unsigned long idx,
-			 struct page **pagep, enum sgp_type sgp, int *type);
-
 static inline struct page *shmem_dir_alloc(gfp_t gfp_mask)
 {
 	/*
@@ -1202,8 +1191,8 @@ static inline struct mempolicy *shmem_get_sbmpol(struct shmem_sb_info *sbinfo)
  * vm. If we swap it in we mark it dirty since we also free the swap
  * entry since a page cannot live in both the swap and page cache
  */
-static int shmem_getpage(struct inode *inode, unsigned long idx,
-			struct page **pagep, enum sgp_type sgp, int *type)
+int shmem_getpage(struct inode *inode, unsigned long idx,
+		  struct page **pagep, enum sgp_type sgp, int *type)
 {
 	struct address_space *mapping = inode->i_mapping;
 	struct shmem_inode_info *info = SHMEM_I(inode);
-- 
1.6.0.4


^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH v18 42/80] c/r: export shmem_getpage() to support shared memory
@ 2009-09-23 23:51   ` Oren Laadan
  0 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-23 23:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, containers, linux-kernel, linux-mm, linux-api,
	Serge Hallyn, Ingo Molnar, Pavel Emelyanov, Oren Laadan,
	Oren Laadan

Export functionality to retrieve specific pages from shared memory
given an inode in shmem-fs; this will be used in the next two patches
to provide support for c/r of shared memory.

mm/shmem.c:
- shmem_getpage() and 'enum sgp_type' moved to linux/mm.h

Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
---
 include/linux/mm.h |   11 +++++++++++
 mm/shmem.c         |   15 ++-------------
 2 files changed, 13 insertions(+), 13 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 0d0c46b..b565a82 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -327,6 +327,17 @@ void put_pages_list(struct list_head *pages);
 
 void split_page(struct page *page, unsigned int order);
 
+/* Flag allocation requirements to shmem_getpage and shmem_swp_alloc */
+enum sgp_type {
+	SGP_READ,	/* don't exceed i_size, don't allocate page */
+	SGP_CACHE,	/* don't exceed i_size, may allocate page */
+	SGP_DIRTY,	/* like SGP_CACHE, but set new page dirty */
+	SGP_WRITE,	/* may exceed i_size, may allocate page */
+};
+
+extern int shmem_getpage(struct inode *inode, unsigned long idx,
+			 struct page **pagep, enum sgp_type sgp, int *type);
+
 /*
  * Compound pages have a destructor function.  Provide a
  * prototype for that function and accessor functions.
diff --git a/mm/shmem.c b/mm/shmem.c
index d713239..d80532b 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -99,14 +99,6 @@ static struct vfsmount *shm_mnt;
 /* Pretend that each entry is of this size in directory's i_size */
 #define BOGO_DIRENT_SIZE 20
 
-/* Flag allocation requirements to shmem_getpage and shmem_swp_alloc */
-enum sgp_type {
-	SGP_READ,	/* don't exceed i_size, don't allocate page */
-	SGP_CACHE,	/* don't exceed i_size, may allocate page */
-	SGP_DIRTY,	/* like SGP_CACHE, but set new page dirty */
-	SGP_WRITE,	/* may exceed i_size, may allocate page */
-};
-
 #ifdef CONFIG_TMPFS
 static unsigned long shmem_default_max_blocks(void)
 {
@@ -119,9 +111,6 @@ static unsigned long shmem_default_max_inodes(void)
 }
 #endif
 
-static int shmem_getpage(struct inode *inode, unsigned long idx,
-			 struct page **pagep, enum sgp_type sgp, int *type);
-
 static inline struct page *shmem_dir_alloc(gfp_t gfp_mask)
 {
 	/*
@@ -1202,8 +1191,8 @@ static inline struct mempolicy *shmem_get_sbmpol(struct shmem_sb_info *sbinfo)
  * vm. If we swap it in we mark it dirty since we also free the swap
  * entry since a page cannot live in both the swap and page cache
  */
-static int shmem_getpage(struct inode *inode, unsigned long idx,
-			struct page **pagep, enum sgp_type sgp, int *type)
+int shmem_getpage(struct inode *inode, unsigned long idx,
+		  struct page **pagep, enum sgp_type sgp, int *type)
 {
 	struct address_space *mapping = inode->i_mapping;
 	struct shmem_inode_info *info = SHMEM_I(inode);
-- 
1.6.0.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH v18 43/80] c/r: dump anonymous- and file-mapped- shared memory
       [not found] ` <1253749920-18673-1-git-send-email-orenl-RdfvBDnrOixBDgjK7y7TUQ@public.gmane.org>
                     ` (41 preceding siblings ...)
  2009-09-23 23:51   ` [PATCH v18 42/80] c/r: export shmem_getpage() to support shared memory Oren Laadan
@ 2009-09-23 23:51   ` Oren Laadan
  2009-09-23 23:51   ` [PATCH v18 44/80] c/r: restore " Oren Laadan
                     ` (38 subsequent siblings)
  81 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-23 23:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-api-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Linus Torvalds, Ingo Molnar,
	Pavel Emelyanov

We now handle anonymous and file-mapped shared memory. Support for IPC
shared memory requires support for IPC first. We extend ckpt_write_vma()
to detect shared memory VMAs and handle it separately than private
memory.

There is not much to do for file-mapped shared memory, except to force
msync() on the region to ensure that the file system is consistent
with the checkpoint image. Use our internal type CKPT_VMA_SHM_FILE.

Anonymous shared memory is always backed by inode in shmem filesystem.
We use that inode to look up the VMA in the objhash and register it if
not found (on first encounter). In this case, the type of the VMA is
CKPT_VMA_SHM_ANON, and we dump the contents. On the other hand, if it is
found there, we must have already saved it before, so we change the
type to CKPT_VMA_SHM_ANON_SKIP and skip it.

To dump the contents of a shmem VMA, we loop through the pages of the
inode in the shmem filesystem, and dump the contents of each dirty
(allocated) page - unallocated pages must be clean.

Note that we save the original size of a shmem VMA because it may have
been re-mapped partially. The format itself remains like with private
VMAs, except that instead of addresses we record _indices_ (page nr)
into the backing inode.

Changelog[v18]:
  - Mark the backing file as visited at chekcpoint

Signed-off-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
---
 checkpoint/memory.c            |  143 +++++++++++++++++++++++++++++++++++----
 checkpoint/objhash.c           |   17 +++++
 include/linux/checkpoint.h     |   15 +++--
 include/linux/checkpoint_hdr.h |    8 ++
 mm/filemap.c                   |   39 +++++++++++-
 mm/mmap.c                      |    2 +-
 mm/shmem.c                     |   35 ++++++++++
 7 files changed, 236 insertions(+), 23 deletions(-)

diff --git a/checkpoint/memory.c b/checkpoint/memory.c
index 6a0dfaf..697896f 100644
--- a/checkpoint/memory.c
+++ b/checkpoint/memory.c
@@ -21,6 +21,7 @@
 #include <linux/pagemap.h>
 #include <linux/mm_types.h>
 #include <linux/proc_fs.h>
+#include <linux/swap.h>
 #include <linux/checkpoint.h>
 #include <linux/checkpoint_hdr.h>
 
@@ -281,6 +282,54 @@ static struct page *consider_private_page(struct vm_area_struct *vma,
 }
 
 /**
+ * consider_shared_page - return page pointer for dirty pages
+ * @ino - inode of shmem object
+ * @idx - page index in shmem object
+ *
+ * Looks up the page that corresponds to the index in the shmem object,
+ * and returns the page if it was modified (and grabs a reference to it),
+ * or otherwise returns NULL (or error).
+ */
+static struct page *consider_shared_page(struct inode *ino, unsigned long idx)
+{
+	struct page *page = NULL;
+	int ret;
+
+	/*
+	 * Inspired by do_shmem_file_read(): very simplified version.
+	 *
+	 * FIXME: consolidate with do_shmem_file_read()
+	 */
+
+	ret = shmem_getpage(ino, idx, &page, SGP_READ, NULL);
+	if (ret < 0)
+		return ERR_PTR(ret);
+
+	/*
+	 * Only care about dirty pages; shmem_getpage() only returns
+	 * pages that have been allocated, so they must be dirty. The
+	 * pages returned are locked and referenced.
+	 */
+
+	if (page) {
+		unlock_page(page);
+		/*
+		 * If users can be writing to this page using arbitrary
+		 * virtual addresses, take care about potential aliasing
+		 * before reading the page on the kernel side.
+		 */
+		if (mapping_writably_mapped(ino->i_mapping))
+			flush_dcache_page(page);
+		/*
+		 * Mark the page accessed if we read the beginning.
+		 */
+		mark_page_accessed(page);
+	}
+
+	return page;
+}
+
+/**
  * vma_fill_pgarr - fill a page-array with addr/page tuples
  * @ctx - checkpoint context
  * @vma - vma to scan
@@ -289,17 +338,16 @@ static struct page *consider_private_page(struct vm_area_struct *vma,
  * Returns the number of pages collected
  */
 static int vma_fill_pgarr(struct ckpt_ctx *ctx,
-			  struct vm_area_struct *vma,
-			  unsigned long *start)
+			  struct vm_area_struct *vma, struct inode *inode,
+			  unsigned long *start, unsigned long end)
 {
-	unsigned long end = vma->vm_end;
 	unsigned long addr = *start;
 	struct ckpt_pgarr *pgarr;
 	int nr_used;
 	int cnt = 0;
 
 	/* this function is only for private memory (anon or file-mapped) */
-	BUG_ON(vma->vm_flags & (VM_SHARED | VM_MAYSHARE));
+	BUG_ON(inode && vma);
 
 	do {
 		pgarr = pgarr_current(ctx);
@@ -311,7 +359,11 @@ static int vma_fill_pgarr(struct ckpt_ctx *ctx,
 		while (addr < end) {
 			struct page *page;
 
-			page = consider_private_page(vma, addr);
+			if (vma)
+				page = consider_private_page(vma, addr);
+			else
+				page = consider_shared_page(inode, addr);
+
 			if (IS_ERR(page))
 				return PTR_ERR(page);
 
@@ -323,7 +375,10 @@ static int vma_fill_pgarr(struct ckpt_ctx *ctx,
 				pgarr->nr_used++;
 			}
 
-			addr += PAGE_SIZE;
+			if (vma)
+				addr += PAGE_SIZE;
+			else
+				addr++;
 
 			if (pgarr_is_full(pgarr))
 				break;
@@ -395,23 +450,32 @@ static int vma_dump_pages(struct ckpt_ctx *ctx, int total)
 }
 
 /**
- * checkpoint_memory_contents - dump contents of a VMA with private memory
+ * checkpoint_memory_contents - dump contents of a memory region
  * @ctx - checkpoint context
- * @vma - vma to scan
+ * @vma - vma to scan (--or--)
+ * @inode - inode to scan
  *
  * Collect lists of pages that needs to be dumped, and corresponding
  * virtual addresses into ctx->pgarr_list page-array chain. Then dump
  * the addresses, followed by the page contents.
  */
 static int checkpoint_memory_contents(struct ckpt_ctx *ctx,
-				      struct vm_area_struct *vma)
+				      struct vm_area_struct *vma,
+				      struct inode *inode)
 {
 	struct ckpt_hdr_pgarr *h;
 	unsigned long addr, end;
 	int cnt, ret;
 
-	addr = vma->vm_start;
-	end = vma->vm_end;
+	BUG_ON(vma && inode);
+
+	if (vma) {
+		addr = vma->vm_start;
+		end = vma->vm_end;
+	} else {
+		addr = 0;
+		end = PAGE_ALIGN(i_size_read(inode)) >> PAGE_CACHE_SHIFT;
+	}
 
 	/*
 	 * Work iteratively, collecting and dumping at most CKPT_PGARR_BATCH
@@ -437,7 +501,7 @@ static int checkpoint_memory_contents(struct ckpt_ctx *ctx,
 	 */
 
 	while (addr < end) {
-		cnt = vma_fill_pgarr(ctx, vma, &addr);
+		cnt = vma_fill_pgarr(ctx, vma, inode, &addr, end);
 		if (cnt == 0)
 			break;
 		else if (cnt < 0)
@@ -481,7 +545,7 @@ static int checkpoint_memory_contents(struct ckpt_ctx *ctx,
  * @vma_objref: vma objref
  */
 int generic_vma_checkpoint(struct ckpt_ctx *ctx, struct vm_area_struct *vma,
-			   enum vma_type type, int vma_objref)
+			   enum vma_type type, int vma_objref, int ino_objref)
 {
 	struct ckpt_hdr_vma *h;
 	int ret;
@@ -495,6 +559,13 @@ int generic_vma_checkpoint(struct ckpt_ctx *ctx, struct vm_area_struct *vma,
 
 	h->vma_type = type;
 	h->vma_objref = vma_objref;
+	h->ino_objref = ino_objref;
+
+	if (vma->vm_file)
+		h->ino_size = i_size_read(vma->vm_file->f_dentry->d_inode);
+	else
+		h->ino_size = 0;
+
 	h->vm_start = vma->vm_start;
 	h->vm_end = vma->vm_end;
 	h->vm_page_prot = pgprot_val(vma->vm_page_prot);
@@ -522,10 +593,37 @@ int private_vma_checkpoint(struct ckpt_ctx *ctx,
 
 	BUG_ON(vma->vm_flags & (VM_SHARED | VM_MAYSHARE));
 
-	ret = generic_vma_checkpoint(ctx, vma, type, vma_objref);
+	ret = generic_vma_checkpoint(ctx, vma, type, vma_objref, 0);
+	if (ret < 0)
+		goto out;
+	ret = checkpoint_memory_contents(ctx, vma, NULL);
+ out:
+	return ret;
+}
+
+/**
+ * shmem_vma_checkpoint - dump contents of private (anon, file) vma
+ * @ctx: checkpoint context
+ * @vma: vma object
+ * @type: vma type
+ * @objref: vma object id
+ */
+int shmem_vma_checkpoint(struct ckpt_ctx *ctx, struct vm_area_struct *vma,
+			 enum vma_type type, int ino_objref)
+{
+	struct file *file = vma->vm_file;
+	int ret;
+
+	ckpt_debug("type %d, ino_ref %d\n", type, ino_objref);
+	BUG_ON(!(vma->vm_flags & (VM_SHARED | VM_MAYSHARE)));
+	BUG_ON(!file);
+
+	ret = generic_vma_checkpoint(ctx, vma, type, 0, ino_objref);
 	if (ret < 0)
 		goto out;
-	ret = checkpoint_memory_contents(ctx, vma);
+	if (type == CKPT_VMA_SHM_ANON_SKIP)
+		goto out;
+	ret = checkpoint_memory_contents(ctx, NULL, file->f_dentry->d_inode);
  out:
 	return ret;
 }
@@ -991,6 +1089,21 @@ static struct restore_vma_ops restore_vma_ops[] = {
 		.vma_type = CKPT_VMA_FILE,
 		.restore = filemap_restore,
 	},
+	/* anonymous shared */
+	{
+		.vma_name = "ANON SHARED",
+		.vma_type = CKPT_VMA_SHM_ANON,
+	},
+	/* anonymous shared (skipped) */
+	{
+		.vma_name = "ANON SHARED (skip)",
+		.vma_type = CKPT_VMA_SHM_ANON_SKIP,
+	},
+	/* file-mapped shared */
+	{
+		.vma_name = "FILE SHARED",
+		.vma_type = CKPT_VMA_SHM_FILE,
+	},
 };
 
 /**
diff --git a/checkpoint/objhash.c b/checkpoint/objhash.c
index 734be3a..be9411f 100644
--- a/checkpoint/objhash.c
+++ b/checkpoint/objhash.c
@@ -64,6 +64,16 @@ static int obj_no_grab(void *ptr)
 	return 0;
 }
 
+static int obj_inode_grab(void *ptr)
+{
+	return igrab((struct inode *) ptr) ? 0 : -EBADF;
+}
+
+static void obj_inode_drop(void *ptr, int lastref)
+{
+	iput((struct inode *) ptr);
+}
+
 static int obj_file_table_grab(void *ptr)
 {
 	atomic_inc(&((struct files_struct *) ptr)->count);
@@ -120,6 +130,13 @@ static struct ckpt_obj_ops ckpt_obj_ops[] = {
 		.ref_drop = obj_no_drop,
 		.ref_grab = obj_no_grab,
 	},
+	/* inode object */
+	{
+		.obj_name = "INODE",
+		.obj_type = CKPT_OBJ_INODE,
+		.ref_drop = obj_inode_drop,
+		.ref_grab = obj_inode_grab,
+	},
 	/* files_struct object */
 	{
 		.obj_name = "FILE_TABLE",
diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
index cb13e8f..5a130d7 100644
--- a/include/linux/checkpoint.h
+++ b/include/linux/checkpoint.h
@@ -170,11 +170,15 @@ extern void ckpt_pgarr_free(struct ckpt_ctx *ctx);
 extern int generic_vma_checkpoint(struct ckpt_ctx *ctx,
 				  struct vm_area_struct *vma,
 				  enum vma_type type,
-				  int vma_objref);
+				  int vma_objref, int ino_objref);
 extern int private_vma_checkpoint(struct ckpt_ctx *ctx,
 				  struct vm_area_struct *vma,
 				  enum vma_type type,
 				  int vma_objref);
+extern int shmem_vma_checkpoint(struct ckpt_ctx *ctx,
+				struct vm_area_struct *vma,
+				enum vma_type type,
+				int ino_objref);
 
 extern int checkpoint_obj_mm(struct ckpt_ctx *ctx, struct task_struct *t);
 extern int restore_obj_mm(struct ckpt_ctx *ctx, int mm_objref);
@@ -187,11 +191,10 @@ extern int private_vma_restore(struct ckpt_ctx *ctx, struct mm_struct *mm,
 			       struct file *file, struct ckpt_hdr_vma *h);
 
 
-#define CKPT_VMA_NOT_SUPPORTED					\
-	(VM_SHARED | VM_MAYSHARE | VM_IO | VM_HUGETLB |		\
-	 VM_NONLINEAR | VM_PFNMAP | VM_RESERVED | VM_NORESERVE	\
-	 | VM_HUGETLB | VM_NONLINEAR | VM_MAPPED_COPY |		\
-	 VM_INSERTPAGE | VM_MIXEDMAP | VM_SAO)
+#define CKPT_VMA_NOT_SUPPORTED						\
+	(VM_IO | VM_HUGETLB | VM_NONLINEAR | VM_PFNMAP |		\
+	 VM_RESERVED | VM_NORESERVE | VM_HUGETLB | VM_NONLINEAR |	\
+	 VM_MAPPED_COPY | VM_INSERTPAGE | VM_MIXEDMAP | VM_SAO)
 
 static inline int ckpt_validate_errno(int errno)
 {
diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
index 097d2e2..23095f2 100644
--- a/include/linux/checkpoint_hdr.h
+++ b/include/linux/checkpoint_hdr.h
@@ -91,6 +91,7 @@ struct ckpt_hdr_objref {
 /* shared objects types */
 enum obj_type {
 	CKPT_OBJ_IGNORE = 0,
+	CKPT_OBJ_INODE,
 	CKPT_OBJ_FILE_TABLE,
 	CKPT_OBJ_FILE,
 	CKPT_OBJ_MM,
@@ -176,6 +177,7 @@ struct ckpt_hdr_task {
 /* task's shared resources */
 struct ckpt_hdr_task_objs {
 	struct ckpt_hdr h;
+
 	__s32 files_objref;
 	__s32 mm_objref;
 } __attribute__((aligned(8)));
@@ -256,6 +258,9 @@ enum vma_type {
 	CKPT_VMA_VDSO,		/* special vdso vma */
 	CKPT_VMA_ANON,		/* private anonymous */
 	CKPT_VMA_FILE,		/* private mapped file */
+	CKPT_VMA_SHM_ANON,	/* shared anonymous */
+	CKPT_VMA_SHM_ANON_SKIP,	/* shared anonymous (skip contents) */
+	CKPT_VMA_SHM_FILE,	/* shared mapped file, only msync */
 	CKPT_VMA_MAX
 };
 
@@ -264,6 +269,9 @@ struct ckpt_hdr_vma {
 	struct ckpt_hdr h;
 	__u32 vma_type;
 	__s32 vma_objref;	/* objref of backing file */
+	__s32 ino_objref;	/* objref of shared segment */
+	__u32 _padding;
+	__u64 ino_size;		/* size of shared segment */
 
 	__u64 vm_start;
 	__u64 vm_end;
diff --git a/mm/filemap.c b/mm/filemap.c
index ef7205b..055f126 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1654,6 +1654,8 @@ int filemap_checkpoint(struct ckpt_ctx *ctx, struct vm_area_struct *vma)
 {
 	struct file *file = vma->vm_file;
 	int vma_objref;
+	int ino_objref;
+	int first, ret;
 
 	if (vma->vm_flags & CKPT_VMA_NOT_SUPPORTED) {
 		pr_warning("c/r: unsupported VMA %#lx\n", vma->vm_flags);
@@ -1666,7 +1668,42 @@ int filemap_checkpoint(struct ckpt_ctx *ctx, struct vm_area_struct *vma)
 	if (vma_objref < 0)
 		return vma_objref;
 
-	return private_vma_checkpoint(ctx, vma, CKPT_VMA_FILE, vma_objref);
+	if (vma->vm_flags & (VM_SHARED | VM_MAYSHARE)) {
+		/*
+		 * Citing mmap(2): "Updates to the mapping are visible
+		 * to other processes that map this file, and are
+		 * carried through to the underlying file. The file
+		 * may not actually be updated until msync(2) or
+		 * munmap(2) is called"
+		 *
+		 * Citing msync(2): "Without use of this call there is
+		 * no guarantee that changes are written back before
+		 * munmap(2) is called."
+		 *
+		 * Force msync for region of shared mapped files, to
+		 * ensure that that the file system is consistent with
+		 * the checkpoint image.  (inspired by sys_msync).
+		 */
+
+		ino_objref = ckpt_obj_lookup_add(ctx, file->f_dentry->d_inode,
+					       CKPT_OBJ_INODE, &first);
+		if (ino_objref < 0)
+			return ino_objref;
+
+		if (first) {
+			ret = vfs_fsync(file, file->f_path.dentry, 0);
+			if (ret < 0)
+				return ret;
+		}
+
+		ret = generic_vma_checkpoint(ctx, vma, CKPT_VMA_SHM_FILE,
+					     vma_objref, ino_objref);
+	} else {
+		ret = private_vma_checkpoint(ctx, vma, CKPT_VMA_FILE,
+					     vma_objref);
+	}
+
+	return ret;
 }
 
 int filemap_restore(struct ckpt_ctx *ctx,
diff --git a/mm/mmap.c b/mm/mmap.c
index 3be4678..aa262eb 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2296,7 +2296,7 @@ static int special_mapping_checkpoint(struct ckpt_ctx *ctx,
 	if (!name || strcmp(name, "[vdso]"))
 		return -ENOSYS;
 
-	return generic_vma_checkpoint(ctx, vma, CKPT_VMA_VDSO, 0);
+	return generic_vma_checkpoint(ctx, vma, CKPT_VMA_VDSO, 0, 0);
 }
 
 int special_mapping_restore(struct ckpt_ctx *ctx,
diff --git a/mm/shmem.c b/mm/shmem.c
index d80532b..3e50bd1 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -30,6 +30,7 @@
 #include <linux/module.h>
 #include <linux/swap.h>
 #include <linux/ima.h>
+#include <linux/checkpoint.h>
 
 static struct vfsmount *shm_mnt;
 
@@ -2381,6 +2382,37 @@ static void shmem_destroy_inode(struct inode *inode)
 	kmem_cache_free(shmem_inode_cachep, SHMEM_I(inode));
 }
 
+#ifdef CONFIG_CHECKPOINT
+static int shmem_checkpoint(struct ckpt_ctx *ctx, struct vm_area_struct *vma)
+{
+	enum vma_type vma_type;
+	int ino_objref;
+	int first;
+
+	/* should be private anonymous ... verify that this is the case */
+	if (vma->vm_flags & CKPT_VMA_NOT_SUPPORTED) {
+		pr_warning("c/r: unsupported VMA %#lx\n", vma->vm_flags);
+		return -ENOSYS;
+	}
+
+	BUG_ON(!vma->vm_file);
+
+	/* we collected the file but we don't checkpoint it per-se */
+	ret = ckpt_obj_visit(ctx, vma->vm_file, CKPT_OBJ_FILE);
+	if (ret < 0)
+		return ret;
+
+	ino_objref = ckpt_obj_lookup_add(ctx, vma->vm_file->f_dentry->d_inode,
+					 CKPT_OBJ_INODE, &first);
+	if (ino_objref < 0)
+		return ino_objref;
+
+	vma_type = (first ? CKPT_VMA_SHM_ANON : CKPT_VMA_SHM_ANON_SKIP);
+
+	return shmem_vma_checkpoint(ctx, vma, vma_type, ino_objref);
+}
+#endif /* CONFIG_CHECKPOINT */
+
 static void init_once(void *foo)
 {
 	struct shmem_inode_info *p = (struct shmem_inode_info *) foo;
@@ -2492,6 +2524,9 @@ static struct vm_operations_struct shmem_vm_ops = {
 	.set_policy     = shmem_set_policy,
 	.get_policy     = shmem_get_policy,
 #endif
+#ifdef CONFIG_CHECKPOINT
+	.checkpoint	= shmem_checkpoint,
+#endif
 };
 
 
-- 
1.6.0.4

^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH v18 43/80] c/r: dump anonymous- and file-mapped- shared memory
  2009-09-23 23:50 ` Oren Laadan
@ 2009-09-23 23:51   ` Oren Laadan
  -1 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-23 23:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, containers, linux-kernel, linux-mm, linux-api,
	Serge Hallyn, Ingo Molnar, Pavel Emelyanov, Oren Laadan,
	Oren Laadan

We now handle anonymous and file-mapped shared memory. Support for IPC
shared memory requires support for IPC first. We extend ckpt_write_vma()
to detect shared memory VMAs and handle it separately than private
memory.

There is not much to do for file-mapped shared memory, except to force
msync() on the region to ensure that the file system is consistent
with the checkpoint image. Use our internal type CKPT_VMA_SHM_FILE.

Anonymous shared memory is always backed by inode in shmem filesystem.
We use that inode to look up the VMA in the objhash and register it if
not found (on first encounter). In this case, the type of the VMA is
CKPT_VMA_SHM_ANON, and we dump the contents. On the other hand, if it is
found there, we must have already saved it before, so we change the
type to CKPT_VMA_SHM_ANON_SKIP and skip it.

To dump the contents of a shmem VMA, we loop through the pages of the
inode in the shmem filesystem, and dump the contents of each dirty
(allocated) page - unallocated pages must be clean.

Note that we save the original size of a shmem VMA because it may have
been re-mapped partially. The format itself remains like with private
VMAs, except that instead of addresses we record _indices_ (page nr)
into the backing inode.

Changelog[v18]:
  - Mark the backing file as visited at chekcpoint

Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
---
 checkpoint/memory.c            |  143 +++++++++++++++++++++++++++++++++++----
 checkpoint/objhash.c           |   17 +++++
 include/linux/checkpoint.h     |   15 +++--
 include/linux/checkpoint_hdr.h |    8 ++
 mm/filemap.c                   |   39 +++++++++++-
 mm/mmap.c                      |    2 +-
 mm/shmem.c                     |   35 ++++++++++
 7 files changed, 236 insertions(+), 23 deletions(-)

diff --git a/checkpoint/memory.c b/checkpoint/memory.c
index 6a0dfaf..697896f 100644
--- a/checkpoint/memory.c
+++ b/checkpoint/memory.c
@@ -21,6 +21,7 @@
 #include <linux/pagemap.h>
 #include <linux/mm_types.h>
 #include <linux/proc_fs.h>
+#include <linux/swap.h>
 #include <linux/checkpoint.h>
 #include <linux/checkpoint_hdr.h>
 
@@ -281,6 +282,54 @@ static struct page *consider_private_page(struct vm_area_struct *vma,
 }
 
 /**
+ * consider_shared_page - return page pointer for dirty pages
+ * @ino - inode of shmem object
+ * @idx - page index in shmem object
+ *
+ * Looks up the page that corresponds to the index in the shmem object,
+ * and returns the page if it was modified (and grabs a reference to it),
+ * or otherwise returns NULL (or error).
+ */
+static struct page *consider_shared_page(struct inode *ino, unsigned long idx)
+{
+	struct page *page = NULL;
+	int ret;
+
+	/*
+	 * Inspired by do_shmem_file_read(): very simplified version.
+	 *
+	 * FIXME: consolidate with do_shmem_file_read()
+	 */
+
+	ret = shmem_getpage(ino, idx, &page, SGP_READ, NULL);
+	if (ret < 0)
+		return ERR_PTR(ret);
+
+	/*
+	 * Only care about dirty pages; shmem_getpage() only returns
+	 * pages that have been allocated, so they must be dirty. The
+	 * pages returned are locked and referenced.
+	 */
+
+	if (page) {
+		unlock_page(page);
+		/*
+		 * If users can be writing to this page using arbitrary
+		 * virtual addresses, take care about potential aliasing
+		 * before reading the page on the kernel side.
+		 */
+		if (mapping_writably_mapped(ino->i_mapping))
+			flush_dcache_page(page);
+		/*
+		 * Mark the page accessed if we read the beginning.
+		 */
+		mark_page_accessed(page);
+	}
+
+	return page;
+}
+
+/**
  * vma_fill_pgarr - fill a page-array with addr/page tuples
  * @ctx - checkpoint context
  * @vma - vma to scan
@@ -289,17 +338,16 @@ static struct page *consider_private_page(struct vm_area_struct *vma,
  * Returns the number of pages collected
  */
 static int vma_fill_pgarr(struct ckpt_ctx *ctx,
-			  struct vm_area_struct *vma,
-			  unsigned long *start)
+			  struct vm_area_struct *vma, struct inode *inode,
+			  unsigned long *start, unsigned long end)
 {
-	unsigned long end = vma->vm_end;
 	unsigned long addr = *start;
 	struct ckpt_pgarr *pgarr;
 	int nr_used;
 	int cnt = 0;
 
 	/* this function is only for private memory (anon or file-mapped) */
-	BUG_ON(vma->vm_flags & (VM_SHARED | VM_MAYSHARE));
+	BUG_ON(inode && vma);
 
 	do {
 		pgarr = pgarr_current(ctx);
@@ -311,7 +359,11 @@ static int vma_fill_pgarr(struct ckpt_ctx *ctx,
 		while (addr < end) {
 			struct page *page;
 
-			page = consider_private_page(vma, addr);
+			if (vma)
+				page = consider_private_page(vma, addr);
+			else
+				page = consider_shared_page(inode, addr);
+
 			if (IS_ERR(page))
 				return PTR_ERR(page);
 
@@ -323,7 +375,10 @@ static int vma_fill_pgarr(struct ckpt_ctx *ctx,
 				pgarr->nr_used++;
 			}
 
-			addr += PAGE_SIZE;
+			if (vma)
+				addr += PAGE_SIZE;
+			else
+				addr++;
 
 			if (pgarr_is_full(pgarr))
 				break;
@@ -395,23 +450,32 @@ static int vma_dump_pages(struct ckpt_ctx *ctx, int total)
 }
 
 /**
- * checkpoint_memory_contents - dump contents of a VMA with private memory
+ * checkpoint_memory_contents - dump contents of a memory region
  * @ctx - checkpoint context
- * @vma - vma to scan
+ * @vma - vma to scan (--or--)
+ * @inode - inode to scan
  *
  * Collect lists of pages that needs to be dumped, and corresponding
  * virtual addresses into ctx->pgarr_list page-array chain. Then dump
  * the addresses, followed by the page contents.
  */
 static int checkpoint_memory_contents(struct ckpt_ctx *ctx,
-				      struct vm_area_struct *vma)
+				      struct vm_area_struct *vma,
+				      struct inode *inode)
 {
 	struct ckpt_hdr_pgarr *h;
 	unsigned long addr, end;
 	int cnt, ret;
 
-	addr = vma->vm_start;
-	end = vma->vm_end;
+	BUG_ON(vma && inode);
+
+	if (vma) {
+		addr = vma->vm_start;
+		end = vma->vm_end;
+	} else {
+		addr = 0;
+		end = PAGE_ALIGN(i_size_read(inode)) >> PAGE_CACHE_SHIFT;
+	}
 
 	/*
 	 * Work iteratively, collecting and dumping at most CKPT_PGARR_BATCH
@@ -437,7 +501,7 @@ static int checkpoint_memory_contents(struct ckpt_ctx *ctx,
 	 */
 
 	while (addr < end) {
-		cnt = vma_fill_pgarr(ctx, vma, &addr);
+		cnt = vma_fill_pgarr(ctx, vma, inode, &addr, end);
 		if (cnt == 0)
 			break;
 		else if (cnt < 0)
@@ -481,7 +545,7 @@ static int checkpoint_memory_contents(struct ckpt_ctx *ctx,
  * @vma_objref: vma objref
  */
 int generic_vma_checkpoint(struct ckpt_ctx *ctx, struct vm_area_struct *vma,
-			   enum vma_type type, int vma_objref)
+			   enum vma_type type, int vma_objref, int ino_objref)
 {
 	struct ckpt_hdr_vma *h;
 	int ret;
@@ -495,6 +559,13 @@ int generic_vma_checkpoint(struct ckpt_ctx *ctx, struct vm_area_struct *vma,
 
 	h->vma_type = type;
 	h->vma_objref = vma_objref;
+	h->ino_objref = ino_objref;
+
+	if (vma->vm_file)
+		h->ino_size = i_size_read(vma->vm_file->f_dentry->d_inode);
+	else
+		h->ino_size = 0;
+
 	h->vm_start = vma->vm_start;
 	h->vm_end = vma->vm_end;
 	h->vm_page_prot = pgprot_val(vma->vm_page_prot);
@@ -522,10 +593,37 @@ int private_vma_checkpoint(struct ckpt_ctx *ctx,
 
 	BUG_ON(vma->vm_flags & (VM_SHARED | VM_MAYSHARE));
 
-	ret = generic_vma_checkpoint(ctx, vma, type, vma_objref);
+	ret = generic_vma_checkpoint(ctx, vma, type, vma_objref, 0);
+	if (ret < 0)
+		goto out;
+	ret = checkpoint_memory_contents(ctx, vma, NULL);
+ out:
+	return ret;
+}
+
+/**
+ * shmem_vma_checkpoint - dump contents of private (anon, file) vma
+ * @ctx: checkpoint context
+ * @vma: vma object
+ * @type: vma type
+ * @objref: vma object id
+ */
+int shmem_vma_checkpoint(struct ckpt_ctx *ctx, struct vm_area_struct *vma,
+			 enum vma_type type, int ino_objref)
+{
+	struct file *file = vma->vm_file;
+	int ret;
+
+	ckpt_debug("type %d, ino_ref %d\n", type, ino_objref);
+	BUG_ON(!(vma->vm_flags & (VM_SHARED | VM_MAYSHARE)));
+	BUG_ON(!file);
+
+	ret = generic_vma_checkpoint(ctx, vma, type, 0, ino_objref);
 	if (ret < 0)
 		goto out;
-	ret = checkpoint_memory_contents(ctx, vma);
+	if (type == CKPT_VMA_SHM_ANON_SKIP)
+		goto out;
+	ret = checkpoint_memory_contents(ctx, NULL, file->f_dentry->d_inode);
  out:
 	return ret;
 }
@@ -991,6 +1089,21 @@ static struct restore_vma_ops restore_vma_ops[] = {
 		.vma_type = CKPT_VMA_FILE,
 		.restore = filemap_restore,
 	},
+	/* anonymous shared */
+	{
+		.vma_name = "ANON SHARED",
+		.vma_type = CKPT_VMA_SHM_ANON,
+	},
+	/* anonymous shared (skipped) */
+	{
+		.vma_name = "ANON SHARED (skip)",
+		.vma_type = CKPT_VMA_SHM_ANON_SKIP,
+	},
+	/* file-mapped shared */
+	{
+		.vma_name = "FILE SHARED",
+		.vma_type = CKPT_VMA_SHM_FILE,
+	},
 };
 
 /**
diff --git a/checkpoint/objhash.c b/checkpoint/objhash.c
index 734be3a..be9411f 100644
--- a/checkpoint/objhash.c
+++ b/checkpoint/objhash.c
@@ -64,6 +64,16 @@ static int obj_no_grab(void *ptr)
 	return 0;
 }
 
+static int obj_inode_grab(void *ptr)
+{
+	return igrab((struct inode *) ptr) ? 0 : -EBADF;
+}
+
+static void obj_inode_drop(void *ptr, int lastref)
+{
+	iput((struct inode *) ptr);
+}
+
 static int obj_file_table_grab(void *ptr)
 {
 	atomic_inc(&((struct files_struct *) ptr)->count);
@@ -120,6 +130,13 @@ static struct ckpt_obj_ops ckpt_obj_ops[] = {
 		.ref_drop = obj_no_drop,
 		.ref_grab = obj_no_grab,
 	},
+	/* inode object */
+	{
+		.obj_name = "INODE",
+		.obj_type = CKPT_OBJ_INODE,
+		.ref_drop = obj_inode_drop,
+		.ref_grab = obj_inode_grab,
+	},
 	/* files_struct object */
 	{
 		.obj_name = "FILE_TABLE",
diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
index cb13e8f..5a130d7 100644
--- a/include/linux/checkpoint.h
+++ b/include/linux/checkpoint.h
@@ -170,11 +170,15 @@ extern void ckpt_pgarr_free(struct ckpt_ctx *ctx);
 extern int generic_vma_checkpoint(struct ckpt_ctx *ctx,
 				  struct vm_area_struct *vma,
 				  enum vma_type type,
-				  int vma_objref);
+				  int vma_objref, int ino_objref);
 extern int private_vma_checkpoint(struct ckpt_ctx *ctx,
 				  struct vm_area_struct *vma,
 				  enum vma_type type,
 				  int vma_objref);
+extern int shmem_vma_checkpoint(struct ckpt_ctx *ctx,
+				struct vm_area_struct *vma,
+				enum vma_type type,
+				int ino_objref);
 
 extern int checkpoint_obj_mm(struct ckpt_ctx *ctx, struct task_struct *t);
 extern int restore_obj_mm(struct ckpt_ctx *ctx, int mm_objref);
@@ -187,11 +191,10 @@ extern int private_vma_restore(struct ckpt_ctx *ctx, struct mm_struct *mm,
 			       struct file *file, struct ckpt_hdr_vma *h);
 
 
-#define CKPT_VMA_NOT_SUPPORTED					\
-	(VM_SHARED | VM_MAYSHARE | VM_IO | VM_HUGETLB |		\
-	 VM_NONLINEAR | VM_PFNMAP | VM_RESERVED | VM_NORESERVE	\
-	 | VM_HUGETLB | VM_NONLINEAR | VM_MAPPED_COPY |		\
-	 VM_INSERTPAGE | VM_MIXEDMAP | VM_SAO)
+#define CKPT_VMA_NOT_SUPPORTED						\
+	(VM_IO | VM_HUGETLB | VM_NONLINEAR | VM_PFNMAP |		\
+	 VM_RESERVED | VM_NORESERVE | VM_HUGETLB | VM_NONLINEAR |	\
+	 VM_MAPPED_COPY | VM_INSERTPAGE | VM_MIXEDMAP | VM_SAO)
 
 static inline int ckpt_validate_errno(int errno)
 {
diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
index 097d2e2..23095f2 100644
--- a/include/linux/checkpoint_hdr.h
+++ b/include/linux/checkpoint_hdr.h
@@ -91,6 +91,7 @@ struct ckpt_hdr_objref {
 /* shared objects types */
 enum obj_type {
 	CKPT_OBJ_IGNORE = 0,
+	CKPT_OBJ_INODE,
 	CKPT_OBJ_FILE_TABLE,
 	CKPT_OBJ_FILE,
 	CKPT_OBJ_MM,
@@ -176,6 +177,7 @@ struct ckpt_hdr_task {
 /* task's shared resources */
 struct ckpt_hdr_task_objs {
 	struct ckpt_hdr h;
+
 	__s32 files_objref;
 	__s32 mm_objref;
 } __attribute__((aligned(8)));
@@ -256,6 +258,9 @@ enum vma_type {
 	CKPT_VMA_VDSO,		/* special vdso vma */
 	CKPT_VMA_ANON,		/* private anonymous */
 	CKPT_VMA_FILE,		/* private mapped file */
+	CKPT_VMA_SHM_ANON,	/* shared anonymous */
+	CKPT_VMA_SHM_ANON_SKIP,	/* shared anonymous (skip contents) */
+	CKPT_VMA_SHM_FILE,	/* shared mapped file, only msync */
 	CKPT_VMA_MAX
 };
 
@@ -264,6 +269,9 @@ struct ckpt_hdr_vma {
 	struct ckpt_hdr h;
 	__u32 vma_type;
 	__s32 vma_objref;	/* objref of backing file */
+	__s32 ino_objref;	/* objref of shared segment */
+	__u32 _padding;
+	__u64 ino_size;		/* size of shared segment */
 
 	__u64 vm_start;
 	__u64 vm_end;
diff --git a/mm/filemap.c b/mm/filemap.c
index ef7205b..055f126 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1654,6 +1654,8 @@ int filemap_checkpoint(struct ckpt_ctx *ctx, struct vm_area_struct *vma)
 {
 	struct file *file = vma->vm_file;
 	int vma_objref;
+	int ino_objref;
+	int first, ret;
 
 	if (vma->vm_flags & CKPT_VMA_NOT_SUPPORTED) {
 		pr_warning("c/r: unsupported VMA %#lx\n", vma->vm_flags);
@@ -1666,7 +1668,42 @@ int filemap_checkpoint(struct ckpt_ctx *ctx, struct vm_area_struct *vma)
 	if (vma_objref < 0)
 		return vma_objref;
 
-	return private_vma_checkpoint(ctx, vma, CKPT_VMA_FILE, vma_objref);
+	if (vma->vm_flags & (VM_SHARED | VM_MAYSHARE)) {
+		/*
+		 * Citing mmap(2): "Updates to the mapping are visible
+		 * to other processes that map this file, and are
+		 * carried through to the underlying file. The file
+		 * may not actually be updated until msync(2) or
+		 * munmap(2) is called"
+		 *
+		 * Citing msync(2): "Without use of this call there is
+		 * no guarantee that changes are written back before
+		 * munmap(2) is called."
+		 *
+		 * Force msync for region of shared mapped files, to
+		 * ensure that that the file system is consistent with
+		 * the checkpoint image.  (inspired by sys_msync).
+		 */
+
+		ino_objref = ckpt_obj_lookup_add(ctx, file->f_dentry->d_inode,
+					       CKPT_OBJ_INODE, &first);
+		if (ino_objref < 0)
+			return ino_objref;
+
+		if (first) {
+			ret = vfs_fsync(file, file->f_path.dentry, 0);
+			if (ret < 0)
+				return ret;
+		}
+
+		ret = generic_vma_checkpoint(ctx, vma, CKPT_VMA_SHM_FILE,
+					     vma_objref, ino_objref);
+	} else {
+		ret = private_vma_checkpoint(ctx, vma, CKPT_VMA_FILE,
+					     vma_objref);
+	}
+
+	return ret;
 }
 
 int filemap_restore(struct ckpt_ctx *ctx,
diff --git a/mm/mmap.c b/mm/mmap.c
index 3be4678..aa262eb 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2296,7 +2296,7 @@ static int special_mapping_checkpoint(struct ckpt_ctx *ctx,
 	if (!name || strcmp(name, "[vdso]"))
 		return -ENOSYS;
 
-	return generic_vma_checkpoint(ctx, vma, CKPT_VMA_VDSO, 0);
+	return generic_vma_checkpoint(ctx, vma, CKPT_VMA_VDSO, 0, 0);
 }
 
 int special_mapping_restore(struct ckpt_ctx *ctx,
diff --git a/mm/shmem.c b/mm/shmem.c
index d80532b..3e50bd1 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -30,6 +30,7 @@
 #include <linux/module.h>
 #include <linux/swap.h>
 #include <linux/ima.h>
+#include <linux/checkpoint.h>
 
 static struct vfsmount *shm_mnt;
 
@@ -2381,6 +2382,37 @@ static void shmem_destroy_inode(struct inode *inode)
 	kmem_cache_free(shmem_inode_cachep, SHMEM_I(inode));
 }
 
+#ifdef CONFIG_CHECKPOINT
+static int shmem_checkpoint(struct ckpt_ctx *ctx, struct vm_area_struct *vma)
+{
+	enum vma_type vma_type;
+	int ino_objref;
+	int first;
+
+	/* should be private anonymous ... verify that this is the case */
+	if (vma->vm_flags & CKPT_VMA_NOT_SUPPORTED) {
+		pr_warning("c/r: unsupported VMA %#lx\n", vma->vm_flags);
+		return -ENOSYS;
+	}
+
+	BUG_ON(!vma->vm_file);
+
+	/* we collected the file but we don't checkpoint it per-se */
+	ret = ckpt_obj_visit(ctx, vma->vm_file, CKPT_OBJ_FILE);
+	if (ret < 0)
+		return ret;
+
+	ino_objref = ckpt_obj_lookup_add(ctx, vma->vm_file->f_dentry->d_inode,
+					 CKPT_OBJ_INODE, &first);
+	if (ino_objref < 0)
+		return ino_objref;
+
+	vma_type = (first ? CKPT_VMA_SHM_ANON : CKPT_VMA_SHM_ANON_SKIP);
+
+	return shmem_vma_checkpoint(ctx, vma, vma_type, ino_objref);
+}
+#endif /* CONFIG_CHECKPOINT */
+
 static void init_once(void *foo)
 {
 	struct shmem_inode_info *p = (struct shmem_inode_info *) foo;
@@ -2492,6 +2524,9 @@ static struct vm_operations_struct shmem_vm_ops = {
 	.set_policy     = shmem_set_policy,
 	.get_policy     = shmem_get_policy,
 #endif
+#ifdef CONFIG_CHECKPOINT
+	.checkpoint	= shmem_checkpoint,
+#endif
 };
 
 
-- 
1.6.0.4


^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH v18 43/80] c/r: dump anonymous- and file-mapped- shared memory
@ 2009-09-23 23:51   ` Oren Laadan
  0 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-23 23:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, containers, linux-kernel, linux-mm, linux-api,
	Serge Hallyn, Ingo Molnar, Pavel Emelyanov, Oren Laadan,
	Oren Laadan

We now handle anonymous and file-mapped shared memory. Support for IPC
shared memory requires support for IPC first. We extend ckpt_write_vma()
to detect shared memory VMAs and handle it separately than private
memory.

There is not much to do for file-mapped shared memory, except to force
msync() on the region to ensure that the file system is consistent
with the checkpoint image. Use our internal type CKPT_VMA_SHM_FILE.

Anonymous shared memory is always backed by inode in shmem filesystem.
We use that inode to look up the VMA in the objhash and register it if
not found (on first encounter). In this case, the type of the VMA is
CKPT_VMA_SHM_ANON, and we dump the contents. On the other hand, if it is
found there, we must have already saved it before, so we change the
type to CKPT_VMA_SHM_ANON_SKIP and skip it.

To dump the contents of a shmem VMA, we loop through the pages of the
inode in the shmem filesystem, and dump the contents of each dirty
(allocated) page - unallocated pages must be clean.

Note that we save the original size of a shmem VMA because it may have
been re-mapped partially. The format itself remains like with private
VMAs, except that instead of addresses we record _indices_ (page nr)
into the backing inode.

Changelog[v18]:
  - Mark the backing file as visited at chekcpoint

Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
---
 checkpoint/memory.c            |  143 +++++++++++++++++++++++++++++++++++----
 checkpoint/objhash.c           |   17 +++++
 include/linux/checkpoint.h     |   15 +++--
 include/linux/checkpoint_hdr.h |    8 ++
 mm/filemap.c                   |   39 +++++++++++-
 mm/mmap.c                      |    2 +-
 mm/shmem.c                     |   35 ++++++++++
 7 files changed, 236 insertions(+), 23 deletions(-)

diff --git a/checkpoint/memory.c b/checkpoint/memory.c
index 6a0dfaf..697896f 100644
--- a/checkpoint/memory.c
+++ b/checkpoint/memory.c
@@ -21,6 +21,7 @@
 #include <linux/pagemap.h>
 #include <linux/mm_types.h>
 #include <linux/proc_fs.h>
+#include <linux/swap.h>
 #include <linux/checkpoint.h>
 #include <linux/checkpoint_hdr.h>
 
@@ -281,6 +282,54 @@ static struct page *consider_private_page(struct vm_area_struct *vma,
 }
 
 /**
+ * consider_shared_page - return page pointer for dirty pages
+ * @ino - inode of shmem object
+ * @idx - page index in shmem object
+ *
+ * Looks up the page that corresponds to the index in the shmem object,
+ * and returns the page if it was modified (and grabs a reference to it),
+ * or otherwise returns NULL (or error).
+ */
+static struct page *consider_shared_page(struct inode *ino, unsigned long idx)
+{
+	struct page *page = NULL;
+	int ret;
+
+	/*
+	 * Inspired by do_shmem_file_read(): very simplified version.
+	 *
+	 * FIXME: consolidate with do_shmem_file_read()
+	 */
+
+	ret = shmem_getpage(ino, idx, &page, SGP_READ, NULL);
+	if (ret < 0)
+		return ERR_PTR(ret);
+
+	/*
+	 * Only care about dirty pages; shmem_getpage() only returns
+	 * pages that have been allocated, so they must be dirty. The
+	 * pages returned are locked and referenced.
+	 */
+
+	if (page) {
+		unlock_page(page);
+		/*
+		 * If users can be writing to this page using arbitrary
+		 * virtual addresses, take care about potential aliasing
+		 * before reading the page on the kernel side.
+		 */
+		if (mapping_writably_mapped(ino->i_mapping))
+			flush_dcache_page(page);
+		/*
+		 * Mark the page accessed if we read the beginning.
+		 */
+		mark_page_accessed(page);
+	}
+
+	return page;
+}
+
+/**
  * vma_fill_pgarr - fill a page-array with addr/page tuples
  * @ctx - checkpoint context
  * @vma - vma to scan
@@ -289,17 +338,16 @@ static struct page *consider_private_page(struct vm_area_struct *vma,
  * Returns the number of pages collected
  */
 static int vma_fill_pgarr(struct ckpt_ctx *ctx,
-			  struct vm_area_struct *vma,
-			  unsigned long *start)
+			  struct vm_area_struct *vma, struct inode *inode,
+			  unsigned long *start, unsigned long end)
 {
-	unsigned long end = vma->vm_end;
 	unsigned long addr = *start;
 	struct ckpt_pgarr *pgarr;
 	int nr_used;
 	int cnt = 0;
 
 	/* this function is only for private memory (anon or file-mapped) */
-	BUG_ON(vma->vm_flags & (VM_SHARED | VM_MAYSHARE));
+	BUG_ON(inode && vma);
 
 	do {
 		pgarr = pgarr_current(ctx);
@@ -311,7 +359,11 @@ static int vma_fill_pgarr(struct ckpt_ctx *ctx,
 		while (addr < end) {
 			struct page *page;
 
-			page = consider_private_page(vma, addr);
+			if (vma)
+				page = consider_private_page(vma, addr);
+			else
+				page = consider_shared_page(inode, addr);
+
 			if (IS_ERR(page))
 				return PTR_ERR(page);
 
@@ -323,7 +375,10 @@ static int vma_fill_pgarr(struct ckpt_ctx *ctx,
 				pgarr->nr_used++;
 			}
 
-			addr += PAGE_SIZE;
+			if (vma)
+				addr += PAGE_SIZE;
+			else
+				addr++;
 
 			if (pgarr_is_full(pgarr))
 				break;
@@ -395,23 +450,32 @@ static int vma_dump_pages(struct ckpt_ctx *ctx, int total)
 }
 
 /**
- * checkpoint_memory_contents - dump contents of a VMA with private memory
+ * checkpoint_memory_contents - dump contents of a memory region
  * @ctx - checkpoint context
- * @vma - vma to scan
+ * @vma - vma to scan (--or--)
+ * @inode - inode to scan
  *
  * Collect lists of pages that needs to be dumped, and corresponding
  * virtual addresses into ctx->pgarr_list page-array chain. Then dump
  * the addresses, followed by the page contents.
  */
 static int checkpoint_memory_contents(struct ckpt_ctx *ctx,
-				      struct vm_area_struct *vma)
+				      struct vm_area_struct *vma,
+				      struct inode *inode)
 {
 	struct ckpt_hdr_pgarr *h;
 	unsigned long addr, end;
 	int cnt, ret;
 
-	addr = vma->vm_start;
-	end = vma->vm_end;
+	BUG_ON(vma && inode);
+
+	if (vma) {
+		addr = vma->vm_start;
+		end = vma->vm_end;
+	} else {
+		addr = 0;
+		end = PAGE_ALIGN(i_size_read(inode)) >> PAGE_CACHE_SHIFT;
+	}
 
 	/*
 	 * Work iteratively, collecting and dumping at most CKPT_PGARR_BATCH
@@ -437,7 +501,7 @@ static int checkpoint_memory_contents(struct ckpt_ctx *ctx,
 	 */
 
 	while (addr < end) {
-		cnt = vma_fill_pgarr(ctx, vma, &addr);
+		cnt = vma_fill_pgarr(ctx, vma, inode, &addr, end);
 		if (cnt == 0)
 			break;
 		else if (cnt < 0)
@@ -481,7 +545,7 @@ static int checkpoint_memory_contents(struct ckpt_ctx *ctx,
  * @vma_objref: vma objref
  */
 int generic_vma_checkpoint(struct ckpt_ctx *ctx, struct vm_area_struct *vma,
-			   enum vma_type type, int vma_objref)
+			   enum vma_type type, int vma_objref, int ino_objref)
 {
 	struct ckpt_hdr_vma *h;
 	int ret;
@@ -495,6 +559,13 @@ int generic_vma_checkpoint(struct ckpt_ctx *ctx, struct vm_area_struct *vma,
 
 	h->vma_type = type;
 	h->vma_objref = vma_objref;
+	h->ino_objref = ino_objref;
+
+	if (vma->vm_file)
+		h->ino_size = i_size_read(vma->vm_file->f_dentry->d_inode);
+	else
+		h->ino_size = 0;
+
 	h->vm_start = vma->vm_start;
 	h->vm_end = vma->vm_end;
 	h->vm_page_prot = pgprot_val(vma->vm_page_prot);
@@ -522,10 +593,37 @@ int private_vma_checkpoint(struct ckpt_ctx *ctx,
 
 	BUG_ON(vma->vm_flags & (VM_SHARED | VM_MAYSHARE));
 
-	ret = generic_vma_checkpoint(ctx, vma, type, vma_objref);
+	ret = generic_vma_checkpoint(ctx, vma, type, vma_objref, 0);
+	if (ret < 0)
+		goto out;
+	ret = checkpoint_memory_contents(ctx, vma, NULL);
+ out:
+	return ret;
+}
+
+/**
+ * shmem_vma_checkpoint - dump contents of private (anon, file) vma
+ * @ctx: checkpoint context
+ * @vma: vma object
+ * @type: vma type
+ * @objref: vma object id
+ */
+int shmem_vma_checkpoint(struct ckpt_ctx *ctx, struct vm_area_struct *vma,
+			 enum vma_type type, int ino_objref)
+{
+	struct file *file = vma->vm_file;
+	int ret;
+
+	ckpt_debug("type %d, ino_ref %d\n", type, ino_objref);
+	BUG_ON(!(vma->vm_flags & (VM_SHARED | VM_MAYSHARE)));
+	BUG_ON(!file);
+
+	ret = generic_vma_checkpoint(ctx, vma, type, 0, ino_objref);
 	if (ret < 0)
 		goto out;
-	ret = checkpoint_memory_contents(ctx, vma);
+	if (type == CKPT_VMA_SHM_ANON_SKIP)
+		goto out;
+	ret = checkpoint_memory_contents(ctx, NULL, file->f_dentry->d_inode);
  out:
 	return ret;
 }
@@ -991,6 +1089,21 @@ static struct restore_vma_ops restore_vma_ops[] = {
 		.vma_type = CKPT_VMA_FILE,
 		.restore = filemap_restore,
 	},
+	/* anonymous shared */
+	{
+		.vma_name = "ANON SHARED",
+		.vma_type = CKPT_VMA_SHM_ANON,
+	},
+	/* anonymous shared (skipped) */
+	{
+		.vma_name = "ANON SHARED (skip)",
+		.vma_type = CKPT_VMA_SHM_ANON_SKIP,
+	},
+	/* file-mapped shared */
+	{
+		.vma_name = "FILE SHARED",
+		.vma_type = CKPT_VMA_SHM_FILE,
+	},
 };
 
 /**
diff --git a/checkpoint/objhash.c b/checkpoint/objhash.c
index 734be3a..be9411f 100644
--- a/checkpoint/objhash.c
+++ b/checkpoint/objhash.c
@@ -64,6 +64,16 @@ static int obj_no_grab(void *ptr)
 	return 0;
 }
 
+static int obj_inode_grab(void *ptr)
+{
+	return igrab((struct inode *) ptr) ? 0 : -EBADF;
+}
+
+static void obj_inode_drop(void *ptr, int lastref)
+{
+	iput((struct inode *) ptr);
+}
+
 static int obj_file_table_grab(void *ptr)
 {
 	atomic_inc(&((struct files_struct *) ptr)->count);
@@ -120,6 +130,13 @@ static struct ckpt_obj_ops ckpt_obj_ops[] = {
 		.ref_drop = obj_no_drop,
 		.ref_grab = obj_no_grab,
 	},
+	/* inode object */
+	{
+		.obj_name = "INODE",
+		.obj_type = CKPT_OBJ_INODE,
+		.ref_drop = obj_inode_drop,
+		.ref_grab = obj_inode_grab,
+	},
 	/* files_struct object */
 	{
 		.obj_name = "FILE_TABLE",
diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
index cb13e8f..5a130d7 100644
--- a/include/linux/checkpoint.h
+++ b/include/linux/checkpoint.h
@@ -170,11 +170,15 @@ extern void ckpt_pgarr_free(struct ckpt_ctx *ctx);
 extern int generic_vma_checkpoint(struct ckpt_ctx *ctx,
 				  struct vm_area_struct *vma,
 				  enum vma_type type,
-				  int vma_objref);
+				  int vma_objref, int ino_objref);
 extern int private_vma_checkpoint(struct ckpt_ctx *ctx,
 				  struct vm_area_struct *vma,
 				  enum vma_type type,
 				  int vma_objref);
+extern int shmem_vma_checkpoint(struct ckpt_ctx *ctx,
+				struct vm_area_struct *vma,
+				enum vma_type type,
+				int ino_objref);
 
 extern int checkpoint_obj_mm(struct ckpt_ctx *ctx, struct task_struct *t);
 extern int restore_obj_mm(struct ckpt_ctx *ctx, int mm_objref);
@@ -187,11 +191,10 @@ extern int private_vma_restore(struct ckpt_ctx *ctx, struct mm_struct *mm,
 			       struct file *file, struct ckpt_hdr_vma *h);
 
 
-#define CKPT_VMA_NOT_SUPPORTED					\
-	(VM_SHARED | VM_MAYSHARE | VM_IO | VM_HUGETLB |		\
-	 VM_NONLINEAR | VM_PFNMAP | VM_RESERVED | VM_NORESERVE	\
-	 | VM_HUGETLB | VM_NONLINEAR | VM_MAPPED_COPY |		\
-	 VM_INSERTPAGE | VM_MIXEDMAP | VM_SAO)
+#define CKPT_VMA_NOT_SUPPORTED						\
+	(VM_IO | VM_HUGETLB | VM_NONLINEAR | VM_PFNMAP |		\
+	 VM_RESERVED | VM_NORESERVE | VM_HUGETLB | VM_NONLINEAR |	\
+	 VM_MAPPED_COPY | VM_INSERTPAGE | VM_MIXEDMAP | VM_SAO)
 
 static inline int ckpt_validate_errno(int errno)
 {
diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
index 097d2e2..23095f2 100644
--- a/include/linux/checkpoint_hdr.h
+++ b/include/linux/checkpoint_hdr.h
@@ -91,6 +91,7 @@ struct ckpt_hdr_objref {
 /* shared objects types */
 enum obj_type {
 	CKPT_OBJ_IGNORE = 0,
+	CKPT_OBJ_INODE,
 	CKPT_OBJ_FILE_TABLE,
 	CKPT_OBJ_FILE,
 	CKPT_OBJ_MM,
@@ -176,6 +177,7 @@ struct ckpt_hdr_task {
 /* task's shared resources */
 struct ckpt_hdr_task_objs {
 	struct ckpt_hdr h;
+
 	__s32 files_objref;
 	__s32 mm_objref;
 } __attribute__((aligned(8)));
@@ -256,6 +258,9 @@ enum vma_type {
 	CKPT_VMA_VDSO,		/* special vdso vma */
 	CKPT_VMA_ANON,		/* private anonymous */
 	CKPT_VMA_FILE,		/* private mapped file */
+	CKPT_VMA_SHM_ANON,	/* shared anonymous */
+	CKPT_VMA_SHM_ANON_SKIP,	/* shared anonymous (skip contents) */
+	CKPT_VMA_SHM_FILE,	/* shared mapped file, only msync */
 	CKPT_VMA_MAX
 };
 
@@ -264,6 +269,9 @@ struct ckpt_hdr_vma {
 	struct ckpt_hdr h;
 	__u32 vma_type;
 	__s32 vma_objref;	/* objref of backing file */
+	__s32 ino_objref;	/* objref of shared segment */
+	__u32 _padding;
+	__u64 ino_size;		/* size of shared segment */
 
 	__u64 vm_start;
 	__u64 vm_end;
diff --git a/mm/filemap.c b/mm/filemap.c
index ef7205b..055f126 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1654,6 +1654,8 @@ int filemap_checkpoint(struct ckpt_ctx *ctx, struct vm_area_struct *vma)
 {
 	struct file *file = vma->vm_file;
 	int vma_objref;
+	int ino_objref;
+	int first, ret;
 
 	if (vma->vm_flags & CKPT_VMA_NOT_SUPPORTED) {
 		pr_warning("c/r: unsupported VMA %#lx\n", vma->vm_flags);
@@ -1666,7 +1668,42 @@ int filemap_checkpoint(struct ckpt_ctx *ctx, struct vm_area_struct *vma)
 	if (vma_objref < 0)
 		return vma_objref;
 
-	return private_vma_checkpoint(ctx, vma, CKPT_VMA_FILE, vma_objref);
+	if (vma->vm_flags & (VM_SHARED | VM_MAYSHARE)) {
+		/*
+		 * Citing mmap(2): "Updates to the mapping are visible
+		 * to other processes that map this file, and are
+		 * carried through to the underlying file. The file
+		 * may not actually be updated until msync(2) or
+		 * munmap(2) is called"
+		 *
+		 * Citing msync(2): "Without use of this call there is
+		 * no guarantee that changes are written back before
+		 * munmap(2) is called."
+		 *
+		 * Force msync for region of shared mapped files, to
+		 * ensure that that the file system is consistent with
+		 * the checkpoint image.  (inspired by sys_msync).
+		 */
+
+		ino_objref = ckpt_obj_lookup_add(ctx, file->f_dentry->d_inode,
+					       CKPT_OBJ_INODE, &first);
+		if (ino_objref < 0)
+			return ino_objref;
+
+		if (first) {
+			ret = vfs_fsync(file, file->f_path.dentry, 0);
+			if (ret < 0)
+				return ret;
+		}
+
+		ret = generic_vma_checkpoint(ctx, vma, CKPT_VMA_SHM_FILE,
+					     vma_objref, ino_objref);
+	} else {
+		ret = private_vma_checkpoint(ctx, vma, CKPT_VMA_FILE,
+					     vma_objref);
+	}
+
+	return ret;
 }
 
 int filemap_restore(struct ckpt_ctx *ctx,
diff --git a/mm/mmap.c b/mm/mmap.c
index 3be4678..aa262eb 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2296,7 +2296,7 @@ static int special_mapping_checkpoint(struct ckpt_ctx *ctx,
 	if (!name || strcmp(name, "[vdso]"))
 		return -ENOSYS;
 
-	return generic_vma_checkpoint(ctx, vma, CKPT_VMA_VDSO, 0);
+	return generic_vma_checkpoint(ctx, vma, CKPT_VMA_VDSO, 0, 0);
 }
 
 int special_mapping_restore(struct ckpt_ctx *ctx,
diff --git a/mm/shmem.c b/mm/shmem.c
index d80532b..3e50bd1 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -30,6 +30,7 @@
 #include <linux/module.h>
 #include <linux/swap.h>
 #include <linux/ima.h>
+#include <linux/checkpoint.h>
 
 static struct vfsmount *shm_mnt;
 
@@ -2381,6 +2382,37 @@ static void shmem_destroy_inode(struct inode *inode)
 	kmem_cache_free(shmem_inode_cachep, SHMEM_I(inode));
 }
 
+#ifdef CONFIG_CHECKPOINT
+static int shmem_checkpoint(struct ckpt_ctx *ctx, struct vm_area_struct *vma)
+{
+	enum vma_type vma_type;
+	int ino_objref;
+	int first;
+
+	/* should be private anonymous ... verify that this is the case */
+	if (vma->vm_flags & CKPT_VMA_NOT_SUPPORTED) {
+		pr_warning("c/r: unsupported VMA %#lx\n", vma->vm_flags);
+		return -ENOSYS;
+	}
+
+	BUG_ON(!vma->vm_file);
+
+	/* we collected the file but we don't checkpoint it per-se */
+	ret = ckpt_obj_visit(ctx, vma->vm_file, CKPT_OBJ_FILE);
+	if (ret < 0)
+		return ret;
+
+	ino_objref = ckpt_obj_lookup_add(ctx, vma->vm_file->f_dentry->d_inode,
+					 CKPT_OBJ_INODE, &first);
+	if (ino_objref < 0)
+		return ino_objref;
+
+	vma_type = (first ? CKPT_VMA_SHM_ANON : CKPT_VMA_SHM_ANON_SKIP);
+
+	return shmem_vma_checkpoint(ctx, vma, vma_type, ino_objref);
+}
+#endif /* CONFIG_CHECKPOINT */
+
 static void init_once(void *foo)
 {
 	struct shmem_inode_info *p = (struct shmem_inode_info *) foo;
@@ -2492,6 +2524,9 @@ static struct vm_operations_struct shmem_vm_ops = {
 	.set_policy     = shmem_set_policy,
 	.get_policy     = shmem_get_policy,
 #endif
+#ifdef CONFIG_CHECKPOINT
+	.checkpoint	= shmem_checkpoint,
+#endif
 };
 
 
-- 
1.6.0.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH v18 44/80] c/r: restore anonymous- and file-mapped- shared memory
       [not found] ` <1253749920-18673-1-git-send-email-orenl-RdfvBDnrOixBDgjK7y7TUQ@public.gmane.org>
                     ` (42 preceding siblings ...)
  2009-09-23 23:51   ` [PATCH v18 43/80] c/r: dump anonymous- and file-mapped- " Oren Laadan
@ 2009-09-23 23:51   ` Oren Laadan
  2009-09-23 23:51   ` [PATCH v18 45/80] splice: export pipe/file-to-pipe/file functionality Oren Laadan
                     ` (37 subsequent siblings)
  81 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-23 23:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-api-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Linus Torvalds, Ingo Molnar,
	Pavel Emelyanov

The bulk of the work is in ckpt_read_vma(), which has been refactored:
the part that create the suitable 'struct file *' for the mapping is
now larger and moved to a separate function. What's left is to read
the VMA description, get the file pointer, create the mapping, and
proceed to read the contents in.

Both anonymous shared VMAs that have been read earlier (as indicated
by a look up to objhash) and file-mapped shared VMAs are skipped.
Anonymous shared VMAs seen for the first time have their contents
read in directly to the backing inode, as indexed by the page numbers
(as opposed to virtual addresses).

Changelog[v14]:
  - Introduce patch

Signed-off-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
---
 checkpoint/memory.c        |   66 ++++++++++++++++++++++++++++++++-----------
 include/linux/checkpoint.h |    6 ++++
 include/linux/mm.h         |    2 +
 mm/filemap.c               |   13 ++++++++-
 mm/shmem.c                 |   49 ++++++++++++++++++++++++++++++++
 5 files changed, 118 insertions(+), 18 deletions(-)

diff --git a/checkpoint/memory.c b/checkpoint/memory.c
index 697896f..f765993 100644
--- a/checkpoint/memory.c
+++ b/checkpoint/memory.c
@@ -849,13 +849,36 @@ static int restore_read_page(struct ckpt_ctx *ctx, struct page *page, void *p)
 	return 0;
 }
 
+static struct page *bring_private_page(unsigned long addr)
+{
+	struct page *page;
+	int ret;
+
+	ret = get_user_pages(current, current->mm, addr, 1, 1, 1, &page, NULL);
+	if (ret < 0)
+		page = ERR_PTR(ret);
+	return page;
+}
+
+static struct page *bring_shared_page(unsigned long idx, struct inode *ino)
+{
+	struct page *page = NULL;
+	int ret;
+
+	ret = shmem_getpage(ino, idx, &page, SGP_WRITE, NULL);
+	if (ret < 0)
+		return ERR_PTR(ret);
+	if (page)
+		unlock_page(page);
+	return page;
+}
+
 /**
  * read_pages_contents - read in data of pages in page-array chain
  * @ctx - restart context
  */
-static int read_pages_contents(struct ckpt_ctx *ctx)
+static int read_pages_contents(struct ckpt_ctx *ctx, struct inode *inode)
 {
-	struct mm_struct *mm = current->mm;
 	struct ckpt_pgarr *pgarr;
 	unsigned long *vaddrs;
 	char *buf;
@@ -865,17 +888,22 @@ static int read_pages_contents(struct ckpt_ctx *ctx)
 	if (!buf)
 		return -ENOMEM;
 
-	down_read(&mm->mmap_sem);
+	down_read(&current->mm->mmap_sem);
 	list_for_each_entry_reverse(pgarr, &ctx->pgarr_list, list) {
 		vaddrs = pgarr->vaddrs;
 		for (i = 0; i < pgarr->nr_used; i++) {
 			struct page *page;
 
 			_ckpt_debug(CKPT_DPAGE, "got page %#lx\n", vaddrs[i]);
-			ret = get_user_pages(current, mm, vaddrs[i],
-					     1, 1, 1, &page, NULL);
-			if (ret < 0)
+			if (inode)
+				page = bring_shared_page(vaddrs[i], inode);
+			else
+				page = bring_private_page(vaddrs[i]);
+
+			if (IS_ERR(page)) {
+				ret = PTR_ERR(page);
 				goto out;
+			}
 
 			ret = restore_read_page(ctx, page, buf);
 			page_cache_release(page);
@@ -886,14 +914,15 @@ static int read_pages_contents(struct ckpt_ctx *ctx)
 	}
 
  out:
-	up_read(&mm->mmap_sem);
+	up_read(&current->mm->mmap_sem);
 	kfree(buf);
 	return 0;
 }
 
 /**
- * restore_memory_contents - restore contents of a VMA with private memory
+ * restore_memory_contents - restore contents of a memory region
  * @ctx - restart context
+ * @inode - backing inode
  *
  * Reads a header that specifies how many pages will follow, then reads
  * a list of virtual addresses into ctx->pgarr_list page-array chain,
@@ -901,7 +930,7 @@ static int read_pages_contents(struct ckpt_ctx *ctx)
  * these steps until reaching a header specifying "0" pages, which marks
  * the end of the contents.
  */
-static int restore_memory_contents(struct ckpt_ctx *ctx)
+int restore_memory_contents(struct ckpt_ctx *ctx, struct inode *inode)
 {
 	struct ckpt_hdr_pgarr *h;
 	unsigned long nr_pages;
@@ -928,7 +957,7 @@ static int restore_memory_contents(struct ckpt_ctx *ctx)
 		ret = read_pages_vaddrs(ctx, nr_pages);
 		if (ret < 0)
 			break;
-		ret = read_pages_contents(ctx);
+		ret = read_pages_contents(ctx, inode);
 		if (ret < 0)
 			break;
 		pgarr_reset_all(ctx);
@@ -986,9 +1015,9 @@ static unsigned long calc_map_flags_bits(unsigned long orig_vm_flags)
  * @file - file to map (NULL for anonymous)
  * @h - vma header data
  */
-static unsigned long generic_vma_restore(struct mm_struct *mm,
-					 struct file *file,
-					 struct ckpt_hdr_vma *h)
+unsigned long generic_vma_restore(struct mm_struct *mm,
+				  struct file *file,
+				  struct ckpt_hdr_vma *h)
 {
 	unsigned long vm_size, vm_start, vm_flags, vm_prot, vm_pgoff;
 	unsigned long addr;
@@ -1033,7 +1062,7 @@ int private_vma_restore(struct ckpt_ctx *ctx, struct mm_struct *mm,
 	if (IS_ERR((void *) addr))
 		return PTR_ERR((void *) addr);
 
-	return restore_memory_contents(ctx);
+	return restore_memory_contents(ctx, NULL);
 }
 
 /**
@@ -1093,16 +1122,19 @@ static struct restore_vma_ops restore_vma_ops[] = {
 	{
 		.vma_name = "ANON SHARED",
 		.vma_type = CKPT_VMA_SHM_ANON,
+		.restore = shmem_restore,
 	},
 	/* anonymous shared (skipped) */
 	{
 		.vma_name = "ANON SHARED (skip)",
 		.vma_type = CKPT_VMA_SHM_ANON_SKIP,
+		.restore = shmem_restore,
 	},
 	/* file-mapped shared */
 	{
 		.vma_name = "FILE SHARED",
 		.vma_type = CKPT_VMA_SHM_FILE,
+		.restore = filemap_restore,
 	},
 };
 
@@ -1121,15 +1153,15 @@ static int restore_vma(struct ckpt_ctx *ctx, struct mm_struct *mm)
 	if (IS_ERR(h))
 		return PTR_ERR(h);
 
-	ckpt_debug("vma %#lx-%#lx flags %#lx type %d vmaref %d\n",
+	ckpt_debug("vma %#lx-%#lx flags %#lx type %d vmaref %d inoref %d\n",
 		   (unsigned long) h->vm_start, (unsigned long) h->vm_end,
 		   (unsigned long) h->vm_flags, (int) h->vma_type,
-		   (int) h->vma_objref);
+		   (int) h->vma_objref, (int) h->ino_objref);
 
 	ret = -EINVAL;
 	if (h->vm_end < h->vm_start)
 		goto out;
-	if (h->vma_objref < 0)
+	if (h->vma_objref < 0 || h->ino_objref < 0)
 		goto out;
 	if (h->vma_type >= CKPT_VMA_MAX)
 		goto out;
diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
index 5a130d7..2770fc2 100644
--- a/include/linux/checkpoint.h
+++ b/include/linux/checkpoint.h
@@ -187,9 +187,15 @@ extern int ckpt_collect_mm(struct ckpt_ctx *ctx, struct task_struct *t);
 extern int checkpoint_mm(struct ckpt_ctx *ctx, void *ptr);
 extern void *restore_mm(struct ckpt_ctx *ctx);
 
+extern unsigned long generic_vma_restore(struct mm_struct *mm,
+					 struct file *file,
+					 struct ckpt_hdr_vma *h);
+
 extern int private_vma_restore(struct ckpt_ctx *ctx, struct mm_struct *mm,
 			       struct file *file, struct ckpt_hdr_vma *h);
 
+extern int restore_memory_contents(struct ckpt_ctx *ctx, struct inode *inode);
+
 
 #define CKPT_VMA_NOT_SUPPORTED						\
 	(VM_IO | VM_HUGETLB | VM_NONLINEAR | VM_PFNMAP |		\
diff --git a/include/linux/mm.h b/include/linux/mm.h
index b565a82..3632e66 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1188,6 +1188,8 @@ extern int filemap_restore(struct ckpt_ctx *ctx, struct mm_struct *mm,
 			   struct ckpt_hdr_vma *hh);
 extern int special_mapping_restore(struct ckpt_ctx *ctx, struct mm_struct *mm,
 				   struct ckpt_hdr_vma *hh);
+extern int shmem_restore(struct ckpt_ctx *ctx, struct mm_struct *mm,
+			 struct ckpt_hdr_vma *hh);
 #endif
 
 /* readahead.c */
diff --git a/mm/filemap.c b/mm/filemap.c
index 055f126..eb7653d 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1711,17 +1711,28 @@ int filemap_restore(struct ckpt_ctx *ctx,
 		    struct ckpt_hdr_vma *h)
 {
 	struct file *file;
+	unsigned long addr;
 	int ret;
 
 	if (h->vma_type == CKPT_VMA_FILE &&
 	    (h->vm_flags & (VM_SHARED | VM_MAYSHARE)))
 		return -EINVAL;
+	if (h->vma_type == CKPT_VMA_SHM_FILE &&
+	    !(h->vm_flags & (VM_SHARED | VM_MAYSHARE)))
+		return -EINVAL;
 
 	file = ckpt_obj_fetch(ctx, h->vma_objref, CKPT_OBJ_FILE);
 	if (IS_ERR(file))
 		return PTR_ERR(file);
 
-	ret = private_vma_restore(ctx, mm, file, h);
+	if (h->vma_type == CKPT_VMA_FILE) {
+		/* private mapped file */
+		ret = private_vma_restore(ctx, mm, file, h);
+	} else {
+		/* shared mapped file */
+		addr = generic_vma_restore(mm, file, h);
+		ret = (IS_ERR((void *) addr) ? PTR_ERR((void *) addr) : 0);
+	}
 	return ret;
 }
 #endif /* CONFIG_CHECKPOINT */
diff --git a/mm/shmem.c b/mm/shmem.c
index 3e50bd1..d1e348f 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -2411,6 +2411,55 @@ static int shmem_checkpoint(struct ckpt_ctx *ctx, struct vm_area_struct *vma)
 
 	return shmem_vma_checkpoint(ctx, vma, vma_type, ino_objref);
 }
+
+int shmem_restore(struct ckpt_ctx *ctx,
+		  struct mm_struct *mm, struct ckpt_hdr_vma *h)
+{
+	unsigned long addr;
+	struct file *file;
+	int ret = 0;
+
+	file = ckpt_obj_fetch(ctx, h->ino_objref, CKPT_OBJ_FILE);
+	if (PTR_ERR(file) == -EINVAL)
+		file = NULL;
+	if (IS_ERR(file))
+		return PTR_ERR(file);
+
+	/* if file is NULL, this is the premiere - create and insert */
+	if (!file) {
+		if (h->vma_type != CKPT_VMA_SHM_ANON)
+			return -EINVAL;
+		/*
+		 * in theory could pass NULL to mmap and let it create
+		 * the file. But, if 'shm_size != vm_end - vm_start',
+		 * or if 'vm_pgoff != 0', then the vma reflects only a
+		 * portion of the shm object and we need to "manually"
+		 * create the full shm object.
+		 */
+		file = shmem_file_setup("/dev/zero", h->ino_size, h->vm_flags);
+		if (IS_ERR(file))
+			return PTR_ERR(file);
+		ret = ckpt_obj_insert(ctx, file, h->ino_objref, CKPT_OBJ_FILE);
+		if (ret < 0)
+			goto out;
+	} else {
+		if (h->vma_type != CKPT_VMA_SHM_ANON_SKIP)
+			return -EINVAL;
+		/* Already need fput() for the file above; keep path simple */
+		get_file(file);
+	}
+
+	addr = generic_vma_restore(mm, file, h);
+	if (IS_ERR((void *) addr))
+		return PTR_ERR((void *) addr);
+
+	if (h->vma_type == CKPT_VMA_SHM_ANON)
+		ret = restore_memory_contents(ctx, file->f_dentry->d_inode);
+ out:
+	fput(file);
+	return ret;
+}
+
 #endif /* CONFIG_CHECKPOINT */
 
 static void init_once(void *foo)
-- 
1.6.0.4

^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH v18 44/80] c/r: restore anonymous- and file-mapped- shared memory
  2009-09-23 23:50 ` Oren Laadan
@ 2009-09-23 23:51   ` Oren Laadan
  -1 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-23 23:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, containers, linux-kernel, linux-mm, linux-api,
	Serge Hallyn, Ingo Molnar, Pavel Emelyanov, Oren Laadan,
	Oren Laadan

The bulk of the work is in ckpt_read_vma(), which has been refactored:
the part that create the suitable 'struct file *' for the mapping is
now larger and moved to a separate function. What's left is to read
the VMA description, get the file pointer, create the mapping, and
proceed to read the contents in.

Both anonymous shared VMAs that have been read earlier (as indicated
by a look up to objhash) and file-mapped shared VMAs are skipped.
Anonymous shared VMAs seen for the first time have their contents
read in directly to the backing inode, as indexed by the page numbers
(as opposed to virtual addresses).

Changelog[v14]:
  - Introduce patch

Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
---
 checkpoint/memory.c        |   66 ++++++++++++++++++++++++++++++++-----------
 include/linux/checkpoint.h |    6 ++++
 include/linux/mm.h         |    2 +
 mm/filemap.c               |   13 ++++++++-
 mm/shmem.c                 |   49 ++++++++++++++++++++++++++++++++
 5 files changed, 118 insertions(+), 18 deletions(-)

diff --git a/checkpoint/memory.c b/checkpoint/memory.c
index 697896f..f765993 100644
--- a/checkpoint/memory.c
+++ b/checkpoint/memory.c
@@ -849,13 +849,36 @@ static int restore_read_page(struct ckpt_ctx *ctx, struct page *page, void *p)
 	return 0;
 }
 
+static struct page *bring_private_page(unsigned long addr)
+{
+	struct page *page;
+	int ret;
+
+	ret = get_user_pages(current, current->mm, addr, 1, 1, 1, &page, NULL);
+	if (ret < 0)
+		page = ERR_PTR(ret);
+	return page;
+}
+
+static struct page *bring_shared_page(unsigned long idx, struct inode *ino)
+{
+	struct page *page = NULL;
+	int ret;
+
+	ret = shmem_getpage(ino, idx, &page, SGP_WRITE, NULL);
+	if (ret < 0)
+		return ERR_PTR(ret);
+	if (page)
+		unlock_page(page);
+	return page;
+}
+
 /**
  * read_pages_contents - read in data of pages in page-array chain
  * @ctx - restart context
  */
-static int read_pages_contents(struct ckpt_ctx *ctx)
+static int read_pages_contents(struct ckpt_ctx *ctx, struct inode *inode)
 {
-	struct mm_struct *mm = current->mm;
 	struct ckpt_pgarr *pgarr;
 	unsigned long *vaddrs;
 	char *buf;
@@ -865,17 +888,22 @@ static int read_pages_contents(struct ckpt_ctx *ctx)
 	if (!buf)
 		return -ENOMEM;
 
-	down_read(&mm->mmap_sem);
+	down_read(&current->mm->mmap_sem);
 	list_for_each_entry_reverse(pgarr, &ctx->pgarr_list, list) {
 		vaddrs = pgarr->vaddrs;
 		for (i = 0; i < pgarr->nr_used; i++) {
 			struct page *page;
 
 			_ckpt_debug(CKPT_DPAGE, "got page %#lx\n", vaddrs[i]);
-			ret = get_user_pages(current, mm, vaddrs[i],
-					     1, 1, 1, &page, NULL);
-			if (ret < 0)
+			if (inode)
+				page = bring_shared_page(vaddrs[i], inode);
+			else
+				page = bring_private_page(vaddrs[i]);
+
+			if (IS_ERR(page)) {
+				ret = PTR_ERR(page);
 				goto out;
+			}
 
 			ret = restore_read_page(ctx, page, buf);
 			page_cache_release(page);
@@ -886,14 +914,15 @@ static int read_pages_contents(struct ckpt_ctx *ctx)
 	}
 
  out:
-	up_read(&mm->mmap_sem);
+	up_read(&current->mm->mmap_sem);
 	kfree(buf);
 	return 0;
 }
 
 /**
- * restore_memory_contents - restore contents of a VMA with private memory
+ * restore_memory_contents - restore contents of a memory region
  * @ctx - restart context
+ * @inode - backing inode
  *
  * Reads a header that specifies how many pages will follow, then reads
  * a list of virtual addresses into ctx->pgarr_list page-array chain,
@@ -901,7 +930,7 @@ static int read_pages_contents(struct ckpt_ctx *ctx)
  * these steps until reaching a header specifying "0" pages, which marks
  * the end of the contents.
  */
-static int restore_memory_contents(struct ckpt_ctx *ctx)
+int restore_memory_contents(struct ckpt_ctx *ctx, struct inode *inode)
 {
 	struct ckpt_hdr_pgarr *h;
 	unsigned long nr_pages;
@@ -928,7 +957,7 @@ static int restore_memory_contents(struct ckpt_ctx *ctx)
 		ret = read_pages_vaddrs(ctx, nr_pages);
 		if (ret < 0)
 			break;
-		ret = read_pages_contents(ctx);
+		ret = read_pages_contents(ctx, inode);
 		if (ret < 0)
 			break;
 		pgarr_reset_all(ctx);
@@ -986,9 +1015,9 @@ static unsigned long calc_map_flags_bits(unsigned long orig_vm_flags)
  * @file - file to map (NULL for anonymous)
  * @h - vma header data
  */
-static unsigned long generic_vma_restore(struct mm_struct *mm,
-					 struct file *file,
-					 struct ckpt_hdr_vma *h)
+unsigned long generic_vma_restore(struct mm_struct *mm,
+				  struct file *file,
+				  struct ckpt_hdr_vma *h)
 {
 	unsigned long vm_size, vm_start, vm_flags, vm_prot, vm_pgoff;
 	unsigned long addr;
@@ -1033,7 +1062,7 @@ int private_vma_restore(struct ckpt_ctx *ctx, struct mm_struct *mm,
 	if (IS_ERR((void *) addr))
 		return PTR_ERR((void *) addr);
 
-	return restore_memory_contents(ctx);
+	return restore_memory_contents(ctx, NULL);
 }
 
 /**
@@ -1093,16 +1122,19 @@ static struct restore_vma_ops restore_vma_ops[] = {
 	{
 		.vma_name = "ANON SHARED",
 		.vma_type = CKPT_VMA_SHM_ANON,
+		.restore = shmem_restore,
 	},
 	/* anonymous shared (skipped) */
 	{
 		.vma_name = "ANON SHARED (skip)",
 		.vma_type = CKPT_VMA_SHM_ANON_SKIP,
+		.restore = shmem_restore,
 	},
 	/* file-mapped shared */
 	{
 		.vma_name = "FILE SHARED",
 		.vma_type = CKPT_VMA_SHM_FILE,
+		.restore = filemap_restore,
 	},
 };
 
@@ -1121,15 +1153,15 @@ static int restore_vma(struct ckpt_ctx *ctx, struct mm_struct *mm)
 	if (IS_ERR(h))
 		return PTR_ERR(h);
 
-	ckpt_debug("vma %#lx-%#lx flags %#lx type %d vmaref %d\n",
+	ckpt_debug("vma %#lx-%#lx flags %#lx type %d vmaref %d inoref %d\n",
 		   (unsigned long) h->vm_start, (unsigned long) h->vm_end,
 		   (unsigned long) h->vm_flags, (int) h->vma_type,
-		   (int) h->vma_objref);
+		   (int) h->vma_objref, (int) h->ino_objref);
 
 	ret = -EINVAL;
 	if (h->vm_end < h->vm_start)
 		goto out;
-	if (h->vma_objref < 0)
+	if (h->vma_objref < 0 || h->ino_objref < 0)
 		goto out;
 	if (h->vma_type >= CKPT_VMA_MAX)
 		goto out;
diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
index 5a130d7..2770fc2 100644
--- a/include/linux/checkpoint.h
+++ b/include/linux/checkpoint.h
@@ -187,9 +187,15 @@ extern int ckpt_collect_mm(struct ckpt_ctx *ctx, struct task_struct *t);
 extern int checkpoint_mm(struct ckpt_ctx *ctx, void *ptr);
 extern void *restore_mm(struct ckpt_ctx *ctx);
 
+extern unsigned long generic_vma_restore(struct mm_struct *mm,
+					 struct file *file,
+					 struct ckpt_hdr_vma *h);
+
 extern int private_vma_restore(struct ckpt_ctx *ctx, struct mm_struct *mm,
 			       struct file *file, struct ckpt_hdr_vma *h);
 
+extern int restore_memory_contents(struct ckpt_ctx *ctx, struct inode *inode);
+
 
 #define CKPT_VMA_NOT_SUPPORTED						\
 	(VM_IO | VM_HUGETLB | VM_NONLINEAR | VM_PFNMAP |		\
diff --git a/include/linux/mm.h b/include/linux/mm.h
index b565a82..3632e66 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1188,6 +1188,8 @@ extern int filemap_restore(struct ckpt_ctx *ctx, struct mm_struct *mm,
 			   struct ckpt_hdr_vma *hh);
 extern int special_mapping_restore(struct ckpt_ctx *ctx, struct mm_struct *mm,
 				   struct ckpt_hdr_vma *hh);
+extern int shmem_restore(struct ckpt_ctx *ctx, struct mm_struct *mm,
+			 struct ckpt_hdr_vma *hh);
 #endif
 
 /* readahead.c */
diff --git a/mm/filemap.c b/mm/filemap.c
index 055f126..eb7653d 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1711,17 +1711,28 @@ int filemap_restore(struct ckpt_ctx *ctx,
 		    struct ckpt_hdr_vma *h)
 {
 	struct file *file;
+	unsigned long addr;
 	int ret;
 
 	if (h->vma_type == CKPT_VMA_FILE &&
 	    (h->vm_flags & (VM_SHARED | VM_MAYSHARE)))
 		return -EINVAL;
+	if (h->vma_type == CKPT_VMA_SHM_FILE &&
+	    !(h->vm_flags & (VM_SHARED | VM_MAYSHARE)))
+		return -EINVAL;
 
 	file = ckpt_obj_fetch(ctx, h->vma_objref, CKPT_OBJ_FILE);
 	if (IS_ERR(file))
 		return PTR_ERR(file);
 
-	ret = private_vma_restore(ctx, mm, file, h);
+	if (h->vma_type == CKPT_VMA_FILE) {
+		/* private mapped file */
+		ret = private_vma_restore(ctx, mm, file, h);
+	} else {
+		/* shared mapped file */
+		addr = generic_vma_restore(mm, file, h);
+		ret = (IS_ERR((void *) addr) ? PTR_ERR((void *) addr) : 0);
+	}
 	return ret;
 }
 #endif /* CONFIG_CHECKPOINT */
diff --git a/mm/shmem.c b/mm/shmem.c
index 3e50bd1..d1e348f 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -2411,6 +2411,55 @@ static int shmem_checkpoint(struct ckpt_ctx *ctx, struct vm_area_struct *vma)
 
 	return shmem_vma_checkpoint(ctx, vma, vma_type, ino_objref);
 }
+
+int shmem_restore(struct ckpt_ctx *ctx,
+		  struct mm_struct *mm, struct ckpt_hdr_vma *h)
+{
+	unsigned long addr;
+	struct file *file;
+	int ret = 0;
+
+	file = ckpt_obj_fetch(ctx, h->ino_objref, CKPT_OBJ_FILE);
+	if (PTR_ERR(file) == -EINVAL)
+		file = NULL;
+	if (IS_ERR(file))
+		return PTR_ERR(file);
+
+	/* if file is NULL, this is the premiere - create and insert */
+	if (!file) {
+		if (h->vma_type != CKPT_VMA_SHM_ANON)
+			return -EINVAL;
+		/*
+		 * in theory could pass NULL to mmap and let it create
+		 * the file. But, if 'shm_size != vm_end - vm_start',
+		 * or if 'vm_pgoff != 0', then the vma reflects only a
+		 * portion of the shm object and we need to "manually"
+		 * create the full shm object.
+		 */
+		file = shmem_file_setup("/dev/zero", h->ino_size, h->vm_flags);
+		if (IS_ERR(file))
+			return PTR_ERR(file);
+		ret = ckpt_obj_insert(ctx, file, h->ino_objref, CKPT_OBJ_FILE);
+		if (ret < 0)
+			goto out;
+	} else {
+		if (h->vma_type != CKPT_VMA_SHM_ANON_SKIP)
+			return -EINVAL;
+		/* Already need fput() for the file above; keep path simple */
+		get_file(file);
+	}
+
+	addr = generic_vma_restore(mm, file, h);
+	if (IS_ERR((void *) addr))
+		return PTR_ERR((void *) addr);
+
+	if (h->vma_type == CKPT_VMA_SHM_ANON)
+		ret = restore_memory_contents(ctx, file->f_dentry->d_inode);
+ out:
+	fput(file);
+	return ret;
+}
+
 #endif /* CONFIG_CHECKPOINT */
 
 static void init_once(void *foo)
-- 
1.6.0.4


^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH v18 44/80] c/r: restore anonymous- and file-mapped- shared memory
@ 2009-09-23 23:51   ` Oren Laadan
  0 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-23 23:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, containers, linux-kernel, linux-mm, linux-api,
	Serge Hallyn, Ingo Molnar, Pavel Emelyanov, Oren Laadan,
	Oren Laadan

The bulk of the work is in ckpt_read_vma(), which has been refactored:
the part that create the suitable 'struct file *' for the mapping is
now larger and moved to a separate function. What's left is to read
the VMA description, get the file pointer, create the mapping, and
proceed to read the contents in.

Both anonymous shared VMAs that have been read earlier (as indicated
by a look up to objhash) and file-mapped shared VMAs are skipped.
Anonymous shared VMAs seen for the first time have their contents
read in directly to the backing inode, as indexed by the page numbers
(as opposed to virtual addresses).

Changelog[v14]:
  - Introduce patch

Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
---
 checkpoint/memory.c        |   66 ++++++++++++++++++++++++++++++++-----------
 include/linux/checkpoint.h |    6 ++++
 include/linux/mm.h         |    2 +
 mm/filemap.c               |   13 ++++++++-
 mm/shmem.c                 |   49 ++++++++++++++++++++++++++++++++
 5 files changed, 118 insertions(+), 18 deletions(-)

diff --git a/checkpoint/memory.c b/checkpoint/memory.c
index 697896f..f765993 100644
--- a/checkpoint/memory.c
+++ b/checkpoint/memory.c
@@ -849,13 +849,36 @@ static int restore_read_page(struct ckpt_ctx *ctx, struct page *page, void *p)
 	return 0;
 }
 
+static struct page *bring_private_page(unsigned long addr)
+{
+	struct page *page;
+	int ret;
+
+	ret = get_user_pages(current, current->mm, addr, 1, 1, 1, &page, NULL);
+	if (ret < 0)
+		page = ERR_PTR(ret);
+	return page;
+}
+
+static struct page *bring_shared_page(unsigned long idx, struct inode *ino)
+{
+	struct page *page = NULL;
+	int ret;
+
+	ret = shmem_getpage(ino, idx, &page, SGP_WRITE, NULL);
+	if (ret < 0)
+		return ERR_PTR(ret);
+	if (page)
+		unlock_page(page);
+	return page;
+}
+
 /**
  * read_pages_contents - read in data of pages in page-array chain
  * @ctx - restart context
  */
-static int read_pages_contents(struct ckpt_ctx *ctx)
+static int read_pages_contents(struct ckpt_ctx *ctx, struct inode *inode)
 {
-	struct mm_struct *mm = current->mm;
 	struct ckpt_pgarr *pgarr;
 	unsigned long *vaddrs;
 	char *buf;
@@ -865,17 +888,22 @@ static int read_pages_contents(struct ckpt_ctx *ctx)
 	if (!buf)
 		return -ENOMEM;
 
-	down_read(&mm->mmap_sem);
+	down_read(&current->mm->mmap_sem);
 	list_for_each_entry_reverse(pgarr, &ctx->pgarr_list, list) {
 		vaddrs = pgarr->vaddrs;
 		for (i = 0; i < pgarr->nr_used; i++) {
 			struct page *page;
 
 			_ckpt_debug(CKPT_DPAGE, "got page %#lx\n", vaddrs[i]);
-			ret = get_user_pages(current, mm, vaddrs[i],
-					     1, 1, 1, &page, NULL);
-			if (ret < 0)
+			if (inode)
+				page = bring_shared_page(vaddrs[i], inode);
+			else
+				page = bring_private_page(vaddrs[i]);
+
+			if (IS_ERR(page)) {
+				ret = PTR_ERR(page);
 				goto out;
+			}
 
 			ret = restore_read_page(ctx, page, buf);
 			page_cache_release(page);
@@ -886,14 +914,15 @@ static int read_pages_contents(struct ckpt_ctx *ctx)
 	}
 
  out:
-	up_read(&mm->mmap_sem);
+	up_read(&current->mm->mmap_sem);
 	kfree(buf);
 	return 0;
 }
 
 /**
- * restore_memory_contents - restore contents of a VMA with private memory
+ * restore_memory_contents - restore contents of a memory region
  * @ctx - restart context
+ * @inode - backing inode
  *
  * Reads a header that specifies how many pages will follow, then reads
  * a list of virtual addresses into ctx->pgarr_list page-array chain,
@@ -901,7 +930,7 @@ static int read_pages_contents(struct ckpt_ctx *ctx)
  * these steps until reaching a header specifying "0" pages, which marks
  * the end of the contents.
  */
-static int restore_memory_contents(struct ckpt_ctx *ctx)
+int restore_memory_contents(struct ckpt_ctx *ctx, struct inode *inode)
 {
 	struct ckpt_hdr_pgarr *h;
 	unsigned long nr_pages;
@@ -928,7 +957,7 @@ static int restore_memory_contents(struct ckpt_ctx *ctx)
 		ret = read_pages_vaddrs(ctx, nr_pages);
 		if (ret < 0)
 			break;
-		ret = read_pages_contents(ctx);
+		ret = read_pages_contents(ctx, inode);
 		if (ret < 0)
 			break;
 		pgarr_reset_all(ctx);
@@ -986,9 +1015,9 @@ static unsigned long calc_map_flags_bits(unsigned long orig_vm_flags)
  * @file - file to map (NULL for anonymous)
  * @h - vma header data
  */
-static unsigned long generic_vma_restore(struct mm_struct *mm,
-					 struct file *file,
-					 struct ckpt_hdr_vma *h)
+unsigned long generic_vma_restore(struct mm_struct *mm,
+				  struct file *file,
+				  struct ckpt_hdr_vma *h)
 {
 	unsigned long vm_size, vm_start, vm_flags, vm_prot, vm_pgoff;
 	unsigned long addr;
@@ -1033,7 +1062,7 @@ int private_vma_restore(struct ckpt_ctx *ctx, struct mm_struct *mm,
 	if (IS_ERR((void *) addr))
 		return PTR_ERR((void *) addr);
 
-	return restore_memory_contents(ctx);
+	return restore_memory_contents(ctx, NULL);
 }
 
 /**
@@ -1093,16 +1122,19 @@ static struct restore_vma_ops restore_vma_ops[] = {
 	{
 		.vma_name = "ANON SHARED",
 		.vma_type = CKPT_VMA_SHM_ANON,
+		.restore = shmem_restore,
 	},
 	/* anonymous shared (skipped) */
 	{
 		.vma_name = "ANON SHARED (skip)",
 		.vma_type = CKPT_VMA_SHM_ANON_SKIP,
+		.restore = shmem_restore,
 	},
 	/* file-mapped shared */
 	{
 		.vma_name = "FILE SHARED",
 		.vma_type = CKPT_VMA_SHM_FILE,
+		.restore = filemap_restore,
 	},
 };
 
@@ -1121,15 +1153,15 @@ static int restore_vma(struct ckpt_ctx *ctx, struct mm_struct *mm)
 	if (IS_ERR(h))
 		return PTR_ERR(h);
 
-	ckpt_debug("vma %#lx-%#lx flags %#lx type %d vmaref %d\n",
+	ckpt_debug("vma %#lx-%#lx flags %#lx type %d vmaref %d inoref %d\n",
 		   (unsigned long) h->vm_start, (unsigned long) h->vm_end,
 		   (unsigned long) h->vm_flags, (int) h->vma_type,
-		   (int) h->vma_objref);
+		   (int) h->vma_objref, (int) h->ino_objref);
 
 	ret = -EINVAL;
 	if (h->vm_end < h->vm_start)
 		goto out;
-	if (h->vma_objref < 0)
+	if (h->vma_objref < 0 || h->ino_objref < 0)
 		goto out;
 	if (h->vma_type >= CKPT_VMA_MAX)
 		goto out;
diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
index 5a130d7..2770fc2 100644
--- a/include/linux/checkpoint.h
+++ b/include/linux/checkpoint.h
@@ -187,9 +187,15 @@ extern int ckpt_collect_mm(struct ckpt_ctx *ctx, struct task_struct *t);
 extern int checkpoint_mm(struct ckpt_ctx *ctx, void *ptr);
 extern void *restore_mm(struct ckpt_ctx *ctx);
 
+extern unsigned long generic_vma_restore(struct mm_struct *mm,
+					 struct file *file,
+					 struct ckpt_hdr_vma *h);
+
 extern int private_vma_restore(struct ckpt_ctx *ctx, struct mm_struct *mm,
 			       struct file *file, struct ckpt_hdr_vma *h);
 
+extern int restore_memory_contents(struct ckpt_ctx *ctx, struct inode *inode);
+
 
 #define CKPT_VMA_NOT_SUPPORTED						\
 	(VM_IO | VM_HUGETLB | VM_NONLINEAR | VM_PFNMAP |		\
diff --git a/include/linux/mm.h b/include/linux/mm.h
index b565a82..3632e66 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1188,6 +1188,8 @@ extern int filemap_restore(struct ckpt_ctx *ctx, struct mm_struct *mm,
 			   struct ckpt_hdr_vma *hh);
 extern int special_mapping_restore(struct ckpt_ctx *ctx, struct mm_struct *mm,
 				   struct ckpt_hdr_vma *hh);
+extern int shmem_restore(struct ckpt_ctx *ctx, struct mm_struct *mm,
+			 struct ckpt_hdr_vma *hh);
 #endif
 
 /* readahead.c */
diff --git a/mm/filemap.c b/mm/filemap.c
index 055f126..eb7653d 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1711,17 +1711,28 @@ int filemap_restore(struct ckpt_ctx *ctx,
 		    struct ckpt_hdr_vma *h)
 {
 	struct file *file;
+	unsigned long addr;
 	int ret;
 
 	if (h->vma_type == CKPT_VMA_FILE &&
 	    (h->vm_flags & (VM_SHARED | VM_MAYSHARE)))
 		return -EINVAL;
+	if (h->vma_type == CKPT_VMA_SHM_FILE &&
+	    !(h->vm_flags & (VM_SHARED | VM_MAYSHARE)))
+		return -EINVAL;
 
 	file = ckpt_obj_fetch(ctx, h->vma_objref, CKPT_OBJ_FILE);
 	if (IS_ERR(file))
 		return PTR_ERR(file);
 
-	ret = private_vma_restore(ctx, mm, file, h);
+	if (h->vma_type == CKPT_VMA_FILE) {
+		/* private mapped file */
+		ret = private_vma_restore(ctx, mm, file, h);
+	} else {
+		/* shared mapped file */
+		addr = generic_vma_restore(mm, file, h);
+		ret = (IS_ERR((void *) addr) ? PTR_ERR((void *) addr) : 0);
+	}
 	return ret;
 }
 #endif /* CONFIG_CHECKPOINT */
diff --git a/mm/shmem.c b/mm/shmem.c
index 3e50bd1..d1e348f 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -2411,6 +2411,55 @@ static int shmem_checkpoint(struct ckpt_ctx *ctx, struct vm_area_struct *vma)
 
 	return shmem_vma_checkpoint(ctx, vma, vma_type, ino_objref);
 }
+
+int shmem_restore(struct ckpt_ctx *ctx,
+		  struct mm_struct *mm, struct ckpt_hdr_vma *h)
+{
+	unsigned long addr;
+	struct file *file;
+	int ret = 0;
+
+	file = ckpt_obj_fetch(ctx, h->ino_objref, CKPT_OBJ_FILE);
+	if (PTR_ERR(file) == -EINVAL)
+		file = NULL;
+	if (IS_ERR(file))
+		return PTR_ERR(file);
+
+	/* if file is NULL, this is the premiere - create and insert */
+	if (!file) {
+		if (h->vma_type != CKPT_VMA_SHM_ANON)
+			return -EINVAL;
+		/*
+		 * in theory could pass NULL to mmap and let it create
+		 * the file. But, if 'shm_size != vm_end - vm_start',
+		 * or if 'vm_pgoff != 0', then the vma reflects only a
+		 * portion of the shm object and we need to "manually"
+		 * create the full shm object.
+		 */
+		file = shmem_file_setup("/dev/zero", h->ino_size, h->vm_flags);
+		if (IS_ERR(file))
+			return PTR_ERR(file);
+		ret = ckpt_obj_insert(ctx, file, h->ino_objref, CKPT_OBJ_FILE);
+		if (ret < 0)
+			goto out;
+	} else {
+		if (h->vma_type != CKPT_VMA_SHM_ANON_SKIP)
+			return -EINVAL;
+		/* Already need fput() for the file above; keep path simple */
+		get_file(file);
+	}
+
+	addr = generic_vma_restore(mm, file, h);
+	if (IS_ERR((void *) addr))
+		return PTR_ERR((void *) addr);
+
+	if (h->vma_type == CKPT_VMA_SHM_ANON)
+		ret = restore_memory_contents(ctx, file->f_dentry->d_inode);
+ out:
+	fput(file);
+	return ret;
+}
+
 #endif /* CONFIG_CHECKPOINT */
 
 static void init_once(void *foo)
-- 
1.6.0.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH v18 45/80] splice: export pipe/file-to-pipe/file functionality
       [not found] ` <1253749920-18673-1-git-send-email-orenl-RdfvBDnrOixBDgjK7y7TUQ@public.gmane.org>
                     ` (43 preceding siblings ...)
  2009-09-23 23:51   ` [PATCH v18 44/80] c/r: restore " Oren Laadan
@ 2009-09-23 23:51   ` Oren Laadan
  2009-09-23 23:51   ` [PATCH v18 46/80] c/r: support for open pipes Oren Laadan
                     ` (36 subsequent siblings)
  81 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-23 23:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-api-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Linus Torvalds, Ingo Molnar,
	Pavel Emelyanov

During pipes c/r pipes we need to save and restore pipe buffers. But
do_splice() requires two file descriptors, therefore we can't use it,
as we always have one file descriptor (checkpoint image) and one
pipe_inode_info.

This patch exports interfaces that work at the pipe_inode_info level,
namely link_pipe(), do_splice_to() and do_splice_from(). They are used
in the following patch to to save and restore pipe buffers without
unnecessary data copy.

It slightly modifies both do_splice_to() and do_splice_from() to
detect the case of pipe-to-pipe transfer, in which case they invoke
splice_pipe_to_pipe() directly.

Signed-off-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
---
 fs/splice.c            |   61 ++++++++++++++++++++++++++++++++---------------
 include/linux/splice.h |    9 +++++++
 2 files changed, 50 insertions(+), 20 deletions(-)

diff --git a/fs/splice.c b/fs/splice.c
index 73766d2..f251b4c 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -1055,18 +1055,43 @@ ssize_t generic_splice_sendpage(struct pipe_inode_info *pipe, struct file *out,
 EXPORT_SYMBOL(generic_splice_sendpage);
 
 /*
+ * After the inode slimming patch, i_pipe/i_bdev/i_cdev share the same
+ * location, so checking ->i_pipe is not enough to verify that this is a
+ * pipe.
+ */
+static inline struct pipe_inode_info *pipe_info(struct inode *inode)
+{
+	if (S_ISFIFO(inode->i_mode))
+		return inode->i_pipe;
+
+	return NULL;
+}
+
+static int splice_pipe_to_pipe(struct pipe_inode_info *ipipe,
+			       struct pipe_inode_info *opipe,
+			       size_t len, unsigned int flags);
+
+/*
  * Attempt to initiate a splice from pipe to file.
  */
-static long do_splice_from(struct pipe_inode_info *pipe, struct file *out,
-			   loff_t *ppos, size_t len, unsigned int flags)
+long do_splice_from(struct pipe_inode_info *pipe, struct file *out,
+		    loff_t *ppos, size_t len, unsigned int flags)
 {
 	ssize_t (*splice_write)(struct pipe_inode_info *, struct file *,
 				loff_t *, size_t, unsigned int);
+	struct pipe_inode_info *opipe;
 	int ret;
 
 	if (unlikely(!(out->f_mode & FMODE_WRITE)))
 		return -EBADF;
 
+	/* When called directly (e.g. from c/r) output may be a pipe */
+	opipe = pipe_info(out->f_path.dentry->d_inode);
+	if (opipe) {
+		BUG_ON(opipe == pipe);
+		return splice_pipe_to_pipe(pipe, opipe, len, flags);
+	}
+
 	if (unlikely(out->f_flags & O_APPEND))
 		return -EINVAL;
 
@@ -1084,17 +1109,25 @@ static long do_splice_from(struct pipe_inode_info *pipe, struct file *out,
 /*
  * Attempt to initiate a splice from a file to a pipe.
  */
-static long do_splice_to(struct file *in, loff_t *ppos,
-			 struct pipe_inode_info *pipe, size_t len,
-			 unsigned int flags)
+long do_splice_to(struct file *in, loff_t *ppos,
+		  struct pipe_inode_info *pipe, size_t len,
+		  unsigned int flags)
 {
 	ssize_t (*splice_read)(struct file *, loff_t *,
 			       struct pipe_inode_info *, size_t, unsigned int);
+	struct pipe_inode_info *ipipe;
 	int ret;
 
 	if (unlikely(!(in->f_mode & FMODE_READ)))
 		return -EBADF;
 
+	/* When called firectly (e.g. from c/r) input may be a pipe */
+	ipipe = pipe_info(in->f_path.dentry->d_inode);
+	if (ipipe) {
+		BUG_ON(ipipe == pipe);
+		return splice_pipe_to_pipe(ipipe, pipe, len, flags);
+	}
+
 	ret = rw_verify_area(READ, in, ppos, len);
 	if (unlikely(ret < 0))
 		return ret;
@@ -1273,18 +1306,6 @@ long do_splice_direct(struct file *in, loff_t *ppos, struct file *out,
 static int splice_pipe_to_pipe(struct pipe_inode_info *ipipe,
 			       struct pipe_inode_info *opipe,
 			       size_t len, unsigned int flags);
-/*
- * After the inode slimming patch, i_pipe/i_bdev/i_cdev share the same
- * location, so checking ->i_pipe is not enough to verify that this is a
- * pipe.
- */
-static inline struct pipe_inode_info *pipe_info(struct inode *inode)
-{
-	if (S_ISFIFO(inode->i_mode))
-		return inode->i_pipe;
-
-	return NULL;
-}
 
 /*
  * Determine where to splice to/from.
@@ -1887,9 +1908,9 @@ retry:
 /*
  * Link contents of ipipe to opipe.
  */
-static int link_pipe(struct pipe_inode_info *ipipe,
-		     struct pipe_inode_info *opipe,
-		     size_t len, unsigned int flags)
+int link_pipe(struct pipe_inode_info *ipipe,
+	      struct pipe_inode_info *opipe,
+	      size_t len, unsigned int flags)
 {
 	struct pipe_buffer *ibuf, *obuf;
 	int ret = 0, i = 0, nbuf;
diff --git a/include/linux/splice.h b/include/linux/splice.h
index 18e7c7c..431662c 100644
--- a/include/linux/splice.h
+++ b/include/linux/splice.h
@@ -82,4 +82,13 @@ extern ssize_t splice_to_pipe(struct pipe_inode_info *,
 extern ssize_t splice_direct_to_actor(struct file *, struct splice_desc *,
 				      splice_direct_actor *);
 
+extern int link_pipe(struct pipe_inode_info *ipipe,
+		     struct pipe_inode_info *opipe,
+		     size_t len, unsigned int flags);
+extern long do_splice_to(struct file *in, loff_t *ppos,
+			 struct pipe_inode_info *pipe, size_t len,
+			 unsigned int flags);
+extern long do_splice_from(struct pipe_inode_info *pipe, struct file *out,
+			   loff_t *ppos, size_t len, unsigned int flags);
+
 #endif
-- 
1.6.0.4

^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH v18 45/80] splice: export pipe/file-to-pipe/file functionality
  2009-09-23 23:50 ` Oren Laadan
@ 2009-09-23 23:51   ` Oren Laadan
  -1 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-23 23:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, containers, linux-kernel, linux-mm, linux-api,
	Serge Hallyn, Ingo Molnar, Pavel Emelyanov, Oren Laadan,
	Oren Laadan

During pipes c/r pipes we need to save and restore pipe buffers. But
do_splice() requires two file descriptors, therefore we can't use it,
as we always have one file descriptor (checkpoint image) and one
pipe_inode_info.

This patch exports interfaces that work at the pipe_inode_info level,
namely link_pipe(), do_splice_to() and do_splice_from(). They are used
in the following patch to to save and restore pipe buffers without
unnecessary data copy.

It slightly modifies both do_splice_to() and do_splice_from() to
detect the case of pipe-to-pipe transfer, in which case they invoke
splice_pipe_to_pipe() directly.

Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
---
 fs/splice.c            |   61 ++++++++++++++++++++++++++++++++---------------
 include/linux/splice.h |    9 +++++++
 2 files changed, 50 insertions(+), 20 deletions(-)

diff --git a/fs/splice.c b/fs/splice.c
index 73766d2..f251b4c 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -1055,18 +1055,43 @@ ssize_t generic_splice_sendpage(struct pipe_inode_info *pipe, struct file *out,
 EXPORT_SYMBOL(generic_splice_sendpage);
 
 /*
+ * After the inode slimming patch, i_pipe/i_bdev/i_cdev share the same
+ * location, so checking ->i_pipe is not enough to verify that this is a
+ * pipe.
+ */
+static inline struct pipe_inode_info *pipe_info(struct inode *inode)
+{
+	if (S_ISFIFO(inode->i_mode))
+		return inode->i_pipe;
+
+	return NULL;
+}
+
+static int splice_pipe_to_pipe(struct pipe_inode_info *ipipe,
+			       struct pipe_inode_info *opipe,
+			       size_t len, unsigned int flags);
+
+/*
  * Attempt to initiate a splice from pipe to file.
  */
-static long do_splice_from(struct pipe_inode_info *pipe, struct file *out,
-			   loff_t *ppos, size_t len, unsigned int flags)
+long do_splice_from(struct pipe_inode_info *pipe, struct file *out,
+		    loff_t *ppos, size_t len, unsigned int flags)
 {
 	ssize_t (*splice_write)(struct pipe_inode_info *, struct file *,
 				loff_t *, size_t, unsigned int);
+	struct pipe_inode_info *opipe;
 	int ret;
 
 	if (unlikely(!(out->f_mode & FMODE_WRITE)))
 		return -EBADF;
 
+	/* When called directly (e.g. from c/r) output may be a pipe */
+	opipe = pipe_info(out->f_path.dentry->d_inode);
+	if (opipe) {
+		BUG_ON(opipe == pipe);
+		return splice_pipe_to_pipe(pipe, opipe, len, flags);
+	}
+
 	if (unlikely(out->f_flags & O_APPEND))
 		return -EINVAL;
 
@@ -1084,17 +1109,25 @@ static long do_splice_from(struct pipe_inode_info *pipe, struct file *out,
 /*
  * Attempt to initiate a splice from a file to a pipe.
  */
-static long do_splice_to(struct file *in, loff_t *ppos,
-			 struct pipe_inode_info *pipe, size_t len,
-			 unsigned int flags)
+long do_splice_to(struct file *in, loff_t *ppos,
+		  struct pipe_inode_info *pipe, size_t len,
+		  unsigned int flags)
 {
 	ssize_t (*splice_read)(struct file *, loff_t *,
 			       struct pipe_inode_info *, size_t, unsigned int);
+	struct pipe_inode_info *ipipe;
 	int ret;
 
 	if (unlikely(!(in->f_mode & FMODE_READ)))
 		return -EBADF;
 
+	/* When called firectly (e.g. from c/r) input may be a pipe */
+	ipipe = pipe_info(in->f_path.dentry->d_inode);
+	if (ipipe) {
+		BUG_ON(ipipe == pipe);
+		return splice_pipe_to_pipe(ipipe, pipe, len, flags);
+	}
+
 	ret = rw_verify_area(READ, in, ppos, len);
 	if (unlikely(ret < 0))
 		return ret;
@@ -1273,18 +1306,6 @@ long do_splice_direct(struct file *in, loff_t *ppos, struct file *out,
 static int splice_pipe_to_pipe(struct pipe_inode_info *ipipe,
 			       struct pipe_inode_info *opipe,
 			       size_t len, unsigned int flags);
-/*
- * After the inode slimming patch, i_pipe/i_bdev/i_cdev share the same
- * location, so checking ->i_pipe is not enough to verify that this is a
- * pipe.
- */
-static inline struct pipe_inode_info *pipe_info(struct inode *inode)
-{
-	if (S_ISFIFO(inode->i_mode))
-		return inode->i_pipe;
-
-	return NULL;
-}
 
 /*
  * Determine where to splice to/from.
@@ -1887,9 +1908,9 @@ retry:
 /*
  * Link contents of ipipe to opipe.
  */
-static int link_pipe(struct pipe_inode_info *ipipe,
-		     struct pipe_inode_info *opipe,
-		     size_t len, unsigned int flags)
+int link_pipe(struct pipe_inode_info *ipipe,
+	      struct pipe_inode_info *opipe,
+	      size_t len, unsigned int flags)
 {
 	struct pipe_buffer *ibuf, *obuf;
 	int ret = 0, i = 0, nbuf;
diff --git a/include/linux/splice.h b/include/linux/splice.h
index 18e7c7c..431662c 100644
--- a/include/linux/splice.h
+++ b/include/linux/splice.h
@@ -82,4 +82,13 @@ extern ssize_t splice_to_pipe(struct pipe_inode_info *,
 extern ssize_t splice_direct_to_actor(struct file *, struct splice_desc *,
 				      splice_direct_actor *);
 
+extern int link_pipe(struct pipe_inode_info *ipipe,
+		     struct pipe_inode_info *opipe,
+		     size_t len, unsigned int flags);
+extern long do_splice_to(struct file *in, loff_t *ppos,
+			 struct pipe_inode_info *pipe, size_t len,
+			 unsigned int flags);
+extern long do_splice_from(struct pipe_inode_info *pipe, struct file *out,
+			   loff_t *ppos, size_t len, unsigned int flags);
+
 #endif
-- 
1.6.0.4


^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH v18 45/80] splice: export pipe/file-to-pipe/file functionality
@ 2009-09-23 23:51   ` Oren Laadan
  0 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-23 23:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, containers, linux-kernel, linux-mm, linux-api,
	Serge Hallyn, Ingo Molnar, Pavel Emelyanov, Oren Laadan,
	Oren Laadan

During pipes c/r pipes we need to save and restore pipe buffers. But
do_splice() requires two file descriptors, therefore we can't use it,
as we always have one file descriptor (checkpoint image) and one
pipe_inode_info.

This patch exports interfaces that work at the pipe_inode_info level,
namely link_pipe(), do_splice_to() and do_splice_from(). They are used
in the following patch to to save and restore pipe buffers without
unnecessary data copy.

It slightly modifies both do_splice_to() and do_splice_from() to
detect the case of pipe-to-pipe transfer, in which case they invoke
splice_pipe_to_pipe() directly.

Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
---
 fs/splice.c            |   61 ++++++++++++++++++++++++++++++++---------------
 include/linux/splice.h |    9 +++++++
 2 files changed, 50 insertions(+), 20 deletions(-)

diff --git a/fs/splice.c b/fs/splice.c
index 73766d2..f251b4c 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -1055,18 +1055,43 @@ ssize_t generic_splice_sendpage(struct pipe_inode_info *pipe, struct file *out,
 EXPORT_SYMBOL(generic_splice_sendpage);
 
 /*
+ * After the inode slimming patch, i_pipe/i_bdev/i_cdev share the same
+ * location, so checking ->i_pipe is not enough to verify that this is a
+ * pipe.
+ */
+static inline struct pipe_inode_info *pipe_info(struct inode *inode)
+{
+	if (S_ISFIFO(inode->i_mode))
+		return inode->i_pipe;
+
+	return NULL;
+}
+
+static int splice_pipe_to_pipe(struct pipe_inode_info *ipipe,
+			       struct pipe_inode_info *opipe,
+			       size_t len, unsigned int flags);
+
+/*
  * Attempt to initiate a splice from pipe to file.
  */
-static long do_splice_from(struct pipe_inode_info *pipe, struct file *out,
-			   loff_t *ppos, size_t len, unsigned int flags)
+long do_splice_from(struct pipe_inode_info *pipe, struct file *out,
+		    loff_t *ppos, size_t len, unsigned int flags)
 {
 	ssize_t (*splice_write)(struct pipe_inode_info *, struct file *,
 				loff_t *, size_t, unsigned int);
+	struct pipe_inode_info *opipe;
 	int ret;
 
 	if (unlikely(!(out->f_mode & FMODE_WRITE)))
 		return -EBADF;
 
+	/* When called directly (e.g. from c/r) output may be a pipe */
+	opipe = pipe_info(out->f_path.dentry->d_inode);
+	if (opipe) {
+		BUG_ON(opipe == pipe);
+		return splice_pipe_to_pipe(pipe, opipe, len, flags);
+	}
+
 	if (unlikely(out->f_flags & O_APPEND))
 		return -EINVAL;
 
@@ -1084,17 +1109,25 @@ static long do_splice_from(struct pipe_inode_info *pipe, struct file *out,
 /*
  * Attempt to initiate a splice from a file to a pipe.
  */
-static long do_splice_to(struct file *in, loff_t *ppos,
-			 struct pipe_inode_info *pipe, size_t len,
-			 unsigned int flags)
+long do_splice_to(struct file *in, loff_t *ppos,
+		  struct pipe_inode_info *pipe, size_t len,
+		  unsigned int flags)
 {
 	ssize_t (*splice_read)(struct file *, loff_t *,
 			       struct pipe_inode_info *, size_t, unsigned int);
+	struct pipe_inode_info *ipipe;
 	int ret;
 
 	if (unlikely(!(in->f_mode & FMODE_READ)))
 		return -EBADF;
 
+	/* When called firectly (e.g. from c/r) input may be a pipe */
+	ipipe = pipe_info(in->f_path.dentry->d_inode);
+	if (ipipe) {
+		BUG_ON(ipipe == pipe);
+		return splice_pipe_to_pipe(ipipe, pipe, len, flags);
+	}
+
 	ret = rw_verify_area(READ, in, ppos, len);
 	if (unlikely(ret < 0))
 		return ret;
@@ -1273,18 +1306,6 @@ long do_splice_direct(struct file *in, loff_t *ppos, struct file *out,
 static int splice_pipe_to_pipe(struct pipe_inode_info *ipipe,
 			       struct pipe_inode_info *opipe,
 			       size_t len, unsigned int flags);
-/*
- * After the inode slimming patch, i_pipe/i_bdev/i_cdev share the same
- * location, so checking ->i_pipe is not enough to verify that this is a
- * pipe.
- */
-static inline struct pipe_inode_info *pipe_info(struct inode *inode)
-{
-	if (S_ISFIFO(inode->i_mode))
-		return inode->i_pipe;
-
-	return NULL;
-}
 
 /*
  * Determine where to splice to/from.
@@ -1887,9 +1908,9 @@ retry:
 /*
  * Link contents of ipipe to opipe.
  */
-static int link_pipe(struct pipe_inode_info *ipipe,
-		     struct pipe_inode_info *opipe,
-		     size_t len, unsigned int flags)
+int link_pipe(struct pipe_inode_info *ipipe,
+	      struct pipe_inode_info *opipe,
+	      size_t len, unsigned int flags)
 {
 	struct pipe_buffer *ibuf, *obuf;
 	int ret = 0, i = 0, nbuf;
diff --git a/include/linux/splice.h b/include/linux/splice.h
index 18e7c7c..431662c 100644
--- a/include/linux/splice.h
+++ b/include/linux/splice.h
@@ -82,4 +82,13 @@ extern ssize_t splice_to_pipe(struct pipe_inode_info *,
 extern ssize_t splice_direct_to_actor(struct file *, struct splice_desc *,
 				      splice_direct_actor *);
 
+extern int link_pipe(struct pipe_inode_info *ipipe,
+		     struct pipe_inode_info *opipe,
+		     size_t len, unsigned int flags);
+extern long do_splice_to(struct file *in, loff_t *ppos,
+			 struct pipe_inode_info *pipe, size_t len,
+			 unsigned int flags);
+extern long do_splice_from(struct pipe_inode_info *pipe, struct file *out,
+			   loff_t *ppos, size_t len, unsigned int flags);
+
 #endif
-- 
1.6.0.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH v18 46/80] c/r: support for open pipes
       [not found] ` <1253749920-18673-1-git-send-email-orenl-RdfvBDnrOixBDgjK7y7TUQ@public.gmane.org>
                     ` (44 preceding siblings ...)
  2009-09-23 23:51   ` [PATCH v18 45/80] splice: export pipe/file-to-pipe/file functionality Oren Laadan
@ 2009-09-23 23:51   ` Oren Laadan
  2009-09-23 23:51   ` [PATCH v18 47/80] c/r: checkpoint and restore FIFOs Oren Laadan
                     ` (35 subsequent siblings)
  81 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-23 23:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-api-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Linus Torvalds, Ingo Molnar,
	Pavel Emelyanov

A pipe is a double-headed inode with a buffer attached to it. We
checkpoint the pipe buffer only once, as soon as we hit one side of
the pipe, regardless whether it is read- or write- end.

To checkpoint a file descriptor that refers to a pipe (either end), we
first lookup the inode in the hash table: If not found, it is the
first encounter of this pipe. Besides the file descriptor, we also (a)
save the pipe data, and (b) register the pipe inode in the hash. If
found, it is the second encounter of this pipe, namely, as we hit the
other end of the same pipe. In both cases we write the pipe-objref of
the inode.

To restore, create a new pipe and thus have two file pointers (read-
and write- ends). We only use one of them, depending on which side was
checkpointed first. We register the file pointer of the other end in
the hash table, with the pipe_objref given for this pipe from the
checkpoint, to be used later when the other arrives. At this point we
also restore the contents of the pipe buffers.

To save the pipe buffer, given a source pipe, use do_tee() to clone
its contents into a temporary 'struct pipe_inode_info', and then use
do_splice_from() to transfer it directly to the checkpoint image file.

To restore the pipe buffer, with a fresh newly allocated target pipe,
use do_splice_to() to splice the data directly between the checkpoint
image file and the pipe.

Changelog[v18]:
  - Adjust format of pipe buffer to include the mandatory pre-header
Changelog[v17]:
  - Forward-declare 'ckpt_ctx' et-al, don't use checkpoint_types.h

Signed-off-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
---
 checkpoint/files.c             |    7 ++
 fs/pipe.c                      |  157 ++++++++++++++++++++++++++++++++++++++++
 include/linux/checkpoint_hdr.h |    7 ++
 include/linux/pipe_fs_i.h      |    8 ++
 4 files changed, 179 insertions(+), 0 deletions(-)

diff --git a/checkpoint/files.c b/checkpoint/files.c
index 3cdfdb3..042f620 100644
--- a/checkpoint/files.c
+++ b/checkpoint/files.c
@@ -17,6 +17,7 @@
 #include <linux/file.h>
 #include <linux/fdtable.h>
 #include <linux/fsnotify.h>
+#include <linux/pipe_fs_i.h>
 #include <linux/syscalls.h>
 #include <linux/deferqueue.h>
 #include <linux/checkpoint.h>
@@ -564,6 +565,12 @@ static struct restore_file_ops restore_file_ops[] = {
 		.file_type = CKPT_FILE_GENERIC,
 		.restore = generic_file_restore,
 	},
+	/* pipes */
+	{
+		.file_name = "PIPE",
+		.file_type = CKPT_FILE_PIPE,
+		.restore = pipe_file_restore,
+	},
 };
 
 static struct file *do_restore_file(struct ckpt_ctx *ctx)
diff --git a/fs/pipe.c b/fs/pipe.c
index 52c4151..30b34a2 100644
--- a/fs/pipe.c
+++ b/fs/pipe.c
@@ -13,11 +13,13 @@
 #include <linux/fs.h>
 #include <linux/mount.h>
 #include <linux/pipe_fs_i.h>
+#include <linux/splice.h>
 #include <linux/uio.h>
 #include <linux/highmem.h>
 #include <linux/pagemap.h>
 #include <linux/audit.h>
 #include <linux/syscalls.h>
+#include <linux/checkpoint.h>
 
 #include <asm/uaccess.h>
 #include <asm/ioctls.h>
@@ -809,6 +811,158 @@ pipe_rdwr_open(struct inode *inode, struct file *filp)
 	return 0;
 }
 
+#ifdef CONFIG_CHECKPOINT
+static int checkpoint_pipe(struct ckpt_ctx *ctx, struct inode *inode)
+{
+	struct pipe_inode_info *pipe;
+	int len, ret = -ENOMEM;
+
+	pipe = alloc_pipe_info(NULL);
+	if (!pipe)
+		return ret;
+
+	pipe->readers = 1;	/* bluff link_pipe() below */
+	len = link_pipe(inode->i_pipe, pipe, INT_MAX, SPLICE_F_NONBLOCK);
+	if (len == -EAGAIN)
+		len = 0;
+	if (len < 0) {
+		ret = len;
+		goto out;
+	}
+
+	ret = ckpt_write_obj_type(ctx, NULL, len, CKPT_HDR_PIPE_BUF);
+	if (ret < 0)
+		goto out;
+
+	ret = do_splice_from(pipe, ctx->file, &ctx->file->f_pos, len, 0);
+	if (ret < 0)
+		goto out;
+	if (ret != len)
+		ret = -EPIPE;  /* can occur due to an error in target file */
+ out:
+	__free_pipe_info(pipe);
+	return ret;
+}
+
+static int pipe_file_checkpoint(struct ckpt_ctx *ctx, struct file *file)
+{
+	struct ckpt_hdr_file_pipe *h;
+	struct inode *inode = file->f_dentry->d_inode;
+	int objref, first, ret;
+
+	objref = ckpt_obj_lookup_add(ctx, inode, CKPT_OBJ_INODE, &first);
+	if (objref < 0)
+		return objref;
+
+	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_FILE);
+	if (!h)
+		return -ENOMEM;
+
+	h->common.f_type = CKPT_FILE_PIPE;
+	h->pipe_objref = objref;
+
+	ret = checkpoint_file_common(ctx, file, &h->common);
+	if (ret < 0)
+		goto out;
+	ret = ckpt_write_obj(ctx, &h->common.h);
+	if (ret < 0)
+		goto out;
+
+	if (first)
+		ret = checkpoint_pipe(ctx, inode);
+ out:
+	ckpt_hdr_put(ctx, h);
+	return ret;
+}
+
+static int restore_pipe(struct ckpt_ctx *ctx, struct file *file)
+{
+	struct pipe_inode_info *pipe;
+	int len, ret;
+
+	len = _ckpt_read_obj_type(ctx, NULL, 0, CKPT_HDR_PIPE_BUF);
+	if (len < 0)
+		return len;
+
+	pipe = file->f_dentry->d_inode->i_pipe;
+	ret = do_splice_to(ctx->file, &ctx->file->f_pos, pipe, len, 0);
+
+	if (ret >= 0 && ret != len)
+		ret = -EPIPE;  /* can occur due to an error in source file */
+
+	return ret;
+}
+
+struct file *pipe_file_restore(struct ckpt_ctx *ctx, struct ckpt_hdr_file *ptr)
+{
+	struct ckpt_hdr_file_pipe *h = (struct ckpt_hdr_file_pipe *) ptr;
+	struct file *file;
+	int fds[2], which, ret;
+
+	if (ptr->h.type != CKPT_HDR_FILE  ||
+	    ptr->h.len != sizeof(*h) || ptr->f_type != CKPT_FILE_PIPE)
+		return ERR_PTR(-EINVAL);
+
+	if (h->pipe_objref <= 0)
+		return ERR_PTR(-EINVAL);
+
+	file = ckpt_obj_fetch(ctx, h->pipe_objref, CKPT_OBJ_FILE);
+	/*
+	 * If ckpt_obj_fetch() returned ERR_PTR(-EINVAL), then this is
+	 * the first time we see this pipe so need to restore the
+	 * contents.  Otherwise, use the file pointer skip forward.
+	 */
+	if (!IS_ERR(file)) {
+		get_file(file);
+	} else if (PTR_ERR(file) == -EINVAL) {
+		/* first encounter of this pipe: create it */
+		ret = do_pipe_flags(fds, 0);
+		if (ret < 0)
+			return file;
+
+		which = (ptr->f_flags & O_WRONLY ? 1 : 0);
+		/*
+		 * Below we return the file corersponding to one side
+		 * of the pipe for our caller to use. Now insert the
+		 * other side of the pipe to the hash, to be picked up
+		 * when that side is restored.
+		 */
+		file = fget(fds[1-which]);	/* the 'other' side */
+		if (!file)	/* this should _never_ happen ! */
+			return ERR_PTR(-EBADF);
+		ret = ckpt_obj_insert(ctx, file, h->pipe_objref, CKPT_OBJ_FILE);
+		if (ret < 0)
+			goto out;
+
+		ret = restore_pipe(ctx, file);
+		fput(file);
+		if (ret < 0)
+			return ERR_PTR(ret);
+
+		file = fget(fds[which]);	/* 'this' side */
+		if (!file)	/* this should _never_ happen ! */
+			return ERR_PTR(-EBADF);
+
+		/* get rid of the file descriptors (caller sets that) */
+		sys_close(fds[which]);
+		sys_close(fds[1-which]);
+	} else {
+		return file;
+	}
+
+	ret = restore_file_common(ctx, file, ptr);
+ out:
+	if (ret < 0) {
+		fput(file);
+		file = ERR_PTR(ret);
+	}
+
+	return file;
+}
+#else
+#define pipe_file_checkpoint  NULL
+#endif /* CONFIG_CHECKPOINT */
+
 /*
  * The file_operations structs are not static because they
  * are also used in linux/fs/fifo.c to do operations on FIFOs.
@@ -825,6 +979,7 @@ const struct file_operations read_pipefifo_fops = {
 	.open		= pipe_read_open,
 	.release	= pipe_read_release,
 	.fasync		= pipe_read_fasync,
+	.checkpoint	= pipe_file_checkpoint,
 };
 
 const struct file_operations write_pipefifo_fops = {
@@ -837,6 +992,7 @@ const struct file_operations write_pipefifo_fops = {
 	.open		= pipe_write_open,
 	.release	= pipe_write_release,
 	.fasync		= pipe_write_fasync,
+	.checkpoint	= pipe_file_checkpoint,
 };
 
 const struct file_operations rdwr_pipefifo_fops = {
@@ -850,6 +1006,7 @@ const struct file_operations rdwr_pipefifo_fops = {
 	.open		= pipe_rdwr_open,
 	.release	= pipe_rdwr_release,
 	.fasync		= pipe_rdwr_fasync,
+	.checkpoint	= pipe_file_checkpoint,
 };
 
 struct pipe_inode_info * alloc_pipe_info(struct inode *inode)
diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
index 23095f2..68f64ae 100644
--- a/include/linux/checkpoint_hdr.h
+++ b/include/linux/checkpoint_hdr.h
@@ -65,6 +65,7 @@ enum {
 	CKPT_HDR_FILE_DESC,
 	CKPT_HDR_FILE_NAME,
 	CKPT_HDR_FILE,
+	CKPT_HDR_PIPE_BUF,
 
 	CKPT_HDR_MM = 401,
 	CKPT_HDR_VMA,
@@ -220,6 +221,7 @@ struct ckpt_hdr_file_desc {
 enum file_type {
 	CKPT_FILE_IGNORE = 0,
 	CKPT_FILE_GENERIC,
+	CKPT_FILE_PIPE,
 	CKPT_FILE_MAX
 };
 
@@ -238,6 +240,11 @@ struct ckpt_hdr_file_generic {
 	struct ckpt_hdr_file common;
 } __attribute__((aligned(8)));
 
+struct ckpt_hdr_file_pipe {
+	struct ckpt_hdr_file common;
+	__s32 pipe_objref;
+} __attribute__((aligned(8)));
+
 /* memory layout */
 struct ckpt_hdr_mm {
 	struct ckpt_hdr h;
diff --git a/include/linux/pipe_fs_i.h b/include/linux/pipe_fs_i.h
index b43a9e0..e526a12 100644
--- a/include/linux/pipe_fs_i.h
+++ b/include/linux/pipe_fs_i.h
@@ -154,4 +154,12 @@ int generic_pipe_buf_confirm(struct pipe_inode_info *, struct pipe_buffer *);
 int generic_pipe_buf_steal(struct pipe_inode_info *, struct pipe_buffer *);
 void generic_pipe_buf_release(struct pipe_inode_info *, struct pipe_buffer *);
 
+/* checkpoint/restart */
+#ifdef CONFIG_CHECKPOINT
+struct ckpt_ctx;
+struct ckpt_hdr_file;
+extern struct file *pipe_file_restore(struct ckpt_ctx *ctx,
+				      struct ckpt_hdr_file *ptr);
+#endif
+
 #endif
-- 
1.6.0.4

^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH v18 46/80] c/r: support for open pipes
  2009-09-23 23:50 ` Oren Laadan
@ 2009-09-23 23:51   ` Oren Laadan
  -1 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-23 23:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, containers, linux-kernel, linux-mm, linux-api,
	Serge Hallyn, Ingo Molnar, Pavel Emelyanov, Oren Laadan,
	Oren Laadan

A pipe is a double-headed inode with a buffer attached to it. We
checkpoint the pipe buffer only once, as soon as we hit one side of
the pipe, regardless whether it is read- or write- end.

To checkpoint a file descriptor that refers to a pipe (either end), we
first lookup the inode in the hash table: If not found, it is the
first encounter of this pipe. Besides the file descriptor, we also (a)
save the pipe data, and (b) register the pipe inode in the hash. If
found, it is the second encounter of this pipe, namely, as we hit the
other end of the same pipe. In both cases we write the pipe-objref of
the inode.

To restore, create a new pipe and thus have two file pointers (read-
and write- ends). We only use one of them, depending on which side was
checkpointed first. We register the file pointer of the other end in
the hash table, with the pipe_objref given for this pipe from the
checkpoint, to be used later when the other arrives. At this point we
also restore the contents of the pipe buffers.

To save the pipe buffer, given a source pipe, use do_tee() to clone
its contents into a temporary 'struct pipe_inode_info', and then use
do_splice_from() to transfer it directly to the checkpoint image file.

To restore the pipe buffer, with a fresh newly allocated target pipe,
use do_splice_to() to splice the data directly between the checkpoint
image file and the pipe.

Changelog[v18]:
  - Adjust format of pipe buffer to include the mandatory pre-header
Changelog[v17]:
  - Forward-declare 'ckpt_ctx' et-al, don't use checkpoint_types.h

Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
---
 checkpoint/files.c             |    7 ++
 fs/pipe.c                      |  157 ++++++++++++++++++++++++++++++++++++++++
 include/linux/checkpoint_hdr.h |    7 ++
 include/linux/pipe_fs_i.h      |    8 ++
 4 files changed, 179 insertions(+), 0 deletions(-)

diff --git a/checkpoint/files.c b/checkpoint/files.c
index 3cdfdb3..042f620 100644
--- a/checkpoint/files.c
+++ b/checkpoint/files.c
@@ -17,6 +17,7 @@
 #include <linux/file.h>
 #include <linux/fdtable.h>
 #include <linux/fsnotify.h>
+#include <linux/pipe_fs_i.h>
 #include <linux/syscalls.h>
 #include <linux/deferqueue.h>
 #include <linux/checkpoint.h>
@@ -564,6 +565,12 @@ static struct restore_file_ops restore_file_ops[] = {
 		.file_type = CKPT_FILE_GENERIC,
 		.restore = generic_file_restore,
 	},
+	/* pipes */
+	{
+		.file_name = "PIPE",
+		.file_type = CKPT_FILE_PIPE,
+		.restore = pipe_file_restore,
+	},
 };
 
 static struct file *do_restore_file(struct ckpt_ctx *ctx)
diff --git a/fs/pipe.c b/fs/pipe.c
index 52c4151..30b34a2 100644
--- a/fs/pipe.c
+++ b/fs/pipe.c
@@ -13,11 +13,13 @@
 #include <linux/fs.h>
 #include <linux/mount.h>
 #include <linux/pipe_fs_i.h>
+#include <linux/splice.h>
 #include <linux/uio.h>
 #include <linux/highmem.h>
 #include <linux/pagemap.h>
 #include <linux/audit.h>
 #include <linux/syscalls.h>
+#include <linux/checkpoint.h>
 
 #include <asm/uaccess.h>
 #include <asm/ioctls.h>
@@ -809,6 +811,158 @@ pipe_rdwr_open(struct inode *inode, struct file *filp)
 	return 0;
 }
 
+#ifdef CONFIG_CHECKPOINT
+static int checkpoint_pipe(struct ckpt_ctx *ctx, struct inode *inode)
+{
+	struct pipe_inode_info *pipe;
+	int len, ret = -ENOMEM;
+
+	pipe = alloc_pipe_info(NULL);
+	if (!pipe)
+		return ret;
+
+	pipe->readers = 1;	/* bluff link_pipe() below */
+	len = link_pipe(inode->i_pipe, pipe, INT_MAX, SPLICE_F_NONBLOCK);
+	if (len == -EAGAIN)
+		len = 0;
+	if (len < 0) {
+		ret = len;
+		goto out;
+	}
+
+	ret = ckpt_write_obj_type(ctx, NULL, len, CKPT_HDR_PIPE_BUF);
+	if (ret < 0)
+		goto out;
+
+	ret = do_splice_from(pipe, ctx->file, &ctx->file->f_pos, len, 0);
+	if (ret < 0)
+		goto out;
+	if (ret != len)
+		ret = -EPIPE;  /* can occur due to an error in target file */
+ out:
+	__free_pipe_info(pipe);
+	return ret;
+}
+
+static int pipe_file_checkpoint(struct ckpt_ctx *ctx, struct file *file)
+{
+	struct ckpt_hdr_file_pipe *h;
+	struct inode *inode = file->f_dentry->d_inode;
+	int objref, first, ret;
+
+	objref = ckpt_obj_lookup_add(ctx, inode, CKPT_OBJ_INODE, &first);
+	if (objref < 0)
+		return objref;
+
+	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_FILE);
+	if (!h)
+		return -ENOMEM;
+
+	h->common.f_type = CKPT_FILE_PIPE;
+	h->pipe_objref = objref;
+
+	ret = checkpoint_file_common(ctx, file, &h->common);
+	if (ret < 0)
+		goto out;
+	ret = ckpt_write_obj(ctx, &h->common.h);
+	if (ret < 0)
+		goto out;
+
+	if (first)
+		ret = checkpoint_pipe(ctx, inode);
+ out:
+	ckpt_hdr_put(ctx, h);
+	return ret;
+}
+
+static int restore_pipe(struct ckpt_ctx *ctx, struct file *file)
+{
+	struct pipe_inode_info *pipe;
+	int len, ret;
+
+	len = _ckpt_read_obj_type(ctx, NULL, 0, CKPT_HDR_PIPE_BUF);
+	if (len < 0)
+		return len;
+
+	pipe = file->f_dentry->d_inode->i_pipe;
+	ret = do_splice_to(ctx->file, &ctx->file->f_pos, pipe, len, 0);
+
+	if (ret >= 0 && ret != len)
+		ret = -EPIPE;  /* can occur due to an error in source file */
+
+	return ret;
+}
+
+struct file *pipe_file_restore(struct ckpt_ctx *ctx, struct ckpt_hdr_file *ptr)
+{
+	struct ckpt_hdr_file_pipe *h = (struct ckpt_hdr_file_pipe *) ptr;
+	struct file *file;
+	int fds[2], which, ret;
+
+	if (ptr->h.type != CKPT_HDR_FILE  ||
+	    ptr->h.len != sizeof(*h) || ptr->f_type != CKPT_FILE_PIPE)
+		return ERR_PTR(-EINVAL);
+
+	if (h->pipe_objref <= 0)
+		return ERR_PTR(-EINVAL);
+
+	file = ckpt_obj_fetch(ctx, h->pipe_objref, CKPT_OBJ_FILE);
+	/*
+	 * If ckpt_obj_fetch() returned ERR_PTR(-EINVAL), then this is
+	 * the first time we see this pipe so need to restore the
+	 * contents.  Otherwise, use the file pointer skip forward.
+	 */
+	if (!IS_ERR(file)) {
+		get_file(file);
+	} else if (PTR_ERR(file) == -EINVAL) {
+		/* first encounter of this pipe: create it */
+		ret = do_pipe_flags(fds, 0);
+		if (ret < 0)
+			return file;
+
+		which = (ptr->f_flags & O_WRONLY ? 1 : 0);
+		/*
+		 * Below we return the file corersponding to one side
+		 * of the pipe for our caller to use. Now insert the
+		 * other side of the pipe to the hash, to be picked up
+		 * when that side is restored.
+		 */
+		file = fget(fds[1-which]);	/* the 'other' side */
+		if (!file)	/* this should _never_ happen ! */
+			return ERR_PTR(-EBADF);
+		ret = ckpt_obj_insert(ctx, file, h->pipe_objref, CKPT_OBJ_FILE);
+		if (ret < 0)
+			goto out;
+
+		ret = restore_pipe(ctx, file);
+		fput(file);
+		if (ret < 0)
+			return ERR_PTR(ret);
+
+		file = fget(fds[which]);	/* 'this' side */
+		if (!file)	/* this should _never_ happen ! */
+			return ERR_PTR(-EBADF);
+
+		/* get rid of the file descriptors (caller sets that) */
+		sys_close(fds[which]);
+		sys_close(fds[1-which]);
+	} else {
+		return file;
+	}
+
+	ret = restore_file_common(ctx, file, ptr);
+ out:
+	if (ret < 0) {
+		fput(file);
+		file = ERR_PTR(ret);
+	}
+
+	return file;
+}
+#else
+#define pipe_file_checkpoint  NULL
+#endif /* CONFIG_CHECKPOINT */
+
 /*
  * The file_operations structs are not static because they
  * are also used in linux/fs/fifo.c to do operations on FIFOs.
@@ -825,6 +979,7 @@ const struct file_operations read_pipefifo_fops = {
 	.open		= pipe_read_open,
 	.release	= pipe_read_release,
 	.fasync		= pipe_read_fasync,
+	.checkpoint	= pipe_file_checkpoint,
 };
 
 const struct file_operations write_pipefifo_fops = {
@@ -837,6 +992,7 @@ const struct file_operations write_pipefifo_fops = {
 	.open		= pipe_write_open,
 	.release	= pipe_write_release,
 	.fasync		= pipe_write_fasync,
+	.checkpoint	= pipe_file_checkpoint,
 };
 
 const struct file_operations rdwr_pipefifo_fops = {
@@ -850,6 +1006,7 @@ const struct file_operations rdwr_pipefifo_fops = {
 	.open		= pipe_rdwr_open,
 	.release	= pipe_rdwr_release,
 	.fasync		= pipe_rdwr_fasync,
+	.checkpoint	= pipe_file_checkpoint,
 };
 
 struct pipe_inode_info * alloc_pipe_info(struct inode *inode)
diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
index 23095f2..68f64ae 100644
--- a/include/linux/checkpoint_hdr.h
+++ b/include/linux/checkpoint_hdr.h
@@ -65,6 +65,7 @@ enum {
 	CKPT_HDR_FILE_DESC,
 	CKPT_HDR_FILE_NAME,
 	CKPT_HDR_FILE,
+	CKPT_HDR_PIPE_BUF,
 
 	CKPT_HDR_MM = 401,
 	CKPT_HDR_VMA,
@@ -220,6 +221,7 @@ struct ckpt_hdr_file_desc {
 enum file_type {
 	CKPT_FILE_IGNORE = 0,
 	CKPT_FILE_GENERIC,
+	CKPT_FILE_PIPE,
 	CKPT_FILE_MAX
 };
 
@@ -238,6 +240,11 @@ struct ckpt_hdr_file_generic {
 	struct ckpt_hdr_file common;
 } __attribute__((aligned(8)));
 
+struct ckpt_hdr_file_pipe {
+	struct ckpt_hdr_file common;
+	__s32 pipe_objref;
+} __attribute__((aligned(8)));
+
 /* memory layout */
 struct ckpt_hdr_mm {
 	struct ckpt_hdr h;
diff --git a/include/linux/pipe_fs_i.h b/include/linux/pipe_fs_i.h
index b43a9e0..e526a12 100644
--- a/include/linux/pipe_fs_i.h
+++ b/include/linux/pipe_fs_i.h
@@ -154,4 +154,12 @@ int generic_pipe_buf_confirm(struct pipe_inode_info *, struct pipe_buffer *);
 int generic_pipe_buf_steal(struct pipe_inode_info *, struct pipe_buffer *);
 void generic_pipe_buf_release(struct pipe_inode_info *, struct pipe_buffer *);
 
+/* checkpoint/restart */
+#ifdef CONFIG_CHECKPOINT
+struct ckpt_ctx;
+struct ckpt_hdr_file;
+extern struct file *pipe_file_restore(struct ckpt_ctx *ctx,
+				      struct ckpt_hdr_file *ptr);
+#endif
+
 #endif
-- 
1.6.0.4


^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH v18 46/80] c/r: support for open pipes
@ 2009-09-23 23:51   ` Oren Laadan
  0 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-23 23:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, containers, linux-kernel, linux-mm, linux-api,
	Serge Hallyn, Ingo Molnar, Pavel Emelyanov, Oren Laadan,
	Oren Laadan

A pipe is a double-headed inode with a buffer attached to it. We
checkpoint the pipe buffer only once, as soon as we hit one side of
the pipe, regardless whether it is read- or write- end.

To checkpoint a file descriptor that refers to a pipe (either end), we
first lookup the inode in the hash table: If not found, it is the
first encounter of this pipe. Besides the file descriptor, we also (a)
save the pipe data, and (b) register the pipe inode in the hash. If
found, it is the second encounter of this pipe, namely, as we hit the
other end of the same pipe. In both cases we write the pipe-objref of
the inode.

To restore, create a new pipe and thus have two file pointers (read-
and write- ends). We only use one of them, depending on which side was
checkpointed first. We register the file pointer of the other end in
the hash table, with the pipe_objref given for this pipe from the
checkpoint, to be used later when the other arrives. At this point we
also restore the contents of the pipe buffers.

To save the pipe buffer, given a source pipe, use do_tee() to clone
its contents into a temporary 'struct pipe_inode_info', and then use
do_splice_from() to transfer it directly to the checkpoint image file.

To restore the pipe buffer, with a fresh newly allocated target pipe,
use do_splice_to() to splice the data directly between the checkpoint
image file and the pipe.

Changelog[v18]:
  - Adjust format of pipe buffer to include the mandatory pre-header
Changelog[v17]:
  - Forward-declare 'ckpt_ctx' et-al, don't use checkpoint_types.h

Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
---
 checkpoint/files.c             |    7 ++
 fs/pipe.c                      |  157 ++++++++++++++++++++++++++++++++++++++++
 include/linux/checkpoint_hdr.h |    7 ++
 include/linux/pipe_fs_i.h      |    8 ++
 4 files changed, 179 insertions(+), 0 deletions(-)

diff --git a/checkpoint/files.c b/checkpoint/files.c
index 3cdfdb3..042f620 100644
--- a/checkpoint/files.c
+++ b/checkpoint/files.c
@@ -17,6 +17,7 @@
 #include <linux/file.h>
 #include <linux/fdtable.h>
 #include <linux/fsnotify.h>
+#include <linux/pipe_fs_i.h>
 #include <linux/syscalls.h>
 #include <linux/deferqueue.h>
 #include <linux/checkpoint.h>
@@ -564,6 +565,12 @@ static struct restore_file_ops restore_file_ops[] = {
 		.file_type = CKPT_FILE_GENERIC,
 		.restore = generic_file_restore,
 	},
+	/* pipes */
+	{
+		.file_name = "PIPE",
+		.file_type = CKPT_FILE_PIPE,
+		.restore = pipe_file_restore,
+	},
 };
 
 static struct file *do_restore_file(struct ckpt_ctx *ctx)
diff --git a/fs/pipe.c b/fs/pipe.c
index 52c4151..30b34a2 100644
--- a/fs/pipe.c
+++ b/fs/pipe.c
@@ -13,11 +13,13 @@
 #include <linux/fs.h>
 #include <linux/mount.h>
 #include <linux/pipe_fs_i.h>
+#include <linux/splice.h>
 #include <linux/uio.h>
 #include <linux/highmem.h>
 #include <linux/pagemap.h>
 #include <linux/audit.h>
 #include <linux/syscalls.h>
+#include <linux/checkpoint.h>
 
 #include <asm/uaccess.h>
 #include <asm/ioctls.h>
@@ -809,6 +811,158 @@ pipe_rdwr_open(struct inode *inode, struct file *filp)
 	return 0;
 }
 
+#ifdef CONFIG_CHECKPOINT
+static int checkpoint_pipe(struct ckpt_ctx *ctx, struct inode *inode)
+{
+	struct pipe_inode_info *pipe;
+	int len, ret = -ENOMEM;
+
+	pipe = alloc_pipe_info(NULL);
+	if (!pipe)
+		return ret;
+
+	pipe->readers = 1;	/* bluff link_pipe() below */
+	len = link_pipe(inode->i_pipe, pipe, INT_MAX, SPLICE_F_NONBLOCK);
+	if (len == -EAGAIN)
+		len = 0;
+	if (len < 0) {
+		ret = len;
+		goto out;
+	}
+
+	ret = ckpt_write_obj_type(ctx, NULL, len, CKPT_HDR_PIPE_BUF);
+	if (ret < 0)
+		goto out;
+
+	ret = do_splice_from(pipe, ctx->file, &ctx->file->f_pos, len, 0);
+	if (ret < 0)
+		goto out;
+	if (ret != len)
+		ret = -EPIPE;  /* can occur due to an error in target file */
+ out:
+	__free_pipe_info(pipe);
+	return ret;
+}
+
+static int pipe_file_checkpoint(struct ckpt_ctx *ctx, struct file *file)
+{
+	struct ckpt_hdr_file_pipe *h;
+	struct inode *inode = file->f_dentry->d_inode;
+	int objref, first, ret;
+
+	objref = ckpt_obj_lookup_add(ctx, inode, CKPT_OBJ_INODE, &first);
+	if (objref < 0)
+		return objref;
+
+	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_FILE);
+	if (!h)
+		return -ENOMEM;
+
+	h->common.f_type = CKPT_FILE_PIPE;
+	h->pipe_objref = objref;
+
+	ret = checkpoint_file_common(ctx, file, &h->common);
+	if (ret < 0)
+		goto out;
+	ret = ckpt_write_obj(ctx, &h->common.h);
+	if (ret < 0)
+		goto out;
+
+	if (first)
+		ret = checkpoint_pipe(ctx, inode);
+ out:
+	ckpt_hdr_put(ctx, h);
+	return ret;
+}
+
+static int restore_pipe(struct ckpt_ctx *ctx, struct file *file)
+{
+	struct pipe_inode_info *pipe;
+	int len, ret;
+
+	len = _ckpt_read_obj_type(ctx, NULL, 0, CKPT_HDR_PIPE_BUF);
+	if (len < 0)
+		return len;
+
+	pipe = file->f_dentry->d_inode->i_pipe;
+	ret = do_splice_to(ctx->file, &ctx->file->f_pos, pipe, len, 0);
+
+	if (ret >= 0 && ret != len)
+		ret = -EPIPE;  /* can occur due to an error in source file */
+
+	return ret;
+}
+
+struct file *pipe_file_restore(struct ckpt_ctx *ctx, struct ckpt_hdr_file *ptr)
+{
+	struct ckpt_hdr_file_pipe *h = (struct ckpt_hdr_file_pipe *) ptr;
+	struct file *file;
+	int fds[2], which, ret;
+
+	if (ptr->h.type != CKPT_HDR_FILE  ||
+	    ptr->h.len != sizeof(*h) || ptr->f_type != CKPT_FILE_PIPE)
+		return ERR_PTR(-EINVAL);
+
+	if (h->pipe_objref <= 0)
+		return ERR_PTR(-EINVAL);
+
+	file = ckpt_obj_fetch(ctx, h->pipe_objref, CKPT_OBJ_FILE);
+	/*
+	 * If ckpt_obj_fetch() returned ERR_PTR(-EINVAL), then this is
+	 * the first time we see this pipe so need to restore the
+	 * contents.  Otherwise, use the file pointer skip forward.
+	 */
+	if (!IS_ERR(file)) {
+		get_file(file);
+	} else if (PTR_ERR(file) == -EINVAL) {
+		/* first encounter of this pipe: create it */
+		ret = do_pipe_flags(fds, 0);
+		if (ret < 0)
+			return file;
+
+		which = (ptr->f_flags & O_WRONLY ? 1 : 0);
+		/*
+		 * Below we return the file corersponding to one side
+		 * of the pipe for our caller to use. Now insert the
+		 * other side of the pipe to the hash, to be picked up
+		 * when that side is restored.
+		 */
+		file = fget(fds[1-which]);	/* the 'other' side */
+		if (!file)	/* this should _never_ happen ! */
+			return ERR_PTR(-EBADF);
+		ret = ckpt_obj_insert(ctx, file, h->pipe_objref, CKPT_OBJ_FILE);
+		if (ret < 0)
+			goto out;
+
+		ret = restore_pipe(ctx, file);
+		fput(file);
+		if (ret < 0)
+			return ERR_PTR(ret);
+
+		file = fget(fds[which]);	/* 'this' side */
+		if (!file)	/* this should _never_ happen ! */
+			return ERR_PTR(-EBADF);
+
+		/* get rid of the file descriptors (caller sets that) */
+		sys_close(fds[which]);
+		sys_close(fds[1-which]);
+	} else {
+		return file;
+	}
+
+	ret = restore_file_common(ctx, file, ptr);
+ out:
+	if (ret < 0) {
+		fput(file);
+		file = ERR_PTR(ret);
+	}
+
+	return file;
+}
+#else
+#define pipe_file_checkpoint  NULL
+#endif /* CONFIG_CHECKPOINT */
+
 /*
  * The file_operations structs are not static because they
  * are also used in linux/fs/fifo.c to do operations on FIFOs.
@@ -825,6 +979,7 @@ const struct file_operations read_pipefifo_fops = {
 	.open		= pipe_read_open,
 	.release	= pipe_read_release,
 	.fasync		= pipe_read_fasync,
+	.checkpoint	= pipe_file_checkpoint,
 };
 
 const struct file_operations write_pipefifo_fops = {
@@ -837,6 +992,7 @@ const struct file_operations write_pipefifo_fops = {
 	.open		= pipe_write_open,
 	.release	= pipe_write_release,
 	.fasync		= pipe_write_fasync,
+	.checkpoint	= pipe_file_checkpoint,
 };
 
 const struct file_operations rdwr_pipefifo_fops = {
@@ -850,6 +1006,7 @@ const struct file_operations rdwr_pipefifo_fops = {
 	.open		= pipe_rdwr_open,
 	.release	= pipe_rdwr_release,
 	.fasync		= pipe_rdwr_fasync,
+	.checkpoint	= pipe_file_checkpoint,
 };
 
 struct pipe_inode_info * alloc_pipe_info(struct inode *inode)
diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
index 23095f2..68f64ae 100644
--- a/include/linux/checkpoint_hdr.h
+++ b/include/linux/checkpoint_hdr.h
@@ -65,6 +65,7 @@ enum {
 	CKPT_HDR_FILE_DESC,
 	CKPT_HDR_FILE_NAME,
 	CKPT_HDR_FILE,
+	CKPT_HDR_PIPE_BUF,
 
 	CKPT_HDR_MM = 401,
 	CKPT_HDR_VMA,
@@ -220,6 +221,7 @@ struct ckpt_hdr_file_desc {
 enum file_type {
 	CKPT_FILE_IGNORE = 0,
 	CKPT_FILE_GENERIC,
+	CKPT_FILE_PIPE,
 	CKPT_FILE_MAX
 };
 
@@ -238,6 +240,11 @@ struct ckpt_hdr_file_generic {
 	struct ckpt_hdr_file common;
 } __attribute__((aligned(8)));
 
+struct ckpt_hdr_file_pipe {
+	struct ckpt_hdr_file common;
+	__s32 pipe_objref;
+} __attribute__((aligned(8)));
+
 /* memory layout */
 struct ckpt_hdr_mm {
 	struct ckpt_hdr h;
diff --git a/include/linux/pipe_fs_i.h b/include/linux/pipe_fs_i.h
index b43a9e0..e526a12 100644
--- a/include/linux/pipe_fs_i.h
+++ b/include/linux/pipe_fs_i.h
@@ -154,4 +154,12 @@ int generic_pipe_buf_confirm(struct pipe_inode_info *, struct pipe_buffer *);
 int generic_pipe_buf_steal(struct pipe_inode_info *, struct pipe_buffer *);
 void generic_pipe_buf_release(struct pipe_inode_info *, struct pipe_buffer *);
 
+/* checkpoint/restart */
+#ifdef CONFIG_CHECKPOINT
+struct ckpt_ctx;
+struct ckpt_hdr_file;
+extern struct file *pipe_file_restore(struct ckpt_ctx *ctx,
+				      struct ckpt_hdr_file *ptr);
+#endif
+
 #endif
-- 
1.6.0.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH v18 47/80] c/r: checkpoint and restore FIFOs
       [not found] ` <1253749920-18673-1-git-send-email-orenl-RdfvBDnrOixBDgjK7y7TUQ@public.gmane.org>
                     ` (45 preceding siblings ...)
  2009-09-23 23:51   ` [PATCH v18 46/80] c/r: support for open pipes Oren Laadan
@ 2009-09-23 23:51   ` Oren Laadan
  2009-09-23 23:51   ` [PATCH v18 48/80] c/r: make ckpt_may_checkpoint_task() check each namespace individually Oren Laadan
                     ` (34 subsequent siblings)
  81 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-23 23:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-api-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Linus Torvalds, Ingo Molnar,
	Pavel Emelyanov

FIFOs are almost like pipes.

Checkpoints adds the FIFO pathname. The first time the FIFO is found
it also assigns an @objref and dumps the contents in the buffers.

To restore, use the @objref only to determine whether a particular
FIFO has already been restored earlier. Note that it ignores the file
pointer that matches that @objref (unlike with pipes, where that file
corresponds to the other end of the pipe). Instead, it creates a new
FIFO using the saved pathname.

Signed-off-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
---
 checkpoint/files.c             |    6 +++
 fs/pipe.c                      |   82 +++++++++++++++++++++++++++++++++++++++-
 include/linux/checkpoint_hdr.h |    1 +
 include/linux/pipe_fs_i.h      |    2 +
 4 files changed, 89 insertions(+), 2 deletions(-)

diff --git a/checkpoint/files.c b/checkpoint/files.c
index 042f620..190c95b 100644
--- a/checkpoint/files.c
+++ b/checkpoint/files.c
@@ -571,6 +571,12 @@ static struct restore_file_ops restore_file_ops[] = {
 		.file_type = CKPT_FILE_PIPE,
 		.restore = pipe_file_restore,
 	},
+	/* fifo */
+	{
+		.file_name = "FIFO",
+		.file_type = CKPT_FILE_FIFO,
+		.restore = fifo_file_restore,
+	},
 };
 
 static struct file *do_restore_file(struct ckpt_ctx *ctx)
diff --git a/fs/pipe.c b/fs/pipe.c
index 30b34a2..65ad44e 100644
--- a/fs/pipe.c
+++ b/fs/pipe.c
@@ -811,6 +811,8 @@ pipe_rdwr_open(struct inode *inode, struct file *filp)
 	return 0;
 }
 
+static struct vfsmount *pipe_mnt __read_mostly;
+
 #ifdef CONFIG_CHECKPOINT
 static int checkpoint_pipe(struct ckpt_ctx *ctx, struct inode *inode)
 {
@@ -858,7 +860,11 @@ static int pipe_file_checkpoint(struct ckpt_ctx *ctx, struct file *file)
 	if (!h)
 		return -ENOMEM;
 
-	h->common.f_type = CKPT_FILE_PIPE;
+	/* fifo and pipe are similar at checkpoint, differ on restore */
+	if (inode->i_sb == pipe_mnt->mnt_sb)
+		h->common.f_type = CKPT_FILE_PIPE;
+	else
+		h->common.f_type = CKPT_FILE_FIFO;
 	h->pipe_objref = objref;
 
 	ret = checkpoint_file_common(ctx, file, &h->common);
@@ -868,6 +874,13 @@ static int pipe_file_checkpoint(struct ckpt_ctx *ctx, struct file *file)
 	if (ret < 0)
 		goto out;
 
+	/* FIFO also needs a file name */
+	if (h->common.f_type == CKPT_FILE_FIFO) {
+		ret = checkpoint_fname(ctx, &file->f_path, &ctx->fs_mnt);
+		if (ret < 0)
+			goto out;
+	}
+
 	if (first)
 		ret = checkpoint_pipe(ctx, inode);
  out:
@@ -959,8 +972,74 @@ struct file *pipe_file_restore(struct ckpt_ctx *ctx, struct ckpt_hdr_file *ptr)
 
 	return file;
 }
+
+struct file *fifo_file_restore(struct ckpt_ctx *ctx, struct ckpt_hdr_file *ptr)
+{
+	struct ckpt_hdr_file_pipe *h = (struct ckpt_hdr_file_pipe *) ptr;
+	struct file *file;
+	int first, ret;
+
+	if (ptr->h.type != CKPT_HDR_FILE  ||
+	    ptr->h.len != sizeof(*h) || ptr->f_type != CKPT_FILE_FIFO)
+		return ERR_PTR(-EINVAL);
+
+	if (h->pipe_objref <= 0)
+		return ERR_PTR(-EINVAL);
+
+	/*
+	 * If ckpt_obj_fetch() returned ERR_PTR(-EINVAL), this is the
+	 * first time for this fifo.
+	 */
+	file = ckpt_obj_fetch(ctx, h->pipe_objref, CKPT_OBJ_FILE);
+	if (!IS_ERR(file))
+		first = 0;
+	else if (PTR_ERR(file) == -EINVAL)
+		first = 1;
+	else
+		return file;
+
+	/*
+	 * To avoid blocking, always open the fifo with O_RDWR;
+	 * then fix flags below.
+	 */
+	file = restore_open_fname(ctx, (ptr->f_flags & ~O_ACCMODE) | O_RDWR);
+	if (IS_ERR(file))
+		return file;
+
+	if ((ptr->f_flags & O_ACCMODE) == O_RDONLY) {
+		file->f_flags = (file->f_flags & ~O_ACCMODE) | O_RDONLY;
+		file->f_mode &= ~FMODE_WRITE;
+	} else if ((ptr->f_flags & O_ACCMODE) == O_WRONLY) {
+		file->f_flags = (file->f_flags & ~O_ACCMODE) | O_WRONLY;
+		file->f_mode &= ~FMODE_READ;
+	} else if ((ptr->f_flags & O_ACCMODE) != O_RDWR) {
+		ret = -EINVAL;
+		goto out;
+	}
+
+	/* first time: add to objhash and restore fifo's contents */
+	if (first) {
+		ret = ckpt_obj_insert(ctx, file, h->pipe_objref, CKPT_OBJ_FILE);
+		if (ret < 0)
+			goto out;
+
+		ret = restore_pipe(ctx, file);
+		if (ret < 0)
+			goto out;
+	}
+
+	ret = restore_file_common(ctx, file, ptr);
+ out:
+	if (ret < 0) {
+		fput(file);
+		file = ERR_PTR(ret);
+	}
+
+	return file;
+}
 #else
 #define pipe_file_checkpoint  NULL
+#define fifo_file_checkpoint  NULL
 #endif /* CONFIG_CHECKPOINT */
 
 /*
@@ -1043,7 +1122,6 @@ void free_pipe_info(struct inode *inode)
 	inode->i_pipe = NULL;
 }
 
-static struct vfsmount *pipe_mnt __read_mostly;
 static int pipefs_delete_dentry(struct dentry *dentry)
 {
 	/*
diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
index 68f64ae..7e64b77 100644
--- a/include/linux/checkpoint_hdr.h
+++ b/include/linux/checkpoint_hdr.h
@@ -222,6 +222,7 @@ enum file_type {
 	CKPT_FILE_IGNORE = 0,
 	CKPT_FILE_GENERIC,
 	CKPT_FILE_PIPE,
+	CKPT_FILE_FIFO,
 	CKPT_FILE_MAX
 };
 
diff --git a/include/linux/pipe_fs_i.h b/include/linux/pipe_fs_i.h
index e526a12..596403e 100644
--- a/include/linux/pipe_fs_i.h
+++ b/include/linux/pipe_fs_i.h
@@ -160,6 +160,8 @@ struct ckpt_ctx;
 struct ckpt_hdr_file;
 extern struct file *pipe_file_restore(struct ckpt_ctx *ctx,
 				      struct ckpt_hdr_file *ptr);
+extern struct file *fifo_file_restore(struct ckpt_ctx *ctx,
+				      struct ckpt_hdr_file *ptr);
 #endif
 
 #endif
-- 
1.6.0.4

^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH v18 47/80] c/r: checkpoint and restore FIFOs
  2009-09-23 23:50 ` Oren Laadan
@ 2009-09-23 23:51   ` Oren Laadan
  -1 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-23 23:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, containers, linux-kernel, linux-mm, linux-api,
	Serge Hallyn, Ingo Molnar, Pavel Emelyanov, Oren Laadan,
	Oren Laadan

FIFOs are almost like pipes.

Checkpoints adds the FIFO pathname. The first time the FIFO is found
it also assigns an @objref and dumps the contents in the buffers.

To restore, use the @objref only to determine whether a particular
FIFO has already been restored earlier. Note that it ignores the file
pointer that matches that @objref (unlike with pipes, where that file
corresponds to the other end of the pipe). Instead, it creates a new
FIFO using the saved pathname.

Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
---
 checkpoint/files.c             |    6 +++
 fs/pipe.c                      |   82 +++++++++++++++++++++++++++++++++++++++-
 include/linux/checkpoint_hdr.h |    1 +
 include/linux/pipe_fs_i.h      |    2 +
 4 files changed, 89 insertions(+), 2 deletions(-)

diff --git a/checkpoint/files.c b/checkpoint/files.c
index 042f620..190c95b 100644
--- a/checkpoint/files.c
+++ b/checkpoint/files.c
@@ -571,6 +571,12 @@ static struct restore_file_ops restore_file_ops[] = {
 		.file_type = CKPT_FILE_PIPE,
 		.restore = pipe_file_restore,
 	},
+	/* fifo */
+	{
+		.file_name = "FIFO",
+		.file_type = CKPT_FILE_FIFO,
+		.restore = fifo_file_restore,
+	},
 };
 
 static struct file *do_restore_file(struct ckpt_ctx *ctx)
diff --git a/fs/pipe.c b/fs/pipe.c
index 30b34a2..65ad44e 100644
--- a/fs/pipe.c
+++ b/fs/pipe.c
@@ -811,6 +811,8 @@ pipe_rdwr_open(struct inode *inode, struct file *filp)
 	return 0;
 }
 
+static struct vfsmount *pipe_mnt __read_mostly;
+
 #ifdef CONFIG_CHECKPOINT
 static int checkpoint_pipe(struct ckpt_ctx *ctx, struct inode *inode)
 {
@@ -858,7 +860,11 @@ static int pipe_file_checkpoint(struct ckpt_ctx *ctx, struct file *file)
 	if (!h)
 		return -ENOMEM;
 
-	h->common.f_type = CKPT_FILE_PIPE;
+	/* fifo and pipe are similar at checkpoint, differ on restore */
+	if (inode->i_sb == pipe_mnt->mnt_sb)
+		h->common.f_type = CKPT_FILE_PIPE;
+	else
+		h->common.f_type = CKPT_FILE_FIFO;
 	h->pipe_objref = objref;
 
 	ret = checkpoint_file_common(ctx, file, &h->common);
@@ -868,6 +874,13 @@ static int pipe_file_checkpoint(struct ckpt_ctx *ctx, struct file *file)
 	if (ret < 0)
 		goto out;
 
+	/* FIFO also needs a file name */
+	if (h->common.f_type == CKPT_FILE_FIFO) {
+		ret = checkpoint_fname(ctx, &file->f_path, &ctx->fs_mnt);
+		if (ret < 0)
+			goto out;
+	}
+
 	if (first)
 		ret = checkpoint_pipe(ctx, inode);
  out:
@@ -959,8 +972,74 @@ struct file *pipe_file_restore(struct ckpt_ctx *ctx, struct ckpt_hdr_file *ptr)
 
 	return file;
 }
+
+struct file *fifo_file_restore(struct ckpt_ctx *ctx, struct ckpt_hdr_file *ptr)
+{
+	struct ckpt_hdr_file_pipe *h = (struct ckpt_hdr_file_pipe *) ptr;
+	struct file *file;
+	int first, ret;
+
+	if (ptr->h.type != CKPT_HDR_FILE  ||
+	    ptr->h.len != sizeof(*h) || ptr->f_type != CKPT_FILE_FIFO)
+		return ERR_PTR(-EINVAL);
+
+	if (h->pipe_objref <= 0)
+		return ERR_PTR(-EINVAL);
+
+	/*
+	 * If ckpt_obj_fetch() returned ERR_PTR(-EINVAL), this is the
+	 * first time for this fifo.
+	 */
+	file = ckpt_obj_fetch(ctx, h->pipe_objref, CKPT_OBJ_FILE);
+	if (!IS_ERR(file))
+		first = 0;
+	else if (PTR_ERR(file) == -EINVAL)
+		first = 1;
+	else
+		return file;
+
+	/*
+	 * To avoid blocking, always open the fifo with O_RDWR;
+	 * then fix flags below.
+	 */
+	file = restore_open_fname(ctx, (ptr->f_flags & ~O_ACCMODE) | O_RDWR);
+	if (IS_ERR(file))
+		return file;
+
+	if ((ptr->f_flags & O_ACCMODE) == O_RDONLY) {
+		file->f_flags = (file->f_flags & ~O_ACCMODE) | O_RDONLY;
+		file->f_mode &= ~FMODE_WRITE;
+	} else if ((ptr->f_flags & O_ACCMODE) == O_WRONLY) {
+		file->f_flags = (file->f_flags & ~O_ACCMODE) | O_WRONLY;
+		file->f_mode &= ~FMODE_READ;
+	} else if ((ptr->f_flags & O_ACCMODE) != O_RDWR) {
+		ret = -EINVAL;
+		goto out;
+	}
+
+	/* first time: add to objhash and restore fifo's contents */
+	if (first) {
+		ret = ckpt_obj_insert(ctx, file, h->pipe_objref, CKPT_OBJ_FILE);
+		if (ret < 0)
+			goto out;
+
+		ret = restore_pipe(ctx, file);
+		if (ret < 0)
+			goto out;
+	}
+
+	ret = restore_file_common(ctx, file, ptr);
+ out:
+	if (ret < 0) {
+		fput(file);
+		file = ERR_PTR(ret);
+	}
+
+	return file;
+}
 #else
 #define pipe_file_checkpoint  NULL
+#define fifo_file_checkpoint  NULL
 #endif /* CONFIG_CHECKPOINT */
 
 /*
@@ -1043,7 +1122,6 @@ void free_pipe_info(struct inode *inode)
 	inode->i_pipe = NULL;
 }
 
-static struct vfsmount *pipe_mnt __read_mostly;
 static int pipefs_delete_dentry(struct dentry *dentry)
 {
 	/*
diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
index 68f64ae..7e64b77 100644
--- a/include/linux/checkpoint_hdr.h
+++ b/include/linux/checkpoint_hdr.h
@@ -222,6 +222,7 @@ enum file_type {
 	CKPT_FILE_IGNORE = 0,
 	CKPT_FILE_GENERIC,
 	CKPT_FILE_PIPE,
+	CKPT_FILE_FIFO,
 	CKPT_FILE_MAX
 };
 
diff --git a/include/linux/pipe_fs_i.h b/include/linux/pipe_fs_i.h
index e526a12..596403e 100644
--- a/include/linux/pipe_fs_i.h
+++ b/include/linux/pipe_fs_i.h
@@ -160,6 +160,8 @@ struct ckpt_ctx;
 struct ckpt_hdr_file;
 extern struct file *pipe_file_restore(struct ckpt_ctx *ctx,
 				      struct ckpt_hdr_file *ptr);
+extern struct file *fifo_file_restore(struct ckpt_ctx *ctx,
+				      struct ckpt_hdr_file *ptr);
 #endif
 
 #endif
-- 
1.6.0.4


^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH v18 47/80] c/r: checkpoint and restore FIFOs
@ 2009-09-23 23:51   ` Oren Laadan
  0 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-23 23:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, containers, linux-kernel, linux-mm, linux-api,
	Serge Hallyn, Ingo Molnar, Pavel Emelyanov, Oren Laadan,
	Oren Laadan

FIFOs are almost like pipes.

Checkpoints adds the FIFO pathname. The first time the FIFO is found
it also assigns an @objref and dumps the contents in the buffers.

To restore, use the @objref only to determine whether a particular
FIFO has already been restored earlier. Note that it ignores the file
pointer that matches that @objref (unlike with pipes, where that file
corresponds to the other end of the pipe). Instead, it creates a new
FIFO using the saved pathname.

Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
---
 checkpoint/files.c             |    6 +++
 fs/pipe.c                      |   82 +++++++++++++++++++++++++++++++++++++++-
 include/linux/checkpoint_hdr.h |    1 +
 include/linux/pipe_fs_i.h      |    2 +
 4 files changed, 89 insertions(+), 2 deletions(-)

diff --git a/checkpoint/files.c b/checkpoint/files.c
index 042f620..190c95b 100644
--- a/checkpoint/files.c
+++ b/checkpoint/files.c
@@ -571,6 +571,12 @@ static struct restore_file_ops restore_file_ops[] = {
 		.file_type = CKPT_FILE_PIPE,
 		.restore = pipe_file_restore,
 	},
+	/* fifo */
+	{
+		.file_name = "FIFO",
+		.file_type = CKPT_FILE_FIFO,
+		.restore = fifo_file_restore,
+	},
 };
 
 static struct file *do_restore_file(struct ckpt_ctx *ctx)
diff --git a/fs/pipe.c b/fs/pipe.c
index 30b34a2..65ad44e 100644
--- a/fs/pipe.c
+++ b/fs/pipe.c
@@ -811,6 +811,8 @@ pipe_rdwr_open(struct inode *inode, struct file *filp)
 	return 0;
 }
 
+static struct vfsmount *pipe_mnt __read_mostly;
+
 #ifdef CONFIG_CHECKPOINT
 static int checkpoint_pipe(struct ckpt_ctx *ctx, struct inode *inode)
 {
@@ -858,7 +860,11 @@ static int pipe_file_checkpoint(struct ckpt_ctx *ctx, struct file *file)
 	if (!h)
 		return -ENOMEM;
 
-	h->common.f_type = CKPT_FILE_PIPE;
+	/* fifo and pipe are similar at checkpoint, differ on restore */
+	if (inode->i_sb == pipe_mnt->mnt_sb)
+		h->common.f_type = CKPT_FILE_PIPE;
+	else
+		h->common.f_type = CKPT_FILE_FIFO;
 	h->pipe_objref = objref;
 
 	ret = checkpoint_file_common(ctx, file, &h->common);
@@ -868,6 +874,13 @@ static int pipe_file_checkpoint(struct ckpt_ctx *ctx, struct file *file)
 	if (ret < 0)
 		goto out;
 
+	/* FIFO also needs a file name */
+	if (h->common.f_type == CKPT_FILE_FIFO) {
+		ret = checkpoint_fname(ctx, &file->f_path, &ctx->fs_mnt);
+		if (ret < 0)
+			goto out;
+	}
+
 	if (first)
 		ret = checkpoint_pipe(ctx, inode);
  out:
@@ -959,8 +972,74 @@ struct file *pipe_file_restore(struct ckpt_ctx *ctx, struct ckpt_hdr_file *ptr)
 
 	return file;
 }
+
+struct file *fifo_file_restore(struct ckpt_ctx *ctx, struct ckpt_hdr_file *ptr)
+{
+	struct ckpt_hdr_file_pipe *h = (struct ckpt_hdr_file_pipe *) ptr;
+	struct file *file;
+	int first, ret;
+
+	if (ptr->h.type != CKPT_HDR_FILE  ||
+	    ptr->h.len != sizeof(*h) || ptr->f_type != CKPT_FILE_FIFO)
+		return ERR_PTR(-EINVAL);
+
+	if (h->pipe_objref <= 0)
+		return ERR_PTR(-EINVAL);
+
+	/*
+	 * If ckpt_obj_fetch() returned ERR_PTR(-EINVAL), this is the
+	 * first time for this fifo.
+	 */
+	file = ckpt_obj_fetch(ctx, h->pipe_objref, CKPT_OBJ_FILE);
+	if (!IS_ERR(file))
+		first = 0;
+	else if (PTR_ERR(file) == -EINVAL)
+		first = 1;
+	else
+		return file;
+
+	/*
+	 * To avoid blocking, always open the fifo with O_RDWR;
+	 * then fix flags below.
+	 */
+	file = restore_open_fname(ctx, (ptr->f_flags & ~O_ACCMODE) | O_RDWR);
+	if (IS_ERR(file))
+		return file;
+
+	if ((ptr->f_flags & O_ACCMODE) == O_RDONLY) {
+		file->f_flags = (file->f_flags & ~O_ACCMODE) | O_RDONLY;
+		file->f_mode &= ~FMODE_WRITE;
+	} else if ((ptr->f_flags & O_ACCMODE) == O_WRONLY) {
+		file->f_flags = (file->f_flags & ~O_ACCMODE) | O_WRONLY;
+		file->f_mode &= ~FMODE_READ;
+	} else if ((ptr->f_flags & O_ACCMODE) != O_RDWR) {
+		ret = -EINVAL;
+		goto out;
+	}
+
+	/* first time: add to objhash and restore fifo's contents */
+	if (first) {
+		ret = ckpt_obj_insert(ctx, file, h->pipe_objref, CKPT_OBJ_FILE);
+		if (ret < 0)
+			goto out;
+
+		ret = restore_pipe(ctx, file);
+		if (ret < 0)
+			goto out;
+	}
+
+	ret = restore_file_common(ctx, file, ptr);
+ out:
+	if (ret < 0) {
+		fput(file);
+		file = ERR_PTR(ret);
+	}
+
+	return file;
+}
 #else
 #define pipe_file_checkpoint  NULL
+#define fifo_file_checkpoint  NULL
 #endif /* CONFIG_CHECKPOINT */
 
 /*
@@ -1043,7 +1122,6 @@ void free_pipe_info(struct inode *inode)
 	inode->i_pipe = NULL;
 }
 
-static struct vfsmount *pipe_mnt __read_mostly;
 static int pipefs_delete_dentry(struct dentry *dentry)
 {
 	/*
diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
index 68f64ae..7e64b77 100644
--- a/include/linux/checkpoint_hdr.h
+++ b/include/linux/checkpoint_hdr.h
@@ -222,6 +222,7 @@ enum file_type {
 	CKPT_FILE_IGNORE = 0,
 	CKPT_FILE_GENERIC,
 	CKPT_FILE_PIPE,
+	CKPT_FILE_FIFO,
 	CKPT_FILE_MAX
 };
 
diff --git a/include/linux/pipe_fs_i.h b/include/linux/pipe_fs_i.h
index e526a12..596403e 100644
--- a/include/linux/pipe_fs_i.h
+++ b/include/linux/pipe_fs_i.h
@@ -160,6 +160,8 @@ struct ckpt_ctx;
 struct ckpt_hdr_file;
 extern struct file *pipe_file_restore(struct ckpt_ctx *ctx,
 				      struct ckpt_hdr_file *ptr);
+extern struct file *fifo_file_restore(struct ckpt_ctx *ctx,
+				      struct ckpt_hdr_file *ptr);
 #endif
 
 #endif
-- 
1.6.0.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH v18 48/80] c/r: make ckpt_may_checkpoint_task() check each namespace individually
       [not found] ` <1253749920-18673-1-git-send-email-orenl-RdfvBDnrOixBDgjK7y7TUQ@public.gmane.org>
                     ` (46 preceding siblings ...)
  2009-09-23 23:51   ` [PATCH v18 47/80] c/r: checkpoint and restore FIFOs Oren Laadan
@ 2009-09-23 23:51   ` Oren Laadan
  2009-09-23 23:51   ` [PATCH v18 49/80] c/r: support for UTS namespace Oren Laadan
                     ` (33 subsequent siblings)
  81 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-23 23:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-api-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Linus Torvalds, Ingo Molnar,
	Pavel Emelyanov

For a given namespace type, say XXX, if a checkpoint was taken on a
CONFIG_XXX_NS system, is restarted on a !CONFIG_XXX_NS, then ensure
that:

1) The global settings of the global (init) namespace do not get
overwritten. Creating new objects in that namespace is ok, as long as
the request identifier is available.

2) All restarting tasks use a single namespace - because it is
impossible to create additional namespaces to accommodate for what had
been checkpointed.

Original patch introducing nsproxy c/r by Dan Smith <danms-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>

Chagnelog[v18]:
  - Add a few more ckpt_write_err()s
Chagnelog[v17]:
  - Only collect sub-objects of struct_nsproxy once.
  - Restore namespace pieces directly instead of using sys_unshare()
  - Proper handling of restart from namespace(s) without namespace(s)

Signed-off-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
---
 checkpoint/checkpoint.c        |   26 +++++++++++--
 checkpoint/objhash.c           |   28 ++++++++++++++
 checkpoint/process.c           |   81 ++++++++++++++++++++++++++++++++++++++++
 include/linux/checkpoint.h     |    5 ++
 include/linux/checkpoint_hdr.h |   13 ++++++
 kernel/nsproxy.c               |   76 +++++++++++++++++++++++++++++++++++++
 6 files changed, 225 insertions(+), 4 deletions(-)

diff --git a/checkpoint/checkpoint.c b/checkpoint/checkpoint.c
index f907485..52d7a92 100644
--- a/checkpoint/checkpoint.c
+++ b/checkpoint/checkpoint.c
@@ -378,6 +378,8 @@ static int checkpoint_all_tasks(struct ckpt_ctx *ctx)
 static int may_checkpoint_task(struct ckpt_ctx *ctx, struct task_struct *t)
 {
 	struct task_struct *root = ctx->root_task;
+	struct nsproxy *nsproxy;
+	int ret = 0;
 
 	ckpt_debug("check %d\n", task_pid_nr_ns(t, ctx->root_nsproxy->pid_ns));
 
@@ -417,11 +419,27 @@ static int may_checkpoint_task(struct ckpt_ctx *ctx, struct task_struct *t)
 		return -EINVAL;
 	}
 
-	/* FIX: change this when namespaces are added */
-	if (task_nsproxy(t) != ctx->root_nsproxy)
-		return -EPERM;
+	rcu_read_lock();
+	nsproxy = task_nsproxy(t);
+	if (nsproxy->uts_ns != ctx->root_nsproxy->uts_ns)
+		ret = -EPERM;
+	if (nsproxy->ipc_ns != ctx->root_nsproxy->ipc_ns)
+		ret = -EPERM;
+	if (nsproxy->mnt_ns != ctx->root_nsproxy->mnt_ns) {
+		__ckpt_write_err(ctx, "TE", "bad mnt_ns", -EPERM);
+		ret = -EPERM;
+	}
+	if (nsproxy->pid_ns != ctx->root_nsproxy->pid_ns) {
+		__ckpt_write_err(ctx, "TE", "bad pid_ns", -EPERM);
+		ret = -EPERM;
+	}
+	if (nsproxy->net_ns != ctx->root_nsproxy->net_ns) {
+		__ckpt_write_err(ctx, "TE", "bad net_ns", -EPERM);
+		ret = -EPERM;
+	}
+	rcu_read_unlock();
 
-	return 0;
+	return ret;
 }
 
 #define CKPT_HDR_PIDS_CHUNK	256
diff --git a/checkpoint/objhash.c b/checkpoint/objhash.c
index be9411f..a8a99e7 100644
--- a/checkpoint/objhash.c
+++ b/checkpoint/objhash.c
@@ -122,6 +122,22 @@ static int obj_mm_users(void *ptr)
 	return atomic_read(&((struct mm_struct *) ptr)->mm_users);
 }
 
+static int obj_ns_grab(void *ptr)
+{
+	get_nsproxy((struct nsproxy *) ptr);
+	return 0;
+}
+
+static void obj_ns_drop(void *ptr, int lastref)
+{
+	put_nsproxy((struct nsproxy *) ptr);
+}
+
+static int obj_ns_users(void *ptr)
+{
+	return atomic_read(&((struct nsproxy *) ptr)->count);
+}
+
 static struct ckpt_obj_ops ckpt_obj_ops[] = {
 	/* ignored object */
 	{
@@ -167,6 +183,16 @@ static struct ckpt_obj_ops ckpt_obj_ops[] = {
 		.checkpoint = checkpoint_mm,
 		.restore = restore_mm,
 	},
+	/* ns object */
+	{
+		.obj_name = "NSPROXY",
+		.obj_type = CKPT_OBJ_NS,
+		.ref_drop = obj_ns_drop,
+		.ref_grab = obj_ns_grab,
+		.ref_users = obj_ns_users,
+		.checkpoint = checkpoint_ns,
+		.restore = restore_ns,
+	},
 };
 
 
@@ -572,6 +598,8 @@ int ckpt_obj_contained(struct ckpt_ctx *ctx)
 
 	/* account for ctx->file reference (if in the table already) */
 	ckpt_obj_users_inc(ctx, ctx->file, 1);
+	/* account for ctx->root_nsproxy reference (if in the table already) */
+	ckpt_obj_users_inc(ctx, ctx->root_nsproxy, 1);
 
 	hlist_for_each_entry(obj, node, &ctx->obj_hash->list, next) {
 		if (!obj->ops->ref_users)
diff --git a/checkpoint/process.c b/checkpoint/process.c
index 1788a39..3444aff 100644
--- a/checkpoint/process.c
+++ b/checkpoint/process.c
@@ -12,6 +12,7 @@
 #define CKPT_DFLAG  CKPT_DSYS
 
 #include <linux/sched.h>
+#include <linux/nsproxy.h>
 #include <linux/posix-timers.h>
 #include <linux/futex.h>
 #include <linux/compat.h>
@@ -104,6 +105,35 @@ static int checkpoint_task_struct(struct ckpt_ctx *ctx, struct task_struct *t)
 	return ckpt_write_string(ctx, t->comm, TASK_COMM_LEN);
 }
 
+static int checkpoint_task_ns(struct ckpt_ctx *ctx, struct task_struct *t)
+{
+	struct ckpt_hdr_task_ns *h;
+	struct nsproxy *nsproxy;
+	int ns_objref;
+	int ret;
+
+	rcu_read_lock();
+	nsproxy = task_nsproxy(t);
+	get_nsproxy(nsproxy);
+	rcu_read_unlock();
+
+	ns_objref = checkpoint_obj(ctx, nsproxy, CKPT_OBJ_NS);
+	put_nsproxy(nsproxy);
+
+	ckpt_debug("nsproxy: objref %d\n", ns_objref);
+	if (ns_objref < 0)
+		return ns_objref;
+
+	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_TASK_NS);
+	if (!h)
+		return -ENOMEM;
+	h->ns_objref = ns_objref;
+	ret = ckpt_write_obj(ctx, &h->h);
+	ckpt_hdr_put(ctx, h);
+
+	return ret;
+}
+
 static int checkpoint_task_objs(struct ckpt_ctx *ctx, struct task_struct *t)
 {
 	struct ckpt_hdr_task_objs *h;
@@ -111,6 +141,19 @@ static int checkpoint_task_objs(struct ckpt_ctx *ctx, struct task_struct *t)
 	int mm_objref;
 	int ret;
 
+	/*
+	 * Shared objects may have dependencies among them: task->mm
+	 * depends on task->nsproxy (by ipc_ns). Therefore first save
+	 * the namespaces, and then the remaining shared objects.
+	 * During restart a task will already have its namespaces
+	 * restored when it gets to restore, e.g. its memory.
+	 */
+
+	ret = checkpoint_task_ns(ctx, t);
+	ckpt_debug("ns: objref %d\n", ret);
+	if (ret < 0)
+		return ret;
+
 	files_objref = checkpoint_obj_file_table(ctx, t);
 	ckpt_debug("files: objref %d\n", files_objref);
 	if (files_objref < 0) {
@@ -286,6 +329,9 @@ int ckpt_collect_task(struct ckpt_ctx *ctx, struct task_struct *t)
 {
 	int ret;
 
+	ret = ckpt_collect_ns(ctx, t);
+	if (ret < 0)
+		return ret;
 	ret = ckpt_collect_file_table(ctx, t);
 	if (ret < 0)
 		return ret;
@@ -361,11 +407,46 @@ static int restore_task_struct(struct ckpt_ctx *ctx)
 	return ret;
 }
 
+static int restore_task_ns(struct ckpt_ctx *ctx)
+{
+	struct ckpt_hdr_task_ns *h;
+	struct nsproxy *nsproxy;
+	int ret = 0;
+
+	h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_TASK_NS);
+	if (IS_ERR(h))
+		return PTR_ERR(h);
+
+	nsproxy = ckpt_obj_fetch(ctx, h->ns_objref, CKPT_OBJ_NS);
+	if (IS_ERR(nsproxy)) {
+		ret = PTR_ERR(nsproxy);
+		goto out;
+	}
+
+	if (nsproxy != task_nsproxy(current)) {
+		get_nsproxy(nsproxy);
+		switch_task_namespaces(current, nsproxy);
+	}
+ out:
+	ckpt_debug("nsproxy: ret %d (%p)\n", ret, task_nsproxy(current));
+	ckpt_hdr_put(ctx, h);
+	return ret;
+}
+
 static int restore_task_objs(struct ckpt_ctx *ctx)
 {
 	struct ckpt_hdr_task_objs *h;
 	int ret;
 
+	/*
+	 * Namespaces come first, because ->mm depends on ->nsproxy,
+	 * and because shared objects are restored before they are
+	 * referenced. See comment in checkpoint_task_objs.
+	 */
+	ret = restore_task_ns(ctx);
+	if (ret < 0)
+		return ret;
+
 	h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_TASK_OBJS);
 	if (IS_ERR(h))
 		return PTR_ERR(h);
diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
index 2770fc2..e68afab 100644
--- a/include/linux/checkpoint.h
+++ b/include/linux/checkpoint.h
@@ -142,6 +142,11 @@ extern int checkpoint_restart_block(struct ckpt_ctx *ctx,
 				    struct task_struct *t);
 extern int restore_restart_block(struct ckpt_ctx *ctx);
 
+/* namespaces */
+extern int ckpt_collect_ns(struct ckpt_ctx *ctx, struct task_struct *t);
+extern int checkpoint_ns(struct ckpt_ctx *ctx, void *ptr);
+extern void *restore_ns(struct ckpt_ctx *ctx);
+
 /* file table */
 extern int ckpt_collect_file_table(struct ckpt_ctx *ctx, struct task_struct *t);
 extern int checkpoint_obj_file_table(struct ckpt_ctx *ctx,
diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
index 7e64b77..5a5916d 100644
--- a/include/linux/checkpoint_hdr.h
+++ b/include/linux/checkpoint_hdr.h
@@ -54,10 +54,12 @@ enum {
 
 	CKPT_HDR_TREE = 101,
 	CKPT_HDR_TASK,
+	CKPT_HDR_TASK_NS,
 	CKPT_HDR_TASK_OBJS,
 	CKPT_HDR_RESTART_BLOCK,
 	CKPT_HDR_THREAD,
 	CKPT_HDR_CPU,
+	CKPT_HDR_NS,
 
 	/* 201-299: reserved for arch-dependent */
 
@@ -96,6 +98,7 @@ enum obj_type {
 	CKPT_OBJ_FILE_TABLE,
 	CKPT_OBJ_FILE,
 	CKPT_OBJ_MM,
+	CKPT_OBJ_NS,
 	CKPT_OBJ_MAX
 };
 
@@ -175,6 +178,16 @@ struct ckpt_hdr_task {
 	__u64 robust_futex_list; /* a __user ptr */
 } __attribute__((aligned(8)));
 
+/* namespaces */
+struct ckpt_hdr_task_ns {
+	struct ckpt_hdr h;
+	__s32 ns_objref;
+} __attribute__((aligned(8)));
+
+struct ckpt_hdr_ns {
+	struct ckpt_hdr h;
+} __attribute__((aligned(8)));
+
 /* task's shared resources */
 struct ckpt_hdr_task_objs {
 	struct ckpt_hdr h;
diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c
index 09b4ff9..54cb987 100644
--- a/kernel/nsproxy.c
+++ b/kernel/nsproxy.c
@@ -21,6 +21,7 @@
 #include <linux/pid_namespace.h>
 #include <net/net_namespace.h>
 #include <linux/ipc_namespace.h>
+#include <linux/checkpoint.h>
 
 static struct kmem_cache *nsproxy_cachep;
 
@@ -221,6 +222,81 @@ void exit_task_namespaces(struct task_struct *p)
 	switch_task_namespaces(p, NULL);
 }
 
+#ifdef CONFIG_CHECKPOINT
+int ckpt_collect_ns(struct ckpt_ctx *ctx, struct task_struct *t)
+{
+	struct nsproxy *nsproxy;
+	int exists;
+	int ret;
+
+	rcu_read_lock();
+	nsproxy = task_nsproxy(t);
+	if (nsproxy)
+		get_nsproxy(nsproxy);
+	rcu_read_unlock();
+
+	if (!nsproxy)
+		return 0;
+
+	/* if already exists, don't proceed inside the struct */
+	exists = ckpt_obj_lookup(ctx, nsproxy, CKPT_OBJ_NS);
+
+	ret = ckpt_obj_collect(ctx, nsproxy, CKPT_OBJ_NS);
+	if (ret < 0 || exists)
+		goto out;
+
+	/* TODO: collect other namespaces here */
+ out:
+	put_nsproxy(nsproxy);
+	return ret;
+}
+
+static int do_checkpoint_ns(struct ckpt_ctx *ctx, struct nsproxy *nsproxy)
+{
+	struct ckpt_hdr_ns *h;
+	int ret;
+
+	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_NS);
+	if (!h)
+		return -ENOMEM;
+
+	/* TODO: Write other namespaces here */
+
+	ret = ckpt_write_obj(ctx, &h->h);
+	ckpt_hdr_put(ctx, h);
+	return ret;
+}
+
+
+int checkpoint_ns(struct ckpt_ctx *ctx, void *ptr)
+{
+	return do_checkpoint_ns(ctx, (struct nsproxy *) ptr);
+}
+
+static struct nsproxy *do_restore_ns(struct ckpt_ctx *ctx)
+{
+	struct ckpt_hdr_ns *h;
+	struct nsproxy *nsproxy = NULL;
+
+	h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_NS);
+	if (IS_ERR(h))
+		return (struct nsproxy *) h;
+
+	nsproxy = current->nsproxy;
+	get_nsproxy(nsproxy);
+
+	/* TODO: add more namespaces here */
+
+	ckpt_hdr_put(ctx, h);
+	return nsproxy;
+}
+
+void *restore_ns(struct ckpt_ctx *ctx)
+{
+	return (void *) do_restore_ns(ctx);
+}
+#endif /* CONFIG_CHECKPOINT */
+
 static int __init nsproxy_cache_init(void)
 {
 	nsproxy_cachep = KMEM_CACHE(nsproxy, SLAB_PANIC);
-- 
1.6.0.4

^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH v18 48/80] c/r: make ckpt_may_checkpoint_task() check each namespace individually
  2009-09-23 23:50 ` Oren Laadan
@ 2009-09-23 23:51   ` Oren Laadan
  -1 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-23 23:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, containers, linux-kernel, linux-mm, linux-api,
	Serge Hallyn, Ingo Molnar, Pavel Emelyanov, Oren Laadan,
	Oren Laadan

For a given namespace type, say XXX, if a checkpoint was taken on a
CONFIG_XXX_NS system, is restarted on a !CONFIG_XXX_NS, then ensure
that:

1) The global settings of the global (init) namespace do not get
overwritten. Creating new objects in that namespace is ok, as long as
the request identifier is available.

2) All restarting tasks use a single namespace - because it is
impossible to create additional namespaces to accommodate for what had
been checkpointed.

Original patch introducing nsproxy c/r by Dan Smith <danms@us.ibm.com>

Chagnelog[v18]:
  - Add a few more ckpt_write_err()s
Chagnelog[v17]:
  - Only collect sub-objects of struct_nsproxy once.
  - Restore namespace pieces directly instead of using sys_unshare()
  - Proper handling of restart from namespace(s) without namespace(s)

Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
---
 checkpoint/checkpoint.c        |   26 +++++++++++--
 checkpoint/objhash.c           |   28 ++++++++++++++
 checkpoint/process.c           |   81 ++++++++++++++++++++++++++++++++++++++++
 include/linux/checkpoint.h     |    5 ++
 include/linux/checkpoint_hdr.h |   13 ++++++
 kernel/nsproxy.c               |   76 +++++++++++++++++++++++++++++++++++++
 6 files changed, 225 insertions(+), 4 deletions(-)

diff --git a/checkpoint/checkpoint.c b/checkpoint/checkpoint.c
index f907485..52d7a92 100644
--- a/checkpoint/checkpoint.c
+++ b/checkpoint/checkpoint.c
@@ -378,6 +378,8 @@ static int checkpoint_all_tasks(struct ckpt_ctx *ctx)
 static int may_checkpoint_task(struct ckpt_ctx *ctx, struct task_struct *t)
 {
 	struct task_struct *root = ctx->root_task;
+	struct nsproxy *nsproxy;
+	int ret = 0;
 
 	ckpt_debug("check %d\n", task_pid_nr_ns(t, ctx->root_nsproxy->pid_ns));
 
@@ -417,11 +419,27 @@ static int may_checkpoint_task(struct ckpt_ctx *ctx, struct task_struct *t)
 		return -EINVAL;
 	}
 
-	/* FIX: change this when namespaces are added */
-	if (task_nsproxy(t) != ctx->root_nsproxy)
-		return -EPERM;
+	rcu_read_lock();
+	nsproxy = task_nsproxy(t);
+	if (nsproxy->uts_ns != ctx->root_nsproxy->uts_ns)
+		ret = -EPERM;
+	if (nsproxy->ipc_ns != ctx->root_nsproxy->ipc_ns)
+		ret = -EPERM;
+	if (nsproxy->mnt_ns != ctx->root_nsproxy->mnt_ns) {
+		__ckpt_write_err(ctx, "TE", "bad mnt_ns", -EPERM);
+		ret = -EPERM;
+	}
+	if (nsproxy->pid_ns != ctx->root_nsproxy->pid_ns) {
+		__ckpt_write_err(ctx, "TE", "bad pid_ns", -EPERM);
+		ret = -EPERM;
+	}
+	if (nsproxy->net_ns != ctx->root_nsproxy->net_ns) {
+		__ckpt_write_err(ctx, "TE", "bad net_ns", -EPERM);
+		ret = -EPERM;
+	}
+	rcu_read_unlock();
 
-	return 0;
+	return ret;
 }
 
 #define CKPT_HDR_PIDS_CHUNK	256
diff --git a/checkpoint/objhash.c b/checkpoint/objhash.c
index be9411f..a8a99e7 100644
--- a/checkpoint/objhash.c
+++ b/checkpoint/objhash.c
@@ -122,6 +122,22 @@ static int obj_mm_users(void *ptr)
 	return atomic_read(&((struct mm_struct *) ptr)->mm_users);
 }
 
+static int obj_ns_grab(void *ptr)
+{
+	get_nsproxy((struct nsproxy *) ptr);
+	return 0;
+}
+
+static void obj_ns_drop(void *ptr, int lastref)
+{
+	put_nsproxy((struct nsproxy *) ptr);
+}
+
+static int obj_ns_users(void *ptr)
+{
+	return atomic_read(&((struct nsproxy *) ptr)->count);
+}
+
 static struct ckpt_obj_ops ckpt_obj_ops[] = {
 	/* ignored object */
 	{
@@ -167,6 +183,16 @@ static struct ckpt_obj_ops ckpt_obj_ops[] = {
 		.checkpoint = checkpoint_mm,
 		.restore = restore_mm,
 	},
+	/* ns object */
+	{
+		.obj_name = "NSPROXY",
+		.obj_type = CKPT_OBJ_NS,
+		.ref_drop = obj_ns_drop,
+		.ref_grab = obj_ns_grab,
+		.ref_users = obj_ns_users,
+		.checkpoint = checkpoint_ns,
+		.restore = restore_ns,
+	},
 };
 
 
@@ -572,6 +598,8 @@ int ckpt_obj_contained(struct ckpt_ctx *ctx)
 
 	/* account for ctx->file reference (if in the table already) */
 	ckpt_obj_users_inc(ctx, ctx->file, 1);
+	/* account for ctx->root_nsproxy reference (if in the table already) */
+	ckpt_obj_users_inc(ctx, ctx->root_nsproxy, 1);
 
 	hlist_for_each_entry(obj, node, &ctx->obj_hash->list, next) {
 		if (!obj->ops->ref_users)
diff --git a/checkpoint/process.c b/checkpoint/process.c
index 1788a39..3444aff 100644
--- a/checkpoint/process.c
+++ b/checkpoint/process.c
@@ -12,6 +12,7 @@
 #define CKPT_DFLAG  CKPT_DSYS
 
 #include <linux/sched.h>
+#include <linux/nsproxy.h>
 #include <linux/posix-timers.h>
 #include <linux/futex.h>
 #include <linux/compat.h>
@@ -104,6 +105,35 @@ static int checkpoint_task_struct(struct ckpt_ctx *ctx, struct task_struct *t)
 	return ckpt_write_string(ctx, t->comm, TASK_COMM_LEN);
 }
 
+static int checkpoint_task_ns(struct ckpt_ctx *ctx, struct task_struct *t)
+{
+	struct ckpt_hdr_task_ns *h;
+	struct nsproxy *nsproxy;
+	int ns_objref;
+	int ret;
+
+	rcu_read_lock();
+	nsproxy = task_nsproxy(t);
+	get_nsproxy(nsproxy);
+	rcu_read_unlock();
+
+	ns_objref = checkpoint_obj(ctx, nsproxy, CKPT_OBJ_NS);
+	put_nsproxy(nsproxy);
+
+	ckpt_debug("nsproxy: objref %d\n", ns_objref);
+	if (ns_objref < 0)
+		return ns_objref;
+
+	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_TASK_NS);
+	if (!h)
+		return -ENOMEM;
+	h->ns_objref = ns_objref;
+	ret = ckpt_write_obj(ctx, &h->h);
+	ckpt_hdr_put(ctx, h);
+
+	return ret;
+}
+
 static int checkpoint_task_objs(struct ckpt_ctx *ctx, struct task_struct *t)
 {
 	struct ckpt_hdr_task_objs *h;
@@ -111,6 +141,19 @@ static int checkpoint_task_objs(struct ckpt_ctx *ctx, struct task_struct *t)
 	int mm_objref;
 	int ret;
 
+	/*
+	 * Shared objects may have dependencies among them: task->mm
+	 * depends on task->nsproxy (by ipc_ns). Therefore first save
+	 * the namespaces, and then the remaining shared objects.
+	 * During restart a task will already have its namespaces
+	 * restored when it gets to restore, e.g. its memory.
+	 */
+
+	ret = checkpoint_task_ns(ctx, t);
+	ckpt_debug("ns: objref %d\n", ret);
+	if (ret < 0)
+		return ret;
+
 	files_objref = checkpoint_obj_file_table(ctx, t);
 	ckpt_debug("files: objref %d\n", files_objref);
 	if (files_objref < 0) {
@@ -286,6 +329,9 @@ int ckpt_collect_task(struct ckpt_ctx *ctx, struct task_struct *t)
 {
 	int ret;
 
+	ret = ckpt_collect_ns(ctx, t);
+	if (ret < 0)
+		return ret;
 	ret = ckpt_collect_file_table(ctx, t);
 	if (ret < 0)
 		return ret;
@@ -361,11 +407,46 @@ static int restore_task_struct(struct ckpt_ctx *ctx)
 	return ret;
 }
 
+static int restore_task_ns(struct ckpt_ctx *ctx)
+{
+	struct ckpt_hdr_task_ns *h;
+	struct nsproxy *nsproxy;
+	int ret = 0;
+
+	h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_TASK_NS);
+	if (IS_ERR(h))
+		return PTR_ERR(h);
+
+	nsproxy = ckpt_obj_fetch(ctx, h->ns_objref, CKPT_OBJ_NS);
+	if (IS_ERR(nsproxy)) {
+		ret = PTR_ERR(nsproxy);
+		goto out;
+	}
+
+	if (nsproxy != task_nsproxy(current)) {
+		get_nsproxy(nsproxy);
+		switch_task_namespaces(current, nsproxy);
+	}
+ out:
+	ckpt_debug("nsproxy: ret %d (%p)\n", ret, task_nsproxy(current));
+	ckpt_hdr_put(ctx, h);
+	return ret;
+}
+
 static int restore_task_objs(struct ckpt_ctx *ctx)
 {
 	struct ckpt_hdr_task_objs *h;
 	int ret;
 
+	/*
+	 * Namespaces come first, because ->mm depends on ->nsproxy,
+	 * and because shared objects are restored before they are
+	 * referenced. See comment in checkpoint_task_objs.
+	 */
+	ret = restore_task_ns(ctx);
+	if (ret < 0)
+		return ret;
+
 	h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_TASK_OBJS);
 	if (IS_ERR(h))
 		return PTR_ERR(h);
diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
index 2770fc2..e68afab 100644
--- a/include/linux/checkpoint.h
+++ b/include/linux/checkpoint.h
@@ -142,6 +142,11 @@ extern int checkpoint_restart_block(struct ckpt_ctx *ctx,
 				    struct task_struct *t);
 extern int restore_restart_block(struct ckpt_ctx *ctx);
 
+/* namespaces */
+extern int ckpt_collect_ns(struct ckpt_ctx *ctx, struct task_struct *t);
+extern int checkpoint_ns(struct ckpt_ctx *ctx, void *ptr);
+extern void *restore_ns(struct ckpt_ctx *ctx);
+
 /* file table */
 extern int ckpt_collect_file_table(struct ckpt_ctx *ctx, struct task_struct *t);
 extern int checkpoint_obj_file_table(struct ckpt_ctx *ctx,
diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
index 7e64b77..5a5916d 100644
--- a/include/linux/checkpoint_hdr.h
+++ b/include/linux/checkpoint_hdr.h
@@ -54,10 +54,12 @@ enum {
 
 	CKPT_HDR_TREE = 101,
 	CKPT_HDR_TASK,
+	CKPT_HDR_TASK_NS,
 	CKPT_HDR_TASK_OBJS,
 	CKPT_HDR_RESTART_BLOCK,
 	CKPT_HDR_THREAD,
 	CKPT_HDR_CPU,
+	CKPT_HDR_NS,
 
 	/* 201-299: reserved for arch-dependent */
 
@@ -96,6 +98,7 @@ enum obj_type {
 	CKPT_OBJ_FILE_TABLE,
 	CKPT_OBJ_FILE,
 	CKPT_OBJ_MM,
+	CKPT_OBJ_NS,
 	CKPT_OBJ_MAX
 };
 
@@ -175,6 +178,16 @@ struct ckpt_hdr_task {
 	__u64 robust_futex_list; /* a __user ptr */
 } __attribute__((aligned(8)));
 
+/* namespaces */
+struct ckpt_hdr_task_ns {
+	struct ckpt_hdr h;
+	__s32 ns_objref;
+} __attribute__((aligned(8)));
+
+struct ckpt_hdr_ns {
+	struct ckpt_hdr h;
+} __attribute__((aligned(8)));
+
 /* task's shared resources */
 struct ckpt_hdr_task_objs {
 	struct ckpt_hdr h;
diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c
index 09b4ff9..54cb987 100644
--- a/kernel/nsproxy.c
+++ b/kernel/nsproxy.c
@@ -21,6 +21,7 @@
 #include <linux/pid_namespace.h>
 #include <net/net_namespace.h>
 #include <linux/ipc_namespace.h>
+#include <linux/checkpoint.h>
 
 static struct kmem_cache *nsproxy_cachep;
 
@@ -221,6 +222,81 @@ void exit_task_namespaces(struct task_struct *p)
 	switch_task_namespaces(p, NULL);
 }
 
+#ifdef CONFIG_CHECKPOINT
+int ckpt_collect_ns(struct ckpt_ctx *ctx, struct task_struct *t)
+{
+	struct nsproxy *nsproxy;
+	int exists;
+	int ret;
+
+	rcu_read_lock();
+	nsproxy = task_nsproxy(t);
+	if (nsproxy)
+		get_nsproxy(nsproxy);
+	rcu_read_unlock();
+
+	if (!nsproxy)
+		return 0;
+
+	/* if already exists, don't proceed inside the struct */
+	exists = ckpt_obj_lookup(ctx, nsproxy, CKPT_OBJ_NS);
+
+	ret = ckpt_obj_collect(ctx, nsproxy, CKPT_OBJ_NS);
+	if (ret < 0 || exists)
+		goto out;
+
+	/* TODO: collect other namespaces here */
+ out:
+	put_nsproxy(nsproxy);
+	return ret;
+}
+
+static int do_checkpoint_ns(struct ckpt_ctx *ctx, struct nsproxy *nsproxy)
+{
+	struct ckpt_hdr_ns *h;
+	int ret;
+
+	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_NS);
+	if (!h)
+		return -ENOMEM;
+
+	/* TODO: Write other namespaces here */
+
+	ret = ckpt_write_obj(ctx, &h->h);
+	ckpt_hdr_put(ctx, h);
+	return ret;
+}
+
+
+int checkpoint_ns(struct ckpt_ctx *ctx, void *ptr)
+{
+	return do_checkpoint_ns(ctx, (struct nsproxy *) ptr);
+}
+
+static struct nsproxy *do_restore_ns(struct ckpt_ctx *ctx)
+{
+	struct ckpt_hdr_ns *h;
+	struct nsproxy *nsproxy = NULL;
+
+	h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_NS);
+	if (IS_ERR(h))
+		return (struct nsproxy *) h;
+
+	nsproxy = current->nsproxy;
+	get_nsproxy(nsproxy);
+
+	/* TODO: add more namespaces here */
+
+	ckpt_hdr_put(ctx, h);
+	return nsproxy;
+}
+
+void *restore_ns(struct ckpt_ctx *ctx)
+{
+	return (void *) do_restore_ns(ctx);
+}
+#endif /* CONFIG_CHECKPOINT */
+
 static int __init nsproxy_cache_init(void)
 {
 	nsproxy_cachep = KMEM_CACHE(nsproxy, SLAB_PANIC);
-- 
1.6.0.4


^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH v18 48/80] c/r: make ckpt_may_checkpoint_task() check each namespace individually
@ 2009-09-23 23:51   ` Oren Laadan
  0 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-23 23:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, containers, linux-kernel, linux-mm, linux-api,
	Serge Hallyn, Ingo Molnar, Pavel Emelyanov, Oren Laadan,
	Oren Laadan

For a given namespace type, say XXX, if a checkpoint was taken on a
CONFIG_XXX_NS system, is restarted on a !CONFIG_XXX_NS, then ensure
that:

1) The global settings of the global (init) namespace do not get
overwritten. Creating new objects in that namespace is ok, as long as
the request identifier is available.

2) All restarting tasks use a single namespace - because it is
impossible to create additional namespaces to accommodate for what had
been checkpointed.

Original patch introducing nsproxy c/r by Dan Smith <danms@us.ibm.com>

Chagnelog[v18]:
  - Add a few more ckpt_write_err()s
Chagnelog[v17]:
  - Only collect sub-objects of struct_nsproxy once.
  - Restore namespace pieces directly instead of using sys_unshare()
  - Proper handling of restart from namespace(s) without namespace(s)

Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
---
 checkpoint/checkpoint.c        |   26 +++++++++++--
 checkpoint/objhash.c           |   28 ++++++++++++++
 checkpoint/process.c           |   81 ++++++++++++++++++++++++++++++++++++++++
 include/linux/checkpoint.h     |    5 ++
 include/linux/checkpoint_hdr.h |   13 ++++++
 kernel/nsproxy.c               |   76 +++++++++++++++++++++++++++++++++++++
 6 files changed, 225 insertions(+), 4 deletions(-)

diff --git a/checkpoint/checkpoint.c b/checkpoint/checkpoint.c
index f907485..52d7a92 100644
--- a/checkpoint/checkpoint.c
+++ b/checkpoint/checkpoint.c
@@ -378,6 +378,8 @@ static int checkpoint_all_tasks(struct ckpt_ctx *ctx)
 static int may_checkpoint_task(struct ckpt_ctx *ctx, struct task_struct *t)
 {
 	struct task_struct *root = ctx->root_task;
+	struct nsproxy *nsproxy;
+	int ret = 0;
 
 	ckpt_debug("check %d\n", task_pid_nr_ns(t, ctx->root_nsproxy->pid_ns));
 
@@ -417,11 +419,27 @@ static int may_checkpoint_task(struct ckpt_ctx *ctx, struct task_struct *t)
 		return -EINVAL;
 	}
 
-	/* FIX: change this when namespaces are added */
-	if (task_nsproxy(t) != ctx->root_nsproxy)
-		return -EPERM;
+	rcu_read_lock();
+	nsproxy = task_nsproxy(t);
+	if (nsproxy->uts_ns != ctx->root_nsproxy->uts_ns)
+		ret = -EPERM;
+	if (nsproxy->ipc_ns != ctx->root_nsproxy->ipc_ns)
+		ret = -EPERM;
+	if (nsproxy->mnt_ns != ctx->root_nsproxy->mnt_ns) {
+		__ckpt_write_err(ctx, "TE", "bad mnt_ns", -EPERM);
+		ret = -EPERM;
+	}
+	if (nsproxy->pid_ns != ctx->root_nsproxy->pid_ns) {
+		__ckpt_write_err(ctx, "TE", "bad pid_ns", -EPERM);
+		ret = -EPERM;
+	}
+	if (nsproxy->net_ns != ctx->root_nsproxy->net_ns) {
+		__ckpt_write_err(ctx, "TE", "bad net_ns", -EPERM);
+		ret = -EPERM;
+	}
+	rcu_read_unlock();
 
-	return 0;
+	return ret;
 }
 
 #define CKPT_HDR_PIDS_CHUNK	256
diff --git a/checkpoint/objhash.c b/checkpoint/objhash.c
index be9411f..a8a99e7 100644
--- a/checkpoint/objhash.c
+++ b/checkpoint/objhash.c
@@ -122,6 +122,22 @@ static int obj_mm_users(void *ptr)
 	return atomic_read(&((struct mm_struct *) ptr)->mm_users);
 }
 
+static int obj_ns_grab(void *ptr)
+{
+	get_nsproxy((struct nsproxy *) ptr);
+	return 0;
+}
+
+static void obj_ns_drop(void *ptr, int lastref)
+{
+	put_nsproxy((struct nsproxy *) ptr);
+}
+
+static int obj_ns_users(void *ptr)
+{
+	return atomic_read(&((struct nsproxy *) ptr)->count);
+}
+
 static struct ckpt_obj_ops ckpt_obj_ops[] = {
 	/* ignored object */
 	{
@@ -167,6 +183,16 @@ static struct ckpt_obj_ops ckpt_obj_ops[] = {
 		.checkpoint = checkpoint_mm,
 		.restore = restore_mm,
 	},
+	/* ns object */
+	{
+		.obj_name = "NSPROXY",
+		.obj_type = CKPT_OBJ_NS,
+		.ref_drop = obj_ns_drop,
+		.ref_grab = obj_ns_grab,
+		.ref_users = obj_ns_users,
+		.checkpoint = checkpoint_ns,
+		.restore = restore_ns,
+	},
 };
 
 
@@ -572,6 +598,8 @@ int ckpt_obj_contained(struct ckpt_ctx *ctx)
 
 	/* account for ctx->file reference (if in the table already) */
 	ckpt_obj_users_inc(ctx, ctx->file, 1);
+	/* account for ctx->root_nsproxy reference (if in the table already) */
+	ckpt_obj_users_inc(ctx, ctx->root_nsproxy, 1);
 
 	hlist_for_each_entry(obj, node, &ctx->obj_hash->list, next) {
 		if (!obj->ops->ref_users)
diff --git a/checkpoint/process.c b/checkpoint/process.c
index 1788a39..3444aff 100644
--- a/checkpoint/process.c
+++ b/checkpoint/process.c
@@ -12,6 +12,7 @@
 #define CKPT_DFLAG  CKPT_DSYS
 
 #include <linux/sched.h>
+#include <linux/nsproxy.h>
 #include <linux/posix-timers.h>
 #include <linux/futex.h>
 #include <linux/compat.h>
@@ -104,6 +105,35 @@ static int checkpoint_task_struct(struct ckpt_ctx *ctx, struct task_struct *t)
 	return ckpt_write_string(ctx, t->comm, TASK_COMM_LEN);
 }
 
+static int checkpoint_task_ns(struct ckpt_ctx *ctx, struct task_struct *t)
+{
+	struct ckpt_hdr_task_ns *h;
+	struct nsproxy *nsproxy;
+	int ns_objref;
+	int ret;
+
+	rcu_read_lock();
+	nsproxy = task_nsproxy(t);
+	get_nsproxy(nsproxy);
+	rcu_read_unlock();
+
+	ns_objref = checkpoint_obj(ctx, nsproxy, CKPT_OBJ_NS);
+	put_nsproxy(nsproxy);
+
+	ckpt_debug("nsproxy: objref %d\n", ns_objref);
+	if (ns_objref < 0)
+		return ns_objref;
+
+	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_TASK_NS);
+	if (!h)
+		return -ENOMEM;
+	h->ns_objref = ns_objref;
+	ret = ckpt_write_obj(ctx, &h->h);
+	ckpt_hdr_put(ctx, h);
+
+	return ret;
+}
+
 static int checkpoint_task_objs(struct ckpt_ctx *ctx, struct task_struct *t)
 {
 	struct ckpt_hdr_task_objs *h;
@@ -111,6 +141,19 @@ static int checkpoint_task_objs(struct ckpt_ctx *ctx, struct task_struct *t)
 	int mm_objref;
 	int ret;
 
+	/*
+	 * Shared objects may have dependencies among them: task->mm
+	 * depends on task->nsproxy (by ipc_ns). Therefore first save
+	 * the namespaces, and then the remaining shared objects.
+	 * During restart a task will already have its namespaces
+	 * restored when it gets to restore, e.g. its memory.
+	 */
+
+	ret = checkpoint_task_ns(ctx, t);
+	ckpt_debug("ns: objref %d\n", ret);
+	if (ret < 0)
+		return ret;
+
 	files_objref = checkpoint_obj_file_table(ctx, t);
 	ckpt_debug("files: objref %d\n", files_objref);
 	if (files_objref < 0) {
@@ -286,6 +329,9 @@ int ckpt_collect_task(struct ckpt_ctx *ctx, struct task_struct *t)
 {
 	int ret;
 
+	ret = ckpt_collect_ns(ctx, t);
+	if (ret < 0)
+		return ret;
 	ret = ckpt_collect_file_table(ctx, t);
 	if (ret < 0)
 		return ret;
@@ -361,11 +407,46 @@ static int restore_task_struct(struct ckpt_ctx *ctx)
 	return ret;
 }
 
+static int restore_task_ns(struct ckpt_ctx *ctx)
+{
+	struct ckpt_hdr_task_ns *h;
+	struct nsproxy *nsproxy;
+	int ret = 0;
+
+	h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_TASK_NS);
+	if (IS_ERR(h))
+		return PTR_ERR(h);
+
+	nsproxy = ckpt_obj_fetch(ctx, h->ns_objref, CKPT_OBJ_NS);
+	if (IS_ERR(nsproxy)) {
+		ret = PTR_ERR(nsproxy);
+		goto out;
+	}
+
+	if (nsproxy != task_nsproxy(current)) {
+		get_nsproxy(nsproxy);
+		switch_task_namespaces(current, nsproxy);
+	}
+ out:
+	ckpt_debug("nsproxy: ret %d (%p)\n", ret, task_nsproxy(current));
+	ckpt_hdr_put(ctx, h);
+	return ret;
+}
+
 static int restore_task_objs(struct ckpt_ctx *ctx)
 {
 	struct ckpt_hdr_task_objs *h;
 	int ret;
 
+	/*
+	 * Namespaces come first, because ->mm depends on ->nsproxy,
+	 * and because shared objects are restored before they are
+	 * referenced. See comment in checkpoint_task_objs.
+	 */
+	ret = restore_task_ns(ctx);
+	if (ret < 0)
+		return ret;
+
 	h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_TASK_OBJS);
 	if (IS_ERR(h))
 		return PTR_ERR(h);
diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
index 2770fc2..e68afab 100644
--- a/include/linux/checkpoint.h
+++ b/include/linux/checkpoint.h
@@ -142,6 +142,11 @@ extern int checkpoint_restart_block(struct ckpt_ctx *ctx,
 				    struct task_struct *t);
 extern int restore_restart_block(struct ckpt_ctx *ctx);
 
+/* namespaces */
+extern int ckpt_collect_ns(struct ckpt_ctx *ctx, struct task_struct *t);
+extern int checkpoint_ns(struct ckpt_ctx *ctx, void *ptr);
+extern void *restore_ns(struct ckpt_ctx *ctx);
+
 /* file table */
 extern int ckpt_collect_file_table(struct ckpt_ctx *ctx, struct task_struct *t);
 extern int checkpoint_obj_file_table(struct ckpt_ctx *ctx,
diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
index 7e64b77..5a5916d 100644
--- a/include/linux/checkpoint_hdr.h
+++ b/include/linux/checkpoint_hdr.h
@@ -54,10 +54,12 @@ enum {
 
 	CKPT_HDR_TREE = 101,
 	CKPT_HDR_TASK,
+	CKPT_HDR_TASK_NS,
 	CKPT_HDR_TASK_OBJS,
 	CKPT_HDR_RESTART_BLOCK,
 	CKPT_HDR_THREAD,
 	CKPT_HDR_CPU,
+	CKPT_HDR_NS,
 
 	/* 201-299: reserved for arch-dependent */
 
@@ -96,6 +98,7 @@ enum obj_type {
 	CKPT_OBJ_FILE_TABLE,
 	CKPT_OBJ_FILE,
 	CKPT_OBJ_MM,
+	CKPT_OBJ_NS,
 	CKPT_OBJ_MAX
 };
 
@@ -175,6 +178,16 @@ struct ckpt_hdr_task {
 	__u64 robust_futex_list; /* a __user ptr */
 } __attribute__((aligned(8)));
 
+/* namespaces */
+struct ckpt_hdr_task_ns {
+	struct ckpt_hdr h;
+	__s32 ns_objref;
+} __attribute__((aligned(8)));
+
+struct ckpt_hdr_ns {
+	struct ckpt_hdr h;
+} __attribute__((aligned(8)));
+
 /* task's shared resources */
 struct ckpt_hdr_task_objs {
 	struct ckpt_hdr h;
diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c
index 09b4ff9..54cb987 100644
--- a/kernel/nsproxy.c
+++ b/kernel/nsproxy.c
@@ -21,6 +21,7 @@
 #include <linux/pid_namespace.h>
 #include <net/net_namespace.h>
 #include <linux/ipc_namespace.h>
+#include <linux/checkpoint.h>
 
 static struct kmem_cache *nsproxy_cachep;
 
@@ -221,6 +222,81 @@ void exit_task_namespaces(struct task_struct *p)
 	switch_task_namespaces(p, NULL);
 }
 
+#ifdef CONFIG_CHECKPOINT
+int ckpt_collect_ns(struct ckpt_ctx *ctx, struct task_struct *t)
+{
+	struct nsproxy *nsproxy;
+	int exists;
+	int ret;
+
+	rcu_read_lock();
+	nsproxy = task_nsproxy(t);
+	if (nsproxy)
+		get_nsproxy(nsproxy);
+	rcu_read_unlock();
+
+	if (!nsproxy)
+		return 0;
+
+	/* if already exists, don't proceed inside the struct */
+	exists = ckpt_obj_lookup(ctx, nsproxy, CKPT_OBJ_NS);
+
+	ret = ckpt_obj_collect(ctx, nsproxy, CKPT_OBJ_NS);
+	if (ret < 0 || exists)
+		goto out;
+
+	/* TODO: collect other namespaces here */
+ out:
+	put_nsproxy(nsproxy);
+	return ret;
+}
+
+static int do_checkpoint_ns(struct ckpt_ctx *ctx, struct nsproxy *nsproxy)
+{
+	struct ckpt_hdr_ns *h;
+	int ret;
+
+	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_NS);
+	if (!h)
+		return -ENOMEM;
+
+	/* TODO: Write other namespaces here */
+
+	ret = ckpt_write_obj(ctx, &h->h);
+	ckpt_hdr_put(ctx, h);
+	return ret;
+}
+
+
+int checkpoint_ns(struct ckpt_ctx *ctx, void *ptr)
+{
+	return do_checkpoint_ns(ctx, (struct nsproxy *) ptr);
+}
+
+static struct nsproxy *do_restore_ns(struct ckpt_ctx *ctx)
+{
+	struct ckpt_hdr_ns *h;
+	struct nsproxy *nsproxy = NULL;
+
+	h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_NS);
+	if (IS_ERR(h))
+		return (struct nsproxy *) h;
+
+	nsproxy = current->nsproxy;
+	get_nsproxy(nsproxy);
+
+	/* TODO: add more namespaces here */
+
+	ckpt_hdr_put(ctx, h);
+	return nsproxy;
+}
+
+void *restore_ns(struct ckpt_ctx *ctx)
+{
+	return (void *) do_restore_ns(ctx);
+}
+#endif /* CONFIG_CHECKPOINT */
+
 static int __init nsproxy_cache_init(void)
 {
 	nsproxy_cachep = KMEM_CACHE(nsproxy, SLAB_PANIC);
-- 
1.6.0.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH v18 49/80] c/r: support for UTS namespace
       [not found] ` <1253749920-18673-1-git-send-email-orenl-RdfvBDnrOixBDgjK7y7TUQ@public.gmane.org>
                     ` (47 preceding siblings ...)
  2009-09-23 23:51   ` [PATCH v18 48/80] c/r: make ckpt_may_checkpoint_task() check each namespace individually Oren Laadan
@ 2009-09-23 23:51   ` Oren Laadan
  2009-09-23 23:51   ` [PATCH v18 50/80] c/r (ipc): allow allocation of a desired ipc identifier Oren Laadan
                     ` (32 subsequent siblings)
  81 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-23 23:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-api-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Linus Torvalds, Ingo Molnar,
	Dan Smith, Pavel Emelyanov

From: Dan Smith <danms-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>

This patch adds a "phase" of checkpoint that saves out information about any
namespaces the task(s) may have.  Do this by tracking the namespace objects
of the tasks and making sure that tasks with the same namespace that follow
get properly referenced in the checkpoint stream.

Changes[v17]:
  - Collect nsproxy->uts_ns
  - Save uts string lengths once in ckpt_hdr_const
  - Save and restore all fields of uts-ns
  - Don't overwrite global uts-ns if !CONFIG_UTS_NS
  - Replace sys_unshare() with create_uts_ns()
  - Take uts_sem around access to uts data
Changes:
  - Remove the kernel restore path
  - Punt on nested namespaces
  - Use __NEW_UTS_LEN in nodename and domainname buffers
  - Add a note to Documentation/checkpoint/internals.txt to indicate where
    in the save/restore process the UTS information is kept
  - Store (and track) the objref of the namespace itself instead of the
    nsproxy (based on comments from Dave on IRC)
  - Remove explicit check for non-root nsproxy
  - Store the nodename and domainname lengths and use ckpt_write_string()
    to store the actual name strings
  - Catch failure of ckpt_obj_add_ptr() in ckpt_write_namespaces()
  - Remove "types" bitfield and use the "is this new" flag to determine
    whether or not we should write out a new ns descriptor
  - Replace kernel restore path
  - Move the namespace information to be directly after the task
    information record
  - Update Documentation to reflect new location of namespace info
  - Support checkpoint and restart of nested UTS namespaces

Signed-off-by: Dan Smith <danms-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
Signed-off-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
---
 checkpoint/Makefile              |    1 +
 checkpoint/checkpoint.c          |    5 +-
 checkpoint/namespace.c           |  100 ++++++++++++++++++++++++++++++++++++++
 checkpoint/objhash.c             |   26 ++++++++++
 checkpoint/process.c             |    2 +
 checkpoint/restart.c             |    6 ++
 include/linux/checkpoint.h       |    4 ++
 include/linux/checkpoint_hdr.h   |   26 +++++++++-
 include/linux/checkpoint_types.h |    6 ++
 include/linux/utsname.h          |    1 +
 kernel/nsproxy.c                 |   47 +++++++++++++++++-
 kernel/utsname.c                 |    3 +-
 12 files changed, 222 insertions(+), 5 deletions(-)
 create mode 100644 checkpoint/namespace.c

diff --git a/checkpoint/Makefile b/checkpoint/Makefile
index f56a7d6..bb2c0ca 100644
--- a/checkpoint/Makefile
+++ b/checkpoint/Makefile
@@ -8,5 +8,6 @@ obj-$(CONFIG_CHECKPOINT) += \
 	checkpoint.o \
 	restart.o \
 	process.o \
+	namespace.o \
 	files.o \
 	memory.o
diff --git a/checkpoint/checkpoint.c b/checkpoint/checkpoint.c
index 52d7a92..2a193b3 100644
--- a/checkpoint/checkpoint.c
+++ b/checkpoint/checkpoint.c
@@ -288,9 +288,12 @@ static void fill_kernel_const(struct ckpt_const *h)
 	/* mm */
 	h->mm_saved_auxv_len = sizeof(mm->saved_auxv);
 	/* uts */
+	h->uts_sysname_len = sizeof(uts->sysname);
+	h->uts_nodename_len = sizeof(uts->nodename);
 	h->uts_release_len = sizeof(uts->release);
 	h->uts_version_len = sizeof(uts->version);
 	h->uts_machine_len = sizeof(uts->machine);
+	h->uts_domainname_len = sizeof(uts->domainname);
 }
 
 /* write the checkpoint header */
@@ -421,8 +424,6 @@ static int may_checkpoint_task(struct ckpt_ctx *ctx, struct task_struct *t)
 
 	rcu_read_lock();
 	nsproxy = task_nsproxy(t);
-	if (nsproxy->uts_ns != ctx->root_nsproxy->uts_ns)
-		ret = -EPERM;
 	if (nsproxy->ipc_ns != ctx->root_nsproxy->ipc_ns)
 		ret = -EPERM;
 	if (nsproxy->mnt_ns != ctx->root_nsproxy->mnt_ns) {
diff --git a/checkpoint/namespace.c b/checkpoint/namespace.c
new file mode 100644
index 0000000..49b8f0a
--- /dev/null
+++ b/checkpoint/namespace.c
@@ -0,0 +1,100 @@
+/*
+ *  Checkpoint namespaces
+ *
+ *  Copyright (C) 2009 Oren Laadan
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+/* default debug level for output */
+#define CKPT_DFLAG  CKPT_DSYS
+
+#include <linux/nsproxy.h>
+#include <linux/user_namespace.h>
+
+#include <linux/checkpoint.h>
+#include <linux/checkpoint_hdr.h>
+
+/*
+ * uts_ns  -  this needs to compile even for !CONFIG_USER_NS, so
+ *   the code may not reside in kernel/utsname.c (which wouldn't
+ *   compile then).
+ */
+static int do_checkpoint_uts_ns(struct ckpt_ctx *ctx,
+				struct uts_namespace *uts_ns)
+{
+	struct ckpt_hdr_utsns *h;
+	struct new_utsname *name;
+	int ret;
+
+	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_UTS_NS);
+	if (!h)
+		return -ENOMEM;
+
+	down_read(&uts_sem);
+	name = &uts_ns->name;
+	memcpy(h->sysname, name->sysname, sizeof(name->sysname));
+	memcpy(h->nodename, name->nodename, sizeof(name->nodename));
+	memcpy(h->release, name->release, sizeof(name->release));
+	memcpy(h->version, name->version, sizeof(name->version));
+	memcpy(h->machine, name->machine, sizeof(name->machine));
+	memcpy(h->domainname, name->domainname, sizeof(name->domainname));
+	up_read(&uts_sem);
+
+	ret = ckpt_write_obj(ctx, &h->h);
+	ckpt_hdr_put(ctx, h);
+	return ret;
+}
+
+int checkpoint_uts_ns(struct ckpt_ctx *ctx, void *ptr)
+{
+	return do_checkpoint_uts_ns(ctx, (struct uts_namespace *) ptr);
+}
+
+static struct uts_namespace *do_restore_uts_ns(struct ckpt_ctx *ctx)
+{
+	struct ckpt_hdr_utsns *h;
+	struct uts_namespace *uts_ns = NULL;
+	struct new_utsname *name;
+
+	h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_UTS_NS);
+	if (IS_ERR(h))
+		return (struct uts_namespace *) h;
+
+#ifdef CONFIG_UTS_NS
+	uts_ns = create_uts_ns();
+	if (!uts_ns) {
+		uts_ns = ERR_PTR(-ENOMEM);
+		goto out;
+	}
+	down_read(&uts_sem);
+	name = &uts_ns->name;
+	memcpy(name->sysname, h->sysname, sizeof(name->sysname));
+	memcpy(name->nodename, h->nodename, sizeof(name->nodename));
+	memcpy(name->release, h->release, sizeof(name->release));
+	memcpy(name->version, h->version, sizeof(name->version));
+	memcpy(name->machine, h->machine, sizeof(name->machine));
+	memcpy(name->domainname, h->domainname, sizeof(name->domainname));
+	up_read(&uts_sem);
+#else
+	/* complain if image contains multiple namespaces */
+	if (ctx->stats.uts_ns) {
+		uts_ns = ERR_PTR(-EEXIST);
+		goto out;
+	}
+	uts_ns = current->nsproxy->uts_ns;
+	get_uts_ns(uts_ns);
+#endif
+
+	ctx->stats.uts_ns++;
+ out:
+	ckpt_hdr_put(ctx, h);
+	return uts_ns;
+}
+
+void *restore_uts_ns(struct ckpt_ctx *ctx)
+{
+	return (void *) do_restore_uts_ns(ctx);
+}
diff --git a/checkpoint/objhash.c b/checkpoint/objhash.c
index a8a99e7..2fd00a6 100644
--- a/checkpoint/objhash.c
+++ b/checkpoint/objhash.c
@@ -138,6 +138,22 @@ static int obj_ns_users(void *ptr)
 	return atomic_read(&((struct nsproxy *) ptr)->count);
 }
 
+static int obj_uts_ns_grab(void *ptr)
+{
+	get_uts_ns((struct uts_namespace *) ptr);
+	return 0;
+}
+
+static void obj_uts_ns_drop(void *ptr, int lastref)
+{
+	put_uts_ns((struct uts_namespace *) ptr);
+}
+
+static int obj_uts_ns_users(void *ptr)
+{
+	return atomic_read(&((struct uts_namespace *) ptr)->kref.refcount);
+}
+
 static struct ckpt_obj_ops ckpt_obj_ops[] = {
 	/* ignored object */
 	{
@@ -193,6 +209,16 @@ static struct ckpt_obj_ops ckpt_obj_ops[] = {
 		.checkpoint = checkpoint_ns,
 		.restore = restore_ns,
 	},
+	/* uts_ns object */
+	{
+		.obj_name = "UTS_NS",
+		.obj_type = CKPT_OBJ_UTS_NS,
+		.ref_drop = obj_uts_ns_drop,
+		.ref_grab = obj_uts_ns_grab,
+		.ref_users = obj_uts_ns_users,
+		.checkpoint = checkpoint_uts_ns,
+		.restore = restore_uts_ns,
+	},
 };
 
 
diff --git a/checkpoint/process.c b/checkpoint/process.c
index 3444aff..b34ee3d 100644
--- a/checkpoint/process.c
+++ b/checkpoint/process.c
@@ -17,8 +17,10 @@
 #include <linux/futex.h>
 #include <linux/compat.h>
 #include <linux/poll.h>
+#include <linux/utsname.h>
 #include <linux/checkpoint.h>
 #include <linux/checkpoint_hdr.h>
+#include <linux/syscalls.h>
 
 
 #ifdef CONFIG_FUTEX
diff --git a/checkpoint/restart.c b/checkpoint/restart.c
index 6183074..e48ad68 100644
--- a/checkpoint/restart.c
+++ b/checkpoint/restart.c
@@ -383,12 +383,18 @@ static int check_kernel_const(struct ckpt_const *h)
 	if (h->mm_saved_auxv_len != sizeof(mm->saved_auxv))
 		return -EINVAL;
 	/* uts */
+	if (h->uts_sysname_len != sizeof(uts->sysname))
+		return -EINVAL;
+	if (h->uts_nodename_len != sizeof(uts->nodename))
+		return -EINVAL;
 	if (h->uts_release_len != sizeof(uts->release))
 		return -EINVAL;
 	if (h->uts_version_len != sizeof(uts->version))
 		return -EINVAL;
 	if (h->uts_machine_len != sizeof(uts->machine))
 		return -EINVAL;
+	if (h->uts_domainname_len != sizeof(uts->domainname))
+		return -EINVAL;
 
 	return 0;
 }
diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
index e68afab..de3537a 100644
--- a/include/linux/checkpoint.h
+++ b/include/linux/checkpoint.h
@@ -147,6 +147,10 @@ extern int ckpt_collect_ns(struct ckpt_ctx *ctx, struct task_struct *t);
 extern int checkpoint_ns(struct ckpt_ctx *ctx, void *ptr);
 extern void *restore_ns(struct ckpt_ctx *ctx);
 
+/* uts-ns */
+extern int checkpoint_uts_ns(struct ckpt_ctx *ctx, void *ptr);
+extern void *restore_uts_ns(struct ckpt_ctx *ctx);
+
 /* file table */
 extern int ckpt_collect_file_table(struct ckpt_ctx *ctx, struct task_struct *t);
 extern int checkpoint_obj_file_table(struct ckpt_ctx *ctx,
diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
index 5a5916d..0da2f15 100644
--- a/include/linux/checkpoint_hdr.h
+++ b/include/linux/checkpoint_hdr.h
@@ -11,7 +11,6 @@
  */
 
 #include <linux/types.h>
-#include <linux/utsname.h>
 
 /*
  * To maintain compatibility between 32-bit and 64-bit architecture flavors,
@@ -60,6 +59,7 @@ enum {
 	CKPT_HDR_THREAD,
 	CKPT_HDR_CPU,
 	CKPT_HDR_NS,
+	CKPT_HDR_UTS_NS,
 
 	/* 201-299: reserved for arch-dependent */
 
@@ -99,6 +99,7 @@ enum obj_type {
 	CKPT_OBJ_FILE,
 	CKPT_OBJ_MM,
 	CKPT_OBJ_NS,
+	CKPT_OBJ_UTS_NS,
 	CKPT_OBJ_MAX
 };
 
@@ -109,9 +110,12 @@ struct ckpt_const {
 	/* mm */
 	__u16 mm_saved_auxv_len;
 	/* uts */
+	__u16 uts_sysname_len;
+	__u16 uts_nodename_len;
 	__u16 uts_release_len;
 	__u16 uts_version_len;
 	__u16 uts_machine_len;
+	__u16 uts_domainname_len;
 } __attribute__((aligned(8)));
 
 /* checkpoint image header */
@@ -186,6 +190,26 @@ struct ckpt_hdr_task_ns {
 
 struct ckpt_hdr_ns {
 	struct ckpt_hdr h;
+	__s32 uts_objref;
+} __attribute__((aligned(8)));
+
+/* cannot include <linux/tty.h> from userspace, so define: */
+#define CKPT_NEW_UTS_LEN  64
+#ifdef __KERNEL__
+#include <linux/utsname.h>
+#if CKPT_NEW_UTS_LEN != __NEW_UTS_LEN
+#error CKPT_NEW_UTS_LEN size is wrong per linux/utsname.h
+#endif
+#endif
+
+struct ckpt_hdr_utsns {
+	struct ckpt_hdr h;
+	char sysname[CKPT_NEW_UTS_LEN + 1];
+	char nodename[CKPT_NEW_UTS_LEN + 1];
+	char release[CKPT_NEW_UTS_LEN + 1];
+	char version[CKPT_NEW_UTS_LEN + 1];
+	char machine[CKPT_NEW_UTS_LEN + 1];
+	char domainname[CKPT_NEW_UTS_LEN + 1];
 } __attribute__((aligned(8)));
 
 /* task's shared resources */
diff --git a/include/linux/checkpoint_types.h b/include/linux/checkpoint_types.h
index f214109..7d1b8c8 100644
--- a/include/linux/checkpoint_types.h
+++ b/include/linux/checkpoint_types.h
@@ -22,6 +22,10 @@
 #include <linux/ktime.h>
 #include <linux/wait.h>
 
+struct ckpt_stats {
+	int uts_ns;
+};
+
 struct ckpt_ctx {
 	int crid;		/* unique checkpoint id */
 
@@ -64,6 +68,8 @@ struct ckpt_ctx {
 	int active_pid;			/* (next) position in pids array */
 	struct completion complete;	/* container root and other tasks on */
 	wait_queue_head_t waitq;	/* start, end, and restart ordering */
+
+	struct ckpt_stats stats;	/* statistics */
 };
 
 #endif /* __KERNEL__ */
diff --git a/include/linux/utsname.h b/include/linux/utsname.h
index 3656b30..d6f24a9 100644
--- a/include/linux/utsname.h
+++ b/include/linux/utsname.h
@@ -50,6 +50,7 @@ static inline void get_uts_ns(struct uts_namespace *ns)
 	kref_get(&ns->kref);
 }
 
+extern struct uts_namespace *create_uts_ns(void);
 extern struct uts_namespace *copy_utsname(unsigned long flags,
 					struct uts_namespace *ns);
 extern void free_uts_ns(struct kref *kref);
diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c
index 54cb987..4f48a68 100644
--- a/kernel/nsproxy.c
+++ b/kernel/nsproxy.c
@@ -245,6 +245,10 @@ int ckpt_collect_ns(struct ckpt_ctx *ctx, struct task_struct *t)
 	if (ret < 0 || exists)
 		goto out;
 
+	ret = ckpt_obj_collect(ctx, nsproxy->uts_ns, CKPT_OBJ_UTS_NS);
+	if (ret < 0)
+		goto out;
+
 	/* TODO: collect other namespaces here */
  out:
 	put_nsproxy(nsproxy);
@@ -260,9 +264,14 @@ static int do_checkpoint_ns(struct ckpt_ctx *ctx, struct nsproxy *nsproxy)
 	if (!h)
 		return -ENOMEM;
 
+	ret = checkpoint_obj(ctx, nsproxy->uts_ns, CKPT_OBJ_UTS_NS);
+	if (ret <= 0)
+		goto out;
+	h->uts_objref = ret;
 	/* TODO: Write other namespaces here */
 
 	ret = ckpt_write_obj(ctx, &h->h);
+ out:
 	ckpt_hdr_put(ctx, h);
 	return ret;
 }
@@ -277,16 +286,52 @@ static struct nsproxy *do_restore_ns(struct ckpt_ctx *ctx)
 {
 	struct ckpt_hdr_ns *h;
 	struct nsproxy *nsproxy = NULL;
+	struct uts_namespace *uts_ns;
+	int ret;
 
 	h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_NS);
 	if (IS_ERR(h))
 		return (struct nsproxy *) h;
 
+	ret = -EINVAL;
+	if (h->uts_objref <= 0)
+		goto out;
+
+	uts_ns = ckpt_obj_fetch(ctx, h->uts_objref, CKPT_OBJ_UTS_NS);
+	if (IS_ERR(uts_ns)) {
+		ret = PTR_ERR(uts_ns);
+		goto out;
+	}
+
+#if defined(COFNIG_UTS_NS)
+	ret = -ENOMEM;
+	nsproxy = create_nsproxy();
+	if (!nsproxy)
+		goto out;
+
+	get_uts_ns(uts_ns);
+	nsproxy->uts_ns = uts_ns;
+
+	get_ipc_ns(current->nsproxy->ipc_ns);
+	nsproxy->ipc_ns = ipc_ns;
+	get_pid_ns(current->nsproxy->pid_ns);
+	nsproxy->pid_ns = current->nsproxy->pid_ns;
+	get_mnt_ns(current->nsproxy->mnt_ns);
+	nsproxy->mnt_ns = current->nsproxy->mnt_ns;
+	get_net(current->nsproxy->net_ns);
+	nsproxy->net_ns = current->nsproxy->net_ns;
+#else
 	nsproxy = current->nsproxy;
 	get_nsproxy(nsproxy);
 
-	/* TODO: add more namespaces here */
+	BUG_ON(nsproxy->uts_ns != uts_ns);
+#endif
 
+	/* TODO: add more namespaces here */
+	ret = 0;
+ out:
+	if (ret < 0)
+		nsproxy = ERR_PTR(ret);
 	ckpt_hdr_put(ctx, h);
 	return nsproxy;
 }
diff --git a/kernel/utsname.c b/kernel/utsname.c
index 8a82b4b..c82ed83 100644
--- a/kernel/utsname.c
+++ b/kernel/utsname.c
@@ -14,8 +14,9 @@
 #include <linux/utsname.h>
 #include <linux/err.h>
 #include <linux/slab.h>
+#include <linux/checkpoint.h>
 
-static struct uts_namespace *create_uts_ns(void)
+struct uts_namespace *create_uts_ns(void)
 {
 	struct uts_namespace *uts_ns;
 
-- 
1.6.0.4

^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH v18 49/80] c/r: support for UTS namespace
  2009-09-23 23:50 ` Oren Laadan
@ 2009-09-23 23:51   ` Oren Laadan
  -1 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-23 23:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, containers, linux-kernel, linux-mm, linux-api,
	Serge Hallyn, Ingo Molnar, Pavel Emelyanov, Dan Smith,
	Oren Laadan

From: Dan Smith <danms@us.ibm.com>

This patch adds a "phase" of checkpoint that saves out information about any
namespaces the task(s) may have.  Do this by tracking the namespace objects
of the tasks and making sure that tasks with the same namespace that follow
get properly referenced in the checkpoint stream.

Changes[v17]:
  - Collect nsproxy->uts_ns
  - Save uts string lengths once in ckpt_hdr_const
  - Save and restore all fields of uts-ns
  - Don't overwrite global uts-ns if !CONFIG_UTS_NS
  - Replace sys_unshare() with create_uts_ns()
  - Take uts_sem around access to uts data
Changes:
  - Remove the kernel restore path
  - Punt on nested namespaces
  - Use __NEW_UTS_LEN in nodename and domainname buffers
  - Add a note to Documentation/checkpoint/internals.txt to indicate where
    in the save/restore process the UTS information is kept
  - Store (and track) the objref of the namespace itself instead of the
    nsproxy (based on comments from Dave on IRC)
  - Remove explicit check for non-root nsproxy
  - Store the nodename and domainname lengths and use ckpt_write_string()
    to store the actual name strings
  - Catch failure of ckpt_obj_add_ptr() in ckpt_write_namespaces()
  - Remove "types" bitfield and use the "is this new" flag to determine
    whether or not we should write out a new ns descriptor
  - Replace kernel restore path
  - Move the namespace information to be directly after the task
    information record
  - Update Documentation to reflect new location of namespace info
  - Support checkpoint and restart of nested UTS namespaces

Signed-off-by: Dan Smith <danms@us.ibm.com>
Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
---
 checkpoint/Makefile              |    1 +
 checkpoint/checkpoint.c          |    5 +-
 checkpoint/namespace.c           |  100 ++++++++++++++++++++++++++++++++++++++
 checkpoint/objhash.c             |   26 ++++++++++
 checkpoint/process.c             |    2 +
 checkpoint/restart.c             |    6 ++
 include/linux/checkpoint.h       |    4 ++
 include/linux/checkpoint_hdr.h   |   26 +++++++++-
 include/linux/checkpoint_types.h |    6 ++
 include/linux/utsname.h          |    1 +
 kernel/nsproxy.c                 |   47 +++++++++++++++++-
 kernel/utsname.c                 |    3 +-
 12 files changed, 222 insertions(+), 5 deletions(-)
 create mode 100644 checkpoint/namespace.c

diff --git a/checkpoint/Makefile b/checkpoint/Makefile
index f56a7d6..bb2c0ca 100644
--- a/checkpoint/Makefile
+++ b/checkpoint/Makefile
@@ -8,5 +8,6 @@ obj-$(CONFIG_CHECKPOINT) += \
 	checkpoint.o \
 	restart.o \
 	process.o \
+	namespace.o \
 	files.o \
 	memory.o
diff --git a/checkpoint/checkpoint.c b/checkpoint/checkpoint.c
index 52d7a92..2a193b3 100644
--- a/checkpoint/checkpoint.c
+++ b/checkpoint/checkpoint.c
@@ -288,9 +288,12 @@ static void fill_kernel_const(struct ckpt_const *h)
 	/* mm */
 	h->mm_saved_auxv_len = sizeof(mm->saved_auxv);
 	/* uts */
+	h->uts_sysname_len = sizeof(uts->sysname);
+	h->uts_nodename_len = sizeof(uts->nodename);
 	h->uts_release_len = sizeof(uts->release);
 	h->uts_version_len = sizeof(uts->version);
 	h->uts_machine_len = sizeof(uts->machine);
+	h->uts_domainname_len = sizeof(uts->domainname);
 }
 
 /* write the checkpoint header */
@@ -421,8 +424,6 @@ static int may_checkpoint_task(struct ckpt_ctx *ctx, struct task_struct *t)
 
 	rcu_read_lock();
 	nsproxy = task_nsproxy(t);
-	if (nsproxy->uts_ns != ctx->root_nsproxy->uts_ns)
-		ret = -EPERM;
 	if (nsproxy->ipc_ns != ctx->root_nsproxy->ipc_ns)
 		ret = -EPERM;
 	if (nsproxy->mnt_ns != ctx->root_nsproxy->mnt_ns) {
diff --git a/checkpoint/namespace.c b/checkpoint/namespace.c
new file mode 100644
index 0000000..49b8f0a
--- /dev/null
+++ b/checkpoint/namespace.c
@@ -0,0 +1,100 @@
+/*
+ *  Checkpoint namespaces
+ *
+ *  Copyright (C) 2009 Oren Laadan
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+/* default debug level for output */
+#define CKPT_DFLAG  CKPT_DSYS
+
+#include <linux/nsproxy.h>
+#include <linux/user_namespace.h>
+
+#include <linux/checkpoint.h>
+#include <linux/checkpoint_hdr.h>
+
+/*
+ * uts_ns  -  this needs to compile even for !CONFIG_USER_NS, so
+ *   the code may not reside in kernel/utsname.c (which wouldn't
+ *   compile then).
+ */
+static int do_checkpoint_uts_ns(struct ckpt_ctx *ctx,
+				struct uts_namespace *uts_ns)
+{
+	struct ckpt_hdr_utsns *h;
+	struct new_utsname *name;
+	int ret;
+
+	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_UTS_NS);
+	if (!h)
+		return -ENOMEM;
+
+	down_read(&uts_sem);
+	name = &uts_ns->name;
+	memcpy(h->sysname, name->sysname, sizeof(name->sysname));
+	memcpy(h->nodename, name->nodename, sizeof(name->nodename));
+	memcpy(h->release, name->release, sizeof(name->release));
+	memcpy(h->version, name->version, sizeof(name->version));
+	memcpy(h->machine, name->machine, sizeof(name->machine));
+	memcpy(h->domainname, name->domainname, sizeof(name->domainname));
+	up_read(&uts_sem);
+
+	ret = ckpt_write_obj(ctx, &h->h);
+	ckpt_hdr_put(ctx, h);
+	return ret;
+}
+
+int checkpoint_uts_ns(struct ckpt_ctx *ctx, void *ptr)
+{
+	return do_checkpoint_uts_ns(ctx, (struct uts_namespace *) ptr);
+}
+
+static struct uts_namespace *do_restore_uts_ns(struct ckpt_ctx *ctx)
+{
+	struct ckpt_hdr_utsns *h;
+	struct uts_namespace *uts_ns = NULL;
+	struct new_utsname *name;
+
+	h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_UTS_NS);
+	if (IS_ERR(h))
+		return (struct uts_namespace *) h;
+
+#ifdef CONFIG_UTS_NS
+	uts_ns = create_uts_ns();
+	if (!uts_ns) {
+		uts_ns = ERR_PTR(-ENOMEM);
+		goto out;
+	}
+	down_read(&uts_sem);
+	name = &uts_ns->name;
+	memcpy(name->sysname, h->sysname, sizeof(name->sysname));
+	memcpy(name->nodename, h->nodename, sizeof(name->nodename));
+	memcpy(name->release, h->release, sizeof(name->release));
+	memcpy(name->version, h->version, sizeof(name->version));
+	memcpy(name->machine, h->machine, sizeof(name->machine));
+	memcpy(name->domainname, h->domainname, sizeof(name->domainname));
+	up_read(&uts_sem);
+#else
+	/* complain if image contains multiple namespaces */
+	if (ctx->stats.uts_ns) {
+		uts_ns = ERR_PTR(-EEXIST);
+		goto out;
+	}
+	uts_ns = current->nsproxy->uts_ns;
+	get_uts_ns(uts_ns);
+#endif
+
+	ctx->stats.uts_ns++;
+ out:
+	ckpt_hdr_put(ctx, h);
+	return uts_ns;
+}
+
+void *restore_uts_ns(struct ckpt_ctx *ctx)
+{
+	return (void *) do_restore_uts_ns(ctx);
+}
diff --git a/checkpoint/objhash.c b/checkpoint/objhash.c
index a8a99e7..2fd00a6 100644
--- a/checkpoint/objhash.c
+++ b/checkpoint/objhash.c
@@ -138,6 +138,22 @@ static int obj_ns_users(void *ptr)
 	return atomic_read(&((struct nsproxy *) ptr)->count);
 }
 
+static int obj_uts_ns_grab(void *ptr)
+{
+	get_uts_ns((struct uts_namespace *) ptr);
+	return 0;
+}
+
+static void obj_uts_ns_drop(void *ptr, int lastref)
+{
+	put_uts_ns((struct uts_namespace *) ptr);
+}
+
+static int obj_uts_ns_users(void *ptr)
+{
+	return atomic_read(&((struct uts_namespace *) ptr)->kref.refcount);
+}
+
 static struct ckpt_obj_ops ckpt_obj_ops[] = {
 	/* ignored object */
 	{
@@ -193,6 +209,16 @@ static struct ckpt_obj_ops ckpt_obj_ops[] = {
 		.checkpoint = checkpoint_ns,
 		.restore = restore_ns,
 	},
+	/* uts_ns object */
+	{
+		.obj_name = "UTS_NS",
+		.obj_type = CKPT_OBJ_UTS_NS,
+		.ref_drop = obj_uts_ns_drop,
+		.ref_grab = obj_uts_ns_grab,
+		.ref_users = obj_uts_ns_users,
+		.checkpoint = checkpoint_uts_ns,
+		.restore = restore_uts_ns,
+	},
 };
 
 
diff --git a/checkpoint/process.c b/checkpoint/process.c
index 3444aff..b34ee3d 100644
--- a/checkpoint/process.c
+++ b/checkpoint/process.c
@@ -17,8 +17,10 @@
 #include <linux/futex.h>
 #include <linux/compat.h>
 #include <linux/poll.h>
+#include <linux/utsname.h>
 #include <linux/checkpoint.h>
 #include <linux/checkpoint_hdr.h>
+#include <linux/syscalls.h>
 
 
 #ifdef CONFIG_FUTEX
diff --git a/checkpoint/restart.c b/checkpoint/restart.c
index 6183074..e48ad68 100644
--- a/checkpoint/restart.c
+++ b/checkpoint/restart.c
@@ -383,12 +383,18 @@ static int check_kernel_const(struct ckpt_const *h)
 	if (h->mm_saved_auxv_len != sizeof(mm->saved_auxv))
 		return -EINVAL;
 	/* uts */
+	if (h->uts_sysname_len != sizeof(uts->sysname))
+		return -EINVAL;
+	if (h->uts_nodename_len != sizeof(uts->nodename))
+		return -EINVAL;
 	if (h->uts_release_len != sizeof(uts->release))
 		return -EINVAL;
 	if (h->uts_version_len != sizeof(uts->version))
 		return -EINVAL;
 	if (h->uts_machine_len != sizeof(uts->machine))
 		return -EINVAL;
+	if (h->uts_domainname_len != sizeof(uts->domainname))
+		return -EINVAL;
 
 	return 0;
 }
diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
index e68afab..de3537a 100644
--- a/include/linux/checkpoint.h
+++ b/include/linux/checkpoint.h
@@ -147,6 +147,10 @@ extern int ckpt_collect_ns(struct ckpt_ctx *ctx, struct task_struct *t);
 extern int checkpoint_ns(struct ckpt_ctx *ctx, void *ptr);
 extern void *restore_ns(struct ckpt_ctx *ctx);
 
+/* uts-ns */
+extern int checkpoint_uts_ns(struct ckpt_ctx *ctx, void *ptr);
+extern void *restore_uts_ns(struct ckpt_ctx *ctx);
+
 /* file table */
 extern int ckpt_collect_file_table(struct ckpt_ctx *ctx, struct task_struct *t);
 extern int checkpoint_obj_file_table(struct ckpt_ctx *ctx,
diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
index 5a5916d..0da2f15 100644
--- a/include/linux/checkpoint_hdr.h
+++ b/include/linux/checkpoint_hdr.h
@@ -11,7 +11,6 @@
  */
 
 #include <linux/types.h>
-#include <linux/utsname.h>
 
 /*
  * To maintain compatibility between 32-bit and 64-bit architecture flavors,
@@ -60,6 +59,7 @@ enum {
 	CKPT_HDR_THREAD,
 	CKPT_HDR_CPU,
 	CKPT_HDR_NS,
+	CKPT_HDR_UTS_NS,
 
 	/* 201-299: reserved for arch-dependent */
 
@@ -99,6 +99,7 @@ enum obj_type {
 	CKPT_OBJ_FILE,
 	CKPT_OBJ_MM,
 	CKPT_OBJ_NS,
+	CKPT_OBJ_UTS_NS,
 	CKPT_OBJ_MAX
 };
 
@@ -109,9 +110,12 @@ struct ckpt_const {
 	/* mm */
 	__u16 mm_saved_auxv_len;
 	/* uts */
+	__u16 uts_sysname_len;
+	__u16 uts_nodename_len;
 	__u16 uts_release_len;
 	__u16 uts_version_len;
 	__u16 uts_machine_len;
+	__u16 uts_domainname_len;
 } __attribute__((aligned(8)));
 
 /* checkpoint image header */
@@ -186,6 +190,26 @@ struct ckpt_hdr_task_ns {
 
 struct ckpt_hdr_ns {
 	struct ckpt_hdr h;
+	__s32 uts_objref;
+} __attribute__((aligned(8)));
+
+/* cannot include <linux/tty.h> from userspace, so define: */
+#define CKPT_NEW_UTS_LEN  64
+#ifdef __KERNEL__
+#include <linux/utsname.h>
+#if CKPT_NEW_UTS_LEN != __NEW_UTS_LEN
+#error CKPT_NEW_UTS_LEN size is wrong per linux/utsname.h
+#endif
+#endif
+
+struct ckpt_hdr_utsns {
+	struct ckpt_hdr h;
+	char sysname[CKPT_NEW_UTS_LEN + 1];
+	char nodename[CKPT_NEW_UTS_LEN + 1];
+	char release[CKPT_NEW_UTS_LEN + 1];
+	char version[CKPT_NEW_UTS_LEN + 1];
+	char machine[CKPT_NEW_UTS_LEN + 1];
+	char domainname[CKPT_NEW_UTS_LEN + 1];
 } __attribute__((aligned(8)));
 
 /* task's shared resources */
diff --git a/include/linux/checkpoint_types.h b/include/linux/checkpoint_types.h
index f214109..7d1b8c8 100644
--- a/include/linux/checkpoint_types.h
+++ b/include/linux/checkpoint_types.h
@@ -22,6 +22,10 @@
 #include <linux/ktime.h>
 #include <linux/wait.h>
 
+struct ckpt_stats {
+	int uts_ns;
+};
+
 struct ckpt_ctx {
 	int crid;		/* unique checkpoint id */
 
@@ -64,6 +68,8 @@ struct ckpt_ctx {
 	int active_pid;			/* (next) position in pids array */
 	struct completion complete;	/* container root and other tasks on */
 	wait_queue_head_t waitq;	/* start, end, and restart ordering */
+
+	struct ckpt_stats stats;	/* statistics */
 };
 
 #endif /* __KERNEL__ */
diff --git a/include/linux/utsname.h b/include/linux/utsname.h
index 3656b30..d6f24a9 100644
--- a/include/linux/utsname.h
+++ b/include/linux/utsname.h
@@ -50,6 +50,7 @@ static inline void get_uts_ns(struct uts_namespace *ns)
 	kref_get(&ns->kref);
 }
 
+extern struct uts_namespace *create_uts_ns(void);
 extern struct uts_namespace *copy_utsname(unsigned long flags,
 					struct uts_namespace *ns);
 extern void free_uts_ns(struct kref *kref);
diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c
index 54cb987..4f48a68 100644
--- a/kernel/nsproxy.c
+++ b/kernel/nsproxy.c
@@ -245,6 +245,10 @@ int ckpt_collect_ns(struct ckpt_ctx *ctx, struct task_struct *t)
 	if (ret < 0 || exists)
 		goto out;
 
+	ret = ckpt_obj_collect(ctx, nsproxy->uts_ns, CKPT_OBJ_UTS_NS);
+	if (ret < 0)
+		goto out;
+
 	/* TODO: collect other namespaces here */
  out:
 	put_nsproxy(nsproxy);
@@ -260,9 +264,14 @@ static int do_checkpoint_ns(struct ckpt_ctx *ctx, struct nsproxy *nsproxy)
 	if (!h)
 		return -ENOMEM;
 
+	ret = checkpoint_obj(ctx, nsproxy->uts_ns, CKPT_OBJ_UTS_NS);
+	if (ret <= 0)
+		goto out;
+	h->uts_objref = ret;
 	/* TODO: Write other namespaces here */
 
 	ret = ckpt_write_obj(ctx, &h->h);
+ out:
 	ckpt_hdr_put(ctx, h);
 	return ret;
 }
@@ -277,16 +286,52 @@ static struct nsproxy *do_restore_ns(struct ckpt_ctx *ctx)
 {
 	struct ckpt_hdr_ns *h;
 	struct nsproxy *nsproxy = NULL;
+	struct uts_namespace *uts_ns;
+	int ret;
 
 	h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_NS);
 	if (IS_ERR(h))
 		return (struct nsproxy *) h;
 
+	ret = -EINVAL;
+	if (h->uts_objref <= 0)
+		goto out;
+
+	uts_ns = ckpt_obj_fetch(ctx, h->uts_objref, CKPT_OBJ_UTS_NS);
+	if (IS_ERR(uts_ns)) {
+		ret = PTR_ERR(uts_ns);
+		goto out;
+	}
+
+#if defined(COFNIG_UTS_NS)
+	ret = -ENOMEM;
+	nsproxy = create_nsproxy();
+	if (!nsproxy)
+		goto out;
+
+	get_uts_ns(uts_ns);
+	nsproxy->uts_ns = uts_ns;
+
+	get_ipc_ns(current->nsproxy->ipc_ns);
+	nsproxy->ipc_ns = ipc_ns;
+	get_pid_ns(current->nsproxy->pid_ns);
+	nsproxy->pid_ns = current->nsproxy->pid_ns;
+	get_mnt_ns(current->nsproxy->mnt_ns);
+	nsproxy->mnt_ns = current->nsproxy->mnt_ns;
+	get_net(current->nsproxy->net_ns);
+	nsproxy->net_ns = current->nsproxy->net_ns;
+#else
 	nsproxy = current->nsproxy;
 	get_nsproxy(nsproxy);
 
-	/* TODO: add more namespaces here */
+	BUG_ON(nsproxy->uts_ns != uts_ns);
+#endif
 
+	/* TODO: add more namespaces here */
+	ret = 0;
+ out:
+	if (ret < 0)
+		nsproxy = ERR_PTR(ret);
 	ckpt_hdr_put(ctx, h);
 	return nsproxy;
 }
diff --git a/kernel/utsname.c b/kernel/utsname.c
index 8a82b4b..c82ed83 100644
--- a/kernel/utsname.c
+++ b/kernel/utsname.c
@@ -14,8 +14,9 @@
 #include <linux/utsname.h>
 #include <linux/err.h>
 #include <linux/slab.h>
+#include <linux/checkpoint.h>
 
-static struct uts_namespace *create_uts_ns(void)
+struct uts_namespace *create_uts_ns(void)
 {
 	struct uts_namespace *uts_ns;
 
-- 
1.6.0.4


^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH v18 49/80] c/r: support for UTS namespace
@ 2009-09-23 23:51   ` Oren Laadan
  0 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-23 23:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, containers, linux-kernel, linux-mm, linux-api,
	Serge Hallyn, Ingo Molnar, Pavel Emelyanov, Dan Smith,
	Oren Laadan

From: Dan Smith <danms@us.ibm.com>

This patch adds a "phase" of checkpoint that saves out information about any
namespaces the task(s) may have.  Do this by tracking the namespace objects
of the tasks and making sure that tasks with the same namespace that follow
get properly referenced in the checkpoint stream.

Changes[v17]:
  - Collect nsproxy->uts_ns
  - Save uts string lengths once in ckpt_hdr_const
  - Save and restore all fields of uts-ns
  - Don't overwrite global uts-ns if !CONFIG_UTS_NS
  - Replace sys_unshare() with create_uts_ns()
  - Take uts_sem around access to uts data
Changes:
  - Remove the kernel restore path
  - Punt on nested namespaces
  - Use __NEW_UTS_LEN in nodename and domainname buffers
  - Add a note to Documentation/checkpoint/internals.txt to indicate where
    in the save/restore process the UTS information is kept
  - Store (and track) the objref of the namespace itself instead of the
    nsproxy (based on comments from Dave on IRC)
  - Remove explicit check for non-root nsproxy
  - Store the nodename and domainname lengths and use ckpt_write_string()
    to store the actual name strings
  - Catch failure of ckpt_obj_add_ptr() in ckpt_write_namespaces()
  - Remove "types" bitfield and use the "is this new" flag to determine
    whether or not we should write out a new ns descriptor
  - Replace kernel restore path
  - Move the namespace information to be directly after the task
    information record
  - Update Documentation to reflect new location of namespace info
  - Support checkpoint and restart of nested UTS namespaces

Signed-off-by: Dan Smith <danms@us.ibm.com>
Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
---
 checkpoint/Makefile              |    1 +
 checkpoint/checkpoint.c          |    5 +-
 checkpoint/namespace.c           |  100 ++++++++++++++++++++++++++++++++++++++
 checkpoint/objhash.c             |   26 ++++++++++
 checkpoint/process.c             |    2 +
 checkpoint/restart.c             |    6 ++
 include/linux/checkpoint.h       |    4 ++
 include/linux/checkpoint_hdr.h   |   26 +++++++++-
 include/linux/checkpoint_types.h |    6 ++
 include/linux/utsname.h          |    1 +
 kernel/nsproxy.c                 |   47 +++++++++++++++++-
 kernel/utsname.c                 |    3 +-
 12 files changed, 222 insertions(+), 5 deletions(-)
 create mode 100644 checkpoint/namespace.c

diff --git a/checkpoint/Makefile b/checkpoint/Makefile
index f56a7d6..bb2c0ca 100644
--- a/checkpoint/Makefile
+++ b/checkpoint/Makefile
@@ -8,5 +8,6 @@ obj-$(CONFIG_CHECKPOINT) += \
 	checkpoint.o \
 	restart.o \
 	process.o \
+	namespace.o \
 	files.o \
 	memory.o
diff --git a/checkpoint/checkpoint.c b/checkpoint/checkpoint.c
index 52d7a92..2a193b3 100644
--- a/checkpoint/checkpoint.c
+++ b/checkpoint/checkpoint.c
@@ -288,9 +288,12 @@ static void fill_kernel_const(struct ckpt_const *h)
 	/* mm */
 	h->mm_saved_auxv_len = sizeof(mm->saved_auxv);
 	/* uts */
+	h->uts_sysname_len = sizeof(uts->sysname);
+	h->uts_nodename_len = sizeof(uts->nodename);
 	h->uts_release_len = sizeof(uts->release);
 	h->uts_version_len = sizeof(uts->version);
 	h->uts_machine_len = sizeof(uts->machine);
+	h->uts_domainname_len = sizeof(uts->domainname);
 }
 
 /* write the checkpoint header */
@@ -421,8 +424,6 @@ static int may_checkpoint_task(struct ckpt_ctx *ctx, struct task_struct *t)
 
 	rcu_read_lock();
 	nsproxy = task_nsproxy(t);
-	if (nsproxy->uts_ns != ctx->root_nsproxy->uts_ns)
-		ret = -EPERM;
 	if (nsproxy->ipc_ns != ctx->root_nsproxy->ipc_ns)
 		ret = -EPERM;
 	if (nsproxy->mnt_ns != ctx->root_nsproxy->mnt_ns) {
diff --git a/checkpoint/namespace.c b/checkpoint/namespace.c
new file mode 100644
index 0000000..49b8f0a
--- /dev/null
+++ b/checkpoint/namespace.c
@@ -0,0 +1,100 @@
+/*
+ *  Checkpoint namespaces
+ *
+ *  Copyright (C) 2009 Oren Laadan
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+/* default debug level for output */
+#define CKPT_DFLAG  CKPT_DSYS
+
+#include <linux/nsproxy.h>
+#include <linux/user_namespace.h>
+
+#include <linux/checkpoint.h>
+#include <linux/checkpoint_hdr.h>
+
+/*
+ * uts_ns  -  this needs to compile even for !CONFIG_USER_NS, so
+ *   the code may not reside in kernel/utsname.c (which wouldn't
+ *   compile then).
+ */
+static int do_checkpoint_uts_ns(struct ckpt_ctx *ctx,
+				struct uts_namespace *uts_ns)
+{
+	struct ckpt_hdr_utsns *h;
+	struct new_utsname *name;
+	int ret;
+
+	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_UTS_NS);
+	if (!h)
+		return -ENOMEM;
+
+	down_read(&uts_sem);
+	name = &uts_ns->name;
+	memcpy(h->sysname, name->sysname, sizeof(name->sysname));
+	memcpy(h->nodename, name->nodename, sizeof(name->nodename));
+	memcpy(h->release, name->release, sizeof(name->release));
+	memcpy(h->version, name->version, sizeof(name->version));
+	memcpy(h->machine, name->machine, sizeof(name->machine));
+	memcpy(h->domainname, name->domainname, sizeof(name->domainname));
+	up_read(&uts_sem);
+
+	ret = ckpt_write_obj(ctx, &h->h);
+	ckpt_hdr_put(ctx, h);
+	return ret;
+}
+
+int checkpoint_uts_ns(struct ckpt_ctx *ctx, void *ptr)
+{
+	return do_checkpoint_uts_ns(ctx, (struct uts_namespace *) ptr);
+}
+
+static struct uts_namespace *do_restore_uts_ns(struct ckpt_ctx *ctx)
+{
+	struct ckpt_hdr_utsns *h;
+	struct uts_namespace *uts_ns = NULL;
+	struct new_utsname *name;
+
+	h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_UTS_NS);
+	if (IS_ERR(h))
+		return (struct uts_namespace *) h;
+
+#ifdef CONFIG_UTS_NS
+	uts_ns = create_uts_ns();
+	if (!uts_ns) {
+		uts_ns = ERR_PTR(-ENOMEM);
+		goto out;
+	}
+	down_read(&uts_sem);
+	name = &uts_ns->name;
+	memcpy(name->sysname, h->sysname, sizeof(name->sysname));
+	memcpy(name->nodename, h->nodename, sizeof(name->nodename));
+	memcpy(name->release, h->release, sizeof(name->release));
+	memcpy(name->version, h->version, sizeof(name->version));
+	memcpy(name->machine, h->machine, sizeof(name->machine));
+	memcpy(name->domainname, h->domainname, sizeof(name->domainname));
+	up_read(&uts_sem);
+#else
+	/* complain if image contains multiple namespaces */
+	if (ctx->stats.uts_ns) {
+		uts_ns = ERR_PTR(-EEXIST);
+		goto out;
+	}
+	uts_ns = current->nsproxy->uts_ns;
+	get_uts_ns(uts_ns);
+#endif
+
+	ctx->stats.uts_ns++;
+ out:
+	ckpt_hdr_put(ctx, h);
+	return uts_ns;
+}
+
+void *restore_uts_ns(struct ckpt_ctx *ctx)
+{
+	return (void *) do_restore_uts_ns(ctx);
+}
diff --git a/checkpoint/objhash.c b/checkpoint/objhash.c
index a8a99e7..2fd00a6 100644
--- a/checkpoint/objhash.c
+++ b/checkpoint/objhash.c
@@ -138,6 +138,22 @@ static int obj_ns_users(void *ptr)
 	return atomic_read(&((struct nsproxy *) ptr)->count);
 }
 
+static int obj_uts_ns_grab(void *ptr)
+{
+	get_uts_ns((struct uts_namespace *) ptr);
+	return 0;
+}
+
+static void obj_uts_ns_drop(void *ptr, int lastref)
+{
+	put_uts_ns((struct uts_namespace *) ptr);
+}
+
+static int obj_uts_ns_users(void *ptr)
+{
+	return atomic_read(&((struct uts_namespace *) ptr)->kref.refcount);
+}
+
 static struct ckpt_obj_ops ckpt_obj_ops[] = {
 	/* ignored object */
 	{
@@ -193,6 +209,16 @@ static struct ckpt_obj_ops ckpt_obj_ops[] = {
 		.checkpoint = checkpoint_ns,
 		.restore = restore_ns,
 	},
+	/* uts_ns object */
+	{
+		.obj_name = "UTS_NS",
+		.obj_type = CKPT_OBJ_UTS_NS,
+		.ref_drop = obj_uts_ns_drop,
+		.ref_grab = obj_uts_ns_grab,
+		.ref_users = obj_uts_ns_users,
+		.checkpoint = checkpoint_uts_ns,
+		.restore = restore_uts_ns,
+	},
 };
 
 
diff --git a/checkpoint/process.c b/checkpoint/process.c
index 3444aff..b34ee3d 100644
--- a/checkpoint/process.c
+++ b/checkpoint/process.c
@@ -17,8 +17,10 @@
 #include <linux/futex.h>
 #include <linux/compat.h>
 #include <linux/poll.h>
+#include <linux/utsname.h>
 #include <linux/checkpoint.h>
 #include <linux/checkpoint_hdr.h>
+#include <linux/syscalls.h>
 
 
 #ifdef CONFIG_FUTEX
diff --git a/checkpoint/restart.c b/checkpoint/restart.c
index 6183074..e48ad68 100644
--- a/checkpoint/restart.c
+++ b/checkpoint/restart.c
@@ -383,12 +383,18 @@ static int check_kernel_const(struct ckpt_const *h)
 	if (h->mm_saved_auxv_len != sizeof(mm->saved_auxv))
 		return -EINVAL;
 	/* uts */
+	if (h->uts_sysname_len != sizeof(uts->sysname))
+		return -EINVAL;
+	if (h->uts_nodename_len != sizeof(uts->nodename))
+		return -EINVAL;
 	if (h->uts_release_len != sizeof(uts->release))
 		return -EINVAL;
 	if (h->uts_version_len != sizeof(uts->version))
 		return -EINVAL;
 	if (h->uts_machine_len != sizeof(uts->machine))
 		return -EINVAL;
+	if (h->uts_domainname_len != sizeof(uts->domainname))
+		return -EINVAL;
 
 	return 0;
 }
diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
index e68afab..de3537a 100644
--- a/include/linux/checkpoint.h
+++ b/include/linux/checkpoint.h
@@ -147,6 +147,10 @@ extern int ckpt_collect_ns(struct ckpt_ctx *ctx, struct task_struct *t);
 extern int checkpoint_ns(struct ckpt_ctx *ctx, void *ptr);
 extern void *restore_ns(struct ckpt_ctx *ctx);
 
+/* uts-ns */
+extern int checkpoint_uts_ns(struct ckpt_ctx *ctx, void *ptr);
+extern void *restore_uts_ns(struct ckpt_ctx *ctx);
+
 /* file table */
 extern int ckpt_collect_file_table(struct ckpt_ctx *ctx, struct task_struct *t);
 extern int checkpoint_obj_file_table(struct ckpt_ctx *ctx,
diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
index 5a5916d..0da2f15 100644
--- a/include/linux/checkpoint_hdr.h
+++ b/include/linux/checkpoint_hdr.h
@@ -11,7 +11,6 @@
  */
 
 #include <linux/types.h>
-#include <linux/utsname.h>
 
 /*
  * To maintain compatibility between 32-bit and 64-bit architecture flavors,
@@ -60,6 +59,7 @@ enum {
 	CKPT_HDR_THREAD,
 	CKPT_HDR_CPU,
 	CKPT_HDR_NS,
+	CKPT_HDR_UTS_NS,
 
 	/* 201-299: reserved for arch-dependent */
 
@@ -99,6 +99,7 @@ enum obj_type {
 	CKPT_OBJ_FILE,
 	CKPT_OBJ_MM,
 	CKPT_OBJ_NS,
+	CKPT_OBJ_UTS_NS,
 	CKPT_OBJ_MAX
 };
 
@@ -109,9 +110,12 @@ struct ckpt_const {
 	/* mm */
 	__u16 mm_saved_auxv_len;
 	/* uts */
+	__u16 uts_sysname_len;
+	__u16 uts_nodename_len;
 	__u16 uts_release_len;
 	__u16 uts_version_len;
 	__u16 uts_machine_len;
+	__u16 uts_domainname_len;
 } __attribute__((aligned(8)));
 
 /* checkpoint image header */
@@ -186,6 +190,26 @@ struct ckpt_hdr_task_ns {
 
 struct ckpt_hdr_ns {
 	struct ckpt_hdr h;
+	__s32 uts_objref;
+} __attribute__((aligned(8)));
+
+/* cannot include <linux/tty.h> from userspace, so define: */
+#define CKPT_NEW_UTS_LEN  64
+#ifdef __KERNEL__
+#include <linux/utsname.h>
+#if CKPT_NEW_UTS_LEN != __NEW_UTS_LEN
+#error CKPT_NEW_UTS_LEN size is wrong per linux/utsname.h
+#endif
+#endif
+
+struct ckpt_hdr_utsns {
+	struct ckpt_hdr h;
+	char sysname[CKPT_NEW_UTS_LEN + 1];
+	char nodename[CKPT_NEW_UTS_LEN + 1];
+	char release[CKPT_NEW_UTS_LEN + 1];
+	char version[CKPT_NEW_UTS_LEN + 1];
+	char machine[CKPT_NEW_UTS_LEN + 1];
+	char domainname[CKPT_NEW_UTS_LEN + 1];
 } __attribute__((aligned(8)));
 
 /* task's shared resources */
diff --git a/include/linux/checkpoint_types.h b/include/linux/checkpoint_types.h
index f214109..7d1b8c8 100644
--- a/include/linux/checkpoint_types.h
+++ b/include/linux/checkpoint_types.h
@@ -22,6 +22,10 @@
 #include <linux/ktime.h>
 #include <linux/wait.h>
 
+struct ckpt_stats {
+	int uts_ns;
+};
+
 struct ckpt_ctx {
 	int crid;		/* unique checkpoint id */
 
@@ -64,6 +68,8 @@ struct ckpt_ctx {
 	int active_pid;			/* (next) position in pids array */
 	struct completion complete;	/* container root and other tasks on */
 	wait_queue_head_t waitq;	/* start, end, and restart ordering */
+
+	struct ckpt_stats stats;	/* statistics */
 };
 
 #endif /* __KERNEL__ */
diff --git a/include/linux/utsname.h b/include/linux/utsname.h
index 3656b30..d6f24a9 100644
--- a/include/linux/utsname.h
+++ b/include/linux/utsname.h
@@ -50,6 +50,7 @@ static inline void get_uts_ns(struct uts_namespace *ns)
 	kref_get(&ns->kref);
 }
 
+extern struct uts_namespace *create_uts_ns(void);
 extern struct uts_namespace *copy_utsname(unsigned long flags,
 					struct uts_namespace *ns);
 extern void free_uts_ns(struct kref *kref);
diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c
index 54cb987..4f48a68 100644
--- a/kernel/nsproxy.c
+++ b/kernel/nsproxy.c
@@ -245,6 +245,10 @@ int ckpt_collect_ns(struct ckpt_ctx *ctx, struct task_struct *t)
 	if (ret < 0 || exists)
 		goto out;
 
+	ret = ckpt_obj_collect(ctx, nsproxy->uts_ns, CKPT_OBJ_UTS_NS);
+	if (ret < 0)
+		goto out;
+
 	/* TODO: collect other namespaces here */
  out:
 	put_nsproxy(nsproxy);
@@ -260,9 +264,14 @@ static int do_checkpoint_ns(struct ckpt_ctx *ctx, struct nsproxy *nsproxy)
 	if (!h)
 		return -ENOMEM;
 
+	ret = checkpoint_obj(ctx, nsproxy->uts_ns, CKPT_OBJ_UTS_NS);
+	if (ret <= 0)
+		goto out;
+	h->uts_objref = ret;
 	/* TODO: Write other namespaces here */
 
 	ret = ckpt_write_obj(ctx, &h->h);
+ out:
 	ckpt_hdr_put(ctx, h);
 	return ret;
 }
@@ -277,16 +286,52 @@ static struct nsproxy *do_restore_ns(struct ckpt_ctx *ctx)
 {
 	struct ckpt_hdr_ns *h;
 	struct nsproxy *nsproxy = NULL;
+	struct uts_namespace *uts_ns;
+	int ret;
 
 	h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_NS);
 	if (IS_ERR(h))
 		return (struct nsproxy *) h;
 
+	ret = -EINVAL;
+	if (h->uts_objref <= 0)
+		goto out;
+
+	uts_ns = ckpt_obj_fetch(ctx, h->uts_objref, CKPT_OBJ_UTS_NS);
+	if (IS_ERR(uts_ns)) {
+		ret = PTR_ERR(uts_ns);
+		goto out;
+	}
+
+#if defined(COFNIG_UTS_NS)
+	ret = -ENOMEM;
+	nsproxy = create_nsproxy();
+	if (!nsproxy)
+		goto out;
+
+	get_uts_ns(uts_ns);
+	nsproxy->uts_ns = uts_ns;
+
+	get_ipc_ns(current->nsproxy->ipc_ns);
+	nsproxy->ipc_ns = ipc_ns;
+	get_pid_ns(current->nsproxy->pid_ns);
+	nsproxy->pid_ns = current->nsproxy->pid_ns;
+	get_mnt_ns(current->nsproxy->mnt_ns);
+	nsproxy->mnt_ns = current->nsproxy->mnt_ns;
+	get_net(current->nsproxy->net_ns);
+	nsproxy->net_ns = current->nsproxy->net_ns;
+#else
 	nsproxy = current->nsproxy;
 	get_nsproxy(nsproxy);
 
-	/* TODO: add more namespaces here */
+	BUG_ON(nsproxy->uts_ns != uts_ns);
+#endif
 
+	/* TODO: add more namespaces here */
+	ret = 0;
+ out:
+	if (ret < 0)
+		nsproxy = ERR_PTR(ret);
 	ckpt_hdr_put(ctx, h);
 	return nsproxy;
 }
diff --git a/kernel/utsname.c b/kernel/utsname.c
index 8a82b4b..c82ed83 100644
--- a/kernel/utsname.c
+++ b/kernel/utsname.c
@@ -14,8 +14,9 @@
 #include <linux/utsname.h>
 #include <linux/err.h>
 #include <linux/slab.h>
+#include <linux/checkpoint.h>
 
-static struct uts_namespace *create_uts_ns(void)
+struct uts_namespace *create_uts_ns(void)
 {
 	struct uts_namespace *uts_ns;
 
-- 
1.6.0.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH v18 50/80] c/r (ipc): allow allocation of a desired ipc identifier
       [not found] ` <1253749920-18673-1-git-send-email-orenl-RdfvBDnrOixBDgjK7y7TUQ@public.gmane.org>
                     ` (48 preceding siblings ...)
  2009-09-23 23:51   ` [PATCH v18 49/80] c/r: support for UTS namespace Oren Laadan
@ 2009-09-23 23:51   ` Oren Laadan
  2009-09-23 23:51   ` [PATCH v18 51/80] c/r: save and restore sysvipc namespace basics Oren Laadan
                     ` (31 subsequent siblings)
  81 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-23 23:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-api-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Linus Torvalds, Ingo Molnar,
	Pavel Emelyanov

During restart, we need to allocate ipc objects that with the same
identifiers as recorded during checkpoint. Modify the allocation
code allow an in-kernel caller to request a specific ipc identifier.
The system call interface remains unchanged.

Signed-off-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
---
 ipc/msg.c  |   17 ++++++++++++-----
 ipc/sem.c  |   17 ++++++++++++-----
 ipc/shm.c  |   19 +++++++++++++------
 ipc/util.c |   42 +++++++++++++++++++++++++++++-------------
 ipc/util.h |    9 +++++----
 5 files changed, 71 insertions(+), 33 deletions(-)

diff --git a/ipc/msg.c b/ipc/msg.c
index 2ceab7f..1db7c45 100644
--- a/ipc/msg.c
+++ b/ipc/msg.c
@@ -73,7 +73,7 @@ struct msg_sender {
 #define msg_unlock(msq)		ipc_unlock(&(msq)->q_perm)
 
 static void freeque(struct ipc_namespace *, struct kern_ipc_perm *);
-static int newque(struct ipc_namespace *, struct ipc_params *);
+static int newque(struct ipc_namespace *, struct ipc_params *, int);
 #ifdef CONFIG_PROC_FS
 static int sysvipc_msg_proc_show(struct seq_file *s, void *it);
 #endif
@@ -174,10 +174,12 @@ static inline void msg_rmid(struct ipc_namespace *ns, struct msg_queue *s)
  * newque - Create a new msg queue
  * @ns: namespace
  * @params: ptr to the structure that contains the key and msgflg
+ * @req_id: request desired id if available (-1 if don't care)
  *
  * Called with msg_ids.rw_mutex held (writer)
  */
-static int newque(struct ipc_namespace *ns, struct ipc_params *params)
+static int
+newque(struct ipc_namespace *ns, struct ipc_params *params, int req_id)
 {
 	struct msg_queue *msq;
 	int id, retval;
@@ -201,7 +203,7 @@ static int newque(struct ipc_namespace *ns, struct ipc_params *params)
 	/*
 	 * ipc_addid() locks msq
 	 */
-	id = ipc_addid(&msg_ids(ns), &msq->q_perm, ns->msg_ctlmni);
+	id = ipc_addid(&msg_ids(ns), &msq->q_perm, ns->msg_ctlmni, req_id);
 	if (id < 0) {
 		security_msg_queue_free(msq);
 		ipc_rcu_putref(msq);
@@ -309,7 +311,7 @@ static inline int msg_security(struct kern_ipc_perm *ipcp, int msgflg)
 	return security_msg_queue_associate(msq, msgflg);
 }
 
-SYSCALL_DEFINE2(msgget, key_t, key, int, msgflg)
+int do_msgget(key_t key, int msgflg, int req_id)
 {
 	struct ipc_namespace *ns;
 	struct ipc_ops msg_ops;
@@ -324,7 +326,12 @@ SYSCALL_DEFINE2(msgget, key_t, key, int, msgflg)
 	msg_params.key = key;
 	msg_params.flg = msgflg;
 
-	return ipcget(ns, &msg_ids(ns), &msg_ops, &msg_params);
+	return ipcget(ns, &msg_ids(ns), &msg_ops, &msg_params, req_id);
+}
+
+SYSCALL_DEFINE2(msgget, key_t, key, int, msgflg)
+{
+	return do_msgget(key, msgflg, -1);
 }
 
 static inline unsigned long
diff --git a/ipc/sem.c b/ipc/sem.c
index 87c2b64..a2b2135 100644
--- a/ipc/sem.c
+++ b/ipc/sem.c
@@ -92,7 +92,7 @@
 #define sem_unlock(sma)		ipc_unlock(&(sma)->sem_perm)
 #define sem_checkid(sma, semid)	ipc_checkid(&sma->sem_perm, semid)
 
-static int newary(struct ipc_namespace *, struct ipc_params *);
+static int newary(struct ipc_namespace *, struct ipc_params *, int);
 static void freeary(struct ipc_namespace *, struct kern_ipc_perm *);
 #ifdef CONFIG_PROC_FS
 static int sysvipc_sem_proc_show(struct seq_file *s, void *it);
@@ -227,11 +227,13 @@ static inline void sem_rmid(struct ipc_namespace *ns, struct sem_array *s)
  * newary - Create a new semaphore set
  * @ns: namespace
  * @params: ptr to the structure that contains key, semflg and nsems
+ * @req_id: request desired id if available (-1 if don't care)
  *
  * Called with sem_ids.rw_mutex held (as a writer)
  */
 
-static int newary(struct ipc_namespace *ns, struct ipc_params *params)
+static int
+newary(struct ipc_namespace *ns, struct ipc_params *params, int req_id)
 {
 	int id;
 	int retval;
@@ -263,7 +265,7 @@ static int newary(struct ipc_namespace *ns, struct ipc_params *params)
 		return retval;
 	}
 
-	id = ipc_addid(&sem_ids(ns), &sma->sem_perm, ns->sc_semmni);
+	id = ipc_addid(&sem_ids(ns), &sma->sem_perm, ns->sc_semmni, req_id);
 	if (id < 0) {
 		security_sem_free(sma);
 		ipc_rcu_putref(sma);
@@ -308,7 +310,7 @@ static inline int sem_more_checks(struct kern_ipc_perm *ipcp,
 	return 0;
 }
 
-SYSCALL_DEFINE3(semget, key_t, key, int, nsems, int, semflg)
+int do_semget(key_t key, int nsems, int semflg, int req_id)
 {
 	struct ipc_namespace *ns;
 	struct ipc_ops sem_ops;
@@ -327,7 +329,12 @@ SYSCALL_DEFINE3(semget, key_t, key, int, nsems, int, semflg)
 	sem_params.flg = semflg;
 	sem_params.u.nsems = nsems;
 
-	return ipcget(ns, &sem_ids(ns), &sem_ops, &sem_params);
+	return ipcget(ns, &sem_ids(ns), &sem_ops, &sem_params, req_id);
+}
+
+SYSCALL_DEFINE3(semget, key_t, key, int, nsems, int, semflg)
+{
+	return do_semget(key, nsems, semflg, -1);
 }
 
 /*
diff --git a/ipc/shm.c b/ipc/shm.c
index 1bc4701..749dd57 100644
--- a/ipc/shm.c
+++ b/ipc/shm.c
@@ -62,7 +62,7 @@ static struct vm_operations_struct shm_vm_ops;
 #define shm_unlock(shp)			\
 	ipc_unlock(&(shp)->shm_perm)
 
-static int newseg(struct ipc_namespace *, struct ipc_params *);
+static int newseg(struct ipc_namespace *, struct ipc_params *, int);
 static void shm_open(struct vm_area_struct *vma);
 static void shm_close(struct vm_area_struct *vma);
 static void shm_destroy (struct ipc_namespace *ns, struct shmid_kernel *shp);
@@ -83,7 +83,7 @@ void shm_init_ns(struct ipc_namespace *ns)
  * Called with shm_ids.rw_mutex (writer) and the shp structure locked.
  * Only shm_ids.rw_mutex remains locked on exit.
  */
-static void do_shm_rmid(struct ipc_namespace *ns, struct kern_ipc_perm *ipcp)
+void do_shm_rmid(struct ipc_namespace *ns, struct kern_ipc_perm *ipcp)
 {
 	struct shmid_kernel *shp;
 	shp = container_of(ipcp, struct shmid_kernel, shm_perm);
@@ -326,11 +326,13 @@ static struct vm_operations_struct shm_vm_ops = {
  * newseg - Create a new shared memory segment
  * @ns: namespace
  * @params: ptr to the structure that contains key, size and shmflg
+ * @req_id: request desired id if available (-1 if don't care)
  *
  * Called with shm_ids.rw_mutex held as a writer.
  */
 
-static int newseg(struct ipc_namespace *ns, struct ipc_params *params)
+static int
+newseg(struct ipc_namespace *ns, struct ipc_params *params, int req_id)
 {
 	key_t key = params->key;
 	int shmflg = params->flg;
@@ -385,7 +387,7 @@ static int newseg(struct ipc_namespace *ns, struct ipc_params *params)
 	if (IS_ERR(file))
 		goto no_file;
 
-	id = ipc_addid(&shm_ids(ns), &shp->shm_perm, ns->shm_ctlmni);
+	id = ipc_addid(&shm_ids(ns), &shp->shm_perm, ns->shm_ctlmni, req_id);
 	if (id < 0) {
 		error = id;
 		goto no_id;
@@ -445,7 +447,7 @@ static inline int shm_more_checks(struct kern_ipc_perm *ipcp,
 	return 0;
 }
 
-SYSCALL_DEFINE3(shmget, key_t, key, size_t, size, int, shmflg)
+int do_shmget(key_t key, size_t size, int shmflg, int req_id)
 {
 	struct ipc_namespace *ns;
 	struct ipc_ops shm_ops;
@@ -461,7 +463,12 @@ SYSCALL_DEFINE3(shmget, key_t, key, size_t, size, int, shmflg)
 	shm_params.flg = shmflg;
 	shm_params.u.size = size;
 
-	return ipcget(ns, &shm_ids(ns), &shm_ops, &shm_params);
+	return ipcget(ns, &shm_ids(ns), &shm_ops, &shm_params, req_id);
+}
+
+SYSCALL_DEFINE3(shmget, key_t, key, size_t, size, int, shmflg)
+{
+	return do_shmget(key, size, shmflg, -1);
 }
 
 static inline unsigned long copy_shmid_to_user(void __user *buf, struct shmid64_ds *in, int version)
diff --git a/ipc/util.c b/ipc/util.c
index b8e4ba9..ca248ec 100644
--- a/ipc/util.c
+++ b/ipc/util.c
@@ -247,10 +247,12 @@ int ipc_get_maxid(struct ipc_ids *ids)
  *	Called with ipc_ids.rw_mutex held as a writer.
  */
  
-int ipc_addid(struct ipc_ids* ids, struct kern_ipc_perm* new, int size)
+int
+ipc_addid(struct ipc_ids *ids, struct kern_ipc_perm *new, int size, int req_id)
 {
 	uid_t euid;
 	gid_t egid;
+	int lid = 0;
 	int id, err;
 
 	if (size > IPCMNI)
@@ -259,28 +261,41 @@ int ipc_addid(struct ipc_ids* ids, struct kern_ipc_perm* new, int size)
 	if (ids->in_use >= size)
 		return -ENOSPC;
 
+	if (req_id >= 0)
+		lid = ipcid_to_idx(req_id);
+
 	spin_lock_init(&new->lock);
 	new->deleted = 0;
 	rcu_read_lock();
 	spin_lock(&new->lock);
 
-	err = idr_get_new(&ids->ipcs_idr, new, &id);
+	err = idr_get_new_above(&ids->ipcs_idr, new, lid, &id);
 	if (err) {
 		spin_unlock(&new->lock);
 		rcu_read_unlock();
 		return err;
 	}
 
+	if (req_id >= 0) {
+		if (id != lid) {
+			idr_remove(&ids->ipcs_idr, id);
+			spin_unlock(&new->lock);
+			rcu_read_unlock();
+			return -EBUSY;
+		}
+		new->seq = req_id / SEQ_MULTIPLIER;
+	} else {
+		new->seq = ids->seq++;
+		if (ids->seq > ids->seq_max)
+			ids->seq = 0;
+	}
+
 	ids->in_use++;
 
 	current_euid_egid(&euid, &egid);
 	new->cuid = new->uid = euid;
 	new->gid = new->cgid = egid;
 
-	new->seq = ids->seq++;
-	if(ids->seq > ids->seq_max)
-		ids->seq = 0;
-
 	new->id = ipc_buildid(id, new->seq);
 	return id;
 }
@@ -296,7 +311,7 @@ int ipc_addid(struct ipc_ids* ids, struct kern_ipc_perm* new, int size)
  *	when the key is IPC_PRIVATE.
  */
 static int ipcget_new(struct ipc_namespace *ns, struct ipc_ids *ids,
-		struct ipc_ops *ops, struct ipc_params *params)
+		struct ipc_ops *ops, struct ipc_params *params, int req_id)
 {
 	int err;
 retry:
@@ -306,7 +321,7 @@ retry:
 		return -ENOMEM;
 
 	down_write(&ids->rw_mutex);
-	err = ops->getnew(ns, params);
+	err = ops->getnew(ns, params, req_id);
 	up_write(&ids->rw_mutex);
 
 	if (err == -EAGAIN)
@@ -351,6 +366,7 @@ static int ipc_check_perms(struct kern_ipc_perm *ipcp, struct ipc_ops *ops,
  *	@ids: IPC identifer set
  *	@ops: the actual creation routine to call
  *	@params: its parameters
+ *	@req_id: request desired id if available (-1 if don't care)
  *
  *	This routine is called by sys_msgget, sys_semget() and sys_shmget()
  *	when the key is not IPC_PRIVATE.
@@ -360,7 +376,7 @@ static int ipc_check_perms(struct kern_ipc_perm *ipcp, struct ipc_ops *ops,
  *	On success, the ipc id is returned.
  */
 static int ipcget_public(struct ipc_namespace *ns, struct ipc_ids *ids,
-		struct ipc_ops *ops, struct ipc_params *params)
+		struct ipc_ops *ops, struct ipc_params *params, int req_id)
 {
 	struct kern_ipc_perm *ipcp;
 	int flg = params->flg;
@@ -381,7 +397,7 @@ retry:
 		else if (!err)
 			err = -ENOMEM;
 		else
-			err = ops->getnew(ns, params);
+			err = ops->getnew(ns, params, req_id);
 	} else {
 		/* ipc object has been locked by ipc_findkey() */
 
@@ -742,12 +758,12 @@ struct kern_ipc_perm *ipc_lock_check(struct ipc_ids *ids, int id)
  * Common routine called by sys_msgget(), sys_semget() and sys_shmget().
  */
 int ipcget(struct ipc_namespace *ns, struct ipc_ids *ids,
-			struct ipc_ops *ops, struct ipc_params *params)
+		struct ipc_ops *ops, struct ipc_params *params, int req_id)
 {
 	if (params->key == IPC_PRIVATE)
-		return ipcget_new(ns, ids, ops, params);
+		return ipcget_new(ns, ids, ops, params, req_id);
 	else
-		return ipcget_public(ns, ids, ops, params);
+		return ipcget_public(ns, ids, ops, params, req_id);
 }
 
 /**
diff --git a/ipc/util.h b/ipc/util.h
index 764b51a..159a73c 100644
--- a/ipc/util.h
+++ b/ipc/util.h
@@ -71,7 +71,7 @@ struct ipc_params {
  *      . routine to call for an extra check if needed
  */
 struct ipc_ops {
-	int (*getnew) (struct ipc_namespace *, struct ipc_params *);
+	int (*getnew) (struct ipc_namespace *, struct ipc_params *, int);
 	int (*associate) (struct kern_ipc_perm *, int);
 	int (*more_checks) (struct kern_ipc_perm *, struct ipc_params *);
 };
@@ -94,7 +94,7 @@ void __init ipc_init_proc_interface(const char *path, const char *header,
 #define ipcid_to_idx(id) ((id) % SEQ_MULTIPLIER)
 
 /* must be called with ids->rw_mutex acquired for writing */
-int ipc_addid(struct ipc_ids *, struct kern_ipc_perm *, int);
+int ipc_addid(struct ipc_ids *, struct kern_ipc_perm *, int, int);
 
 /* must be called with ids->rw_mutex acquired for reading */
 int ipc_get_maxid(struct ipc_ids *);
@@ -171,7 +171,8 @@ static inline void ipc_unlock(struct kern_ipc_perm *perm)
 
 struct kern_ipc_perm *ipc_lock_check(struct ipc_ids *ids, int id);
 int ipcget(struct ipc_namespace *ns, struct ipc_ids *ids,
-			struct ipc_ops *ops, struct ipc_params *params);
+	   struct ipc_ops *ops, struct ipc_params *params, int req_id);
 void free_ipcs(struct ipc_namespace *ns, struct ipc_ids *ids,
-		void (*free)(struct ipc_namespace *, struct kern_ipc_perm *));
+	       void (*free)(struct ipc_namespace *, struct kern_ipc_perm *));
+
 #endif
-- 
1.6.0.4

^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH v18 50/80] c/r (ipc): allow allocation of a desired ipc identifier
  2009-09-23 23:50 ` Oren Laadan
@ 2009-09-23 23:51   ` Oren Laadan
  -1 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-23 23:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, containers, linux-kernel, linux-mm, linux-api,
	Serge Hallyn, Ingo Molnar, Pavel Emelyanov, Oren Laadan,
	Oren Laadan

During restart, we need to allocate ipc objects that with the same
identifiers as recorded during checkpoint. Modify the allocation
code allow an in-kernel caller to request a specific ipc identifier.
The system call interface remains unchanged.

Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
---
 ipc/msg.c  |   17 ++++++++++++-----
 ipc/sem.c  |   17 ++++++++++++-----
 ipc/shm.c  |   19 +++++++++++++------
 ipc/util.c |   42 +++++++++++++++++++++++++++++-------------
 ipc/util.h |    9 +++++----
 5 files changed, 71 insertions(+), 33 deletions(-)

diff --git a/ipc/msg.c b/ipc/msg.c
index 2ceab7f..1db7c45 100644
--- a/ipc/msg.c
+++ b/ipc/msg.c
@@ -73,7 +73,7 @@ struct msg_sender {
 #define msg_unlock(msq)		ipc_unlock(&(msq)->q_perm)
 
 static void freeque(struct ipc_namespace *, struct kern_ipc_perm *);
-static int newque(struct ipc_namespace *, struct ipc_params *);
+static int newque(struct ipc_namespace *, struct ipc_params *, int);
 #ifdef CONFIG_PROC_FS
 static int sysvipc_msg_proc_show(struct seq_file *s, void *it);
 #endif
@@ -174,10 +174,12 @@ static inline void msg_rmid(struct ipc_namespace *ns, struct msg_queue *s)
  * newque - Create a new msg queue
  * @ns: namespace
  * @params: ptr to the structure that contains the key and msgflg
+ * @req_id: request desired id if available (-1 if don't care)
  *
  * Called with msg_ids.rw_mutex held (writer)
  */
-static int newque(struct ipc_namespace *ns, struct ipc_params *params)
+static int
+newque(struct ipc_namespace *ns, struct ipc_params *params, int req_id)
 {
 	struct msg_queue *msq;
 	int id, retval;
@@ -201,7 +203,7 @@ static int newque(struct ipc_namespace *ns, struct ipc_params *params)
 	/*
 	 * ipc_addid() locks msq
 	 */
-	id = ipc_addid(&msg_ids(ns), &msq->q_perm, ns->msg_ctlmni);
+	id = ipc_addid(&msg_ids(ns), &msq->q_perm, ns->msg_ctlmni, req_id);
 	if (id < 0) {
 		security_msg_queue_free(msq);
 		ipc_rcu_putref(msq);
@@ -309,7 +311,7 @@ static inline int msg_security(struct kern_ipc_perm *ipcp, int msgflg)
 	return security_msg_queue_associate(msq, msgflg);
 }
 
-SYSCALL_DEFINE2(msgget, key_t, key, int, msgflg)
+int do_msgget(key_t key, int msgflg, int req_id)
 {
 	struct ipc_namespace *ns;
 	struct ipc_ops msg_ops;
@@ -324,7 +326,12 @@ SYSCALL_DEFINE2(msgget, key_t, key, int, msgflg)
 	msg_params.key = key;
 	msg_params.flg = msgflg;
 
-	return ipcget(ns, &msg_ids(ns), &msg_ops, &msg_params);
+	return ipcget(ns, &msg_ids(ns), &msg_ops, &msg_params, req_id);
+}
+
+SYSCALL_DEFINE2(msgget, key_t, key, int, msgflg)
+{
+	return do_msgget(key, msgflg, -1);
 }
 
 static inline unsigned long
diff --git a/ipc/sem.c b/ipc/sem.c
index 87c2b64..a2b2135 100644
--- a/ipc/sem.c
+++ b/ipc/sem.c
@@ -92,7 +92,7 @@
 #define sem_unlock(sma)		ipc_unlock(&(sma)->sem_perm)
 #define sem_checkid(sma, semid)	ipc_checkid(&sma->sem_perm, semid)
 
-static int newary(struct ipc_namespace *, struct ipc_params *);
+static int newary(struct ipc_namespace *, struct ipc_params *, int);
 static void freeary(struct ipc_namespace *, struct kern_ipc_perm *);
 #ifdef CONFIG_PROC_FS
 static int sysvipc_sem_proc_show(struct seq_file *s, void *it);
@@ -227,11 +227,13 @@ static inline void sem_rmid(struct ipc_namespace *ns, struct sem_array *s)
  * newary - Create a new semaphore set
  * @ns: namespace
  * @params: ptr to the structure that contains key, semflg and nsems
+ * @req_id: request desired id if available (-1 if don't care)
  *
  * Called with sem_ids.rw_mutex held (as a writer)
  */
 
-static int newary(struct ipc_namespace *ns, struct ipc_params *params)
+static int
+newary(struct ipc_namespace *ns, struct ipc_params *params, int req_id)
 {
 	int id;
 	int retval;
@@ -263,7 +265,7 @@ static int newary(struct ipc_namespace *ns, struct ipc_params *params)
 		return retval;
 	}
 
-	id = ipc_addid(&sem_ids(ns), &sma->sem_perm, ns->sc_semmni);
+	id = ipc_addid(&sem_ids(ns), &sma->sem_perm, ns->sc_semmni, req_id);
 	if (id < 0) {
 		security_sem_free(sma);
 		ipc_rcu_putref(sma);
@@ -308,7 +310,7 @@ static inline int sem_more_checks(struct kern_ipc_perm *ipcp,
 	return 0;
 }
 
-SYSCALL_DEFINE3(semget, key_t, key, int, nsems, int, semflg)
+int do_semget(key_t key, int nsems, int semflg, int req_id)
 {
 	struct ipc_namespace *ns;
 	struct ipc_ops sem_ops;
@@ -327,7 +329,12 @@ SYSCALL_DEFINE3(semget, key_t, key, int, nsems, int, semflg)
 	sem_params.flg = semflg;
 	sem_params.u.nsems = nsems;
 
-	return ipcget(ns, &sem_ids(ns), &sem_ops, &sem_params);
+	return ipcget(ns, &sem_ids(ns), &sem_ops, &sem_params, req_id);
+}
+
+SYSCALL_DEFINE3(semget, key_t, key, int, nsems, int, semflg)
+{
+	return do_semget(key, nsems, semflg, -1);
 }
 
 /*
diff --git a/ipc/shm.c b/ipc/shm.c
index 1bc4701..749dd57 100644
--- a/ipc/shm.c
+++ b/ipc/shm.c
@@ -62,7 +62,7 @@ static struct vm_operations_struct shm_vm_ops;
 #define shm_unlock(shp)			\
 	ipc_unlock(&(shp)->shm_perm)
 
-static int newseg(struct ipc_namespace *, struct ipc_params *);
+static int newseg(struct ipc_namespace *, struct ipc_params *, int);
 static void shm_open(struct vm_area_struct *vma);
 static void shm_close(struct vm_area_struct *vma);
 static void shm_destroy (struct ipc_namespace *ns, struct shmid_kernel *shp);
@@ -83,7 +83,7 @@ void shm_init_ns(struct ipc_namespace *ns)
  * Called with shm_ids.rw_mutex (writer) and the shp structure locked.
  * Only shm_ids.rw_mutex remains locked on exit.
  */
-static void do_shm_rmid(struct ipc_namespace *ns, struct kern_ipc_perm *ipcp)
+void do_shm_rmid(struct ipc_namespace *ns, struct kern_ipc_perm *ipcp)
 {
 	struct shmid_kernel *shp;
 	shp = container_of(ipcp, struct shmid_kernel, shm_perm);
@@ -326,11 +326,13 @@ static struct vm_operations_struct shm_vm_ops = {
  * newseg - Create a new shared memory segment
  * @ns: namespace
  * @params: ptr to the structure that contains key, size and shmflg
+ * @req_id: request desired id if available (-1 if don't care)
  *
  * Called with shm_ids.rw_mutex held as a writer.
  */
 
-static int newseg(struct ipc_namespace *ns, struct ipc_params *params)
+static int
+newseg(struct ipc_namespace *ns, struct ipc_params *params, int req_id)
 {
 	key_t key = params->key;
 	int shmflg = params->flg;
@@ -385,7 +387,7 @@ static int newseg(struct ipc_namespace *ns, struct ipc_params *params)
 	if (IS_ERR(file))
 		goto no_file;
 
-	id = ipc_addid(&shm_ids(ns), &shp->shm_perm, ns->shm_ctlmni);
+	id = ipc_addid(&shm_ids(ns), &shp->shm_perm, ns->shm_ctlmni, req_id);
 	if (id < 0) {
 		error = id;
 		goto no_id;
@@ -445,7 +447,7 @@ static inline int shm_more_checks(struct kern_ipc_perm *ipcp,
 	return 0;
 }
 
-SYSCALL_DEFINE3(shmget, key_t, key, size_t, size, int, shmflg)
+int do_shmget(key_t key, size_t size, int shmflg, int req_id)
 {
 	struct ipc_namespace *ns;
 	struct ipc_ops shm_ops;
@@ -461,7 +463,12 @@ SYSCALL_DEFINE3(shmget, key_t, key, size_t, size, int, shmflg)
 	shm_params.flg = shmflg;
 	shm_params.u.size = size;
 
-	return ipcget(ns, &shm_ids(ns), &shm_ops, &shm_params);
+	return ipcget(ns, &shm_ids(ns), &shm_ops, &shm_params, req_id);
+}
+
+SYSCALL_DEFINE3(shmget, key_t, key, size_t, size, int, shmflg)
+{
+	return do_shmget(key, size, shmflg, -1);
 }
 
 static inline unsigned long copy_shmid_to_user(void __user *buf, struct shmid64_ds *in, int version)
diff --git a/ipc/util.c b/ipc/util.c
index b8e4ba9..ca248ec 100644
--- a/ipc/util.c
+++ b/ipc/util.c
@@ -247,10 +247,12 @@ int ipc_get_maxid(struct ipc_ids *ids)
  *	Called with ipc_ids.rw_mutex held as a writer.
  */
  
-int ipc_addid(struct ipc_ids* ids, struct kern_ipc_perm* new, int size)
+int
+ipc_addid(struct ipc_ids *ids, struct kern_ipc_perm *new, int size, int req_id)
 {
 	uid_t euid;
 	gid_t egid;
+	int lid = 0;
 	int id, err;
 
 	if (size > IPCMNI)
@@ -259,28 +261,41 @@ int ipc_addid(struct ipc_ids* ids, struct kern_ipc_perm* new, int size)
 	if (ids->in_use >= size)
 		return -ENOSPC;
 
+	if (req_id >= 0)
+		lid = ipcid_to_idx(req_id);
+
 	spin_lock_init(&new->lock);
 	new->deleted = 0;
 	rcu_read_lock();
 	spin_lock(&new->lock);
 
-	err = idr_get_new(&ids->ipcs_idr, new, &id);
+	err = idr_get_new_above(&ids->ipcs_idr, new, lid, &id);
 	if (err) {
 		spin_unlock(&new->lock);
 		rcu_read_unlock();
 		return err;
 	}
 
+	if (req_id >= 0) {
+		if (id != lid) {
+			idr_remove(&ids->ipcs_idr, id);
+			spin_unlock(&new->lock);
+			rcu_read_unlock();
+			return -EBUSY;
+		}
+		new->seq = req_id / SEQ_MULTIPLIER;
+	} else {
+		new->seq = ids->seq++;
+		if (ids->seq > ids->seq_max)
+			ids->seq = 0;
+	}
+
 	ids->in_use++;
 
 	current_euid_egid(&euid, &egid);
 	new->cuid = new->uid = euid;
 	new->gid = new->cgid = egid;
 
-	new->seq = ids->seq++;
-	if(ids->seq > ids->seq_max)
-		ids->seq = 0;
-
 	new->id = ipc_buildid(id, new->seq);
 	return id;
 }
@@ -296,7 +311,7 @@ int ipc_addid(struct ipc_ids* ids, struct kern_ipc_perm* new, int size)
  *	when the key is IPC_PRIVATE.
  */
 static int ipcget_new(struct ipc_namespace *ns, struct ipc_ids *ids,
-		struct ipc_ops *ops, struct ipc_params *params)
+		struct ipc_ops *ops, struct ipc_params *params, int req_id)
 {
 	int err;
 retry:
@@ -306,7 +321,7 @@ retry:
 		return -ENOMEM;
 
 	down_write(&ids->rw_mutex);
-	err = ops->getnew(ns, params);
+	err = ops->getnew(ns, params, req_id);
 	up_write(&ids->rw_mutex);
 
 	if (err == -EAGAIN)
@@ -351,6 +366,7 @@ static int ipc_check_perms(struct kern_ipc_perm *ipcp, struct ipc_ops *ops,
  *	@ids: IPC identifer set
  *	@ops: the actual creation routine to call
  *	@params: its parameters
+ *	@req_id: request desired id if available (-1 if don't care)
  *
  *	This routine is called by sys_msgget, sys_semget() and sys_shmget()
  *	when the key is not IPC_PRIVATE.
@@ -360,7 +376,7 @@ static int ipc_check_perms(struct kern_ipc_perm *ipcp, struct ipc_ops *ops,
  *	On success, the ipc id is returned.
  */
 static int ipcget_public(struct ipc_namespace *ns, struct ipc_ids *ids,
-		struct ipc_ops *ops, struct ipc_params *params)
+		struct ipc_ops *ops, struct ipc_params *params, int req_id)
 {
 	struct kern_ipc_perm *ipcp;
 	int flg = params->flg;
@@ -381,7 +397,7 @@ retry:
 		else if (!err)
 			err = -ENOMEM;
 		else
-			err = ops->getnew(ns, params);
+			err = ops->getnew(ns, params, req_id);
 	} else {
 		/* ipc object has been locked by ipc_findkey() */
 
@@ -742,12 +758,12 @@ struct kern_ipc_perm *ipc_lock_check(struct ipc_ids *ids, int id)
  * Common routine called by sys_msgget(), sys_semget() and sys_shmget().
  */
 int ipcget(struct ipc_namespace *ns, struct ipc_ids *ids,
-			struct ipc_ops *ops, struct ipc_params *params)
+		struct ipc_ops *ops, struct ipc_params *params, int req_id)
 {
 	if (params->key == IPC_PRIVATE)
-		return ipcget_new(ns, ids, ops, params);
+		return ipcget_new(ns, ids, ops, params, req_id);
 	else
-		return ipcget_public(ns, ids, ops, params);
+		return ipcget_public(ns, ids, ops, params, req_id);
 }
 
 /**
diff --git a/ipc/util.h b/ipc/util.h
index 764b51a..159a73c 100644
--- a/ipc/util.h
+++ b/ipc/util.h
@@ -71,7 +71,7 @@ struct ipc_params {
  *      . routine to call for an extra check if needed
  */
 struct ipc_ops {
-	int (*getnew) (struct ipc_namespace *, struct ipc_params *);
+	int (*getnew) (struct ipc_namespace *, struct ipc_params *, int);
 	int (*associate) (struct kern_ipc_perm *, int);
 	int (*more_checks) (struct kern_ipc_perm *, struct ipc_params *);
 };
@@ -94,7 +94,7 @@ void __init ipc_init_proc_interface(const char *path, const char *header,
 #define ipcid_to_idx(id) ((id) % SEQ_MULTIPLIER)
 
 /* must be called with ids->rw_mutex acquired for writing */
-int ipc_addid(struct ipc_ids *, struct kern_ipc_perm *, int);
+int ipc_addid(struct ipc_ids *, struct kern_ipc_perm *, int, int);
 
 /* must be called with ids->rw_mutex acquired for reading */
 int ipc_get_maxid(struct ipc_ids *);
@@ -171,7 +171,8 @@ static inline void ipc_unlock(struct kern_ipc_perm *perm)
 
 struct kern_ipc_perm *ipc_lock_check(struct ipc_ids *ids, int id);
 int ipcget(struct ipc_namespace *ns, struct ipc_ids *ids,
-			struct ipc_ops *ops, struct ipc_params *params);
+	   struct ipc_ops *ops, struct ipc_params *params, int req_id);
 void free_ipcs(struct ipc_namespace *ns, struct ipc_ids *ids,
-		void (*free)(struct ipc_namespace *, struct kern_ipc_perm *));
+	       void (*free)(struct ipc_namespace *, struct kern_ipc_perm *));
+
 #endif
-- 
1.6.0.4


^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH v18 50/80] c/r (ipc): allow allocation of a desired ipc identifier
@ 2009-09-23 23:51   ` Oren Laadan
  0 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-23 23:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, containers, linux-kernel, linux-mm, linux-api,
	Serge Hallyn, Ingo Molnar, Pavel Emelyanov, Oren Laadan,
	Oren Laadan

During restart, we need to allocate ipc objects that with the same
identifiers as recorded during checkpoint. Modify the allocation
code allow an in-kernel caller to request a specific ipc identifier.
The system call interface remains unchanged.

Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
---
 ipc/msg.c  |   17 ++++++++++++-----
 ipc/sem.c  |   17 ++++++++++++-----
 ipc/shm.c  |   19 +++++++++++++------
 ipc/util.c |   42 +++++++++++++++++++++++++++++-------------
 ipc/util.h |    9 +++++----
 5 files changed, 71 insertions(+), 33 deletions(-)

diff --git a/ipc/msg.c b/ipc/msg.c
index 2ceab7f..1db7c45 100644
--- a/ipc/msg.c
+++ b/ipc/msg.c
@@ -73,7 +73,7 @@ struct msg_sender {
 #define msg_unlock(msq)		ipc_unlock(&(msq)->q_perm)
 
 static void freeque(struct ipc_namespace *, struct kern_ipc_perm *);
-static int newque(struct ipc_namespace *, struct ipc_params *);
+static int newque(struct ipc_namespace *, struct ipc_params *, int);
 #ifdef CONFIG_PROC_FS
 static int sysvipc_msg_proc_show(struct seq_file *s, void *it);
 #endif
@@ -174,10 +174,12 @@ static inline void msg_rmid(struct ipc_namespace *ns, struct msg_queue *s)
  * newque - Create a new msg queue
  * @ns: namespace
  * @params: ptr to the structure that contains the key and msgflg
+ * @req_id: request desired id if available (-1 if don't care)
  *
  * Called with msg_ids.rw_mutex held (writer)
  */
-static int newque(struct ipc_namespace *ns, struct ipc_params *params)
+static int
+newque(struct ipc_namespace *ns, struct ipc_params *params, int req_id)
 {
 	struct msg_queue *msq;
 	int id, retval;
@@ -201,7 +203,7 @@ static int newque(struct ipc_namespace *ns, struct ipc_params *params)
 	/*
 	 * ipc_addid() locks msq
 	 */
-	id = ipc_addid(&msg_ids(ns), &msq->q_perm, ns->msg_ctlmni);
+	id = ipc_addid(&msg_ids(ns), &msq->q_perm, ns->msg_ctlmni, req_id);
 	if (id < 0) {
 		security_msg_queue_free(msq);
 		ipc_rcu_putref(msq);
@@ -309,7 +311,7 @@ static inline int msg_security(struct kern_ipc_perm *ipcp, int msgflg)
 	return security_msg_queue_associate(msq, msgflg);
 }
 
-SYSCALL_DEFINE2(msgget, key_t, key, int, msgflg)
+int do_msgget(key_t key, int msgflg, int req_id)
 {
 	struct ipc_namespace *ns;
 	struct ipc_ops msg_ops;
@@ -324,7 +326,12 @@ SYSCALL_DEFINE2(msgget, key_t, key, int, msgflg)
 	msg_params.key = key;
 	msg_params.flg = msgflg;
 
-	return ipcget(ns, &msg_ids(ns), &msg_ops, &msg_params);
+	return ipcget(ns, &msg_ids(ns), &msg_ops, &msg_params, req_id);
+}
+
+SYSCALL_DEFINE2(msgget, key_t, key, int, msgflg)
+{
+	return do_msgget(key, msgflg, -1);
 }
 
 static inline unsigned long
diff --git a/ipc/sem.c b/ipc/sem.c
index 87c2b64..a2b2135 100644
--- a/ipc/sem.c
+++ b/ipc/sem.c
@@ -92,7 +92,7 @@
 #define sem_unlock(sma)		ipc_unlock(&(sma)->sem_perm)
 #define sem_checkid(sma, semid)	ipc_checkid(&sma->sem_perm, semid)
 
-static int newary(struct ipc_namespace *, struct ipc_params *);
+static int newary(struct ipc_namespace *, struct ipc_params *, int);
 static void freeary(struct ipc_namespace *, struct kern_ipc_perm *);
 #ifdef CONFIG_PROC_FS
 static int sysvipc_sem_proc_show(struct seq_file *s, void *it);
@@ -227,11 +227,13 @@ static inline void sem_rmid(struct ipc_namespace *ns, struct sem_array *s)
  * newary - Create a new semaphore set
  * @ns: namespace
  * @params: ptr to the structure that contains key, semflg and nsems
+ * @req_id: request desired id if available (-1 if don't care)
  *
  * Called with sem_ids.rw_mutex held (as a writer)
  */
 
-static int newary(struct ipc_namespace *ns, struct ipc_params *params)
+static int
+newary(struct ipc_namespace *ns, struct ipc_params *params, int req_id)
 {
 	int id;
 	int retval;
@@ -263,7 +265,7 @@ static int newary(struct ipc_namespace *ns, struct ipc_params *params)
 		return retval;
 	}
 
-	id = ipc_addid(&sem_ids(ns), &sma->sem_perm, ns->sc_semmni);
+	id = ipc_addid(&sem_ids(ns), &sma->sem_perm, ns->sc_semmni, req_id);
 	if (id < 0) {
 		security_sem_free(sma);
 		ipc_rcu_putref(sma);
@@ -308,7 +310,7 @@ static inline int sem_more_checks(struct kern_ipc_perm *ipcp,
 	return 0;
 }
 
-SYSCALL_DEFINE3(semget, key_t, key, int, nsems, int, semflg)
+int do_semget(key_t key, int nsems, int semflg, int req_id)
 {
 	struct ipc_namespace *ns;
 	struct ipc_ops sem_ops;
@@ -327,7 +329,12 @@ SYSCALL_DEFINE3(semget, key_t, key, int, nsems, int, semflg)
 	sem_params.flg = semflg;
 	sem_params.u.nsems = nsems;
 
-	return ipcget(ns, &sem_ids(ns), &sem_ops, &sem_params);
+	return ipcget(ns, &sem_ids(ns), &sem_ops, &sem_params, req_id);
+}
+
+SYSCALL_DEFINE3(semget, key_t, key, int, nsems, int, semflg)
+{
+	return do_semget(key, nsems, semflg, -1);
 }
 
 /*
diff --git a/ipc/shm.c b/ipc/shm.c
index 1bc4701..749dd57 100644
--- a/ipc/shm.c
+++ b/ipc/shm.c
@@ -62,7 +62,7 @@ static struct vm_operations_struct shm_vm_ops;
 #define shm_unlock(shp)			\
 	ipc_unlock(&(shp)->shm_perm)
 
-static int newseg(struct ipc_namespace *, struct ipc_params *);
+static int newseg(struct ipc_namespace *, struct ipc_params *, int);
 static void shm_open(struct vm_area_struct *vma);
 static void shm_close(struct vm_area_struct *vma);
 static void shm_destroy (struct ipc_namespace *ns, struct shmid_kernel *shp);
@@ -83,7 +83,7 @@ void shm_init_ns(struct ipc_namespace *ns)
  * Called with shm_ids.rw_mutex (writer) and the shp structure locked.
  * Only shm_ids.rw_mutex remains locked on exit.
  */
-static void do_shm_rmid(struct ipc_namespace *ns, struct kern_ipc_perm *ipcp)
+void do_shm_rmid(struct ipc_namespace *ns, struct kern_ipc_perm *ipcp)
 {
 	struct shmid_kernel *shp;
 	shp = container_of(ipcp, struct shmid_kernel, shm_perm);
@@ -326,11 +326,13 @@ static struct vm_operations_struct shm_vm_ops = {
  * newseg - Create a new shared memory segment
  * @ns: namespace
  * @params: ptr to the structure that contains key, size and shmflg
+ * @req_id: request desired id if available (-1 if don't care)
  *
  * Called with shm_ids.rw_mutex held as a writer.
  */
 
-static int newseg(struct ipc_namespace *ns, struct ipc_params *params)
+static int
+newseg(struct ipc_namespace *ns, struct ipc_params *params, int req_id)
 {
 	key_t key = params->key;
 	int shmflg = params->flg;
@@ -385,7 +387,7 @@ static int newseg(struct ipc_namespace *ns, struct ipc_params *params)
 	if (IS_ERR(file))
 		goto no_file;
 
-	id = ipc_addid(&shm_ids(ns), &shp->shm_perm, ns->shm_ctlmni);
+	id = ipc_addid(&shm_ids(ns), &shp->shm_perm, ns->shm_ctlmni, req_id);
 	if (id < 0) {
 		error = id;
 		goto no_id;
@@ -445,7 +447,7 @@ static inline int shm_more_checks(struct kern_ipc_perm *ipcp,
 	return 0;
 }
 
-SYSCALL_DEFINE3(shmget, key_t, key, size_t, size, int, shmflg)
+int do_shmget(key_t key, size_t size, int shmflg, int req_id)
 {
 	struct ipc_namespace *ns;
 	struct ipc_ops shm_ops;
@@ -461,7 +463,12 @@ SYSCALL_DEFINE3(shmget, key_t, key, size_t, size, int, shmflg)
 	shm_params.flg = shmflg;
 	shm_params.u.size = size;
 
-	return ipcget(ns, &shm_ids(ns), &shm_ops, &shm_params);
+	return ipcget(ns, &shm_ids(ns), &shm_ops, &shm_params, req_id);
+}
+
+SYSCALL_DEFINE3(shmget, key_t, key, size_t, size, int, shmflg)
+{
+	return do_shmget(key, size, shmflg, -1);
 }
 
 static inline unsigned long copy_shmid_to_user(void __user *buf, struct shmid64_ds *in, int version)
diff --git a/ipc/util.c b/ipc/util.c
index b8e4ba9..ca248ec 100644
--- a/ipc/util.c
+++ b/ipc/util.c
@@ -247,10 +247,12 @@ int ipc_get_maxid(struct ipc_ids *ids)
  *	Called with ipc_ids.rw_mutex held as a writer.
  */
  
-int ipc_addid(struct ipc_ids* ids, struct kern_ipc_perm* new, int size)
+int
+ipc_addid(struct ipc_ids *ids, struct kern_ipc_perm *new, int size, int req_id)
 {
 	uid_t euid;
 	gid_t egid;
+	int lid = 0;
 	int id, err;
 
 	if (size > IPCMNI)
@@ -259,28 +261,41 @@ int ipc_addid(struct ipc_ids* ids, struct kern_ipc_perm* new, int size)
 	if (ids->in_use >= size)
 		return -ENOSPC;
 
+	if (req_id >= 0)
+		lid = ipcid_to_idx(req_id);
+
 	spin_lock_init(&new->lock);
 	new->deleted = 0;
 	rcu_read_lock();
 	spin_lock(&new->lock);
 
-	err = idr_get_new(&ids->ipcs_idr, new, &id);
+	err = idr_get_new_above(&ids->ipcs_idr, new, lid, &id);
 	if (err) {
 		spin_unlock(&new->lock);
 		rcu_read_unlock();
 		return err;
 	}
 
+	if (req_id >= 0) {
+		if (id != lid) {
+			idr_remove(&ids->ipcs_idr, id);
+			spin_unlock(&new->lock);
+			rcu_read_unlock();
+			return -EBUSY;
+		}
+		new->seq = req_id / SEQ_MULTIPLIER;
+	} else {
+		new->seq = ids->seq++;
+		if (ids->seq > ids->seq_max)
+			ids->seq = 0;
+	}
+
 	ids->in_use++;
 
 	current_euid_egid(&euid, &egid);
 	new->cuid = new->uid = euid;
 	new->gid = new->cgid = egid;
 
-	new->seq = ids->seq++;
-	if(ids->seq > ids->seq_max)
-		ids->seq = 0;
-
 	new->id = ipc_buildid(id, new->seq);
 	return id;
 }
@@ -296,7 +311,7 @@ int ipc_addid(struct ipc_ids* ids, struct kern_ipc_perm* new, int size)
  *	when the key is IPC_PRIVATE.
  */
 static int ipcget_new(struct ipc_namespace *ns, struct ipc_ids *ids,
-		struct ipc_ops *ops, struct ipc_params *params)
+		struct ipc_ops *ops, struct ipc_params *params, int req_id)
 {
 	int err;
 retry:
@@ -306,7 +321,7 @@ retry:
 		return -ENOMEM;
 
 	down_write(&ids->rw_mutex);
-	err = ops->getnew(ns, params);
+	err = ops->getnew(ns, params, req_id);
 	up_write(&ids->rw_mutex);
 
 	if (err == -EAGAIN)
@@ -351,6 +366,7 @@ static int ipc_check_perms(struct kern_ipc_perm *ipcp, struct ipc_ops *ops,
  *	@ids: IPC identifer set
  *	@ops: the actual creation routine to call
  *	@params: its parameters
+ *	@req_id: request desired id if available (-1 if don't care)
  *
  *	This routine is called by sys_msgget, sys_semget() and sys_shmget()
  *	when the key is not IPC_PRIVATE.
@@ -360,7 +376,7 @@ static int ipc_check_perms(struct kern_ipc_perm *ipcp, struct ipc_ops *ops,
  *	On success, the ipc id is returned.
  */
 static int ipcget_public(struct ipc_namespace *ns, struct ipc_ids *ids,
-		struct ipc_ops *ops, struct ipc_params *params)
+		struct ipc_ops *ops, struct ipc_params *params, int req_id)
 {
 	struct kern_ipc_perm *ipcp;
 	int flg = params->flg;
@@ -381,7 +397,7 @@ retry:
 		else if (!err)
 			err = -ENOMEM;
 		else
-			err = ops->getnew(ns, params);
+			err = ops->getnew(ns, params, req_id);
 	} else {
 		/* ipc object has been locked by ipc_findkey() */
 
@@ -742,12 +758,12 @@ struct kern_ipc_perm *ipc_lock_check(struct ipc_ids *ids, int id)
  * Common routine called by sys_msgget(), sys_semget() and sys_shmget().
  */
 int ipcget(struct ipc_namespace *ns, struct ipc_ids *ids,
-			struct ipc_ops *ops, struct ipc_params *params)
+		struct ipc_ops *ops, struct ipc_params *params, int req_id)
 {
 	if (params->key == IPC_PRIVATE)
-		return ipcget_new(ns, ids, ops, params);
+		return ipcget_new(ns, ids, ops, params, req_id);
 	else
-		return ipcget_public(ns, ids, ops, params);
+		return ipcget_public(ns, ids, ops, params, req_id);
 }
 
 /**
diff --git a/ipc/util.h b/ipc/util.h
index 764b51a..159a73c 100644
--- a/ipc/util.h
+++ b/ipc/util.h
@@ -71,7 +71,7 @@ struct ipc_params {
  *      . routine to call for an extra check if needed
  */
 struct ipc_ops {
-	int (*getnew) (struct ipc_namespace *, struct ipc_params *);
+	int (*getnew) (struct ipc_namespace *, struct ipc_params *, int);
 	int (*associate) (struct kern_ipc_perm *, int);
 	int (*more_checks) (struct kern_ipc_perm *, struct ipc_params *);
 };
@@ -94,7 +94,7 @@ void __init ipc_init_proc_interface(const char *path, const char *header,
 #define ipcid_to_idx(id) ((id) % SEQ_MULTIPLIER)
 
 /* must be called with ids->rw_mutex acquired for writing */
-int ipc_addid(struct ipc_ids *, struct kern_ipc_perm *, int);
+int ipc_addid(struct ipc_ids *, struct kern_ipc_perm *, int, int);
 
 /* must be called with ids->rw_mutex acquired for reading */
 int ipc_get_maxid(struct ipc_ids *);
@@ -171,7 +171,8 @@ static inline void ipc_unlock(struct kern_ipc_perm *perm)
 
 struct kern_ipc_perm *ipc_lock_check(struct ipc_ids *ids, int id);
 int ipcget(struct ipc_namespace *ns, struct ipc_ids *ids,
-			struct ipc_ops *ops, struct ipc_params *params);
+	   struct ipc_ops *ops, struct ipc_params *params, int req_id);
 void free_ipcs(struct ipc_namespace *ns, struct ipc_ids *ids,
-		void (*free)(struct ipc_namespace *, struct kern_ipc_perm *));
+	       void (*free)(struct ipc_namespace *, struct kern_ipc_perm *));
+
 #endif
-- 
1.6.0.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH v18 51/80] c/r: save and restore sysvipc namespace basics
       [not found] ` <1253749920-18673-1-git-send-email-orenl-RdfvBDnrOixBDgjK7y7TUQ@public.gmane.org>
                     ` (49 preceding siblings ...)
  2009-09-23 23:51   ` [PATCH v18 50/80] c/r (ipc): allow allocation of a desired ipc identifier Oren Laadan
@ 2009-09-23 23:51   ` Oren Laadan
  2009-09-23 23:51   ` [PATCH v18 52/80] c/r: support share-memory sysv-ipc Oren Laadan
                     ` (30 subsequent siblings)
  81 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-23 23:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-api-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Linus Torvalds, Ingo Molnar,
	Pavel Emelyanov

Add the helpers to checkpoint and restore the contents of 'struct
kern_ipc_perm'. Add header structures for ipc state. Put place-holders
to save and restore ipc state.

Save and restores the common state (parameters) of ipc namespace.

Generic code to iterate through the objects of sysvipc shared memory,
message queues and semaphores. The logic to save and restore the state
of these objects will be added in the next few patches.

Right now, we return -EPERM if the user calling sys_restart() isn't
allowed to create an object with the checkpointed uid.  We may prefer
to simply use the caller's uid in that case - but that could lead to
subtle userspace bugs?  Unsure, so going for the stricter behavior.

TODO: restore kern_ipc_perms->security.

Changelog[v17]:
  - Fix include: use checkpoint.h not checkpoint_hdr.h
  - Collect nsproxy->ipc_ns
  - Restore objects in the right namespace
  - If !CONFIG_IPC_NS only restore objects, not global settings
  - Don't overwrite global ipc-ns if !CONFIG_IPC_NS
  - Reset the checkpointed uid and gid info on ipc objects
  - Fix compilation with CONFIG_SYSVIPC=n
Changelog [Dan Smith <danms-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>]
  - Fix compilation with CONFIG_SYSVIPC=n
  - Update to match UTS changes

Signed-off-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
---
 checkpoint/checkpoint.c          |    2 -
 checkpoint/objhash.c             |   28 ++++
 include/linux/checkpoint.h       |   13 ++
 include/linux/checkpoint_hdr.h   |   54 +++++++
 include/linux/checkpoint_types.h |    1 +
 init/Kconfig                     |    6 +
 ipc/Makefile                     |    2 +-
 ipc/checkpoint.c                 |  317 ++++++++++++++++++++++++++++++++++++++
 ipc/namespace.c                  |    2 +-
 ipc/util.h                       |   10 ++
 kernel/nsproxy.c                 |   22 ++-
 11 files changed, 449 insertions(+), 8 deletions(-)
 create mode 100644 ipc/checkpoint.c

diff --git a/checkpoint/checkpoint.c b/checkpoint/checkpoint.c
index 2a193b3..e35c5a8 100644
--- a/checkpoint/checkpoint.c
+++ b/checkpoint/checkpoint.c
@@ -424,8 +424,6 @@ static int may_checkpoint_task(struct ckpt_ctx *ctx, struct task_struct *t)
 
 	rcu_read_lock();
 	nsproxy = task_nsproxy(t);
-	if (nsproxy->ipc_ns != ctx->root_nsproxy->ipc_ns)
-		ret = -EPERM;
 	if (nsproxy->mnt_ns != ctx->root_nsproxy->mnt_ns) {
 		__ckpt_write_err(ctx, "TE", "bad mnt_ns", -EPERM);
 		ret = -EPERM;
diff --git a/checkpoint/objhash.c b/checkpoint/objhash.c
index 2fd00a6..15a5caf 100644
--- a/checkpoint/objhash.c
+++ b/checkpoint/objhash.c
@@ -15,6 +15,8 @@
 #include <linux/hash.h>
 #include <linux/file.h>
 #include <linux/fdtable.h>
+#include <linux/sched.h>
+#include <linux/ipc_namespace.h>
 #include <linux/checkpoint.h>
 #include <linux/checkpoint_hdr.h>
 
@@ -154,6 +156,22 @@ static int obj_uts_ns_users(void *ptr)
 	return atomic_read(&((struct uts_namespace *) ptr)->kref.refcount);
 }
 
+static int obj_ipc_ns_grab(void *ptr)
+{
+	get_ipc_ns((struct ipc_namespace *) ptr);
+	return 0;
+}
+
+static void obj_ipc_ns_drop(void *ptr, int lastref)
+{
+	put_ipc_ns((struct ipc_namespace *) ptr);
+}
+
+static int obj_ipc_ns_users(void *ptr)
+{
+	return atomic_read(&((struct ipc_namespace *) ptr)->count);
+}
+
 static struct ckpt_obj_ops ckpt_obj_ops[] = {
 	/* ignored object */
 	{
@@ -219,6 +237,16 @@ static struct ckpt_obj_ops ckpt_obj_ops[] = {
 		.checkpoint = checkpoint_uts_ns,
 		.restore = restore_uts_ns,
 	},
+	/* ipc_ns object */
+	{
+		.obj_name = "IPC_NS",
+		.obj_type = CKPT_OBJ_IPC_NS,
+		.ref_drop = obj_ipc_ns_drop,
+		.ref_grab = obj_ipc_ns_grab,
+		.ref_users = obj_ipc_ns_users,
+		.checkpoint = checkpoint_ipc_ns,
+		.restore = restore_ipc_ns,
+	},
 };
 
 
diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
index de3537a..c8501b8 100644
--- a/include/linux/checkpoint.h
+++ b/include/linux/checkpoint.h
@@ -22,6 +22,9 @@
 #ifdef __KERNEL__
 #ifdef CONFIG_CHECKPOINT
 
+#include <linux/sched.h>
+#include <linux/nsproxy.h>
+#include <linux/ipc_namespace.h>
 #include <linux/checkpoint_types.h>
 #include <linux/checkpoint_hdr.h>
 #include <linux/err.h>
@@ -151,6 +154,15 @@ extern void *restore_ns(struct ckpt_ctx *ctx);
 extern int checkpoint_uts_ns(struct ckpt_ctx *ctx, void *ptr);
 extern void *restore_uts_ns(struct ckpt_ctx *ctx);
 
+/* ipc-ns */
+#ifdef CONFIG_SYSVIPC
+extern int checkpoint_ipc_ns(struct ckpt_ctx *ctx, void *ptr);
+extern void *restore_ipc_ns(struct ckpt_ctx *ctx);
+#else
+#define checkpoint_ipc_ns  checkpoint_bad
+#define restore_ipc_ns  restore_bad
+#endif /* CONFIG_SYSVIPC */
+
 /* file table */
 extern int ckpt_collect_file_table(struct ckpt_ctx *ctx, struct task_struct *t);
 extern int checkpoint_obj_file_table(struct ckpt_ctx *ctx,
@@ -224,6 +236,7 @@ static inline int ckpt_validate_errno(int errno)
 #define CKPT_DFILE	0x10		/* files and filesystem */
 #define CKPT_DMEM	0x20		/* memory state */
 #define CKPT_DPAGE	0x40		/* memory pages */
+#define CKPT_DIPC	0x80		/* sysvipc */
 
 #define CKPT_DDEFAULT	0xffff		/* default debug level */
 
diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
index 0da2f15..1ecec60 100644
--- a/include/linux/checkpoint_hdr.h
+++ b/include/linux/checkpoint_hdr.h
@@ -60,6 +60,7 @@ enum {
 	CKPT_HDR_CPU,
 	CKPT_HDR_NS,
 	CKPT_HDR_UTS_NS,
+	CKPT_HDR_IPC_NS,
 
 	/* 201-299: reserved for arch-dependent */
 
@@ -74,6 +75,11 @@ enum {
 	CKPT_HDR_PGARR,
 	CKPT_HDR_MM_CONTEXT,
 
+	CKPT_HDR_IPC = 501,
+	CKPT_HDR_IPC_SHM,
+	CKPT_HDR_IPC_MSG,
+	CKPT_HDR_IPC_SEM,
+
 	CKPT_HDR_TAIL = 9001,
 
 	CKPT_HDR_ERROR = 9999,
@@ -100,6 +106,7 @@ enum obj_type {
 	CKPT_OBJ_MM,
 	CKPT_OBJ_NS,
 	CKPT_OBJ_UTS_NS,
+	CKPT_OBJ_IPC_NS,
 	CKPT_OBJ_MAX
 };
 
@@ -191,6 +198,7 @@ struct ckpt_hdr_task_ns {
 struct ckpt_hdr_ns {
 	struct ckpt_hdr h;
 	__s32 uts_objref;
+	__u32 ipc_objref;
 } __attribute__((aligned(8)));
 
 /* cannot include <linux/tty.h> from userspace, so define: */
@@ -332,4 +340,50 @@ struct ckpt_hdr_pgarr {
 } __attribute__((aligned(8)));
 
 
+/* ipc commons */
+struct ckpt_hdr_ipcns {
+	struct ckpt_hdr h;
+	__u64 shm_ctlmax;
+	__u64 shm_ctlall;
+	__s32 shm_ctlmni;
+
+	__s32 msg_ctlmax;
+	__s32 msg_ctlmnb;
+	__s32 msg_ctlmni;
+
+	__s32 sem_ctl_msl;
+	__s32 sem_ctl_mns;
+	__s32 sem_ctl_opm;
+	__s32 sem_ctl_mni;
+} __attribute__((aligned(8)));
+
+struct ckpt_hdr_ipc {
+	struct ckpt_hdr h;
+	__u32 ipc_type;
+	__u32 ipc_count;
+} __attribute__((aligned(8)));
+
+struct ckpt_hdr_ipc_perms {
+	__s32 id;
+	__u32 key;
+	__u32 uid;
+	__u32 gid;
+	__u32 cuid;
+	__u32 cgid;
+	__u32 mode;
+	__u32 _padding;
+	__u64 seq;
+} __attribute__((aligned(8)));
+
+
+#define CKPT_TST_OVERFLOW_16(a, b) \
+	((sizeof(a) > sizeof(b)) && ((a) > SHORT_MAX))
+
+#define CKPT_TST_OVERFLOW_32(a, b) \
+	((sizeof(a) > sizeof(b)) && ((a) > INT_MAX))
+
+#define CKPT_TST_OVERFLOW_64(a, b) \
+	((sizeof(a) > sizeof(b)) && ((a) > LONG_MAX))
+
+
 #endif /* _CHECKPOINT_CKPT_HDR_H_ */
diff --git a/include/linux/checkpoint_types.h b/include/linux/checkpoint_types.h
index 7d1b8c8..7e6051f 100644
--- a/include/linux/checkpoint_types.h
+++ b/include/linux/checkpoint_types.h
@@ -24,6 +24,7 @@
 
 struct ckpt_stats {
 	int uts_ns;
+	int ipc_ns;
 };
 
 struct ckpt_ctx {
diff --git a/init/Kconfig b/init/Kconfig
index 46329b9..e601a61 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -191,6 +191,12 @@ config SYSVIPC
 	  section 6.4 of the Linux Programmer's Guide, available from
 	  <http://www.tldp.org/guides.html>.
 
+config SYSVIPC_CHECKPOINT
+	bool
+	depends on SYSVIPC
+	depends on CHECKPOINT
+	default y
+
 config SYSVIPC_SYSCTL
 	bool
 	depends on SYSVIPC
diff --git a/ipc/Makefile b/ipc/Makefile
index 4e1955e..b747127 100644
--- a/ipc/Makefile
+++ b/ipc/Makefile
@@ -9,4 +9,4 @@ obj_mq-$(CONFIG_COMPAT) += compat_mq.o
 obj-$(CONFIG_POSIX_MQUEUE) += mqueue.o msgutil.o $(obj_mq-y)
 obj-$(CONFIG_IPC_NS) += namespace.o
 obj-$(CONFIG_POSIX_MQUEUE_SYSCTL) += mq_sysctl.o
-
+obj-$(CONFIG_SYSVIPC_CHECKPOINT) += checkpoint.o
diff --git a/ipc/checkpoint.c b/ipc/checkpoint.c
new file mode 100644
index 0000000..4eb1a97
--- /dev/null
+++ b/ipc/checkpoint.c
@@ -0,0 +1,317 @@
+/*
+ *  Checkpoint logic and helpers
+ *
+ *  Copyright (C) 2009 Oren Laadan
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+/* default debug level for output */
+#define CKPT_DFLAG  CKPT_DIPC
+
+#include <linux/ipc.h>
+#include <linux/msg.h>
+#include <linux/sched.h>
+#include <linux/ipc_namespace.h>
+#include <linux/checkpoint.h>
+#include <linux/checkpoint_hdr.h>
+
+#include "util.h"
+
+/* for ckpt_debug */
+static char *ipc_ind_to_str[] = { "sem", "msg", "shm" };
+
+#define shm_ids(ns)	((ns)->ids[IPC_SHM_IDS])
+#define msg_ids(ns)	((ns)->ids[IPC_MSG_IDS])
+#define sem_ids(ns)	((ns)->ids[IPC_SEM_IDS])
+
+/**************************************************************************
+ * Checkpoint
+ */
+
+int checkpoint_fill_ipc_perms(struct ckpt_hdr_ipc_perms *h,
+			      struct kern_ipc_perm *perm)
+{
+	if (ipcperms(perm, S_IROTH))
+		return -EACCES;
+
+	h->id = perm->id;
+	h->key = perm->key;
+	h->uid = perm->uid;
+	h->gid = perm->gid;
+	h->cuid = perm->cuid;
+	h->cgid = perm->cgid;
+	h->mode = perm->mode & S_IRWXUGO;
+	h->seq = perm->seq;
+
+	return 0;
+}
+
+static int checkpoint_ipc_any(struct ckpt_ctx *ctx,
+			      struct ipc_namespace *ipc_ns,
+			      int ipc_ind, int ipc_type,
+			      int (*func)(int id, void *p, void *data))
+{
+	struct ckpt_hdr_ipc *h;
+	struct ipc_ids *ipc_ids = &ipc_ns->ids[ipc_ind];
+	int ret = -ENOMEM;
+
+	down_read(&ipc_ids->rw_mutex);
+	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_IPC);
+	if (!h)
+		goto out;
+
+	h->ipc_type = ipc_type;
+	h->ipc_count = ipc_ids->in_use;
+	ckpt_debug("ipc-%s count %d\n", ipc_ind_to_str[ipc_ind], h->ipc_count);
+
+	ret = ckpt_write_obj(ctx, &h->h);
+	ckpt_hdr_put(ctx, h);
+	if (ret < 0)
+		goto out;
+
+	ret = idr_for_each(&ipc_ids->ipcs_idr, func, ctx);
+	ckpt_debug("ipc-%s ret %d\n", ipc_ind_to_str[ipc_ind], ret);
+ out:
+	up_read(&ipc_ids->rw_mutex);
+	return ret;
+}
+
+static int do_checkpoint_ipc_ns(struct ckpt_ctx *ctx,
+				struct ipc_namespace *ipc_ns)
+{
+	struct ckpt_hdr_ipcns *h;
+	int ret;
+
+	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_IPC_NS);
+	if (!h)
+		return -ENOMEM;
+
+	down_read(&shm_ids(ipc_ns).rw_mutex);
+	h->shm_ctlmax = ipc_ns->shm_ctlmax;
+	h->shm_ctlall = ipc_ns->shm_ctlall;
+	h->shm_ctlmni = ipc_ns->shm_ctlmni;
+	up_read(&shm_ids(ipc_ns).rw_mutex);
+
+	down_read(&msg_ids(ipc_ns).rw_mutex);
+	h->msg_ctlmax = ipc_ns->msg_ctlmax;
+	h->msg_ctlmnb = ipc_ns->msg_ctlmnb;
+	h->msg_ctlmni = ipc_ns->msg_ctlmni;
+	up_read(&msg_ids(ipc_ns).rw_mutex);
+
+	down_read(&sem_ids(ipc_ns).rw_mutex);
+	h->sem_ctl_msl = ipc_ns->sem_ctls[0];
+	h->sem_ctl_mns = ipc_ns->sem_ctls[1];
+	h->sem_ctl_opm = ipc_ns->sem_ctls[2];
+	h->sem_ctl_mni = ipc_ns->sem_ctls[3];
+	up_read(&sem_ids(ipc_ns).rw_mutex);
+
+	ret = ckpt_write_obj(ctx, &h->h);
+	ckpt_hdr_put(ctx, h);
+	if (ret < 0)
+		return ret;
+
+#if 0 /* NEXT FEW PATCHES */
+	ret = checkpoint_ipc_any(ctx, ipc_ns, IPC_SHM_IDS,
+				 CKPT_HDR_IPC_SHM, checkpoint_ipc_shm);
+	if (ret < 0)
+		return ret;
+	ret = checkpoint_ipc_any(ctx, ipc_ns, IPC_MSG_IDS,
+				 CKPT_HDR_IPC_MSG, checkpoint_ipc_msg);
+	if (ret < 0)
+		return ret;
+	ret = checkpoint_ipc_any(ctx, ipc_ns, IPC_SEM_IDS,
+				 CKPT_HDR_IPC_SEM, checkpoint_ipc_sem);
+#endif
+	return ret;
+}
+
+int checkpoint_ipc_ns(struct ckpt_ctx *ctx, void *ptr)
+{
+	return do_checkpoint_ipc_ns(ctx, (struct ipc_namespace *) ptr);
+}
+
+/**************************************************************************
+ * Restart
+ */
+
+/*
+ * check whether current task may create ipc object with
+ * checkpointed uids and gids.
+ * Return 1 if ok, 0 if not.
+ */
+static int validate_created_perms(struct ckpt_hdr_ipc_perms *h)
+{
+	const struct cred *cred = current_cred();
+	uid_t uid = cred->uid, euid = cred->euid;
+
+	/* actually I don't know - is CAP_IPC_OWNER the right one? */
+	if (((h->uid != uid && h->uid == euid) ||
+			(h->cuid != uid && h->cuid != euid) ||
+			!in_group_p(h->cgid) ||
+			!in_group_p(h->gid)) &&
+			!capable(CAP_IPC_OWNER))
+		return 0;
+	return 1;
+}
+
+int restore_load_ipc_perms(struct ckpt_hdr_ipc_perms *h,
+			   struct kern_ipc_perm *perm)
+{
+	if (h->id < 0)
+		return -EINVAL;
+	if (CKPT_TST_OVERFLOW_16(h->uid, perm->uid) ||
+	    CKPT_TST_OVERFLOW_16(h->gid, perm->gid) ||
+	    CKPT_TST_OVERFLOW_16(h->cuid, perm->cuid) ||
+	    CKPT_TST_OVERFLOW_16(h->cgid, perm->cgid) ||
+	    CKPT_TST_OVERFLOW_16(h->mode, perm->mode))
+		return -EINVAL;
+	if (h->seq >= USHORT_MAX)
+		return -EINVAL;
+	if (h->mode & ~S_IRWXUGO)
+		return -EINVAL;
+
+	/* FIX: verify the ->mode field makes sense */
+
+	perm->id = h->id;
+	perm->key = h->key;
+
+	if (!validate_created_perms(h))
+		return -EPERM;
+	perm->uid = h->uid;
+	perm->gid = h->gid;
+	perm->cuid = h->cuid;
+	perm->cgid = h->cgid;
+	perm->mode = h->mode;
+	perm->seq = h->seq;
+	/*
+	 * Todo: restore perm->security.
+	 * At the moment it gets set by security_x_alloc() called through
+	 * ipcget()->ipcget_public()->ops-.getnew (->nequeue for instance)
+	 * We will want to ask the LSM to consider resetting the
+	 * checkpointed ->security, based on current_security(),
+	 * the checkpointed ->security, and the checkpoint file context.
+	 */
+
+	return 0;
+}
+
+static int restore_ipc_any(struct ckpt_ctx *ctx, struct ipc_namespace *ipc_ns,
+			   int ipc_ind, int ipc_type,
+			   int (*func)(struct ckpt_ctx *ctx,
+				       struct ipc_namespace *ns))
+{
+	struct ckpt_hdr_ipc *h;
+	int n, ret;
+
+	h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_IPC);
+	if (IS_ERR(h))
+		return PTR_ERR(h);
+
+	ckpt_debug("ipc-%s: count %d\n", ipc_ind_to_str[ipc_ind], h->ipc_count);
+
+	ret = -EINVAL;
+	if (h->ipc_type != ipc_type)
+		goto out;
+
+	ret = 0;
+	for (n = 0; n < h->ipc_count; n++) {
+		ret = (*func)(ctx, ipc_ns);
+		if (ret < 0)
+			goto out;
+	}
+ out:
+	ckpt_debug("ipc-%s: ret %d\n", ipc_ind_to_str[ipc_ind], ret);
+	ckpt_hdr_put(ctx, h);
+	return ret;
+}
+
+static struct ipc_namespace *do_restore_ipc_ns(struct ckpt_ctx *ctx)
+{
+	struct ipc_namespace *ipc_ns = NULL;
+	struct ckpt_hdr_ipcns *h;
+	int ret;
+
+	h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_IPC_NS);
+	if (IS_ERR(h))
+		return ERR_PTR(PTR_ERR(h));
+
+	ret = -EINVAL;
+	if (h->shm_ctlmax < 0 || h->shm_ctlall < 0 || h->shm_ctlmni < 0)
+		goto out;
+	if (h->msg_ctlmax < 0 || h->msg_ctlmnb < 0 || h->msg_ctlmni < 0)
+		goto out;
+	if (h->sem_ctl_msl < 0 || h->sem_ctl_mns < 0 ||
+	    h->sem_ctl_opm < 0 || h->sem_ctl_mni < 0)
+		goto out;
+
+	/*
+	 * If !CONFIG_IPC_NS, do not restore the global IPC state, as
+	 * it is used by other processes. It is ok to try to restore
+	 * the {shm,msg,sem} objects: in the worst case the requested
+	 * identifiers will be in use.
+	 */
+#ifdef CONFIG_IPC_NS
+	ret = -ENOMEM;
+	ipc_ns = create_ipc_ns();
+	if (!ipc_ns)
+		goto out;
+
+	down_read(&shm_ids(ipc_ns).rw_mutex);
+	ipc_ns->shm_ctlmax = h->shm_ctlmax;
+	ipc_ns->shm_ctlall = h->shm_ctlall;
+	ipc_ns->shm_ctlmni = h->shm_ctlmni;
+	up_read(&shm_ids(ipc_ns).rw_mutex);
+
+	down_read(&msg_ids(ipc_ns).rw_mutex);
+	ipc_ns->msg_ctlmax = h->msg_ctlmax;
+	ipc_ns->msg_ctlmnb = h->msg_ctlmnb;
+	ipc_ns->msg_ctlmni = h->msg_ctlmni;
+	up_read(&msg_ids(ipc_ns).rw_mutex);
+
+	down_read(&sem_ids(ipc_ns).rw_mutex);
+	ipc_ns->sem_ctls[0] = h->sem_ctl_msl;
+	ipc_ns->sem_ctls[1] = h->sem_ctl_mns;
+	ipc_ns->sem_ctls[2] = h->sem_ctl_opm;
+	ipc_ns->sem_ctls[3] = h->sem_ctl_mni;
+	up_read(&sem_ids(ipc_ns).rw_mutex);
+#else
+	ret = -EEXIST;
+	/* complain if image contains multiple namespaces */
+	if (ctx->stats.ipc_ns)
+		goto out;
+	ipc_ns = current->nsproxy->ipc_ns;
+	get_ipc_ns(ipc_ns);
+#endif
+
+#if 0 /* NEXT FEW PATCHES */
+	ret = restore_ipc_any(ctx, ipc_ns, IPC_SHM_IDS,
+			      CKPT_HDR_IPC_SHM, restore_ipc_shm);
+	if (ret < 0)
+		goto out;
+	ret = restore_ipc_any(ctx, ipc_ns, IPC_MSG_IDS,
+			      CKPT_HDR_IPC_MSG, restore_ipc_msg);
+	if (ret < 0)
+		goto out;
+	ret = restore_ipc_any(ctx, ipc_ns, IPC_SEM_IDS,
+			      CKPT_HDR_IPC_SEM, restore_ipc_sem);
+#endif
+	if (ret < 0)
+		goto out;
+
+	ctx->stats.ipc_ns++;
+ out:
+	ckpt_hdr_put(ctx, h);
+	if (ret < 0) {
+		put_ipc_ns(ipc_ns);
+		ipc_ns = ERR_PTR(ret);
+	}
+	return ipc_ns;
+}
+
+void *restore_ipc_ns(struct ckpt_ctx *ctx)
+{
+	return (void *) do_restore_ipc_ns(ctx);
+}
diff --git a/ipc/namespace.c b/ipc/namespace.c
index a1094ff..8e5ea32 100644
--- a/ipc/namespace.c
+++ b/ipc/namespace.c
@@ -14,7 +14,7 @@
 
 #include "util.h"
 
-static struct ipc_namespace *create_ipc_ns(void)
+struct ipc_namespace *create_ipc_ns(void)
 {
 	struct ipc_namespace *ns;
 	int err;
diff --git a/ipc/util.h b/ipc/util.h
index 159a73c..8ae1f8e 100644
--- a/ipc/util.h
+++ b/ipc/util.h
@@ -12,6 +12,7 @@
 
 #include <linux/unistd.h>
 #include <linux/err.h>
+#include <linux/checkpoint.h>
 
 #define SEQ_MULTIPLIER	(IPCMNI)
 
@@ -175,4 +176,13 @@ int ipcget(struct ipc_namespace *ns, struct ipc_ids *ids,
 void free_ipcs(struct ipc_namespace *ns, struct ipc_ids *ids,
 	       void (*free)(struct ipc_namespace *, struct kern_ipc_perm *));
 
+struct ipc_namespace *create_ipc_ns(void);
+
+#ifdef CONFIG_CHECKPOINT
+extern int checkpoint_fill_ipc_perms(struct ckpt_hdr_ipc_perms *h,
+				     struct kern_ipc_perm *perm);
+extern int restore_load_ipc_perms(struct ckpt_hdr_ipc_perms *h,
+				  struct kern_ipc_perm *perm);
+#endif
+
 #endif
diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c
index 4f48a68..fddc724 100644
--- a/kernel/nsproxy.c
+++ b/kernel/nsproxy.c
@@ -248,6 +248,7 @@ int ckpt_collect_ns(struct ckpt_ctx *ctx, struct task_struct *t)
 	ret = ckpt_obj_collect(ctx, nsproxy->uts_ns, CKPT_OBJ_UTS_NS);
 	if (ret < 0)
 		goto out;
+	ret = ckpt_obj_collect(ctx, nsproxy->ipc_ns, CKPT_OBJ_IPC_NS);
 
 	/* TODO: collect other namespaces here */
  out:
@@ -268,6 +269,11 @@ static int do_checkpoint_ns(struct ckpt_ctx *ctx, struct nsproxy *nsproxy)
 	if (ret <= 0)
 		goto out;
 	h->uts_objref = ret;
+	ret = checkpoint_obj(ctx, nsproxy->ipc_ns, CKPT_OBJ_IPC_NS);
+	if (ret < 0)
+		goto out;
+	h->ipc_objref = ret;
+
 	/* TODO: Write other namespaces here */
 
 	ret = ckpt_write_obj(ctx, &h->h);
@@ -287,6 +293,7 @@ static struct nsproxy *do_restore_ns(struct ckpt_ctx *ctx)
 	struct ckpt_hdr_ns *h;
 	struct nsproxy *nsproxy = NULL;
 	struct uts_namespace *uts_ns;
+	struct ipc_namespace *ipc_ns;
 	int ret;
 
 	h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_NS);
@@ -294,7 +301,8 @@ static struct nsproxy *do_restore_ns(struct ckpt_ctx *ctx)
 		return (struct nsproxy *) h;
 
 	ret = -EINVAL;
-	if (h->uts_objref <= 0)
+	if (h->uts_objref <= 0 ||
+	    h->ipc_objref <= 0)
 		goto out;
 
 	uts_ns = ckpt_obj_fetch(ctx, h->uts_objref, CKPT_OBJ_UTS_NS);
@@ -302,8 +310,13 @@ static struct nsproxy *do_restore_ns(struct ckpt_ctx *ctx)
 		ret = PTR_ERR(uts_ns);
 		goto out;
 	}
+	ipc_ns = ckpt_obj_fetch(ctx, h->ipc_objref, CKPT_OBJ_IPC_NS);
+	if (IS_ERR(ipc_ns)) {
+		ret = PTR_ERR(ipc_ns);
+		goto out;
+	}
 
-#if defined(COFNIG_UTS_NS)
+#if defined(COFNIG_UTS_NS) || defined(CONFIG_IPC_NS)
 	ret = -ENOMEM;
 	nsproxy = create_nsproxy();
 	if (!nsproxy)
@@ -311,9 +324,9 @@ static struct nsproxy *do_restore_ns(struct ckpt_ctx *ctx)
 
 	get_uts_ns(uts_ns);
 	nsproxy->uts_ns = uts_ns;
-
-	get_ipc_ns(current->nsproxy->ipc_ns);
+	get_ipc_ns(ipc_ns);
 	nsproxy->ipc_ns = ipc_ns;
+
 	get_pid_ns(current->nsproxy->pid_ns);
 	nsproxy->pid_ns = current->nsproxy->pid_ns;
 	get_mnt_ns(current->nsproxy->mnt_ns);
@@ -325,6 +338,7 @@ static struct nsproxy *do_restore_ns(struct ckpt_ctx *ctx)
 	get_nsproxy(nsproxy);
 
 	BUG_ON(nsproxy->uts_ns != uts_ns);
+	BUG_ON(nsproxy->ipc_ns != ipc_ns);
 #endif
 
 	/* TODO: add more namespaces here */
-- 
1.6.0.4

^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH v18 51/80] c/r: save and restore sysvipc namespace basics
  2009-09-23 23:50 ` Oren Laadan
@ 2009-09-23 23:51   ` Oren Laadan
  -1 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-23 23:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, containers, linux-kernel, linux-mm, linux-api,
	Serge Hallyn, Ingo Molnar, Pavel Emelyanov, Oren Laadan,
	Oren Laadan

Add the helpers to checkpoint and restore the contents of 'struct
kern_ipc_perm'. Add header structures for ipc state. Put place-holders
to save and restore ipc state.

Save and restores the common state (parameters) of ipc namespace.

Generic code to iterate through the objects of sysvipc shared memory,
message queues and semaphores. The logic to save and restore the state
of these objects will be added in the next few patches.

Right now, we return -EPERM if the user calling sys_restart() isn't
allowed to create an object with the checkpointed uid.  We may prefer
to simply use the caller's uid in that case - but that could lead to
subtle userspace bugs?  Unsure, so going for the stricter behavior.

TODO: restore kern_ipc_perms->security.

Changelog[v17]:
  - Fix include: use checkpoint.h not checkpoint_hdr.h
  - Collect nsproxy->ipc_ns
  - Restore objects in the right namespace
  - If !CONFIG_IPC_NS only restore objects, not global settings
  - Don't overwrite global ipc-ns if !CONFIG_IPC_NS
  - Reset the checkpointed uid and gid info on ipc objects
  - Fix compilation with CONFIG_SYSVIPC=n
Changelog [Dan Smith <danms@us.ibm.com>]
  - Fix compilation with CONFIG_SYSVIPC=n
  - Update to match UTS changes

Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
---
 checkpoint/checkpoint.c          |    2 -
 checkpoint/objhash.c             |   28 ++++
 include/linux/checkpoint.h       |   13 ++
 include/linux/checkpoint_hdr.h   |   54 +++++++
 include/linux/checkpoint_types.h |    1 +
 init/Kconfig                     |    6 +
 ipc/Makefile                     |    2 +-
 ipc/checkpoint.c                 |  317 ++++++++++++++++++++++++++++++++++++++
 ipc/namespace.c                  |    2 +-
 ipc/util.h                       |   10 ++
 kernel/nsproxy.c                 |   22 ++-
 11 files changed, 449 insertions(+), 8 deletions(-)
 create mode 100644 ipc/checkpoint.c

diff --git a/checkpoint/checkpoint.c b/checkpoint/checkpoint.c
index 2a193b3..e35c5a8 100644
--- a/checkpoint/checkpoint.c
+++ b/checkpoint/checkpoint.c
@@ -424,8 +424,6 @@ static int may_checkpoint_task(struct ckpt_ctx *ctx, struct task_struct *t)
 
 	rcu_read_lock();
 	nsproxy = task_nsproxy(t);
-	if (nsproxy->ipc_ns != ctx->root_nsproxy->ipc_ns)
-		ret = -EPERM;
 	if (nsproxy->mnt_ns != ctx->root_nsproxy->mnt_ns) {
 		__ckpt_write_err(ctx, "TE", "bad mnt_ns", -EPERM);
 		ret = -EPERM;
diff --git a/checkpoint/objhash.c b/checkpoint/objhash.c
index 2fd00a6..15a5caf 100644
--- a/checkpoint/objhash.c
+++ b/checkpoint/objhash.c
@@ -15,6 +15,8 @@
 #include <linux/hash.h>
 #include <linux/file.h>
 #include <linux/fdtable.h>
+#include <linux/sched.h>
+#include <linux/ipc_namespace.h>
 #include <linux/checkpoint.h>
 #include <linux/checkpoint_hdr.h>
 
@@ -154,6 +156,22 @@ static int obj_uts_ns_users(void *ptr)
 	return atomic_read(&((struct uts_namespace *) ptr)->kref.refcount);
 }
 
+static int obj_ipc_ns_grab(void *ptr)
+{
+	get_ipc_ns((struct ipc_namespace *) ptr);
+	return 0;
+}
+
+static void obj_ipc_ns_drop(void *ptr, int lastref)
+{
+	put_ipc_ns((struct ipc_namespace *) ptr);
+}
+
+static int obj_ipc_ns_users(void *ptr)
+{
+	return atomic_read(&((struct ipc_namespace *) ptr)->count);
+}
+
 static struct ckpt_obj_ops ckpt_obj_ops[] = {
 	/* ignored object */
 	{
@@ -219,6 +237,16 @@ static struct ckpt_obj_ops ckpt_obj_ops[] = {
 		.checkpoint = checkpoint_uts_ns,
 		.restore = restore_uts_ns,
 	},
+	/* ipc_ns object */
+	{
+		.obj_name = "IPC_NS",
+		.obj_type = CKPT_OBJ_IPC_NS,
+		.ref_drop = obj_ipc_ns_drop,
+		.ref_grab = obj_ipc_ns_grab,
+		.ref_users = obj_ipc_ns_users,
+		.checkpoint = checkpoint_ipc_ns,
+		.restore = restore_ipc_ns,
+	},
 };
 
 
diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
index de3537a..c8501b8 100644
--- a/include/linux/checkpoint.h
+++ b/include/linux/checkpoint.h
@@ -22,6 +22,9 @@
 #ifdef __KERNEL__
 #ifdef CONFIG_CHECKPOINT
 
+#include <linux/sched.h>
+#include <linux/nsproxy.h>
+#include <linux/ipc_namespace.h>
 #include <linux/checkpoint_types.h>
 #include <linux/checkpoint_hdr.h>
 #include <linux/err.h>
@@ -151,6 +154,15 @@ extern void *restore_ns(struct ckpt_ctx *ctx);
 extern int checkpoint_uts_ns(struct ckpt_ctx *ctx, void *ptr);
 extern void *restore_uts_ns(struct ckpt_ctx *ctx);
 
+/* ipc-ns */
+#ifdef CONFIG_SYSVIPC
+extern int checkpoint_ipc_ns(struct ckpt_ctx *ctx, void *ptr);
+extern void *restore_ipc_ns(struct ckpt_ctx *ctx);
+#else
+#define checkpoint_ipc_ns  checkpoint_bad
+#define restore_ipc_ns  restore_bad
+#endif /* CONFIG_SYSVIPC */
+
 /* file table */
 extern int ckpt_collect_file_table(struct ckpt_ctx *ctx, struct task_struct *t);
 extern int checkpoint_obj_file_table(struct ckpt_ctx *ctx,
@@ -224,6 +236,7 @@ static inline int ckpt_validate_errno(int errno)
 #define CKPT_DFILE	0x10		/* files and filesystem */
 #define CKPT_DMEM	0x20		/* memory state */
 #define CKPT_DPAGE	0x40		/* memory pages */
+#define CKPT_DIPC	0x80		/* sysvipc */
 
 #define CKPT_DDEFAULT	0xffff		/* default debug level */
 
diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
index 0da2f15..1ecec60 100644
--- a/include/linux/checkpoint_hdr.h
+++ b/include/linux/checkpoint_hdr.h
@@ -60,6 +60,7 @@ enum {
 	CKPT_HDR_CPU,
 	CKPT_HDR_NS,
 	CKPT_HDR_UTS_NS,
+	CKPT_HDR_IPC_NS,
 
 	/* 201-299: reserved for arch-dependent */
 
@@ -74,6 +75,11 @@ enum {
 	CKPT_HDR_PGARR,
 	CKPT_HDR_MM_CONTEXT,
 
+	CKPT_HDR_IPC = 501,
+	CKPT_HDR_IPC_SHM,
+	CKPT_HDR_IPC_MSG,
+	CKPT_HDR_IPC_SEM,
+
 	CKPT_HDR_TAIL = 9001,
 
 	CKPT_HDR_ERROR = 9999,
@@ -100,6 +106,7 @@ enum obj_type {
 	CKPT_OBJ_MM,
 	CKPT_OBJ_NS,
 	CKPT_OBJ_UTS_NS,
+	CKPT_OBJ_IPC_NS,
 	CKPT_OBJ_MAX
 };
 
@@ -191,6 +198,7 @@ struct ckpt_hdr_task_ns {
 struct ckpt_hdr_ns {
 	struct ckpt_hdr h;
 	__s32 uts_objref;
+	__u32 ipc_objref;
 } __attribute__((aligned(8)));
 
 /* cannot include <linux/tty.h> from userspace, so define: */
@@ -332,4 +340,50 @@ struct ckpt_hdr_pgarr {
 } __attribute__((aligned(8)));
 
 
+/* ipc commons */
+struct ckpt_hdr_ipcns {
+	struct ckpt_hdr h;
+	__u64 shm_ctlmax;
+	__u64 shm_ctlall;
+	__s32 shm_ctlmni;
+
+	__s32 msg_ctlmax;
+	__s32 msg_ctlmnb;
+	__s32 msg_ctlmni;
+
+	__s32 sem_ctl_msl;
+	__s32 sem_ctl_mns;
+	__s32 sem_ctl_opm;
+	__s32 sem_ctl_mni;
+} __attribute__((aligned(8)));
+
+struct ckpt_hdr_ipc {
+	struct ckpt_hdr h;
+	__u32 ipc_type;
+	__u32 ipc_count;
+} __attribute__((aligned(8)));
+
+struct ckpt_hdr_ipc_perms {
+	__s32 id;
+	__u32 key;
+	__u32 uid;
+	__u32 gid;
+	__u32 cuid;
+	__u32 cgid;
+	__u32 mode;
+	__u32 _padding;
+	__u64 seq;
+} __attribute__((aligned(8)));
+
+
+#define CKPT_TST_OVERFLOW_16(a, b) \
+	((sizeof(a) > sizeof(b)) && ((a) > SHORT_MAX))
+
+#define CKPT_TST_OVERFLOW_32(a, b) \
+	((sizeof(a) > sizeof(b)) && ((a) > INT_MAX))
+
+#define CKPT_TST_OVERFLOW_64(a, b) \
+	((sizeof(a) > sizeof(b)) && ((a) > LONG_MAX))
+
+
 #endif /* _CHECKPOINT_CKPT_HDR_H_ */
diff --git a/include/linux/checkpoint_types.h b/include/linux/checkpoint_types.h
index 7d1b8c8..7e6051f 100644
--- a/include/linux/checkpoint_types.h
+++ b/include/linux/checkpoint_types.h
@@ -24,6 +24,7 @@
 
 struct ckpt_stats {
 	int uts_ns;
+	int ipc_ns;
 };
 
 struct ckpt_ctx {
diff --git a/init/Kconfig b/init/Kconfig
index 46329b9..e601a61 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -191,6 +191,12 @@ config SYSVIPC
 	  section 6.4 of the Linux Programmer's Guide, available from
 	  <http://www.tldp.org/guides.html>.
 
+config SYSVIPC_CHECKPOINT
+	bool
+	depends on SYSVIPC
+	depends on CHECKPOINT
+	default y
+
 config SYSVIPC_SYSCTL
 	bool
 	depends on SYSVIPC
diff --git a/ipc/Makefile b/ipc/Makefile
index 4e1955e..b747127 100644
--- a/ipc/Makefile
+++ b/ipc/Makefile
@@ -9,4 +9,4 @@ obj_mq-$(CONFIG_COMPAT) += compat_mq.o
 obj-$(CONFIG_POSIX_MQUEUE) += mqueue.o msgutil.o $(obj_mq-y)
 obj-$(CONFIG_IPC_NS) += namespace.o
 obj-$(CONFIG_POSIX_MQUEUE_SYSCTL) += mq_sysctl.o
-
+obj-$(CONFIG_SYSVIPC_CHECKPOINT) += checkpoint.o
diff --git a/ipc/checkpoint.c b/ipc/checkpoint.c
new file mode 100644
index 0000000..4eb1a97
--- /dev/null
+++ b/ipc/checkpoint.c
@@ -0,0 +1,317 @@
+/*
+ *  Checkpoint logic and helpers
+ *
+ *  Copyright (C) 2009 Oren Laadan
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+/* default debug level for output */
+#define CKPT_DFLAG  CKPT_DIPC
+
+#include <linux/ipc.h>
+#include <linux/msg.h>
+#include <linux/sched.h>
+#include <linux/ipc_namespace.h>
+#include <linux/checkpoint.h>
+#include <linux/checkpoint_hdr.h>
+
+#include "util.h"
+
+/* for ckpt_debug */
+static char *ipc_ind_to_str[] = { "sem", "msg", "shm" };
+
+#define shm_ids(ns)	((ns)->ids[IPC_SHM_IDS])
+#define msg_ids(ns)	((ns)->ids[IPC_MSG_IDS])
+#define sem_ids(ns)	((ns)->ids[IPC_SEM_IDS])
+
+/**************************************************************************
+ * Checkpoint
+ */
+
+int checkpoint_fill_ipc_perms(struct ckpt_hdr_ipc_perms *h,
+			      struct kern_ipc_perm *perm)
+{
+	if (ipcperms(perm, S_IROTH))
+		return -EACCES;
+
+	h->id = perm->id;
+	h->key = perm->key;
+	h->uid = perm->uid;
+	h->gid = perm->gid;
+	h->cuid = perm->cuid;
+	h->cgid = perm->cgid;
+	h->mode = perm->mode & S_IRWXUGO;
+	h->seq = perm->seq;
+
+	return 0;
+}
+
+static int checkpoint_ipc_any(struct ckpt_ctx *ctx,
+			      struct ipc_namespace *ipc_ns,
+			      int ipc_ind, int ipc_type,
+			      int (*func)(int id, void *p, void *data))
+{
+	struct ckpt_hdr_ipc *h;
+	struct ipc_ids *ipc_ids = &ipc_ns->ids[ipc_ind];
+	int ret = -ENOMEM;
+
+	down_read(&ipc_ids->rw_mutex);
+	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_IPC);
+	if (!h)
+		goto out;
+
+	h->ipc_type = ipc_type;
+	h->ipc_count = ipc_ids->in_use;
+	ckpt_debug("ipc-%s count %d\n", ipc_ind_to_str[ipc_ind], h->ipc_count);
+
+	ret = ckpt_write_obj(ctx, &h->h);
+	ckpt_hdr_put(ctx, h);
+	if (ret < 0)
+		goto out;
+
+	ret = idr_for_each(&ipc_ids->ipcs_idr, func, ctx);
+	ckpt_debug("ipc-%s ret %d\n", ipc_ind_to_str[ipc_ind], ret);
+ out:
+	up_read(&ipc_ids->rw_mutex);
+	return ret;
+}
+
+static int do_checkpoint_ipc_ns(struct ckpt_ctx *ctx,
+				struct ipc_namespace *ipc_ns)
+{
+	struct ckpt_hdr_ipcns *h;
+	int ret;
+
+	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_IPC_NS);
+	if (!h)
+		return -ENOMEM;
+
+	down_read(&shm_ids(ipc_ns).rw_mutex);
+	h->shm_ctlmax = ipc_ns->shm_ctlmax;
+	h->shm_ctlall = ipc_ns->shm_ctlall;
+	h->shm_ctlmni = ipc_ns->shm_ctlmni;
+	up_read(&shm_ids(ipc_ns).rw_mutex);
+
+	down_read(&msg_ids(ipc_ns).rw_mutex);
+	h->msg_ctlmax = ipc_ns->msg_ctlmax;
+	h->msg_ctlmnb = ipc_ns->msg_ctlmnb;
+	h->msg_ctlmni = ipc_ns->msg_ctlmni;
+	up_read(&msg_ids(ipc_ns).rw_mutex);
+
+	down_read(&sem_ids(ipc_ns).rw_mutex);
+	h->sem_ctl_msl = ipc_ns->sem_ctls[0];
+	h->sem_ctl_mns = ipc_ns->sem_ctls[1];
+	h->sem_ctl_opm = ipc_ns->sem_ctls[2];
+	h->sem_ctl_mni = ipc_ns->sem_ctls[3];
+	up_read(&sem_ids(ipc_ns).rw_mutex);
+
+	ret = ckpt_write_obj(ctx, &h->h);
+	ckpt_hdr_put(ctx, h);
+	if (ret < 0)
+		return ret;
+
+#if 0 /* NEXT FEW PATCHES */
+	ret = checkpoint_ipc_any(ctx, ipc_ns, IPC_SHM_IDS,
+				 CKPT_HDR_IPC_SHM, checkpoint_ipc_shm);
+	if (ret < 0)
+		return ret;
+	ret = checkpoint_ipc_any(ctx, ipc_ns, IPC_MSG_IDS,
+				 CKPT_HDR_IPC_MSG, checkpoint_ipc_msg);
+	if (ret < 0)
+		return ret;
+	ret = checkpoint_ipc_any(ctx, ipc_ns, IPC_SEM_IDS,
+				 CKPT_HDR_IPC_SEM, checkpoint_ipc_sem);
+#endif
+	return ret;
+}
+
+int checkpoint_ipc_ns(struct ckpt_ctx *ctx, void *ptr)
+{
+	return do_checkpoint_ipc_ns(ctx, (struct ipc_namespace *) ptr);
+}
+
+/**************************************************************************
+ * Restart
+ */
+
+/*
+ * check whether current task may create ipc object with
+ * checkpointed uids and gids.
+ * Return 1 if ok, 0 if not.
+ */
+static int validate_created_perms(struct ckpt_hdr_ipc_perms *h)
+{
+	const struct cred *cred = current_cred();
+	uid_t uid = cred->uid, euid = cred->euid;
+
+	/* actually I don't know - is CAP_IPC_OWNER the right one? */
+	if (((h->uid != uid && h->uid == euid) ||
+			(h->cuid != uid && h->cuid != euid) ||
+			!in_group_p(h->cgid) ||
+			!in_group_p(h->gid)) &&
+			!capable(CAP_IPC_OWNER))
+		return 0;
+	return 1;
+}
+
+int restore_load_ipc_perms(struct ckpt_hdr_ipc_perms *h,
+			   struct kern_ipc_perm *perm)
+{
+	if (h->id < 0)
+		return -EINVAL;
+	if (CKPT_TST_OVERFLOW_16(h->uid, perm->uid) ||
+	    CKPT_TST_OVERFLOW_16(h->gid, perm->gid) ||
+	    CKPT_TST_OVERFLOW_16(h->cuid, perm->cuid) ||
+	    CKPT_TST_OVERFLOW_16(h->cgid, perm->cgid) ||
+	    CKPT_TST_OVERFLOW_16(h->mode, perm->mode))
+		return -EINVAL;
+	if (h->seq >= USHORT_MAX)
+		return -EINVAL;
+	if (h->mode & ~S_IRWXUGO)
+		return -EINVAL;
+
+	/* FIX: verify the ->mode field makes sense */
+
+	perm->id = h->id;
+	perm->key = h->key;
+
+	if (!validate_created_perms(h))
+		return -EPERM;
+	perm->uid = h->uid;
+	perm->gid = h->gid;
+	perm->cuid = h->cuid;
+	perm->cgid = h->cgid;
+	perm->mode = h->mode;
+	perm->seq = h->seq;
+	/*
+	 * Todo: restore perm->security.
+	 * At the moment it gets set by security_x_alloc() called through
+	 * ipcget()->ipcget_public()->ops-.getnew (->nequeue for instance)
+	 * We will want to ask the LSM to consider resetting the
+	 * checkpointed ->security, based on current_security(),
+	 * the checkpointed ->security, and the checkpoint file context.
+	 */
+
+	return 0;
+}
+
+static int restore_ipc_any(struct ckpt_ctx *ctx, struct ipc_namespace *ipc_ns,
+			   int ipc_ind, int ipc_type,
+			   int (*func)(struct ckpt_ctx *ctx,
+				       struct ipc_namespace *ns))
+{
+	struct ckpt_hdr_ipc *h;
+	int n, ret;
+
+	h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_IPC);
+	if (IS_ERR(h))
+		return PTR_ERR(h);
+
+	ckpt_debug("ipc-%s: count %d\n", ipc_ind_to_str[ipc_ind], h->ipc_count);
+
+	ret = -EINVAL;
+	if (h->ipc_type != ipc_type)
+		goto out;
+
+	ret = 0;
+	for (n = 0; n < h->ipc_count; n++) {
+		ret = (*func)(ctx, ipc_ns);
+		if (ret < 0)
+			goto out;
+	}
+ out:
+	ckpt_debug("ipc-%s: ret %d\n", ipc_ind_to_str[ipc_ind], ret);
+	ckpt_hdr_put(ctx, h);
+	return ret;
+}
+
+static struct ipc_namespace *do_restore_ipc_ns(struct ckpt_ctx *ctx)
+{
+	struct ipc_namespace *ipc_ns = NULL;
+	struct ckpt_hdr_ipcns *h;
+	int ret;
+
+	h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_IPC_NS);
+	if (IS_ERR(h))
+		return ERR_PTR(PTR_ERR(h));
+
+	ret = -EINVAL;
+	if (h->shm_ctlmax < 0 || h->shm_ctlall < 0 || h->shm_ctlmni < 0)
+		goto out;
+	if (h->msg_ctlmax < 0 || h->msg_ctlmnb < 0 || h->msg_ctlmni < 0)
+		goto out;
+	if (h->sem_ctl_msl < 0 || h->sem_ctl_mns < 0 ||
+	    h->sem_ctl_opm < 0 || h->sem_ctl_mni < 0)
+		goto out;
+
+	/*
+	 * If !CONFIG_IPC_NS, do not restore the global IPC state, as
+	 * it is used by other processes. It is ok to try to restore
+	 * the {shm,msg,sem} objects: in the worst case the requested
+	 * identifiers will be in use.
+	 */
+#ifdef CONFIG_IPC_NS
+	ret = -ENOMEM;
+	ipc_ns = create_ipc_ns();
+	if (!ipc_ns)
+		goto out;
+
+	down_read(&shm_ids(ipc_ns).rw_mutex);
+	ipc_ns->shm_ctlmax = h->shm_ctlmax;
+	ipc_ns->shm_ctlall = h->shm_ctlall;
+	ipc_ns->shm_ctlmni = h->shm_ctlmni;
+	up_read(&shm_ids(ipc_ns).rw_mutex);
+
+	down_read(&msg_ids(ipc_ns).rw_mutex);
+	ipc_ns->msg_ctlmax = h->msg_ctlmax;
+	ipc_ns->msg_ctlmnb = h->msg_ctlmnb;
+	ipc_ns->msg_ctlmni = h->msg_ctlmni;
+	up_read(&msg_ids(ipc_ns).rw_mutex);
+
+	down_read(&sem_ids(ipc_ns).rw_mutex);
+	ipc_ns->sem_ctls[0] = h->sem_ctl_msl;
+	ipc_ns->sem_ctls[1] = h->sem_ctl_mns;
+	ipc_ns->sem_ctls[2] = h->sem_ctl_opm;
+	ipc_ns->sem_ctls[3] = h->sem_ctl_mni;
+	up_read(&sem_ids(ipc_ns).rw_mutex);
+#else
+	ret = -EEXIST;
+	/* complain if image contains multiple namespaces */
+	if (ctx->stats.ipc_ns)
+		goto out;
+	ipc_ns = current->nsproxy->ipc_ns;
+	get_ipc_ns(ipc_ns);
+#endif
+
+#if 0 /* NEXT FEW PATCHES */
+	ret = restore_ipc_any(ctx, ipc_ns, IPC_SHM_IDS,
+			      CKPT_HDR_IPC_SHM, restore_ipc_shm);
+	if (ret < 0)
+		goto out;
+	ret = restore_ipc_any(ctx, ipc_ns, IPC_MSG_IDS,
+			      CKPT_HDR_IPC_MSG, restore_ipc_msg);
+	if (ret < 0)
+		goto out;
+	ret = restore_ipc_any(ctx, ipc_ns, IPC_SEM_IDS,
+			      CKPT_HDR_IPC_SEM, restore_ipc_sem);
+#endif
+	if (ret < 0)
+		goto out;
+
+	ctx->stats.ipc_ns++;
+ out:
+	ckpt_hdr_put(ctx, h);
+	if (ret < 0) {
+		put_ipc_ns(ipc_ns);
+		ipc_ns = ERR_PTR(ret);
+	}
+	return ipc_ns;
+}
+
+void *restore_ipc_ns(struct ckpt_ctx *ctx)
+{
+	return (void *) do_restore_ipc_ns(ctx);
+}
diff --git a/ipc/namespace.c b/ipc/namespace.c
index a1094ff..8e5ea32 100644
--- a/ipc/namespace.c
+++ b/ipc/namespace.c
@@ -14,7 +14,7 @@
 
 #include "util.h"
 
-static struct ipc_namespace *create_ipc_ns(void)
+struct ipc_namespace *create_ipc_ns(void)
 {
 	struct ipc_namespace *ns;
 	int err;
diff --git a/ipc/util.h b/ipc/util.h
index 159a73c..8ae1f8e 100644
--- a/ipc/util.h
+++ b/ipc/util.h
@@ -12,6 +12,7 @@
 
 #include <linux/unistd.h>
 #include <linux/err.h>
+#include <linux/checkpoint.h>
 
 #define SEQ_MULTIPLIER	(IPCMNI)
 
@@ -175,4 +176,13 @@ int ipcget(struct ipc_namespace *ns, struct ipc_ids *ids,
 void free_ipcs(struct ipc_namespace *ns, struct ipc_ids *ids,
 	       void (*free)(struct ipc_namespace *, struct kern_ipc_perm *));
 
+struct ipc_namespace *create_ipc_ns(void);
+
+#ifdef CONFIG_CHECKPOINT
+extern int checkpoint_fill_ipc_perms(struct ckpt_hdr_ipc_perms *h,
+				     struct kern_ipc_perm *perm);
+extern int restore_load_ipc_perms(struct ckpt_hdr_ipc_perms *h,
+				  struct kern_ipc_perm *perm);
+#endif
+
 #endif
diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c
index 4f48a68..fddc724 100644
--- a/kernel/nsproxy.c
+++ b/kernel/nsproxy.c
@@ -248,6 +248,7 @@ int ckpt_collect_ns(struct ckpt_ctx *ctx, struct task_struct *t)
 	ret = ckpt_obj_collect(ctx, nsproxy->uts_ns, CKPT_OBJ_UTS_NS);
 	if (ret < 0)
 		goto out;
+	ret = ckpt_obj_collect(ctx, nsproxy->ipc_ns, CKPT_OBJ_IPC_NS);
 
 	/* TODO: collect other namespaces here */
  out:
@@ -268,6 +269,11 @@ static int do_checkpoint_ns(struct ckpt_ctx *ctx, struct nsproxy *nsproxy)
 	if (ret <= 0)
 		goto out;
 	h->uts_objref = ret;
+	ret = checkpoint_obj(ctx, nsproxy->ipc_ns, CKPT_OBJ_IPC_NS);
+	if (ret < 0)
+		goto out;
+	h->ipc_objref = ret;
+
 	/* TODO: Write other namespaces here */
 
 	ret = ckpt_write_obj(ctx, &h->h);
@@ -287,6 +293,7 @@ static struct nsproxy *do_restore_ns(struct ckpt_ctx *ctx)
 	struct ckpt_hdr_ns *h;
 	struct nsproxy *nsproxy = NULL;
 	struct uts_namespace *uts_ns;
+	struct ipc_namespace *ipc_ns;
 	int ret;
 
 	h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_NS);
@@ -294,7 +301,8 @@ static struct nsproxy *do_restore_ns(struct ckpt_ctx *ctx)
 		return (struct nsproxy *) h;
 
 	ret = -EINVAL;
-	if (h->uts_objref <= 0)
+	if (h->uts_objref <= 0 ||
+	    h->ipc_objref <= 0)
 		goto out;
 
 	uts_ns = ckpt_obj_fetch(ctx, h->uts_objref, CKPT_OBJ_UTS_NS);
@@ -302,8 +310,13 @@ static struct nsproxy *do_restore_ns(struct ckpt_ctx *ctx)
 		ret = PTR_ERR(uts_ns);
 		goto out;
 	}
+	ipc_ns = ckpt_obj_fetch(ctx, h->ipc_objref, CKPT_OBJ_IPC_NS);
+	if (IS_ERR(ipc_ns)) {
+		ret = PTR_ERR(ipc_ns);
+		goto out;
+	}
 
-#if defined(COFNIG_UTS_NS)
+#if defined(COFNIG_UTS_NS) || defined(CONFIG_IPC_NS)
 	ret = -ENOMEM;
 	nsproxy = create_nsproxy();
 	if (!nsproxy)
@@ -311,9 +324,9 @@ static struct nsproxy *do_restore_ns(struct ckpt_ctx *ctx)
 
 	get_uts_ns(uts_ns);
 	nsproxy->uts_ns = uts_ns;
-
-	get_ipc_ns(current->nsproxy->ipc_ns);
+	get_ipc_ns(ipc_ns);
 	nsproxy->ipc_ns = ipc_ns;
+
 	get_pid_ns(current->nsproxy->pid_ns);
 	nsproxy->pid_ns = current->nsproxy->pid_ns;
 	get_mnt_ns(current->nsproxy->mnt_ns);
@@ -325,6 +338,7 @@ static struct nsproxy *do_restore_ns(struct ckpt_ctx *ctx)
 	get_nsproxy(nsproxy);
 
 	BUG_ON(nsproxy->uts_ns != uts_ns);
+	BUG_ON(nsproxy->ipc_ns != ipc_ns);
 #endif
 
 	/* TODO: add more namespaces here */
-- 
1.6.0.4


^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH v18 51/80] c/r: save and restore sysvipc namespace basics
@ 2009-09-23 23:51   ` Oren Laadan
  0 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-23 23:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, containers, linux-kernel, linux-mm, linux-api,
	Serge Hallyn, Ingo Molnar, Pavel Emelyanov, Oren Laadan,
	Oren Laadan

Add the helpers to checkpoint and restore the contents of 'struct
kern_ipc_perm'. Add header structures for ipc state. Put place-holders
to save and restore ipc state.

Save and restores the common state (parameters) of ipc namespace.

Generic code to iterate through the objects of sysvipc shared memory,
message queues and semaphores. The logic to save and restore the state
of these objects will be added in the next few patches.

Right now, we return -EPERM if the user calling sys_restart() isn't
allowed to create an object with the checkpointed uid.  We may prefer
to simply use the caller's uid in that case - but that could lead to
subtle userspace bugs?  Unsure, so going for the stricter behavior.

TODO: restore kern_ipc_perms->security.

Changelog[v17]:
  - Fix include: use checkpoint.h not checkpoint_hdr.h
  - Collect nsproxy->ipc_ns
  - Restore objects in the right namespace
  - If !CONFIG_IPC_NS only restore objects, not global settings
  - Don't overwrite global ipc-ns if !CONFIG_IPC_NS
  - Reset the checkpointed uid and gid info on ipc objects
  - Fix compilation with CONFIG_SYSVIPC=n
Changelog [Dan Smith <danms@us.ibm.com>]
  - Fix compilation with CONFIG_SYSVIPC=n
  - Update to match UTS changes

Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
---
 checkpoint/checkpoint.c          |    2 -
 checkpoint/objhash.c             |   28 ++++
 include/linux/checkpoint.h       |   13 ++
 include/linux/checkpoint_hdr.h   |   54 +++++++
 include/linux/checkpoint_types.h |    1 +
 init/Kconfig                     |    6 +
 ipc/Makefile                     |    2 +-
 ipc/checkpoint.c                 |  317 ++++++++++++++++++++++++++++++++++++++
 ipc/namespace.c                  |    2 +-
 ipc/util.h                       |   10 ++
 kernel/nsproxy.c                 |   22 ++-
 11 files changed, 449 insertions(+), 8 deletions(-)
 create mode 100644 ipc/checkpoint.c

diff --git a/checkpoint/checkpoint.c b/checkpoint/checkpoint.c
index 2a193b3..e35c5a8 100644
--- a/checkpoint/checkpoint.c
+++ b/checkpoint/checkpoint.c
@@ -424,8 +424,6 @@ static int may_checkpoint_task(struct ckpt_ctx *ctx, struct task_struct *t)
 
 	rcu_read_lock();
 	nsproxy = task_nsproxy(t);
-	if (nsproxy->ipc_ns != ctx->root_nsproxy->ipc_ns)
-		ret = -EPERM;
 	if (nsproxy->mnt_ns != ctx->root_nsproxy->mnt_ns) {
 		__ckpt_write_err(ctx, "TE", "bad mnt_ns", -EPERM);
 		ret = -EPERM;
diff --git a/checkpoint/objhash.c b/checkpoint/objhash.c
index 2fd00a6..15a5caf 100644
--- a/checkpoint/objhash.c
+++ b/checkpoint/objhash.c
@@ -15,6 +15,8 @@
 #include <linux/hash.h>
 #include <linux/file.h>
 #include <linux/fdtable.h>
+#include <linux/sched.h>
+#include <linux/ipc_namespace.h>
 #include <linux/checkpoint.h>
 #include <linux/checkpoint_hdr.h>
 
@@ -154,6 +156,22 @@ static int obj_uts_ns_users(void *ptr)
 	return atomic_read(&((struct uts_namespace *) ptr)->kref.refcount);
 }
 
+static int obj_ipc_ns_grab(void *ptr)
+{
+	get_ipc_ns((struct ipc_namespace *) ptr);
+	return 0;
+}
+
+static void obj_ipc_ns_drop(void *ptr, int lastref)
+{
+	put_ipc_ns((struct ipc_namespace *) ptr);
+}
+
+static int obj_ipc_ns_users(void *ptr)
+{
+	return atomic_read(&((struct ipc_namespace *) ptr)->count);
+}
+
 static struct ckpt_obj_ops ckpt_obj_ops[] = {
 	/* ignored object */
 	{
@@ -219,6 +237,16 @@ static struct ckpt_obj_ops ckpt_obj_ops[] = {
 		.checkpoint = checkpoint_uts_ns,
 		.restore = restore_uts_ns,
 	},
+	/* ipc_ns object */
+	{
+		.obj_name = "IPC_NS",
+		.obj_type = CKPT_OBJ_IPC_NS,
+		.ref_drop = obj_ipc_ns_drop,
+		.ref_grab = obj_ipc_ns_grab,
+		.ref_users = obj_ipc_ns_users,
+		.checkpoint = checkpoint_ipc_ns,
+		.restore = restore_ipc_ns,
+	},
 };
 
 
diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
index de3537a..c8501b8 100644
--- a/include/linux/checkpoint.h
+++ b/include/linux/checkpoint.h
@@ -22,6 +22,9 @@
 #ifdef __KERNEL__
 #ifdef CONFIG_CHECKPOINT
 
+#include <linux/sched.h>
+#include <linux/nsproxy.h>
+#include <linux/ipc_namespace.h>
 #include <linux/checkpoint_types.h>
 #include <linux/checkpoint_hdr.h>
 #include <linux/err.h>
@@ -151,6 +154,15 @@ extern void *restore_ns(struct ckpt_ctx *ctx);
 extern int checkpoint_uts_ns(struct ckpt_ctx *ctx, void *ptr);
 extern void *restore_uts_ns(struct ckpt_ctx *ctx);
 
+/* ipc-ns */
+#ifdef CONFIG_SYSVIPC
+extern int checkpoint_ipc_ns(struct ckpt_ctx *ctx, void *ptr);
+extern void *restore_ipc_ns(struct ckpt_ctx *ctx);
+#else
+#define checkpoint_ipc_ns  checkpoint_bad
+#define restore_ipc_ns  restore_bad
+#endif /* CONFIG_SYSVIPC */
+
 /* file table */
 extern int ckpt_collect_file_table(struct ckpt_ctx *ctx, struct task_struct *t);
 extern int checkpoint_obj_file_table(struct ckpt_ctx *ctx,
@@ -224,6 +236,7 @@ static inline int ckpt_validate_errno(int errno)
 #define CKPT_DFILE	0x10		/* files and filesystem */
 #define CKPT_DMEM	0x20		/* memory state */
 #define CKPT_DPAGE	0x40		/* memory pages */
+#define CKPT_DIPC	0x80		/* sysvipc */
 
 #define CKPT_DDEFAULT	0xffff		/* default debug level */
 
diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
index 0da2f15..1ecec60 100644
--- a/include/linux/checkpoint_hdr.h
+++ b/include/linux/checkpoint_hdr.h
@@ -60,6 +60,7 @@ enum {
 	CKPT_HDR_CPU,
 	CKPT_HDR_NS,
 	CKPT_HDR_UTS_NS,
+	CKPT_HDR_IPC_NS,
 
 	/* 201-299: reserved for arch-dependent */
 
@@ -74,6 +75,11 @@ enum {
 	CKPT_HDR_PGARR,
 	CKPT_HDR_MM_CONTEXT,
 
+	CKPT_HDR_IPC = 501,
+	CKPT_HDR_IPC_SHM,
+	CKPT_HDR_IPC_MSG,
+	CKPT_HDR_IPC_SEM,
+
 	CKPT_HDR_TAIL = 9001,
 
 	CKPT_HDR_ERROR = 9999,
@@ -100,6 +106,7 @@ enum obj_type {
 	CKPT_OBJ_MM,
 	CKPT_OBJ_NS,
 	CKPT_OBJ_UTS_NS,
+	CKPT_OBJ_IPC_NS,
 	CKPT_OBJ_MAX
 };
 
@@ -191,6 +198,7 @@ struct ckpt_hdr_task_ns {
 struct ckpt_hdr_ns {
 	struct ckpt_hdr h;
 	__s32 uts_objref;
+	__u32 ipc_objref;
 } __attribute__((aligned(8)));
 
 /* cannot include <linux/tty.h> from userspace, so define: */
@@ -332,4 +340,50 @@ struct ckpt_hdr_pgarr {
 } __attribute__((aligned(8)));
 
 
+/* ipc commons */
+struct ckpt_hdr_ipcns {
+	struct ckpt_hdr h;
+	__u64 shm_ctlmax;
+	__u64 shm_ctlall;
+	__s32 shm_ctlmni;
+
+	__s32 msg_ctlmax;
+	__s32 msg_ctlmnb;
+	__s32 msg_ctlmni;
+
+	__s32 sem_ctl_msl;
+	__s32 sem_ctl_mns;
+	__s32 sem_ctl_opm;
+	__s32 sem_ctl_mni;
+} __attribute__((aligned(8)));
+
+struct ckpt_hdr_ipc {
+	struct ckpt_hdr h;
+	__u32 ipc_type;
+	__u32 ipc_count;
+} __attribute__((aligned(8)));
+
+struct ckpt_hdr_ipc_perms {
+	__s32 id;
+	__u32 key;
+	__u32 uid;
+	__u32 gid;
+	__u32 cuid;
+	__u32 cgid;
+	__u32 mode;
+	__u32 _padding;
+	__u64 seq;
+} __attribute__((aligned(8)));
+
+
+#define CKPT_TST_OVERFLOW_16(a, b) \
+	((sizeof(a) > sizeof(b)) && ((a) > SHORT_MAX))
+
+#define CKPT_TST_OVERFLOW_32(a, b) \
+	((sizeof(a) > sizeof(b)) && ((a) > INT_MAX))
+
+#define CKPT_TST_OVERFLOW_64(a, b) \
+	((sizeof(a) > sizeof(b)) && ((a) > LONG_MAX))
+
+
 #endif /* _CHECKPOINT_CKPT_HDR_H_ */
diff --git a/include/linux/checkpoint_types.h b/include/linux/checkpoint_types.h
index 7d1b8c8..7e6051f 100644
--- a/include/linux/checkpoint_types.h
+++ b/include/linux/checkpoint_types.h
@@ -24,6 +24,7 @@
 
 struct ckpt_stats {
 	int uts_ns;
+	int ipc_ns;
 };
 
 struct ckpt_ctx {
diff --git a/init/Kconfig b/init/Kconfig
index 46329b9..e601a61 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -191,6 +191,12 @@ config SYSVIPC
 	  section 6.4 of the Linux Programmer's Guide, available from
 	  <http://www.tldp.org/guides.html>.
 
+config SYSVIPC_CHECKPOINT
+	bool
+	depends on SYSVIPC
+	depends on CHECKPOINT
+	default y
+
 config SYSVIPC_SYSCTL
 	bool
 	depends on SYSVIPC
diff --git a/ipc/Makefile b/ipc/Makefile
index 4e1955e..b747127 100644
--- a/ipc/Makefile
+++ b/ipc/Makefile
@@ -9,4 +9,4 @@ obj_mq-$(CONFIG_COMPAT) += compat_mq.o
 obj-$(CONFIG_POSIX_MQUEUE) += mqueue.o msgutil.o $(obj_mq-y)
 obj-$(CONFIG_IPC_NS) += namespace.o
 obj-$(CONFIG_POSIX_MQUEUE_SYSCTL) += mq_sysctl.o
-
+obj-$(CONFIG_SYSVIPC_CHECKPOINT) += checkpoint.o
diff --git a/ipc/checkpoint.c b/ipc/checkpoint.c
new file mode 100644
index 0000000..4eb1a97
--- /dev/null
+++ b/ipc/checkpoint.c
@@ -0,0 +1,317 @@
+/*
+ *  Checkpoint logic and helpers
+ *
+ *  Copyright (C) 2009 Oren Laadan
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+/* default debug level for output */
+#define CKPT_DFLAG  CKPT_DIPC
+
+#include <linux/ipc.h>
+#include <linux/msg.h>
+#include <linux/sched.h>
+#include <linux/ipc_namespace.h>
+#include <linux/checkpoint.h>
+#include <linux/checkpoint_hdr.h>
+
+#include "util.h"
+
+/* for ckpt_debug */
+static char *ipc_ind_to_str[] = { "sem", "msg", "shm" };
+
+#define shm_ids(ns)	((ns)->ids[IPC_SHM_IDS])
+#define msg_ids(ns)	((ns)->ids[IPC_MSG_IDS])
+#define sem_ids(ns)	((ns)->ids[IPC_SEM_IDS])
+
+/**************************************************************************
+ * Checkpoint
+ */
+
+int checkpoint_fill_ipc_perms(struct ckpt_hdr_ipc_perms *h,
+			      struct kern_ipc_perm *perm)
+{
+	if (ipcperms(perm, S_IROTH))
+		return -EACCES;
+
+	h->id = perm->id;
+	h->key = perm->key;
+	h->uid = perm->uid;
+	h->gid = perm->gid;
+	h->cuid = perm->cuid;
+	h->cgid = perm->cgid;
+	h->mode = perm->mode & S_IRWXUGO;
+	h->seq = perm->seq;
+
+	return 0;
+}
+
+static int checkpoint_ipc_any(struct ckpt_ctx *ctx,
+			      struct ipc_namespace *ipc_ns,
+			      int ipc_ind, int ipc_type,
+			      int (*func)(int id, void *p, void *data))
+{
+	struct ckpt_hdr_ipc *h;
+	struct ipc_ids *ipc_ids = &ipc_ns->ids[ipc_ind];
+	int ret = -ENOMEM;
+
+	down_read(&ipc_ids->rw_mutex);
+	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_IPC);
+	if (!h)
+		goto out;
+
+	h->ipc_type = ipc_type;
+	h->ipc_count = ipc_ids->in_use;
+	ckpt_debug("ipc-%s count %d\n", ipc_ind_to_str[ipc_ind], h->ipc_count);
+
+	ret = ckpt_write_obj(ctx, &h->h);
+	ckpt_hdr_put(ctx, h);
+	if (ret < 0)
+		goto out;
+
+	ret = idr_for_each(&ipc_ids->ipcs_idr, func, ctx);
+	ckpt_debug("ipc-%s ret %d\n", ipc_ind_to_str[ipc_ind], ret);
+ out:
+	up_read(&ipc_ids->rw_mutex);
+	return ret;
+}
+
+static int do_checkpoint_ipc_ns(struct ckpt_ctx *ctx,
+				struct ipc_namespace *ipc_ns)
+{
+	struct ckpt_hdr_ipcns *h;
+	int ret;
+
+	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_IPC_NS);
+	if (!h)
+		return -ENOMEM;
+
+	down_read(&shm_ids(ipc_ns).rw_mutex);
+	h->shm_ctlmax = ipc_ns->shm_ctlmax;
+	h->shm_ctlall = ipc_ns->shm_ctlall;
+	h->shm_ctlmni = ipc_ns->shm_ctlmni;
+	up_read(&shm_ids(ipc_ns).rw_mutex);
+
+	down_read(&msg_ids(ipc_ns).rw_mutex);
+	h->msg_ctlmax = ipc_ns->msg_ctlmax;
+	h->msg_ctlmnb = ipc_ns->msg_ctlmnb;
+	h->msg_ctlmni = ipc_ns->msg_ctlmni;
+	up_read(&msg_ids(ipc_ns).rw_mutex);
+
+	down_read(&sem_ids(ipc_ns).rw_mutex);
+	h->sem_ctl_msl = ipc_ns->sem_ctls[0];
+	h->sem_ctl_mns = ipc_ns->sem_ctls[1];
+	h->sem_ctl_opm = ipc_ns->sem_ctls[2];
+	h->sem_ctl_mni = ipc_ns->sem_ctls[3];
+	up_read(&sem_ids(ipc_ns).rw_mutex);
+
+	ret = ckpt_write_obj(ctx, &h->h);
+	ckpt_hdr_put(ctx, h);
+	if (ret < 0)
+		return ret;
+
+#if 0 /* NEXT FEW PATCHES */
+	ret = checkpoint_ipc_any(ctx, ipc_ns, IPC_SHM_IDS,
+				 CKPT_HDR_IPC_SHM, checkpoint_ipc_shm);
+	if (ret < 0)
+		return ret;
+	ret = checkpoint_ipc_any(ctx, ipc_ns, IPC_MSG_IDS,
+				 CKPT_HDR_IPC_MSG, checkpoint_ipc_msg);
+	if (ret < 0)
+		return ret;
+	ret = checkpoint_ipc_any(ctx, ipc_ns, IPC_SEM_IDS,
+				 CKPT_HDR_IPC_SEM, checkpoint_ipc_sem);
+#endif
+	return ret;
+}
+
+int checkpoint_ipc_ns(struct ckpt_ctx *ctx, void *ptr)
+{
+	return do_checkpoint_ipc_ns(ctx, (struct ipc_namespace *) ptr);
+}
+
+/**************************************************************************
+ * Restart
+ */
+
+/*
+ * check whether current task may create ipc object with
+ * checkpointed uids and gids.
+ * Return 1 if ok, 0 if not.
+ */
+static int validate_created_perms(struct ckpt_hdr_ipc_perms *h)
+{
+	const struct cred *cred = current_cred();
+	uid_t uid = cred->uid, euid = cred->euid;
+
+	/* actually I don't know - is CAP_IPC_OWNER the right one? */
+	if (((h->uid != uid && h->uid == euid) ||
+			(h->cuid != uid && h->cuid != euid) ||
+			!in_group_p(h->cgid) ||
+			!in_group_p(h->gid)) &&
+			!capable(CAP_IPC_OWNER))
+		return 0;
+	return 1;
+}
+
+int restore_load_ipc_perms(struct ckpt_hdr_ipc_perms *h,
+			   struct kern_ipc_perm *perm)
+{
+	if (h->id < 0)
+		return -EINVAL;
+	if (CKPT_TST_OVERFLOW_16(h->uid, perm->uid) ||
+	    CKPT_TST_OVERFLOW_16(h->gid, perm->gid) ||
+	    CKPT_TST_OVERFLOW_16(h->cuid, perm->cuid) ||
+	    CKPT_TST_OVERFLOW_16(h->cgid, perm->cgid) ||
+	    CKPT_TST_OVERFLOW_16(h->mode, perm->mode))
+		return -EINVAL;
+	if (h->seq >= USHORT_MAX)
+		return -EINVAL;
+	if (h->mode & ~S_IRWXUGO)
+		return -EINVAL;
+
+	/* FIX: verify the ->mode field makes sense */
+
+	perm->id = h->id;
+	perm->key = h->key;
+
+	if (!validate_created_perms(h))
+		return -EPERM;
+	perm->uid = h->uid;
+	perm->gid = h->gid;
+	perm->cuid = h->cuid;
+	perm->cgid = h->cgid;
+	perm->mode = h->mode;
+	perm->seq = h->seq;
+	/*
+	 * Todo: restore perm->security.
+	 * At the moment it gets set by security_x_alloc() called through
+	 * ipcget()->ipcget_public()->ops-.getnew (->nequeue for instance)
+	 * We will want to ask the LSM to consider resetting the
+	 * checkpointed ->security, based on current_security(),
+	 * the checkpointed ->security, and the checkpoint file context.
+	 */
+
+	return 0;
+}
+
+static int restore_ipc_any(struct ckpt_ctx *ctx, struct ipc_namespace *ipc_ns,
+			   int ipc_ind, int ipc_type,
+			   int (*func)(struct ckpt_ctx *ctx,
+				       struct ipc_namespace *ns))
+{
+	struct ckpt_hdr_ipc *h;
+	int n, ret;
+
+	h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_IPC);
+	if (IS_ERR(h))
+		return PTR_ERR(h);
+
+	ckpt_debug("ipc-%s: count %d\n", ipc_ind_to_str[ipc_ind], h->ipc_count);
+
+	ret = -EINVAL;
+	if (h->ipc_type != ipc_type)
+		goto out;
+
+	ret = 0;
+	for (n = 0; n < h->ipc_count; n++) {
+		ret = (*func)(ctx, ipc_ns);
+		if (ret < 0)
+			goto out;
+	}
+ out:
+	ckpt_debug("ipc-%s: ret %d\n", ipc_ind_to_str[ipc_ind], ret);
+	ckpt_hdr_put(ctx, h);
+	return ret;
+}
+
+static struct ipc_namespace *do_restore_ipc_ns(struct ckpt_ctx *ctx)
+{
+	struct ipc_namespace *ipc_ns = NULL;
+	struct ckpt_hdr_ipcns *h;
+	int ret;
+
+	h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_IPC_NS);
+	if (IS_ERR(h))
+		return ERR_PTR(PTR_ERR(h));
+
+	ret = -EINVAL;
+	if (h->shm_ctlmax < 0 || h->shm_ctlall < 0 || h->shm_ctlmni < 0)
+		goto out;
+	if (h->msg_ctlmax < 0 || h->msg_ctlmnb < 0 || h->msg_ctlmni < 0)
+		goto out;
+	if (h->sem_ctl_msl < 0 || h->sem_ctl_mns < 0 ||
+	    h->sem_ctl_opm < 0 || h->sem_ctl_mni < 0)
+		goto out;
+
+	/*
+	 * If !CONFIG_IPC_NS, do not restore the global IPC state, as
+	 * it is used by other processes. It is ok to try to restore
+	 * the {shm,msg,sem} objects: in the worst case the requested
+	 * identifiers will be in use.
+	 */
+#ifdef CONFIG_IPC_NS
+	ret = -ENOMEM;
+	ipc_ns = create_ipc_ns();
+	if (!ipc_ns)
+		goto out;
+
+	down_read(&shm_ids(ipc_ns).rw_mutex);
+	ipc_ns->shm_ctlmax = h->shm_ctlmax;
+	ipc_ns->shm_ctlall = h->shm_ctlall;
+	ipc_ns->shm_ctlmni = h->shm_ctlmni;
+	up_read(&shm_ids(ipc_ns).rw_mutex);
+
+	down_read(&msg_ids(ipc_ns).rw_mutex);
+	ipc_ns->msg_ctlmax = h->msg_ctlmax;
+	ipc_ns->msg_ctlmnb = h->msg_ctlmnb;
+	ipc_ns->msg_ctlmni = h->msg_ctlmni;
+	up_read(&msg_ids(ipc_ns).rw_mutex);
+
+	down_read(&sem_ids(ipc_ns).rw_mutex);
+	ipc_ns->sem_ctls[0] = h->sem_ctl_msl;
+	ipc_ns->sem_ctls[1] = h->sem_ctl_mns;
+	ipc_ns->sem_ctls[2] = h->sem_ctl_opm;
+	ipc_ns->sem_ctls[3] = h->sem_ctl_mni;
+	up_read(&sem_ids(ipc_ns).rw_mutex);
+#else
+	ret = -EEXIST;
+	/* complain if image contains multiple namespaces */
+	if (ctx->stats.ipc_ns)
+		goto out;
+	ipc_ns = current->nsproxy->ipc_ns;
+	get_ipc_ns(ipc_ns);
+#endif
+
+#if 0 /* NEXT FEW PATCHES */
+	ret = restore_ipc_any(ctx, ipc_ns, IPC_SHM_IDS,
+			      CKPT_HDR_IPC_SHM, restore_ipc_shm);
+	if (ret < 0)
+		goto out;
+	ret = restore_ipc_any(ctx, ipc_ns, IPC_MSG_IDS,
+			      CKPT_HDR_IPC_MSG, restore_ipc_msg);
+	if (ret < 0)
+		goto out;
+	ret = restore_ipc_any(ctx, ipc_ns, IPC_SEM_IDS,
+			      CKPT_HDR_IPC_SEM, restore_ipc_sem);
+#endif
+	if (ret < 0)
+		goto out;
+
+	ctx->stats.ipc_ns++;
+ out:
+	ckpt_hdr_put(ctx, h);
+	if (ret < 0) {
+		put_ipc_ns(ipc_ns);
+		ipc_ns = ERR_PTR(ret);
+	}
+	return ipc_ns;
+}
+
+void *restore_ipc_ns(struct ckpt_ctx *ctx)
+{
+	return (void *) do_restore_ipc_ns(ctx);
+}
diff --git a/ipc/namespace.c b/ipc/namespace.c
index a1094ff..8e5ea32 100644
--- a/ipc/namespace.c
+++ b/ipc/namespace.c
@@ -14,7 +14,7 @@
 
 #include "util.h"
 
-static struct ipc_namespace *create_ipc_ns(void)
+struct ipc_namespace *create_ipc_ns(void)
 {
 	struct ipc_namespace *ns;
 	int err;
diff --git a/ipc/util.h b/ipc/util.h
index 159a73c..8ae1f8e 100644
--- a/ipc/util.h
+++ b/ipc/util.h
@@ -12,6 +12,7 @@
 
 #include <linux/unistd.h>
 #include <linux/err.h>
+#include <linux/checkpoint.h>
 
 #define SEQ_MULTIPLIER	(IPCMNI)
 
@@ -175,4 +176,13 @@ int ipcget(struct ipc_namespace *ns, struct ipc_ids *ids,
 void free_ipcs(struct ipc_namespace *ns, struct ipc_ids *ids,
 	       void (*free)(struct ipc_namespace *, struct kern_ipc_perm *));
 
+struct ipc_namespace *create_ipc_ns(void);
+
+#ifdef CONFIG_CHECKPOINT
+extern int checkpoint_fill_ipc_perms(struct ckpt_hdr_ipc_perms *h,
+				     struct kern_ipc_perm *perm);
+extern int restore_load_ipc_perms(struct ckpt_hdr_ipc_perms *h,
+				  struct kern_ipc_perm *perm);
+#endif
+
 #endif
diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c
index 4f48a68..fddc724 100644
--- a/kernel/nsproxy.c
+++ b/kernel/nsproxy.c
@@ -248,6 +248,7 @@ int ckpt_collect_ns(struct ckpt_ctx *ctx, struct task_struct *t)
 	ret = ckpt_obj_collect(ctx, nsproxy->uts_ns, CKPT_OBJ_UTS_NS);
 	if (ret < 0)
 		goto out;
+	ret = ckpt_obj_collect(ctx, nsproxy->ipc_ns, CKPT_OBJ_IPC_NS);
 
 	/* TODO: collect other namespaces here */
  out:
@@ -268,6 +269,11 @@ static int do_checkpoint_ns(struct ckpt_ctx *ctx, struct nsproxy *nsproxy)
 	if (ret <= 0)
 		goto out;
 	h->uts_objref = ret;
+	ret = checkpoint_obj(ctx, nsproxy->ipc_ns, CKPT_OBJ_IPC_NS);
+	if (ret < 0)
+		goto out;
+	h->ipc_objref = ret;
+
 	/* TODO: Write other namespaces here */
 
 	ret = ckpt_write_obj(ctx, &h->h);
@@ -287,6 +293,7 @@ static struct nsproxy *do_restore_ns(struct ckpt_ctx *ctx)
 	struct ckpt_hdr_ns *h;
 	struct nsproxy *nsproxy = NULL;
 	struct uts_namespace *uts_ns;
+	struct ipc_namespace *ipc_ns;
 	int ret;
 
 	h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_NS);
@@ -294,7 +301,8 @@ static struct nsproxy *do_restore_ns(struct ckpt_ctx *ctx)
 		return (struct nsproxy *) h;
 
 	ret = -EINVAL;
-	if (h->uts_objref <= 0)
+	if (h->uts_objref <= 0 ||
+	    h->ipc_objref <= 0)
 		goto out;
 
 	uts_ns = ckpt_obj_fetch(ctx, h->uts_objref, CKPT_OBJ_UTS_NS);
@@ -302,8 +310,13 @@ static struct nsproxy *do_restore_ns(struct ckpt_ctx *ctx)
 		ret = PTR_ERR(uts_ns);
 		goto out;
 	}
+	ipc_ns = ckpt_obj_fetch(ctx, h->ipc_objref, CKPT_OBJ_IPC_NS);
+	if (IS_ERR(ipc_ns)) {
+		ret = PTR_ERR(ipc_ns);
+		goto out;
+	}
 
-#if defined(COFNIG_UTS_NS)
+#if defined(COFNIG_UTS_NS) || defined(CONFIG_IPC_NS)
 	ret = -ENOMEM;
 	nsproxy = create_nsproxy();
 	if (!nsproxy)
@@ -311,9 +324,9 @@ static struct nsproxy *do_restore_ns(struct ckpt_ctx *ctx)
 
 	get_uts_ns(uts_ns);
 	nsproxy->uts_ns = uts_ns;
-
-	get_ipc_ns(current->nsproxy->ipc_ns);
+	get_ipc_ns(ipc_ns);
 	nsproxy->ipc_ns = ipc_ns;
+
 	get_pid_ns(current->nsproxy->pid_ns);
 	nsproxy->pid_ns = current->nsproxy->pid_ns;
 	get_mnt_ns(current->nsproxy->mnt_ns);
@@ -325,6 +338,7 @@ static struct nsproxy *do_restore_ns(struct ckpt_ctx *ctx)
 	get_nsproxy(nsproxy);
 
 	BUG_ON(nsproxy->uts_ns != uts_ns);
+	BUG_ON(nsproxy->ipc_ns != ipc_ns);
 #endif
 
 	/* TODO: add more namespaces here */
-- 
1.6.0.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH v18 52/80] c/r: support share-memory sysv-ipc
       [not found] ` <1253749920-18673-1-git-send-email-orenl-RdfvBDnrOixBDgjK7y7TUQ@public.gmane.org>
                     ` (50 preceding siblings ...)
  2009-09-23 23:51   ` [PATCH v18 51/80] c/r: save and restore sysvipc namespace basics Oren Laadan
@ 2009-09-23 23:51   ` Oren Laadan
  2009-09-23 23:51   ` [PATCH v18 53/80] c/r: support message-queues sysv-ipc Oren Laadan
                     ` (29 subsequent siblings)
  81 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-23 23:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-api-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Linus Torvalds, Ingo Molnar,
	Pavel Emelyanov

Checkpoint of sysvipc shared memory is performed in two steps: first,
the entire ipc namespace is dumped as a whole by iterating through all
shm objects and dumping the contents of each one. The shmem inode is
registered in the objhash. Second, for each vma that refers to ipc
shared memory we find the inode in the objhash, and save the objref.

(If we find a new inode, that indicates that the ipc namespace is not
entirely frozen and someone must have manipulated it since step 1).

Handling of shm objects that have been deleted (via IPC_RMID) is left
to a later patch in this series.

Changelog[v18]:
  - Collect files used by shm objects
  - Use file instead of inode as shared object during checkpoint
Changelog[v17]:
  - Restore objects in the right namespace
  - Properly initialize ctx->deferqueue
  - Fix compilation with CONFIG_CHECKPOINT=n

Signed-off-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
---
 checkpoint/checkpoint.c          |    5 +
 checkpoint/memory.c              |   28 ++++-
 checkpoint/restart.c             |    5 +
 checkpoint/sys.c                 |    7 +
 include/linux/checkpoint.h       |   10 ++
 include/linux/checkpoint_hdr.h   |   19 +++-
 include/linux/checkpoint_types.h |    1 +
 include/linux/shm.h              |   15 ++
 ipc/Makefile                     |    2 +-
 ipc/checkpoint.c                 |   25 +++-
 ipc/checkpoint_shm.c             |  277 ++++++++++++++++++++++++++++++++++++++
 ipc/shm.c                        |   84 +++++++++++-
 ipc/util.h                       |    9 ++
 kernel/nsproxy.c                 |    8 +
 mm/shmem.c                       |    2 +-
 15 files changed, 482 insertions(+), 15 deletions(-)
 create mode 100644 ipc/checkpoint_shm.c

diff --git a/checkpoint/checkpoint.c b/checkpoint/checkpoint.c
index e35c5a8..27eec5c 100644
--- a/checkpoint/checkpoint.c
+++ b/checkpoint/checkpoint.c
@@ -24,6 +24,7 @@
 #include <linux/utsname.h>
 #include <linux/magic.h>
 #include <linux/hrtimer.h>
+#include <linux/deferqueue.h>
 #include <linux/checkpoint.h>
 #include <linux/checkpoint_hdr.h>
 
@@ -793,6 +794,10 @@ long do_checkpoint(struct ckpt_ctx *ctx, pid_t pid)
 	if (ret < 0)
 		goto out;
 
+	ret = deferqueue_run(ctx->deferqueue);  /* run deferred work */
+	if (ret < 0)
+		goto out;
+
 	/* verify that all objects were indeed visited */
 	if (!ckpt_obj_visited(ctx)) {
 		ckpt_write_err(ctx, "E", "leak: unvisited", -EBUSY);
diff --git a/checkpoint/memory.c b/checkpoint/memory.c
index f765993..0da948f 100644
--- a/checkpoint/memory.c
+++ b/checkpoint/memory.c
@@ -20,6 +20,7 @@
 #include <linux/mman.h>
 #include <linux/pagemap.h>
 #include <linux/mm_types.h>
+#include <linux/shm.h>
 #include <linux/proc_fs.h>
 #include <linux/swap.h>
 #include <linux/checkpoint.h>
@@ -459,9 +460,9 @@ static int vma_dump_pages(struct ckpt_ctx *ctx, int total)
  * virtual addresses into ctx->pgarr_list page-array chain. Then dump
  * the addresses, followed by the page contents.
  */
-static int checkpoint_memory_contents(struct ckpt_ctx *ctx,
-				      struct vm_area_struct *vma,
-				      struct inode *inode)
+int checkpoint_memory_contents(struct ckpt_ctx *ctx,
+			       struct vm_area_struct *vma,
+			       struct inode *inode)
 {
 	struct ckpt_hdr_pgarr *h;
 	unsigned long addr, end;
@@ -1084,6 +1085,13 @@ static int anon_private_restore(struct ckpt_ctx *ctx,
 	return private_vma_restore(ctx, mm, NULL, h);
 }
 
+static int bad_vma_restore(struct ckpt_ctx *ctx,
+			   struct mm_struct *mm,
+			   struct ckpt_hdr_vma *h)
+{
+	return -EINVAL;
+}
+
 /* callbacks to restore vma per its type: */
 struct restore_vma_ops {
 	char *vma_name;
@@ -1136,6 +1144,20 @@ static struct restore_vma_ops restore_vma_ops[] = {
 		.vma_type = CKPT_VMA_SHM_FILE,
 		.restore = filemap_restore,
 	},
+	/* sysvipc shared */
+	{
+		.vma_name = "IPC SHARED",
+		.vma_type = CKPT_VMA_SHM_IPC,
+		/* ipc inode itself is restore by restore_ipc_ns()... */
+		.restore = bad_vma_restore,
+
+	},
+	/* sysvipc shared (skip) */
+	{
+		.vma_name = "IPC SHARED (skip)",
+		.vma_type = CKPT_VMA_SHM_IPC_SKIP,
+		.restore = ipcshm_restore,
+	},
 };
 
 /**
diff --git a/checkpoint/restart.c b/checkpoint/restart.c
index e48ad68..53a759a 100644
--- a/checkpoint/restart.c
+++ b/checkpoint/restart.c
@@ -21,6 +21,7 @@
 #include <linux/utsname.h>
 #include <asm/syscall.h>
 #include <linux/elf.h>
+#include <linux/deferqueue.h>
 #include <linux/checkpoint.h>
 #include <linux/checkpoint_hdr.h>
 
@@ -881,6 +882,10 @@ static int do_restore_coord(struct ckpt_ctx *ctx, pid_t pid)
 			goto out;
 	}
 
+	ret = deferqueue_run(ctx->deferqueue);  /* run deferred work */
+	if (ret < 0)
+		goto out;
+
 	ret = restore_read_tail(ctx);
 	if (ret < 0)
 		goto out;
diff --git a/checkpoint/sys.c b/checkpoint/sys.c
index 1c98eee..d6a1650 100644
--- a/checkpoint/sys.c
+++ b/checkpoint/sys.c
@@ -21,6 +21,7 @@
 #include <linux/uaccess.h>
 #include <linux/capability.h>
 #include <linux/checkpoint.h>
+#include <linux/deferqueue.h>
 
 /*
  * ckpt_unpriv_allowed - sysctl controlled, do not allow checkpoints or
@@ -191,6 +192,9 @@ static void ckpt_ctx_free(struct ckpt_ctx *ctx)
 {
 	BUG_ON(atomic_read(&ctx->refcount));
 
+	if (ctx->deferqueue)
+		deferqueue_destroy(ctx->deferqueue);
+
 	if (ctx->files_deferq)
 		deferqueue_destroy(ctx->files_deferq);
 
@@ -243,6 +247,9 @@ static struct ckpt_ctx *ckpt_ctx_alloc(int fd, unsigned long uflags,
 	err = -ENOMEM;
 	if (ckpt_obj_hash_alloc(ctx) < 0)
 		goto err;
+	ctx->deferqueue = deferqueue_create();
+	if (!ctx->deferqueue)
+		goto err;
 
 	ctx->files_deferq = deferqueue_create();
 	if (!ctx->files_deferq)
diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
index c8501b8..4c1c13e 100644
--- a/include/linux/checkpoint.h
+++ b/include/linux/checkpoint.h
@@ -158,9 +158,16 @@ extern void *restore_uts_ns(struct ckpt_ctx *ctx);
 #ifdef CONFIG_SYSVIPC
 extern int checkpoint_ipc_ns(struct ckpt_ctx *ctx, void *ptr);
 extern void *restore_ipc_ns(struct ckpt_ctx *ctx);
+extern int ckpt_collect_ipc_ns(struct ckpt_ctx *ctx,
+			       struct ipc_namespace *ipc_ns);
 #else
 #define checkpoint_ipc_ns  checkpoint_bad
 #define restore_ipc_ns  restore_bad
+static inline int ckpt_collect_ipc_ns(struct ckpt_ctx *ctx,
+				      struct ipc_namespace *ipc_ns)
+{
+	return 0;
+}
 #endif /* CONFIG_SYSVIPC */
 
 /* file table */
@@ -215,6 +222,9 @@ extern unsigned long generic_vma_restore(struct mm_struct *mm,
 extern int private_vma_restore(struct ckpt_ctx *ctx, struct mm_struct *mm,
 			       struct file *file, struct ckpt_hdr_vma *h);
 
+extern int checkpoint_memory_contents(struct ckpt_ctx *ctx,
+				      struct vm_area_struct *vma,
+				      struct inode *inode);
 extern int restore_memory_contents(struct ckpt_ctx *ctx, struct inode *inode);
 
 
diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
index 1ecec60..a4cf7a9 100644
--- a/include/linux/checkpoint_hdr.h
+++ b/include/linux/checkpoint_hdr.h
@@ -314,7 +314,9 @@ enum vma_type {
 	CKPT_VMA_SHM_ANON,	/* shared anonymous */
 	CKPT_VMA_SHM_ANON_SKIP,	/* shared anonymous (skip contents) */
 	CKPT_VMA_SHM_FILE,	/* shared mapped file, only msync */
-	CKPT_VMA_MAX
+	CKPT_VMA_SHM_IPC,	/* shared sysvipc */
+	CKPT_VMA_SHM_IPC_SKIP,	/* shared sysvipc (skip contents) */
+	CKPT_VMA_MAX,
 };
 
 /* vma descriptor */
@@ -364,6 +366,7 @@ struct ckpt_hdr_ipc {
 } __attribute__((aligned(8)));
 
 struct ckpt_hdr_ipc_perms {
+	struct ckpt_hdr h;
 	__s32 id;
 	__u32 key;
 	__u32 uid;
@@ -375,6 +378,20 @@ struct ckpt_hdr_ipc_perms {
 	__u64 seq;
 } __attribute__((aligned(8)));
 
+struct ckpt_hdr_ipc_shm {
+	struct ckpt_hdr h;
+	struct ckpt_hdr_ipc_perms perms;
+	__u64 shm_segsz;
+	__u64 shm_atim;
+	__u64 shm_dtim;
+	__u64 shm_ctim;
+	__s32 shm_cprid;
+	__s32 shm_lprid;
+	__u32 mlock_uid;
+	__u32 flags;
+	__u32 objref;
+} __attribute__((aligned(8)));
+
 
 #define CKPT_TST_OVERFLOW_16(a, b) \
 	((sizeof(a) > sizeof(b)) && ((a) > SHORT_MAX))
diff --git a/include/linux/checkpoint_types.h b/include/linux/checkpoint_types.h
index 7e6051f..9632aa4 100644
--- a/include/linux/checkpoint_types.h
+++ b/include/linux/checkpoint_types.h
@@ -48,6 +48,7 @@ struct ckpt_ctx {
 	atomic_t refcount;
 
 	struct ckpt_obj_hash *obj_hash;	/* repository for shared objects */
+	struct deferqueue_head *deferqueue;	/* deferred c/r work */
 	struct deferqueue_head *files_deferq;	/* deferred file-table work */
 
 	struct path fs_mnt;     /* container root (FIXME) */
diff --git a/include/linux/shm.h b/include/linux/shm.h
index eca6235..94ac1a7 100644
--- a/include/linux/shm.h
+++ b/include/linux/shm.h
@@ -118,6 +118,21 @@ static inline int is_file_shm_hugepages(struct file *file)
 }
 #endif
 
+struct ipc_namespace;
+extern int shmctl_down(struct ipc_namespace *ns, int shmid, int cmd,
+		       struct shmid_ds __user *buf, int version);
+
+#ifdef CONFIG_CHECKPOINT
+#ifdef CONFIG_SYSVIPC
+struct ckpt_ctx;
+struct ckpt_hdr_vma;
+extern int ipcshm_restore(struct ckpt_ctx *ctx, struct mm_struct *mm,
+			  struct ckpt_hdr_vma *h);
+#else
+#define ipcshm_restore NULL
+#endif
+#endif
+
 #endif /* __KERNEL__ */
 
 #endif /* _LINUX_SHM_H_ */
diff --git a/ipc/Makefile b/ipc/Makefile
index b747127..db4b076 100644
--- a/ipc/Makefile
+++ b/ipc/Makefile
@@ -9,4 +9,4 @@ obj_mq-$(CONFIG_COMPAT) += compat_mq.o
 obj-$(CONFIG_POSIX_MQUEUE) += mqueue.o msgutil.o $(obj_mq-y)
 obj-$(CONFIG_IPC_NS) += namespace.o
 obj-$(CONFIG_POSIX_MQUEUE_SYSCTL) += mq_sysctl.o
-obj-$(CONFIG_SYSVIPC_CHECKPOINT) += checkpoint.o
+obj-$(CONFIG_SYSVIPC_CHECKPOINT) += checkpoint.o checkpoint_shm.o
diff --git a/ipc/checkpoint.c b/ipc/checkpoint.c
index 4eb1a97..b360861 100644
--- a/ipc/checkpoint.c
+++ b/ipc/checkpoint.c
@@ -113,9 +113,9 @@ static int do_checkpoint_ipc_ns(struct ckpt_ctx *ctx,
 	if (ret < 0)
 		return ret;
 
-#if 0 /* NEXT FEW PATCHES */
 	ret = checkpoint_ipc_any(ctx, ipc_ns, IPC_SHM_IDS,
 				 CKPT_HDR_IPC_SHM, checkpoint_ipc_shm);
+#if 0 /* NEXT FEW PATCHES */
 	if (ret < 0)
 		return ret;
 	ret = checkpoint_ipc_any(ctx, ipc_ns, IPC_MSG_IDS,
@@ -134,6 +134,27 @@ int checkpoint_ipc_ns(struct ckpt_ctx *ctx, void *ptr)
 }
 
 /**************************************************************************
+ * Collect
+ */
+
+int ckpt_collect_ipc_ns(struct ckpt_ctx *ctx, struct ipc_namespace *ipc_ns)
+{
+	struct ipc_ids *ipc_ids;
+	int ret;
+
+	/*
+	 * Each shm object holds a reference to a file pointer, so
+	 * collect them. Nothing to do for msg and sem.
+	 */
+	ipc_ids = &ipc_ns->ids[IPC_SHM_IDS];
+	down_read(&ipc_ids->rw_mutex);
+	ret = idr_for_each(&ipc_ids->ipcs_idr, ckpt_collect_ipc_shm, ctx);
+	up_read(&ipc_ids->rw_mutex);
+
+	return ret;
+}
+
+/**************************************************************************
  * Restart
  */
 
@@ -286,9 +307,9 @@ static struct ipc_namespace *do_restore_ipc_ns(struct ckpt_ctx *ctx)
 	get_ipc_ns(ipc_ns);
 #endif
 
-#if 0 /* NEXT FEW PATCHES */
 	ret = restore_ipc_any(ctx, ipc_ns, IPC_SHM_IDS,
 			      CKPT_HDR_IPC_SHM, restore_ipc_shm);
+#if 0 /* NEXT FEW PATCHES */
 	if (ret < 0)
 		goto out;
 	ret = restore_ipc_any(ctx, ipc_ns, IPC_MSG_IDS,
diff --git a/ipc/checkpoint_shm.c b/ipc/checkpoint_shm.c
new file mode 100644
index 0000000..826e430
--- /dev/null
+++ b/ipc/checkpoint_shm.c
@@ -0,0 +1,277 @@
+/*
+ *  Checkpoint/restart - dump state of sysvipc shm
+ *
+ *  Copyright (C) 2009 Oren Laadan
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+/* default debug level for output */
+#define CKPT_DFLAG  CKPT_DIPC
+
+#include <linux/mm.h>
+#include <linux/shm.h>
+#include <linux/shmem_fs.h>
+#include <linux/hugetlb.h>
+#include <linux/rwsem.h>
+#include <linux/sched.h>
+#include <linux/file.h>
+#include <linux/syscalls.h>
+#include <linux/nsproxy.h>
+#include <linux/ipc_namespace.h>
+#include <linux/deferqueue.h>
+
+#include <linux/msg.h>	/* needed for util.h that uses 'struct msg_msg' */
+#include "util.h"
+
+#include <linux/checkpoint.h>
+#include <linux/checkpoint_hdr.h>
+
+/************************************************************************
+ * ipc checkpoint
+ */
+
+static int fill_ipc_shm_hdr(struct ckpt_ctx *ctx,
+			    struct ckpt_hdr_ipc_shm *h,
+			    struct shmid_kernel *shp)
+{
+	int ret = 0;
+
+	ipc_lock_by_ptr(&shp->shm_perm);
+
+	ret = checkpoint_fill_ipc_perms(&h->perms, &shp->shm_perm);
+	if (ret < 0)
+		goto unlock;
+
+	h->shm_segsz = shp->shm_segsz;
+	h->shm_atim = shp->shm_atim;
+	h->shm_dtim = shp->shm_dtim;
+	h->shm_ctim = shp->shm_ctim;
+	h->shm_cprid = shp->shm_cprid;
+	h->shm_lprid = shp->shm_lprid;
+
+	if (shp->mlock_user)
+		h->mlock_uid = shp->mlock_user->uid;
+	else
+		h->mlock_uid = (unsigned int) -1;
+
+	h->flags = 0;
+	/* check if shm was setup with SHM_NORESERVE */
+	if (SHMEM_I(shp->shm_file->f_dentry->d_inode)->flags & VM_NORESERVE)
+		h->flags |= SHM_NORESERVE;
+	/* check if shm was setup with SHM_HUGETLB (unsupported yet) */
+	if (is_file_hugepages(shp->shm_file)) {
+		pr_warning("c/r: unsupported SHM_HUGETLB\n");
+		ret = -ENOSYS;
+	}
+
+ unlock:
+	ipc_unlock(&shp->shm_perm);
+	ckpt_debug("shm: cprid %d lprid %d segsz %lld mlock %d\n",
+		 h->shm_cprid, h->shm_lprid, h->shm_segsz, h->mlock_uid);
+
+	return ret;
+}
+
+int checkpoint_ipc_shm(int id, void *p, void *data)
+{
+	struct ckpt_hdr_ipc_shm *h;
+	struct ckpt_ctx *ctx = (struct ckpt_ctx *) data;
+	struct kern_ipc_perm *perm = (struct kern_ipc_perm *) p;
+	struct shmid_kernel *shp;
+	struct inode *inode;
+	int first, objref;
+	int ret;
+
+	shp = container_of(perm, struct shmid_kernel, shm_perm);
+	inode = shp->shm_file->f_dentry->d_inode;
+
+	/* we collected the file but we don't checkpoint it per-se */
+	ret = ckpt_obj_visit(ctx, shp->shm_file, CKPT_OBJ_FILE);
+	if (ret < 0)
+		return ret;
+
+	objref = ckpt_obj_lookup_add(ctx, inode, CKPT_OBJ_INODE, &first);
+	if (objref < 0)
+		return objref;
+
+	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_IPC_SHM);
+	if (!h)
+		return -ENOMEM;
+
+	ret = fill_ipc_shm_hdr(ctx, h, shp);
+	if (ret < 0)
+		goto out;
+
+	h->objref = objref;
+	ckpt_debug("shm: objref %d\n", h->objref);
+
+	ret = ckpt_write_obj(ctx, &h->h);
+	if (ret < 0)
+		goto out;
+
+	ret = checkpoint_memory_contents(ctx, NULL, inode);
+ out:
+	ckpt_hdr_put(ctx, h);
+	return ret;
+}
+
+/************************************************************************
++ * ipc collect
++ */
+int ckpt_collect_ipc_shm(int id, void *p, void *data)
+{
+	struct ckpt_ctx *ctx = (struct ckpt_ctx *) data;
+	struct kern_ipc_perm *perm = (struct kern_ipc_perm *) p;
+	struct shmid_kernel *shp;
+
+	shp = container_of(perm, struct shmid_kernel, shm_perm);
+	return ckpt_collect_file(ctx, shp->shm_file);
+}
+
+/************************************************************************
+ * ipc restart
+ */
+
+struct dq_ipcshm_del {
+	/*
+	 * XXX: always keep ->ipcns first so that put_ipc_ns() can
+	 * be safely provided as the dtor for this deferqueue object
+	 */
+	struct ipc_namespace *ipcns;
+	int id;
+};
+
+static int ipc_shm_delete(void *data)
+{
+	struct dq_ipcshm_del *dq = (struct dq_ipcshm_del *) data;
+	mm_segment_t old_fs;
+	int ret;
+
+	old_fs = get_fs();
+	set_fs(get_ds());
+	ret = shmctl_down(dq->ipcns, dq->id, IPC_RMID, NULL, 0);
+	set_fs(old_fs);
+
+	put_ipc_ns(dq->ipcns);
+	return ret;
+}
+
+static int load_ipc_shm_hdr(struct ckpt_ctx *ctx,
+			    struct ckpt_hdr_ipc_shm *h,
+			    struct shmid_kernel *shp)
+{
+	int ret;
+
+	ret = restore_load_ipc_perms(&h->perms, &shp->shm_perm);
+	if (ret < 0)
+		return ret;
+
+	ckpt_debug("shm: cprid %d lprid %d segsz %lld mlock %d\n",
+		 h->shm_cprid, h->shm_lprid, h->shm_segsz, h->mlock_uid);
+
+	if (h->shm_cprid < 0 || h->shm_lprid < 0)
+		return -EINVAL;
+
+	shp->shm_segsz = h->shm_segsz;
+	shp->shm_atim = h->shm_atim;
+	shp->shm_dtim = h->shm_dtim;
+	shp->shm_ctim = h->shm_ctim;
+	shp->shm_cprid = h->shm_cprid;
+	shp->shm_lprid = h->shm_lprid;
+
+	return 0;
+}
+
+int restore_ipc_shm(struct ckpt_ctx *ctx, struct ipc_namespace *ns)
+{
+	struct ckpt_hdr_ipc_shm *h;
+	struct kern_ipc_perm *perms;
+	struct shmid_kernel *shp;
+	struct ipc_ids *shm_ids = &ns->ids[IPC_SHM_IDS];
+	struct file *file;
+	int shmflag;
+	int ret;
+
+	h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_IPC_SHM);
+	if (IS_ERR(h))
+		return PTR_ERR(h);
+
+	ret = -EINVAL;
+	if (h->perms.id < 0)
+		goto out;
+
+#define CKPT_SHMFL_MASK  (SHM_NORESERVE | SHM_HUGETLB)
+	if (h->flags & ~CKPT_SHMFL_MASK)
+		goto out;
+
+	ret = -ENOSYS;
+	if (h->mlock_uid != (unsigned int) -1)	/* FIXME: support SHM_LOCK */
+		goto out;
+	if (h->flags & SHM_HUGETLB)	/* FIXME: support SHM_HUGETLB */
+		goto out;
+
+	/*
+	 * SHM_DEST means that the shm is to be deleted after creation.
+	 * However, deleting before it's actually attached is quite silly.
+	 * Instead, we defer this task to until restart has succeeded.
+	 */
+	if (h->perms.mode & SHM_DEST) {
+		struct dq_ipcshm_del dq;
+
+		/* to not confuse the rest of the code */
+		h->perms.mode &= ~SHM_DEST;
+
+		dq.id = h->perms.id;
+		dq.ipcns = ns;
+		get_ipc_ns(dq.ipcns);
+
+		/* XXX can safely use put_ipc_ns() as dtor, see above */
+		ret = deferqueue_add(ctx->deferqueue, &dq, sizeof(dq),
+				     (deferqueue_func_t) ipc_shm_delete,
+				     (deferqueue_func_t) put_ipc_ns);
+		if (ret < 0)
+			goto out;
+	}
+
+	shmflag = h->flags | h->perms.mode | IPC_CREAT | IPC_EXCL;
+	ckpt_debug("shm: do_shmget size %lld flag %#x id %d\n",
+		 h->shm_segsz, shmflag, h->perms.id);
+	ret = do_shmget(ns, h->perms.key, h->shm_segsz, shmflag, h->perms.id);
+	ckpt_debug("shm: do_shmget ret %d\n", ret);
+	if (ret < 0)
+		goto out;
+
+	down_write(&shm_ids->rw_mutex);
+
+	/* we are the sole owners/users of this ipc_ns, it can't go away */
+	perms = ipc_lock(shm_ids, h->perms.id);
+	BUG_ON(IS_ERR(perms));  /* ipc_ns is private to us */
+
+	shp = container_of(perms, struct shmid_kernel, shm_perm);
+	file = shp->shm_file;
+	get_file(file);
+
+	ret = load_ipc_shm_hdr(ctx, h, shp);
+	if (ret < 0)
+		goto mutex;
+
+	/* deposit in objhash and read contents in */
+	ret = ckpt_obj_insert(ctx, file, h->objref, CKPT_OBJ_FILE);
+	if (ret < 0)
+		goto mutex;
+	ret = restore_memory_contents(ctx, file->f_dentry->d_inode);
+ mutex:
+	fput(file);
+	if (ret < 0) {
+		ckpt_debug("shm: need to remove (%d)\n", ret);
+		do_shm_rmid(ns, perms);
+	} else
+		ipc_unlock(perms);
+	up_write(&shm_ids->rw_mutex);
+ out:
+	ckpt_hdr_put(ctx, h);
+	return ret;
+}
diff --git a/ipc/shm.c b/ipc/shm.c
index 749dd57..26f9253 100644
--- a/ipc/shm.c
+++ b/ipc/shm.c
@@ -40,6 +40,7 @@
 #include <linux/mount.h>
 #include <linux/ipc_namespace.h>
 #include <linux/ima.h>
+#include <linux/checkpoint.h>
 
 #include <asm/uaccess.h>
 
@@ -305,6 +306,74 @@ int is_file_shm_hugepages(struct file *file)
 	return ret;
 }
 
+#ifdef CONFIG_CHECKPOINT
+static int ipcshm_checkpoint(struct ckpt_ctx *ctx, struct vm_area_struct *vma)
+{
+	int ino_objref;
+	int first;
+
+	ino_objref = ckpt_obj_lookup_add(ctx, vma->vm_file->f_dentry->d_inode,
+				       CKPT_OBJ_INODE, &first);
+	if (ino_objref < 0)
+		return ino_objref;
+
+	/*
+	 * This shouldn't happen, because all IPC regions should have
+	 * been already dumped by now via ipc namespaces; It means
+	 * the ipc_ns has been modified recently during checkpoint.
+	 */
+	if (first)
+		return -EBUSY;
+
+	return generic_vma_checkpoint(ctx, vma, CKPT_VMA_SHM_IPC_SKIP,
+				      0, ino_objref);
+}
+
+int ipcshm_restore(struct ckpt_ctx *ctx, struct mm_struct *mm,
+		   struct ckpt_hdr_vma *h)
+{
+	struct file *file;
+	int shmid, shmflg = 0;
+	mm_segment_t old_fs;
+	unsigned long start;
+	unsigned long addr;
+	int ret;
+
+	if (!h->ino_objref)
+		return -EINVAL;
+	/* FIX: verify the vm_flags too */
+
+	file = ckpt_obj_fetch(ctx, h->ino_objref, CKPT_OBJ_FILE);
+	if (IS_ERR(file))
+		PTR_ERR(file);
+
+	shmid = file->f_dentry->d_inode->i_ino;
+
+	if (!(h->vm_flags & VM_WRITE))
+		shmflg |= SHM_RDONLY;
+
+	/*
+	 * FIX: do_shmat() has limited interface: all-or-nothing
+	 * mapping. If the vma, however, reflects a partial mapping
+	 * then we need to modify that function to accomplish the
+	 * desired outcome.  Partial mapping can exist due to the user
+	 * call shmat() and then unmapping part of the region.
+	 * Currently, we at least detect this and call it a foul play.
+	 */
+	if (((h->vm_end - h->vm_start) != h->ino_size) || h->vm_pgoff)
+		return -ENOSYS;
+
+	old_fs = get_fs();
+	set_fs(get_ds());
+	start = h->vm_start;
+	ret = do_shmat(shmid, (char __user *) start, shmflg, &addr);
+	set_fs(old_fs);
+
+	BUG_ON(ret >= 0 && addr != h->vm_start);
+	return ret;
+}
+#endif
+
 static const struct file_operations shm_file_operations = {
 	.mmap		= shm_mmap,
 	.fsync		= shm_fsync,
@@ -320,6 +389,9 @@ static struct vm_operations_struct shm_vm_ops = {
 	.set_policy = shm_set_policy,
 	.get_policy = shm_get_policy,
 #endif
+#if defined(CONFIG_CHECKPOINT)
+	.checkpoint = ipcshm_checkpoint,
+#endif
 };
 
 /**
@@ -447,14 +519,12 @@ static inline int shm_more_checks(struct kern_ipc_perm *ipcp,
 	return 0;
 }
 
-int do_shmget(key_t key, size_t size, int shmflg, int req_id)
+int do_shmget(struct ipc_namespace *ns, key_t key, size_t size,
+	      int shmflg, int req_id)
 {
-	struct ipc_namespace *ns;
 	struct ipc_ops shm_ops;
 	struct ipc_params shm_params;
 
-	ns = current->nsproxy->ipc_ns;
-
 	shm_ops.getnew = newseg;
 	shm_ops.associate = shm_security;
 	shm_ops.more_checks = shm_more_checks;
@@ -468,7 +538,7 @@ int do_shmget(key_t key, size_t size, int shmflg, int req_id)
 
 SYSCALL_DEFINE3(shmget, key_t, key, size_t, size, int, shmflg)
 {
-	return do_shmget(key, size, shmflg, -1);
+	return do_shmget(current->nsproxy->ipc_ns, key, size, shmflg, -1);
 }
 
 static inline unsigned long copy_shmid_to_user(void __user *buf, struct shmid64_ds *in, int version)
@@ -599,8 +669,8 @@ static void shm_get_stat(struct ipc_namespace *ns, unsigned long *rss,
  * to be held in write mode.
  * NOTE: no locks must be held, the rw_mutex is taken inside this function.
  */
-static int shmctl_down(struct ipc_namespace *ns, int shmid, int cmd,
-		       struct shmid_ds __user *buf, int version)
+int shmctl_down(struct ipc_namespace *ns, int shmid, int cmd,
+		struct shmid_ds __user *buf, int version)
 {
 	struct kern_ipc_perm *ipcp;
 	struct shmid64_ds shmid64;
diff --git a/ipc/util.h b/ipc/util.h
index 8ae1f8e..e0007dc 100644
--- a/ipc/util.h
+++ b/ipc/util.h
@@ -178,11 +178,20 @@ void free_ipcs(struct ipc_namespace *ns, struct ipc_ids *ids,
 
 struct ipc_namespace *create_ipc_ns(void);
 
+int do_shmget(struct ipc_namespace *ns, key_t key, size_t size, int shmflg,
+	      int req_id);
+void do_shm_rmid(struct ipc_namespace *ns, struct kern_ipc_perm *ipcp);
+
+
 #ifdef CONFIG_CHECKPOINT
 extern int checkpoint_fill_ipc_perms(struct ckpt_hdr_ipc_perms *h,
 				     struct kern_ipc_perm *perm);
 extern int restore_load_ipc_perms(struct ckpt_hdr_ipc_perms *h,
 				  struct kern_ipc_perm *perm);
+
+extern int ckpt_collect_ipc_shm(int id, void *p, void *data);
+extern int checkpoint_ipc_shm(int id, void *p, void *data);
+extern int restore_ipc_shm(struct ckpt_ctx *ctx, struct ipc_namespace *ns);
 #endif
 
 #endif
diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c
index fddc724..e7aaa00 100644
--- a/kernel/nsproxy.c
+++ b/kernel/nsproxy.c
@@ -249,6 +249,14 @@ int ckpt_collect_ns(struct ckpt_ctx *ctx, struct task_struct *t)
 	if (ret < 0)
 		goto out;
 	ret = ckpt_obj_collect(ctx, nsproxy->ipc_ns, CKPT_OBJ_IPC_NS);
+	if (ret < 0)
+		goto out;
+	/*
+	 * ipc_ns (shm) may keep references to files: if this is the
+	 * first time we see this ipc_ns (ret > 0), proceed inside.
+	 */
+	if (ret)
+		ret = ckpt_collect_ipc_ns(ctx, nsproxy->ipc_ns);
 
 	/* TODO: collect other namespaces here */
  out:
diff --git a/mm/shmem.c b/mm/shmem.c
index d1e348f..2cfff8d 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -2387,7 +2387,7 @@ static int shmem_checkpoint(struct ckpt_ctx *ctx, struct vm_area_struct *vma)
 {
 	enum vma_type vma_type;
 	int ino_objref;
-	int first;
+	int first, ret;
 
 	/* should be private anonymous ... verify that this is the case */
 	if (vma->vm_flags & CKPT_VMA_NOT_SUPPORTED) {
-- 
1.6.0.4

^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH v18 52/80] c/r: support share-memory sysv-ipc
  2009-09-23 23:50 ` Oren Laadan
@ 2009-09-23 23:51   ` Oren Laadan
  -1 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-23 23:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, containers, linux-kernel, linux-mm, linux-api,
	Serge Hallyn, Ingo Molnar, Pavel Emelyanov, Oren Laadan,
	Oren Laadan

Checkpoint of sysvipc shared memory is performed in two steps: first,
the entire ipc namespace is dumped as a whole by iterating through all
shm objects and dumping the contents of each one. The shmem inode is
registered in the objhash. Second, for each vma that refers to ipc
shared memory we find the inode in the objhash, and save the objref.

(If we find a new inode, that indicates that the ipc namespace is not
entirely frozen and someone must have manipulated it since step 1).

Handling of shm objects that have been deleted (via IPC_RMID) is left
to a later patch in this series.

Changelog[v18]:
  - Collect files used by shm objects
  - Use file instead of inode as shared object during checkpoint
Changelog[v17]:
  - Restore objects in the right namespace
  - Properly initialize ctx->deferqueue
  - Fix compilation with CONFIG_CHECKPOINT=n

Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
---
 checkpoint/checkpoint.c          |    5 +
 checkpoint/memory.c              |   28 ++++-
 checkpoint/restart.c             |    5 +
 checkpoint/sys.c                 |    7 +
 include/linux/checkpoint.h       |   10 ++
 include/linux/checkpoint_hdr.h   |   19 +++-
 include/linux/checkpoint_types.h |    1 +
 include/linux/shm.h              |   15 ++
 ipc/Makefile                     |    2 +-
 ipc/checkpoint.c                 |   25 +++-
 ipc/checkpoint_shm.c             |  277 ++++++++++++++++++++++++++++++++++++++
 ipc/shm.c                        |   84 +++++++++++-
 ipc/util.h                       |    9 ++
 kernel/nsproxy.c                 |    8 +
 mm/shmem.c                       |    2 +-
 15 files changed, 482 insertions(+), 15 deletions(-)
 create mode 100644 ipc/checkpoint_shm.c

diff --git a/checkpoint/checkpoint.c b/checkpoint/checkpoint.c
index e35c5a8..27eec5c 100644
--- a/checkpoint/checkpoint.c
+++ b/checkpoint/checkpoint.c
@@ -24,6 +24,7 @@
 #include <linux/utsname.h>
 #include <linux/magic.h>
 #include <linux/hrtimer.h>
+#include <linux/deferqueue.h>
 #include <linux/checkpoint.h>
 #include <linux/checkpoint_hdr.h>
 
@@ -793,6 +794,10 @@ long do_checkpoint(struct ckpt_ctx *ctx, pid_t pid)
 	if (ret < 0)
 		goto out;
 
+	ret = deferqueue_run(ctx->deferqueue);  /* run deferred work */
+	if (ret < 0)
+		goto out;
+
 	/* verify that all objects were indeed visited */
 	if (!ckpt_obj_visited(ctx)) {
 		ckpt_write_err(ctx, "E", "leak: unvisited", -EBUSY);
diff --git a/checkpoint/memory.c b/checkpoint/memory.c
index f765993..0da948f 100644
--- a/checkpoint/memory.c
+++ b/checkpoint/memory.c
@@ -20,6 +20,7 @@
 #include <linux/mman.h>
 #include <linux/pagemap.h>
 #include <linux/mm_types.h>
+#include <linux/shm.h>
 #include <linux/proc_fs.h>
 #include <linux/swap.h>
 #include <linux/checkpoint.h>
@@ -459,9 +460,9 @@ static int vma_dump_pages(struct ckpt_ctx *ctx, int total)
  * virtual addresses into ctx->pgarr_list page-array chain. Then dump
  * the addresses, followed by the page contents.
  */
-static int checkpoint_memory_contents(struct ckpt_ctx *ctx,
-				      struct vm_area_struct *vma,
-				      struct inode *inode)
+int checkpoint_memory_contents(struct ckpt_ctx *ctx,
+			       struct vm_area_struct *vma,
+			       struct inode *inode)
 {
 	struct ckpt_hdr_pgarr *h;
 	unsigned long addr, end;
@@ -1084,6 +1085,13 @@ static int anon_private_restore(struct ckpt_ctx *ctx,
 	return private_vma_restore(ctx, mm, NULL, h);
 }
 
+static int bad_vma_restore(struct ckpt_ctx *ctx,
+			   struct mm_struct *mm,
+			   struct ckpt_hdr_vma *h)
+{
+	return -EINVAL;
+}
+
 /* callbacks to restore vma per its type: */
 struct restore_vma_ops {
 	char *vma_name;
@@ -1136,6 +1144,20 @@ static struct restore_vma_ops restore_vma_ops[] = {
 		.vma_type = CKPT_VMA_SHM_FILE,
 		.restore = filemap_restore,
 	},
+	/* sysvipc shared */
+	{
+		.vma_name = "IPC SHARED",
+		.vma_type = CKPT_VMA_SHM_IPC,
+		/* ipc inode itself is restore by restore_ipc_ns()... */
+		.restore = bad_vma_restore,
+
+	},
+	/* sysvipc shared (skip) */
+	{
+		.vma_name = "IPC SHARED (skip)",
+		.vma_type = CKPT_VMA_SHM_IPC_SKIP,
+		.restore = ipcshm_restore,
+	},
 };
 
 /**
diff --git a/checkpoint/restart.c b/checkpoint/restart.c
index e48ad68..53a759a 100644
--- a/checkpoint/restart.c
+++ b/checkpoint/restart.c
@@ -21,6 +21,7 @@
 #include <linux/utsname.h>
 #include <asm/syscall.h>
 #include <linux/elf.h>
+#include <linux/deferqueue.h>
 #include <linux/checkpoint.h>
 #include <linux/checkpoint_hdr.h>
 
@@ -881,6 +882,10 @@ static int do_restore_coord(struct ckpt_ctx *ctx, pid_t pid)
 			goto out;
 	}
 
+	ret = deferqueue_run(ctx->deferqueue);  /* run deferred work */
+	if (ret < 0)
+		goto out;
+
 	ret = restore_read_tail(ctx);
 	if (ret < 0)
 		goto out;
diff --git a/checkpoint/sys.c b/checkpoint/sys.c
index 1c98eee..d6a1650 100644
--- a/checkpoint/sys.c
+++ b/checkpoint/sys.c
@@ -21,6 +21,7 @@
 #include <linux/uaccess.h>
 #include <linux/capability.h>
 #include <linux/checkpoint.h>
+#include <linux/deferqueue.h>
 
 /*
  * ckpt_unpriv_allowed - sysctl controlled, do not allow checkpoints or
@@ -191,6 +192,9 @@ static void ckpt_ctx_free(struct ckpt_ctx *ctx)
 {
 	BUG_ON(atomic_read(&ctx->refcount));
 
+	if (ctx->deferqueue)
+		deferqueue_destroy(ctx->deferqueue);
+
 	if (ctx->files_deferq)
 		deferqueue_destroy(ctx->files_deferq);
 
@@ -243,6 +247,9 @@ static struct ckpt_ctx *ckpt_ctx_alloc(int fd, unsigned long uflags,
 	err = -ENOMEM;
 	if (ckpt_obj_hash_alloc(ctx) < 0)
 		goto err;
+	ctx->deferqueue = deferqueue_create();
+	if (!ctx->deferqueue)
+		goto err;
 
 	ctx->files_deferq = deferqueue_create();
 	if (!ctx->files_deferq)
diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
index c8501b8..4c1c13e 100644
--- a/include/linux/checkpoint.h
+++ b/include/linux/checkpoint.h
@@ -158,9 +158,16 @@ extern void *restore_uts_ns(struct ckpt_ctx *ctx);
 #ifdef CONFIG_SYSVIPC
 extern int checkpoint_ipc_ns(struct ckpt_ctx *ctx, void *ptr);
 extern void *restore_ipc_ns(struct ckpt_ctx *ctx);
+extern int ckpt_collect_ipc_ns(struct ckpt_ctx *ctx,
+			       struct ipc_namespace *ipc_ns);
 #else
 #define checkpoint_ipc_ns  checkpoint_bad
 #define restore_ipc_ns  restore_bad
+static inline int ckpt_collect_ipc_ns(struct ckpt_ctx *ctx,
+				      struct ipc_namespace *ipc_ns)
+{
+	return 0;
+}
 #endif /* CONFIG_SYSVIPC */
 
 /* file table */
@@ -215,6 +222,9 @@ extern unsigned long generic_vma_restore(struct mm_struct *mm,
 extern int private_vma_restore(struct ckpt_ctx *ctx, struct mm_struct *mm,
 			       struct file *file, struct ckpt_hdr_vma *h);
 
+extern int checkpoint_memory_contents(struct ckpt_ctx *ctx,
+				      struct vm_area_struct *vma,
+				      struct inode *inode);
 extern int restore_memory_contents(struct ckpt_ctx *ctx, struct inode *inode);
 
 
diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
index 1ecec60..a4cf7a9 100644
--- a/include/linux/checkpoint_hdr.h
+++ b/include/linux/checkpoint_hdr.h
@@ -314,7 +314,9 @@ enum vma_type {
 	CKPT_VMA_SHM_ANON,	/* shared anonymous */
 	CKPT_VMA_SHM_ANON_SKIP,	/* shared anonymous (skip contents) */
 	CKPT_VMA_SHM_FILE,	/* shared mapped file, only msync */
-	CKPT_VMA_MAX
+	CKPT_VMA_SHM_IPC,	/* shared sysvipc */
+	CKPT_VMA_SHM_IPC_SKIP,	/* shared sysvipc (skip contents) */
+	CKPT_VMA_MAX,
 };
 
 /* vma descriptor */
@@ -364,6 +366,7 @@ struct ckpt_hdr_ipc {
 } __attribute__((aligned(8)));
 
 struct ckpt_hdr_ipc_perms {
+	struct ckpt_hdr h;
 	__s32 id;
 	__u32 key;
 	__u32 uid;
@@ -375,6 +378,20 @@ struct ckpt_hdr_ipc_perms {
 	__u64 seq;
 } __attribute__((aligned(8)));
 
+struct ckpt_hdr_ipc_shm {
+	struct ckpt_hdr h;
+	struct ckpt_hdr_ipc_perms perms;
+	__u64 shm_segsz;
+	__u64 shm_atim;
+	__u64 shm_dtim;
+	__u64 shm_ctim;
+	__s32 shm_cprid;
+	__s32 shm_lprid;
+	__u32 mlock_uid;
+	__u32 flags;
+	__u32 objref;
+} __attribute__((aligned(8)));
+
 
 #define CKPT_TST_OVERFLOW_16(a, b) \
 	((sizeof(a) > sizeof(b)) && ((a) > SHORT_MAX))
diff --git a/include/linux/checkpoint_types.h b/include/linux/checkpoint_types.h
index 7e6051f..9632aa4 100644
--- a/include/linux/checkpoint_types.h
+++ b/include/linux/checkpoint_types.h
@@ -48,6 +48,7 @@ struct ckpt_ctx {
 	atomic_t refcount;
 
 	struct ckpt_obj_hash *obj_hash;	/* repository for shared objects */
+	struct deferqueue_head *deferqueue;	/* deferred c/r work */
 	struct deferqueue_head *files_deferq;	/* deferred file-table work */
 
 	struct path fs_mnt;     /* container root (FIXME) */
diff --git a/include/linux/shm.h b/include/linux/shm.h
index eca6235..94ac1a7 100644
--- a/include/linux/shm.h
+++ b/include/linux/shm.h
@@ -118,6 +118,21 @@ static inline int is_file_shm_hugepages(struct file *file)
 }
 #endif
 
+struct ipc_namespace;
+extern int shmctl_down(struct ipc_namespace *ns, int shmid, int cmd,
+		       struct shmid_ds __user *buf, int version);
+
+#ifdef CONFIG_CHECKPOINT
+#ifdef CONFIG_SYSVIPC
+struct ckpt_ctx;
+struct ckpt_hdr_vma;
+extern int ipcshm_restore(struct ckpt_ctx *ctx, struct mm_struct *mm,
+			  struct ckpt_hdr_vma *h);
+#else
+#define ipcshm_restore NULL
+#endif
+#endif
+
 #endif /* __KERNEL__ */
 
 #endif /* _LINUX_SHM_H_ */
diff --git a/ipc/Makefile b/ipc/Makefile
index b747127..db4b076 100644
--- a/ipc/Makefile
+++ b/ipc/Makefile
@@ -9,4 +9,4 @@ obj_mq-$(CONFIG_COMPAT) += compat_mq.o
 obj-$(CONFIG_POSIX_MQUEUE) += mqueue.o msgutil.o $(obj_mq-y)
 obj-$(CONFIG_IPC_NS) += namespace.o
 obj-$(CONFIG_POSIX_MQUEUE_SYSCTL) += mq_sysctl.o
-obj-$(CONFIG_SYSVIPC_CHECKPOINT) += checkpoint.o
+obj-$(CONFIG_SYSVIPC_CHECKPOINT) += checkpoint.o checkpoint_shm.o
diff --git a/ipc/checkpoint.c b/ipc/checkpoint.c
index 4eb1a97..b360861 100644
--- a/ipc/checkpoint.c
+++ b/ipc/checkpoint.c
@@ -113,9 +113,9 @@ static int do_checkpoint_ipc_ns(struct ckpt_ctx *ctx,
 	if (ret < 0)
 		return ret;
 
-#if 0 /* NEXT FEW PATCHES */
 	ret = checkpoint_ipc_any(ctx, ipc_ns, IPC_SHM_IDS,
 				 CKPT_HDR_IPC_SHM, checkpoint_ipc_shm);
+#if 0 /* NEXT FEW PATCHES */
 	if (ret < 0)
 		return ret;
 	ret = checkpoint_ipc_any(ctx, ipc_ns, IPC_MSG_IDS,
@@ -134,6 +134,27 @@ int checkpoint_ipc_ns(struct ckpt_ctx *ctx, void *ptr)
 }
 
 /**************************************************************************
+ * Collect
+ */
+
+int ckpt_collect_ipc_ns(struct ckpt_ctx *ctx, struct ipc_namespace *ipc_ns)
+{
+	struct ipc_ids *ipc_ids;
+	int ret;
+
+	/*
+	 * Each shm object holds a reference to a file pointer, so
+	 * collect them. Nothing to do for msg and sem.
+	 */
+	ipc_ids = &ipc_ns->ids[IPC_SHM_IDS];
+	down_read(&ipc_ids->rw_mutex);
+	ret = idr_for_each(&ipc_ids->ipcs_idr, ckpt_collect_ipc_shm, ctx);
+	up_read(&ipc_ids->rw_mutex);
+
+	return ret;
+}
+
+/**************************************************************************
  * Restart
  */
 
@@ -286,9 +307,9 @@ static struct ipc_namespace *do_restore_ipc_ns(struct ckpt_ctx *ctx)
 	get_ipc_ns(ipc_ns);
 #endif
 
-#if 0 /* NEXT FEW PATCHES */
 	ret = restore_ipc_any(ctx, ipc_ns, IPC_SHM_IDS,
 			      CKPT_HDR_IPC_SHM, restore_ipc_shm);
+#if 0 /* NEXT FEW PATCHES */
 	if (ret < 0)
 		goto out;
 	ret = restore_ipc_any(ctx, ipc_ns, IPC_MSG_IDS,
diff --git a/ipc/checkpoint_shm.c b/ipc/checkpoint_shm.c
new file mode 100644
index 0000000..826e430
--- /dev/null
+++ b/ipc/checkpoint_shm.c
@@ -0,0 +1,277 @@
+/*
+ *  Checkpoint/restart - dump state of sysvipc shm
+ *
+ *  Copyright (C) 2009 Oren Laadan
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+/* default debug level for output */
+#define CKPT_DFLAG  CKPT_DIPC
+
+#include <linux/mm.h>
+#include <linux/shm.h>
+#include <linux/shmem_fs.h>
+#include <linux/hugetlb.h>
+#include <linux/rwsem.h>
+#include <linux/sched.h>
+#include <linux/file.h>
+#include <linux/syscalls.h>
+#include <linux/nsproxy.h>
+#include <linux/ipc_namespace.h>
+#include <linux/deferqueue.h>
+
+#include <linux/msg.h>	/* needed for util.h that uses 'struct msg_msg' */
+#include "util.h"
+
+#include <linux/checkpoint.h>
+#include <linux/checkpoint_hdr.h>
+
+/************************************************************************
+ * ipc checkpoint
+ */
+
+static int fill_ipc_shm_hdr(struct ckpt_ctx *ctx,
+			    struct ckpt_hdr_ipc_shm *h,
+			    struct shmid_kernel *shp)
+{
+	int ret = 0;
+
+	ipc_lock_by_ptr(&shp->shm_perm);
+
+	ret = checkpoint_fill_ipc_perms(&h->perms, &shp->shm_perm);
+	if (ret < 0)
+		goto unlock;
+
+	h->shm_segsz = shp->shm_segsz;
+	h->shm_atim = shp->shm_atim;
+	h->shm_dtim = shp->shm_dtim;
+	h->shm_ctim = shp->shm_ctim;
+	h->shm_cprid = shp->shm_cprid;
+	h->shm_lprid = shp->shm_lprid;
+
+	if (shp->mlock_user)
+		h->mlock_uid = shp->mlock_user->uid;
+	else
+		h->mlock_uid = (unsigned int) -1;
+
+	h->flags = 0;
+	/* check if shm was setup with SHM_NORESERVE */
+	if (SHMEM_I(shp->shm_file->f_dentry->d_inode)->flags & VM_NORESERVE)
+		h->flags |= SHM_NORESERVE;
+	/* check if shm was setup with SHM_HUGETLB (unsupported yet) */
+	if (is_file_hugepages(shp->shm_file)) {
+		pr_warning("c/r: unsupported SHM_HUGETLB\n");
+		ret = -ENOSYS;
+	}
+
+ unlock:
+	ipc_unlock(&shp->shm_perm);
+	ckpt_debug("shm: cprid %d lprid %d segsz %lld mlock %d\n",
+		 h->shm_cprid, h->shm_lprid, h->shm_segsz, h->mlock_uid);
+
+	return ret;
+}
+
+int checkpoint_ipc_shm(int id, void *p, void *data)
+{
+	struct ckpt_hdr_ipc_shm *h;
+	struct ckpt_ctx *ctx = (struct ckpt_ctx *) data;
+	struct kern_ipc_perm *perm = (struct kern_ipc_perm *) p;
+	struct shmid_kernel *shp;
+	struct inode *inode;
+	int first, objref;
+	int ret;
+
+	shp = container_of(perm, struct shmid_kernel, shm_perm);
+	inode = shp->shm_file->f_dentry->d_inode;
+
+	/* we collected the file but we don't checkpoint it per-se */
+	ret = ckpt_obj_visit(ctx, shp->shm_file, CKPT_OBJ_FILE);
+	if (ret < 0)
+		return ret;
+
+	objref = ckpt_obj_lookup_add(ctx, inode, CKPT_OBJ_INODE, &first);
+	if (objref < 0)
+		return objref;
+
+	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_IPC_SHM);
+	if (!h)
+		return -ENOMEM;
+
+	ret = fill_ipc_shm_hdr(ctx, h, shp);
+	if (ret < 0)
+		goto out;
+
+	h->objref = objref;
+	ckpt_debug("shm: objref %d\n", h->objref);
+
+	ret = ckpt_write_obj(ctx, &h->h);
+	if (ret < 0)
+		goto out;
+
+	ret = checkpoint_memory_contents(ctx, NULL, inode);
+ out:
+	ckpt_hdr_put(ctx, h);
+	return ret;
+}
+
+/************************************************************************
++ * ipc collect
++ */
+int ckpt_collect_ipc_shm(int id, void *p, void *data)
+{
+	struct ckpt_ctx *ctx = (struct ckpt_ctx *) data;
+	struct kern_ipc_perm *perm = (struct kern_ipc_perm *) p;
+	struct shmid_kernel *shp;
+
+	shp = container_of(perm, struct shmid_kernel, shm_perm);
+	return ckpt_collect_file(ctx, shp->shm_file);
+}
+
+/************************************************************************
+ * ipc restart
+ */
+
+struct dq_ipcshm_del {
+	/*
+	 * XXX: always keep ->ipcns first so that put_ipc_ns() can
+	 * be safely provided as the dtor for this deferqueue object
+	 */
+	struct ipc_namespace *ipcns;
+	int id;
+};
+
+static int ipc_shm_delete(void *data)
+{
+	struct dq_ipcshm_del *dq = (struct dq_ipcshm_del *) data;
+	mm_segment_t old_fs;
+	int ret;
+
+	old_fs = get_fs();
+	set_fs(get_ds());
+	ret = shmctl_down(dq->ipcns, dq->id, IPC_RMID, NULL, 0);
+	set_fs(old_fs);
+
+	put_ipc_ns(dq->ipcns);
+	return ret;
+}
+
+static int load_ipc_shm_hdr(struct ckpt_ctx *ctx,
+			    struct ckpt_hdr_ipc_shm *h,
+			    struct shmid_kernel *shp)
+{
+	int ret;
+
+	ret = restore_load_ipc_perms(&h->perms, &shp->shm_perm);
+	if (ret < 0)
+		return ret;
+
+	ckpt_debug("shm: cprid %d lprid %d segsz %lld mlock %d\n",
+		 h->shm_cprid, h->shm_lprid, h->shm_segsz, h->mlock_uid);
+
+	if (h->shm_cprid < 0 || h->shm_lprid < 0)
+		return -EINVAL;
+
+	shp->shm_segsz = h->shm_segsz;
+	shp->shm_atim = h->shm_atim;
+	shp->shm_dtim = h->shm_dtim;
+	shp->shm_ctim = h->shm_ctim;
+	shp->shm_cprid = h->shm_cprid;
+	shp->shm_lprid = h->shm_lprid;
+
+	return 0;
+}
+
+int restore_ipc_shm(struct ckpt_ctx *ctx, struct ipc_namespace *ns)
+{
+	struct ckpt_hdr_ipc_shm *h;
+	struct kern_ipc_perm *perms;
+	struct shmid_kernel *shp;
+	struct ipc_ids *shm_ids = &ns->ids[IPC_SHM_IDS];
+	struct file *file;
+	int shmflag;
+	int ret;
+
+	h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_IPC_SHM);
+	if (IS_ERR(h))
+		return PTR_ERR(h);
+
+	ret = -EINVAL;
+	if (h->perms.id < 0)
+		goto out;
+
+#define CKPT_SHMFL_MASK  (SHM_NORESERVE | SHM_HUGETLB)
+	if (h->flags & ~CKPT_SHMFL_MASK)
+		goto out;
+
+	ret = -ENOSYS;
+	if (h->mlock_uid != (unsigned int) -1)	/* FIXME: support SHM_LOCK */
+		goto out;
+	if (h->flags & SHM_HUGETLB)	/* FIXME: support SHM_HUGETLB */
+		goto out;
+
+	/*
+	 * SHM_DEST means that the shm is to be deleted after creation.
+	 * However, deleting before it's actually attached is quite silly.
+	 * Instead, we defer this task to until restart has succeeded.
+	 */
+	if (h->perms.mode & SHM_DEST) {
+		struct dq_ipcshm_del dq;
+
+		/* to not confuse the rest of the code */
+		h->perms.mode &= ~SHM_DEST;
+
+		dq.id = h->perms.id;
+		dq.ipcns = ns;
+		get_ipc_ns(dq.ipcns);
+
+		/* XXX can safely use put_ipc_ns() as dtor, see above */
+		ret = deferqueue_add(ctx->deferqueue, &dq, sizeof(dq),
+				     (deferqueue_func_t) ipc_shm_delete,
+				     (deferqueue_func_t) put_ipc_ns);
+		if (ret < 0)
+			goto out;
+	}
+
+	shmflag = h->flags | h->perms.mode | IPC_CREAT | IPC_EXCL;
+	ckpt_debug("shm: do_shmget size %lld flag %#x id %d\n",
+		 h->shm_segsz, shmflag, h->perms.id);
+	ret = do_shmget(ns, h->perms.key, h->shm_segsz, shmflag, h->perms.id);
+	ckpt_debug("shm: do_shmget ret %d\n", ret);
+	if (ret < 0)
+		goto out;
+
+	down_write(&shm_ids->rw_mutex);
+
+	/* we are the sole owners/users of this ipc_ns, it can't go away */
+	perms = ipc_lock(shm_ids, h->perms.id);
+	BUG_ON(IS_ERR(perms));  /* ipc_ns is private to us */
+
+	shp = container_of(perms, struct shmid_kernel, shm_perm);
+	file = shp->shm_file;
+	get_file(file);
+
+	ret = load_ipc_shm_hdr(ctx, h, shp);
+	if (ret < 0)
+		goto mutex;
+
+	/* deposit in objhash and read contents in */
+	ret = ckpt_obj_insert(ctx, file, h->objref, CKPT_OBJ_FILE);
+	if (ret < 0)
+		goto mutex;
+	ret = restore_memory_contents(ctx, file->f_dentry->d_inode);
+ mutex:
+	fput(file);
+	if (ret < 0) {
+		ckpt_debug("shm: need to remove (%d)\n", ret);
+		do_shm_rmid(ns, perms);
+	} else
+		ipc_unlock(perms);
+	up_write(&shm_ids->rw_mutex);
+ out:
+	ckpt_hdr_put(ctx, h);
+	return ret;
+}
diff --git a/ipc/shm.c b/ipc/shm.c
index 749dd57..26f9253 100644
--- a/ipc/shm.c
+++ b/ipc/shm.c
@@ -40,6 +40,7 @@
 #include <linux/mount.h>
 #include <linux/ipc_namespace.h>
 #include <linux/ima.h>
+#include <linux/checkpoint.h>
 
 #include <asm/uaccess.h>
 
@@ -305,6 +306,74 @@ int is_file_shm_hugepages(struct file *file)
 	return ret;
 }
 
+#ifdef CONFIG_CHECKPOINT
+static int ipcshm_checkpoint(struct ckpt_ctx *ctx, struct vm_area_struct *vma)
+{
+	int ino_objref;
+	int first;
+
+	ino_objref = ckpt_obj_lookup_add(ctx, vma->vm_file->f_dentry->d_inode,
+				       CKPT_OBJ_INODE, &first);
+	if (ino_objref < 0)
+		return ino_objref;
+
+	/*
+	 * This shouldn't happen, because all IPC regions should have
+	 * been already dumped by now via ipc namespaces; It means
+	 * the ipc_ns has been modified recently during checkpoint.
+	 */
+	if (first)
+		return -EBUSY;
+
+	return generic_vma_checkpoint(ctx, vma, CKPT_VMA_SHM_IPC_SKIP,
+				      0, ino_objref);
+}
+
+int ipcshm_restore(struct ckpt_ctx *ctx, struct mm_struct *mm,
+		   struct ckpt_hdr_vma *h)
+{
+	struct file *file;
+	int shmid, shmflg = 0;
+	mm_segment_t old_fs;
+	unsigned long start;
+	unsigned long addr;
+	int ret;
+
+	if (!h->ino_objref)
+		return -EINVAL;
+	/* FIX: verify the vm_flags too */
+
+	file = ckpt_obj_fetch(ctx, h->ino_objref, CKPT_OBJ_FILE);
+	if (IS_ERR(file))
+		PTR_ERR(file);
+
+	shmid = file->f_dentry->d_inode->i_ino;
+
+	if (!(h->vm_flags & VM_WRITE))
+		shmflg |= SHM_RDONLY;
+
+	/*
+	 * FIX: do_shmat() has limited interface: all-or-nothing
+	 * mapping. If the vma, however, reflects a partial mapping
+	 * then we need to modify that function to accomplish the
+	 * desired outcome.  Partial mapping can exist due to the user
+	 * call shmat() and then unmapping part of the region.
+	 * Currently, we at least detect this and call it a foul play.
+	 */
+	if (((h->vm_end - h->vm_start) != h->ino_size) || h->vm_pgoff)
+		return -ENOSYS;
+
+	old_fs = get_fs();
+	set_fs(get_ds());
+	start = h->vm_start;
+	ret = do_shmat(shmid, (char __user *) start, shmflg, &addr);
+	set_fs(old_fs);
+
+	BUG_ON(ret >= 0 && addr != h->vm_start);
+	return ret;
+}
+#endif
+
 static const struct file_operations shm_file_operations = {
 	.mmap		= shm_mmap,
 	.fsync		= shm_fsync,
@@ -320,6 +389,9 @@ static struct vm_operations_struct shm_vm_ops = {
 	.set_policy = shm_set_policy,
 	.get_policy = shm_get_policy,
 #endif
+#if defined(CONFIG_CHECKPOINT)
+	.checkpoint = ipcshm_checkpoint,
+#endif
 };
 
 /**
@@ -447,14 +519,12 @@ static inline int shm_more_checks(struct kern_ipc_perm *ipcp,
 	return 0;
 }
 
-int do_shmget(key_t key, size_t size, int shmflg, int req_id)
+int do_shmget(struct ipc_namespace *ns, key_t key, size_t size,
+	      int shmflg, int req_id)
 {
-	struct ipc_namespace *ns;
 	struct ipc_ops shm_ops;
 	struct ipc_params shm_params;
 
-	ns = current->nsproxy->ipc_ns;
-
 	shm_ops.getnew = newseg;
 	shm_ops.associate = shm_security;
 	shm_ops.more_checks = shm_more_checks;
@@ -468,7 +538,7 @@ int do_shmget(key_t key, size_t size, int shmflg, int req_id)
 
 SYSCALL_DEFINE3(shmget, key_t, key, size_t, size, int, shmflg)
 {
-	return do_shmget(key, size, shmflg, -1);
+	return do_shmget(current->nsproxy->ipc_ns, key, size, shmflg, -1);
 }
 
 static inline unsigned long copy_shmid_to_user(void __user *buf, struct shmid64_ds *in, int version)
@@ -599,8 +669,8 @@ static void shm_get_stat(struct ipc_namespace *ns, unsigned long *rss,
  * to be held in write mode.
  * NOTE: no locks must be held, the rw_mutex is taken inside this function.
  */
-static int shmctl_down(struct ipc_namespace *ns, int shmid, int cmd,
-		       struct shmid_ds __user *buf, int version)
+int shmctl_down(struct ipc_namespace *ns, int shmid, int cmd,
+		struct shmid_ds __user *buf, int version)
 {
 	struct kern_ipc_perm *ipcp;
 	struct shmid64_ds shmid64;
diff --git a/ipc/util.h b/ipc/util.h
index 8ae1f8e..e0007dc 100644
--- a/ipc/util.h
+++ b/ipc/util.h
@@ -178,11 +178,20 @@ void free_ipcs(struct ipc_namespace *ns, struct ipc_ids *ids,
 
 struct ipc_namespace *create_ipc_ns(void);
 
+int do_shmget(struct ipc_namespace *ns, key_t key, size_t size, int shmflg,
+	      int req_id);
+void do_shm_rmid(struct ipc_namespace *ns, struct kern_ipc_perm *ipcp);
+
+
 #ifdef CONFIG_CHECKPOINT
 extern int checkpoint_fill_ipc_perms(struct ckpt_hdr_ipc_perms *h,
 				     struct kern_ipc_perm *perm);
 extern int restore_load_ipc_perms(struct ckpt_hdr_ipc_perms *h,
 				  struct kern_ipc_perm *perm);
+
+extern int ckpt_collect_ipc_shm(int id, void *p, void *data);
+extern int checkpoint_ipc_shm(int id, void *p, void *data);
+extern int restore_ipc_shm(struct ckpt_ctx *ctx, struct ipc_namespace *ns);
 #endif
 
 #endif
diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c
index fddc724..e7aaa00 100644
--- a/kernel/nsproxy.c
+++ b/kernel/nsproxy.c
@@ -249,6 +249,14 @@ int ckpt_collect_ns(struct ckpt_ctx *ctx, struct task_struct *t)
 	if (ret < 0)
 		goto out;
 	ret = ckpt_obj_collect(ctx, nsproxy->ipc_ns, CKPT_OBJ_IPC_NS);
+	if (ret < 0)
+		goto out;
+	/*
+	 * ipc_ns (shm) may keep references to files: if this is the
+	 * first time we see this ipc_ns (ret > 0), proceed inside.
+	 */
+	if (ret)
+		ret = ckpt_collect_ipc_ns(ctx, nsproxy->ipc_ns);
 
 	/* TODO: collect other namespaces here */
  out:
diff --git a/mm/shmem.c b/mm/shmem.c
index d1e348f..2cfff8d 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -2387,7 +2387,7 @@ static int shmem_checkpoint(struct ckpt_ctx *ctx, struct vm_area_struct *vma)
 {
 	enum vma_type vma_type;
 	int ino_objref;
-	int first;
+	int first, ret;
 
 	/* should be private anonymous ... verify that this is the case */
 	if (vma->vm_flags & CKPT_VMA_NOT_SUPPORTED) {
-- 
1.6.0.4


^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH v18 52/80] c/r: support share-memory sysv-ipc
@ 2009-09-23 23:51   ` Oren Laadan
  0 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-23 23:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, containers, linux-kernel, linux-mm, linux-api,
	Serge Hallyn, Ingo Molnar, Pavel Emelyanov, Oren Laadan,
	Oren Laadan

Checkpoint of sysvipc shared memory is performed in two steps: first,
the entire ipc namespace is dumped as a whole by iterating through all
shm objects and dumping the contents of each one. The shmem inode is
registered in the objhash. Second, for each vma that refers to ipc
shared memory we find the inode in the objhash, and save the objref.

(If we find a new inode, that indicates that the ipc namespace is not
entirely frozen and someone must have manipulated it since step 1).

Handling of shm objects that have been deleted (via IPC_RMID) is left
to a later patch in this series.

Changelog[v18]:
  - Collect files used by shm objects
  - Use file instead of inode as shared object during checkpoint
Changelog[v17]:
  - Restore objects in the right namespace
  - Properly initialize ctx->deferqueue
  - Fix compilation with CONFIG_CHECKPOINT=n

Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
---
 checkpoint/checkpoint.c          |    5 +
 checkpoint/memory.c              |   28 ++++-
 checkpoint/restart.c             |    5 +
 checkpoint/sys.c                 |    7 +
 include/linux/checkpoint.h       |   10 ++
 include/linux/checkpoint_hdr.h   |   19 +++-
 include/linux/checkpoint_types.h |    1 +
 include/linux/shm.h              |   15 ++
 ipc/Makefile                     |    2 +-
 ipc/checkpoint.c                 |   25 +++-
 ipc/checkpoint_shm.c             |  277 ++++++++++++++++++++++++++++++++++++++
 ipc/shm.c                        |   84 +++++++++++-
 ipc/util.h                       |    9 ++
 kernel/nsproxy.c                 |    8 +
 mm/shmem.c                       |    2 +-
 15 files changed, 482 insertions(+), 15 deletions(-)
 create mode 100644 ipc/checkpoint_shm.c

diff --git a/checkpoint/checkpoint.c b/checkpoint/checkpoint.c
index e35c5a8..27eec5c 100644
--- a/checkpoint/checkpoint.c
+++ b/checkpoint/checkpoint.c
@@ -24,6 +24,7 @@
 #include <linux/utsname.h>
 #include <linux/magic.h>
 #include <linux/hrtimer.h>
+#include <linux/deferqueue.h>
 #include <linux/checkpoint.h>
 #include <linux/checkpoint_hdr.h>
 
@@ -793,6 +794,10 @@ long do_checkpoint(struct ckpt_ctx *ctx, pid_t pid)
 	if (ret < 0)
 		goto out;
 
+	ret = deferqueue_run(ctx->deferqueue);  /* run deferred work */
+	if (ret < 0)
+		goto out;
+
 	/* verify that all objects were indeed visited */
 	if (!ckpt_obj_visited(ctx)) {
 		ckpt_write_err(ctx, "E", "leak: unvisited", -EBUSY);
diff --git a/checkpoint/memory.c b/checkpoint/memory.c
index f765993..0da948f 100644
--- a/checkpoint/memory.c
+++ b/checkpoint/memory.c
@@ -20,6 +20,7 @@
 #include <linux/mman.h>
 #include <linux/pagemap.h>
 #include <linux/mm_types.h>
+#include <linux/shm.h>
 #include <linux/proc_fs.h>
 #include <linux/swap.h>
 #include <linux/checkpoint.h>
@@ -459,9 +460,9 @@ static int vma_dump_pages(struct ckpt_ctx *ctx, int total)
  * virtual addresses into ctx->pgarr_list page-array chain. Then dump
  * the addresses, followed by the page contents.
  */
-static int checkpoint_memory_contents(struct ckpt_ctx *ctx,
-				      struct vm_area_struct *vma,
-				      struct inode *inode)
+int checkpoint_memory_contents(struct ckpt_ctx *ctx,
+			       struct vm_area_struct *vma,
+			       struct inode *inode)
 {
 	struct ckpt_hdr_pgarr *h;
 	unsigned long addr, end;
@@ -1084,6 +1085,13 @@ static int anon_private_restore(struct ckpt_ctx *ctx,
 	return private_vma_restore(ctx, mm, NULL, h);
 }
 
+static int bad_vma_restore(struct ckpt_ctx *ctx,
+			   struct mm_struct *mm,
+			   struct ckpt_hdr_vma *h)
+{
+	return -EINVAL;
+}
+
 /* callbacks to restore vma per its type: */
 struct restore_vma_ops {
 	char *vma_name;
@@ -1136,6 +1144,20 @@ static struct restore_vma_ops restore_vma_ops[] = {
 		.vma_type = CKPT_VMA_SHM_FILE,
 		.restore = filemap_restore,
 	},
+	/* sysvipc shared */
+	{
+		.vma_name = "IPC SHARED",
+		.vma_type = CKPT_VMA_SHM_IPC,
+		/* ipc inode itself is restore by restore_ipc_ns()... */
+		.restore = bad_vma_restore,
+
+	},
+	/* sysvipc shared (skip) */
+	{
+		.vma_name = "IPC SHARED (skip)",
+		.vma_type = CKPT_VMA_SHM_IPC_SKIP,
+		.restore = ipcshm_restore,
+	},
 };
 
 /**
diff --git a/checkpoint/restart.c b/checkpoint/restart.c
index e48ad68..53a759a 100644
--- a/checkpoint/restart.c
+++ b/checkpoint/restart.c
@@ -21,6 +21,7 @@
 #include <linux/utsname.h>
 #include <asm/syscall.h>
 #include <linux/elf.h>
+#include <linux/deferqueue.h>
 #include <linux/checkpoint.h>
 #include <linux/checkpoint_hdr.h>
 
@@ -881,6 +882,10 @@ static int do_restore_coord(struct ckpt_ctx *ctx, pid_t pid)
 			goto out;
 	}
 
+	ret = deferqueue_run(ctx->deferqueue);  /* run deferred work */
+	if (ret < 0)
+		goto out;
+
 	ret = restore_read_tail(ctx);
 	if (ret < 0)
 		goto out;
diff --git a/checkpoint/sys.c b/checkpoint/sys.c
index 1c98eee..d6a1650 100644
--- a/checkpoint/sys.c
+++ b/checkpoint/sys.c
@@ -21,6 +21,7 @@
 #include <linux/uaccess.h>
 #include <linux/capability.h>
 #include <linux/checkpoint.h>
+#include <linux/deferqueue.h>
 
 /*
  * ckpt_unpriv_allowed - sysctl controlled, do not allow checkpoints or
@@ -191,6 +192,9 @@ static void ckpt_ctx_free(struct ckpt_ctx *ctx)
 {
 	BUG_ON(atomic_read(&ctx->refcount));
 
+	if (ctx->deferqueue)
+		deferqueue_destroy(ctx->deferqueue);
+
 	if (ctx->files_deferq)
 		deferqueue_destroy(ctx->files_deferq);
 
@@ -243,6 +247,9 @@ static struct ckpt_ctx *ckpt_ctx_alloc(int fd, unsigned long uflags,
 	err = -ENOMEM;
 	if (ckpt_obj_hash_alloc(ctx) < 0)
 		goto err;
+	ctx->deferqueue = deferqueue_create();
+	if (!ctx->deferqueue)
+		goto err;
 
 	ctx->files_deferq = deferqueue_create();
 	if (!ctx->files_deferq)
diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
index c8501b8..4c1c13e 100644
--- a/include/linux/checkpoint.h
+++ b/include/linux/checkpoint.h
@@ -158,9 +158,16 @@ extern void *restore_uts_ns(struct ckpt_ctx *ctx);
 #ifdef CONFIG_SYSVIPC
 extern int checkpoint_ipc_ns(struct ckpt_ctx *ctx, void *ptr);
 extern void *restore_ipc_ns(struct ckpt_ctx *ctx);
+extern int ckpt_collect_ipc_ns(struct ckpt_ctx *ctx,
+			       struct ipc_namespace *ipc_ns);
 #else
 #define checkpoint_ipc_ns  checkpoint_bad
 #define restore_ipc_ns  restore_bad
+static inline int ckpt_collect_ipc_ns(struct ckpt_ctx *ctx,
+				      struct ipc_namespace *ipc_ns)
+{
+	return 0;
+}
 #endif /* CONFIG_SYSVIPC */
 
 /* file table */
@@ -215,6 +222,9 @@ extern unsigned long generic_vma_restore(struct mm_struct *mm,
 extern int private_vma_restore(struct ckpt_ctx *ctx, struct mm_struct *mm,
 			       struct file *file, struct ckpt_hdr_vma *h);
 
+extern int checkpoint_memory_contents(struct ckpt_ctx *ctx,
+				      struct vm_area_struct *vma,
+				      struct inode *inode);
 extern int restore_memory_contents(struct ckpt_ctx *ctx, struct inode *inode);
 
 
diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
index 1ecec60..a4cf7a9 100644
--- a/include/linux/checkpoint_hdr.h
+++ b/include/linux/checkpoint_hdr.h
@@ -314,7 +314,9 @@ enum vma_type {
 	CKPT_VMA_SHM_ANON,	/* shared anonymous */
 	CKPT_VMA_SHM_ANON_SKIP,	/* shared anonymous (skip contents) */
 	CKPT_VMA_SHM_FILE,	/* shared mapped file, only msync */
-	CKPT_VMA_MAX
+	CKPT_VMA_SHM_IPC,	/* shared sysvipc */
+	CKPT_VMA_SHM_IPC_SKIP,	/* shared sysvipc (skip contents) */
+	CKPT_VMA_MAX,
 };
 
 /* vma descriptor */
@@ -364,6 +366,7 @@ struct ckpt_hdr_ipc {
 } __attribute__((aligned(8)));
 
 struct ckpt_hdr_ipc_perms {
+	struct ckpt_hdr h;
 	__s32 id;
 	__u32 key;
 	__u32 uid;
@@ -375,6 +378,20 @@ struct ckpt_hdr_ipc_perms {
 	__u64 seq;
 } __attribute__((aligned(8)));
 
+struct ckpt_hdr_ipc_shm {
+	struct ckpt_hdr h;
+	struct ckpt_hdr_ipc_perms perms;
+	__u64 shm_segsz;
+	__u64 shm_atim;
+	__u64 shm_dtim;
+	__u64 shm_ctim;
+	__s32 shm_cprid;
+	__s32 shm_lprid;
+	__u32 mlock_uid;
+	__u32 flags;
+	__u32 objref;
+} __attribute__((aligned(8)));
+
 
 #define CKPT_TST_OVERFLOW_16(a, b) \
 	((sizeof(a) > sizeof(b)) && ((a) > SHORT_MAX))
diff --git a/include/linux/checkpoint_types.h b/include/linux/checkpoint_types.h
index 7e6051f..9632aa4 100644
--- a/include/linux/checkpoint_types.h
+++ b/include/linux/checkpoint_types.h
@@ -48,6 +48,7 @@ struct ckpt_ctx {
 	atomic_t refcount;
 
 	struct ckpt_obj_hash *obj_hash;	/* repository for shared objects */
+	struct deferqueue_head *deferqueue;	/* deferred c/r work */
 	struct deferqueue_head *files_deferq;	/* deferred file-table work */
 
 	struct path fs_mnt;     /* container root (FIXME) */
diff --git a/include/linux/shm.h b/include/linux/shm.h
index eca6235..94ac1a7 100644
--- a/include/linux/shm.h
+++ b/include/linux/shm.h
@@ -118,6 +118,21 @@ static inline int is_file_shm_hugepages(struct file *file)
 }
 #endif
 
+struct ipc_namespace;
+extern int shmctl_down(struct ipc_namespace *ns, int shmid, int cmd,
+		       struct shmid_ds __user *buf, int version);
+
+#ifdef CONFIG_CHECKPOINT
+#ifdef CONFIG_SYSVIPC
+struct ckpt_ctx;
+struct ckpt_hdr_vma;
+extern int ipcshm_restore(struct ckpt_ctx *ctx, struct mm_struct *mm,
+			  struct ckpt_hdr_vma *h);
+#else
+#define ipcshm_restore NULL
+#endif
+#endif
+
 #endif /* __KERNEL__ */
 
 #endif /* _LINUX_SHM_H_ */
diff --git a/ipc/Makefile b/ipc/Makefile
index b747127..db4b076 100644
--- a/ipc/Makefile
+++ b/ipc/Makefile
@@ -9,4 +9,4 @@ obj_mq-$(CONFIG_COMPAT) += compat_mq.o
 obj-$(CONFIG_POSIX_MQUEUE) += mqueue.o msgutil.o $(obj_mq-y)
 obj-$(CONFIG_IPC_NS) += namespace.o
 obj-$(CONFIG_POSIX_MQUEUE_SYSCTL) += mq_sysctl.o
-obj-$(CONFIG_SYSVIPC_CHECKPOINT) += checkpoint.o
+obj-$(CONFIG_SYSVIPC_CHECKPOINT) += checkpoint.o checkpoint_shm.o
diff --git a/ipc/checkpoint.c b/ipc/checkpoint.c
index 4eb1a97..b360861 100644
--- a/ipc/checkpoint.c
+++ b/ipc/checkpoint.c
@@ -113,9 +113,9 @@ static int do_checkpoint_ipc_ns(struct ckpt_ctx *ctx,
 	if (ret < 0)
 		return ret;
 
-#if 0 /* NEXT FEW PATCHES */
 	ret = checkpoint_ipc_any(ctx, ipc_ns, IPC_SHM_IDS,
 				 CKPT_HDR_IPC_SHM, checkpoint_ipc_shm);
+#if 0 /* NEXT FEW PATCHES */
 	if (ret < 0)
 		return ret;
 	ret = checkpoint_ipc_any(ctx, ipc_ns, IPC_MSG_IDS,
@@ -134,6 +134,27 @@ int checkpoint_ipc_ns(struct ckpt_ctx *ctx, void *ptr)
 }
 
 /**************************************************************************
+ * Collect
+ */
+
+int ckpt_collect_ipc_ns(struct ckpt_ctx *ctx, struct ipc_namespace *ipc_ns)
+{
+	struct ipc_ids *ipc_ids;
+	int ret;
+
+	/*
+	 * Each shm object holds a reference to a file pointer, so
+	 * collect them. Nothing to do for msg and sem.
+	 */
+	ipc_ids = &ipc_ns->ids[IPC_SHM_IDS];
+	down_read(&ipc_ids->rw_mutex);
+	ret = idr_for_each(&ipc_ids->ipcs_idr, ckpt_collect_ipc_shm, ctx);
+	up_read(&ipc_ids->rw_mutex);
+
+	return ret;
+}
+
+/**************************************************************************
  * Restart
  */
 
@@ -286,9 +307,9 @@ static struct ipc_namespace *do_restore_ipc_ns(struct ckpt_ctx *ctx)
 	get_ipc_ns(ipc_ns);
 #endif
 
-#if 0 /* NEXT FEW PATCHES */
 	ret = restore_ipc_any(ctx, ipc_ns, IPC_SHM_IDS,
 			      CKPT_HDR_IPC_SHM, restore_ipc_shm);
+#if 0 /* NEXT FEW PATCHES */
 	if (ret < 0)
 		goto out;
 	ret = restore_ipc_any(ctx, ipc_ns, IPC_MSG_IDS,
diff --git a/ipc/checkpoint_shm.c b/ipc/checkpoint_shm.c
new file mode 100644
index 0000000..826e430
--- /dev/null
+++ b/ipc/checkpoint_shm.c
@@ -0,0 +1,277 @@
+/*
+ *  Checkpoint/restart - dump state of sysvipc shm
+ *
+ *  Copyright (C) 2009 Oren Laadan
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+/* default debug level for output */
+#define CKPT_DFLAG  CKPT_DIPC
+
+#include <linux/mm.h>
+#include <linux/shm.h>
+#include <linux/shmem_fs.h>
+#include <linux/hugetlb.h>
+#include <linux/rwsem.h>
+#include <linux/sched.h>
+#include <linux/file.h>
+#include <linux/syscalls.h>
+#include <linux/nsproxy.h>
+#include <linux/ipc_namespace.h>
+#include <linux/deferqueue.h>
+
+#include <linux/msg.h>	/* needed for util.h that uses 'struct msg_msg' */
+#include "util.h"
+
+#include <linux/checkpoint.h>
+#include <linux/checkpoint_hdr.h>
+
+/************************************************************************
+ * ipc checkpoint
+ */
+
+static int fill_ipc_shm_hdr(struct ckpt_ctx *ctx,
+			    struct ckpt_hdr_ipc_shm *h,
+			    struct shmid_kernel *shp)
+{
+	int ret = 0;
+
+	ipc_lock_by_ptr(&shp->shm_perm);
+
+	ret = checkpoint_fill_ipc_perms(&h->perms, &shp->shm_perm);
+	if (ret < 0)
+		goto unlock;
+
+	h->shm_segsz = shp->shm_segsz;
+	h->shm_atim = shp->shm_atim;
+	h->shm_dtim = shp->shm_dtim;
+	h->shm_ctim = shp->shm_ctim;
+	h->shm_cprid = shp->shm_cprid;
+	h->shm_lprid = shp->shm_lprid;
+
+	if (shp->mlock_user)
+		h->mlock_uid = shp->mlock_user->uid;
+	else
+		h->mlock_uid = (unsigned int) -1;
+
+	h->flags = 0;
+	/* check if shm was setup with SHM_NORESERVE */
+	if (SHMEM_I(shp->shm_file->f_dentry->d_inode)->flags & VM_NORESERVE)
+		h->flags |= SHM_NORESERVE;
+	/* check if shm was setup with SHM_HUGETLB (unsupported yet) */
+	if (is_file_hugepages(shp->shm_file)) {
+		pr_warning("c/r: unsupported SHM_HUGETLB\n");
+		ret = -ENOSYS;
+	}
+
+ unlock:
+	ipc_unlock(&shp->shm_perm);
+	ckpt_debug("shm: cprid %d lprid %d segsz %lld mlock %d\n",
+		 h->shm_cprid, h->shm_lprid, h->shm_segsz, h->mlock_uid);
+
+	return ret;
+}
+
+int checkpoint_ipc_shm(int id, void *p, void *data)
+{
+	struct ckpt_hdr_ipc_shm *h;
+	struct ckpt_ctx *ctx = (struct ckpt_ctx *) data;
+	struct kern_ipc_perm *perm = (struct kern_ipc_perm *) p;
+	struct shmid_kernel *shp;
+	struct inode *inode;
+	int first, objref;
+	int ret;
+
+	shp = container_of(perm, struct shmid_kernel, shm_perm);
+	inode = shp->shm_file->f_dentry->d_inode;
+
+	/* we collected the file but we don't checkpoint it per-se */
+	ret = ckpt_obj_visit(ctx, shp->shm_file, CKPT_OBJ_FILE);
+	if (ret < 0)
+		return ret;
+
+	objref = ckpt_obj_lookup_add(ctx, inode, CKPT_OBJ_INODE, &first);
+	if (objref < 0)
+		return objref;
+
+	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_IPC_SHM);
+	if (!h)
+		return -ENOMEM;
+
+	ret = fill_ipc_shm_hdr(ctx, h, shp);
+	if (ret < 0)
+		goto out;
+
+	h->objref = objref;
+	ckpt_debug("shm: objref %d\n", h->objref);
+
+	ret = ckpt_write_obj(ctx, &h->h);
+	if (ret < 0)
+		goto out;
+
+	ret = checkpoint_memory_contents(ctx, NULL, inode);
+ out:
+	ckpt_hdr_put(ctx, h);
+	return ret;
+}
+
+/************************************************************************
++ * ipc collect
++ */
+int ckpt_collect_ipc_shm(int id, void *p, void *data)
+{
+	struct ckpt_ctx *ctx = (struct ckpt_ctx *) data;
+	struct kern_ipc_perm *perm = (struct kern_ipc_perm *) p;
+	struct shmid_kernel *shp;
+
+	shp = container_of(perm, struct shmid_kernel, shm_perm);
+	return ckpt_collect_file(ctx, shp->shm_file);
+}
+
+/************************************************************************
+ * ipc restart
+ */
+
+struct dq_ipcshm_del {
+	/*
+	 * XXX: always keep ->ipcns first so that put_ipc_ns() can
+	 * be safely provided as the dtor for this deferqueue object
+	 */
+	struct ipc_namespace *ipcns;
+	int id;
+};
+
+static int ipc_shm_delete(void *data)
+{
+	struct dq_ipcshm_del *dq = (struct dq_ipcshm_del *) data;
+	mm_segment_t old_fs;
+	int ret;
+
+	old_fs = get_fs();
+	set_fs(get_ds());
+	ret = shmctl_down(dq->ipcns, dq->id, IPC_RMID, NULL, 0);
+	set_fs(old_fs);
+
+	put_ipc_ns(dq->ipcns);
+	return ret;
+}
+
+static int load_ipc_shm_hdr(struct ckpt_ctx *ctx,
+			    struct ckpt_hdr_ipc_shm *h,
+			    struct shmid_kernel *shp)
+{
+	int ret;
+
+	ret = restore_load_ipc_perms(&h->perms, &shp->shm_perm);
+	if (ret < 0)
+		return ret;
+
+	ckpt_debug("shm: cprid %d lprid %d segsz %lld mlock %d\n",
+		 h->shm_cprid, h->shm_lprid, h->shm_segsz, h->mlock_uid);
+
+	if (h->shm_cprid < 0 || h->shm_lprid < 0)
+		return -EINVAL;
+
+	shp->shm_segsz = h->shm_segsz;
+	shp->shm_atim = h->shm_atim;
+	shp->shm_dtim = h->shm_dtim;
+	shp->shm_ctim = h->shm_ctim;
+	shp->shm_cprid = h->shm_cprid;
+	shp->shm_lprid = h->shm_lprid;
+
+	return 0;
+}
+
+int restore_ipc_shm(struct ckpt_ctx *ctx, struct ipc_namespace *ns)
+{
+	struct ckpt_hdr_ipc_shm *h;
+	struct kern_ipc_perm *perms;
+	struct shmid_kernel *shp;
+	struct ipc_ids *shm_ids = &ns->ids[IPC_SHM_IDS];
+	struct file *file;
+	int shmflag;
+	int ret;
+
+	h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_IPC_SHM);
+	if (IS_ERR(h))
+		return PTR_ERR(h);
+
+	ret = -EINVAL;
+	if (h->perms.id < 0)
+		goto out;
+
+#define CKPT_SHMFL_MASK  (SHM_NORESERVE | SHM_HUGETLB)
+	if (h->flags & ~CKPT_SHMFL_MASK)
+		goto out;
+
+	ret = -ENOSYS;
+	if (h->mlock_uid != (unsigned int) -1)	/* FIXME: support SHM_LOCK */
+		goto out;
+	if (h->flags & SHM_HUGETLB)	/* FIXME: support SHM_HUGETLB */
+		goto out;
+
+	/*
+	 * SHM_DEST means that the shm is to be deleted after creation.
+	 * However, deleting before it's actually attached is quite silly.
+	 * Instead, we defer this task to until restart has succeeded.
+	 */
+	if (h->perms.mode & SHM_DEST) {
+		struct dq_ipcshm_del dq;
+
+		/* to not confuse the rest of the code */
+		h->perms.mode &= ~SHM_DEST;
+
+		dq.id = h->perms.id;
+		dq.ipcns = ns;
+		get_ipc_ns(dq.ipcns);
+
+		/* XXX can safely use put_ipc_ns() as dtor, see above */
+		ret = deferqueue_add(ctx->deferqueue, &dq, sizeof(dq),
+				     (deferqueue_func_t) ipc_shm_delete,
+				     (deferqueue_func_t) put_ipc_ns);
+		if (ret < 0)
+			goto out;
+	}
+
+	shmflag = h->flags | h->perms.mode | IPC_CREAT | IPC_EXCL;
+	ckpt_debug("shm: do_shmget size %lld flag %#x id %d\n",
+		 h->shm_segsz, shmflag, h->perms.id);
+	ret = do_shmget(ns, h->perms.key, h->shm_segsz, shmflag, h->perms.id);
+	ckpt_debug("shm: do_shmget ret %d\n", ret);
+	if (ret < 0)
+		goto out;
+
+	down_write(&shm_ids->rw_mutex);
+
+	/* we are the sole owners/users of this ipc_ns, it can't go away */
+	perms = ipc_lock(shm_ids, h->perms.id);
+	BUG_ON(IS_ERR(perms));  /* ipc_ns is private to us */
+
+	shp = container_of(perms, struct shmid_kernel, shm_perm);
+	file = shp->shm_file;
+	get_file(file);
+
+	ret = load_ipc_shm_hdr(ctx, h, shp);
+	if (ret < 0)
+		goto mutex;
+
+	/* deposit in objhash and read contents in */
+	ret = ckpt_obj_insert(ctx, file, h->objref, CKPT_OBJ_FILE);
+	if (ret < 0)
+		goto mutex;
+	ret = restore_memory_contents(ctx, file->f_dentry->d_inode);
+ mutex:
+	fput(file);
+	if (ret < 0) {
+		ckpt_debug("shm: need to remove (%d)\n", ret);
+		do_shm_rmid(ns, perms);
+	} else
+		ipc_unlock(perms);
+	up_write(&shm_ids->rw_mutex);
+ out:
+	ckpt_hdr_put(ctx, h);
+	return ret;
+}
diff --git a/ipc/shm.c b/ipc/shm.c
index 749dd57..26f9253 100644
--- a/ipc/shm.c
+++ b/ipc/shm.c
@@ -40,6 +40,7 @@
 #include <linux/mount.h>
 #include <linux/ipc_namespace.h>
 #include <linux/ima.h>
+#include <linux/checkpoint.h>
 
 #include <asm/uaccess.h>
 
@@ -305,6 +306,74 @@ int is_file_shm_hugepages(struct file *file)
 	return ret;
 }
 
+#ifdef CONFIG_CHECKPOINT
+static int ipcshm_checkpoint(struct ckpt_ctx *ctx, struct vm_area_struct *vma)
+{
+	int ino_objref;
+	int first;
+
+	ino_objref = ckpt_obj_lookup_add(ctx, vma->vm_file->f_dentry->d_inode,
+				       CKPT_OBJ_INODE, &first);
+	if (ino_objref < 0)
+		return ino_objref;
+
+	/*
+	 * This shouldn't happen, because all IPC regions should have
+	 * been already dumped by now via ipc namespaces; It means
+	 * the ipc_ns has been modified recently during checkpoint.
+	 */
+	if (first)
+		return -EBUSY;
+
+	return generic_vma_checkpoint(ctx, vma, CKPT_VMA_SHM_IPC_SKIP,
+				      0, ino_objref);
+}
+
+int ipcshm_restore(struct ckpt_ctx *ctx, struct mm_struct *mm,
+		   struct ckpt_hdr_vma *h)
+{
+	struct file *file;
+	int shmid, shmflg = 0;
+	mm_segment_t old_fs;
+	unsigned long start;
+	unsigned long addr;
+	int ret;
+
+	if (!h->ino_objref)
+		return -EINVAL;
+	/* FIX: verify the vm_flags too */
+
+	file = ckpt_obj_fetch(ctx, h->ino_objref, CKPT_OBJ_FILE);
+	if (IS_ERR(file))
+		PTR_ERR(file);
+
+	shmid = file->f_dentry->d_inode->i_ino;
+
+	if (!(h->vm_flags & VM_WRITE))
+		shmflg |= SHM_RDONLY;
+
+	/*
+	 * FIX: do_shmat() has limited interface: all-or-nothing
+	 * mapping. If the vma, however, reflects a partial mapping
+	 * then we need to modify that function to accomplish the
+	 * desired outcome.  Partial mapping can exist due to the user
+	 * call shmat() and then unmapping part of the region.
+	 * Currently, we at least detect this and call it a foul play.
+	 */
+	if (((h->vm_end - h->vm_start) != h->ino_size) || h->vm_pgoff)
+		return -ENOSYS;
+
+	old_fs = get_fs();
+	set_fs(get_ds());
+	start = h->vm_start;
+	ret = do_shmat(shmid, (char __user *) start, shmflg, &addr);
+	set_fs(old_fs);
+
+	BUG_ON(ret >= 0 && addr != h->vm_start);
+	return ret;
+}
+#endif
+
 static const struct file_operations shm_file_operations = {
 	.mmap		= shm_mmap,
 	.fsync		= shm_fsync,
@@ -320,6 +389,9 @@ static struct vm_operations_struct shm_vm_ops = {
 	.set_policy = shm_set_policy,
 	.get_policy = shm_get_policy,
 #endif
+#if defined(CONFIG_CHECKPOINT)
+	.checkpoint = ipcshm_checkpoint,
+#endif
 };
 
 /**
@@ -447,14 +519,12 @@ static inline int shm_more_checks(struct kern_ipc_perm *ipcp,
 	return 0;
 }
 
-int do_shmget(key_t key, size_t size, int shmflg, int req_id)
+int do_shmget(struct ipc_namespace *ns, key_t key, size_t size,
+	      int shmflg, int req_id)
 {
-	struct ipc_namespace *ns;
 	struct ipc_ops shm_ops;
 	struct ipc_params shm_params;
 
-	ns = current->nsproxy->ipc_ns;
-
 	shm_ops.getnew = newseg;
 	shm_ops.associate = shm_security;
 	shm_ops.more_checks = shm_more_checks;
@@ -468,7 +538,7 @@ int do_shmget(key_t key, size_t size, int shmflg, int req_id)
 
 SYSCALL_DEFINE3(shmget, key_t, key, size_t, size, int, shmflg)
 {
-	return do_shmget(key, size, shmflg, -1);
+	return do_shmget(current->nsproxy->ipc_ns, key, size, shmflg, -1);
 }
 
 static inline unsigned long copy_shmid_to_user(void __user *buf, struct shmid64_ds *in, int version)
@@ -599,8 +669,8 @@ static void shm_get_stat(struct ipc_namespace *ns, unsigned long *rss,
  * to be held in write mode.
  * NOTE: no locks must be held, the rw_mutex is taken inside this function.
  */
-static int shmctl_down(struct ipc_namespace *ns, int shmid, int cmd,
-		       struct shmid_ds __user *buf, int version)
+int shmctl_down(struct ipc_namespace *ns, int shmid, int cmd,
+		struct shmid_ds __user *buf, int version)
 {
 	struct kern_ipc_perm *ipcp;
 	struct shmid64_ds shmid64;
diff --git a/ipc/util.h b/ipc/util.h
index 8ae1f8e..e0007dc 100644
--- a/ipc/util.h
+++ b/ipc/util.h
@@ -178,11 +178,20 @@ void free_ipcs(struct ipc_namespace *ns, struct ipc_ids *ids,
 
 struct ipc_namespace *create_ipc_ns(void);
 
+int do_shmget(struct ipc_namespace *ns, key_t key, size_t size, int shmflg,
+	      int req_id);
+void do_shm_rmid(struct ipc_namespace *ns, struct kern_ipc_perm *ipcp);
+
+
 #ifdef CONFIG_CHECKPOINT
 extern int checkpoint_fill_ipc_perms(struct ckpt_hdr_ipc_perms *h,
 				     struct kern_ipc_perm *perm);
 extern int restore_load_ipc_perms(struct ckpt_hdr_ipc_perms *h,
 				  struct kern_ipc_perm *perm);
+
+extern int ckpt_collect_ipc_shm(int id, void *p, void *data);
+extern int checkpoint_ipc_shm(int id, void *p, void *data);
+extern int restore_ipc_shm(struct ckpt_ctx *ctx, struct ipc_namespace *ns);
 #endif
 
 #endif
diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c
index fddc724..e7aaa00 100644
--- a/kernel/nsproxy.c
+++ b/kernel/nsproxy.c
@@ -249,6 +249,14 @@ int ckpt_collect_ns(struct ckpt_ctx *ctx, struct task_struct *t)
 	if (ret < 0)
 		goto out;
 	ret = ckpt_obj_collect(ctx, nsproxy->ipc_ns, CKPT_OBJ_IPC_NS);
+	if (ret < 0)
+		goto out;
+	/*
+	 * ipc_ns (shm) may keep references to files: if this is the
+	 * first time we see this ipc_ns (ret > 0), proceed inside.
+	 */
+	if (ret)
+		ret = ckpt_collect_ipc_ns(ctx, nsproxy->ipc_ns);
 
 	/* TODO: collect other namespaces here */
  out:
diff --git a/mm/shmem.c b/mm/shmem.c
index d1e348f..2cfff8d 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -2387,7 +2387,7 @@ static int shmem_checkpoint(struct ckpt_ctx *ctx, struct vm_area_struct *vma)
 {
 	enum vma_type vma_type;
 	int ino_objref;
-	int first;
+	int first, ret;
 
 	/* should be private anonymous ... verify that this is the case */
 	if (vma->vm_flags & CKPT_VMA_NOT_SUPPORTED) {
-- 
1.6.0.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH v18 53/80] c/r: support message-queues sysv-ipc
       [not found] ` <1253749920-18673-1-git-send-email-orenl-RdfvBDnrOixBDgjK7y7TUQ@public.gmane.org>
                     ` (51 preceding siblings ...)
  2009-09-23 23:51   ` [PATCH v18 52/80] c/r: support share-memory sysv-ipc Oren Laadan
@ 2009-09-23 23:51   ` Oren Laadan
  2009-09-23 23:51   ` [PATCH v18 54/80] c/r: support semaphore sysv-ipc Oren Laadan
                     ` (28 subsequent siblings)
  81 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-23 23:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-api-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Linus Torvalds, Ingo Molnar,
	Pavel Emelyanov

Checkpoint of sysvipc message-queues is performed by iterating through
all 'msq' objects and dumping the contents of each one. The message
queued on each 'msq' are dumped with that object.

Message of a specific queue get written one by one. The queue lock
cannot be held while dumping them, but the loop must be protected from
someone (who ?) writing or reading. To do that we grab the lock, then
hijack the entire chain of messages from the queue, drop the lock,
and then safely dump them in a loop. Finally, with the lock held, we
re-attach the chain while verifying that there isn't other (new) data
on that queue.

Writing the message contents themselves is straight forward. The code
is similar to that in ipc/msgutil.c, the main difference being that
we deal with kernel memory and not user memory.

Changelog[v17]:
  - Allocate security context for msg_msg
  - Restore objects in the right namespace
  - Don't unlock ipc before freeing

Signed-off-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
---
 include/linux/checkpoint_hdr.h |   20 +++
 ipc/Makefile                   |    3 +-
 ipc/checkpoint.c               |    2 +-
 ipc/checkpoint_msg.c           |  364 ++++++++++++++++++++++++++++++++++++++++
 ipc/msg.c                      |   10 +-
 ipc/msgutil.c                  |    8 -
 ipc/util.h                     |   13 ++
 7 files changed, 403 insertions(+), 17 deletions(-)
 create mode 100644 ipc/checkpoint_msg.c

diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
index a4cf7a9..93b6aed 100644
--- a/include/linux/checkpoint_hdr.h
+++ b/include/linux/checkpoint_hdr.h
@@ -78,6 +78,7 @@ enum {
 	CKPT_HDR_IPC = 501,
 	CKPT_HDR_IPC_SHM,
 	CKPT_HDR_IPC_MSG,
+	CKPT_HDR_IPC_MSG_MSG,
 	CKPT_HDR_IPC_SEM,
 
 	CKPT_HDR_TAIL = 9001,
@@ -392,6 +393,25 @@ struct ckpt_hdr_ipc_shm {
 	__u32 objref;
 } __attribute__((aligned(8)));
 
+struct ckpt_hdr_ipc_msg {
+	struct ckpt_hdr h;
+	struct ckpt_hdr_ipc_perms perms;
+	__u64 q_stime;
+	__u64 q_rtime;
+	__u64 q_ctime;
+	__u64 q_cbytes;
+	__u64 q_qnum;
+	__u64 q_qbytes;
+	__s32 q_lspid;
+	__s32 q_lrpid;
+} __attribute__((aligned(8)));
+
+struct ckpt_hdr_ipc_msg_msg {
+	struct ckpt_hdr h;
+	__s32 m_type;
+	__u32 m_ts;
+} __attribute__((aligned(8)));
+
 
 #define CKPT_TST_OVERFLOW_16(a, b) \
 	((sizeof(a) > sizeof(b)) && ((a) > SHORT_MAX))
diff --git a/ipc/Makefile b/ipc/Makefile
index db4b076..71a257f 100644
--- a/ipc/Makefile
+++ b/ipc/Makefile
@@ -9,4 +9,5 @@ obj_mq-$(CONFIG_COMPAT) += compat_mq.o
 obj-$(CONFIG_POSIX_MQUEUE) += mqueue.o msgutil.o $(obj_mq-y)
 obj-$(CONFIG_IPC_NS) += namespace.o
 obj-$(CONFIG_POSIX_MQUEUE_SYSCTL) += mq_sysctl.o
-obj-$(CONFIG_SYSVIPC_CHECKPOINT) += checkpoint.o checkpoint_shm.o
+obj-$(CONFIG_SYSVIPC_CHECKPOINT) += checkpoint.o \
+		checkpoint_shm.o checkpoint_msg.o
diff --git a/ipc/checkpoint.c b/ipc/checkpoint.c
index b360861..588ed37 100644
--- a/ipc/checkpoint.c
+++ b/ipc/checkpoint.c
@@ -115,11 +115,11 @@ static int do_checkpoint_ipc_ns(struct ckpt_ctx *ctx,
 
 	ret = checkpoint_ipc_any(ctx, ipc_ns, IPC_SHM_IDS,
 				 CKPT_HDR_IPC_SHM, checkpoint_ipc_shm);
-#if 0 /* NEXT FEW PATCHES */
 	if (ret < 0)
 		return ret;
 	ret = checkpoint_ipc_any(ctx, ipc_ns, IPC_MSG_IDS,
 				 CKPT_HDR_IPC_MSG, checkpoint_ipc_msg);
+#if 0 /* NEXT FEW PATCHES */
 	if (ret < 0)
 		return ret;
 	ret = checkpoint_ipc_any(ctx, ipc_ns, IPC_SEM_IDS,
diff --git a/ipc/checkpoint_msg.c b/ipc/checkpoint_msg.c
new file mode 100644
index 0000000..b933c19
--- /dev/null
+++ b/ipc/checkpoint_msg.c
@@ -0,0 +1,364 @@
+/*
+ *  Checkpoint/restart - dump state of sysvipc msg
+ *
+ *  Copyright (C) 2009 Oren Laadan
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+/* default debug level for output */
+#define CKPT_DFLAG  CKPT_DIPC
+
+#include <linux/mm.h>
+#include <linux/msg.h>
+#include <linux/rwsem.h>
+#include <linux/sched.h>
+#include <linux/syscalls.h>
+#include <linux/nsproxy.h>
+#include <linux/security.h>
+#include <linux/ipc_namespace.h>
+
+#include "util.h"
+
+#include <linux/checkpoint.h>
+#include <linux/checkpoint_hdr.h>
+
+/************************************************************************
+ * ipc checkpoint
+ */
+
+static int fill_ipc_msg_hdr(struct ckpt_ctx *ctx,
+			    struct ckpt_hdr_ipc_msg *h,
+			    struct msg_queue *msq)
+{
+	int ret = 0;
+
+	ipc_lock_by_ptr(&msq->q_perm);
+
+	ret = checkpoint_fill_ipc_perms(&h->perms, &msq->q_perm);
+	if (ret < 0)
+		goto unlock;
+
+	h->q_stime = msq->q_stime;
+	h->q_rtime = msq->q_rtime;
+	h->q_ctime = msq->q_ctime;
+	h->q_cbytes = msq->q_cbytes;
+	h->q_qnum = msq->q_qnum;
+	h->q_qbytes = msq->q_qbytes;
+	h->q_lspid = msq->q_lspid;
+	h->q_lrpid = msq->q_lrpid;
+
+ unlock:
+	ipc_unlock(&msq->q_perm);
+	ckpt_debug("msg: lspid %d rspid %d qnum %lld qbytes %lld\n",
+		 h->q_lspid, h->q_lrpid, h->q_qnum, h->q_qbytes);
+
+	return ret;
+}
+
+static int checkpoint_msg_contents(struct ckpt_ctx *ctx, struct msg_msg *msg)
+{
+	struct ckpt_hdr_ipc_msg_msg *h;
+	struct msg_msgseg *seg;
+	int total, len;
+	int ret;
+
+	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_IPC_MSG_MSG);
+	if (!h)
+		return -ENOMEM;
+
+	h->m_type = msg->m_type;
+	h->m_ts = msg->m_ts;
+
+	ret = ckpt_write_obj(ctx, &h->h);
+	ckpt_hdr_put(ctx, h);
+	if (ret < 0)
+		return ret;
+
+	total = msg->m_ts;
+	len = min(total, (int) DATALEN_MSG);
+	ret = ckpt_write_buffer(ctx, (msg + 1), len);
+	if (ret < 0)
+		return ret;
+
+	seg = msg->next;
+	total -= len;
+
+	while (total) {
+		len = min(total, (int) DATALEN_SEG);
+		ret = ckpt_write_buffer(ctx, (seg + 1), len);
+		if (ret < 0)
+			break;
+		seg = seg->next;
+		total -= len;
+	}
+
+	return ret;
+}
+
+static int checkpoint_msg_queue(struct ckpt_ctx *ctx, struct msg_queue *msq)
+{
+	struct list_head messages;
+	struct msg_msg *msg;
+	int ret = -EBUSY;
+
+	/*
+	 * Scanning the msq requires the lock, but then we can't write
+	 * data out from inside. Instead, we grab the lock, remove all
+	 * messages to our own list, drop the lock, write the messages,
+	 * and finally re-attach the them to the msq with the lock taken.
+	 */
+	ipc_lock_by_ptr(&msq->q_perm);
+	if (!list_empty(&msq->q_receivers))
+		goto unlock;
+	if (!list_empty(&msq->q_senders))
+		goto unlock;
+	if (list_empty(&msq->q_messages))
+		goto unlock;
+	/* temporarily take out all messages */
+	INIT_LIST_HEAD(&messages);
+	list_splice_init(&msq->q_messages, &messages);
+ unlock:
+	ipc_unlock(&msq->q_perm);
+
+	list_for_each_entry(msg, &messages, m_list) {
+		ret = checkpoint_msg_contents(ctx, msg);
+		if (ret < 0)
+			break;
+	}
+
+	/* put all the messages back in */
+	ipc_lock_by_ptr(&msq->q_perm);
+	list_splice(&messages, &msq->q_messages);
+	ipc_unlock(&msq->q_perm);
+
+	return ret;
+}
+
+int checkpoint_ipc_msg(int id, void *p, void *data)
+{
+	struct ckpt_hdr_ipc_msg *h;
+	struct ckpt_ctx *ctx = (struct ckpt_ctx *) data;
+	struct kern_ipc_perm *perm = (struct kern_ipc_perm *) p;
+	struct msg_queue *msq;
+	int ret;
+
+	msq = container_of(perm, struct msg_queue, q_perm);
+
+	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_IPC_MSG);
+	if (!h)
+		return -ENOMEM;
+
+	ret = fill_ipc_msg_hdr(ctx, h, msq);
+	if (ret < 0)
+		goto out;
+
+	ret = ckpt_write_obj(ctx, &h->h);
+	if (ret < 0)
+		goto out;
+
+	if (h->q_qnum)
+		ret = checkpoint_msg_queue(ctx, msq);
+ out:
+	ckpt_hdr_put(ctx, h);
+	return ret;
+}
+
+
+/************************************************************************
+ * ipc restart
+ */
+
+static int load_ipc_msg_hdr(struct ckpt_ctx *ctx,
+			    struct ckpt_hdr_ipc_msg *h,
+			    struct msg_queue *msq)
+{
+	int ret = 0;
+
+	ret = restore_load_ipc_perms(&h->perms, &msq->q_perm);
+	if (ret < 0)
+		return ret;
+
+	ckpt_debug("msq: lspid %d lrpid %d qnum %lld qbytes %lld\n",
+		 h->q_lspid, h->q_lrpid, h->q_qnum, h->q_qbytes);
+
+	if (h->q_lspid < 0 || h->q_lrpid < 0)
+		return -EINVAL;
+
+	msq->q_stime = h->q_stime;
+	msq->q_rtime = h->q_rtime;
+	msq->q_ctime = h->q_ctime;
+	msq->q_lspid = h->q_lspid;
+	msq->q_lrpid = h->q_lrpid;
+
+	return 0;
+}
+
+static struct msg_msg *restore_msg_contents_one(struct ckpt_ctx *ctx, int *clen)
+{
+	struct ckpt_hdr_ipc_msg_msg *h;
+	struct msg_msg *msg = NULL;
+	struct msg_msgseg *seg, **pseg;
+	int total, len;
+	int ret;
+
+	h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_IPC_MSG_MSG);
+	if (IS_ERR(h))
+		return (struct msg_msg *) h;
+
+	ret = -EINVAL;
+	if (h->m_type < 1)
+		goto out;
+	if (h->m_ts > current->nsproxy->ipc_ns->msg_ctlmax)
+		goto out;
+
+	total = h->m_ts;
+	len = min(total, (int) DATALEN_MSG);
+	msg = kmalloc(sizeof(*msg) + len, GFP_KERNEL);
+	if (!msg) {
+		ret = -ENOMEM;
+		goto out;
+	}
+	msg->next = NULL;
+	pseg = &msg->next;
+
+	ret = _ckpt_read_buffer(ctx, (msg + 1), len);
+	if (ret < 0)
+		goto out;
+
+	total -= len;
+	while (total) {
+		len = min(total, (int) DATALEN_SEG);
+		seg = kmalloc(sizeof(*seg) + len, GFP_KERNEL);
+		if (!seg) {
+			ret = -ENOMEM;
+			goto out;
+		}
+		seg->next = NULL;
+		*pseg = seg;
+		pseg = &seg->next;
+
+		ret = _ckpt_read_buffer(ctx, (seg + 1), len);
+		if (ret < 0)
+			goto out;
+		total -= len;
+	}
+
+	msg->m_type = h->m_type;
+	msg->m_ts = h->m_ts;
+	*clen = h->m_ts;
+	ret = security_msg_msg_alloc(msg);
+ out:
+	if (ret < 0 && msg) {
+		free_msg(msg);
+		msg = ERR_PTR(ret);
+	}
+	ckpt_hdr_put(ctx, h);
+	return msg;
+}
+
+static inline void free_msg_list(struct list_head *queue)
+{
+	struct msg_msg *msg, *tmp;
+
+	list_for_each_entry_safe(msg, tmp, queue, m_list)
+		free_msg(msg);
+}
+
+static int restore_msg_contents(struct ckpt_ctx *ctx, struct list_head *queue,
+				unsigned long qnum, unsigned long *cbytes)
+{
+	struct msg_msg *msg;
+	int clen = 0;
+	int ret = 0;
+
+	INIT_LIST_HEAD(queue);
+
+	*cbytes = 0;
+	while (qnum--) {
+		msg = restore_msg_contents_one(ctx, &clen);
+		if (IS_ERR(msg))
+			goto fail;
+		list_add_tail(&msg->m_list, queue);
+		*cbytes += clen;
+	}
+	return 0;
+ fail:
+	ret = PTR_ERR(msg);
+	free_msg_list(queue);
+	return ret;
+}
+
+int restore_ipc_msg(struct ckpt_ctx *ctx, struct ipc_namespace *ns)
+{
+	struct ckpt_hdr_ipc_msg *h;
+	struct kern_ipc_perm *perms;
+	struct msg_queue *msq;
+	struct ipc_ids *msg_ids = &ns->ids[IPC_MSG_IDS];
+	struct list_head messages;
+	unsigned long cbytes;
+	int msgflag;
+	int ret;
+
+	INIT_LIST_HEAD(&messages);
+
+	h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_IPC_MSG);
+	if (IS_ERR(h))
+		return PTR_ERR(h);
+
+	ret = -EINVAL;
+	if (h->perms.id < 0)
+		goto out;
+
+	/* read queued messages into temporary queue */
+	ret = restore_msg_contents(ctx, &messages, h->q_qnum, &cbytes);
+	if (ret < 0)
+		goto out;
+
+	ret = -EINVAL;
+	if (h->q_cbytes != cbytes)
+		goto out;
+
+	/* restore the message queue */
+	msgflag = h->perms.mode | IPC_CREAT | IPC_EXCL;
+	ckpt_debug("msg: do_msgget key %d flag %#x id %d\n",
+		 h->perms.key, msgflag, h->perms.id);
+	ret = do_msgget(ns, h->perms.key, msgflag, h->perms.id);
+	ckpt_debug("msg: do_msgget ret %d\n", ret);
+	if (ret < 0)
+		goto out;
+
+	down_write(&msg_ids->rw_mutex);
+
+	/* we are the sole owners/users of this ipc_ns, it can't go away */
+	perms = ipc_lock(msg_ids, h->perms.id);
+	BUG_ON(IS_ERR(perms));	/* ipc_ns is private to us */
+
+	msq = container_of(perms, struct msg_queue, q_perm);
+	BUG_ON(!list_empty(&msq->q_messages));	/* ipc_ns is private to us */
+
+	/* attach queued messages we read before */
+	list_splice_init(&messages, &msq->q_messages);
+
+	/* adjust msq and namespace statistics */
+	atomic_add(h->q_cbytes, &ns->msg_bytes);
+	atomic_add(h->q_qnum, &ns->msg_hdrs);
+	msq->q_cbytes = h->q_cbytes;
+	msq->q_qbytes = h->q_qbytes;
+	msq->q_qnum = h->q_qnum;
+
+	ret = load_ipc_msg_hdr(ctx, h, msq);
+
+	if (ret < 0) {
+		ckpt_debug("msq: need to remove (%d)\n", ret);
+		freeque(ns, perms);
+	} else
+		ipc_unlock(perms);
+	up_write(&msg_ids->rw_mutex);
+ out:
+	free_msg_list(&messages);  /* no-op if all ok, else cleanup msgs */
+	ckpt_hdr_put(ctx, h);
+	return ret;
+}
diff --git a/ipc/msg.c b/ipc/msg.c
index 1db7c45..3559d53 100644
--- a/ipc/msg.c
+++ b/ipc/msg.c
@@ -72,7 +72,6 @@ struct msg_sender {
 
 #define msg_unlock(msq)		ipc_unlock(&(msq)->q_perm)
 
-static void freeque(struct ipc_namespace *, struct kern_ipc_perm *);
 static int newque(struct ipc_namespace *, struct ipc_params *, int);
 #ifdef CONFIG_PROC_FS
 static int sysvipc_msg_proc_show(struct seq_file *s, void *it);
@@ -278,7 +277,7 @@ static void expunge_all(struct msg_queue *msq, int res)
  * msg_ids.rw_mutex (writer) and the spinlock for this message queue are held
  * before freeque() is called. msg_ids.rw_mutex remains locked on exit.
  */
-static void freeque(struct ipc_namespace *ns, struct kern_ipc_perm *ipcp)
+void freeque(struct ipc_namespace *ns, struct kern_ipc_perm *ipcp)
 {
 	struct list_head *tmp;
 	struct msg_queue *msq = container_of(ipcp, struct msg_queue, q_perm);
@@ -311,14 +310,11 @@ static inline int msg_security(struct kern_ipc_perm *ipcp, int msgflg)
 	return security_msg_queue_associate(msq, msgflg);
 }
 
-int do_msgget(key_t key, int msgflg, int req_id)
+int do_msgget(struct ipc_namespace *ns, key_t key, int msgflg, int req_id)
 {
-	struct ipc_namespace *ns;
 	struct ipc_ops msg_ops;
 	struct ipc_params msg_params;
 
-	ns = current->nsproxy->ipc_ns;
-
 	msg_ops.getnew = newque;
 	msg_ops.associate = msg_security;
 	msg_ops.more_checks = NULL;
@@ -331,7 +327,7 @@ int do_msgget(key_t key, int msgflg, int req_id)
 
 SYSCALL_DEFINE2(msgget, key_t, key, int, msgflg)
 {
-	return do_msgget(key, msgflg, -1);
+	return do_msgget(current->nsproxy->ipc_ns, key, msgflg, -1);
 }
 
 static inline unsigned long
diff --git a/ipc/msgutil.c b/ipc/msgutil.c
index f095ee2..e119243 100644
--- a/ipc/msgutil.c
+++ b/ipc/msgutil.c
@@ -36,14 +36,6 @@ struct ipc_namespace init_ipc_ns = {
 
 atomic_t nr_ipc_ns = ATOMIC_INIT(1);
 
-struct msg_msgseg {
-	struct msg_msgseg* next;
-	/* the next part of the message follows immediately */
-};
-
-#define DATALEN_MSG	(PAGE_SIZE-sizeof(struct msg_msg))
-#define DATALEN_SEG	(PAGE_SIZE-sizeof(struct msg_msgseg))
-
 struct msg_msg *load_msg(const void __user *src, int len)
 {
 	struct msg_msg *msg;
diff --git a/ipc/util.h b/ipc/util.h
index e0007dc..8a223f0 100644
--- a/ipc/util.h
+++ b/ipc/util.h
@@ -141,6 +141,14 @@ extern void free_msg(struct msg_msg *msg);
 extern struct msg_msg *load_msg(const void __user *src, int len);
 extern int store_msg(void __user *dest, struct msg_msg *msg, int len);
 
+struct msg_msgseg {
+	struct msg_msgseg *next;
+	/* the next part of the message follows immediately */
+};
+
+#define DATALEN_MSG	(PAGE_SIZE-sizeof(struct msg_msg))
+#define DATALEN_SEG	(PAGE_SIZE-sizeof(struct msg_msgseg))
+
 extern void recompute_msgmni(struct ipc_namespace *);
 
 static inline int ipc_buildid(int id, int seq)
@@ -182,6 +190,8 @@ int do_shmget(struct ipc_namespace *ns, key_t key, size_t size, int shmflg,
 	      int req_id);
 void do_shm_rmid(struct ipc_namespace *ns, struct kern_ipc_perm *ipcp);
 
+int do_msgget(struct ipc_namespace *ns, key_t key, int msgflg, int req_id);
+void freeque(struct ipc_namespace *ns, struct kern_ipc_perm *ipcp);
 
 #ifdef CONFIG_CHECKPOINT
 extern int checkpoint_fill_ipc_perms(struct ckpt_hdr_ipc_perms *h,
@@ -192,6 +202,9 @@ extern int restore_load_ipc_perms(struct ckpt_hdr_ipc_perms *h,
 extern int ckpt_collect_ipc_shm(int id, void *p, void *data);
 extern int checkpoint_ipc_shm(int id, void *p, void *data);
 extern int restore_ipc_shm(struct ckpt_ctx *ctx, struct ipc_namespace *ns);
+
+extern int checkpoint_ipc_msg(int id, void *p, void *data);
+extern int restore_ipc_msg(struct ckpt_ctx *ctx, struct ipc_namespace *ns);
 #endif
 
 #endif
-- 
1.6.0.4

^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH v18 53/80] c/r: support message-queues sysv-ipc
  2009-09-23 23:50 ` Oren Laadan
@ 2009-09-23 23:51   ` Oren Laadan
  -1 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-23 23:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, containers, linux-kernel, linux-mm, linux-api,
	Serge Hallyn, Ingo Molnar, Pavel Emelyanov, Oren Laadan,
	Oren Laadan

Checkpoint of sysvipc message-queues is performed by iterating through
all 'msq' objects and dumping the contents of each one. The message
queued on each 'msq' are dumped with that object.

Message of a specific queue get written one by one. The queue lock
cannot be held while dumping them, but the loop must be protected from
someone (who ?) writing or reading. To do that we grab the lock, then
hijack the entire chain of messages from the queue, drop the lock,
and then safely dump them in a loop. Finally, with the lock held, we
re-attach the chain while verifying that there isn't other (new) data
on that queue.

Writing the message contents themselves is straight forward. The code
is similar to that in ipc/msgutil.c, the main difference being that
we deal with kernel memory and not user memory.

Changelog[v17]:
  - Allocate security context for msg_msg
  - Restore objects in the right namespace
  - Don't unlock ipc before freeing

Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
---
 include/linux/checkpoint_hdr.h |   20 +++
 ipc/Makefile                   |    3 +-
 ipc/checkpoint.c               |    2 +-
 ipc/checkpoint_msg.c           |  364 ++++++++++++++++++++++++++++++++++++++++
 ipc/msg.c                      |   10 +-
 ipc/msgutil.c                  |    8 -
 ipc/util.h                     |   13 ++
 7 files changed, 403 insertions(+), 17 deletions(-)
 create mode 100644 ipc/checkpoint_msg.c

diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
index a4cf7a9..93b6aed 100644
--- a/include/linux/checkpoint_hdr.h
+++ b/include/linux/checkpoint_hdr.h
@@ -78,6 +78,7 @@ enum {
 	CKPT_HDR_IPC = 501,
 	CKPT_HDR_IPC_SHM,
 	CKPT_HDR_IPC_MSG,
+	CKPT_HDR_IPC_MSG_MSG,
 	CKPT_HDR_IPC_SEM,
 
 	CKPT_HDR_TAIL = 9001,
@@ -392,6 +393,25 @@ struct ckpt_hdr_ipc_shm {
 	__u32 objref;
 } __attribute__((aligned(8)));
 
+struct ckpt_hdr_ipc_msg {
+	struct ckpt_hdr h;
+	struct ckpt_hdr_ipc_perms perms;
+	__u64 q_stime;
+	__u64 q_rtime;
+	__u64 q_ctime;
+	__u64 q_cbytes;
+	__u64 q_qnum;
+	__u64 q_qbytes;
+	__s32 q_lspid;
+	__s32 q_lrpid;
+} __attribute__((aligned(8)));
+
+struct ckpt_hdr_ipc_msg_msg {
+	struct ckpt_hdr h;
+	__s32 m_type;
+	__u32 m_ts;
+} __attribute__((aligned(8)));
+
 
 #define CKPT_TST_OVERFLOW_16(a, b) \
 	((sizeof(a) > sizeof(b)) && ((a) > SHORT_MAX))
diff --git a/ipc/Makefile b/ipc/Makefile
index db4b076..71a257f 100644
--- a/ipc/Makefile
+++ b/ipc/Makefile
@@ -9,4 +9,5 @@ obj_mq-$(CONFIG_COMPAT) += compat_mq.o
 obj-$(CONFIG_POSIX_MQUEUE) += mqueue.o msgutil.o $(obj_mq-y)
 obj-$(CONFIG_IPC_NS) += namespace.o
 obj-$(CONFIG_POSIX_MQUEUE_SYSCTL) += mq_sysctl.o
-obj-$(CONFIG_SYSVIPC_CHECKPOINT) += checkpoint.o checkpoint_shm.o
+obj-$(CONFIG_SYSVIPC_CHECKPOINT) += checkpoint.o \
+		checkpoint_shm.o checkpoint_msg.o
diff --git a/ipc/checkpoint.c b/ipc/checkpoint.c
index b360861..588ed37 100644
--- a/ipc/checkpoint.c
+++ b/ipc/checkpoint.c
@@ -115,11 +115,11 @@ static int do_checkpoint_ipc_ns(struct ckpt_ctx *ctx,
 
 	ret = checkpoint_ipc_any(ctx, ipc_ns, IPC_SHM_IDS,
 				 CKPT_HDR_IPC_SHM, checkpoint_ipc_shm);
-#if 0 /* NEXT FEW PATCHES */
 	if (ret < 0)
 		return ret;
 	ret = checkpoint_ipc_any(ctx, ipc_ns, IPC_MSG_IDS,
 				 CKPT_HDR_IPC_MSG, checkpoint_ipc_msg);
+#if 0 /* NEXT FEW PATCHES */
 	if (ret < 0)
 		return ret;
 	ret = checkpoint_ipc_any(ctx, ipc_ns, IPC_SEM_IDS,
diff --git a/ipc/checkpoint_msg.c b/ipc/checkpoint_msg.c
new file mode 100644
index 0000000..b933c19
--- /dev/null
+++ b/ipc/checkpoint_msg.c
@@ -0,0 +1,364 @@
+/*
+ *  Checkpoint/restart - dump state of sysvipc msg
+ *
+ *  Copyright (C) 2009 Oren Laadan
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+/* default debug level for output */
+#define CKPT_DFLAG  CKPT_DIPC
+
+#include <linux/mm.h>
+#include <linux/msg.h>
+#include <linux/rwsem.h>
+#include <linux/sched.h>
+#include <linux/syscalls.h>
+#include <linux/nsproxy.h>
+#include <linux/security.h>
+#include <linux/ipc_namespace.h>
+
+#include "util.h"
+
+#include <linux/checkpoint.h>
+#include <linux/checkpoint_hdr.h>
+
+/************************************************************************
+ * ipc checkpoint
+ */
+
+static int fill_ipc_msg_hdr(struct ckpt_ctx *ctx,
+			    struct ckpt_hdr_ipc_msg *h,
+			    struct msg_queue *msq)
+{
+	int ret = 0;
+
+	ipc_lock_by_ptr(&msq->q_perm);
+
+	ret = checkpoint_fill_ipc_perms(&h->perms, &msq->q_perm);
+	if (ret < 0)
+		goto unlock;
+
+	h->q_stime = msq->q_stime;
+	h->q_rtime = msq->q_rtime;
+	h->q_ctime = msq->q_ctime;
+	h->q_cbytes = msq->q_cbytes;
+	h->q_qnum = msq->q_qnum;
+	h->q_qbytes = msq->q_qbytes;
+	h->q_lspid = msq->q_lspid;
+	h->q_lrpid = msq->q_lrpid;
+
+ unlock:
+	ipc_unlock(&msq->q_perm);
+	ckpt_debug("msg: lspid %d rspid %d qnum %lld qbytes %lld\n",
+		 h->q_lspid, h->q_lrpid, h->q_qnum, h->q_qbytes);
+
+	return ret;
+}
+
+static int checkpoint_msg_contents(struct ckpt_ctx *ctx, struct msg_msg *msg)
+{
+	struct ckpt_hdr_ipc_msg_msg *h;
+	struct msg_msgseg *seg;
+	int total, len;
+	int ret;
+
+	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_IPC_MSG_MSG);
+	if (!h)
+		return -ENOMEM;
+
+	h->m_type = msg->m_type;
+	h->m_ts = msg->m_ts;
+
+	ret = ckpt_write_obj(ctx, &h->h);
+	ckpt_hdr_put(ctx, h);
+	if (ret < 0)
+		return ret;
+
+	total = msg->m_ts;
+	len = min(total, (int) DATALEN_MSG);
+	ret = ckpt_write_buffer(ctx, (msg + 1), len);
+	if (ret < 0)
+		return ret;
+
+	seg = msg->next;
+	total -= len;
+
+	while (total) {
+		len = min(total, (int) DATALEN_SEG);
+		ret = ckpt_write_buffer(ctx, (seg + 1), len);
+		if (ret < 0)
+			break;
+		seg = seg->next;
+		total -= len;
+	}
+
+	return ret;
+}
+
+static int checkpoint_msg_queue(struct ckpt_ctx *ctx, struct msg_queue *msq)
+{
+	struct list_head messages;
+	struct msg_msg *msg;
+	int ret = -EBUSY;
+
+	/*
+	 * Scanning the msq requires the lock, but then we can't write
+	 * data out from inside. Instead, we grab the lock, remove all
+	 * messages to our own list, drop the lock, write the messages,
+	 * and finally re-attach the them to the msq with the lock taken.
+	 */
+	ipc_lock_by_ptr(&msq->q_perm);
+	if (!list_empty(&msq->q_receivers))
+		goto unlock;
+	if (!list_empty(&msq->q_senders))
+		goto unlock;
+	if (list_empty(&msq->q_messages))
+		goto unlock;
+	/* temporarily take out all messages */
+	INIT_LIST_HEAD(&messages);
+	list_splice_init(&msq->q_messages, &messages);
+ unlock:
+	ipc_unlock(&msq->q_perm);
+
+	list_for_each_entry(msg, &messages, m_list) {
+		ret = checkpoint_msg_contents(ctx, msg);
+		if (ret < 0)
+			break;
+	}
+
+	/* put all the messages back in */
+	ipc_lock_by_ptr(&msq->q_perm);
+	list_splice(&messages, &msq->q_messages);
+	ipc_unlock(&msq->q_perm);
+
+	return ret;
+}
+
+int checkpoint_ipc_msg(int id, void *p, void *data)
+{
+	struct ckpt_hdr_ipc_msg *h;
+	struct ckpt_ctx *ctx = (struct ckpt_ctx *) data;
+	struct kern_ipc_perm *perm = (struct kern_ipc_perm *) p;
+	struct msg_queue *msq;
+	int ret;
+
+	msq = container_of(perm, struct msg_queue, q_perm);
+
+	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_IPC_MSG);
+	if (!h)
+		return -ENOMEM;
+
+	ret = fill_ipc_msg_hdr(ctx, h, msq);
+	if (ret < 0)
+		goto out;
+
+	ret = ckpt_write_obj(ctx, &h->h);
+	if (ret < 0)
+		goto out;
+
+	if (h->q_qnum)
+		ret = checkpoint_msg_queue(ctx, msq);
+ out:
+	ckpt_hdr_put(ctx, h);
+	return ret;
+}
+
+
+/************************************************************************
+ * ipc restart
+ */
+
+static int load_ipc_msg_hdr(struct ckpt_ctx *ctx,
+			    struct ckpt_hdr_ipc_msg *h,
+			    struct msg_queue *msq)
+{
+	int ret = 0;
+
+	ret = restore_load_ipc_perms(&h->perms, &msq->q_perm);
+	if (ret < 0)
+		return ret;
+
+	ckpt_debug("msq: lspid %d lrpid %d qnum %lld qbytes %lld\n",
+		 h->q_lspid, h->q_lrpid, h->q_qnum, h->q_qbytes);
+
+	if (h->q_lspid < 0 || h->q_lrpid < 0)
+		return -EINVAL;
+
+	msq->q_stime = h->q_stime;
+	msq->q_rtime = h->q_rtime;
+	msq->q_ctime = h->q_ctime;
+	msq->q_lspid = h->q_lspid;
+	msq->q_lrpid = h->q_lrpid;
+
+	return 0;
+}
+
+static struct msg_msg *restore_msg_contents_one(struct ckpt_ctx *ctx, int *clen)
+{
+	struct ckpt_hdr_ipc_msg_msg *h;
+	struct msg_msg *msg = NULL;
+	struct msg_msgseg *seg, **pseg;
+	int total, len;
+	int ret;
+
+	h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_IPC_MSG_MSG);
+	if (IS_ERR(h))
+		return (struct msg_msg *) h;
+
+	ret = -EINVAL;
+	if (h->m_type < 1)
+		goto out;
+	if (h->m_ts > current->nsproxy->ipc_ns->msg_ctlmax)
+		goto out;
+
+	total = h->m_ts;
+	len = min(total, (int) DATALEN_MSG);
+	msg = kmalloc(sizeof(*msg) + len, GFP_KERNEL);
+	if (!msg) {
+		ret = -ENOMEM;
+		goto out;
+	}
+	msg->next = NULL;
+	pseg = &msg->next;
+
+	ret = _ckpt_read_buffer(ctx, (msg + 1), len);
+	if (ret < 0)
+		goto out;
+
+	total -= len;
+	while (total) {
+		len = min(total, (int) DATALEN_SEG);
+		seg = kmalloc(sizeof(*seg) + len, GFP_KERNEL);
+		if (!seg) {
+			ret = -ENOMEM;
+			goto out;
+		}
+		seg->next = NULL;
+		*pseg = seg;
+		pseg = &seg->next;
+
+		ret = _ckpt_read_buffer(ctx, (seg + 1), len);
+		if (ret < 0)
+			goto out;
+		total -= len;
+	}
+
+	msg->m_type = h->m_type;
+	msg->m_ts = h->m_ts;
+	*clen = h->m_ts;
+	ret = security_msg_msg_alloc(msg);
+ out:
+	if (ret < 0 && msg) {
+		free_msg(msg);
+		msg = ERR_PTR(ret);
+	}
+	ckpt_hdr_put(ctx, h);
+	return msg;
+}
+
+static inline void free_msg_list(struct list_head *queue)
+{
+	struct msg_msg *msg, *tmp;
+
+	list_for_each_entry_safe(msg, tmp, queue, m_list)
+		free_msg(msg);
+}
+
+static int restore_msg_contents(struct ckpt_ctx *ctx, struct list_head *queue,
+				unsigned long qnum, unsigned long *cbytes)
+{
+	struct msg_msg *msg;
+	int clen = 0;
+	int ret = 0;
+
+	INIT_LIST_HEAD(queue);
+
+	*cbytes = 0;
+	while (qnum--) {
+		msg = restore_msg_contents_one(ctx, &clen);
+		if (IS_ERR(msg))
+			goto fail;
+		list_add_tail(&msg->m_list, queue);
+		*cbytes += clen;
+	}
+	return 0;
+ fail:
+	ret = PTR_ERR(msg);
+	free_msg_list(queue);
+	return ret;
+}
+
+int restore_ipc_msg(struct ckpt_ctx *ctx, struct ipc_namespace *ns)
+{
+	struct ckpt_hdr_ipc_msg *h;
+	struct kern_ipc_perm *perms;
+	struct msg_queue *msq;
+	struct ipc_ids *msg_ids = &ns->ids[IPC_MSG_IDS];
+	struct list_head messages;
+	unsigned long cbytes;
+	int msgflag;
+	int ret;
+
+	INIT_LIST_HEAD(&messages);
+
+	h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_IPC_MSG);
+	if (IS_ERR(h))
+		return PTR_ERR(h);
+
+	ret = -EINVAL;
+	if (h->perms.id < 0)
+		goto out;
+
+	/* read queued messages into temporary queue */
+	ret = restore_msg_contents(ctx, &messages, h->q_qnum, &cbytes);
+	if (ret < 0)
+		goto out;
+
+	ret = -EINVAL;
+	if (h->q_cbytes != cbytes)
+		goto out;
+
+	/* restore the message queue */
+	msgflag = h->perms.mode | IPC_CREAT | IPC_EXCL;
+	ckpt_debug("msg: do_msgget key %d flag %#x id %d\n",
+		 h->perms.key, msgflag, h->perms.id);
+	ret = do_msgget(ns, h->perms.key, msgflag, h->perms.id);
+	ckpt_debug("msg: do_msgget ret %d\n", ret);
+	if (ret < 0)
+		goto out;
+
+	down_write(&msg_ids->rw_mutex);
+
+	/* we are the sole owners/users of this ipc_ns, it can't go away */
+	perms = ipc_lock(msg_ids, h->perms.id);
+	BUG_ON(IS_ERR(perms));	/* ipc_ns is private to us */
+
+	msq = container_of(perms, struct msg_queue, q_perm);
+	BUG_ON(!list_empty(&msq->q_messages));	/* ipc_ns is private to us */
+
+	/* attach queued messages we read before */
+	list_splice_init(&messages, &msq->q_messages);
+
+	/* adjust msq and namespace statistics */
+	atomic_add(h->q_cbytes, &ns->msg_bytes);
+	atomic_add(h->q_qnum, &ns->msg_hdrs);
+	msq->q_cbytes = h->q_cbytes;
+	msq->q_qbytes = h->q_qbytes;
+	msq->q_qnum = h->q_qnum;
+
+	ret = load_ipc_msg_hdr(ctx, h, msq);
+
+	if (ret < 0) {
+		ckpt_debug("msq: need to remove (%d)\n", ret);
+		freeque(ns, perms);
+	} else
+		ipc_unlock(perms);
+	up_write(&msg_ids->rw_mutex);
+ out:
+	free_msg_list(&messages);  /* no-op if all ok, else cleanup msgs */
+	ckpt_hdr_put(ctx, h);
+	return ret;
+}
diff --git a/ipc/msg.c b/ipc/msg.c
index 1db7c45..3559d53 100644
--- a/ipc/msg.c
+++ b/ipc/msg.c
@@ -72,7 +72,6 @@ struct msg_sender {
 
 #define msg_unlock(msq)		ipc_unlock(&(msq)->q_perm)
 
-static void freeque(struct ipc_namespace *, struct kern_ipc_perm *);
 static int newque(struct ipc_namespace *, struct ipc_params *, int);
 #ifdef CONFIG_PROC_FS
 static int sysvipc_msg_proc_show(struct seq_file *s, void *it);
@@ -278,7 +277,7 @@ static void expunge_all(struct msg_queue *msq, int res)
  * msg_ids.rw_mutex (writer) and the spinlock for this message queue are held
  * before freeque() is called. msg_ids.rw_mutex remains locked on exit.
  */
-static void freeque(struct ipc_namespace *ns, struct kern_ipc_perm *ipcp)
+void freeque(struct ipc_namespace *ns, struct kern_ipc_perm *ipcp)
 {
 	struct list_head *tmp;
 	struct msg_queue *msq = container_of(ipcp, struct msg_queue, q_perm);
@@ -311,14 +310,11 @@ static inline int msg_security(struct kern_ipc_perm *ipcp, int msgflg)
 	return security_msg_queue_associate(msq, msgflg);
 }
 
-int do_msgget(key_t key, int msgflg, int req_id)
+int do_msgget(struct ipc_namespace *ns, key_t key, int msgflg, int req_id)
 {
-	struct ipc_namespace *ns;
 	struct ipc_ops msg_ops;
 	struct ipc_params msg_params;
 
-	ns = current->nsproxy->ipc_ns;
-
 	msg_ops.getnew = newque;
 	msg_ops.associate = msg_security;
 	msg_ops.more_checks = NULL;
@@ -331,7 +327,7 @@ int do_msgget(key_t key, int msgflg, int req_id)
 
 SYSCALL_DEFINE2(msgget, key_t, key, int, msgflg)
 {
-	return do_msgget(key, msgflg, -1);
+	return do_msgget(current->nsproxy->ipc_ns, key, msgflg, -1);
 }
 
 static inline unsigned long
diff --git a/ipc/msgutil.c b/ipc/msgutil.c
index f095ee2..e119243 100644
--- a/ipc/msgutil.c
+++ b/ipc/msgutil.c
@@ -36,14 +36,6 @@ struct ipc_namespace init_ipc_ns = {
 
 atomic_t nr_ipc_ns = ATOMIC_INIT(1);
 
-struct msg_msgseg {
-	struct msg_msgseg* next;
-	/* the next part of the message follows immediately */
-};
-
-#define DATALEN_MSG	(PAGE_SIZE-sizeof(struct msg_msg))
-#define DATALEN_SEG	(PAGE_SIZE-sizeof(struct msg_msgseg))
-
 struct msg_msg *load_msg(const void __user *src, int len)
 {
 	struct msg_msg *msg;
diff --git a/ipc/util.h b/ipc/util.h
index e0007dc..8a223f0 100644
--- a/ipc/util.h
+++ b/ipc/util.h
@@ -141,6 +141,14 @@ extern void free_msg(struct msg_msg *msg);
 extern struct msg_msg *load_msg(const void __user *src, int len);
 extern int store_msg(void __user *dest, struct msg_msg *msg, int len);
 
+struct msg_msgseg {
+	struct msg_msgseg *next;
+	/* the next part of the message follows immediately */
+};
+
+#define DATALEN_MSG	(PAGE_SIZE-sizeof(struct msg_msg))
+#define DATALEN_SEG	(PAGE_SIZE-sizeof(struct msg_msgseg))
+
 extern void recompute_msgmni(struct ipc_namespace *);
 
 static inline int ipc_buildid(int id, int seq)
@@ -182,6 +190,8 @@ int do_shmget(struct ipc_namespace *ns, key_t key, size_t size, int shmflg,
 	      int req_id);
 void do_shm_rmid(struct ipc_namespace *ns, struct kern_ipc_perm *ipcp);
 
+int do_msgget(struct ipc_namespace *ns, key_t key, int msgflg, int req_id);
+void freeque(struct ipc_namespace *ns, struct kern_ipc_perm *ipcp);
 
 #ifdef CONFIG_CHECKPOINT
 extern int checkpoint_fill_ipc_perms(struct ckpt_hdr_ipc_perms *h,
@@ -192,6 +202,9 @@ extern int restore_load_ipc_perms(struct ckpt_hdr_ipc_perms *h,
 extern int ckpt_collect_ipc_shm(int id, void *p, void *data);
 extern int checkpoint_ipc_shm(int id, void *p, void *data);
 extern int restore_ipc_shm(struct ckpt_ctx *ctx, struct ipc_namespace *ns);
+
+extern int checkpoint_ipc_msg(int id, void *p, void *data);
+extern int restore_ipc_msg(struct ckpt_ctx *ctx, struct ipc_namespace *ns);
 #endif
 
 #endif
-- 
1.6.0.4


^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH v18 53/80] c/r: support message-queues sysv-ipc
@ 2009-09-23 23:51   ` Oren Laadan
  0 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-23 23:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, containers, linux-kernel, linux-mm, linux-api,
	Serge Hallyn, Ingo Molnar, Pavel Emelyanov, Oren Laadan,
	Oren Laadan

Checkpoint of sysvipc message-queues is performed by iterating through
all 'msq' objects and dumping the contents of each one. The message
queued on each 'msq' are dumped with that object.

Message of a specific queue get written one by one. The queue lock
cannot be held while dumping them, but the loop must be protected from
someone (who ?) writing or reading. To do that we grab the lock, then
hijack the entire chain of messages from the queue, drop the lock,
and then safely dump them in a loop. Finally, with the lock held, we
re-attach the chain while verifying that there isn't other (new) data
on that queue.

Writing the message contents themselves is straight forward. The code
is similar to that in ipc/msgutil.c, the main difference being that
we deal with kernel memory and not user memory.

Changelog[v17]:
  - Allocate security context for msg_msg
  - Restore objects in the right namespace
  - Don't unlock ipc before freeing

Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
---
 include/linux/checkpoint_hdr.h |   20 +++
 ipc/Makefile                   |    3 +-
 ipc/checkpoint.c               |    2 +-
 ipc/checkpoint_msg.c           |  364 ++++++++++++++++++++++++++++++++++++++++
 ipc/msg.c                      |   10 +-
 ipc/msgutil.c                  |    8 -
 ipc/util.h                     |   13 ++
 7 files changed, 403 insertions(+), 17 deletions(-)
 create mode 100644 ipc/checkpoint_msg.c

diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
index a4cf7a9..93b6aed 100644
--- a/include/linux/checkpoint_hdr.h
+++ b/include/linux/checkpoint_hdr.h
@@ -78,6 +78,7 @@ enum {
 	CKPT_HDR_IPC = 501,
 	CKPT_HDR_IPC_SHM,
 	CKPT_HDR_IPC_MSG,
+	CKPT_HDR_IPC_MSG_MSG,
 	CKPT_HDR_IPC_SEM,
 
 	CKPT_HDR_TAIL = 9001,
@@ -392,6 +393,25 @@ struct ckpt_hdr_ipc_shm {
 	__u32 objref;
 } __attribute__((aligned(8)));
 
+struct ckpt_hdr_ipc_msg {
+	struct ckpt_hdr h;
+	struct ckpt_hdr_ipc_perms perms;
+	__u64 q_stime;
+	__u64 q_rtime;
+	__u64 q_ctime;
+	__u64 q_cbytes;
+	__u64 q_qnum;
+	__u64 q_qbytes;
+	__s32 q_lspid;
+	__s32 q_lrpid;
+} __attribute__((aligned(8)));
+
+struct ckpt_hdr_ipc_msg_msg {
+	struct ckpt_hdr h;
+	__s32 m_type;
+	__u32 m_ts;
+} __attribute__((aligned(8)));
+
 
 #define CKPT_TST_OVERFLOW_16(a, b) \
 	((sizeof(a) > sizeof(b)) && ((a) > SHORT_MAX))
diff --git a/ipc/Makefile b/ipc/Makefile
index db4b076..71a257f 100644
--- a/ipc/Makefile
+++ b/ipc/Makefile
@@ -9,4 +9,5 @@ obj_mq-$(CONFIG_COMPAT) += compat_mq.o
 obj-$(CONFIG_POSIX_MQUEUE) += mqueue.o msgutil.o $(obj_mq-y)
 obj-$(CONFIG_IPC_NS) += namespace.o
 obj-$(CONFIG_POSIX_MQUEUE_SYSCTL) += mq_sysctl.o
-obj-$(CONFIG_SYSVIPC_CHECKPOINT) += checkpoint.o checkpoint_shm.o
+obj-$(CONFIG_SYSVIPC_CHECKPOINT) += checkpoint.o \
+		checkpoint_shm.o checkpoint_msg.o
diff --git a/ipc/checkpoint.c b/ipc/checkpoint.c
index b360861..588ed37 100644
--- a/ipc/checkpoint.c
+++ b/ipc/checkpoint.c
@@ -115,11 +115,11 @@ static int do_checkpoint_ipc_ns(struct ckpt_ctx *ctx,
 
 	ret = checkpoint_ipc_any(ctx, ipc_ns, IPC_SHM_IDS,
 				 CKPT_HDR_IPC_SHM, checkpoint_ipc_shm);
-#if 0 /* NEXT FEW PATCHES */
 	if (ret < 0)
 		return ret;
 	ret = checkpoint_ipc_any(ctx, ipc_ns, IPC_MSG_IDS,
 				 CKPT_HDR_IPC_MSG, checkpoint_ipc_msg);
+#if 0 /* NEXT FEW PATCHES */
 	if (ret < 0)
 		return ret;
 	ret = checkpoint_ipc_any(ctx, ipc_ns, IPC_SEM_IDS,
diff --git a/ipc/checkpoint_msg.c b/ipc/checkpoint_msg.c
new file mode 100644
index 0000000..b933c19
--- /dev/null
+++ b/ipc/checkpoint_msg.c
@@ -0,0 +1,364 @@
+/*
+ *  Checkpoint/restart - dump state of sysvipc msg
+ *
+ *  Copyright (C) 2009 Oren Laadan
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+/* default debug level for output */
+#define CKPT_DFLAG  CKPT_DIPC
+
+#include <linux/mm.h>
+#include <linux/msg.h>
+#include <linux/rwsem.h>
+#include <linux/sched.h>
+#include <linux/syscalls.h>
+#include <linux/nsproxy.h>
+#include <linux/security.h>
+#include <linux/ipc_namespace.h>
+
+#include "util.h"
+
+#include <linux/checkpoint.h>
+#include <linux/checkpoint_hdr.h>
+
+/************************************************************************
+ * ipc checkpoint
+ */
+
+static int fill_ipc_msg_hdr(struct ckpt_ctx *ctx,
+			    struct ckpt_hdr_ipc_msg *h,
+			    struct msg_queue *msq)
+{
+	int ret = 0;
+
+	ipc_lock_by_ptr(&msq->q_perm);
+
+	ret = checkpoint_fill_ipc_perms(&h->perms, &msq->q_perm);
+	if (ret < 0)
+		goto unlock;
+
+	h->q_stime = msq->q_stime;
+	h->q_rtime = msq->q_rtime;
+	h->q_ctime = msq->q_ctime;
+	h->q_cbytes = msq->q_cbytes;
+	h->q_qnum = msq->q_qnum;
+	h->q_qbytes = msq->q_qbytes;
+	h->q_lspid = msq->q_lspid;
+	h->q_lrpid = msq->q_lrpid;
+
+ unlock:
+	ipc_unlock(&msq->q_perm);
+	ckpt_debug("msg: lspid %d rspid %d qnum %lld qbytes %lld\n",
+		 h->q_lspid, h->q_lrpid, h->q_qnum, h->q_qbytes);
+
+	return ret;
+}
+
+static int checkpoint_msg_contents(struct ckpt_ctx *ctx, struct msg_msg *msg)
+{
+	struct ckpt_hdr_ipc_msg_msg *h;
+	struct msg_msgseg *seg;
+	int total, len;
+	int ret;
+
+	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_IPC_MSG_MSG);
+	if (!h)
+		return -ENOMEM;
+
+	h->m_type = msg->m_type;
+	h->m_ts = msg->m_ts;
+
+	ret = ckpt_write_obj(ctx, &h->h);
+	ckpt_hdr_put(ctx, h);
+	if (ret < 0)
+		return ret;
+
+	total = msg->m_ts;
+	len = min(total, (int) DATALEN_MSG);
+	ret = ckpt_write_buffer(ctx, (msg + 1), len);
+	if (ret < 0)
+		return ret;
+
+	seg = msg->next;
+	total -= len;
+
+	while (total) {
+		len = min(total, (int) DATALEN_SEG);
+		ret = ckpt_write_buffer(ctx, (seg + 1), len);
+		if (ret < 0)
+			break;
+		seg = seg->next;
+		total -= len;
+	}
+
+	return ret;
+}
+
+static int checkpoint_msg_queue(struct ckpt_ctx *ctx, struct msg_queue *msq)
+{
+	struct list_head messages;
+	struct msg_msg *msg;
+	int ret = -EBUSY;
+
+	/*
+	 * Scanning the msq requires the lock, but then we can't write
+	 * data out from inside. Instead, we grab the lock, remove all
+	 * messages to our own list, drop the lock, write the messages,
+	 * and finally re-attach the them to the msq with the lock taken.
+	 */
+	ipc_lock_by_ptr(&msq->q_perm);
+	if (!list_empty(&msq->q_receivers))
+		goto unlock;
+	if (!list_empty(&msq->q_senders))
+		goto unlock;
+	if (list_empty(&msq->q_messages))
+		goto unlock;
+	/* temporarily take out all messages */
+	INIT_LIST_HEAD(&messages);
+	list_splice_init(&msq->q_messages, &messages);
+ unlock:
+	ipc_unlock(&msq->q_perm);
+
+	list_for_each_entry(msg, &messages, m_list) {
+		ret = checkpoint_msg_contents(ctx, msg);
+		if (ret < 0)
+			break;
+	}
+
+	/* put all the messages back in */
+	ipc_lock_by_ptr(&msq->q_perm);
+	list_splice(&messages, &msq->q_messages);
+	ipc_unlock(&msq->q_perm);
+
+	return ret;
+}
+
+int checkpoint_ipc_msg(int id, void *p, void *data)
+{
+	struct ckpt_hdr_ipc_msg *h;
+	struct ckpt_ctx *ctx = (struct ckpt_ctx *) data;
+	struct kern_ipc_perm *perm = (struct kern_ipc_perm *) p;
+	struct msg_queue *msq;
+	int ret;
+
+	msq = container_of(perm, struct msg_queue, q_perm);
+
+	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_IPC_MSG);
+	if (!h)
+		return -ENOMEM;
+
+	ret = fill_ipc_msg_hdr(ctx, h, msq);
+	if (ret < 0)
+		goto out;
+
+	ret = ckpt_write_obj(ctx, &h->h);
+	if (ret < 0)
+		goto out;
+
+	if (h->q_qnum)
+		ret = checkpoint_msg_queue(ctx, msq);
+ out:
+	ckpt_hdr_put(ctx, h);
+	return ret;
+}
+
+
+/************************************************************************
+ * ipc restart
+ */
+
+static int load_ipc_msg_hdr(struct ckpt_ctx *ctx,
+			    struct ckpt_hdr_ipc_msg *h,
+			    struct msg_queue *msq)
+{
+	int ret = 0;
+
+	ret = restore_load_ipc_perms(&h->perms, &msq->q_perm);
+	if (ret < 0)
+		return ret;
+
+	ckpt_debug("msq: lspid %d lrpid %d qnum %lld qbytes %lld\n",
+		 h->q_lspid, h->q_lrpid, h->q_qnum, h->q_qbytes);
+
+	if (h->q_lspid < 0 || h->q_lrpid < 0)
+		return -EINVAL;
+
+	msq->q_stime = h->q_stime;
+	msq->q_rtime = h->q_rtime;
+	msq->q_ctime = h->q_ctime;
+	msq->q_lspid = h->q_lspid;
+	msq->q_lrpid = h->q_lrpid;
+
+	return 0;
+}
+
+static struct msg_msg *restore_msg_contents_one(struct ckpt_ctx *ctx, int *clen)
+{
+	struct ckpt_hdr_ipc_msg_msg *h;
+	struct msg_msg *msg = NULL;
+	struct msg_msgseg *seg, **pseg;
+	int total, len;
+	int ret;
+
+	h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_IPC_MSG_MSG);
+	if (IS_ERR(h))
+		return (struct msg_msg *) h;
+
+	ret = -EINVAL;
+	if (h->m_type < 1)
+		goto out;
+	if (h->m_ts > current->nsproxy->ipc_ns->msg_ctlmax)
+		goto out;
+
+	total = h->m_ts;
+	len = min(total, (int) DATALEN_MSG);
+	msg = kmalloc(sizeof(*msg) + len, GFP_KERNEL);
+	if (!msg) {
+		ret = -ENOMEM;
+		goto out;
+	}
+	msg->next = NULL;
+	pseg = &msg->next;
+
+	ret = _ckpt_read_buffer(ctx, (msg + 1), len);
+	if (ret < 0)
+		goto out;
+
+	total -= len;
+	while (total) {
+		len = min(total, (int) DATALEN_SEG);
+		seg = kmalloc(sizeof(*seg) + len, GFP_KERNEL);
+		if (!seg) {
+			ret = -ENOMEM;
+			goto out;
+		}
+		seg->next = NULL;
+		*pseg = seg;
+		pseg = &seg->next;
+
+		ret = _ckpt_read_buffer(ctx, (seg + 1), len);
+		if (ret < 0)
+			goto out;
+		total -= len;
+	}
+
+	msg->m_type = h->m_type;
+	msg->m_ts = h->m_ts;
+	*clen = h->m_ts;
+	ret = security_msg_msg_alloc(msg);
+ out:
+	if (ret < 0 && msg) {
+		free_msg(msg);
+		msg = ERR_PTR(ret);
+	}
+	ckpt_hdr_put(ctx, h);
+	return msg;
+}
+
+static inline void free_msg_list(struct list_head *queue)
+{
+	struct msg_msg *msg, *tmp;
+
+	list_for_each_entry_safe(msg, tmp, queue, m_list)
+		free_msg(msg);
+}
+
+static int restore_msg_contents(struct ckpt_ctx *ctx, struct list_head *queue,
+				unsigned long qnum, unsigned long *cbytes)
+{
+	struct msg_msg *msg;
+	int clen = 0;
+	int ret = 0;
+
+	INIT_LIST_HEAD(queue);
+
+	*cbytes = 0;
+	while (qnum--) {
+		msg = restore_msg_contents_one(ctx, &clen);
+		if (IS_ERR(msg))
+			goto fail;
+		list_add_tail(&msg->m_list, queue);
+		*cbytes += clen;
+	}
+	return 0;
+ fail:
+	ret = PTR_ERR(msg);
+	free_msg_list(queue);
+	return ret;
+}
+
+int restore_ipc_msg(struct ckpt_ctx *ctx, struct ipc_namespace *ns)
+{
+	struct ckpt_hdr_ipc_msg *h;
+	struct kern_ipc_perm *perms;
+	struct msg_queue *msq;
+	struct ipc_ids *msg_ids = &ns->ids[IPC_MSG_IDS];
+	struct list_head messages;
+	unsigned long cbytes;
+	int msgflag;
+	int ret;
+
+	INIT_LIST_HEAD(&messages);
+
+	h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_IPC_MSG);
+	if (IS_ERR(h))
+		return PTR_ERR(h);
+
+	ret = -EINVAL;
+	if (h->perms.id < 0)
+		goto out;
+
+	/* read queued messages into temporary queue */
+	ret = restore_msg_contents(ctx, &messages, h->q_qnum, &cbytes);
+	if (ret < 0)
+		goto out;
+
+	ret = -EINVAL;
+	if (h->q_cbytes != cbytes)
+		goto out;
+
+	/* restore the message queue */
+	msgflag = h->perms.mode | IPC_CREAT | IPC_EXCL;
+	ckpt_debug("msg: do_msgget key %d flag %#x id %d\n",
+		 h->perms.key, msgflag, h->perms.id);
+	ret = do_msgget(ns, h->perms.key, msgflag, h->perms.id);
+	ckpt_debug("msg: do_msgget ret %d\n", ret);
+	if (ret < 0)
+		goto out;
+
+	down_write(&msg_ids->rw_mutex);
+
+	/* we are the sole owners/users of this ipc_ns, it can't go away */
+	perms = ipc_lock(msg_ids, h->perms.id);
+	BUG_ON(IS_ERR(perms));	/* ipc_ns is private to us */
+
+	msq = container_of(perms, struct msg_queue, q_perm);
+	BUG_ON(!list_empty(&msq->q_messages));	/* ipc_ns is private to us */
+
+	/* attach queued messages we read before */
+	list_splice_init(&messages, &msq->q_messages);
+
+	/* adjust msq and namespace statistics */
+	atomic_add(h->q_cbytes, &ns->msg_bytes);
+	atomic_add(h->q_qnum, &ns->msg_hdrs);
+	msq->q_cbytes = h->q_cbytes;
+	msq->q_qbytes = h->q_qbytes;
+	msq->q_qnum = h->q_qnum;
+
+	ret = load_ipc_msg_hdr(ctx, h, msq);
+
+	if (ret < 0) {
+		ckpt_debug("msq: need to remove (%d)\n", ret);
+		freeque(ns, perms);
+	} else
+		ipc_unlock(perms);
+	up_write(&msg_ids->rw_mutex);
+ out:
+	free_msg_list(&messages);  /* no-op if all ok, else cleanup msgs */
+	ckpt_hdr_put(ctx, h);
+	return ret;
+}
diff --git a/ipc/msg.c b/ipc/msg.c
index 1db7c45..3559d53 100644
--- a/ipc/msg.c
+++ b/ipc/msg.c
@@ -72,7 +72,6 @@ struct msg_sender {
 
 #define msg_unlock(msq)		ipc_unlock(&(msq)->q_perm)
 
-static void freeque(struct ipc_namespace *, struct kern_ipc_perm *);
 static int newque(struct ipc_namespace *, struct ipc_params *, int);
 #ifdef CONFIG_PROC_FS
 static int sysvipc_msg_proc_show(struct seq_file *s, void *it);
@@ -278,7 +277,7 @@ static void expunge_all(struct msg_queue *msq, int res)
  * msg_ids.rw_mutex (writer) and the spinlock for this message queue are held
  * before freeque() is called. msg_ids.rw_mutex remains locked on exit.
  */
-static void freeque(struct ipc_namespace *ns, struct kern_ipc_perm *ipcp)
+void freeque(struct ipc_namespace *ns, struct kern_ipc_perm *ipcp)
 {
 	struct list_head *tmp;
 	struct msg_queue *msq = container_of(ipcp, struct msg_queue, q_perm);
@@ -311,14 +310,11 @@ static inline int msg_security(struct kern_ipc_perm *ipcp, int msgflg)
 	return security_msg_queue_associate(msq, msgflg);
 }
 
-int do_msgget(key_t key, int msgflg, int req_id)
+int do_msgget(struct ipc_namespace *ns, key_t key, int msgflg, int req_id)
 {
-	struct ipc_namespace *ns;
 	struct ipc_ops msg_ops;
 	struct ipc_params msg_params;
 
-	ns = current->nsproxy->ipc_ns;
-
 	msg_ops.getnew = newque;
 	msg_ops.associate = msg_security;
 	msg_ops.more_checks = NULL;
@@ -331,7 +327,7 @@ int do_msgget(key_t key, int msgflg, int req_id)
 
 SYSCALL_DEFINE2(msgget, key_t, key, int, msgflg)
 {
-	return do_msgget(key, msgflg, -1);
+	return do_msgget(current->nsproxy->ipc_ns, key, msgflg, -1);
 }
 
 static inline unsigned long
diff --git a/ipc/msgutil.c b/ipc/msgutil.c
index f095ee2..e119243 100644
--- a/ipc/msgutil.c
+++ b/ipc/msgutil.c
@@ -36,14 +36,6 @@ struct ipc_namespace init_ipc_ns = {
 
 atomic_t nr_ipc_ns = ATOMIC_INIT(1);
 
-struct msg_msgseg {
-	struct msg_msgseg* next;
-	/* the next part of the message follows immediately */
-};
-
-#define DATALEN_MSG	(PAGE_SIZE-sizeof(struct msg_msg))
-#define DATALEN_SEG	(PAGE_SIZE-sizeof(struct msg_msgseg))
-
 struct msg_msg *load_msg(const void __user *src, int len)
 {
 	struct msg_msg *msg;
diff --git a/ipc/util.h b/ipc/util.h
index e0007dc..8a223f0 100644
--- a/ipc/util.h
+++ b/ipc/util.h
@@ -141,6 +141,14 @@ extern void free_msg(struct msg_msg *msg);
 extern struct msg_msg *load_msg(const void __user *src, int len);
 extern int store_msg(void __user *dest, struct msg_msg *msg, int len);
 
+struct msg_msgseg {
+	struct msg_msgseg *next;
+	/* the next part of the message follows immediately */
+};
+
+#define DATALEN_MSG	(PAGE_SIZE-sizeof(struct msg_msg))
+#define DATALEN_SEG	(PAGE_SIZE-sizeof(struct msg_msgseg))
+
 extern void recompute_msgmni(struct ipc_namespace *);
 
 static inline int ipc_buildid(int id, int seq)
@@ -182,6 +190,8 @@ int do_shmget(struct ipc_namespace *ns, key_t key, size_t size, int shmflg,
 	      int req_id);
 void do_shm_rmid(struct ipc_namespace *ns, struct kern_ipc_perm *ipcp);
 
+int do_msgget(struct ipc_namespace *ns, key_t key, int msgflg, int req_id);
+void freeque(struct ipc_namespace *ns, struct kern_ipc_perm *ipcp);
 
 #ifdef CONFIG_CHECKPOINT
 extern int checkpoint_fill_ipc_perms(struct ckpt_hdr_ipc_perms *h,
@@ -192,6 +202,9 @@ extern int restore_load_ipc_perms(struct ckpt_hdr_ipc_perms *h,
 extern int ckpt_collect_ipc_shm(int id, void *p, void *data);
 extern int checkpoint_ipc_shm(int id, void *p, void *data);
 extern int restore_ipc_shm(struct ckpt_ctx *ctx, struct ipc_namespace *ns);
+
+extern int checkpoint_ipc_msg(int id, void *p, void *data);
+extern int restore_ipc_msg(struct ckpt_ctx *ctx, struct ipc_namespace *ns);
 #endif
 
 #endif
-- 
1.6.0.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH v18 54/80] c/r: support semaphore sysv-ipc
       [not found] ` <1253749920-18673-1-git-send-email-orenl-RdfvBDnrOixBDgjK7y7TUQ@public.gmane.org>
                     ` (52 preceding siblings ...)
  2009-09-23 23:51   ` [PATCH v18 53/80] c/r: support message-queues sysv-ipc Oren Laadan
@ 2009-09-23 23:51   ` Oren Laadan
  2009-09-23 23:51   ` [PATCH v18 55/80] c/r: (s390): expose a constant for the number of words (CRs) Oren Laadan
                     ` (27 subsequent siblings)
  81 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-23 23:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-api-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Linus Torvalds, Ingo Molnar,
	Pavel Emelyanov

Checkpoint of sysvipc semaphores is performed by iterating through all
sem objects and dumping the contents of each one. The semaphore array
of each sem is dumped with that object.

The semaphore array (sem->sem_base) holds an array of 'struct sem',
which is a {int, int}. Because this translates into the same format
on 32- and 64-bit architectures, the checkpoint format is simply the
dump of this array as is.

TODO: this patch does not handle semaphore-undo -- this data should be
saved per-task while iterating through the tasks.

Changelog[v18]:
  - Handle kmalloc failure in restore_sem_array()
Changelog[v17]:
  - Restore objects in the right namespace
  - Forward declare struct msg_msg (instead of include linux/msg.h)
  - Fix typo in comment
  - Don't unlock ipc before calling freeary in error path

Signed-off-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
---
 include/linux/checkpoint_hdr.h |    8 ++
 ipc/Makefile                   |    2 +-
 ipc/checkpoint.c               |    4 -
 ipc/checkpoint_sem.c           |  221 ++++++++++++++++++++++++++++++++++++++++
 ipc/sem.c                      |   11 +--
 ipc/util.h                     |    8 ++
 6 files changed, 242 insertions(+), 12 deletions(-)
 create mode 100644 ipc/checkpoint_sem.c

diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
index 93b6aed..cb7dfc8 100644
--- a/include/linux/checkpoint_hdr.h
+++ b/include/linux/checkpoint_hdr.h
@@ -412,6 +412,14 @@ struct ckpt_hdr_ipc_msg_msg {
 	__u32 m_ts;
 } __attribute__((aligned(8)));
 
+struct ckpt_hdr_ipc_sem {
+	struct ckpt_hdr h;
+	struct ckpt_hdr_ipc_perms perms;
+	__u64 sem_otime;
+	__u64 sem_ctime;
+	__u32 sem_nsems;
+} __attribute__((aligned(8)));
+
 
 #define CKPT_TST_OVERFLOW_16(a, b) \
 	((sizeof(a) > sizeof(b)) && ((a) > SHORT_MAX))
diff --git a/ipc/Makefile b/ipc/Makefile
index 71a257f..3ecba9e 100644
--- a/ipc/Makefile
+++ b/ipc/Makefile
@@ -10,4 +10,4 @@ obj-$(CONFIG_POSIX_MQUEUE) += mqueue.o msgutil.o $(obj_mq-y)
 obj-$(CONFIG_IPC_NS) += namespace.o
 obj-$(CONFIG_POSIX_MQUEUE_SYSCTL) += mq_sysctl.o
 obj-$(CONFIG_SYSVIPC_CHECKPOINT) += checkpoint.o \
-		checkpoint_shm.o checkpoint_msg.o
+		checkpoint_shm.o checkpoint_msg.o checkpoint_sem.o
diff --git a/ipc/checkpoint.c b/ipc/checkpoint.c
index 588ed37..8e6e9ba 100644
--- a/ipc/checkpoint.c
+++ b/ipc/checkpoint.c
@@ -119,12 +119,10 @@ static int do_checkpoint_ipc_ns(struct ckpt_ctx *ctx,
 		return ret;
 	ret = checkpoint_ipc_any(ctx, ipc_ns, IPC_MSG_IDS,
 				 CKPT_HDR_IPC_MSG, checkpoint_ipc_msg);
-#if 0 /* NEXT FEW PATCHES */
 	if (ret < 0)
 		return ret;
 	ret = checkpoint_ipc_any(ctx, ipc_ns, IPC_SEM_IDS,
 				 CKPT_HDR_IPC_SEM, checkpoint_ipc_sem);
-#endif
 	return ret;
 }
 
@@ -309,7 +307,6 @@ static struct ipc_namespace *do_restore_ipc_ns(struct ckpt_ctx *ctx)
 
 	ret = restore_ipc_any(ctx, ipc_ns, IPC_SHM_IDS,
 			      CKPT_HDR_IPC_SHM, restore_ipc_shm);
-#if 0 /* NEXT FEW PATCHES */
 	if (ret < 0)
 		goto out;
 	ret = restore_ipc_any(ctx, ipc_ns, IPC_MSG_IDS,
@@ -318,7 +315,6 @@ static struct ipc_namespace *do_restore_ipc_ns(struct ckpt_ctx *ctx)
 		goto out;
 	ret = restore_ipc_any(ctx, ipc_ns, IPC_SEM_IDS,
 			      CKPT_HDR_IPC_SEM, restore_ipc_sem);
-#endif
 	if (ret < 0)
 		goto out;
 
diff --git a/ipc/checkpoint_sem.c b/ipc/checkpoint_sem.c
new file mode 100644
index 0000000..76eb2b9
--- /dev/null
+++ b/ipc/checkpoint_sem.c
@@ -0,0 +1,221 @@
+/*
+ *  Checkpoint/restart - dump state of sysvipc sem
+ *
+ *  Copyright (C) 2009 Oren Laadan
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+/* default debug level for output */
+#define CKPT_DFLAG  CKPT_DIPC
+
+#include <linux/mm.h>
+#include <linux/sem.h>
+#include <linux/rwsem.h>
+#include <linux/sched.h>
+#include <linux/syscalls.h>
+#include <linux/nsproxy.h>
+#include <linux/ipc_namespace.h>
+
+struct msg_msg;
+#include "util.h"
+
+#include <linux/checkpoint.h>
+#include <linux/checkpoint_hdr.h>
+
+/************************************************************************
+ * ipc checkpoint
+ */
+
+static int fill_ipc_sem_hdr(struct ckpt_ctx *ctx,
+			       struct ckpt_hdr_ipc_sem *h,
+			       struct sem_array *sem)
+{
+	int ret = 0;
+
+	ipc_lock_by_ptr(&sem->sem_perm);
+
+	ret = checkpoint_fill_ipc_perms(&h->perms, &sem->sem_perm);
+	if (ret < 0)
+		goto unlock;
+
+	h->sem_otime = sem->sem_otime;
+	h->sem_ctime = sem->sem_ctime;
+	h->sem_nsems = sem->sem_nsems;
+
+ unlock:
+	ipc_unlock(&sem->sem_perm);
+	ckpt_debug("sem: nsems %u\n", h->sem_nsems);
+
+	return ret;
+}
+
+/**
+ * ckpt_write_sem_array - dump the state of a semaphore array
+ * @ctx: checkpoint context
+ * @sem: semphore array
+ *
+ * The state of a sempahore is an array of 'struct sem'. This structure
+ * is {int, int}, which translates to the same format {32 bits, 32 bits}
+ * on both 32- and 64-bit architectures. So we simply dump the array.
+ *
+ * The sem-undo information is not saved per ipc_ns, but rather per task.
+ */
+static int checkpoint_sem_array(struct ckpt_ctx *ctx, struct sem_array *sem)
+{
+	/* this is a "best-effort" test, so lock not needed */
+	if (!list_empty(&sem->sem_pending))
+		return -EBUSY;
+
+	/* our caller holds the mutex, so this is safe */
+	return ckpt_write_buffer(ctx, sem->sem_base,
+			       sem->sem_nsems * sizeof(*sem->sem_base));
+}
+
+int checkpoint_ipc_sem(int id, void *p, void *data)
+{
+	struct ckpt_hdr_ipc_sem *h;
+	struct ckpt_ctx *ctx = (struct ckpt_ctx *) data;
+	struct kern_ipc_perm *perm = (struct kern_ipc_perm *) p;
+	struct sem_array *sem;
+	int ret;
+
+	sem = container_of(perm, struct sem_array, sem_perm);
+
+	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_IPC_SEM);
+	if (!h)
+		return -ENOMEM;
+
+	ret = fill_ipc_sem_hdr(ctx, h, sem);
+	if (ret < 0)
+		goto out;
+
+	ret = ckpt_write_obj(ctx, &h->h);
+	if (ret < 0)
+		goto out;
+
+	if (h->sem_nsems)
+		ret = checkpoint_sem_array(ctx, sem);
+ out:
+	ckpt_hdr_put(ctx, h);
+	return ret;
+}
+
+/************************************************************************
+ * ipc restart
+ */
+
+static int load_ipc_sem_hdr(struct ckpt_ctx *ctx,
+			       struct ckpt_hdr_ipc_sem *h,
+			       struct sem_array *sem)
+{
+	int ret = 0;
+
+	ret = restore_load_ipc_perms(&h->perms, &sem->sem_perm);
+	if (ret < 0)
+		return ret;
+
+	ckpt_debug("sem: nsems %u\n", h->sem_nsems);
+
+	sem->sem_otime = h->sem_otime;
+	sem->sem_ctime = h->sem_ctime;
+	sem->sem_nsems = h->sem_nsems;
+
+	return 0;
+}
+
+/**
+ * ckpt_read_sem_array - read the state of a semaphore array
+ * @ctx: checkpoint context
+ * @sem: semphore array
+ *
+ * Expect the data in an array of 'struct sem': {32 bit, 32 bit}.
+ * See comment in ckpt_write_sem_array().
+ *
+ * The sem-undo information is not restored per ipc_ns, but rather per task.
+ */
+static struct sem *restore_sem_array(struct ckpt_ctx *ctx, int nsems)
+{
+	struct sem *sma;
+	int i, ret;
+
+	sma = kmalloc(nsems * sizeof(*sma), GFP_KERNEL);
+	if (!sma)
+		return ERR_PTR(-ENOMEM);
+	ret = _ckpt_read_buffer(ctx, sma, nsems * sizeof(*sma));
+	if (ret < 0)
+		goto out;
+
+	/* validate sem array contents */
+	for (i = 0; i < nsems; i++) {
+		if (sma[i].semval < 0 || sma[i].sempid < 0) {
+			ret = -EINVAL;
+			break;
+		}
+	}
+ out:
+	if (ret < 0) {
+		kfree(sma);
+		sma = ERR_PTR(ret);
+	}
+	return sma;
+}
+
+int restore_ipc_sem(struct ckpt_ctx *ctx, struct ipc_namespace *ns)
+{
+	struct ckpt_hdr_ipc_sem *h;
+	struct kern_ipc_perm *perms;
+	struct sem_array *sem;
+	struct sem *sma = NULL;
+	struct ipc_ids *sem_ids = &ns->ids[IPC_SEM_IDS];
+	int semflag, ret;
+
+	h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_IPC_SEM);
+	if (IS_ERR(h))
+		return PTR_ERR(h);
+
+	ret = -EINVAL;
+	if (h->perms.id < 0)
+		goto out;
+	if (h->sem_nsems < 0)
+		goto out;
+
+	/* read sempahore array state */
+	sma = restore_sem_array(ctx, h->sem_nsems);
+	if (IS_ERR(sma)) {
+		ret = PTR_ERR(sma);
+		goto out;
+	}
+
+	/* restore the message queue now */
+	semflag = h->perms.mode | IPC_CREAT | IPC_EXCL;
+	ckpt_debug("sem: do_semget key %d flag %#x id %d\n",
+		 h->perms.key, semflag, h->perms.id);
+	ret = do_semget(ns, h->perms.key, h->sem_nsems, semflag, h->perms.id);
+	ckpt_debug("sem: do_semget ret %d\n", ret);
+	if (ret < 0)
+		goto out;
+
+	down_write(&sem_ids->rw_mutex);
+
+	/* we are the sole owners/users of this ipc_ns, it can't go away */
+	perms = ipc_lock(sem_ids, h->perms.id);
+	BUG_ON(IS_ERR(perms));  /* ipc_ns is private to us */
+
+	sem = container_of(perms, struct sem_array, sem_perm);
+	memcpy(sem->sem_base, sma, sem->sem_nsems * sizeof(*sma));
+
+	ret = load_ipc_sem_hdr(ctx, h, sem);
+	if (ret < 0) {
+		ckpt_debug("sem: need to remove (%d)\n", ret);
+		freeary(ns, perms);
+	} else
+		ipc_unlock(perms);
+	up_write(&sem_ids->rw_mutex);
+ out:
+	kfree(sma);
+	ckpt_hdr_put(ctx, h);
+	return ret;
+}
diff --git a/ipc/sem.c b/ipc/sem.c
index a2b2135..7361041 100644
--- a/ipc/sem.c
+++ b/ipc/sem.c
@@ -93,7 +93,6 @@
 #define sem_checkid(sma, semid)	ipc_checkid(&sma->sem_perm, semid)
 
 static int newary(struct ipc_namespace *, struct ipc_params *, int);
-static void freeary(struct ipc_namespace *, struct kern_ipc_perm *);
 #ifdef CONFIG_PROC_FS
 static int sysvipc_sem_proc_show(struct seq_file *s, void *it);
 #endif
@@ -310,14 +309,12 @@ static inline int sem_more_checks(struct kern_ipc_perm *ipcp,
 	return 0;
 }
 
-int do_semget(key_t key, int nsems, int semflg, int req_id)
+int do_semget(struct ipc_namespace *ns, key_t key, int nsems,
+	      int semflg, int req_id)
 {
-	struct ipc_namespace *ns;
 	struct ipc_ops sem_ops;
 	struct ipc_params sem_params;
 
-	ns = current->nsproxy->ipc_ns;
-
 	if (nsems < 0 || nsems > ns->sc_semmsl)
 		return -EINVAL;
 
@@ -334,7 +331,7 @@ int do_semget(key_t key, int nsems, int semflg, int req_id)
 
 SYSCALL_DEFINE3(semget, key_t, key, int, nsems, int, semflg)
 {
-	return do_semget(key, nsems, semflg, -1);
+	return do_semget(current->nsproxy->ipc_ns, key, nsems, semflg, -1);
 }
 
 /*
@@ -521,7 +518,7 @@ static void free_un(struct rcu_head *head)
  * as a writer and the spinlock for this semaphore set hold. sem_ids.rw_mutex
  * remains locked on exit.
  */
-static void freeary(struct ipc_namespace *ns, struct kern_ipc_perm *ipcp)
+void freeary(struct ipc_namespace *ns, struct kern_ipc_perm *ipcp)
 {
 	struct sem_undo *un, *tu;
 	struct sem_queue *q, *tq;
diff --git a/ipc/util.h b/ipc/util.h
index 8a223f0..ba080de 100644
--- a/ipc/util.h
+++ b/ipc/util.h
@@ -193,6 +193,11 @@ void do_shm_rmid(struct ipc_namespace *ns, struct kern_ipc_perm *ipcp);
 int do_msgget(struct ipc_namespace *ns, key_t key, int msgflg, int req_id);
 void freeque(struct ipc_namespace *ns, struct kern_ipc_perm *ipcp);
 
+int do_semget(struct ipc_namespace *ns, key_t key, int nsems, int semflg,
+	      int req_id);
+void freeary(struct ipc_namespace *ns, struct kern_ipc_perm *ipcp);
+
+
 #ifdef CONFIG_CHECKPOINT
 extern int checkpoint_fill_ipc_perms(struct ckpt_hdr_ipc_perms *h,
 				     struct kern_ipc_perm *perm);
@@ -205,6 +210,9 @@ extern int restore_ipc_shm(struct ckpt_ctx *ctx, struct ipc_namespace *ns);
 
 extern int checkpoint_ipc_msg(int id, void *p, void *data);
 extern int restore_ipc_msg(struct ckpt_ctx *ctx, struct ipc_namespace *ns);
+
+extern int checkpoint_ipc_sem(int id, void *p, void *data);
+extern int restore_ipc_sem(struct ckpt_ctx *ctx, struct ipc_namespace *ns);
 #endif
 
 #endif
-- 
1.6.0.4

^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH v18 54/80] c/r: support semaphore sysv-ipc
  2009-09-23 23:50 ` Oren Laadan
@ 2009-09-23 23:51   ` Oren Laadan
  -1 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-23 23:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, containers, linux-kernel, linux-mm, linux-api,
	Serge Hallyn, Ingo Molnar, Pavel Emelyanov, Oren Laadan,
	Oren Laadan

Checkpoint of sysvipc semaphores is performed by iterating through all
sem objects and dumping the contents of each one. The semaphore array
of each sem is dumped with that object.

The semaphore array (sem->sem_base) holds an array of 'struct sem',
which is a {int, int}. Because this translates into the same format
on 32- and 64-bit architectures, the checkpoint format is simply the
dump of this array as is.

TODO: this patch does not handle semaphore-undo -- this data should be
saved per-task while iterating through the tasks.

Changelog[v18]:
  - Handle kmalloc failure in restore_sem_array()
Changelog[v17]:
  - Restore objects in the right namespace
  - Forward declare struct msg_msg (instead of include linux/msg.h)
  - Fix typo in comment
  - Don't unlock ipc before calling freeary in error path

Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
---
 include/linux/checkpoint_hdr.h |    8 ++
 ipc/Makefile                   |    2 +-
 ipc/checkpoint.c               |    4 -
 ipc/checkpoint_sem.c           |  221 ++++++++++++++++++++++++++++++++++++++++
 ipc/sem.c                      |   11 +--
 ipc/util.h                     |    8 ++
 6 files changed, 242 insertions(+), 12 deletions(-)
 create mode 100644 ipc/checkpoint_sem.c

diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
index 93b6aed..cb7dfc8 100644
--- a/include/linux/checkpoint_hdr.h
+++ b/include/linux/checkpoint_hdr.h
@@ -412,6 +412,14 @@ struct ckpt_hdr_ipc_msg_msg {
 	__u32 m_ts;
 } __attribute__((aligned(8)));
 
+struct ckpt_hdr_ipc_sem {
+	struct ckpt_hdr h;
+	struct ckpt_hdr_ipc_perms perms;
+	__u64 sem_otime;
+	__u64 sem_ctime;
+	__u32 sem_nsems;
+} __attribute__((aligned(8)));
+
 
 #define CKPT_TST_OVERFLOW_16(a, b) \
 	((sizeof(a) > sizeof(b)) && ((a) > SHORT_MAX))
diff --git a/ipc/Makefile b/ipc/Makefile
index 71a257f..3ecba9e 100644
--- a/ipc/Makefile
+++ b/ipc/Makefile
@@ -10,4 +10,4 @@ obj-$(CONFIG_POSIX_MQUEUE) += mqueue.o msgutil.o $(obj_mq-y)
 obj-$(CONFIG_IPC_NS) += namespace.o
 obj-$(CONFIG_POSIX_MQUEUE_SYSCTL) += mq_sysctl.o
 obj-$(CONFIG_SYSVIPC_CHECKPOINT) += checkpoint.o \
-		checkpoint_shm.o checkpoint_msg.o
+		checkpoint_shm.o checkpoint_msg.o checkpoint_sem.o
diff --git a/ipc/checkpoint.c b/ipc/checkpoint.c
index 588ed37..8e6e9ba 100644
--- a/ipc/checkpoint.c
+++ b/ipc/checkpoint.c
@@ -119,12 +119,10 @@ static int do_checkpoint_ipc_ns(struct ckpt_ctx *ctx,
 		return ret;
 	ret = checkpoint_ipc_any(ctx, ipc_ns, IPC_MSG_IDS,
 				 CKPT_HDR_IPC_MSG, checkpoint_ipc_msg);
-#if 0 /* NEXT FEW PATCHES */
 	if (ret < 0)
 		return ret;
 	ret = checkpoint_ipc_any(ctx, ipc_ns, IPC_SEM_IDS,
 				 CKPT_HDR_IPC_SEM, checkpoint_ipc_sem);
-#endif
 	return ret;
 }
 
@@ -309,7 +307,6 @@ static struct ipc_namespace *do_restore_ipc_ns(struct ckpt_ctx *ctx)
 
 	ret = restore_ipc_any(ctx, ipc_ns, IPC_SHM_IDS,
 			      CKPT_HDR_IPC_SHM, restore_ipc_shm);
-#if 0 /* NEXT FEW PATCHES */
 	if (ret < 0)
 		goto out;
 	ret = restore_ipc_any(ctx, ipc_ns, IPC_MSG_IDS,
@@ -318,7 +315,6 @@ static struct ipc_namespace *do_restore_ipc_ns(struct ckpt_ctx *ctx)
 		goto out;
 	ret = restore_ipc_any(ctx, ipc_ns, IPC_SEM_IDS,
 			      CKPT_HDR_IPC_SEM, restore_ipc_sem);
-#endif
 	if (ret < 0)
 		goto out;
 
diff --git a/ipc/checkpoint_sem.c b/ipc/checkpoint_sem.c
new file mode 100644
index 0000000..76eb2b9
--- /dev/null
+++ b/ipc/checkpoint_sem.c
@@ -0,0 +1,221 @@
+/*
+ *  Checkpoint/restart - dump state of sysvipc sem
+ *
+ *  Copyright (C) 2009 Oren Laadan
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+/* default debug level for output */
+#define CKPT_DFLAG  CKPT_DIPC
+
+#include <linux/mm.h>
+#include <linux/sem.h>
+#include <linux/rwsem.h>
+#include <linux/sched.h>
+#include <linux/syscalls.h>
+#include <linux/nsproxy.h>
+#include <linux/ipc_namespace.h>
+
+struct msg_msg;
+#include "util.h"
+
+#include <linux/checkpoint.h>
+#include <linux/checkpoint_hdr.h>
+
+/************************************************************************
+ * ipc checkpoint
+ */
+
+static int fill_ipc_sem_hdr(struct ckpt_ctx *ctx,
+			       struct ckpt_hdr_ipc_sem *h,
+			       struct sem_array *sem)
+{
+	int ret = 0;
+
+	ipc_lock_by_ptr(&sem->sem_perm);
+
+	ret = checkpoint_fill_ipc_perms(&h->perms, &sem->sem_perm);
+	if (ret < 0)
+		goto unlock;
+
+	h->sem_otime = sem->sem_otime;
+	h->sem_ctime = sem->sem_ctime;
+	h->sem_nsems = sem->sem_nsems;
+
+ unlock:
+	ipc_unlock(&sem->sem_perm);
+	ckpt_debug("sem: nsems %u\n", h->sem_nsems);
+
+	return ret;
+}
+
+/**
+ * ckpt_write_sem_array - dump the state of a semaphore array
+ * @ctx: checkpoint context
+ * @sem: semphore array
+ *
+ * The state of a sempahore is an array of 'struct sem'. This structure
+ * is {int, int}, which translates to the same format {32 bits, 32 bits}
+ * on both 32- and 64-bit architectures. So we simply dump the array.
+ *
+ * The sem-undo information is not saved per ipc_ns, but rather per task.
+ */
+static int checkpoint_sem_array(struct ckpt_ctx *ctx, struct sem_array *sem)
+{
+	/* this is a "best-effort" test, so lock not needed */
+	if (!list_empty(&sem->sem_pending))
+		return -EBUSY;
+
+	/* our caller holds the mutex, so this is safe */
+	return ckpt_write_buffer(ctx, sem->sem_base,
+			       sem->sem_nsems * sizeof(*sem->sem_base));
+}
+
+int checkpoint_ipc_sem(int id, void *p, void *data)
+{
+	struct ckpt_hdr_ipc_sem *h;
+	struct ckpt_ctx *ctx = (struct ckpt_ctx *) data;
+	struct kern_ipc_perm *perm = (struct kern_ipc_perm *) p;
+	struct sem_array *sem;
+	int ret;
+
+	sem = container_of(perm, struct sem_array, sem_perm);
+
+	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_IPC_SEM);
+	if (!h)
+		return -ENOMEM;
+
+	ret = fill_ipc_sem_hdr(ctx, h, sem);
+	if (ret < 0)
+		goto out;
+
+	ret = ckpt_write_obj(ctx, &h->h);
+	if (ret < 0)
+		goto out;
+
+	if (h->sem_nsems)
+		ret = checkpoint_sem_array(ctx, sem);
+ out:
+	ckpt_hdr_put(ctx, h);
+	return ret;
+}
+
+/************************************************************************
+ * ipc restart
+ */
+
+static int load_ipc_sem_hdr(struct ckpt_ctx *ctx,
+			       struct ckpt_hdr_ipc_sem *h,
+			       struct sem_array *sem)
+{
+	int ret = 0;
+
+	ret = restore_load_ipc_perms(&h->perms, &sem->sem_perm);
+	if (ret < 0)
+		return ret;
+
+	ckpt_debug("sem: nsems %u\n", h->sem_nsems);
+
+	sem->sem_otime = h->sem_otime;
+	sem->sem_ctime = h->sem_ctime;
+	sem->sem_nsems = h->sem_nsems;
+
+	return 0;
+}
+
+/**
+ * ckpt_read_sem_array - read the state of a semaphore array
+ * @ctx: checkpoint context
+ * @sem: semphore array
+ *
+ * Expect the data in an array of 'struct sem': {32 bit, 32 bit}.
+ * See comment in ckpt_write_sem_array().
+ *
+ * The sem-undo information is not restored per ipc_ns, but rather per task.
+ */
+static struct sem *restore_sem_array(struct ckpt_ctx *ctx, int nsems)
+{
+	struct sem *sma;
+	int i, ret;
+
+	sma = kmalloc(nsems * sizeof(*sma), GFP_KERNEL);
+	if (!sma)
+		return ERR_PTR(-ENOMEM);
+	ret = _ckpt_read_buffer(ctx, sma, nsems * sizeof(*sma));
+	if (ret < 0)
+		goto out;
+
+	/* validate sem array contents */
+	for (i = 0; i < nsems; i++) {
+		if (sma[i].semval < 0 || sma[i].sempid < 0) {
+			ret = -EINVAL;
+			break;
+		}
+	}
+ out:
+	if (ret < 0) {
+		kfree(sma);
+		sma = ERR_PTR(ret);
+	}
+	return sma;
+}
+
+int restore_ipc_sem(struct ckpt_ctx *ctx, struct ipc_namespace *ns)
+{
+	struct ckpt_hdr_ipc_sem *h;
+	struct kern_ipc_perm *perms;
+	struct sem_array *sem;
+	struct sem *sma = NULL;
+	struct ipc_ids *sem_ids = &ns->ids[IPC_SEM_IDS];
+	int semflag, ret;
+
+	h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_IPC_SEM);
+	if (IS_ERR(h))
+		return PTR_ERR(h);
+
+	ret = -EINVAL;
+	if (h->perms.id < 0)
+		goto out;
+	if (h->sem_nsems < 0)
+		goto out;
+
+	/* read sempahore array state */
+	sma = restore_sem_array(ctx, h->sem_nsems);
+	if (IS_ERR(sma)) {
+		ret = PTR_ERR(sma);
+		goto out;
+	}
+
+	/* restore the message queue now */
+	semflag = h->perms.mode | IPC_CREAT | IPC_EXCL;
+	ckpt_debug("sem: do_semget key %d flag %#x id %d\n",
+		 h->perms.key, semflag, h->perms.id);
+	ret = do_semget(ns, h->perms.key, h->sem_nsems, semflag, h->perms.id);
+	ckpt_debug("sem: do_semget ret %d\n", ret);
+	if (ret < 0)
+		goto out;
+
+	down_write(&sem_ids->rw_mutex);
+
+	/* we are the sole owners/users of this ipc_ns, it can't go away */
+	perms = ipc_lock(sem_ids, h->perms.id);
+	BUG_ON(IS_ERR(perms));  /* ipc_ns is private to us */
+
+	sem = container_of(perms, struct sem_array, sem_perm);
+	memcpy(sem->sem_base, sma, sem->sem_nsems * sizeof(*sma));
+
+	ret = load_ipc_sem_hdr(ctx, h, sem);
+	if (ret < 0) {
+		ckpt_debug("sem: need to remove (%d)\n", ret);
+		freeary(ns, perms);
+	} else
+		ipc_unlock(perms);
+	up_write(&sem_ids->rw_mutex);
+ out:
+	kfree(sma);
+	ckpt_hdr_put(ctx, h);
+	return ret;
+}
diff --git a/ipc/sem.c b/ipc/sem.c
index a2b2135..7361041 100644
--- a/ipc/sem.c
+++ b/ipc/sem.c
@@ -93,7 +93,6 @@
 #define sem_checkid(sma, semid)	ipc_checkid(&sma->sem_perm, semid)
 
 static int newary(struct ipc_namespace *, struct ipc_params *, int);
-static void freeary(struct ipc_namespace *, struct kern_ipc_perm *);
 #ifdef CONFIG_PROC_FS
 static int sysvipc_sem_proc_show(struct seq_file *s, void *it);
 #endif
@@ -310,14 +309,12 @@ static inline int sem_more_checks(struct kern_ipc_perm *ipcp,
 	return 0;
 }
 
-int do_semget(key_t key, int nsems, int semflg, int req_id)
+int do_semget(struct ipc_namespace *ns, key_t key, int nsems,
+	      int semflg, int req_id)
 {
-	struct ipc_namespace *ns;
 	struct ipc_ops sem_ops;
 	struct ipc_params sem_params;
 
-	ns = current->nsproxy->ipc_ns;
-
 	if (nsems < 0 || nsems > ns->sc_semmsl)
 		return -EINVAL;
 
@@ -334,7 +331,7 @@ int do_semget(key_t key, int nsems, int semflg, int req_id)
 
 SYSCALL_DEFINE3(semget, key_t, key, int, nsems, int, semflg)
 {
-	return do_semget(key, nsems, semflg, -1);
+	return do_semget(current->nsproxy->ipc_ns, key, nsems, semflg, -1);
 }
 
 /*
@@ -521,7 +518,7 @@ static void free_un(struct rcu_head *head)
  * as a writer and the spinlock for this semaphore set hold. sem_ids.rw_mutex
  * remains locked on exit.
  */
-static void freeary(struct ipc_namespace *ns, struct kern_ipc_perm *ipcp)
+void freeary(struct ipc_namespace *ns, struct kern_ipc_perm *ipcp)
 {
 	struct sem_undo *un, *tu;
 	struct sem_queue *q, *tq;
diff --git a/ipc/util.h b/ipc/util.h
index 8a223f0..ba080de 100644
--- a/ipc/util.h
+++ b/ipc/util.h
@@ -193,6 +193,11 @@ void do_shm_rmid(struct ipc_namespace *ns, struct kern_ipc_perm *ipcp);
 int do_msgget(struct ipc_namespace *ns, key_t key, int msgflg, int req_id);
 void freeque(struct ipc_namespace *ns, struct kern_ipc_perm *ipcp);
 
+int do_semget(struct ipc_namespace *ns, key_t key, int nsems, int semflg,
+	      int req_id);
+void freeary(struct ipc_namespace *ns, struct kern_ipc_perm *ipcp);
+
+
 #ifdef CONFIG_CHECKPOINT
 extern int checkpoint_fill_ipc_perms(struct ckpt_hdr_ipc_perms *h,
 				     struct kern_ipc_perm *perm);
@@ -205,6 +210,9 @@ extern int restore_ipc_shm(struct ckpt_ctx *ctx, struct ipc_namespace *ns);
 
 extern int checkpoint_ipc_msg(int id, void *p, void *data);
 extern int restore_ipc_msg(struct ckpt_ctx *ctx, struct ipc_namespace *ns);
+
+extern int checkpoint_ipc_sem(int id, void *p, void *data);
+extern int restore_ipc_sem(struct ckpt_ctx *ctx, struct ipc_namespace *ns);
 #endif
 
 #endif
-- 
1.6.0.4


^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH v18 54/80] c/r: support semaphore sysv-ipc
@ 2009-09-23 23:51   ` Oren Laadan
  0 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-23 23:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, containers, linux-kernel, linux-mm, linux-api,
	Serge Hallyn, Ingo Molnar, Pavel Emelyanov, Oren Laadan,
	Oren Laadan

Checkpoint of sysvipc semaphores is performed by iterating through all
sem objects and dumping the contents of each one. The semaphore array
of each sem is dumped with that object.

The semaphore array (sem->sem_base) holds an array of 'struct sem',
which is a {int, int}. Because this translates into the same format
on 32- and 64-bit architectures, the checkpoint format is simply the
dump of this array as is.

TODO: this patch does not handle semaphore-undo -- this data should be
saved per-task while iterating through the tasks.

Changelog[v18]:
  - Handle kmalloc failure in restore_sem_array()
Changelog[v17]:
  - Restore objects in the right namespace
  - Forward declare struct msg_msg (instead of include linux/msg.h)
  - Fix typo in comment
  - Don't unlock ipc before calling freeary in error path

Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
---
 include/linux/checkpoint_hdr.h |    8 ++
 ipc/Makefile                   |    2 +-
 ipc/checkpoint.c               |    4 -
 ipc/checkpoint_sem.c           |  221 ++++++++++++++++++++++++++++++++++++++++
 ipc/sem.c                      |   11 +--
 ipc/util.h                     |    8 ++
 6 files changed, 242 insertions(+), 12 deletions(-)
 create mode 100644 ipc/checkpoint_sem.c

diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
index 93b6aed..cb7dfc8 100644
--- a/include/linux/checkpoint_hdr.h
+++ b/include/linux/checkpoint_hdr.h
@@ -412,6 +412,14 @@ struct ckpt_hdr_ipc_msg_msg {
 	__u32 m_ts;
 } __attribute__((aligned(8)));
 
+struct ckpt_hdr_ipc_sem {
+	struct ckpt_hdr h;
+	struct ckpt_hdr_ipc_perms perms;
+	__u64 sem_otime;
+	__u64 sem_ctime;
+	__u32 sem_nsems;
+} __attribute__((aligned(8)));
+
 
 #define CKPT_TST_OVERFLOW_16(a, b) \
 	((sizeof(a) > sizeof(b)) && ((a) > SHORT_MAX))
diff --git a/ipc/Makefile b/ipc/Makefile
index 71a257f..3ecba9e 100644
--- a/ipc/Makefile
+++ b/ipc/Makefile
@@ -10,4 +10,4 @@ obj-$(CONFIG_POSIX_MQUEUE) += mqueue.o msgutil.o $(obj_mq-y)
 obj-$(CONFIG_IPC_NS) += namespace.o
 obj-$(CONFIG_POSIX_MQUEUE_SYSCTL) += mq_sysctl.o
 obj-$(CONFIG_SYSVIPC_CHECKPOINT) += checkpoint.o \
-		checkpoint_shm.o checkpoint_msg.o
+		checkpoint_shm.o checkpoint_msg.o checkpoint_sem.o
diff --git a/ipc/checkpoint.c b/ipc/checkpoint.c
index 588ed37..8e6e9ba 100644
--- a/ipc/checkpoint.c
+++ b/ipc/checkpoint.c
@@ -119,12 +119,10 @@ static int do_checkpoint_ipc_ns(struct ckpt_ctx *ctx,
 		return ret;
 	ret = checkpoint_ipc_any(ctx, ipc_ns, IPC_MSG_IDS,
 				 CKPT_HDR_IPC_MSG, checkpoint_ipc_msg);
-#if 0 /* NEXT FEW PATCHES */
 	if (ret < 0)
 		return ret;
 	ret = checkpoint_ipc_any(ctx, ipc_ns, IPC_SEM_IDS,
 				 CKPT_HDR_IPC_SEM, checkpoint_ipc_sem);
-#endif
 	return ret;
 }
 
@@ -309,7 +307,6 @@ static struct ipc_namespace *do_restore_ipc_ns(struct ckpt_ctx *ctx)
 
 	ret = restore_ipc_any(ctx, ipc_ns, IPC_SHM_IDS,
 			      CKPT_HDR_IPC_SHM, restore_ipc_shm);
-#if 0 /* NEXT FEW PATCHES */
 	if (ret < 0)
 		goto out;
 	ret = restore_ipc_any(ctx, ipc_ns, IPC_MSG_IDS,
@@ -318,7 +315,6 @@ static struct ipc_namespace *do_restore_ipc_ns(struct ckpt_ctx *ctx)
 		goto out;
 	ret = restore_ipc_any(ctx, ipc_ns, IPC_SEM_IDS,
 			      CKPT_HDR_IPC_SEM, restore_ipc_sem);
-#endif
 	if (ret < 0)
 		goto out;
 
diff --git a/ipc/checkpoint_sem.c b/ipc/checkpoint_sem.c
new file mode 100644
index 0000000..76eb2b9
--- /dev/null
+++ b/ipc/checkpoint_sem.c
@@ -0,0 +1,221 @@
+/*
+ *  Checkpoint/restart - dump state of sysvipc sem
+ *
+ *  Copyright (C) 2009 Oren Laadan
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+/* default debug level for output */
+#define CKPT_DFLAG  CKPT_DIPC
+
+#include <linux/mm.h>
+#include <linux/sem.h>
+#include <linux/rwsem.h>
+#include <linux/sched.h>
+#include <linux/syscalls.h>
+#include <linux/nsproxy.h>
+#include <linux/ipc_namespace.h>
+
+struct msg_msg;
+#include "util.h"
+
+#include <linux/checkpoint.h>
+#include <linux/checkpoint_hdr.h>
+
+/************************************************************************
+ * ipc checkpoint
+ */
+
+static int fill_ipc_sem_hdr(struct ckpt_ctx *ctx,
+			       struct ckpt_hdr_ipc_sem *h,
+			       struct sem_array *sem)
+{
+	int ret = 0;
+
+	ipc_lock_by_ptr(&sem->sem_perm);
+
+	ret = checkpoint_fill_ipc_perms(&h->perms, &sem->sem_perm);
+	if (ret < 0)
+		goto unlock;
+
+	h->sem_otime = sem->sem_otime;
+	h->sem_ctime = sem->sem_ctime;
+	h->sem_nsems = sem->sem_nsems;
+
+ unlock:
+	ipc_unlock(&sem->sem_perm);
+	ckpt_debug("sem: nsems %u\n", h->sem_nsems);
+
+	return ret;
+}
+
+/**
+ * ckpt_write_sem_array - dump the state of a semaphore array
+ * @ctx: checkpoint context
+ * @sem: semphore array
+ *
+ * The state of a sempahore is an array of 'struct sem'. This structure
+ * is {int, int}, which translates to the same format {32 bits, 32 bits}
+ * on both 32- and 64-bit architectures. So we simply dump the array.
+ *
+ * The sem-undo information is not saved per ipc_ns, but rather per task.
+ */
+static int checkpoint_sem_array(struct ckpt_ctx *ctx, struct sem_array *sem)
+{
+	/* this is a "best-effort" test, so lock not needed */
+	if (!list_empty(&sem->sem_pending))
+		return -EBUSY;
+
+	/* our caller holds the mutex, so this is safe */
+	return ckpt_write_buffer(ctx, sem->sem_base,
+			       sem->sem_nsems * sizeof(*sem->sem_base));
+}
+
+int checkpoint_ipc_sem(int id, void *p, void *data)
+{
+	struct ckpt_hdr_ipc_sem *h;
+	struct ckpt_ctx *ctx = (struct ckpt_ctx *) data;
+	struct kern_ipc_perm *perm = (struct kern_ipc_perm *) p;
+	struct sem_array *sem;
+	int ret;
+
+	sem = container_of(perm, struct sem_array, sem_perm);
+
+	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_IPC_SEM);
+	if (!h)
+		return -ENOMEM;
+
+	ret = fill_ipc_sem_hdr(ctx, h, sem);
+	if (ret < 0)
+		goto out;
+
+	ret = ckpt_write_obj(ctx, &h->h);
+	if (ret < 0)
+		goto out;
+
+	if (h->sem_nsems)
+		ret = checkpoint_sem_array(ctx, sem);
+ out:
+	ckpt_hdr_put(ctx, h);
+	return ret;
+}
+
+/************************************************************************
+ * ipc restart
+ */
+
+static int load_ipc_sem_hdr(struct ckpt_ctx *ctx,
+			       struct ckpt_hdr_ipc_sem *h,
+			       struct sem_array *sem)
+{
+	int ret = 0;
+
+	ret = restore_load_ipc_perms(&h->perms, &sem->sem_perm);
+	if (ret < 0)
+		return ret;
+
+	ckpt_debug("sem: nsems %u\n", h->sem_nsems);
+
+	sem->sem_otime = h->sem_otime;
+	sem->sem_ctime = h->sem_ctime;
+	sem->sem_nsems = h->sem_nsems;
+
+	return 0;
+}
+
+/**
+ * ckpt_read_sem_array - read the state of a semaphore array
+ * @ctx: checkpoint context
+ * @sem: semphore array
+ *
+ * Expect the data in an array of 'struct sem': {32 bit, 32 bit}.
+ * See comment in ckpt_write_sem_array().
+ *
+ * The sem-undo information is not restored per ipc_ns, but rather per task.
+ */
+static struct sem *restore_sem_array(struct ckpt_ctx *ctx, int nsems)
+{
+	struct sem *sma;
+	int i, ret;
+
+	sma = kmalloc(nsems * sizeof(*sma), GFP_KERNEL);
+	if (!sma)
+		return ERR_PTR(-ENOMEM);
+	ret = _ckpt_read_buffer(ctx, sma, nsems * sizeof(*sma));
+	if (ret < 0)
+		goto out;
+
+	/* validate sem array contents */
+	for (i = 0; i < nsems; i++) {
+		if (sma[i].semval < 0 || sma[i].sempid < 0) {
+			ret = -EINVAL;
+			break;
+		}
+	}
+ out:
+	if (ret < 0) {
+		kfree(sma);
+		sma = ERR_PTR(ret);
+	}
+	return sma;
+}
+
+int restore_ipc_sem(struct ckpt_ctx *ctx, struct ipc_namespace *ns)
+{
+	struct ckpt_hdr_ipc_sem *h;
+	struct kern_ipc_perm *perms;
+	struct sem_array *sem;
+	struct sem *sma = NULL;
+	struct ipc_ids *sem_ids = &ns->ids[IPC_SEM_IDS];
+	int semflag, ret;
+
+	h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_IPC_SEM);
+	if (IS_ERR(h))
+		return PTR_ERR(h);
+
+	ret = -EINVAL;
+	if (h->perms.id < 0)
+		goto out;
+	if (h->sem_nsems < 0)
+		goto out;
+
+	/* read sempahore array state */
+	sma = restore_sem_array(ctx, h->sem_nsems);
+	if (IS_ERR(sma)) {
+		ret = PTR_ERR(sma);
+		goto out;
+	}
+
+	/* restore the message queue now */
+	semflag = h->perms.mode | IPC_CREAT | IPC_EXCL;
+	ckpt_debug("sem: do_semget key %d flag %#x id %d\n",
+		 h->perms.key, semflag, h->perms.id);
+	ret = do_semget(ns, h->perms.key, h->sem_nsems, semflag, h->perms.id);
+	ckpt_debug("sem: do_semget ret %d\n", ret);
+	if (ret < 0)
+		goto out;
+
+	down_write(&sem_ids->rw_mutex);
+
+	/* we are the sole owners/users of this ipc_ns, it can't go away */
+	perms = ipc_lock(sem_ids, h->perms.id);
+	BUG_ON(IS_ERR(perms));  /* ipc_ns is private to us */
+
+	sem = container_of(perms, struct sem_array, sem_perm);
+	memcpy(sem->sem_base, sma, sem->sem_nsems * sizeof(*sma));
+
+	ret = load_ipc_sem_hdr(ctx, h, sem);
+	if (ret < 0) {
+		ckpt_debug("sem: need to remove (%d)\n", ret);
+		freeary(ns, perms);
+	} else
+		ipc_unlock(perms);
+	up_write(&sem_ids->rw_mutex);
+ out:
+	kfree(sma);
+	ckpt_hdr_put(ctx, h);
+	return ret;
+}
diff --git a/ipc/sem.c b/ipc/sem.c
index a2b2135..7361041 100644
--- a/ipc/sem.c
+++ b/ipc/sem.c
@@ -93,7 +93,6 @@
 #define sem_checkid(sma, semid)	ipc_checkid(&sma->sem_perm, semid)
 
 static int newary(struct ipc_namespace *, struct ipc_params *, int);
-static void freeary(struct ipc_namespace *, struct kern_ipc_perm *);
 #ifdef CONFIG_PROC_FS
 static int sysvipc_sem_proc_show(struct seq_file *s, void *it);
 #endif
@@ -310,14 +309,12 @@ static inline int sem_more_checks(struct kern_ipc_perm *ipcp,
 	return 0;
 }
 
-int do_semget(key_t key, int nsems, int semflg, int req_id)
+int do_semget(struct ipc_namespace *ns, key_t key, int nsems,
+	      int semflg, int req_id)
 {
-	struct ipc_namespace *ns;
 	struct ipc_ops sem_ops;
 	struct ipc_params sem_params;
 
-	ns = current->nsproxy->ipc_ns;
-
 	if (nsems < 0 || nsems > ns->sc_semmsl)
 		return -EINVAL;
 
@@ -334,7 +331,7 @@ int do_semget(key_t key, int nsems, int semflg, int req_id)
 
 SYSCALL_DEFINE3(semget, key_t, key, int, nsems, int, semflg)
 {
-	return do_semget(key, nsems, semflg, -1);
+	return do_semget(current->nsproxy->ipc_ns, key, nsems, semflg, -1);
 }
 
 /*
@@ -521,7 +518,7 @@ static void free_un(struct rcu_head *head)
  * as a writer and the spinlock for this semaphore set hold. sem_ids.rw_mutex
  * remains locked on exit.
  */
-static void freeary(struct ipc_namespace *ns, struct kern_ipc_perm *ipcp)
+void freeary(struct ipc_namespace *ns, struct kern_ipc_perm *ipcp)
 {
 	struct sem_undo *un, *tu;
 	struct sem_queue *q, *tq;
diff --git a/ipc/util.h b/ipc/util.h
index 8a223f0..ba080de 100644
--- a/ipc/util.h
+++ b/ipc/util.h
@@ -193,6 +193,11 @@ void do_shm_rmid(struct ipc_namespace *ns, struct kern_ipc_perm *ipcp);
 int do_msgget(struct ipc_namespace *ns, key_t key, int msgflg, int req_id);
 void freeque(struct ipc_namespace *ns, struct kern_ipc_perm *ipcp);
 
+int do_semget(struct ipc_namespace *ns, key_t key, int nsems, int semflg,
+	      int req_id);
+void freeary(struct ipc_namespace *ns, struct kern_ipc_perm *ipcp);
+
+
 #ifdef CONFIG_CHECKPOINT
 extern int checkpoint_fill_ipc_perms(struct ckpt_hdr_ipc_perms *h,
 				     struct kern_ipc_perm *perm);
@@ -205,6 +210,9 @@ extern int restore_ipc_shm(struct ckpt_ctx *ctx, struct ipc_namespace *ns);
 
 extern int checkpoint_ipc_msg(int id, void *p, void *data);
 extern int restore_ipc_msg(struct ckpt_ctx *ctx, struct ipc_namespace *ns);
+
+extern int checkpoint_ipc_sem(int id, void *p, void *data);
+extern int restore_ipc_sem(struct ckpt_ctx *ctx, struct ipc_namespace *ns);
 #endif
 
 #endif
-- 
1.6.0.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH v18 55/80] c/r: (s390): expose a constant for the number of words (CRs)
       [not found] ` <1253749920-18673-1-git-send-email-orenl-RdfvBDnrOixBDgjK7y7TUQ@public.gmane.org>
                     ` (53 preceding siblings ...)
  2009-09-23 23:51   ` [PATCH v18 54/80] c/r: support semaphore sysv-ipc Oren Laadan
@ 2009-09-23 23:51   ` Oren Laadan
  2009-09-23 23:51   ` [PATCH v18 56/80] c/r: add CKPT_COPY() macro Oren Laadan
                     ` (26 subsequent siblings)
  81 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-23 23:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-api-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Linus Torvalds, Ingo Molnar,
	Dan Smith, Pavel Emelyanov

We need to use this value in the checkpoint/restart code and would like to
have a constant instead of a magic '3'.

Changelog:
    Mar 30:
            . Add CHECKPOINT_SUPPORT in Kconfig (Nathan Lynch)
    Mar 03:
            . Picked up additional use of magic '3' in ptrace.h

Signed-off-by: Dan Smith <danms-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
---
 arch/s390/Kconfig |    4 ++++
 1 files changed, 4 insertions(+), 0 deletions(-)

diff --git a/arch/s390/Kconfig b/arch/s390/Kconfig
index 2ae5d72..6f143ab 100644
--- a/arch/s390/Kconfig
+++ b/arch/s390/Kconfig
@@ -49,6 +49,10 @@ config GENERIC_TIME_VSYSCALL
 config GENERIC_CLOCKEVENTS
 	def_bool y
 
+config CHECKPOINT_SUPPORT
+	bool
+	default y if 64BIT
+
 config GENERIC_BUG
 	bool
 	depends on BUG
-- 
1.6.0.4

^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH v18 55/80] c/r: (s390): expose a constant for the number of words (CRs)
  2009-09-23 23:50 ` Oren Laadan
@ 2009-09-23 23:51   ` Oren Laadan
  -1 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-23 23:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, containers, linux-kernel, linux-mm, linux-api,
	Serge Hallyn, Ingo Molnar, Pavel Emelyanov, Oren Laadan,
	Dan Smith

We need to use this value in the checkpoint/restart code and would like to
have a constant instead of a magic '3'.

Changelog:
    Mar 30:
            . Add CHECKPOINT_SUPPORT in Kconfig (Nathan Lynch)
    Mar 03:
            . Picked up additional use of magic '3' in ptrace.h

Signed-off-by: Dan Smith <danms@us.ibm.com>
---
 arch/s390/Kconfig |    4 ++++
 1 files changed, 4 insertions(+), 0 deletions(-)

diff --git a/arch/s390/Kconfig b/arch/s390/Kconfig
index 2ae5d72..6f143ab 100644
--- a/arch/s390/Kconfig
+++ b/arch/s390/Kconfig
@@ -49,6 +49,10 @@ config GENERIC_TIME_VSYSCALL
 config GENERIC_CLOCKEVENTS
 	def_bool y
 
+config CHECKPOINT_SUPPORT
+	bool
+	default y if 64BIT
+
 config GENERIC_BUG
 	bool
 	depends on BUG
-- 
1.6.0.4


^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH v18 55/80] c/r: (s390): expose a constant for the number of words (CRs)
@ 2009-09-23 23:51   ` Oren Laadan
  0 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-23 23:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, containers, linux-kernel, linux-mm, linux-api,
	Serge Hallyn, Ingo Molnar, Pavel Emelyanov, Oren Laadan,
	Dan Smith

We need to use this value in the checkpoint/restart code and would like to
have a constant instead of a magic '3'.

Changelog:
    Mar 30:
            . Add CHECKPOINT_SUPPORT in Kconfig (Nathan Lynch)
    Mar 03:
            . Picked up additional use of magic '3' in ptrace.h

Signed-off-by: Dan Smith <danms@us.ibm.com>
---
 arch/s390/Kconfig |    4 ++++
 1 files changed, 4 insertions(+), 0 deletions(-)

diff --git a/arch/s390/Kconfig b/arch/s390/Kconfig
index 2ae5d72..6f143ab 100644
--- a/arch/s390/Kconfig
+++ b/arch/s390/Kconfig
@@ -49,6 +49,10 @@ config GENERIC_TIME_VSYSCALL
 config GENERIC_CLOCKEVENTS
 	def_bool y
 
+config CHECKPOINT_SUPPORT
+	bool
+	default y if 64BIT
+
 config GENERIC_BUG
 	bool
 	depends on BUG
-- 
1.6.0.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH v18 56/80] c/r: add CKPT_COPY() macro
       [not found] ` <1253749920-18673-1-git-send-email-orenl-RdfvBDnrOixBDgjK7y7TUQ@public.gmane.org>
                     ` (54 preceding siblings ...)
  2009-09-23 23:51   ` [PATCH v18 55/80] c/r: (s390): expose a constant for the number of words (CRs) Oren Laadan
@ 2009-09-23 23:51   ` Oren Laadan
  2009-09-23 23:51   ` [PATCH v18 57/80] c/r: define s390-specific checkpoint-restart code Oren Laadan
                     ` (25 subsequent siblings)
  81 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-23 23:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-api-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Linus Torvalds, Ingo Molnar,
	Dan Smith, Pavel Emelyanov

From: Dan Smith <danms-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>

As suggested by Dave[1], this provides us a way to make the copy-in and
copy-out processes symmetric.  CKPT_COPY_ARRAY() provides us a way to do
the same thing but for arrays.  It's not critical, but it helps us unify
the checkpoint and restart paths for some things.

Changelog:
    Mar 04:
            . Removed semicolons
            . Added build-time check for __must_be_array in CKPT_COPY_ARRAY
    Feb 27:
            . Changed CKPT_COPY() to use assignment, eliminating the need
              for the CKPT_COPY_BIT() macro
            . Add CKPT_COPY_ARRAY() macro to help copying register arrays,
              etc
            . Move the macro definitions inside the CR #ifdef
    Feb 25:
            . Changed WARN_ON() to BUILD_BUG_ON()

Signed-off-by: Dan Smith <danms-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
Signed-off-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>

1: https://lists.linux-foundation.org/pipermail/containers/2009-February/015821.html (all the way at the bottom)
---
 include/linux/checkpoint.h |   29 +++++++++++++++++++++++++++++
 1 files changed, 29 insertions(+), 0 deletions(-)

diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
index 4c1c13e..561232d 100644
--- a/include/linux/checkpoint.h
+++ b/include/linux/checkpoint.h
@@ -238,6 +238,34 @@ static inline int ckpt_validate_errno(int errno)
 	return (errno >= 0) && (errno < MAX_ERRNO);
 }
 
+/* useful macros to copy fields and buffers to/from ckpt_hdr_xxx structures */
+#define CKPT_CPT 1
+#define CKPT_RST 2
+
+#define CKPT_COPY(op, SAVE, LIVE)				        \
+	do {							\
+		if (op == CKPT_CPT)				\
+			SAVE = LIVE;				\
+		else						\
+			LIVE = SAVE;				\
+	} while (0)
+
+/*
+ * Copy @count items from @LIVE to @SAVE if op is CKPT_CPT (otherwise,
+ * copy in the reverse direction)
+ */
+#define CKPT_COPY_ARRAY(op, SAVE, LIVE, count)				\
+	do {								\
+		(void)__must_be_array(SAVE);				\
+		(void)__must_be_array(LIVE);				\
+		BUILD_BUG_ON(sizeof(*SAVE) != sizeof(*LIVE));		\
+		if (op == CKPT_CPT)					\
+			memcpy(SAVE, LIVE, count * sizeof(*SAVE));	\
+		else							\
+			memcpy(LIVE, SAVE, count * sizeof(*SAVE));	\
+	} while (0)
+
+
 /* debugging flags */
 #define CKPT_DBASE	0x1		/* anything */
 #define CKPT_DSYS	0x2		/* generic (system) */
@@ -270,6 +298,7 @@ extern unsigned long ckpt_debug_level;
  * CKPT_DBASE is the base flags, doesn't change
  * CKPT_DFLAG is to be redfined in each source file
  */
+
 #define ckpt_debug(fmt, args...)  \
 	_ckpt_debug(CKPT_DBASE | CKPT_DFLAG, fmt, ## args)
 
-- 
1.6.0.4

^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH v18 56/80] c/r: add CKPT_COPY() macro
  2009-09-23 23:50 ` Oren Laadan
@ 2009-09-23 23:51   ` Oren Laadan
  -1 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-23 23:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, containers, linux-kernel, linux-mm, linux-api,
	Serge Hallyn, Ingo Molnar, Pavel Emelyanov, Dan Smith,
	Oren Laadan

From: Dan Smith <danms@us.ibm.com>

As suggested by Dave[1], this provides us a way to make the copy-in and
copy-out processes symmetric.  CKPT_COPY_ARRAY() provides us a way to do
the same thing but for arrays.  It's not critical, but it helps us unify
the checkpoint and restart paths for some things.

Changelog:
    Mar 04:
            . Removed semicolons
            . Added build-time check for __must_be_array in CKPT_COPY_ARRAY
    Feb 27:
            . Changed CKPT_COPY() to use assignment, eliminating the need
              for the CKPT_COPY_BIT() macro
            . Add CKPT_COPY_ARRAY() macro to help copying register arrays,
              etc
            . Move the macro definitions inside the CR #ifdef
    Feb 25:
            . Changed WARN_ON() to BUILD_BUG_ON()

Signed-off-by: Dan Smith <danms@us.ibm.com>
Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>

1: https://lists.linux-foundation.org/pipermail/containers/2009-February/015821.html (all the way at the bottom)
---
 include/linux/checkpoint.h |   29 +++++++++++++++++++++++++++++
 1 files changed, 29 insertions(+), 0 deletions(-)

diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
index 4c1c13e..561232d 100644
--- a/include/linux/checkpoint.h
+++ b/include/linux/checkpoint.h
@@ -238,6 +238,34 @@ static inline int ckpt_validate_errno(int errno)
 	return (errno >= 0) && (errno < MAX_ERRNO);
 }
 
+/* useful macros to copy fields and buffers to/from ckpt_hdr_xxx structures */
+#define CKPT_CPT 1
+#define CKPT_RST 2
+
+#define CKPT_COPY(op, SAVE, LIVE)				        \
+	do {							\
+		if (op == CKPT_CPT)				\
+			SAVE = LIVE;				\
+		else						\
+			LIVE = SAVE;				\
+	} while (0)
+
+/*
+ * Copy @count items from @LIVE to @SAVE if op is CKPT_CPT (otherwise,
+ * copy in the reverse direction)
+ */
+#define CKPT_COPY_ARRAY(op, SAVE, LIVE, count)				\
+	do {								\
+		(void)__must_be_array(SAVE);				\
+		(void)__must_be_array(LIVE);				\
+		BUILD_BUG_ON(sizeof(*SAVE) != sizeof(*LIVE));		\
+		if (op == CKPT_CPT)					\
+			memcpy(SAVE, LIVE, count * sizeof(*SAVE));	\
+		else							\
+			memcpy(LIVE, SAVE, count * sizeof(*SAVE));	\
+	} while (0)
+
+
 /* debugging flags */
 #define CKPT_DBASE	0x1		/* anything */
 #define CKPT_DSYS	0x2		/* generic (system) */
@@ -270,6 +298,7 @@ extern unsigned long ckpt_debug_level;
  * CKPT_DBASE is the base flags, doesn't change
  * CKPT_DFLAG is to be redfined in each source file
  */
+
 #define ckpt_debug(fmt, args...)  \
 	_ckpt_debug(CKPT_DBASE | CKPT_DFLAG, fmt, ## args)
 
-- 
1.6.0.4


^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH v18 56/80] c/r: add CKPT_COPY() macro
@ 2009-09-23 23:51   ` Oren Laadan
  0 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-23 23:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, containers, linux-kernel, linux-mm, linux-api,
	Serge Hallyn, Ingo Molnar, Pavel Emelyanov, Dan Smith,
	Oren Laadan

From: Dan Smith <danms@us.ibm.com>

As suggested by Dave[1], this provides us a way to make the copy-in and
copy-out processes symmetric.  CKPT_COPY_ARRAY() provides us a way to do
the same thing but for arrays.  It's not critical, but it helps us unify
the checkpoint and restart paths for some things.

Changelog:
    Mar 04:
            . Removed semicolons
            . Added build-time check for __must_be_array in CKPT_COPY_ARRAY
    Feb 27:
            . Changed CKPT_COPY() to use assignment, eliminating the need
              for the CKPT_COPY_BIT() macro
            . Add CKPT_COPY_ARRAY() macro to help copying register arrays,
              etc
            . Move the macro definitions inside the CR #ifdef
    Feb 25:
            . Changed WARN_ON() to BUILD_BUG_ON()

Signed-off-by: Dan Smith <danms@us.ibm.com>
Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>

1: https://lists.linux-foundation.org/pipermail/containers/2009-February/015821.html (all the way at the bottom)
---
 include/linux/checkpoint.h |   29 +++++++++++++++++++++++++++++
 1 files changed, 29 insertions(+), 0 deletions(-)

diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
index 4c1c13e..561232d 100644
--- a/include/linux/checkpoint.h
+++ b/include/linux/checkpoint.h
@@ -238,6 +238,34 @@ static inline int ckpt_validate_errno(int errno)
 	return (errno >= 0) && (errno < MAX_ERRNO);
 }
 
+/* useful macros to copy fields and buffers to/from ckpt_hdr_xxx structures */
+#define CKPT_CPT 1
+#define CKPT_RST 2
+
+#define CKPT_COPY(op, SAVE, LIVE)				        \
+	do {							\
+		if (op == CKPT_CPT)				\
+			SAVE = LIVE;				\
+		else						\
+			LIVE = SAVE;				\
+	} while (0)
+
+/*
+ * Copy @count items from @LIVE to @SAVE if op is CKPT_CPT (otherwise,
+ * copy in the reverse direction)
+ */
+#define CKPT_COPY_ARRAY(op, SAVE, LIVE, count)				\
+	do {								\
+		(void)__must_be_array(SAVE);				\
+		(void)__must_be_array(LIVE);				\
+		BUILD_BUG_ON(sizeof(*SAVE) != sizeof(*LIVE));		\
+		if (op == CKPT_CPT)					\
+			memcpy(SAVE, LIVE, count * sizeof(*SAVE));	\
+		else							\
+			memcpy(LIVE, SAVE, count * sizeof(*SAVE));	\
+	} while (0)
+
+
 /* debugging flags */
 #define CKPT_DBASE	0x1		/* anything */
 #define CKPT_DSYS	0x2		/* generic (system) */
@@ -270,6 +298,7 @@ extern unsigned long ckpt_debug_level;
  * CKPT_DBASE is the base flags, doesn't change
  * CKPT_DFLAG is to be redfined in each source file
  */
+
 #define ckpt_debug(fmt, args...)  \
 	_ckpt_debug(CKPT_DBASE | CKPT_DFLAG, fmt, ## args)
 
-- 
1.6.0.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH v18 57/80] c/r: define s390-specific checkpoint-restart code
       [not found] ` <1253749920-18673-1-git-send-email-orenl-RdfvBDnrOixBDgjK7y7TUQ@public.gmane.org>
                     ` (55 preceding siblings ...)
  2009-09-23 23:51   ` [PATCH v18 56/80] c/r: add CKPT_COPY() macro Oren Laadan
@ 2009-09-23 23:51   ` Oren Laadan
  2009-09-23 23:51   ` [PATCH v18 58/80] c/r: clone_with_pids: define the s390 syscall Oren Laadan
                     ` (24 subsequent siblings)
  81 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-23 23:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-api-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Linus Torvalds, Ingo Molnar,
	Dan Smith, Pavel Emelyanov

From: Dan Smith <danms-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>

Implement the s390 arch-specific checkpoint/restart helpers.  This
is on top of Oren Laadan's c/r code.

With these, I am able to checkpoint and restart simple programs as per
Oren's patch intro.  While on x86 I never had to freeze a single task
to checkpoint it, on s390 I do need to.  That is a prereq for consistent
snapshots (esp with multiple processes) anyway so I don't see that as
a problem.

Changelog:
    Jun 15:
            . Fix checkpoint and restart compat wrappers
    May 28:
            . Export asm/checkpoint_hdr.h to userspace
            . Define CKPT_ARCH_ID for S390
    Apr 11:
            . Introduce ckpt_arch_vdso()
    Feb 27:
            . Add checkpoint_s390.h
            . Fixed up save and restore of PSW, with the non-address bits
              properly masked out
    Feb 25:
            . Make checkpoint_hdr.h safe for inclusion in userspace
            . Replace comment about vsdo code
            . Add comment about restoring access registers
            . Write and read an empty ckpt_hdr_head_arch record to appease
              code (mktree) that expects it to be there
            . Utilize NUM_CKPT_WORDS in checkpoint_hdr.h
    Feb 24:
            . Use CKPT_COPY() to unify the un/loading of cpu and mm state
            . Fix fprs definition in ckpt_hdr_cpu
            . Remove debug WARN_ON() from checkpoint.c
    Feb 23:
            . Macro-ize the un/packing of trace flags
            . Fix the crash when externally-linked
            . Break out the restart functions into restart.c
            . Remove unneeded s390_enable_sie() call
    Jan 30:
            . Switched types in ckpt_hdr_cpu to __u64 etc.
              (Per Oren suggestion)
            . Replaced direct inclusion of structs in
              ckpt_hdr_cpu with the struct members.
              (Per Oren suggestion)
            . Also ended up adding a bunch of new things
              into restart (mm_segment, ksp, etc) in vain
              attempt to get code using fpu to not segfault
              after restart.

Signed-off-by: Serge E. Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
Signed-off-by: Dan Smith <danms-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
---
 arch/s390/include/asm/Kbuild           |    1 +
 arch/s390/include/asm/checkpoint_hdr.h |   89 +++++++++++++++
 arch/s390/include/asm/unistd.h         |    4 +-
 arch/s390/kernel/compat_wrapper.S      |   14 +++
 arch/s390/kernel/syscalls.S            |    2 +
 arch/s390/mm/Makefile                  |    1 +
 arch/s390/mm/checkpoint.c              |  194 ++++++++++++++++++++++++++++++++
 arch/s390/mm/checkpoint_s390.h         |   23 ++++
 include/linux/checkpoint_hdr.h         |    2 +
 9 files changed, 329 insertions(+), 1 deletions(-)
 create mode 100644 arch/s390/include/asm/checkpoint_hdr.h
 create mode 100644 arch/s390/mm/checkpoint.c
 create mode 100644 arch/s390/mm/checkpoint_s390.h

diff --git a/arch/s390/include/asm/Kbuild b/arch/s390/include/asm/Kbuild
index 63a2341..3282a6e 100644
--- a/arch/s390/include/asm/Kbuild
+++ b/arch/s390/include/asm/Kbuild
@@ -8,6 +8,7 @@ header-y += ucontext.h
 header-y += vtoc.h
 header-y += zcrypt.h
 header-y += chsc.h
+header-y += checkpoint_hdr.h
 
 unifdef-y += cmb.h
 unifdef-y += debug.h
diff --git a/arch/s390/include/asm/checkpoint_hdr.h b/arch/s390/include/asm/checkpoint_hdr.h
new file mode 100644
index 0000000..ad9449e
--- /dev/null
+++ b/arch/s390/include/asm/checkpoint_hdr.h
@@ -0,0 +1,89 @@
+#ifndef __ASM_S390_CKPT_HDR_H
+#define __ASM_S390_CKPT_HDR_H
+/*
+ *  Checkpoint/restart - architecture specific headers s/390
+ *
+ *  Copyright IBM Corp. 2009
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+#ifndef _CHECKPOINT_CKPT_HDR_H_
+#error asm/checkpoint_hdr.h included directly
+#endif
+
+#include <linux/types.h>
+#include <asm/ptrace.h>
+
+#ifdef __KERNEL__
+#include <asm/processor.h>
+#else
+#include <sys/user.h>
+#endif
+
+#ifdef CONFIG_64BIT
+#define CKPT_ARCH_ID	CKPT_ARCH_S390X
+/* else - if we ever support 32bit - CKPT_ARCH_S390 */
+#endif
+
+/*
+ * Notes
+ * NUM_GPRS defined in <asm/ptrace.h> to be 16
+ * NUM_FPRS defined in <asm/ptrace.h> to be 16
+ * NUM_APRS defined in <asm/ptrace.h> to be 16
+ * NUM_CR_WORDS defined in <asm/ptrace.h> to be 3
+ */
+struct ckpt_hdr_cpu {
+	struct ckpt_hdr h;
+	__u64 args[1];
+	__u64 gprs[NUM_GPRS];
+	__u64 orig_gpr2;
+	__u16 svcnr;
+	__u16 ilc;
+	__u32 acrs[NUM_ACRS];
+	__u64 ieee_instruction_pointer;
+
+	/* psw_t */
+	__u64 psw_t_mask;
+	__u64 psw_t_addr;
+
+	/* s390_fp_regs_t */
+	__u32 fpc;
+	union {
+		float f;
+		double d;
+		__u64 ui;
+		struct {
+			__u32 fp_hi;
+			__u32 fp_lo;
+		} fp;
+	} fprs[NUM_FPRS];
+
+	/* per_struct */
+	__u64 per_control_regs[NUM_CR_WORDS];
+	__u64 starting_addr;
+	__u64 ending_addr;
+	__u64 address;
+	__u16 perc_atmid;
+	__u8 access_id;
+	__u8 single_step;
+	__u8 instruction_fetch;
+};
+
+struct ckpt_hdr_mm_context {
+	struct ckpt_hdr h;
+	unsigned long vdso_base;
+	int noexec;
+	int has_pgste;
+	int alloc_pgste;
+	unsigned long asce_bits;
+	unsigned long asce_limit;
+};
+
+struct ckpt_hdr_header_arch {
+	struct ckpt_hdr h;
+};
+
+#endif /* __ASM_S390_CKPT_HDR__H */
diff --git a/arch/s390/include/asm/unistd.h b/arch/s390/include/asm/unistd.h
index c80602d..5d1678a 100644
--- a/arch/s390/include/asm/unistd.h
+++ b/arch/s390/include/asm/unistd.h
@@ -269,7 +269,9 @@
 #define	__NR_pwritev		329
 #define __NR_rt_tgsigqueueinfo	330
 #define __NR_perf_counter_open	331
-#define NR_syscalls 332
+#define __NR_checkpoint		332
+#define __NR_restart		333
+#define NR_syscalls 334
 
 /* 
  * There are some system calls that are not present on 64 bit, some
diff --git a/arch/s390/kernel/compat_wrapper.S b/arch/s390/kernel/compat_wrapper.S
index 88a8336..e882f99 100644
--- a/arch/s390/kernel/compat_wrapper.S
+++ b/arch/s390/kernel/compat_wrapper.S
@@ -1840,3 +1840,17 @@ sys_perf_counter_open_wrapper:
 	lgfr	%r5,%r5			# int
 	llgfr	%r6,%r6			# unsigned long
 	jg	sys_perf_counter_open	# branch to system call
+
+	.globl sys_checkpoint_wrapper
+sys_checkpoint_wrapper:
+	lgfr	%r2,%r2			# pid_t
+	lgfr	%r3,%r3			# int
+	llgfr	%r4,%r4			# unsigned long
+	jg	compat_sys_checkpoint
+
+	.globl sys_restore_wrapper
+sys_restore_wrapper:
+	lgfr	%r2,%r2			# int
+	lgfr	%r3,%r3			# int
+	llgfr	%r4,%r4			# unsigned long
+	jg	compat_sys_restore
diff --git a/arch/s390/kernel/syscalls.S b/arch/s390/kernel/syscalls.S
index ad1acd2..67518e2 100644
--- a/arch/s390/kernel/syscalls.S
+++ b/arch/s390/kernel/syscalls.S
@@ -340,3 +340,5 @@ SYSCALL(sys_preadv,sys_preadv,compat_sys_preadv_wrapper)
 SYSCALL(sys_pwritev,sys_pwritev,compat_sys_pwritev_wrapper)
 SYSCALL(sys_rt_tgsigqueueinfo,sys_rt_tgsigqueueinfo,compat_sys_rt_tgsigqueueinfo_wrapper) /* 330 */
 SYSCALL(sys_perf_counter_open,sys_perf_counter_open,sys_perf_counter_open_wrapper)
+SYSCALL(sys_checkpoint,sys_checkpoint,sys_checkpoint_wrapper)
+SYSCALL(sys_restart,sys_restart,sys_restore_wrapper)
diff --git a/arch/s390/mm/Makefile b/arch/s390/mm/Makefile
index db05661..e3d356d 100644
--- a/arch/s390/mm/Makefile
+++ b/arch/s390/mm/Makefile
@@ -6,3 +6,4 @@ obj-y	 := init.o fault.o extmem.o mmap.o vmem.o pgtable.o maccess.o
 obj-$(CONFIG_CMM) += cmm.o
 obj-$(CONFIG_HUGETLB_PAGE) += hugetlbpage.o
 obj-$(CONFIG_PAGE_STATES) += page-states.o
+obj-$(CONFIG_CHECKPOINT) += checkpoint.o
diff --git a/arch/s390/mm/checkpoint.c b/arch/s390/mm/checkpoint.c
new file mode 100644
index 0000000..40dd417
--- /dev/null
+++ b/arch/s390/mm/checkpoint.c
@@ -0,0 +1,194 @@
+/*
+ *  Checkpoint/restart - architecture specific support for s390
+ *
+ *  Copyright IBM Corp. 2009
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+#include <linux/kernel.h>
+#include <asm/system.h>
+#include <asm/pgtable.h>
+#include <asm/elf.h>
+
+#include <linux/checkpoint.h>
+#include <linux/checkpoint_hdr.h>
+
+/**************************************************************************
+ * Checkpoint
+ */
+
+static void s390_copy_regs(int op, struct ckpt_hdr_cpu *h,
+			   struct task_struct *t)
+{
+	struct pt_regs *regs = task_pt_regs(t);
+	struct thread_struct *thr = &t->thread;
+
+	/* Save the whole PSW to facilitate forensic debugging, but only
+	 * restore the address portion to avoid letting userspace do
+	 * bad things by manipulating its value.
+	 */
+	if (op == CKPT_CPT) {
+		CKPT_COPY(op, h->psw_t_addr, regs->psw.addr);
+	} else {
+		regs->psw.addr &= ~PSW_ADDR_INSN;
+		regs->psw.addr |= h->psw_t_addr;
+	}
+
+	CKPT_COPY(op, h->args[0], regs->args[0]);
+	CKPT_COPY(op, h->orig_gpr2, regs->orig_gpr2);
+	CKPT_COPY(op, h->svcnr, regs->svcnr);
+	CKPT_COPY(op, h->ilc, regs->ilc);
+	CKPT_COPY(op, h->ieee_instruction_pointer,
+		thr->ieee_instruction_pointer);
+	CKPT_COPY(op, h->psw_t_mask, regs->psw.mask);
+	CKPT_COPY(op, h->fpc, thr->fp_regs.fpc);
+	CKPT_COPY(op, h->starting_addr, thr->per_info.starting_addr);
+	CKPT_COPY(op, h->ending_addr, thr->per_info.ending_addr);
+	CKPT_COPY(op, h->address, thr->per_info.lowcore.words.address);
+	CKPT_COPY(op, h->perc_atmid, thr->per_info.lowcore.words.perc_atmid);
+	CKPT_COPY(op, h->access_id, thr->per_info.lowcore.words.access_id);
+	CKPT_COPY(op, h->single_step, thr->per_info.single_step);
+	CKPT_COPY(op, h->instruction_fetch, thr->per_info.instruction_fetch);
+
+	CKPT_COPY_ARRAY(op, h->gprs, regs->gprs, NUM_GPRS);
+	/*
+	 * for checkpoint in process context (from within a container),
+	 * the actual syscall is taking place at this very moment; so
+	 * we (optimistically) subtitute the future return value (0) of
+	 * this syscall into the orig_eax, so that upon restart it will
+	 * succeed (or it will endlessly retry checkpoint...)
+	 */
+	if (op == CKPT_CPT && t==current) {
+		BUG_ON(h->gprs[2] < 0);
+		h->gprs[2] = 0;
+	}
+	CKPT_COPY_ARRAY(op, h->fprs, thr->fp_regs.fprs, NUM_FPRS);
+	CKPT_COPY_ARRAY(op, h->acrs, thr->acrs, NUM_ACRS);
+	CKPT_COPY_ARRAY(op, h->per_control_regs,
+		      thr->per_info.control_regs.words.cr, NUM_CR_WORDS);
+}
+
+static void s390_mm(int op, struct ckpt_hdr_mm_context *h,
+		    struct mm_struct *mm)
+{
+	CKPT_COPY(op, h->noexec, mm->context.noexec);
+	CKPT_COPY(op, h->has_pgste, mm->context.has_pgste);
+	CKPT_COPY(op, h->alloc_pgste, mm->context.alloc_pgste);
+	CKPT_COPY(op, h->asce_bits, mm->context.asce_bits);
+	CKPT_COPY(op, h->asce_limit, mm->context.asce_limit);
+}
+
+int checkpoint_thread(struct ckpt_ctx *ctx, struct task_struct *t)
+{
+	return 0;
+}
+
+/* dump the cpu state and registers of a given task */
+int checkpoint_cpu(struct ckpt_ctx *ctx, struct task_struct *t)
+{
+	struct ckpt_hdr_cpu *h;
+	int ret;
+
+	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_CPU);
+	if (!h)
+		return -ENOMEM;
+
+	s390_copy_regs(CKPT_CPT, h, t);
+
+	ret = ckpt_write_obj(ctx, (struct ckpt_hdr *) h);
+	ckpt_hdr_put(ctx, h);
+
+	return ret;
+}
+
+/* Write an empty header since it is assumed to be there */
+int checkpoint_write_header_arch(struct ckpt_ctx *ctx)
+{
+	struct ckpt_hdr_header_arch *h;
+	int ret;
+
+	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_HEADER_ARCH);
+	if (!h)
+		return -ENOMEM;
+
+	ret = ckpt_write_obj(ctx, (struct ckpt_hdr *) h);
+	ckpt_hdr_put(ctx, h);
+
+	return ret;
+}
+
+int checkpoint_mm_context(struct ckpt_ctx *ctx, struct mm_struct *mm)
+{
+	struct ckpt_hdr_mm_context *h;
+	int ret;
+
+	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_MM_CONTEXT);
+	if (!h)
+		return -ENOMEM;
+
+	s390_mm(CKPT_CPT, h, mm);
+
+	ret = ckpt_write_obj(ctx, (struct ckpt_hdr *) h);
+	ckpt_hdr_put(ctx, h);
+
+	return ret;
+}
+
+/**************************************************************************
+ * Restart
+ */
+
+int restore_thread(struct ckpt_ctx *ctx)
+{
+	return 0;
+}
+
+int restore_cpu(struct ckpt_ctx *ctx)
+{
+	struct ckpt_hdr_cpu *h;
+
+	h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_CPU);
+	if (IS_ERR(h))
+		return PTR_ERR(h);
+
+	s390_copy_regs(CKPT_RST, h, current);
+
+	/* s390 does not restore the access registers after a syscall,
+	 * but does on a task switch.  Since we're switching tasks (in
+	 * a way), we need to replicate that behavior here.
+	 */
+	restore_access_regs(h->acrs);
+
+	ckpt_hdr_put(ctx, h);
+	return 0;
+}
+
+int restore_read_header_arch(struct ckpt_ctx *ctx)
+{
+	struct ckpt_hdr_header_arch *h;
+
+	h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_HEADER_ARCH);
+	if (IS_ERR(h))
+		return PTR_ERR(h);
+
+	ckpt_hdr_put(ctx, h);
+	return 0;
+}
+
+
+int restore_mm_context(struct ckpt_ctx *ctx, struct mm_struct *mm)
+{
+	struct ckpt_hdr_mm_context *h;
+
+	h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_MM_CONTEXT);
+	if (IS_ERR(h))
+		return PTR_ERR(h);
+
+	s390_mm(CKPT_RST, h, mm);
+
+	ckpt_hdr_put(ctx, h);
+	return 0;
+}
diff --git a/arch/s390/mm/checkpoint_s390.h b/arch/s390/mm/checkpoint_s390.h
new file mode 100644
index 0000000..c3bf24d
--- /dev/null
+++ b/arch/s390/mm/checkpoint_s390.h
@@ -0,0 +1,23 @@
+/*
+ *  Checkpoint/restart - architecture specific support for s390
+ *
+ *  Copyright IBM Corp. 2009
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+#ifndef _S390_CHECKPOINT_H
+#define _S390_CHECKPOINT_H
+
+#include <linux/checkpoint_hdr.h>
+#include <linux/sched.h>
+#include <linux/mm_types.h>
+
+extern void checkpoint_s390_regs(int op, struct ckpt_hdr_cpu *h,
+				 struct task_struct *t);
+extern void checkpoint_s390_mm(int op, struct ckpt_hdr_mm_context *h,
+			       struct mm_struct *mm);
+
+#endif /* _S390_CHECKPOINT_H */
diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
index cb7dfc8..ef260a0 100644
--- a/include/linux/checkpoint_hdr.h
+++ b/include/linux/checkpoint_hdr.h
@@ -88,7 +88,9 @@ enum {
 
 /* architecture */
 enum {
+	/* do not change order (will break ABI) */
 	CKPT_ARCH_X86_32 = 1,
+	CKPT_ARCH_S390X,
 };
 
 /* shared objrects (objref) */
-- 
1.6.0.4

^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH v18 57/80] c/r: define s390-specific checkpoint-restart code
  2009-09-23 23:50 ` Oren Laadan
@ 2009-09-23 23:51   ` Oren Laadan
  -1 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-23 23:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, containers, linux-kernel, linux-mm, linux-api,
	Serge Hallyn, Ingo Molnar, Pavel Emelyanov, Dan Smith

From: Dan Smith <danms@us.ibm.com>

Implement the s390 arch-specific checkpoint/restart helpers.  This
is on top of Oren Laadan's c/r code.

With these, I am able to checkpoint and restart simple programs as per
Oren's patch intro.  While on x86 I never had to freeze a single task
to checkpoint it, on s390 I do need to.  That is a prereq for consistent
snapshots (esp with multiple processes) anyway so I don't see that as
a problem.

Changelog:
    Jun 15:
            . Fix checkpoint and restart compat wrappers
    May 28:
            . Export asm/checkpoint_hdr.h to userspace
            . Define CKPT_ARCH_ID for S390
    Apr 11:
            . Introduce ckpt_arch_vdso()
    Feb 27:
            . Add checkpoint_s390.h
            . Fixed up save and restore of PSW, with the non-address bits
              properly masked out
    Feb 25:
            . Make checkpoint_hdr.h safe for inclusion in userspace
            . Replace comment about vsdo code
            . Add comment about restoring access registers
            . Write and read an empty ckpt_hdr_head_arch record to appease
              code (mktree) that expects it to be there
            . Utilize NUM_CKPT_WORDS in checkpoint_hdr.h
    Feb 24:
            . Use CKPT_COPY() to unify the un/loading of cpu and mm state
            . Fix fprs definition in ckpt_hdr_cpu
            . Remove debug WARN_ON() from checkpoint.c
    Feb 23:
            . Macro-ize the un/packing of trace flags
            . Fix the crash when externally-linked
            . Break out the restart functions into restart.c
            . Remove unneeded s390_enable_sie() call
    Jan 30:
            . Switched types in ckpt_hdr_cpu to __u64 etc.
              (Per Oren suggestion)
            . Replaced direct inclusion of structs in
              ckpt_hdr_cpu with the struct members.
              (Per Oren suggestion)
            . Also ended up adding a bunch of new things
              into restart (mm_segment, ksp, etc) in vain
              attempt to get code using fpu to not segfault
              after restart.

Signed-off-by: Serge E. Hallyn <serue@us.ibm.com>
Signed-off-by: Dan Smith <danms@us.ibm.com>
---
 arch/s390/include/asm/Kbuild           |    1 +
 arch/s390/include/asm/checkpoint_hdr.h |   89 +++++++++++++++
 arch/s390/include/asm/unistd.h         |    4 +-
 arch/s390/kernel/compat_wrapper.S      |   14 +++
 arch/s390/kernel/syscalls.S            |    2 +
 arch/s390/mm/Makefile                  |    1 +
 arch/s390/mm/checkpoint.c              |  194 ++++++++++++++++++++++++++++++++
 arch/s390/mm/checkpoint_s390.h         |   23 ++++
 include/linux/checkpoint_hdr.h         |    2 +
 9 files changed, 329 insertions(+), 1 deletions(-)
 create mode 100644 arch/s390/include/asm/checkpoint_hdr.h
 create mode 100644 arch/s390/mm/checkpoint.c
 create mode 100644 arch/s390/mm/checkpoint_s390.h

diff --git a/arch/s390/include/asm/Kbuild b/arch/s390/include/asm/Kbuild
index 63a2341..3282a6e 100644
--- a/arch/s390/include/asm/Kbuild
+++ b/arch/s390/include/asm/Kbuild
@@ -8,6 +8,7 @@ header-y += ucontext.h
 header-y += vtoc.h
 header-y += zcrypt.h
 header-y += chsc.h
+header-y += checkpoint_hdr.h
 
 unifdef-y += cmb.h
 unifdef-y += debug.h
diff --git a/arch/s390/include/asm/checkpoint_hdr.h b/arch/s390/include/asm/checkpoint_hdr.h
new file mode 100644
index 0000000..ad9449e
--- /dev/null
+++ b/arch/s390/include/asm/checkpoint_hdr.h
@@ -0,0 +1,89 @@
+#ifndef __ASM_S390_CKPT_HDR_H
+#define __ASM_S390_CKPT_HDR_H
+/*
+ *  Checkpoint/restart - architecture specific headers s/390
+ *
+ *  Copyright IBM Corp. 2009
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+#ifndef _CHECKPOINT_CKPT_HDR_H_
+#error asm/checkpoint_hdr.h included directly
+#endif
+
+#include <linux/types.h>
+#include <asm/ptrace.h>
+
+#ifdef __KERNEL__
+#include <asm/processor.h>
+#else
+#include <sys/user.h>
+#endif
+
+#ifdef CONFIG_64BIT
+#define CKPT_ARCH_ID	CKPT_ARCH_S390X
+/* else - if we ever support 32bit - CKPT_ARCH_S390 */
+#endif
+
+/*
+ * Notes
+ * NUM_GPRS defined in <asm/ptrace.h> to be 16
+ * NUM_FPRS defined in <asm/ptrace.h> to be 16
+ * NUM_APRS defined in <asm/ptrace.h> to be 16
+ * NUM_CR_WORDS defined in <asm/ptrace.h> to be 3
+ */
+struct ckpt_hdr_cpu {
+	struct ckpt_hdr h;
+	__u64 args[1];
+	__u64 gprs[NUM_GPRS];
+	__u64 orig_gpr2;
+	__u16 svcnr;
+	__u16 ilc;
+	__u32 acrs[NUM_ACRS];
+	__u64 ieee_instruction_pointer;
+
+	/* psw_t */
+	__u64 psw_t_mask;
+	__u64 psw_t_addr;
+
+	/* s390_fp_regs_t */
+	__u32 fpc;
+	union {
+		float f;
+		double d;
+		__u64 ui;
+		struct {
+			__u32 fp_hi;
+			__u32 fp_lo;
+		} fp;
+	} fprs[NUM_FPRS];
+
+	/* per_struct */
+	__u64 per_control_regs[NUM_CR_WORDS];
+	__u64 starting_addr;
+	__u64 ending_addr;
+	__u64 address;
+	__u16 perc_atmid;
+	__u8 access_id;
+	__u8 single_step;
+	__u8 instruction_fetch;
+};
+
+struct ckpt_hdr_mm_context {
+	struct ckpt_hdr h;
+	unsigned long vdso_base;
+	int noexec;
+	int has_pgste;
+	int alloc_pgste;
+	unsigned long asce_bits;
+	unsigned long asce_limit;
+};
+
+struct ckpt_hdr_header_arch {
+	struct ckpt_hdr h;
+};
+
+#endif /* __ASM_S390_CKPT_HDR__H */
diff --git a/arch/s390/include/asm/unistd.h b/arch/s390/include/asm/unistd.h
index c80602d..5d1678a 100644
--- a/arch/s390/include/asm/unistd.h
+++ b/arch/s390/include/asm/unistd.h
@@ -269,7 +269,9 @@
 #define	__NR_pwritev		329
 #define __NR_rt_tgsigqueueinfo	330
 #define __NR_perf_counter_open	331
-#define NR_syscalls 332
+#define __NR_checkpoint		332
+#define __NR_restart		333
+#define NR_syscalls 334
 
 /* 
  * There are some system calls that are not present on 64 bit, some
diff --git a/arch/s390/kernel/compat_wrapper.S b/arch/s390/kernel/compat_wrapper.S
index 88a8336..e882f99 100644
--- a/arch/s390/kernel/compat_wrapper.S
+++ b/arch/s390/kernel/compat_wrapper.S
@@ -1840,3 +1840,17 @@ sys_perf_counter_open_wrapper:
 	lgfr	%r5,%r5			# int
 	llgfr	%r6,%r6			# unsigned long
 	jg	sys_perf_counter_open	# branch to system call
+
+	.globl sys_checkpoint_wrapper
+sys_checkpoint_wrapper:
+	lgfr	%r2,%r2			# pid_t
+	lgfr	%r3,%r3			# int
+	llgfr	%r4,%r4			# unsigned long
+	jg	compat_sys_checkpoint
+
+	.globl sys_restore_wrapper
+sys_restore_wrapper:
+	lgfr	%r2,%r2			# int
+	lgfr	%r3,%r3			# int
+	llgfr	%r4,%r4			# unsigned long
+	jg	compat_sys_restore
diff --git a/arch/s390/kernel/syscalls.S b/arch/s390/kernel/syscalls.S
index ad1acd2..67518e2 100644
--- a/arch/s390/kernel/syscalls.S
+++ b/arch/s390/kernel/syscalls.S
@@ -340,3 +340,5 @@ SYSCALL(sys_preadv,sys_preadv,compat_sys_preadv_wrapper)
 SYSCALL(sys_pwritev,sys_pwritev,compat_sys_pwritev_wrapper)
 SYSCALL(sys_rt_tgsigqueueinfo,sys_rt_tgsigqueueinfo,compat_sys_rt_tgsigqueueinfo_wrapper) /* 330 */
 SYSCALL(sys_perf_counter_open,sys_perf_counter_open,sys_perf_counter_open_wrapper)
+SYSCALL(sys_checkpoint,sys_checkpoint,sys_checkpoint_wrapper)
+SYSCALL(sys_restart,sys_restart,sys_restore_wrapper)
diff --git a/arch/s390/mm/Makefile b/arch/s390/mm/Makefile
index db05661..e3d356d 100644
--- a/arch/s390/mm/Makefile
+++ b/arch/s390/mm/Makefile
@@ -6,3 +6,4 @@ obj-y	 := init.o fault.o extmem.o mmap.o vmem.o pgtable.o maccess.o
 obj-$(CONFIG_CMM) += cmm.o
 obj-$(CONFIG_HUGETLB_PAGE) += hugetlbpage.o
 obj-$(CONFIG_PAGE_STATES) += page-states.o
+obj-$(CONFIG_CHECKPOINT) += checkpoint.o
diff --git a/arch/s390/mm/checkpoint.c b/arch/s390/mm/checkpoint.c
new file mode 100644
index 0000000..40dd417
--- /dev/null
+++ b/arch/s390/mm/checkpoint.c
@@ -0,0 +1,194 @@
+/*
+ *  Checkpoint/restart - architecture specific support for s390
+ *
+ *  Copyright IBM Corp. 2009
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+#include <linux/kernel.h>
+#include <asm/system.h>
+#include <asm/pgtable.h>
+#include <asm/elf.h>
+
+#include <linux/checkpoint.h>
+#include <linux/checkpoint_hdr.h>
+
+/**************************************************************************
+ * Checkpoint
+ */
+
+static void s390_copy_regs(int op, struct ckpt_hdr_cpu *h,
+			   struct task_struct *t)
+{
+	struct pt_regs *regs = task_pt_regs(t);
+	struct thread_struct *thr = &t->thread;
+
+	/* Save the whole PSW to facilitate forensic debugging, but only
+	 * restore the address portion to avoid letting userspace do
+	 * bad things by manipulating its value.
+	 */
+	if (op == CKPT_CPT) {
+		CKPT_COPY(op, h->psw_t_addr, regs->psw.addr);
+	} else {
+		regs->psw.addr &= ~PSW_ADDR_INSN;
+		regs->psw.addr |= h->psw_t_addr;
+	}
+
+	CKPT_COPY(op, h->args[0], regs->args[0]);
+	CKPT_COPY(op, h->orig_gpr2, regs->orig_gpr2);
+	CKPT_COPY(op, h->svcnr, regs->svcnr);
+	CKPT_COPY(op, h->ilc, regs->ilc);
+	CKPT_COPY(op, h->ieee_instruction_pointer,
+		thr->ieee_instruction_pointer);
+	CKPT_COPY(op, h->psw_t_mask, regs->psw.mask);
+	CKPT_COPY(op, h->fpc, thr->fp_regs.fpc);
+	CKPT_COPY(op, h->starting_addr, thr->per_info.starting_addr);
+	CKPT_COPY(op, h->ending_addr, thr->per_info.ending_addr);
+	CKPT_COPY(op, h->address, thr->per_info.lowcore.words.address);
+	CKPT_COPY(op, h->perc_atmid, thr->per_info.lowcore.words.perc_atmid);
+	CKPT_COPY(op, h->access_id, thr->per_info.lowcore.words.access_id);
+	CKPT_COPY(op, h->single_step, thr->per_info.single_step);
+	CKPT_COPY(op, h->instruction_fetch, thr->per_info.instruction_fetch);
+
+	CKPT_COPY_ARRAY(op, h->gprs, regs->gprs, NUM_GPRS);
+	/*
+	 * for checkpoint in process context (from within a container),
+	 * the actual syscall is taking place at this very moment; so
+	 * we (optimistically) subtitute the future return value (0) of
+	 * this syscall into the orig_eax, so that upon restart it will
+	 * succeed (or it will endlessly retry checkpoint...)
+	 */
+	if (op == CKPT_CPT && t==current) {
+		BUG_ON(h->gprs[2] < 0);
+		h->gprs[2] = 0;
+	}
+	CKPT_COPY_ARRAY(op, h->fprs, thr->fp_regs.fprs, NUM_FPRS);
+	CKPT_COPY_ARRAY(op, h->acrs, thr->acrs, NUM_ACRS);
+	CKPT_COPY_ARRAY(op, h->per_control_regs,
+		      thr->per_info.control_regs.words.cr, NUM_CR_WORDS);
+}
+
+static void s390_mm(int op, struct ckpt_hdr_mm_context *h,
+		    struct mm_struct *mm)
+{
+	CKPT_COPY(op, h->noexec, mm->context.noexec);
+	CKPT_COPY(op, h->has_pgste, mm->context.has_pgste);
+	CKPT_COPY(op, h->alloc_pgste, mm->context.alloc_pgste);
+	CKPT_COPY(op, h->asce_bits, mm->context.asce_bits);
+	CKPT_COPY(op, h->asce_limit, mm->context.asce_limit);
+}
+
+int checkpoint_thread(struct ckpt_ctx *ctx, struct task_struct *t)
+{
+	return 0;
+}
+
+/* dump the cpu state and registers of a given task */
+int checkpoint_cpu(struct ckpt_ctx *ctx, struct task_struct *t)
+{
+	struct ckpt_hdr_cpu *h;
+	int ret;
+
+	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_CPU);
+	if (!h)
+		return -ENOMEM;
+
+	s390_copy_regs(CKPT_CPT, h, t);
+
+	ret = ckpt_write_obj(ctx, (struct ckpt_hdr *) h);
+	ckpt_hdr_put(ctx, h);
+
+	return ret;
+}
+
+/* Write an empty header since it is assumed to be there */
+int checkpoint_write_header_arch(struct ckpt_ctx *ctx)
+{
+	struct ckpt_hdr_header_arch *h;
+	int ret;
+
+	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_HEADER_ARCH);
+	if (!h)
+		return -ENOMEM;
+
+	ret = ckpt_write_obj(ctx, (struct ckpt_hdr *) h);
+	ckpt_hdr_put(ctx, h);
+
+	return ret;
+}
+
+int checkpoint_mm_context(struct ckpt_ctx *ctx, struct mm_struct *mm)
+{
+	struct ckpt_hdr_mm_context *h;
+	int ret;
+
+	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_MM_CONTEXT);
+	if (!h)
+		return -ENOMEM;
+
+	s390_mm(CKPT_CPT, h, mm);
+
+	ret = ckpt_write_obj(ctx, (struct ckpt_hdr *) h);
+	ckpt_hdr_put(ctx, h);
+
+	return ret;
+}
+
+/**************************************************************************
+ * Restart
+ */
+
+int restore_thread(struct ckpt_ctx *ctx)
+{
+	return 0;
+}
+
+int restore_cpu(struct ckpt_ctx *ctx)
+{
+	struct ckpt_hdr_cpu *h;
+
+	h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_CPU);
+	if (IS_ERR(h))
+		return PTR_ERR(h);
+
+	s390_copy_regs(CKPT_RST, h, current);
+
+	/* s390 does not restore the access registers after a syscall,
+	 * but does on a task switch.  Since we're switching tasks (in
+	 * a way), we need to replicate that behavior here.
+	 */
+	restore_access_regs(h->acrs);
+
+	ckpt_hdr_put(ctx, h);
+	return 0;
+}
+
+int restore_read_header_arch(struct ckpt_ctx *ctx)
+{
+	struct ckpt_hdr_header_arch *h;
+
+	h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_HEADER_ARCH);
+	if (IS_ERR(h))
+		return PTR_ERR(h);
+
+	ckpt_hdr_put(ctx, h);
+	return 0;
+}
+
+
+int restore_mm_context(struct ckpt_ctx *ctx, struct mm_struct *mm)
+{
+	struct ckpt_hdr_mm_context *h;
+
+	h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_MM_CONTEXT);
+	if (IS_ERR(h))
+		return PTR_ERR(h);
+
+	s390_mm(CKPT_RST, h, mm);
+
+	ckpt_hdr_put(ctx, h);
+	return 0;
+}
diff --git a/arch/s390/mm/checkpoint_s390.h b/arch/s390/mm/checkpoint_s390.h
new file mode 100644
index 0000000..c3bf24d
--- /dev/null
+++ b/arch/s390/mm/checkpoint_s390.h
@@ -0,0 +1,23 @@
+/*
+ *  Checkpoint/restart - architecture specific support for s390
+ *
+ *  Copyright IBM Corp. 2009
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+#ifndef _S390_CHECKPOINT_H
+#define _S390_CHECKPOINT_H
+
+#include <linux/checkpoint_hdr.h>
+#include <linux/sched.h>
+#include <linux/mm_types.h>
+
+extern void checkpoint_s390_regs(int op, struct ckpt_hdr_cpu *h,
+				 struct task_struct *t);
+extern void checkpoint_s390_mm(int op, struct ckpt_hdr_mm_context *h,
+			       struct mm_struct *mm);
+
+#endif /* _S390_CHECKPOINT_H */
diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
index cb7dfc8..ef260a0 100644
--- a/include/linux/checkpoint_hdr.h
+++ b/include/linux/checkpoint_hdr.h
@@ -88,7 +88,9 @@ enum {
 
 /* architecture */
 enum {
+	/* do not change order (will break ABI) */
 	CKPT_ARCH_X86_32 = 1,
+	CKPT_ARCH_S390X,
 };
 
 /* shared objrects (objref) */
-- 
1.6.0.4


^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH v18 57/80] c/r: define s390-specific checkpoint-restart code
@ 2009-09-23 23:51   ` Oren Laadan
  0 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-23 23:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, containers, linux-kernel, linux-mm, linux-api,
	Serge Hallyn, Ingo Molnar, Pavel Emelyanov, Dan Smith

From: Dan Smith <danms@us.ibm.com>

Implement the s390 arch-specific checkpoint/restart helpers.  This
is on top of Oren Laadan's c/r code.

With these, I am able to checkpoint and restart simple programs as per
Oren's patch intro.  While on x86 I never had to freeze a single task
to checkpoint it, on s390 I do need to.  That is a prereq for consistent
snapshots (esp with multiple processes) anyway so I don't see that as
a problem.

Changelog:
    Jun 15:
            . Fix checkpoint and restart compat wrappers
    May 28:
            . Export asm/checkpoint_hdr.h to userspace
            . Define CKPT_ARCH_ID for S390
    Apr 11:
            . Introduce ckpt_arch_vdso()
    Feb 27:
            . Add checkpoint_s390.h
            . Fixed up save and restore of PSW, with the non-address bits
              properly masked out
    Feb 25:
            . Make checkpoint_hdr.h safe for inclusion in userspace
            . Replace comment about vsdo code
            . Add comment about restoring access registers
            . Write and read an empty ckpt_hdr_head_arch record to appease
              code (mktree) that expects it to be there
            . Utilize NUM_CKPT_WORDS in checkpoint_hdr.h
    Feb 24:
            . Use CKPT_COPY() to unify the un/loading of cpu and mm state
            . Fix fprs definition in ckpt_hdr_cpu
            . Remove debug WARN_ON() from checkpoint.c
    Feb 23:
            . Macro-ize the un/packing of trace flags
            . Fix the crash when externally-linked
            . Break out the restart functions into restart.c
            . Remove unneeded s390_enable_sie() call
    Jan 30:
            . Switched types in ckpt_hdr_cpu to __u64 etc.
              (Per Oren suggestion)
            . Replaced direct inclusion of structs in
              ckpt_hdr_cpu with the struct members.
              (Per Oren suggestion)
            . Also ended up adding a bunch of new things
              into restart (mm_segment, ksp, etc) in vain
              attempt to get code using fpu to not segfault
              after restart.

Signed-off-by: Serge E. Hallyn <serue@us.ibm.com>
Signed-off-by: Dan Smith <danms@us.ibm.com>
---
 arch/s390/include/asm/Kbuild           |    1 +
 arch/s390/include/asm/checkpoint_hdr.h |   89 +++++++++++++++
 arch/s390/include/asm/unistd.h         |    4 +-
 arch/s390/kernel/compat_wrapper.S      |   14 +++
 arch/s390/kernel/syscalls.S            |    2 +
 arch/s390/mm/Makefile                  |    1 +
 arch/s390/mm/checkpoint.c              |  194 ++++++++++++++++++++++++++++++++
 arch/s390/mm/checkpoint_s390.h         |   23 ++++
 include/linux/checkpoint_hdr.h         |    2 +
 9 files changed, 329 insertions(+), 1 deletions(-)
 create mode 100644 arch/s390/include/asm/checkpoint_hdr.h
 create mode 100644 arch/s390/mm/checkpoint.c
 create mode 100644 arch/s390/mm/checkpoint_s390.h

diff --git a/arch/s390/include/asm/Kbuild b/arch/s390/include/asm/Kbuild
index 63a2341..3282a6e 100644
--- a/arch/s390/include/asm/Kbuild
+++ b/arch/s390/include/asm/Kbuild
@@ -8,6 +8,7 @@ header-y += ucontext.h
 header-y += vtoc.h
 header-y += zcrypt.h
 header-y += chsc.h
+header-y += checkpoint_hdr.h
 
 unifdef-y += cmb.h
 unifdef-y += debug.h
diff --git a/arch/s390/include/asm/checkpoint_hdr.h b/arch/s390/include/asm/checkpoint_hdr.h
new file mode 100644
index 0000000..ad9449e
--- /dev/null
+++ b/arch/s390/include/asm/checkpoint_hdr.h
@@ -0,0 +1,89 @@
+#ifndef __ASM_S390_CKPT_HDR_H
+#define __ASM_S390_CKPT_HDR_H
+/*
+ *  Checkpoint/restart - architecture specific headers s/390
+ *
+ *  Copyright IBM Corp. 2009
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+#ifndef _CHECKPOINT_CKPT_HDR_H_
+#error asm/checkpoint_hdr.h included directly
+#endif
+
+#include <linux/types.h>
+#include <asm/ptrace.h>
+
+#ifdef __KERNEL__
+#include <asm/processor.h>
+#else
+#include <sys/user.h>
+#endif
+
+#ifdef CONFIG_64BIT
+#define CKPT_ARCH_ID	CKPT_ARCH_S390X
+/* else - if we ever support 32bit - CKPT_ARCH_S390 */
+#endif
+
+/*
+ * Notes
+ * NUM_GPRS defined in <asm/ptrace.h> to be 16
+ * NUM_FPRS defined in <asm/ptrace.h> to be 16
+ * NUM_APRS defined in <asm/ptrace.h> to be 16
+ * NUM_CR_WORDS defined in <asm/ptrace.h> to be 3
+ */
+struct ckpt_hdr_cpu {
+	struct ckpt_hdr h;
+	__u64 args[1];
+	__u64 gprs[NUM_GPRS];
+	__u64 orig_gpr2;
+	__u16 svcnr;
+	__u16 ilc;
+	__u32 acrs[NUM_ACRS];
+	__u64 ieee_instruction_pointer;
+
+	/* psw_t */
+	__u64 psw_t_mask;
+	__u64 psw_t_addr;
+
+	/* s390_fp_regs_t */
+	__u32 fpc;
+	union {
+		float f;
+		double d;
+		__u64 ui;
+		struct {
+			__u32 fp_hi;
+			__u32 fp_lo;
+		} fp;
+	} fprs[NUM_FPRS];
+
+	/* per_struct */
+	__u64 per_control_regs[NUM_CR_WORDS];
+	__u64 starting_addr;
+	__u64 ending_addr;
+	__u64 address;
+	__u16 perc_atmid;
+	__u8 access_id;
+	__u8 single_step;
+	__u8 instruction_fetch;
+};
+
+struct ckpt_hdr_mm_context {
+	struct ckpt_hdr h;
+	unsigned long vdso_base;
+	int noexec;
+	int has_pgste;
+	int alloc_pgste;
+	unsigned long asce_bits;
+	unsigned long asce_limit;
+};
+
+struct ckpt_hdr_header_arch {
+	struct ckpt_hdr h;
+};
+
+#endif /* __ASM_S390_CKPT_HDR__H */
diff --git a/arch/s390/include/asm/unistd.h b/arch/s390/include/asm/unistd.h
index c80602d..5d1678a 100644
--- a/arch/s390/include/asm/unistd.h
+++ b/arch/s390/include/asm/unistd.h
@@ -269,7 +269,9 @@
 #define	__NR_pwritev		329
 #define __NR_rt_tgsigqueueinfo	330
 #define __NR_perf_counter_open	331
-#define NR_syscalls 332
+#define __NR_checkpoint		332
+#define __NR_restart		333
+#define NR_syscalls 334
 
 /* 
  * There are some system calls that are not present on 64 bit, some
diff --git a/arch/s390/kernel/compat_wrapper.S b/arch/s390/kernel/compat_wrapper.S
index 88a8336..e882f99 100644
--- a/arch/s390/kernel/compat_wrapper.S
+++ b/arch/s390/kernel/compat_wrapper.S
@@ -1840,3 +1840,17 @@ sys_perf_counter_open_wrapper:
 	lgfr	%r5,%r5			# int
 	llgfr	%r6,%r6			# unsigned long
 	jg	sys_perf_counter_open	# branch to system call
+
+	.globl sys_checkpoint_wrapper
+sys_checkpoint_wrapper:
+	lgfr	%r2,%r2			# pid_t
+	lgfr	%r3,%r3			# int
+	llgfr	%r4,%r4			# unsigned long
+	jg	compat_sys_checkpoint
+
+	.globl sys_restore_wrapper
+sys_restore_wrapper:
+	lgfr	%r2,%r2			# int
+	lgfr	%r3,%r3			# int
+	llgfr	%r4,%r4			# unsigned long
+	jg	compat_sys_restore
diff --git a/arch/s390/kernel/syscalls.S b/arch/s390/kernel/syscalls.S
index ad1acd2..67518e2 100644
--- a/arch/s390/kernel/syscalls.S
+++ b/arch/s390/kernel/syscalls.S
@@ -340,3 +340,5 @@ SYSCALL(sys_preadv,sys_preadv,compat_sys_preadv_wrapper)
 SYSCALL(sys_pwritev,sys_pwritev,compat_sys_pwritev_wrapper)
 SYSCALL(sys_rt_tgsigqueueinfo,sys_rt_tgsigqueueinfo,compat_sys_rt_tgsigqueueinfo_wrapper) /* 330 */
 SYSCALL(sys_perf_counter_open,sys_perf_counter_open,sys_perf_counter_open_wrapper)
+SYSCALL(sys_checkpoint,sys_checkpoint,sys_checkpoint_wrapper)
+SYSCALL(sys_restart,sys_restart,sys_restore_wrapper)
diff --git a/arch/s390/mm/Makefile b/arch/s390/mm/Makefile
index db05661..e3d356d 100644
--- a/arch/s390/mm/Makefile
+++ b/arch/s390/mm/Makefile
@@ -6,3 +6,4 @@ obj-y	 := init.o fault.o extmem.o mmap.o vmem.o pgtable.o maccess.o
 obj-$(CONFIG_CMM) += cmm.o
 obj-$(CONFIG_HUGETLB_PAGE) += hugetlbpage.o
 obj-$(CONFIG_PAGE_STATES) += page-states.o
+obj-$(CONFIG_CHECKPOINT) += checkpoint.o
diff --git a/arch/s390/mm/checkpoint.c b/arch/s390/mm/checkpoint.c
new file mode 100644
index 0000000..40dd417
--- /dev/null
+++ b/arch/s390/mm/checkpoint.c
@@ -0,0 +1,194 @@
+/*
+ *  Checkpoint/restart - architecture specific support for s390
+ *
+ *  Copyright IBM Corp. 2009
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+#include <linux/kernel.h>
+#include <asm/system.h>
+#include <asm/pgtable.h>
+#include <asm/elf.h>
+
+#include <linux/checkpoint.h>
+#include <linux/checkpoint_hdr.h>
+
+/**************************************************************************
+ * Checkpoint
+ */
+
+static void s390_copy_regs(int op, struct ckpt_hdr_cpu *h,
+			   struct task_struct *t)
+{
+	struct pt_regs *regs = task_pt_regs(t);
+	struct thread_struct *thr = &t->thread;
+
+	/* Save the whole PSW to facilitate forensic debugging, but only
+	 * restore the address portion to avoid letting userspace do
+	 * bad things by manipulating its value.
+	 */
+	if (op == CKPT_CPT) {
+		CKPT_COPY(op, h->psw_t_addr, regs->psw.addr);
+	} else {
+		regs->psw.addr &= ~PSW_ADDR_INSN;
+		regs->psw.addr |= h->psw_t_addr;
+	}
+
+	CKPT_COPY(op, h->args[0], regs->args[0]);
+	CKPT_COPY(op, h->orig_gpr2, regs->orig_gpr2);
+	CKPT_COPY(op, h->svcnr, regs->svcnr);
+	CKPT_COPY(op, h->ilc, regs->ilc);
+	CKPT_COPY(op, h->ieee_instruction_pointer,
+		thr->ieee_instruction_pointer);
+	CKPT_COPY(op, h->psw_t_mask, regs->psw.mask);
+	CKPT_COPY(op, h->fpc, thr->fp_regs.fpc);
+	CKPT_COPY(op, h->starting_addr, thr->per_info.starting_addr);
+	CKPT_COPY(op, h->ending_addr, thr->per_info.ending_addr);
+	CKPT_COPY(op, h->address, thr->per_info.lowcore.words.address);
+	CKPT_COPY(op, h->perc_atmid, thr->per_info.lowcore.words.perc_atmid);
+	CKPT_COPY(op, h->access_id, thr->per_info.lowcore.words.access_id);
+	CKPT_COPY(op, h->single_step, thr->per_info.single_step);
+	CKPT_COPY(op, h->instruction_fetch, thr->per_info.instruction_fetch);
+
+	CKPT_COPY_ARRAY(op, h->gprs, regs->gprs, NUM_GPRS);
+	/*
+	 * for checkpoint in process context (from within a container),
+	 * the actual syscall is taking place at this very moment; so
+	 * we (optimistically) subtitute the future return value (0) of
+	 * this syscall into the orig_eax, so that upon restart it will
+	 * succeed (or it will endlessly retry checkpoint...)
+	 */
+	if (op == CKPT_CPT && t==current) {
+		BUG_ON(h->gprs[2] < 0);
+		h->gprs[2] = 0;
+	}
+	CKPT_COPY_ARRAY(op, h->fprs, thr->fp_regs.fprs, NUM_FPRS);
+	CKPT_COPY_ARRAY(op, h->acrs, thr->acrs, NUM_ACRS);
+	CKPT_COPY_ARRAY(op, h->per_control_regs,
+		      thr->per_info.control_regs.words.cr, NUM_CR_WORDS);
+}
+
+static void s390_mm(int op, struct ckpt_hdr_mm_context *h,
+		    struct mm_struct *mm)
+{
+	CKPT_COPY(op, h->noexec, mm->context.noexec);
+	CKPT_COPY(op, h->has_pgste, mm->context.has_pgste);
+	CKPT_COPY(op, h->alloc_pgste, mm->context.alloc_pgste);
+	CKPT_COPY(op, h->asce_bits, mm->context.asce_bits);
+	CKPT_COPY(op, h->asce_limit, mm->context.asce_limit);
+}
+
+int checkpoint_thread(struct ckpt_ctx *ctx, struct task_struct *t)
+{
+	return 0;
+}
+
+/* dump the cpu state and registers of a given task */
+int checkpoint_cpu(struct ckpt_ctx *ctx, struct task_struct *t)
+{
+	struct ckpt_hdr_cpu *h;
+	int ret;
+
+	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_CPU);
+	if (!h)
+		return -ENOMEM;
+
+	s390_copy_regs(CKPT_CPT, h, t);
+
+	ret = ckpt_write_obj(ctx, (struct ckpt_hdr *) h);
+	ckpt_hdr_put(ctx, h);
+
+	return ret;
+}
+
+/* Write an empty header since it is assumed to be there */
+int checkpoint_write_header_arch(struct ckpt_ctx *ctx)
+{
+	struct ckpt_hdr_header_arch *h;
+	int ret;
+
+	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_HEADER_ARCH);
+	if (!h)
+		return -ENOMEM;
+
+	ret = ckpt_write_obj(ctx, (struct ckpt_hdr *) h);
+	ckpt_hdr_put(ctx, h);
+
+	return ret;
+}
+
+int checkpoint_mm_context(struct ckpt_ctx *ctx, struct mm_struct *mm)
+{
+	struct ckpt_hdr_mm_context *h;
+	int ret;
+
+	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_MM_CONTEXT);
+	if (!h)
+		return -ENOMEM;
+
+	s390_mm(CKPT_CPT, h, mm);
+
+	ret = ckpt_write_obj(ctx, (struct ckpt_hdr *) h);
+	ckpt_hdr_put(ctx, h);
+
+	return ret;
+}
+
+/**************************************************************************
+ * Restart
+ */
+
+int restore_thread(struct ckpt_ctx *ctx)
+{
+	return 0;
+}
+
+int restore_cpu(struct ckpt_ctx *ctx)
+{
+	struct ckpt_hdr_cpu *h;
+
+	h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_CPU);
+	if (IS_ERR(h))
+		return PTR_ERR(h);
+
+	s390_copy_regs(CKPT_RST, h, current);
+
+	/* s390 does not restore the access registers after a syscall,
+	 * but does on a task switch.  Since we're switching tasks (in
+	 * a way), we need to replicate that behavior here.
+	 */
+	restore_access_regs(h->acrs);
+
+	ckpt_hdr_put(ctx, h);
+	return 0;
+}
+
+int restore_read_header_arch(struct ckpt_ctx *ctx)
+{
+	struct ckpt_hdr_header_arch *h;
+
+	h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_HEADER_ARCH);
+	if (IS_ERR(h))
+		return PTR_ERR(h);
+
+	ckpt_hdr_put(ctx, h);
+	return 0;
+}
+
+
+int restore_mm_context(struct ckpt_ctx *ctx, struct mm_struct *mm)
+{
+	struct ckpt_hdr_mm_context *h;
+
+	h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_MM_CONTEXT);
+	if (IS_ERR(h))
+		return PTR_ERR(h);
+
+	s390_mm(CKPT_RST, h, mm);
+
+	ckpt_hdr_put(ctx, h);
+	return 0;
+}
diff --git a/arch/s390/mm/checkpoint_s390.h b/arch/s390/mm/checkpoint_s390.h
new file mode 100644
index 0000000..c3bf24d
--- /dev/null
+++ b/arch/s390/mm/checkpoint_s390.h
@@ -0,0 +1,23 @@
+/*
+ *  Checkpoint/restart - architecture specific support for s390
+ *
+ *  Copyright IBM Corp. 2009
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+#ifndef _S390_CHECKPOINT_H
+#define _S390_CHECKPOINT_H
+
+#include <linux/checkpoint_hdr.h>
+#include <linux/sched.h>
+#include <linux/mm_types.h>
+
+extern void checkpoint_s390_regs(int op, struct ckpt_hdr_cpu *h,
+				 struct task_struct *t);
+extern void checkpoint_s390_mm(int op, struct ckpt_hdr_mm_context *h,
+			       struct mm_struct *mm);
+
+#endif /* _S390_CHECKPOINT_H */
diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
index cb7dfc8..ef260a0 100644
--- a/include/linux/checkpoint_hdr.h
+++ b/include/linux/checkpoint_hdr.h
@@ -88,7 +88,9 @@ enum {
 
 /* architecture */
 enum {
+	/* do not change order (will break ABI) */
 	CKPT_ARCH_X86_32 = 1,
+	CKPT_ARCH_S390X,
 };
 
 /* shared objrects (objref) */
-- 
1.6.0.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH v18 58/80] c/r: clone_with_pids: define the s390 syscall
       [not found] ` <1253749920-18673-1-git-send-email-orenl-RdfvBDnrOixBDgjK7y7TUQ@public.gmane.org>
                     ` (56 preceding siblings ...)
  2009-09-23 23:51   ` [PATCH v18 57/80] c/r: define s390-specific checkpoint-restart code Oren Laadan
@ 2009-09-23 23:51   ` Oren Laadan
  2009-09-23 23:51   ` [PATCH v18 59/80] c/r: capabilities: define checkpoint and restore fns Oren Laadan
                     ` (23 subsequent siblings)
  81 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-23 23:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-api-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Linus Torvalds, Ingo Molnar,
	Pavel Emelyanov

From: Serge E. Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>

Hook up the clone_with_pids system call for s390x.  clone_with_pids()
takes an additional argument over clone(), which we pass in through
register 7.  Stub code for using the syscall looks like:

struct target_pid_set {
        int num_pids;
        pid_t *target_pids;
        unsigned long flags;
};

    register unsigned long int __r2 asm ("2") = (unsigned long int)(stack);
    register unsigned long int __r3 asm ("3") = (unsigned long int)(flags);
    register unsigned long int __r4 asm ("4") = (unsigned long int)(NULL);
    register unsigned long int __r5 asm ("5") = (unsigned long int)(NULL);
    register unsigned long int __r6 asm ("6") = (unsigned long int)(NULL);
    register unsigned long int __r7 asm ("7") = (unsigned long int)(setp);
    register unsigned long int __result asm ("2");
    __asm__ __volatile__(
            " lghi %%r1,332\n"
            " svc 0\n"
            : "=d" (__result)
            : "0" (__r2), "d" (__r3),
              "d" (__r4), "d" (__r5), "d" (__r6), "d" (__r7)
            : "1", "cc", "memory"
    );
            __result;
    })

    struct target_pid_set pid_set;
    int pids[1] = { 19799 };
    pid_set.num_pids = 1;
    pid_set.target_pids = &pids[0];
    pid_set.flags = 0;

    rc = do_clone_with_pids(topstack, clone_flags, setp);
    if (rc == 0)
	printf("Child\n");
    else if (rc > 0)
	printf("Parent: child pid %d\n", rc);
    else
	printf("Error %d\n", rc);

Changelog[v18]:
  - Set return value for self-checkpoint

Signed-off-by: Serge E. Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
---
 arch/s390/include/asm/unistd.h  |    3 ++-
 arch/s390/kernel/compat_linux.c |   19 +++++++++++++++++++
 arch/s390/kernel/process.c      |   19 +++++++++++++++++++
 arch/s390/kernel/syscalls.S     |    1 +
 4 files changed, 41 insertions(+), 1 deletions(-)

diff --git a/arch/s390/include/asm/unistd.h b/arch/s390/include/asm/unistd.h
index 5d1678a..2a84f9c 100644
--- a/arch/s390/include/asm/unistd.h
+++ b/arch/s390/include/asm/unistd.h
@@ -271,7 +271,8 @@
 #define __NR_perf_counter_open	331
 #define __NR_checkpoint		332
 #define __NR_restart		333
-#define NR_syscalls 334
+#define __NR_clone_with_pids	334
+#define NR_syscalls 335
 
 /* 
  * There are some system calls that are not present on 64 bit, some
diff --git a/arch/s390/kernel/compat_linux.c b/arch/s390/kernel/compat_linux.c
index 9ab188d..c6dc681 100644
--- a/arch/s390/kernel/compat_linux.c
+++ b/arch/s390/kernel/compat_linux.c
@@ -818,6 +818,25 @@ asmlinkage long sys32_clone(void)
 		       parent_tidptr, child_tidptr);
 }
 
+asmlinkage long sys32_clone_with_pids(void)
+{
+	struct pt_regs *regs = task_pt_regs(current);
+	unsigned long clone_flags;
+	unsigned long newsp;
+	int __user *parent_tidptr, *child_tidptr;
+	void __user *upid_setp;
+
+	clone_flags = regs->gprs[3] & 0xffffffffUL;
+	newsp = regs->orig_gpr2 & 0x7fffffffUL;
+	parent_tidptr = compat_ptr(regs->gprs[4]);
+	child_tidptr = compat_ptr(regs->gprs[5]);
+	upid_setp = compat_ptr(regs->gprs[7]);
+	if (!newsp)
+		newsp = regs->gprs[15];
+	return do_fork_with_pids(clone_flags, newsp, regs, 0,
+		       parent_tidptr, child_tidptr, upid_setp);
+}
+
 /*
  * 31 bit emulation wrapper functions for sys_fadvise64/fadvise64_64.
  * These need to rewrite the advise values for POSIX_FADV_{DONTNEED,NOREUSE}
diff --git a/arch/s390/kernel/process.c b/arch/s390/kernel/process.c
index 5a43f27..263d3ab 100644
--- a/arch/s390/kernel/process.c
+++ b/arch/s390/kernel/process.c
@@ -247,6 +247,25 @@ SYSCALL_DEFINE0(clone)
 		       parent_tidptr, child_tidptr);
 }
 
+SYSCALL_DEFINE0(clone_with_pids)
+{
+	struct pt_regs *regs = task_pt_regs(current);
+	unsigned long clone_flags;
+	unsigned long newsp;
+	int __user *parent_tidptr, *child_tidptr;
+	void __user *upid_setp;
+
+	clone_flags = regs->gprs[3];
+	newsp = regs->orig_gpr2;
+	parent_tidptr = (int __user *) regs->gprs[4];
+	child_tidptr = (int __user *) regs->gprs[5];
+	upid_setp = (void __user *) regs->gprs[7];
+	if (!newsp)
+		newsp = regs->gprs[15];
+	return do_fork_with_pids(clone_flags, newsp, regs, 0, parent_tidptr,
+			child_tidptr, upid_setp);
+}
+
 /*
  * This is trivial, and on the face of it looks like it
  * could equally well be done in user mode.
diff --git a/arch/s390/kernel/syscalls.S b/arch/s390/kernel/syscalls.S
index 67518e2..db850e7 100644
--- a/arch/s390/kernel/syscalls.S
+++ b/arch/s390/kernel/syscalls.S
@@ -342,3 +342,4 @@ SYSCALL(sys_rt_tgsigqueueinfo,sys_rt_tgsigqueueinfo,compat_sys_rt_tgsigqueueinfo
 SYSCALL(sys_perf_counter_open,sys_perf_counter_open,sys_perf_counter_open_wrapper)
 SYSCALL(sys_checkpoint,sys_checkpoint,sys_checkpoint_wrapper)
 SYSCALL(sys_restart,sys_restart,sys_restore_wrapper)
+SYSCALL(sys_clone_with_pids,sys_clone_with_pids,sys_clone_with_pids_wrapper)
-- 
1.6.0.4

^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH v18 58/80] c/r: clone_with_pids: define the s390 syscall
  2009-09-23 23:50 ` Oren Laadan
@ 2009-09-23 23:51   ` Oren Laadan
  -1 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-23 23:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, containers, linux-kernel, linux-mm, linux-api,
	Serge Hallyn, Ingo Molnar, Pavel Emelyanov

From: Serge E. Hallyn <serue@us.ibm.com>

Hook up the clone_with_pids system call for s390x.  clone_with_pids()
takes an additional argument over clone(), which we pass in through
register 7.  Stub code for using the syscall looks like:

struct target_pid_set {
        int num_pids;
        pid_t *target_pids;
        unsigned long flags;
};

    register unsigned long int __r2 asm ("2") = (unsigned long int)(stack);
    register unsigned long int __r3 asm ("3") = (unsigned long int)(flags);
    register unsigned long int __r4 asm ("4") = (unsigned long int)(NULL);
    register unsigned long int __r5 asm ("5") = (unsigned long int)(NULL);
    register unsigned long int __r6 asm ("6") = (unsigned long int)(NULL);
    register unsigned long int __r7 asm ("7") = (unsigned long int)(setp);
    register unsigned long int __result asm ("2");
    __asm__ __volatile__(
            " lghi %%r1,332\n"
            " svc 0\n"
            : "=d" (__result)
            : "0" (__r2), "d" (__r3),
              "d" (__r4), "d" (__r5), "d" (__r6), "d" (__r7)
            : "1", "cc", "memory"
    );
            __result;
    })

    struct target_pid_set pid_set;
    int pids[1] = { 19799 };
    pid_set.num_pids = 1;
    pid_set.target_pids = &pids[0];
    pid_set.flags = 0;

    rc = do_clone_with_pids(topstack, clone_flags, setp);
    if (rc == 0)
	printf("Child\n");
    else if (rc > 0)
	printf("Parent: child pid %d\n", rc);
    else
	printf("Error %d\n", rc);

Changelog[v18]:
  - Set return value for self-checkpoint

Signed-off-by: Serge E. Hallyn <serue@us.ibm.com>
---
 arch/s390/include/asm/unistd.h  |    3 ++-
 arch/s390/kernel/compat_linux.c |   19 +++++++++++++++++++
 arch/s390/kernel/process.c      |   19 +++++++++++++++++++
 arch/s390/kernel/syscalls.S     |    1 +
 4 files changed, 41 insertions(+), 1 deletions(-)

diff --git a/arch/s390/include/asm/unistd.h b/arch/s390/include/asm/unistd.h
index 5d1678a..2a84f9c 100644
--- a/arch/s390/include/asm/unistd.h
+++ b/arch/s390/include/asm/unistd.h
@@ -271,7 +271,8 @@
 #define __NR_perf_counter_open	331
 #define __NR_checkpoint		332
 #define __NR_restart		333
-#define NR_syscalls 334
+#define __NR_clone_with_pids	334
+#define NR_syscalls 335
 
 /* 
  * There are some system calls that are not present on 64 bit, some
diff --git a/arch/s390/kernel/compat_linux.c b/arch/s390/kernel/compat_linux.c
index 9ab188d..c6dc681 100644
--- a/arch/s390/kernel/compat_linux.c
+++ b/arch/s390/kernel/compat_linux.c
@@ -818,6 +818,25 @@ asmlinkage long sys32_clone(void)
 		       parent_tidptr, child_tidptr);
 }
 
+asmlinkage long sys32_clone_with_pids(void)
+{
+	struct pt_regs *regs = task_pt_regs(current);
+	unsigned long clone_flags;
+	unsigned long newsp;
+	int __user *parent_tidptr, *child_tidptr;
+	void __user *upid_setp;
+
+	clone_flags = regs->gprs[3] & 0xffffffffUL;
+	newsp = regs->orig_gpr2 & 0x7fffffffUL;
+	parent_tidptr = compat_ptr(regs->gprs[4]);
+	child_tidptr = compat_ptr(regs->gprs[5]);
+	upid_setp = compat_ptr(regs->gprs[7]);
+	if (!newsp)
+		newsp = regs->gprs[15];
+	return do_fork_with_pids(clone_flags, newsp, regs, 0,
+		       parent_tidptr, child_tidptr, upid_setp);
+}
+
 /*
  * 31 bit emulation wrapper functions for sys_fadvise64/fadvise64_64.
  * These need to rewrite the advise values for POSIX_FADV_{DONTNEED,NOREUSE}
diff --git a/arch/s390/kernel/process.c b/arch/s390/kernel/process.c
index 5a43f27..263d3ab 100644
--- a/arch/s390/kernel/process.c
+++ b/arch/s390/kernel/process.c
@@ -247,6 +247,25 @@ SYSCALL_DEFINE0(clone)
 		       parent_tidptr, child_tidptr);
 }
 
+SYSCALL_DEFINE0(clone_with_pids)
+{
+	struct pt_regs *regs = task_pt_regs(current);
+	unsigned long clone_flags;
+	unsigned long newsp;
+	int __user *parent_tidptr, *child_tidptr;
+	void __user *upid_setp;
+
+	clone_flags = regs->gprs[3];
+	newsp = regs->orig_gpr2;
+	parent_tidptr = (int __user *) regs->gprs[4];
+	child_tidptr = (int __user *) regs->gprs[5];
+	upid_setp = (void __user *) regs->gprs[7];
+	if (!newsp)
+		newsp = regs->gprs[15];
+	return do_fork_with_pids(clone_flags, newsp, regs, 0, parent_tidptr,
+			child_tidptr, upid_setp);
+}
+
 /*
  * This is trivial, and on the face of it looks like it
  * could equally well be done in user mode.
diff --git a/arch/s390/kernel/syscalls.S b/arch/s390/kernel/syscalls.S
index 67518e2..db850e7 100644
--- a/arch/s390/kernel/syscalls.S
+++ b/arch/s390/kernel/syscalls.S
@@ -342,3 +342,4 @@ SYSCALL(sys_rt_tgsigqueueinfo,sys_rt_tgsigqueueinfo,compat_sys_rt_tgsigqueueinfo
 SYSCALL(sys_perf_counter_open,sys_perf_counter_open,sys_perf_counter_open_wrapper)
 SYSCALL(sys_checkpoint,sys_checkpoint,sys_checkpoint_wrapper)
 SYSCALL(sys_restart,sys_restart,sys_restore_wrapper)
+SYSCALL(sys_clone_with_pids,sys_clone_with_pids,sys_clone_with_pids_wrapper)
-- 
1.6.0.4


^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH v18 58/80] c/r: clone_with_pids: define the s390 syscall
@ 2009-09-23 23:51   ` Oren Laadan
  0 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-23 23:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, containers, linux-kernel, linux-mm, linux-api,
	Serge Hallyn, Ingo Molnar, Pavel Emelyanov

From: Serge E. Hallyn <serue@us.ibm.com>

Hook up the clone_with_pids system call for s390x.  clone_with_pids()
takes an additional argument over clone(), which we pass in through
register 7.  Stub code for using the syscall looks like:

struct target_pid_set {
        int num_pids;
        pid_t *target_pids;
        unsigned long flags;
};

    register unsigned long int __r2 asm ("2") = (unsigned long int)(stack);
    register unsigned long int __r3 asm ("3") = (unsigned long int)(flags);
    register unsigned long int __r4 asm ("4") = (unsigned long int)(NULL);
    register unsigned long int __r5 asm ("5") = (unsigned long int)(NULL);
    register unsigned long int __r6 asm ("6") = (unsigned long int)(NULL);
    register unsigned long int __r7 asm ("7") = (unsigned long int)(setp);
    register unsigned long int __result asm ("2");
    __asm__ __volatile__(
            " lghi %%r1,332\n"
            " svc 0\n"
            : "=d" (__result)
            : "0" (__r2), "d" (__r3),
              "d" (__r4), "d" (__r5), "d" (__r6), "d" (__r7)
            : "1", "cc", "memory"
    );
            __result;
    })

    struct target_pid_set pid_set;
    int pids[1] = { 19799 };
    pid_set.num_pids = 1;
    pid_set.target_pids = &pids[0];
    pid_set.flags = 0;

    rc = do_clone_with_pids(topstack, clone_flags, setp);
    if (rc == 0)
	printf("Child\n");
    else if (rc > 0)
	printf("Parent: child pid %d\n", rc);
    else
	printf("Error %d\n", rc);

Changelog[v18]:
  - Set return value for self-checkpoint

Signed-off-by: Serge E. Hallyn <serue@us.ibm.com>
---
 arch/s390/include/asm/unistd.h  |    3 ++-
 arch/s390/kernel/compat_linux.c |   19 +++++++++++++++++++
 arch/s390/kernel/process.c      |   19 +++++++++++++++++++
 arch/s390/kernel/syscalls.S     |    1 +
 4 files changed, 41 insertions(+), 1 deletions(-)

diff --git a/arch/s390/include/asm/unistd.h b/arch/s390/include/asm/unistd.h
index 5d1678a..2a84f9c 100644
--- a/arch/s390/include/asm/unistd.h
+++ b/arch/s390/include/asm/unistd.h
@@ -271,7 +271,8 @@
 #define __NR_perf_counter_open	331
 #define __NR_checkpoint		332
 #define __NR_restart		333
-#define NR_syscalls 334
+#define __NR_clone_with_pids	334
+#define NR_syscalls 335
 
 /* 
  * There are some system calls that are not present on 64 bit, some
diff --git a/arch/s390/kernel/compat_linux.c b/arch/s390/kernel/compat_linux.c
index 9ab188d..c6dc681 100644
--- a/arch/s390/kernel/compat_linux.c
+++ b/arch/s390/kernel/compat_linux.c
@@ -818,6 +818,25 @@ asmlinkage long sys32_clone(void)
 		       parent_tidptr, child_tidptr);
 }
 
+asmlinkage long sys32_clone_with_pids(void)
+{
+	struct pt_regs *regs = task_pt_regs(current);
+	unsigned long clone_flags;
+	unsigned long newsp;
+	int __user *parent_tidptr, *child_tidptr;
+	void __user *upid_setp;
+
+	clone_flags = regs->gprs[3] & 0xffffffffUL;
+	newsp = regs->orig_gpr2 & 0x7fffffffUL;
+	parent_tidptr = compat_ptr(regs->gprs[4]);
+	child_tidptr = compat_ptr(regs->gprs[5]);
+	upid_setp = compat_ptr(regs->gprs[7]);
+	if (!newsp)
+		newsp = regs->gprs[15];
+	return do_fork_with_pids(clone_flags, newsp, regs, 0,
+		       parent_tidptr, child_tidptr, upid_setp);
+}
+
 /*
  * 31 bit emulation wrapper functions for sys_fadvise64/fadvise64_64.
  * These need to rewrite the advise values for POSIX_FADV_{DONTNEED,NOREUSE}
diff --git a/arch/s390/kernel/process.c b/arch/s390/kernel/process.c
index 5a43f27..263d3ab 100644
--- a/arch/s390/kernel/process.c
+++ b/arch/s390/kernel/process.c
@@ -247,6 +247,25 @@ SYSCALL_DEFINE0(clone)
 		       parent_tidptr, child_tidptr);
 }
 
+SYSCALL_DEFINE0(clone_with_pids)
+{
+	struct pt_regs *regs = task_pt_regs(current);
+	unsigned long clone_flags;
+	unsigned long newsp;
+	int __user *parent_tidptr, *child_tidptr;
+	void __user *upid_setp;
+
+	clone_flags = regs->gprs[3];
+	newsp = regs->orig_gpr2;
+	parent_tidptr = (int __user *) regs->gprs[4];
+	child_tidptr = (int __user *) regs->gprs[5];
+	upid_setp = (void __user *) regs->gprs[7];
+	if (!newsp)
+		newsp = regs->gprs[15];
+	return do_fork_with_pids(clone_flags, newsp, regs, 0, parent_tidptr,
+			child_tidptr, upid_setp);
+}
+
 /*
  * This is trivial, and on the face of it looks like it
  * could equally well be done in user mode.
diff --git a/arch/s390/kernel/syscalls.S b/arch/s390/kernel/syscalls.S
index 67518e2..db850e7 100644
--- a/arch/s390/kernel/syscalls.S
+++ b/arch/s390/kernel/syscalls.S
@@ -342,3 +342,4 @@ SYSCALL(sys_rt_tgsigqueueinfo,sys_rt_tgsigqueueinfo,compat_sys_rt_tgsigqueueinfo
 SYSCALL(sys_perf_counter_open,sys_perf_counter_open,sys_perf_counter_open_wrapper)
 SYSCALL(sys_checkpoint,sys_checkpoint,sys_checkpoint_wrapper)
 SYSCALL(sys_restart,sys_restart,sys_restore_wrapper)
+SYSCALL(sys_clone_with_pids,sys_clone_with_pids,sys_clone_with_pids_wrapper)
-- 
1.6.0.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH v18 59/80] c/r: capabilities: define checkpoint and restore fns
       [not found] ` <1253749920-18673-1-git-send-email-orenl-RdfvBDnrOixBDgjK7y7TUQ@public.gmane.org>
                     ` (57 preceding siblings ...)
  2009-09-23 23:51   ` [PATCH v18 58/80] c/r: clone_with_pids: define the s390 syscall Oren Laadan
@ 2009-09-23 23:51   ` Oren Laadan
  2009-09-23 23:51   ` [PATCH v18 60/80] c/r: checkpoint and restore task credentials Oren Laadan
                     ` (22 subsequent siblings)
  81 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-23 23:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-api-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Linus Torvalds, Ingo Molnar,
	Pavel Emelyanov

From: Serge E. Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>

[ Andrew: I am punting on dealing with the subsystem cooperation
issues in this version, in favor of trying to get LSM issues
straightened out ]

An application checkpoint image will store capability sets
(and the bounding set) as __u64s.  Define checkpoint and
restart functions to translate between those and kernel_cap_t's.

Define a common function do_capset_tocred() which applies capability
set changes to a passed-in struct cred.

The restore function uses do_capset_tocred() to apply the restored
capabilities to the struct cred being crafted, subject to the
current task's (task executing sys_restart()) permissions.

Changelog:
	Jun 09: Can't choose securebits or drop bounding set if
		file capabilities aren't compiled into the kernel.
		Also just store caps in __u32s (looks cleaner).
	Jun 01: Made the checkpoint and restore functions and the
		ckpt_hdr_capabilities struct more opaque to the
		rest of the c/r code, as suggested by Andrew Morgan,
		and using naming suggested by Oren.
	Jun 01: Add commented BUILD_BUG_ON() to point out that the
		current implementation depends on 64-bit capabilities.
		(Andrew Morgan and Alexey Dobriyan).
	May 28: add helpers to c/r securebits

Signed-off-by: Serge E. Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
---
 include/linux/capability.h     |    6 ++
 include/linux/checkpoint_hdr.h |   11 +++
 kernel/capability.c            |  164 +++++++++++++++++++++++++++++++++++++---
 security/commoncap.c           |   19 +----
 4 files changed, 172 insertions(+), 28 deletions(-)

diff --git a/include/linux/capability.h b/include/linux/capability.h
index c302110..3a74655 100644
--- a/include/linux/capability.h
+++ b/include/linux/capability.h
@@ -568,6 +568,12 @@ extern int capable(int cap);
 struct dentry;
 extern int get_vfs_caps_from_disk(const struct dentry *dentry, struct cpu_vfs_cap_data *cpu_caps);
 
+struct cred;
+int apply_securebits(unsigned securebits, struct cred *new);
+struct ckpt_capabilities;
+int restore_capabilities(struct ckpt_capabilities *h, struct cred *new);
+void checkpoint_capabilities(struct ckpt_capabilities *h, struct cred * cred);
+
 #endif /* __KERNEL__ */
 
 #endif /* !_LINUX_CAPABILITY_H */
diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
index ef260a0..cb036e8 100644
--- a/include/linux/checkpoint_hdr.h
+++ b/include/linux/checkpoint_hdr.h
@@ -61,6 +61,7 @@ enum {
 	CKPT_HDR_NS,
 	CKPT_HDR_UTS_NS,
 	CKPT_HDR_IPC_NS,
+	CKPT_HDR_CAPABILITIES,
 
 	/* 201-299: reserved for arch-dependent */
 
@@ -192,6 +193,16 @@ struct ckpt_hdr_task {
 	__u64 robust_futex_list; /* a __user ptr */
 } __attribute__((aligned(8)));
 
+/* Posix capabilities */
+struct ckpt_capabilities {
+	__u32 cap_i_0, cap_i_1; /* inheritable set */
+	__u32 cap_p_0, cap_p_1; /* permitted set */
+	__u32 cap_e_0, cap_e_1; /* effective set */
+	__u32 cap_b_0, cap_b_1; /* bounding set */
+	__u32 securebits;
+	__u32 padding;
+} __attribute__((aligned(8)));
+
 /* namespaces */
 struct ckpt_hdr_task_ns {
 	struct ckpt_hdr h;
diff --git a/kernel/capability.c b/kernel/capability.c
index 4e17041..4f58454 100644
--- a/kernel/capability.c
+++ b/kernel/capability.c
@@ -14,6 +14,8 @@
 #include <linux/security.h>
 #include <linux/syscalls.h>
 #include <linux/pid_namespace.h>
+#include <linux/securebits.h>
+#include <linux/checkpoint.h>
 #include <asm/uaccess.h>
 #include "cred-internals.h"
 
@@ -217,6 +219,45 @@ SYSCALL_DEFINE2(capget, cap_user_header_t, header, cap_user_data_t, dataptr)
 	return ret;
 }
 
+static int do_capset_tocred(kernel_cap_t *effective, kernel_cap_t *inheritable,
+			kernel_cap_t *permitted, struct cred *new)
+{
+	int ret;
+
+	ret = security_capset(new, current_cred(),
+			      effective, inheritable, permitted);
+	if (ret < 0)
+		return ret;
+
+	/*
+	 * for checkpoint-restart, do we want to wait until end of restart?
+	 * not sure we care */
+	audit_log_capset(current->pid, new, current_cred());
+
+	return 0;
+}
+
+static int do_capset(kernel_cap_t *effective, kernel_cap_t *inheritable,
+			kernel_cap_t *permitted)
+{
+	struct cred *new;
+	int ret;
+
+	new = prepare_creds();
+	if (!new)
+		return -ENOMEM;
+
+	ret = do_capset_tocred(effective, inheritable, permitted, new);
+	if (ret < 0)
+		goto error;
+
+	return commit_creds(new);
+
+error:
+	abort_creds(new);
+	return ret;
+}
+
 /**
  * sys_capset - set capabilities for a process or (*) a group of processes
  * @header: pointer to struct that contains capability version and
@@ -240,7 +281,6 @@ SYSCALL_DEFINE2(capset, cap_user_header_t, header, const cap_user_data_t, data)
 	struct __user_cap_data_struct kdata[_KERNEL_CAPABILITY_U32S];
 	unsigned i, tocopy;
 	kernel_cap_t inheritable, permitted, effective;
-	struct cred *new;
 	int ret;
 	pid_t pid;
 
@@ -271,23 +311,125 @@ SYSCALL_DEFINE2(capset, cap_user_header_t, header, const cap_user_data_t, data)
 		i++;
 	}
 
-	new = prepare_creds();
-	if (!new)
-		return -ENOMEM;
+	return do_capset(&effective, &inheritable, &permitted);
 
-	ret = security_capset(new, current_cred(),
-			      &effective, &inheritable, &permitted);
+}
+
+#ifdef CONFIG_SECURITY_FILE_CAPABILITIES
+int apply_securebits(unsigned securebits, struct cred *new)
+{
+	if ((((new->securebits & SECURE_ALL_LOCKS) >> 1)
+	     & (new->securebits ^ securebits))				/*[1]*/
+	    || ((new->securebits & SECURE_ALL_LOCKS & ~securebits))	/*[2]*/
+	    || (securebits & ~(SECURE_ALL_LOCKS | SECURE_ALL_BITS))	/*[3]*/
+	    || (cap_capable(current, current_cred(), CAP_SETPCAP,
+			    SECURITY_CAP_AUDIT) != 0)			/*[4]*/
+		/*
+		 * [1] no changing of bits that are locked
+		 * [2] no unlocking of locks
+		 * [3] no setting of unsupported bits
+		 * [4] doing anything requires privilege (go read about
+		 *     the "sendmail capabilities bug")
+		 */
+	    )
+		/* cannot change a locked bit */
+		return -EPERM;
+	new->securebits = securebits;
+	return 0;
+}
+
+static void do_capbset_drop(struct cred *cred, int cap)
+{
+	cap_lower(cred->cap_bset, cap);
+}
+
+static inline int restore_cap_bset(kernel_cap_t bset, struct cred *cred)
+{
+	int i, may_dropbcap = capable(CAP_SETPCAP);
+
+	for (i = 0; i < CAP_LAST_CAP; i++) {
+		if (cap_raised(bset, i))
+			continue;
+		if (!cap_raised(current_cred()->cap_bset, i))
+			continue;
+		if (!may_dropbcap)
+			return -EPERM;
+		do_capbset_drop(cred, i);
+	}
+
+	return 0;
+}
+
+#else /* CONFIG_SECURITY_FILE_CAPABILITIES */
+
+int apply_securebits(unsigned securebits, struct cred *new)
+{
+	/* settable securebits not supported */
+	return 0;
+}
+
+static inline int restore_cap_bset(kernel_cap_t bset, struct cred *cred)
+{
+	/* bounding sets not supported */
+	return 0;
+}
+#endif /* CONFIG_SECURITY_FILE_CAPABILITIES */
+
+#ifdef CONFIG_CHECKPOINT
+static int do_restore_caps(struct ckpt_capabilities *h, struct cred *cred)
+{
+	kernel_cap_t effective, inheritable, permitted, bset;
+	int ret;
+
+	effective.cap[0] = h->cap_e_0;
+	effective.cap[1] = h->cap_e_1;
+	inheritable.cap[0] = h->cap_i_0;
+	inheritable.cap[1] = h->cap_i_1;
+	permitted.cap[0] = h->cap_p_0;
+	permitted.cap[1] = h->cap_p_1;
+	bset.cap[0] = h->cap_b_0;
+	bset.cap[1] = h->cap_b_1;
+
+	ret = do_capset_tocred(&effective, &inheritable, &permitted, cred);
 	if (ret < 0)
-		goto error;
+		return ret;
+
+	ret = restore_cap_bset(bset, cred);
+	return ret;
+}
 
-	audit_log_capset(pid, new, current_cred());
+void checkpoint_capabilities(struct ckpt_capabilities *h, struct cred * cred)
+{
+	BUILD_BUG_ON(CAP_LAST_CAP >= 64);
+	h->securebits = cred->securebits;
+	h->cap_i_0 = cred->cap_inheritable.cap[0];
+	h->cap_i_1 = cred->cap_inheritable.cap[1];
+	h->cap_p_0 = cred->cap_permitted.cap[0];
+	h->cap_p_1 = cred->cap_permitted.cap[1];
+	h->cap_e_0 = cred->cap_effective.cap[0];
+	h->cap_e_1 = cred->cap_effective.cap[1];
+	h->cap_b_0 = cred->cap_bset.cap[0];
+	h->cap_b_1 = cred->cap_bset.cap[1];
+}
 
-	return commit_creds(new);
+/*
+ * restore_capabilities: called by restore_creds() to set the
+ * restored capabilities (if permitted) in a new struct cred which
+ * will be attached at the end of the sys_restart().
+ * struct cred *new is prepared by caller (using prepare_creds())
+ * (and aborted by caller on error)
+ * return 0 on success, < 0 on error
+ */
+int restore_capabilities(struct ckpt_capabilities *h, struct cred *new)
+{
+	int ret = do_restore_caps(h, new);
+
+	if (!ret)
+		ret = apply_securebits(h->securebits, new);
 
-error:
-	abort_creds(new);
 	return ret;
 }
+#endif /* CONFIG_CHECKPOINT */
 
 /**
  * capable - Determine if the current task has a superior capability in effect
diff --git a/security/commoncap.c b/security/commoncap.c
index e3097c0..892815b 100644
--- a/security/commoncap.c
+++ b/security/commoncap.c
@@ -893,24 +893,9 @@ int cap_task_prctl(int option, unsigned long arg2, unsigned long arg3,
 	 * capability-based-privilege environment.
 	 */
 	case PR_SET_SECUREBITS:
-		error = -EPERM;
-		if ((((new->securebits & SECURE_ALL_LOCKS) >> 1)
-		     & (new->securebits ^ arg2))			/*[1]*/
-		    || ((new->securebits & SECURE_ALL_LOCKS & ~arg2))	/*[2]*/
-		    || (arg2 & ~(SECURE_ALL_LOCKS | SECURE_ALL_BITS))	/*[3]*/
-		    || (cap_capable(current, current_cred(), CAP_SETPCAP,
-				    SECURITY_CAP_AUDIT) != 0)		/*[4]*/
-			/*
-			 * [1] no changing of bits that are locked
-			 * [2] no unlocking of locks
-			 * [3] no setting of unsupported bits
-			 * [4] doing anything requires privilege (go read about
-			 *     the "sendmail capabilities bug")
-			 */
-		    )
-			/* cannot change a locked bit */
+		error = apply_securebits(arg2, new);
+		if (error)
 			goto error;
-		new->securebits = arg2;
 		goto changed;
 
 	case PR_GET_SECUREBITS:
-- 
1.6.0.4

^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH v18 59/80] c/r: capabilities: define checkpoint and restore fns
  2009-09-23 23:50 ` Oren Laadan
@ 2009-09-23 23:51   ` Oren Laadan
  -1 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-23 23:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, containers, linux-kernel, linux-mm, linux-api,
	Serge Hallyn, Ingo Molnar, Pavel Emelyanov

From: Serge E. Hallyn <serue@us.ibm.com>

[ Andrew: I am punting on dealing with the subsystem cooperation
issues in this version, in favor of trying to get LSM issues
straightened out ]

An application checkpoint image will store capability sets
(and the bounding set) as __u64s.  Define checkpoint and
restart functions to translate between those and kernel_cap_t's.

Define a common function do_capset_tocred() which applies capability
set changes to a passed-in struct cred.

The restore function uses do_capset_tocred() to apply the restored
capabilities to the struct cred being crafted, subject to the
current task's (task executing sys_restart()) permissions.

Changelog:
	Jun 09: Can't choose securebits or drop bounding set if
		file capabilities aren't compiled into the kernel.
		Also just store caps in __u32s (looks cleaner).
	Jun 01: Made the checkpoint and restore functions and the
		ckpt_hdr_capabilities struct more opaque to the
		rest of the c/r code, as suggested by Andrew Morgan,
		and using naming suggested by Oren.
	Jun 01: Add commented BUILD_BUG_ON() to point out that the
		current implementation depends on 64-bit capabilities.
		(Andrew Morgan and Alexey Dobriyan).
	May 28: add helpers to c/r securebits

Signed-off-by: Serge E. Hallyn <serue@us.ibm.com>
---
 include/linux/capability.h     |    6 ++
 include/linux/checkpoint_hdr.h |   11 +++
 kernel/capability.c            |  164 +++++++++++++++++++++++++++++++++++++---
 security/commoncap.c           |   19 +----
 4 files changed, 172 insertions(+), 28 deletions(-)

diff --git a/include/linux/capability.h b/include/linux/capability.h
index c302110..3a74655 100644
--- a/include/linux/capability.h
+++ b/include/linux/capability.h
@@ -568,6 +568,12 @@ extern int capable(int cap);
 struct dentry;
 extern int get_vfs_caps_from_disk(const struct dentry *dentry, struct cpu_vfs_cap_data *cpu_caps);
 
+struct cred;
+int apply_securebits(unsigned securebits, struct cred *new);
+struct ckpt_capabilities;
+int restore_capabilities(struct ckpt_capabilities *h, struct cred *new);
+void checkpoint_capabilities(struct ckpt_capabilities *h, struct cred * cred);
+
 #endif /* __KERNEL__ */
 
 #endif /* !_LINUX_CAPABILITY_H */
diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
index ef260a0..cb036e8 100644
--- a/include/linux/checkpoint_hdr.h
+++ b/include/linux/checkpoint_hdr.h
@@ -61,6 +61,7 @@ enum {
 	CKPT_HDR_NS,
 	CKPT_HDR_UTS_NS,
 	CKPT_HDR_IPC_NS,
+	CKPT_HDR_CAPABILITIES,
 
 	/* 201-299: reserved for arch-dependent */
 
@@ -192,6 +193,16 @@ struct ckpt_hdr_task {
 	__u64 robust_futex_list; /* a __user ptr */
 } __attribute__((aligned(8)));
 
+/* Posix capabilities */
+struct ckpt_capabilities {
+	__u32 cap_i_0, cap_i_1; /* inheritable set */
+	__u32 cap_p_0, cap_p_1; /* permitted set */
+	__u32 cap_e_0, cap_e_1; /* effective set */
+	__u32 cap_b_0, cap_b_1; /* bounding set */
+	__u32 securebits;
+	__u32 padding;
+} __attribute__((aligned(8)));
+
 /* namespaces */
 struct ckpt_hdr_task_ns {
 	struct ckpt_hdr h;
diff --git a/kernel/capability.c b/kernel/capability.c
index 4e17041..4f58454 100644
--- a/kernel/capability.c
+++ b/kernel/capability.c
@@ -14,6 +14,8 @@
 #include <linux/security.h>
 #include <linux/syscalls.h>
 #include <linux/pid_namespace.h>
+#include <linux/securebits.h>
+#include <linux/checkpoint.h>
 #include <asm/uaccess.h>
 #include "cred-internals.h"
 
@@ -217,6 +219,45 @@ SYSCALL_DEFINE2(capget, cap_user_header_t, header, cap_user_data_t, dataptr)
 	return ret;
 }
 
+static int do_capset_tocred(kernel_cap_t *effective, kernel_cap_t *inheritable,
+			kernel_cap_t *permitted, struct cred *new)
+{
+	int ret;
+
+	ret = security_capset(new, current_cred(),
+			      effective, inheritable, permitted);
+	if (ret < 0)
+		return ret;
+
+	/*
+	 * for checkpoint-restart, do we want to wait until end of restart?
+	 * not sure we care */
+	audit_log_capset(current->pid, new, current_cred());
+
+	return 0;
+}
+
+static int do_capset(kernel_cap_t *effective, kernel_cap_t *inheritable,
+			kernel_cap_t *permitted)
+{
+	struct cred *new;
+	int ret;
+
+	new = prepare_creds();
+	if (!new)
+		return -ENOMEM;
+
+	ret = do_capset_tocred(effective, inheritable, permitted, new);
+	if (ret < 0)
+		goto error;
+
+	return commit_creds(new);
+
+error:
+	abort_creds(new);
+	return ret;
+}
+
 /**
  * sys_capset - set capabilities for a process or (*) a group of processes
  * @header: pointer to struct that contains capability version and
@@ -240,7 +281,6 @@ SYSCALL_DEFINE2(capset, cap_user_header_t, header, const cap_user_data_t, data)
 	struct __user_cap_data_struct kdata[_KERNEL_CAPABILITY_U32S];
 	unsigned i, tocopy;
 	kernel_cap_t inheritable, permitted, effective;
-	struct cred *new;
 	int ret;
 	pid_t pid;
 
@@ -271,23 +311,125 @@ SYSCALL_DEFINE2(capset, cap_user_header_t, header, const cap_user_data_t, data)
 		i++;
 	}
 
-	new = prepare_creds();
-	if (!new)
-		return -ENOMEM;
+	return do_capset(&effective, &inheritable, &permitted);
 
-	ret = security_capset(new, current_cred(),
-			      &effective, &inheritable, &permitted);
+}
+
+#ifdef CONFIG_SECURITY_FILE_CAPABILITIES
+int apply_securebits(unsigned securebits, struct cred *new)
+{
+	if ((((new->securebits & SECURE_ALL_LOCKS) >> 1)
+	     & (new->securebits ^ securebits))				/*[1]*/
+	    || ((new->securebits & SECURE_ALL_LOCKS & ~securebits))	/*[2]*/
+	    || (securebits & ~(SECURE_ALL_LOCKS | SECURE_ALL_BITS))	/*[3]*/
+	    || (cap_capable(current, current_cred(), CAP_SETPCAP,
+			    SECURITY_CAP_AUDIT) != 0)			/*[4]*/
+		/*
+		 * [1] no changing of bits that are locked
+		 * [2] no unlocking of locks
+		 * [3] no setting of unsupported bits
+		 * [4] doing anything requires privilege (go read about
+		 *     the "sendmail capabilities bug")
+		 */
+	    )
+		/* cannot change a locked bit */
+		return -EPERM;
+	new->securebits = securebits;
+	return 0;
+}
+
+static void do_capbset_drop(struct cred *cred, int cap)
+{
+	cap_lower(cred->cap_bset, cap);
+}
+
+static inline int restore_cap_bset(kernel_cap_t bset, struct cred *cred)
+{
+	int i, may_dropbcap = capable(CAP_SETPCAP);
+
+	for (i = 0; i < CAP_LAST_CAP; i++) {
+		if (cap_raised(bset, i))
+			continue;
+		if (!cap_raised(current_cred()->cap_bset, i))
+			continue;
+		if (!may_dropbcap)
+			return -EPERM;
+		do_capbset_drop(cred, i);
+	}
+
+	return 0;
+}
+
+#else /* CONFIG_SECURITY_FILE_CAPABILITIES */
+
+int apply_securebits(unsigned securebits, struct cred *new)
+{
+	/* settable securebits not supported */
+	return 0;
+}
+
+static inline int restore_cap_bset(kernel_cap_t bset, struct cred *cred)
+{
+	/* bounding sets not supported */
+	return 0;
+}
+#endif /* CONFIG_SECURITY_FILE_CAPABILITIES */
+
+#ifdef CONFIG_CHECKPOINT
+static int do_restore_caps(struct ckpt_capabilities *h, struct cred *cred)
+{
+	kernel_cap_t effective, inheritable, permitted, bset;
+	int ret;
+
+	effective.cap[0] = h->cap_e_0;
+	effective.cap[1] = h->cap_e_1;
+	inheritable.cap[0] = h->cap_i_0;
+	inheritable.cap[1] = h->cap_i_1;
+	permitted.cap[0] = h->cap_p_0;
+	permitted.cap[1] = h->cap_p_1;
+	bset.cap[0] = h->cap_b_0;
+	bset.cap[1] = h->cap_b_1;
+
+	ret = do_capset_tocred(&effective, &inheritable, &permitted, cred);
 	if (ret < 0)
-		goto error;
+		return ret;
+
+	ret = restore_cap_bset(bset, cred);
+	return ret;
+}
 
-	audit_log_capset(pid, new, current_cred());
+void checkpoint_capabilities(struct ckpt_capabilities *h, struct cred * cred)
+{
+	BUILD_BUG_ON(CAP_LAST_CAP >= 64);
+	h->securebits = cred->securebits;
+	h->cap_i_0 = cred->cap_inheritable.cap[0];
+	h->cap_i_1 = cred->cap_inheritable.cap[1];
+	h->cap_p_0 = cred->cap_permitted.cap[0];
+	h->cap_p_1 = cred->cap_permitted.cap[1];
+	h->cap_e_0 = cred->cap_effective.cap[0];
+	h->cap_e_1 = cred->cap_effective.cap[1];
+	h->cap_b_0 = cred->cap_bset.cap[0];
+	h->cap_b_1 = cred->cap_bset.cap[1];
+}
 
-	return commit_creds(new);
+/*
+ * restore_capabilities: called by restore_creds() to set the
+ * restored capabilities (if permitted) in a new struct cred which
+ * will be attached at the end of the sys_restart().
+ * struct cred *new is prepared by caller (using prepare_creds())
+ * (and aborted by caller on error)
+ * return 0 on success, < 0 on error
+ */
+int restore_capabilities(struct ckpt_capabilities *h, struct cred *new)
+{
+	int ret = do_restore_caps(h, new);
+
+	if (!ret)
+		ret = apply_securebits(h->securebits, new);
 
-error:
-	abort_creds(new);
 	return ret;
 }
+#endif /* CONFIG_CHECKPOINT */
 
 /**
  * capable - Determine if the current task has a superior capability in effect
diff --git a/security/commoncap.c b/security/commoncap.c
index e3097c0..892815b 100644
--- a/security/commoncap.c
+++ b/security/commoncap.c
@@ -893,24 +893,9 @@ int cap_task_prctl(int option, unsigned long arg2, unsigned long arg3,
 	 * capability-based-privilege environment.
 	 */
 	case PR_SET_SECUREBITS:
-		error = -EPERM;
-		if ((((new->securebits & SECURE_ALL_LOCKS) >> 1)
-		     & (new->securebits ^ arg2))			/*[1]*/
-		    || ((new->securebits & SECURE_ALL_LOCKS & ~arg2))	/*[2]*/
-		    || (arg2 & ~(SECURE_ALL_LOCKS | SECURE_ALL_BITS))	/*[3]*/
-		    || (cap_capable(current, current_cred(), CAP_SETPCAP,
-				    SECURITY_CAP_AUDIT) != 0)		/*[4]*/
-			/*
-			 * [1] no changing of bits that are locked
-			 * [2] no unlocking of locks
-			 * [3] no setting of unsupported bits
-			 * [4] doing anything requires privilege (go read about
-			 *     the "sendmail capabilities bug")
-			 */
-		    )
-			/* cannot change a locked bit */
+		error = apply_securebits(arg2, new);
+		if (error)
 			goto error;
-		new->securebits = arg2;
 		goto changed;
 
 	case PR_GET_SECUREBITS:
-- 
1.6.0.4


^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH v18 59/80] c/r: capabilities: define checkpoint and restore fns
@ 2009-09-23 23:51   ` Oren Laadan
  0 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-23 23:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, containers, linux-kernel, linux-mm, linux-api,
	Serge Hallyn, Ingo Molnar, Pavel Emelyanov

From: Serge E. Hallyn <serue@us.ibm.com>

[ Andrew: I am punting on dealing with the subsystem cooperation
issues in this version, in favor of trying to get LSM issues
straightened out ]

An application checkpoint image will store capability sets
(and the bounding set) as __u64s.  Define checkpoint and
restart functions to translate between those and kernel_cap_t's.

Define a common function do_capset_tocred() which applies capability
set changes to a passed-in struct cred.

The restore function uses do_capset_tocred() to apply the restored
capabilities to the struct cred being crafted, subject to the
current task's (task executing sys_restart()) permissions.

Changelog:
	Jun 09: Can't choose securebits or drop bounding set if
		file capabilities aren't compiled into the kernel.
		Also just store caps in __u32s (looks cleaner).
	Jun 01: Made the checkpoint and restore functions and the
		ckpt_hdr_capabilities struct more opaque to the
		rest of the c/r code, as suggested by Andrew Morgan,
		and using naming suggested by Oren.
	Jun 01: Add commented BUILD_BUG_ON() to point out that the
		current implementation depends on 64-bit capabilities.
		(Andrew Morgan and Alexey Dobriyan).
	May 28: add helpers to c/r securebits

Signed-off-by: Serge E. Hallyn <serue@us.ibm.com>
---
 include/linux/capability.h     |    6 ++
 include/linux/checkpoint_hdr.h |   11 +++
 kernel/capability.c            |  164 +++++++++++++++++++++++++++++++++++++---
 security/commoncap.c           |   19 +----
 4 files changed, 172 insertions(+), 28 deletions(-)

diff --git a/include/linux/capability.h b/include/linux/capability.h
index c302110..3a74655 100644
--- a/include/linux/capability.h
+++ b/include/linux/capability.h
@@ -568,6 +568,12 @@ extern int capable(int cap);
 struct dentry;
 extern int get_vfs_caps_from_disk(const struct dentry *dentry, struct cpu_vfs_cap_data *cpu_caps);
 
+struct cred;
+int apply_securebits(unsigned securebits, struct cred *new);
+struct ckpt_capabilities;
+int restore_capabilities(struct ckpt_capabilities *h, struct cred *new);
+void checkpoint_capabilities(struct ckpt_capabilities *h, struct cred * cred);
+
 #endif /* __KERNEL__ */
 
 #endif /* !_LINUX_CAPABILITY_H */
diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
index ef260a0..cb036e8 100644
--- a/include/linux/checkpoint_hdr.h
+++ b/include/linux/checkpoint_hdr.h
@@ -61,6 +61,7 @@ enum {
 	CKPT_HDR_NS,
 	CKPT_HDR_UTS_NS,
 	CKPT_HDR_IPC_NS,
+	CKPT_HDR_CAPABILITIES,
 
 	/* 201-299: reserved for arch-dependent */
 
@@ -192,6 +193,16 @@ struct ckpt_hdr_task {
 	__u64 robust_futex_list; /* a __user ptr */
 } __attribute__((aligned(8)));
 
+/* Posix capabilities */
+struct ckpt_capabilities {
+	__u32 cap_i_0, cap_i_1; /* inheritable set */
+	__u32 cap_p_0, cap_p_1; /* permitted set */
+	__u32 cap_e_0, cap_e_1; /* effective set */
+	__u32 cap_b_0, cap_b_1; /* bounding set */
+	__u32 securebits;
+	__u32 padding;
+} __attribute__((aligned(8)));
+
 /* namespaces */
 struct ckpt_hdr_task_ns {
 	struct ckpt_hdr h;
diff --git a/kernel/capability.c b/kernel/capability.c
index 4e17041..4f58454 100644
--- a/kernel/capability.c
+++ b/kernel/capability.c
@@ -14,6 +14,8 @@
 #include <linux/security.h>
 #include <linux/syscalls.h>
 #include <linux/pid_namespace.h>
+#include <linux/securebits.h>
+#include <linux/checkpoint.h>
 #include <asm/uaccess.h>
 #include "cred-internals.h"
 
@@ -217,6 +219,45 @@ SYSCALL_DEFINE2(capget, cap_user_header_t, header, cap_user_data_t, dataptr)
 	return ret;
 }
 
+static int do_capset_tocred(kernel_cap_t *effective, kernel_cap_t *inheritable,
+			kernel_cap_t *permitted, struct cred *new)
+{
+	int ret;
+
+	ret = security_capset(new, current_cred(),
+			      effective, inheritable, permitted);
+	if (ret < 0)
+		return ret;
+
+	/*
+	 * for checkpoint-restart, do we want to wait until end of restart?
+	 * not sure we care */
+	audit_log_capset(current->pid, new, current_cred());
+
+	return 0;
+}
+
+static int do_capset(kernel_cap_t *effective, kernel_cap_t *inheritable,
+			kernel_cap_t *permitted)
+{
+	struct cred *new;
+	int ret;
+
+	new = prepare_creds();
+	if (!new)
+		return -ENOMEM;
+
+	ret = do_capset_tocred(effective, inheritable, permitted, new);
+	if (ret < 0)
+		goto error;
+
+	return commit_creds(new);
+
+error:
+	abort_creds(new);
+	return ret;
+}
+
 /**
  * sys_capset - set capabilities for a process or (*) a group of processes
  * @header: pointer to struct that contains capability version and
@@ -240,7 +281,6 @@ SYSCALL_DEFINE2(capset, cap_user_header_t, header, const cap_user_data_t, data)
 	struct __user_cap_data_struct kdata[_KERNEL_CAPABILITY_U32S];
 	unsigned i, tocopy;
 	kernel_cap_t inheritable, permitted, effective;
-	struct cred *new;
 	int ret;
 	pid_t pid;
 
@@ -271,23 +311,125 @@ SYSCALL_DEFINE2(capset, cap_user_header_t, header, const cap_user_data_t, data)
 		i++;
 	}
 
-	new = prepare_creds();
-	if (!new)
-		return -ENOMEM;
+	return do_capset(&effective, &inheritable, &permitted);
 
-	ret = security_capset(new, current_cred(),
-			      &effective, &inheritable, &permitted);
+}
+
+#ifdef CONFIG_SECURITY_FILE_CAPABILITIES
+int apply_securebits(unsigned securebits, struct cred *new)
+{
+	if ((((new->securebits & SECURE_ALL_LOCKS) >> 1)
+	     & (new->securebits ^ securebits))				/*[1]*/
+	    || ((new->securebits & SECURE_ALL_LOCKS & ~securebits))	/*[2]*/
+	    || (securebits & ~(SECURE_ALL_LOCKS | SECURE_ALL_BITS))	/*[3]*/
+	    || (cap_capable(current, current_cred(), CAP_SETPCAP,
+			    SECURITY_CAP_AUDIT) != 0)			/*[4]*/
+		/*
+		 * [1] no changing of bits that are locked
+		 * [2] no unlocking of locks
+		 * [3] no setting of unsupported bits
+		 * [4] doing anything requires privilege (go read about
+		 *     the "sendmail capabilities bug")
+		 */
+	    )
+		/* cannot change a locked bit */
+		return -EPERM;
+	new->securebits = securebits;
+	return 0;
+}
+
+static void do_capbset_drop(struct cred *cred, int cap)
+{
+	cap_lower(cred->cap_bset, cap);
+}
+
+static inline int restore_cap_bset(kernel_cap_t bset, struct cred *cred)
+{
+	int i, may_dropbcap = capable(CAP_SETPCAP);
+
+	for (i = 0; i < CAP_LAST_CAP; i++) {
+		if (cap_raised(bset, i))
+			continue;
+		if (!cap_raised(current_cred()->cap_bset, i))
+			continue;
+		if (!may_dropbcap)
+			return -EPERM;
+		do_capbset_drop(cred, i);
+	}
+
+	return 0;
+}
+
+#else /* CONFIG_SECURITY_FILE_CAPABILITIES */
+
+int apply_securebits(unsigned securebits, struct cred *new)
+{
+	/* settable securebits not supported */
+	return 0;
+}
+
+static inline int restore_cap_bset(kernel_cap_t bset, struct cred *cred)
+{
+	/* bounding sets not supported */
+	return 0;
+}
+#endif /* CONFIG_SECURITY_FILE_CAPABILITIES */
+
+#ifdef CONFIG_CHECKPOINT
+static int do_restore_caps(struct ckpt_capabilities *h, struct cred *cred)
+{
+	kernel_cap_t effective, inheritable, permitted, bset;
+	int ret;
+
+	effective.cap[0] = h->cap_e_0;
+	effective.cap[1] = h->cap_e_1;
+	inheritable.cap[0] = h->cap_i_0;
+	inheritable.cap[1] = h->cap_i_1;
+	permitted.cap[0] = h->cap_p_0;
+	permitted.cap[1] = h->cap_p_1;
+	bset.cap[0] = h->cap_b_0;
+	bset.cap[1] = h->cap_b_1;
+
+	ret = do_capset_tocred(&effective, &inheritable, &permitted, cred);
 	if (ret < 0)
-		goto error;
+		return ret;
+
+	ret = restore_cap_bset(bset, cred);
+	return ret;
+}
 
-	audit_log_capset(pid, new, current_cred());
+void checkpoint_capabilities(struct ckpt_capabilities *h, struct cred * cred)
+{
+	BUILD_BUG_ON(CAP_LAST_CAP >= 64);
+	h->securebits = cred->securebits;
+	h->cap_i_0 = cred->cap_inheritable.cap[0];
+	h->cap_i_1 = cred->cap_inheritable.cap[1];
+	h->cap_p_0 = cred->cap_permitted.cap[0];
+	h->cap_p_1 = cred->cap_permitted.cap[1];
+	h->cap_e_0 = cred->cap_effective.cap[0];
+	h->cap_e_1 = cred->cap_effective.cap[1];
+	h->cap_b_0 = cred->cap_bset.cap[0];
+	h->cap_b_1 = cred->cap_bset.cap[1];
+}
 
-	return commit_creds(new);
+/*
+ * restore_capabilities: called by restore_creds() to set the
+ * restored capabilities (if permitted) in a new struct cred which
+ * will be attached at the end of the sys_restart().
+ * struct cred *new is prepared by caller (using prepare_creds())
+ * (and aborted by caller on error)
+ * return 0 on success, < 0 on error
+ */
+int restore_capabilities(struct ckpt_capabilities *h, struct cred *new)
+{
+	int ret = do_restore_caps(h, new);
+
+	if (!ret)
+		ret = apply_securebits(h->securebits, new);
 
-error:
-	abort_creds(new);
 	return ret;
 }
+#endif /* CONFIG_CHECKPOINT */
 
 /**
  * capable - Determine if the current task has a superior capability in effect
diff --git a/security/commoncap.c b/security/commoncap.c
index e3097c0..892815b 100644
--- a/security/commoncap.c
+++ b/security/commoncap.c
@@ -893,24 +893,9 @@ int cap_task_prctl(int option, unsigned long arg2, unsigned long arg3,
 	 * capability-based-privilege environment.
 	 */
 	case PR_SET_SECUREBITS:
-		error = -EPERM;
-		if ((((new->securebits & SECURE_ALL_LOCKS) >> 1)
-		     & (new->securebits ^ arg2))			/*[1]*/
-		    || ((new->securebits & SECURE_ALL_LOCKS & ~arg2))	/*[2]*/
-		    || (arg2 & ~(SECURE_ALL_LOCKS | SECURE_ALL_BITS))	/*[3]*/
-		    || (cap_capable(current, current_cred(), CAP_SETPCAP,
-				    SECURITY_CAP_AUDIT) != 0)		/*[4]*/
-			/*
-			 * [1] no changing of bits that are locked
-			 * [2] no unlocking of locks
-			 * [3] no setting of unsupported bits
-			 * [4] doing anything requires privilege (go read about
-			 *     the "sendmail capabilities bug")
-			 */
-		    )
-			/* cannot change a locked bit */
+		error = apply_securebits(arg2, new);
+		if (error)
 			goto error;
-		new->securebits = arg2;
 		goto changed;
 
 	case PR_GET_SECUREBITS:
-- 
1.6.0.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH v18 60/80] c/r: checkpoint and restore task credentials
       [not found] ` <1253749920-18673-1-git-send-email-orenl-RdfvBDnrOixBDgjK7y7TUQ@public.gmane.org>
                     ` (58 preceding siblings ...)
  2009-09-23 23:51   ` [PATCH v18 59/80] c/r: capabilities: define checkpoint and restore fns Oren Laadan
@ 2009-09-23 23:51   ` Oren Laadan
  2009-09-23 23:51   ` [PATCH v18 61/80] c/r: restore file->f_cred Oren Laadan
                     ` (21 subsequent siblings)
  81 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-23 23:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-api-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Linus Torvalds, Ingo Molnar,
	Pavel Emelyanov

From: Serge E. Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>

This patch adds the checkpointing and restart of credentials
(uids, gids, and capabilities) to Oren's c/r patchset (on top
of v14).  It goes to great pains to re-use (and define when
needed) common helpers, in order to make sure that as security
code is modified, the cr code will be updated.  Some of the
helpers should still be moved (i.e. _creds() functions should
be in kernel/cred.c).

When building the credentials for the restarted process, I
1. create a new struct cred as a copy of the running task's
cred (using prepare_cred())
2. always authorize any changes to the new struct cred
based on the permissions of current_cred() (not the current
transient state of the new cred).

While this may mean that certain transient_cred1->transient_cred2
states are allowed which otherwise wouldn't be allowed, the
fact remains that current_cred() is allowed to transition to
transient_cred2.

The reconstructed creds are applied to the task at the very
end of the sys_restart call.  This ensures that any objects which
need to be re-created (file, socket, etc) are re-created using
the creds of the task calling sys_restart - preventing an unpriv
user from creating a privileged object, and ensuring that a
root task can restart a process which had started out privileged,
created some privileged objects, then dropped its privilege.

With these patches, the root user can restart checkpoint images
(created by either hallyn or root) of user hallyn's tasks,
resulting in a program owned by hallyn.

Changelog:
	Sep 08: [NTL] discard const from struct cred * where appropriate
	Jun 15: Fix user_ns handling when !CONFIG_USER_N
	        Set creator_ref=0 for root_ns (discard @flags)
		Don't  overwrite global user-ns if CONFIG_USER_NS
	Jun 10: Merge with ckpt-v16-dev (Oren Laadan)
	Jun 01: Don't check ordering of groups in group_info, bc
		set_groups() will sort it for us.
	May 28: 1. Restore securebits
		2. Address Alexey's comments: move prototypes out of
		   sched.h, validate ngroups < NGROUPS_MAX, validate
		   groups are sorted, and get rid of ckpt_hdr_cred->version.
		3. remove bogus unused flag RESTORE_CREATE_USERNS
	May 26: Move group, user, userns, creds c/r functions out
		of checkpoint/process.c and into the appropriate files.
	May 26: Define struct ckpt_hdr_task_creds and move task cred
		objref c/r into {checkpoint_restore}_task_shared().
	May 26: Take cred refs around checkpoint_write_creds()
	May 20: Remove the limit on number of groups in groupinfo
		at checkpoint time
	May 20: Remove the depth limit on empty user namespaces
	May 20: Better document checkpoint_user
	May 18: fix more refcounting: if (userns 5, uid 0) had
		no active tasks or child user_namespaces, then
		it shouldn't exist at restart or it, its namespace,
		and its whole chain of creators will be leaked.
	May 14: fix some refcounting:
		1. a new user_ns needs a ref to remain pinned
		   by its root user
		2. current_user_ns needs an extra ref bc objhash
		   drops two on restart
		3. cred needs a ref for the real credentials bc
		   commit_creds eats one ref.
	May 13: folded in fix to userns refcounting.

Signed-off-by: Serge E. Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
[orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org: merge with ckpt-v16-dev]
---
 checkpoint/namespace.c           |   41 ++++++++++
 checkpoint/objhash.c             |   82 ++++++++++++++++++++
 checkpoint/process.c             |  111 ++++++++++++++++++++++++++-
 include/linux/capability.h       |    6 +-
 include/linux/checkpoint.h       |   12 +++
 include/linux/checkpoint_hdr.h   |   59 ++++++++++++++
 include/linux/checkpoint_types.h |    2 +
 kernel/cred.c                    |  123 +++++++++++++++++++++++++++++
 kernel/groups.c                  |   69 +++++++++++++++++
 kernel/user.c                    |  158 ++++++++++++++++++++++++++++++++++++++
 kernel/user_namespace.c          |   89 +++++++++++++++++++++
 11 files changed, 746 insertions(+), 6 deletions(-)

diff --git a/checkpoint/namespace.c b/checkpoint/namespace.c
index 49b8f0a..89af2c0 100644
--- a/checkpoint/namespace.c
+++ b/checkpoint/namespace.c
@@ -98,3 +98,44 @@ void *restore_uts_ns(struct ckpt_ctx *ctx)
 {
 	return (void *) do_restore_uts_ns(ctx);
 }
+
+/*
+ * user_ns  -  trivial checkpoint/restore for !CONFIG_USER_NS case
+ */
+#ifndef CONFIG_USER_NS
+int checkpoint_userns(struct ckpt_ctx *ctx, void *ptr)
+{
+	struct ckpt_hdr_user_ns *h;
+	int ret;
+
+	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_USER_NS);
+	if (!h)
+		return -ENOMEM;
+	ret = ckpt_write_obj(ctx, (struct ckpt_hdr *) h);
+	ckpt_hdr_put(ctx, h);
+	return ret;
+}
+
+void *restore_userns(struct ckpt_ctx *ctx)
+{
+	struct ckpt_hdr_user_ns *h;
+	struct user_namespace *ns;
+
+	/* complain if image contains multiple namespaces */
+	if (ctx->stats.user_ns)
+		return ERR_PTR(-EEXIST);
+
+	h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_USER_NS);
+	if (IS_ERR(h))
+		return ERR_PTR(PTR_ERR(h));
+
+	if (h->creator_ref)
+		ns = ERR_PTR(-EINVAL);
+	else
+		ns = get_user_ns(current_user_ns());
+
+	ctx->stats.user_ns++;
+	ckpt_hdr_put(ctx, h);
+	return ns;
+}
+#endif
diff --git a/checkpoint/objhash.c b/checkpoint/objhash.c
index 15a5caf..f8a3210 100644
--- a/checkpoint/objhash.c
+++ b/checkpoint/objhash.c
@@ -17,6 +17,7 @@
 #include <linux/fdtable.h>
 #include <linux/sched.h>
 #include <linux/ipc_namespace.h>
+#include <linux/user_namespace.h>
 #include <linux/checkpoint.h>
 #include <linux/checkpoint_hdr.h>
 
@@ -172,6 +173,51 @@ static int obj_ipc_ns_users(void *ptr)
 	return atomic_read(&((struct ipc_namespace *) ptr)->count);
 }
 
+static int obj_cred_grab(void *ptr)
+{
+	get_cred((struct cred *) ptr);
+	return 0;
+}
+
+static void obj_cred_drop(void *ptr, int lastref)
+{
+	put_cred((struct cred *) ptr);
+}
+
+static int obj_user_grab(void *ptr)
+{
+	struct user_struct *u = ptr;
+	(void) get_uid(u);
+	return 0;
+}
+
+static void obj_user_drop(void *ptr, int lastref)
+{
+	free_uid((struct user_struct *) ptr);
+}
+
+static int obj_userns_grab(void *ptr)
+{
+	get_user_ns((struct user_namespace *) ptr);
+	return 0;
+}
+
+static void obj_userns_drop(void *ptr, int lastref)
+{
+	put_user_ns((struct user_namespace *) ptr);
+}
+
+static int obj_groupinfo_grab(void *ptr)
+{
+	get_group_info((struct group_info *) ptr);
+	return 0;
+}
+
+static void obj_groupinfo_drop(void *ptr, int lastref)
+{
+	put_group_info((struct group_info *) ptr);
+}
+
 static struct ckpt_obj_ops ckpt_obj_ops[] = {
 	/* ignored object */
 	{
@@ -247,6 +293,42 @@ static struct ckpt_obj_ops ckpt_obj_ops[] = {
 		.checkpoint = checkpoint_ipc_ns,
 		.restore = restore_ipc_ns,
 	},
+	/* user_ns object */
+	{
+		.obj_name = "USER_NS",
+		.obj_type = CKPT_OBJ_USER_NS,
+		.ref_drop = obj_userns_drop,
+		.ref_grab = obj_userns_grab,
+		.checkpoint = checkpoint_userns,
+		.restore = restore_userns,
+	},
+	/* struct cred */
+	{
+		.obj_name = "CRED",
+		.obj_type = CKPT_OBJ_CRED,
+		.ref_drop = obj_cred_drop,
+		.ref_grab = obj_cred_grab,
+		.checkpoint = checkpoint_cred,
+		.restore = restore_cred,
+	},
+	/* user object */
+	{
+		.obj_name = "USER",
+		.obj_type = CKPT_OBJ_USER,
+		.ref_drop = obj_user_drop,
+		.ref_grab = obj_user_grab,
+		.checkpoint = checkpoint_user,
+		.restore = restore_user,
+	},
+	/* struct groupinfo */
+	{
+		.obj_name = "GROUPINFO",
+		.obj_type = CKPT_OBJ_GROUPINFO,
+		.ref_drop = obj_groupinfo_drop,
+		.ref_grab = obj_groupinfo_grab,
+		.checkpoint = checkpoint_groupinfo,
+		.restore = restore_groupinfo,
+	},
 };
 
 
diff --git a/checkpoint/process.c b/checkpoint/process.c
index b34ee3d..1e79f73 100644
--- a/checkpoint/process.c
+++ b/checkpoint/process.c
@@ -18,6 +18,7 @@
 #include <linux/compat.h>
 #include <linux/poll.h>
 #include <linux/utsname.h>
+#include <linux/user_namespace.h>
 #include <linux/checkpoint.h>
 #include <linux/checkpoint_hdr.h>
 #include <linux/syscalls.h>
@@ -136,6 +137,45 @@ static int checkpoint_task_ns(struct ckpt_ctx *ctx, struct task_struct *t)
 	return ret;
 }
 
+static int checkpoint_task_creds(struct ckpt_ctx *ctx, struct task_struct *t)
+{
+	int realcred_ref, ecred_ref;
+	struct cred *rcred, *ecred;
+	struct ckpt_hdr_task_creds *h;
+	int ret;
+
+	rcred = (struct cred *) get_cred(t->real_cred);
+	ecred = (struct cred *) get_cred(t->cred);
+
+	realcred_ref = checkpoint_obj(ctx, rcred, CKPT_OBJ_CRED);
+	if (realcred_ref < 0) {
+		ret = realcred_ref;
+		goto error;
+	}
+
+	ecred_ref = checkpoint_obj(ctx, ecred, CKPT_OBJ_CRED);
+	if (ecred_ref < 0) {
+		ret = ecred_ref;
+		goto error;
+	}
+
+	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_TASK_CREDS);
+	if (!h) {
+		ret = -ENOMEM;
+		goto error;
+	}
+
+	h->cred_ref = realcred_ref;
+	h->ecred_ref = ecred_ref;
+	ret = ckpt_write_obj(ctx, (struct ckpt_hdr *) h);
+	ckpt_hdr_put(ctx, h);
+
+error:
+	put_cred(rcred);
+	put_cred(ecred);
+	return ret;
+}
+
 static int checkpoint_task_objs(struct ckpt_ctx *ctx, struct task_struct *t)
 {
 	struct ckpt_hdr_task_objs *h;
@@ -151,8 +191,12 @@ static int checkpoint_task_objs(struct ckpt_ctx *ctx, struct task_struct *t)
 	 * restored when it gets to restore, e.g. its memory.
 	 */
 
-	ret = checkpoint_task_ns(ctx, t);
-	ckpt_debug("ns: objref %d\n", ret);
+	ret = checkpoint_task_creds(ctx, t);
+	ckpt_debug("cred: objref %d\n", ret);
+	if (!ret) {
+		ret = checkpoint_task_ns(ctx, t);
+		ckpt_debug("ns: objref %d\n", ret);
+	}
 	if (ret < 0)
 		return ret;
 
@@ -435,6 +479,34 @@ static int restore_task_ns(struct ckpt_ctx *ctx)
 	return ret;
 }
 
+static int restore_task_creds(struct ckpt_ctx *ctx)
+{
+	struct ckpt_hdr_task_creds *h;
+	struct cred *realcred, *ecred;
+	int ret = 0;
+
+	h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_TASK_CREDS);
+	if (IS_ERR(h))
+		return PTR_ERR(h);
+
+	realcred = ckpt_obj_fetch(ctx, h->cred_ref, CKPT_OBJ_CRED);
+	if (IS_ERR(realcred)) {
+		ret = PTR_ERR(realcred);
+		goto out;
+	}
+	ecred = ckpt_obj_fetch(ctx, h->ecred_ref, CKPT_OBJ_CRED);
+	if (IS_ERR(ecred)) {
+		ret = PTR_ERR(ecred);
+		goto out;
+	}
+	ctx->realcred = realcred;
+	ctx->ecred = ecred;
+
+out:
+	ckpt_hdr_put(ctx, h);
+	return ret;
+}
+
 static int restore_task_objs(struct ckpt_ctx *ctx)
 {
 	struct ckpt_hdr_task_objs *h;
@@ -445,7 +517,9 @@ static int restore_task_objs(struct ckpt_ctx *ctx)
 	 * and because shared objects are restored before they are
 	 * referenced. See comment in checkpoint_task_objs.
 	 */
-	ret = restore_task_ns(ctx);
+	ret = restore_task_creds(ctx);
+	if (!ret)
+		ret = restore_task_ns(ctx);
 	if (ret < 0)
 		return ret;
 
@@ -463,6 +537,33 @@ static int restore_task_objs(struct ckpt_ctx *ctx)
 	return ret;
 }
 
+static int restore_creds(struct ckpt_ctx *ctx)
+{
+	int ret;
+	const struct cred *old;
+	struct cred *rcred, *ecred;
+
+	rcred = ctx->realcred;
+	ecred = ctx->ecred;
+
+	/* commit_creds will take one ref for the eff creds, but
+	 * expects us to hold a ref for the obj creds, so take a
+	 * ref here */
+	get_cred(rcred);
+	ret = commit_creds(rcred);
+	if (ret)
+		return ret;
+
+	if (ecred == rcred)
+		return 0;
+
+	old =  override_creds(ecred); /* override_creds otoh takes new ref */
+	put_cred(old);
+
+	ctx->realcred = ctx->ecred = NULL;
+	return 0;
+}
+
 int restore_restart_block(struct ckpt_ctx *ctx)
 {
 	struct ckpt_hdr_restart_block *h;
@@ -596,6 +697,10 @@ int restore_task(struct ckpt_ctx *ctx)
 		goto out;
 	ret = restore_cpu(ctx);
 	ckpt_debug("cpu %d\n", ret);
+	if (ret < 0)
+		goto out;
+	ret = restore_creds(ctx);
+	ckpt_debug("creds: ret %d\n", ret);
  out:
 	return ret;
 }
diff --git a/include/linux/capability.h b/include/linux/capability.h
index 3a74655..2f726f7 100644
--- a/include/linux/capability.h
+++ b/include/linux/capability.h
@@ -569,10 +569,10 @@ struct dentry;
 extern int get_vfs_caps_from_disk(const struct dentry *dentry, struct cpu_vfs_cap_data *cpu_caps);
 
 struct cred;
-int apply_securebits(unsigned securebits, struct cred *new);
+extern int apply_securebits(unsigned securebits, struct cred *new);
 struct ckpt_capabilities;
-int restore_capabilities(struct ckpt_capabilities *h, struct cred *new);
-void checkpoint_capabilities(struct ckpt_capabilities *h, struct cred * cred);
+extern int restore_capabilities(struct ckpt_capabilities *h, struct cred *new);
+extern void checkpoint_capabilities(struct ckpt_capabilities *h, struct cred *cred);
 
 #endif /* __KERNEL__ */
 
diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
index 561232d..3dbf188 100644
--- a/include/linux/checkpoint.h
+++ b/include/linux/checkpoint.h
@@ -25,6 +25,7 @@
 #include <linux/sched.h>
 #include <linux/nsproxy.h>
 #include <linux/ipc_namespace.h>
+#include <linux/user_namespace.h>
 #include <linux/checkpoint_types.h>
 #include <linux/checkpoint_hdr.h>
 #include <linux/err.h>
@@ -192,6 +193,17 @@ extern int checkpoint_file_common(struct ckpt_ctx *ctx, struct file *file,
 extern int restore_file_common(struct ckpt_ctx *ctx, struct file *file,
 			       struct ckpt_hdr_file *h);
 
+/* credentials */
+extern int checkpoint_groupinfo(struct ckpt_ctx *ctx, void *ptr);
+extern int checkpoint_user(struct ckpt_ctx *ctx, void *ptr);
+extern int checkpoint_cred(struct ckpt_ctx *ctx, void *ptr);
+extern void *restore_groupinfo(struct ckpt_ctx *ctx);
+extern void *restore_user(struct ckpt_ctx *ctx);
+extern void *restore_cred(struct ckpt_ctx *ctx);
+
+extern int checkpoint_userns(struct ckpt_ctx *ctx, void *ptr);
+extern void *restore_userns(struct ckpt_ctx *ctx);
+
 /* memory */
 extern void ckpt_pgarr_free(struct ckpt_ctx *ctx);
 
diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
index cb036e8..3f00bce 100644
--- a/include/linux/checkpoint_hdr.h
+++ b/include/linux/checkpoint_hdr.h
@@ -62,6 +62,11 @@ enum {
 	CKPT_HDR_UTS_NS,
 	CKPT_HDR_IPC_NS,
 	CKPT_HDR_CAPABILITIES,
+	CKPT_HDR_USER_NS,
+	CKPT_HDR_CRED,
+	CKPT_HDR_USER,
+	CKPT_HDR_GROUPINFO,
+	CKPT_HDR_TASK_CREDS,
 
 	/* 201-299: reserved for arch-dependent */
 
@@ -111,6 +116,10 @@ enum obj_type {
 	CKPT_OBJ_NS,
 	CKPT_OBJ_UTS_NS,
 	CKPT_OBJ_IPC_NS,
+	CKPT_OBJ_USER_NS,
+	CKPT_OBJ_CRED,
+	CKPT_OBJ_USER,
+	CKPT_OBJ_GROUPINFO,
 	CKPT_OBJ_MAX
 };
 
@@ -184,6 +193,11 @@ struct ckpt_hdr_task {
 	__u32 exit_signal;
 	__u32 pdeath_signal;
 
+#ifdef CONFIG_AUDITSYSCALL
+	/* would audit want to track the checkpointed ids,
+	   or (more likely) who actually restarted? */
+#endif
+
 	__u64 set_child_tid;
 	__u64 clear_child_tid;
 
@@ -191,6 +205,7 @@ struct ckpt_hdr_task {
 	__u32 compat_robust_futex_list; /* a compat __user ptr */
 	__u32 robust_futex_head_len;
 	__u64 robust_futex_list; /* a __user ptr */
+
 } __attribute__((aligned(8)));
 
 /* Posix capabilities */
@@ -203,6 +218,50 @@ struct ckpt_capabilities {
 	__u32 padding;
 } __attribute__((aligned(8)));
 
+struct ckpt_hdr_task_creds {
+	struct ckpt_hdr h;
+	__s32 cred_ref;
+	__s32 ecred_ref;
+} __attribute__((aligned(8)));
+
+struct ckpt_hdr_cred {
+	struct ckpt_hdr h;
+	__u32 uid, suid, euid, fsuid;
+	__u32 gid, sgid, egid, fsgid;
+	__s32 user_ref;
+	__s32 groupinfo_ref;
+	struct ckpt_capabilities cap_s;
+} __attribute__((aligned(8)));
+
+struct ckpt_hdr_groupinfo {
+	struct ckpt_hdr h;
+	__u32 ngroups;
+	/*
+	 * This is followed by ngroups __u32s
+	 */
+	__u32 groups[0];
+} __attribute__((aligned(8)));
+
+/*
+ * todo - keyrings and LSM
+ * These may be better done with userspace help though
+ */
+struct ckpt_hdr_user_struct {
+	struct ckpt_hdr h;
+	__u32 uid;
+	__s32 userns_ref;
+} __attribute__((aligned(8)));
+
+/*
+ * The user-struct mostly tracks system resource usage.
+ * Most of it's contents therefore will simply be set
+ * correctly as restart opens resources
+ */
+struct ckpt_hdr_user_ns {
+	struct ckpt_hdr h;
+	__s32 creator_ref;
+} __attribute__((aligned(8)));
+
 /* namespaces */
 struct ckpt_hdr_task_ns {
 	struct ckpt_hdr h;
diff --git a/include/linux/checkpoint_types.h b/include/linux/checkpoint_types.h
index 9632aa4..be45666 100644
--- a/include/linux/checkpoint_types.h
+++ b/include/linux/checkpoint_types.h
@@ -25,6 +25,7 @@
 struct ckpt_stats {
 	int uts_ns;
 	int ipc_ns;
+	int user_ns;
 };
 
 struct ckpt_ctx {
@@ -70,6 +71,7 @@ struct ckpt_ctx {
 	int active_pid;			/* (next) position in pids array */
 	struct completion complete;	/* container root and other tasks on */
 	wait_queue_head_t waitq;	/* start, end, and restart ordering */
+	struct cred *realcred, *ecred;	/* tmp storage for cred at restart */
 
 	struct ckpt_stats stats;	/* statistics */
 };
diff --git a/kernel/cred.c b/kernel/cred.c
index 5c8db56..9710cae 100644
--- a/kernel/cred.c
+++ b/kernel/cred.c
@@ -16,6 +16,7 @@
 #include <linux/init_task.h>
 #include <linux/security.h>
 #include <linux/cn_proc.h>
+#include <linux/checkpoint.h>
 #include "cred-internals.h"
 
 static struct kmem_cache *cred_jar;
@@ -703,3 +704,125 @@ int cred_setfsgid(struct cred *new, gid_t gid, gid_t *old_fsgid)
 	}
 	return -EPERM;
 }
+
+#ifdef CONFIG_CHECKPOINT
+static int do_checkpoint_cred(struct ckpt_ctx *ctx, struct cred *cred)
+{
+	int ret;
+	int groupinfo_ref, user_ref;
+	struct ckpt_hdr_cred *h;
+
+	groupinfo_ref = checkpoint_obj(ctx, cred->group_info,
+					CKPT_OBJ_GROUPINFO);
+	if (groupinfo_ref < 0)
+		return groupinfo_ref;
+	user_ref = checkpoint_obj(ctx, cred->user, CKPT_OBJ_USER);
+	if (user_ref < 0)
+		return user_ref;
+
+	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_CRED);
+	if (!h)
+		return -ENOMEM;
+
+	h->uid = cred->uid;
+	h->suid = cred->suid;
+	h->euid = cred->euid;
+	h->fsuid = cred->fsuid;
+
+	h->gid = cred->gid;
+	h->sgid = cred->sgid;
+	h->egid = cred->egid;
+	h->fsgid = cred->fsgid;
+
+	checkpoint_capabilities(&h->cap_s, cred);
+
+	h->user_ref = user_ref;
+	h->groupinfo_ref = groupinfo_ref;
+
+	ret = ckpt_write_obj(ctx, (struct ckpt_hdr *) h);
+	ckpt_hdr_put(ctx, h);
+
+	return ret;
+}
+
+int checkpoint_cred(struct ckpt_ctx *ctx, void *ptr)
+{
+	return do_checkpoint_cred(ctx, (struct cred *) ptr);
+}
+
+static struct cred *do_restore_cred(struct ckpt_ctx *ctx)
+{
+	struct cred *cred;
+	struct ckpt_hdr_cred *h;
+	struct user_struct *user;
+	struct group_info *groupinfo;
+	int ret = -EINVAL;
+	uid_t olduid;
+	gid_t oldgid;
+	int i;
+
+	h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_CRED);
+	if (IS_ERR(h))
+		return ERR_PTR(PTR_ERR(h));
+
+	cred = prepare_creds();
+	if (!cred)
+		goto error;
+
+
+	/* Do we care if the target user and target group were compatible?
+	 * Probably.  But then, we can't do any setuid without CAP_SETUID,
+	 * so we must have been privileged to abuse it... */
+	groupinfo = ckpt_obj_fetch(ctx, h->groupinfo_ref, CKPT_OBJ_GROUPINFO);
+	if (IS_ERR(groupinfo))
+		goto err_putcred;
+	user = ckpt_obj_fetch(ctx, h->user_ref, CKPT_OBJ_USER);
+	if (IS_ERR(user))
+		goto err_putcred;
+
+	/*
+	 * TODO: this check should  go into the common helper in
+	 * kernel/sys.c, and should account for user namespaces
+	 */
+	if (!capable(CAP_SETGID))
+		for (i = 0; i < groupinfo->ngroups; i++) {
+			if (!in_egroup_p(GROUP_AT(groupinfo, i)))
+				goto err_putcred;
+		}
+	ret = set_groups(cred, groupinfo);
+	if (ret < 0)
+		goto err_putcred;
+	free_uid(cred->user);
+	cred->user = get_uid(user);
+	ret = cred_setresuid(cred, h->uid, h->euid, h->suid);
+	if (ret < 0)
+		goto err_putcred;
+	ret = cred_setfsuid(cred, h->fsuid, &olduid);
+	if (olduid != h->fsuid && ret < 0)
+		goto err_putcred;
+	ret = cred_setresgid(cred, h->gid, h->egid, h->sgid);
+	if (ret < 0)
+		goto err_putcred;
+	ret = cred_setfsgid(cred, h->fsgid, &oldgid);
+	if (oldgid != h->fsgid && ret < 0)
+		goto err_putcred;
+	ret = restore_capabilities(&h->cap_s, cred);
+	if (ret)
+		goto err_putcred;
+
+	ckpt_hdr_put(ctx, h);
+	return cred;
+
+err_putcred:
+	abort_creds(cred);
+error:
+	ckpt_hdr_put(ctx, h);
+	return ERR_PTR(ret);
+}
+
+void *restore_cred(struct ckpt_ctx *ctx)
+{
+	return (void *) do_restore_cred(ctx);
+}
+
+#endif
diff --git a/kernel/groups.c b/kernel/groups.c
index 2b45b2e..3612c3e 100644
--- a/kernel/groups.c
+++ b/kernel/groups.c
@@ -6,6 +6,7 @@
 #include <linux/slab.h>
 #include <linux/security.h>
 #include <linux/syscalls.h>
+#include <linux/checkpoint.h>
 #include <asm/uaccess.h>
 
 /* init to 2 - one for init_task, one to ensure it is never freed */
@@ -286,3 +287,71 @@ int in_egroup_p(gid_t grp)
 }
 
 EXPORT_SYMBOL(in_egroup_p);
+
+#ifdef CONFIG_CHECKPOINT
+static int do_checkpoint_groupinfo(struct ckpt_ctx *ctx, struct group_info *g)
+{
+	int ret, i, size;
+	struct ckpt_hdr_groupinfo *h;
+
+	size = sizeof(*h) + g->ngroups * sizeof(__u32);
+	h = ckpt_hdr_get_type(ctx, size, CKPT_HDR_GROUPINFO);
+	if (!h)
+		return -ENOMEM;
+
+	h->ngroups = g->ngroups;
+	for (i = 0; i < g->ngroups; i++)
+		h->groups[i] = GROUP_AT(g, i);
+
+	ret = ckpt_write_obj(ctx, (struct ckpt_hdr *) h);
+	ckpt_hdr_put(ctx, h);
+
+	return ret;
+}
+
+int checkpoint_groupinfo(struct ckpt_ctx *ctx, void *ptr)
+{
+	return do_checkpoint_groupinfo(ctx, (struct group_info *)ptr);
+}
+
+/*
+ * TODO - switch to reading in smaller blocks?
+ */
+#define MAX_GROUPINFO_SIZE (sizeof(*h)+NGROUPS_MAX*sizeof(gid_t))
+static struct group_info *do_restore_groupinfo(struct ckpt_ctx *ctx)
+{
+	struct group_info *g;
+	struct ckpt_hdr_groupinfo *h;
+	int i;
+
+	h = ckpt_read_buf_type(ctx, MAX_GROUPINFO_SIZE, CKPT_HDR_GROUPINFO);
+	if (IS_ERR(h))
+		return ERR_PTR(PTR_ERR(h));
+
+	g = ERR_PTR(-EINVAL);
+	if (h->ngroups > NGROUPS_MAX)
+		goto out;
+
+	for (i = 1; i < h->ngroups; i++)
+		if (h->groups[i-1] >= h->groups[i])
+			goto out;
+
+	g = groups_alloc(h->ngroups);
+	if (!g) {
+		g = ERR_PTR(-ENOMEM);
+		goto out;
+	}
+	for (i = 0; i < h->ngroups; i++)
+		GROUP_AT(g, i) = h->groups[i];
+
+out:
+	ckpt_hdr_put(ctx, h);
+	return g;
+}
+
+void *restore_groupinfo(struct ckpt_ctx *ctx)
+{
+	return (void *) do_restore_groupinfo(ctx);
+}
+
+#endif
diff --git a/kernel/user.c b/kernel/user.c
index 2c000e7..a535ed6 100644
--- a/kernel/user.c
+++ b/kernel/user.c
@@ -16,6 +16,7 @@
 #include <linux/interrupt.h>
 #include <linux/module.h>
 #include <linux/user_namespace.h>
+#include <linux/checkpoint.h>
 #include "cred-internals.h"
 
 struct user_namespace init_user_ns = {
@@ -508,3 +509,160 @@ static int __init uid_cache_init(void)
 }
 
 module_init(uid_cache_init);
+
+#ifdef CONFIG_CHECKPOINT
+/*
+ * write the user struct
+ * TODO keyring will need to be dumped
+ *
+ * Here is what we're doing.  Remember a task can do clone(CLONE_NEWUSER)
+ * resulting in a cloned task in a new user namespace, with uid 0 in that
+ * new user_ns.  In that case, the parent's user (uid+user_ns) is the
+ * 'creator' of the new user_ns.
+ * Here, we call the user_ns of the ctx->root_task the 'root_ns'.  When we
+ * checkpoint a user-struct, we must store the chain of creators.  We
+ * must not do so recursively, this being the kernel.  In
+ * checkpoint_write_user() we walk and record in memory the list of creators up
+ * to either the latest user_struct which has already been saved, or the
+ * root_ns.  Then we walk that chain backward, writing out the user_ns and
+ * user_struct to the checkpoint image.
+ */
+#define UNSAVED_STRIDE 50
+static int do_checkpoint_user(struct ckpt_ctx *ctx, struct user_struct *u)
+{
+	struct user_namespace *ns, *root_ns;
+	struct ckpt_hdr_user_struct *h;
+	int ns_objref;
+	int ret, i, unsaved_ns_nr = 0;
+	struct user_struct *save_u;
+	struct user_struct **unsaved_creators;
+	int step = 1, size;
+
+	/* if we've already saved the userns, then life is good */
+	ns_objref = ckpt_obj_lookup(ctx, u->user_ns, CKPT_OBJ_USER_NS);
+	if (ns_objref)
+		goto write_user;
+
+	root_ns = task_cred_xxx(ctx->root_task, user)->user_ns;
+
+	if (u->user_ns == root_ns)
+		goto save_last_ns;
+
+	size = UNSAVED_STRIDE*sizeof(struct user_struct *);
+	unsaved_creators = kmalloc(size, GFP_KERNEL);
+	if (!unsaved_creators)
+		return -ENOMEM;
+	save_u = u;
+	do {
+		ns = save_u->user_ns;
+		save_u = ns->creator;
+		if (ckpt_obj_lookup(ctx, save_u, CKPT_OBJ_USER))
+			goto found;
+		unsaved_creators[unsaved_ns_nr++] = save_u;
+		if (unsaved_ns_nr == step * UNSAVED_STRIDE) {
+			step++;
+			size = step*UNSAVED_STRIDE*sizeof(struct user_struct *);
+			unsaved_creators = krealloc(unsaved_creators, size,
+							GFP_KERNEL);
+			if (!unsaved_creators)
+				return -ENOMEM;
+		}
+	} while (ns != root_ns);
+
+found:
+	for (i = unsaved_ns_nr-1; i >= 0; i--) {
+		ret = checkpoint_obj(ctx, unsaved_creators[i], CKPT_OBJ_USER);
+		if (ret < 0) {
+			kfree(unsaved_creators);
+			return ret;
+		}
+	}
+	kfree(unsaved_creators);
+
+save_last_ns:
+	ns_objref = checkpoint_obj(ctx, u->user_ns, CKPT_OBJ_USER_NS);
+	if (ns_objref < 0)
+		return ns_objref;
+
+write_user:
+	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_USER);
+	if (!h)
+		return -ENOMEM;
+
+	h->uid = u->uid;
+	h->userns_ref = ns_objref;
+
+	/* write out the user_struct */
+	ret = ckpt_write_obj(ctx, (struct ckpt_hdr *) h);
+	ckpt_hdr_put(ctx, h);
+
+	return ret;
+}
+
+int checkpoint_user(struct ckpt_ctx *ctx, void *ptr)
+{
+	return do_checkpoint_user(ctx, (struct user_struct *) ptr);
+}
+
+static int may_setuid(struct user_namespace *ns, uid_t uid)
+{
+	/*
+	 * this next check will one day become
+	 * if capable(CAP_SETUID, ns) return 1;
+	 * followed by uid_equiv(current_userns, current_uid, ns, uid)
+	 * instead of just uids.
+	 */
+	if (capable(CAP_SETUID))
+		return 1;
+
+	/*
+	 * this may be overly strict, but since we might end up
+	 * restarting a privileged program here, we do not want
+	 * someone with only CAP_SYS_ADMIN but no CAP_SETUID to
+	 * be able to create random userids even in a userns he
+	 * created.
+	 */
+	if (current_user()->user_ns != ns)
+		return 0;
+	if (current_uid() == uid ||
+		current_euid() == uid ||
+		current_suid() == uid)
+		return 1;
+	return 0;
+}
+
+static struct user_struct *do_restore_user(struct ckpt_ctx *ctx)
+{
+	struct user_struct *u;
+	struct user_namespace *ns;
+	struct ckpt_hdr_user_struct *h;
+
+	h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_USER);
+	if (IS_ERR(h))
+		return ERR_PTR(PTR_ERR(h));
+
+	ns = ckpt_obj_fetch(ctx, h->userns_ref, CKPT_OBJ_USER_NS);
+	if (IS_ERR(ns)) {
+		u = ERR_PTR(PTR_ERR(ns));
+		goto out;
+	}
+
+	if (!may_setuid(ns, h->uid)) {
+		u = ERR_PTR(-EPERM);
+		goto out;
+	}
+	u = alloc_uid(ns, h->uid);
+	if (!u)
+		u = ERR_PTR(-EINVAL);
+
+out:
+	ckpt_hdr_put(ctx, h);
+	return u;
+}
+
+void *restore_user(struct ckpt_ctx *ctx)
+{
+	return (void *) do_restore_user(ctx);
+}
+
+#endif
diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
index e624b0f..3a35b50 100644
--- a/kernel/user_namespace.c
+++ b/kernel/user_namespace.c
@@ -9,6 +9,7 @@
 #include <linux/nsproxy.h>
 #include <linux/slab.h>
 #include <linux/user_namespace.h>
+#include <linux/checkpoint.h>
 #include <linux/cred.h>
 
 static struct user_namespace *_new_user_ns(struct user_struct *creator,
@@ -103,3 +104,91 @@ void free_user_ns(struct kref *kref)
 	schedule_work(&ns->destroyer);
 }
 EXPORT_SYMBOL(free_user_ns);
+
+#ifdef CONFIG_CHECKPOINT
+/*
+ * do_checkpoint_userns() is only called from do_checkpoint_user().
+ * When called, we always know that either:
+ *   1. This is the root_ns (user_ns of the ctx->root_task),
+ *	in which case we set h->creator_ref = 0.
+ * or
+ *   2. The creator has already been written out to the
+ *	checkpoint image (and saved in the objhash)
+ */
+static int do_checkpoint_userns(struct ckpt_ctx *ctx, struct user_namespace *ns)
+{
+	struct ckpt_hdr_user_ns *h;
+	struct user_namespace *root_ns;
+	int creator_ref = 0;
+	int ret;
+
+	root_ns = task_cred_xxx(ctx->root_task, user)->user_ns;
+	if (ns != root_ns) {
+		creator_ref = ckpt_obj_lookup(ctx, ns->creator, CKPT_OBJ_USER);
+		if (!creator_ref)
+			return -EINVAL;
+	}
+
+	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_USER_NS);
+	if (!h)
+		return -ENOMEM;
+	h->creator_ref = creator_ref;
+	ret = ckpt_write_obj(ctx, (struct ckpt_hdr *) h);
+	ckpt_hdr_put(ctx, h);
+
+	return ret;
+}
+
+int checkpoint_userns(struct ckpt_ctx *ctx, void *ptr)
+{
+	return do_checkpoint_userns(ctx, (struct user_namespace *) ptr);
+}
+
+static struct user_namespace *do_restore_userns(struct ckpt_ctx *ctx)
+{
+	struct ckpt_hdr_user_ns *h;
+	struct user_namespace *ns;
+	struct user_struct *new_root, *creator;
+
+	h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_USER_NS);
+	if (IS_ERR(h))
+		return ERR_PTR(PTR_ERR(h));
+
+	if (!h->creator_ref) {
+		ns = get_user_ns(current_user_ns());
+		goto out;
+	}
+
+	creator = ckpt_obj_fetch(ctx, h->creator_ref, CKPT_OBJ_USER);
+	if (IS_ERR(creator)) {
+		ns = ERR_PTR(-EINVAL);
+		goto out;
+	}
+
+	ns = new_user_ns(creator, &new_root);
+	if (IS_ERR(ns))
+		goto out;
+
+	/* ns only referenced from new_root, which we discard below */
+	get_user_ns(ns);
+
+	/* new_user_ns() doesn't bump creator's refcount */
+	get_uid(creator);
+
+	/*
+	 * Free the new root user.  If we actually needed it,
+	 * then it will show up later in the checkpoint image
+	 * The objhash will keep the userns pinned until then.
+	 */
+	free_uid(new_root);
+ out:
+	ctx->stats.user_ns++;
+	ckpt_hdr_put(ctx, h);
+	return ns;
+}
+
+void *restore_userns(struct ckpt_ctx *ctx)
+{
+	return (void *) do_restore_userns(ctx);
+}
+#endif
-- 
1.6.0.4

^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH v18 60/80] c/r: checkpoint and restore task credentials
  2009-09-23 23:50 ` Oren Laadan
@ 2009-09-23 23:51   ` Oren Laadan
  -1 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-23 23:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, containers, linux-kernel, linux-mm, linux-api,
	Serge Hallyn, Ingo Molnar, Pavel Emelyanov

From: Serge E. Hallyn <serue@us.ibm.com>

This patch adds the checkpointing and restart of credentials
(uids, gids, and capabilities) to Oren's c/r patchset (on top
of v14).  It goes to great pains to re-use (and define when
needed) common helpers, in order to make sure that as security
code is modified, the cr code will be updated.  Some of the
helpers should still be moved (i.e. _creds() functions should
be in kernel/cred.c).

When building the credentials for the restarted process, I
1. create a new struct cred as a copy of the running task's
cred (using prepare_cred())
2. always authorize any changes to the new struct cred
based on the permissions of current_cred() (not the current
transient state of the new cred).

While this may mean that certain transient_cred1->transient_cred2
states are allowed which otherwise wouldn't be allowed, the
fact remains that current_cred() is allowed to transition to
transient_cred2.

The reconstructed creds are applied to the task at the very
end of the sys_restart call.  This ensures that any objects which
need to be re-created (file, socket, etc) are re-created using
the creds of the task calling sys_restart - preventing an unpriv
user from creating a privileged object, and ensuring that a
root task can restart a process which had started out privileged,
created some privileged objects, then dropped its privilege.

With these patches, the root user can restart checkpoint images
(created by either hallyn or root) of user hallyn's tasks,
resulting in a program owned by hallyn.

Changelog:
	Sep 08: [NTL] discard const from struct cred * where appropriate
	Jun 15: Fix user_ns handling when !CONFIG_USER_N
	        Set creator_ref=0 for root_ns (discard @flags)
		Don't  overwrite global user-ns if CONFIG_USER_NS
	Jun 10: Merge with ckpt-v16-dev (Oren Laadan)
	Jun 01: Don't check ordering of groups in group_info, bc
		set_groups() will sort it for us.
	May 28: 1. Restore securebits
		2. Address Alexey's comments: move prototypes out of
		   sched.h, validate ngroups < NGROUPS_MAX, validate
		   groups are sorted, and get rid of ckpt_hdr_cred->version.
		3. remove bogus unused flag RESTORE_CREATE_USERNS
	May 26: Move group, user, userns, creds c/r functions out
		of checkpoint/process.c and into the appropriate files.
	May 26: Define struct ckpt_hdr_task_creds and move task cred
		objref c/r into {checkpoint_restore}_task_shared().
	May 26: Take cred refs around checkpoint_write_creds()
	May 20: Remove the limit on number of groups in groupinfo
		at checkpoint time
	May 20: Remove the depth limit on empty user namespaces
	May 20: Better document checkpoint_user
	May 18: fix more refcounting: if (userns 5, uid 0) had
		no active tasks or child user_namespaces, then
		it shouldn't exist at restart or it, its namespace,
		and its whole chain of creators will be leaked.
	May 14: fix some refcounting:
		1. a new user_ns needs a ref to remain pinned
		   by its root user
		2. current_user_ns needs an extra ref bc objhash
		   drops two on restart
		3. cred needs a ref for the real credentials bc
		   commit_creds eats one ref.
	May 13: folded in fix to userns refcounting.

Signed-off-by: Serge E. Hallyn <serue@us.ibm.com>
[orenl@cs.columbia.edu: merge with ckpt-v16-dev]
---
 checkpoint/namespace.c           |   41 ++++++++++
 checkpoint/objhash.c             |   82 ++++++++++++++++++++
 checkpoint/process.c             |  111 ++++++++++++++++++++++++++-
 include/linux/capability.h       |    6 +-
 include/linux/checkpoint.h       |   12 +++
 include/linux/checkpoint_hdr.h   |   59 ++++++++++++++
 include/linux/checkpoint_types.h |    2 +
 kernel/cred.c                    |  123 +++++++++++++++++++++++++++++
 kernel/groups.c                  |   69 +++++++++++++++++
 kernel/user.c                    |  158 ++++++++++++++++++++++++++++++++++++++
 kernel/user_namespace.c          |   89 +++++++++++++++++++++
 11 files changed, 746 insertions(+), 6 deletions(-)

diff --git a/checkpoint/namespace.c b/checkpoint/namespace.c
index 49b8f0a..89af2c0 100644
--- a/checkpoint/namespace.c
+++ b/checkpoint/namespace.c
@@ -98,3 +98,44 @@ void *restore_uts_ns(struct ckpt_ctx *ctx)
 {
 	return (void *) do_restore_uts_ns(ctx);
 }
+
+/*
+ * user_ns  -  trivial checkpoint/restore for !CONFIG_USER_NS case
+ */
+#ifndef CONFIG_USER_NS
+int checkpoint_userns(struct ckpt_ctx *ctx, void *ptr)
+{
+	struct ckpt_hdr_user_ns *h;
+	int ret;
+
+	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_USER_NS);
+	if (!h)
+		return -ENOMEM;
+	ret = ckpt_write_obj(ctx, (struct ckpt_hdr *) h);
+	ckpt_hdr_put(ctx, h);
+	return ret;
+}
+
+void *restore_userns(struct ckpt_ctx *ctx)
+{
+	struct ckpt_hdr_user_ns *h;
+	struct user_namespace *ns;
+
+	/* complain if image contains multiple namespaces */
+	if (ctx->stats.user_ns)
+		return ERR_PTR(-EEXIST);
+
+	h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_USER_NS);
+	if (IS_ERR(h))
+		return ERR_PTR(PTR_ERR(h));
+
+	if (h->creator_ref)
+		ns = ERR_PTR(-EINVAL);
+	else
+		ns = get_user_ns(current_user_ns());
+
+	ctx->stats.user_ns++;
+	ckpt_hdr_put(ctx, h);
+	return ns;
+}
+#endif
diff --git a/checkpoint/objhash.c b/checkpoint/objhash.c
index 15a5caf..f8a3210 100644
--- a/checkpoint/objhash.c
+++ b/checkpoint/objhash.c
@@ -17,6 +17,7 @@
 #include <linux/fdtable.h>
 #include <linux/sched.h>
 #include <linux/ipc_namespace.h>
+#include <linux/user_namespace.h>
 #include <linux/checkpoint.h>
 #include <linux/checkpoint_hdr.h>
 
@@ -172,6 +173,51 @@ static int obj_ipc_ns_users(void *ptr)
 	return atomic_read(&((struct ipc_namespace *) ptr)->count);
 }
 
+static int obj_cred_grab(void *ptr)
+{
+	get_cred((struct cred *) ptr);
+	return 0;
+}
+
+static void obj_cred_drop(void *ptr, int lastref)
+{
+	put_cred((struct cred *) ptr);
+}
+
+static int obj_user_grab(void *ptr)
+{
+	struct user_struct *u = ptr;
+	(void) get_uid(u);
+	return 0;
+}
+
+static void obj_user_drop(void *ptr, int lastref)
+{
+	free_uid((struct user_struct *) ptr);
+}
+
+static int obj_userns_grab(void *ptr)
+{
+	get_user_ns((struct user_namespace *) ptr);
+	return 0;
+}
+
+static void obj_userns_drop(void *ptr, int lastref)
+{
+	put_user_ns((struct user_namespace *) ptr);
+}
+
+static int obj_groupinfo_grab(void *ptr)
+{
+	get_group_info((struct group_info *) ptr);
+	return 0;
+}
+
+static void obj_groupinfo_drop(void *ptr, int lastref)
+{
+	put_group_info((struct group_info *) ptr);
+}
+
 static struct ckpt_obj_ops ckpt_obj_ops[] = {
 	/* ignored object */
 	{
@@ -247,6 +293,42 @@ static struct ckpt_obj_ops ckpt_obj_ops[] = {
 		.checkpoint = checkpoint_ipc_ns,
 		.restore = restore_ipc_ns,
 	},
+	/* user_ns object */
+	{
+		.obj_name = "USER_NS",
+		.obj_type = CKPT_OBJ_USER_NS,
+		.ref_drop = obj_userns_drop,
+		.ref_grab = obj_userns_grab,
+		.checkpoint = checkpoint_userns,
+		.restore = restore_userns,
+	},
+	/* struct cred */
+	{
+		.obj_name = "CRED",
+		.obj_type = CKPT_OBJ_CRED,
+		.ref_drop = obj_cred_drop,
+		.ref_grab = obj_cred_grab,
+		.checkpoint = checkpoint_cred,
+		.restore = restore_cred,
+	},
+	/* user object */
+	{
+		.obj_name = "USER",
+		.obj_type = CKPT_OBJ_USER,
+		.ref_drop = obj_user_drop,
+		.ref_grab = obj_user_grab,
+		.checkpoint = checkpoint_user,
+		.restore = restore_user,
+	},
+	/* struct groupinfo */
+	{
+		.obj_name = "GROUPINFO",
+		.obj_type = CKPT_OBJ_GROUPINFO,
+		.ref_drop = obj_groupinfo_drop,
+		.ref_grab = obj_groupinfo_grab,
+		.checkpoint = checkpoint_groupinfo,
+		.restore = restore_groupinfo,
+	},
 };
 
 
diff --git a/checkpoint/process.c b/checkpoint/process.c
index b34ee3d..1e79f73 100644
--- a/checkpoint/process.c
+++ b/checkpoint/process.c
@@ -18,6 +18,7 @@
 #include <linux/compat.h>
 #include <linux/poll.h>
 #include <linux/utsname.h>
+#include <linux/user_namespace.h>
 #include <linux/checkpoint.h>
 #include <linux/checkpoint_hdr.h>
 #include <linux/syscalls.h>
@@ -136,6 +137,45 @@ static int checkpoint_task_ns(struct ckpt_ctx *ctx, struct task_struct *t)
 	return ret;
 }
 
+static int checkpoint_task_creds(struct ckpt_ctx *ctx, struct task_struct *t)
+{
+	int realcred_ref, ecred_ref;
+	struct cred *rcred, *ecred;
+	struct ckpt_hdr_task_creds *h;
+	int ret;
+
+	rcred = (struct cred *) get_cred(t->real_cred);
+	ecred = (struct cred *) get_cred(t->cred);
+
+	realcred_ref = checkpoint_obj(ctx, rcred, CKPT_OBJ_CRED);
+	if (realcred_ref < 0) {
+		ret = realcred_ref;
+		goto error;
+	}
+
+	ecred_ref = checkpoint_obj(ctx, ecred, CKPT_OBJ_CRED);
+	if (ecred_ref < 0) {
+		ret = ecred_ref;
+		goto error;
+	}
+
+	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_TASK_CREDS);
+	if (!h) {
+		ret = -ENOMEM;
+		goto error;
+	}
+
+	h->cred_ref = realcred_ref;
+	h->ecred_ref = ecred_ref;
+	ret = ckpt_write_obj(ctx, (struct ckpt_hdr *) h);
+	ckpt_hdr_put(ctx, h);
+
+error:
+	put_cred(rcred);
+	put_cred(ecred);
+	return ret;
+}
+
 static int checkpoint_task_objs(struct ckpt_ctx *ctx, struct task_struct *t)
 {
 	struct ckpt_hdr_task_objs *h;
@@ -151,8 +191,12 @@ static int checkpoint_task_objs(struct ckpt_ctx *ctx, struct task_struct *t)
 	 * restored when it gets to restore, e.g. its memory.
 	 */
 
-	ret = checkpoint_task_ns(ctx, t);
-	ckpt_debug("ns: objref %d\n", ret);
+	ret = checkpoint_task_creds(ctx, t);
+	ckpt_debug("cred: objref %d\n", ret);
+	if (!ret) {
+		ret = checkpoint_task_ns(ctx, t);
+		ckpt_debug("ns: objref %d\n", ret);
+	}
 	if (ret < 0)
 		return ret;
 
@@ -435,6 +479,34 @@ static int restore_task_ns(struct ckpt_ctx *ctx)
 	return ret;
 }
 
+static int restore_task_creds(struct ckpt_ctx *ctx)
+{
+	struct ckpt_hdr_task_creds *h;
+	struct cred *realcred, *ecred;
+	int ret = 0;
+
+	h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_TASK_CREDS);
+	if (IS_ERR(h))
+		return PTR_ERR(h);
+
+	realcred = ckpt_obj_fetch(ctx, h->cred_ref, CKPT_OBJ_CRED);
+	if (IS_ERR(realcred)) {
+		ret = PTR_ERR(realcred);
+		goto out;
+	}
+	ecred = ckpt_obj_fetch(ctx, h->ecred_ref, CKPT_OBJ_CRED);
+	if (IS_ERR(ecred)) {
+		ret = PTR_ERR(ecred);
+		goto out;
+	}
+	ctx->realcred = realcred;
+	ctx->ecred = ecred;
+
+out:
+	ckpt_hdr_put(ctx, h);
+	return ret;
+}
+
 static int restore_task_objs(struct ckpt_ctx *ctx)
 {
 	struct ckpt_hdr_task_objs *h;
@@ -445,7 +517,9 @@ static int restore_task_objs(struct ckpt_ctx *ctx)
 	 * and because shared objects are restored before they are
 	 * referenced. See comment in checkpoint_task_objs.
 	 */
-	ret = restore_task_ns(ctx);
+	ret = restore_task_creds(ctx);
+	if (!ret)
+		ret = restore_task_ns(ctx);
 	if (ret < 0)
 		return ret;
 
@@ -463,6 +537,33 @@ static int restore_task_objs(struct ckpt_ctx *ctx)
 	return ret;
 }
 
+static int restore_creds(struct ckpt_ctx *ctx)
+{
+	int ret;
+	const struct cred *old;
+	struct cred *rcred, *ecred;
+
+	rcred = ctx->realcred;
+	ecred = ctx->ecred;
+
+	/* commit_creds will take one ref for the eff creds, but
+	 * expects us to hold a ref for the obj creds, so take a
+	 * ref here */
+	get_cred(rcred);
+	ret = commit_creds(rcred);
+	if (ret)
+		return ret;
+
+	if (ecred == rcred)
+		return 0;
+
+	old =  override_creds(ecred); /* override_creds otoh takes new ref */
+	put_cred(old);
+
+	ctx->realcred = ctx->ecred = NULL;
+	return 0;
+}
+
 int restore_restart_block(struct ckpt_ctx *ctx)
 {
 	struct ckpt_hdr_restart_block *h;
@@ -596,6 +697,10 @@ int restore_task(struct ckpt_ctx *ctx)
 		goto out;
 	ret = restore_cpu(ctx);
 	ckpt_debug("cpu %d\n", ret);
+	if (ret < 0)
+		goto out;
+	ret = restore_creds(ctx);
+	ckpt_debug("creds: ret %d\n", ret);
  out:
 	return ret;
 }
diff --git a/include/linux/capability.h b/include/linux/capability.h
index 3a74655..2f726f7 100644
--- a/include/linux/capability.h
+++ b/include/linux/capability.h
@@ -569,10 +569,10 @@ struct dentry;
 extern int get_vfs_caps_from_disk(const struct dentry *dentry, struct cpu_vfs_cap_data *cpu_caps);
 
 struct cred;
-int apply_securebits(unsigned securebits, struct cred *new);
+extern int apply_securebits(unsigned securebits, struct cred *new);
 struct ckpt_capabilities;
-int restore_capabilities(struct ckpt_capabilities *h, struct cred *new);
-void checkpoint_capabilities(struct ckpt_capabilities *h, struct cred * cred);
+extern int restore_capabilities(struct ckpt_capabilities *h, struct cred *new);
+extern void checkpoint_capabilities(struct ckpt_capabilities *h, struct cred *cred);
 
 #endif /* __KERNEL__ */
 
diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
index 561232d..3dbf188 100644
--- a/include/linux/checkpoint.h
+++ b/include/linux/checkpoint.h
@@ -25,6 +25,7 @@
 #include <linux/sched.h>
 #include <linux/nsproxy.h>
 #include <linux/ipc_namespace.h>
+#include <linux/user_namespace.h>
 #include <linux/checkpoint_types.h>
 #include <linux/checkpoint_hdr.h>
 #include <linux/err.h>
@@ -192,6 +193,17 @@ extern int checkpoint_file_common(struct ckpt_ctx *ctx, struct file *file,
 extern int restore_file_common(struct ckpt_ctx *ctx, struct file *file,
 			       struct ckpt_hdr_file *h);
 
+/* credentials */
+extern int checkpoint_groupinfo(struct ckpt_ctx *ctx, void *ptr);
+extern int checkpoint_user(struct ckpt_ctx *ctx, void *ptr);
+extern int checkpoint_cred(struct ckpt_ctx *ctx, void *ptr);
+extern void *restore_groupinfo(struct ckpt_ctx *ctx);
+extern void *restore_user(struct ckpt_ctx *ctx);
+extern void *restore_cred(struct ckpt_ctx *ctx);
+
+extern int checkpoint_userns(struct ckpt_ctx *ctx, void *ptr);
+extern void *restore_userns(struct ckpt_ctx *ctx);
+
 /* memory */
 extern void ckpt_pgarr_free(struct ckpt_ctx *ctx);
 
diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
index cb036e8..3f00bce 100644
--- a/include/linux/checkpoint_hdr.h
+++ b/include/linux/checkpoint_hdr.h
@@ -62,6 +62,11 @@ enum {
 	CKPT_HDR_UTS_NS,
 	CKPT_HDR_IPC_NS,
 	CKPT_HDR_CAPABILITIES,
+	CKPT_HDR_USER_NS,
+	CKPT_HDR_CRED,
+	CKPT_HDR_USER,
+	CKPT_HDR_GROUPINFO,
+	CKPT_HDR_TASK_CREDS,
 
 	/* 201-299: reserved for arch-dependent */
 
@@ -111,6 +116,10 @@ enum obj_type {
 	CKPT_OBJ_NS,
 	CKPT_OBJ_UTS_NS,
 	CKPT_OBJ_IPC_NS,
+	CKPT_OBJ_USER_NS,
+	CKPT_OBJ_CRED,
+	CKPT_OBJ_USER,
+	CKPT_OBJ_GROUPINFO,
 	CKPT_OBJ_MAX
 };
 
@@ -184,6 +193,11 @@ struct ckpt_hdr_task {
 	__u32 exit_signal;
 	__u32 pdeath_signal;
 
+#ifdef CONFIG_AUDITSYSCALL
+	/* would audit want to track the checkpointed ids,
+	   or (more likely) who actually restarted? */
+#endif
+
 	__u64 set_child_tid;
 	__u64 clear_child_tid;
 
@@ -191,6 +205,7 @@ struct ckpt_hdr_task {
 	__u32 compat_robust_futex_list; /* a compat __user ptr */
 	__u32 robust_futex_head_len;
 	__u64 robust_futex_list; /* a __user ptr */
+
 } __attribute__((aligned(8)));
 
 /* Posix capabilities */
@@ -203,6 +218,50 @@ struct ckpt_capabilities {
 	__u32 padding;
 } __attribute__((aligned(8)));
 
+struct ckpt_hdr_task_creds {
+	struct ckpt_hdr h;
+	__s32 cred_ref;
+	__s32 ecred_ref;
+} __attribute__((aligned(8)));
+
+struct ckpt_hdr_cred {
+	struct ckpt_hdr h;
+	__u32 uid, suid, euid, fsuid;
+	__u32 gid, sgid, egid, fsgid;
+	__s32 user_ref;
+	__s32 groupinfo_ref;
+	struct ckpt_capabilities cap_s;
+} __attribute__((aligned(8)));
+
+struct ckpt_hdr_groupinfo {
+	struct ckpt_hdr h;
+	__u32 ngroups;
+	/*
+	 * This is followed by ngroups __u32s
+	 */
+	__u32 groups[0];
+} __attribute__((aligned(8)));
+
+/*
+ * todo - keyrings and LSM
+ * These may be better done with userspace help though
+ */
+struct ckpt_hdr_user_struct {
+	struct ckpt_hdr h;
+	__u32 uid;
+	__s32 userns_ref;
+} __attribute__((aligned(8)));
+
+/*
+ * The user-struct mostly tracks system resource usage.
+ * Most of it's contents therefore will simply be set
+ * correctly as restart opens resources
+ */
+struct ckpt_hdr_user_ns {
+	struct ckpt_hdr h;
+	__s32 creator_ref;
+} __attribute__((aligned(8)));
+
 /* namespaces */
 struct ckpt_hdr_task_ns {
 	struct ckpt_hdr h;
diff --git a/include/linux/checkpoint_types.h b/include/linux/checkpoint_types.h
index 9632aa4..be45666 100644
--- a/include/linux/checkpoint_types.h
+++ b/include/linux/checkpoint_types.h
@@ -25,6 +25,7 @@
 struct ckpt_stats {
 	int uts_ns;
 	int ipc_ns;
+	int user_ns;
 };
 
 struct ckpt_ctx {
@@ -70,6 +71,7 @@ struct ckpt_ctx {
 	int active_pid;			/* (next) position in pids array */
 	struct completion complete;	/* container root and other tasks on */
 	wait_queue_head_t waitq;	/* start, end, and restart ordering */
+	struct cred *realcred, *ecred;	/* tmp storage for cred at restart */
 
 	struct ckpt_stats stats;	/* statistics */
 };
diff --git a/kernel/cred.c b/kernel/cred.c
index 5c8db56..9710cae 100644
--- a/kernel/cred.c
+++ b/kernel/cred.c
@@ -16,6 +16,7 @@
 #include <linux/init_task.h>
 #include <linux/security.h>
 #include <linux/cn_proc.h>
+#include <linux/checkpoint.h>
 #include "cred-internals.h"
 
 static struct kmem_cache *cred_jar;
@@ -703,3 +704,125 @@ int cred_setfsgid(struct cred *new, gid_t gid, gid_t *old_fsgid)
 	}
 	return -EPERM;
 }
+
+#ifdef CONFIG_CHECKPOINT
+static int do_checkpoint_cred(struct ckpt_ctx *ctx, struct cred *cred)
+{
+	int ret;
+	int groupinfo_ref, user_ref;
+	struct ckpt_hdr_cred *h;
+
+	groupinfo_ref = checkpoint_obj(ctx, cred->group_info,
+					CKPT_OBJ_GROUPINFO);
+	if (groupinfo_ref < 0)
+		return groupinfo_ref;
+	user_ref = checkpoint_obj(ctx, cred->user, CKPT_OBJ_USER);
+	if (user_ref < 0)
+		return user_ref;
+
+	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_CRED);
+	if (!h)
+		return -ENOMEM;
+
+	h->uid = cred->uid;
+	h->suid = cred->suid;
+	h->euid = cred->euid;
+	h->fsuid = cred->fsuid;
+
+	h->gid = cred->gid;
+	h->sgid = cred->sgid;
+	h->egid = cred->egid;
+	h->fsgid = cred->fsgid;
+
+	checkpoint_capabilities(&h->cap_s, cred);
+
+	h->user_ref = user_ref;
+	h->groupinfo_ref = groupinfo_ref;
+
+	ret = ckpt_write_obj(ctx, (struct ckpt_hdr *) h);
+	ckpt_hdr_put(ctx, h);
+
+	return ret;
+}
+
+int checkpoint_cred(struct ckpt_ctx *ctx, void *ptr)
+{
+	return do_checkpoint_cred(ctx, (struct cred *) ptr);
+}
+
+static struct cred *do_restore_cred(struct ckpt_ctx *ctx)
+{
+	struct cred *cred;
+	struct ckpt_hdr_cred *h;
+	struct user_struct *user;
+	struct group_info *groupinfo;
+	int ret = -EINVAL;
+	uid_t olduid;
+	gid_t oldgid;
+	int i;
+
+	h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_CRED);
+	if (IS_ERR(h))
+		return ERR_PTR(PTR_ERR(h));
+
+	cred = prepare_creds();
+	if (!cred)
+		goto error;
+
+
+	/* Do we care if the target user and target group were compatible?
+	 * Probably.  But then, we can't do any setuid without CAP_SETUID,
+	 * so we must have been privileged to abuse it... */
+	groupinfo = ckpt_obj_fetch(ctx, h->groupinfo_ref, CKPT_OBJ_GROUPINFO);
+	if (IS_ERR(groupinfo))
+		goto err_putcred;
+	user = ckpt_obj_fetch(ctx, h->user_ref, CKPT_OBJ_USER);
+	if (IS_ERR(user))
+		goto err_putcred;
+
+	/*
+	 * TODO: this check should  go into the common helper in
+	 * kernel/sys.c, and should account for user namespaces
+	 */
+	if (!capable(CAP_SETGID))
+		for (i = 0; i < groupinfo->ngroups; i++) {
+			if (!in_egroup_p(GROUP_AT(groupinfo, i)))
+				goto err_putcred;
+		}
+	ret = set_groups(cred, groupinfo);
+	if (ret < 0)
+		goto err_putcred;
+	free_uid(cred->user);
+	cred->user = get_uid(user);
+	ret = cred_setresuid(cred, h->uid, h->euid, h->suid);
+	if (ret < 0)
+		goto err_putcred;
+	ret = cred_setfsuid(cred, h->fsuid, &olduid);
+	if (olduid != h->fsuid && ret < 0)
+		goto err_putcred;
+	ret = cred_setresgid(cred, h->gid, h->egid, h->sgid);
+	if (ret < 0)
+		goto err_putcred;
+	ret = cred_setfsgid(cred, h->fsgid, &oldgid);
+	if (oldgid != h->fsgid && ret < 0)
+		goto err_putcred;
+	ret = restore_capabilities(&h->cap_s, cred);
+	if (ret)
+		goto err_putcred;
+
+	ckpt_hdr_put(ctx, h);
+	return cred;
+
+err_putcred:
+	abort_creds(cred);
+error:
+	ckpt_hdr_put(ctx, h);
+	return ERR_PTR(ret);
+}
+
+void *restore_cred(struct ckpt_ctx *ctx)
+{
+	return (void *) do_restore_cred(ctx);
+}
+
+#endif
diff --git a/kernel/groups.c b/kernel/groups.c
index 2b45b2e..3612c3e 100644
--- a/kernel/groups.c
+++ b/kernel/groups.c
@@ -6,6 +6,7 @@
 #include <linux/slab.h>
 #include <linux/security.h>
 #include <linux/syscalls.h>
+#include <linux/checkpoint.h>
 #include <asm/uaccess.h>
 
 /* init to 2 - one for init_task, one to ensure it is never freed */
@@ -286,3 +287,71 @@ int in_egroup_p(gid_t grp)
 }
 
 EXPORT_SYMBOL(in_egroup_p);
+
+#ifdef CONFIG_CHECKPOINT
+static int do_checkpoint_groupinfo(struct ckpt_ctx *ctx, struct group_info *g)
+{
+	int ret, i, size;
+	struct ckpt_hdr_groupinfo *h;
+
+	size = sizeof(*h) + g->ngroups * sizeof(__u32);
+	h = ckpt_hdr_get_type(ctx, size, CKPT_HDR_GROUPINFO);
+	if (!h)
+		return -ENOMEM;
+
+	h->ngroups = g->ngroups;
+	for (i = 0; i < g->ngroups; i++)
+		h->groups[i] = GROUP_AT(g, i);
+
+	ret = ckpt_write_obj(ctx, (struct ckpt_hdr *) h);
+	ckpt_hdr_put(ctx, h);
+
+	return ret;
+}
+
+int checkpoint_groupinfo(struct ckpt_ctx *ctx, void *ptr)
+{
+	return do_checkpoint_groupinfo(ctx, (struct group_info *)ptr);
+}
+
+/*
+ * TODO - switch to reading in smaller blocks?
+ */
+#define MAX_GROUPINFO_SIZE (sizeof(*h)+NGROUPS_MAX*sizeof(gid_t))
+static struct group_info *do_restore_groupinfo(struct ckpt_ctx *ctx)
+{
+	struct group_info *g;
+	struct ckpt_hdr_groupinfo *h;
+	int i;
+
+	h = ckpt_read_buf_type(ctx, MAX_GROUPINFO_SIZE, CKPT_HDR_GROUPINFO);
+	if (IS_ERR(h))
+		return ERR_PTR(PTR_ERR(h));
+
+	g = ERR_PTR(-EINVAL);
+	if (h->ngroups > NGROUPS_MAX)
+		goto out;
+
+	for (i = 1; i < h->ngroups; i++)
+		if (h->groups[i-1] >= h->groups[i])
+			goto out;
+
+	g = groups_alloc(h->ngroups);
+	if (!g) {
+		g = ERR_PTR(-ENOMEM);
+		goto out;
+	}
+	for (i = 0; i < h->ngroups; i++)
+		GROUP_AT(g, i) = h->groups[i];
+
+out:
+	ckpt_hdr_put(ctx, h);
+	return g;
+}
+
+void *restore_groupinfo(struct ckpt_ctx *ctx)
+{
+	return (void *) do_restore_groupinfo(ctx);
+}
+
+#endif
diff --git a/kernel/user.c b/kernel/user.c
index 2c000e7..a535ed6 100644
--- a/kernel/user.c
+++ b/kernel/user.c
@@ -16,6 +16,7 @@
 #include <linux/interrupt.h>
 #include <linux/module.h>
 #include <linux/user_namespace.h>
+#include <linux/checkpoint.h>
 #include "cred-internals.h"
 
 struct user_namespace init_user_ns = {
@@ -508,3 +509,160 @@ static int __init uid_cache_init(void)
 }
 
 module_init(uid_cache_init);
+
+#ifdef CONFIG_CHECKPOINT
+/*
+ * write the user struct
+ * TODO keyring will need to be dumped
+ *
+ * Here is what we're doing.  Remember a task can do clone(CLONE_NEWUSER)
+ * resulting in a cloned task in a new user namespace, with uid 0 in that
+ * new user_ns.  In that case, the parent's user (uid+user_ns) is the
+ * 'creator' of the new user_ns.
+ * Here, we call the user_ns of the ctx->root_task the 'root_ns'.  When we
+ * checkpoint a user-struct, we must store the chain of creators.  We
+ * must not do so recursively, this being the kernel.  In
+ * checkpoint_write_user() we walk and record in memory the list of creators up
+ * to either the latest user_struct which has already been saved, or the
+ * root_ns.  Then we walk that chain backward, writing out the user_ns and
+ * user_struct to the checkpoint image.
+ */
+#define UNSAVED_STRIDE 50
+static int do_checkpoint_user(struct ckpt_ctx *ctx, struct user_struct *u)
+{
+	struct user_namespace *ns, *root_ns;
+	struct ckpt_hdr_user_struct *h;
+	int ns_objref;
+	int ret, i, unsaved_ns_nr = 0;
+	struct user_struct *save_u;
+	struct user_struct **unsaved_creators;
+	int step = 1, size;
+
+	/* if we've already saved the userns, then life is good */
+	ns_objref = ckpt_obj_lookup(ctx, u->user_ns, CKPT_OBJ_USER_NS);
+	if (ns_objref)
+		goto write_user;
+
+	root_ns = task_cred_xxx(ctx->root_task, user)->user_ns;
+
+	if (u->user_ns == root_ns)
+		goto save_last_ns;
+
+	size = UNSAVED_STRIDE*sizeof(struct user_struct *);
+	unsaved_creators = kmalloc(size, GFP_KERNEL);
+	if (!unsaved_creators)
+		return -ENOMEM;
+	save_u = u;
+	do {
+		ns = save_u->user_ns;
+		save_u = ns->creator;
+		if (ckpt_obj_lookup(ctx, save_u, CKPT_OBJ_USER))
+			goto found;
+		unsaved_creators[unsaved_ns_nr++] = save_u;
+		if (unsaved_ns_nr == step * UNSAVED_STRIDE) {
+			step++;
+			size = step*UNSAVED_STRIDE*sizeof(struct user_struct *);
+			unsaved_creators = krealloc(unsaved_creators, size,
+							GFP_KERNEL);
+			if (!unsaved_creators)
+				return -ENOMEM;
+		}
+	} while (ns != root_ns);
+
+found:
+	for (i = unsaved_ns_nr-1; i >= 0; i--) {
+		ret = checkpoint_obj(ctx, unsaved_creators[i], CKPT_OBJ_USER);
+		if (ret < 0) {
+			kfree(unsaved_creators);
+			return ret;
+		}
+	}
+	kfree(unsaved_creators);
+
+save_last_ns:
+	ns_objref = checkpoint_obj(ctx, u->user_ns, CKPT_OBJ_USER_NS);
+	if (ns_objref < 0)
+		return ns_objref;
+
+write_user:
+	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_USER);
+	if (!h)
+		return -ENOMEM;
+
+	h->uid = u->uid;
+	h->userns_ref = ns_objref;
+
+	/* write out the user_struct */
+	ret = ckpt_write_obj(ctx, (struct ckpt_hdr *) h);
+	ckpt_hdr_put(ctx, h);
+
+	return ret;
+}
+
+int checkpoint_user(struct ckpt_ctx *ctx, void *ptr)
+{
+	return do_checkpoint_user(ctx, (struct user_struct *) ptr);
+}
+
+static int may_setuid(struct user_namespace *ns, uid_t uid)
+{
+	/*
+	 * this next check will one day become
+	 * if capable(CAP_SETUID, ns) return 1;
+	 * followed by uid_equiv(current_userns, current_uid, ns, uid)
+	 * instead of just uids.
+	 */
+	if (capable(CAP_SETUID))
+		return 1;
+
+	/*
+	 * this may be overly strict, but since we might end up
+	 * restarting a privileged program here, we do not want
+	 * someone with only CAP_SYS_ADMIN but no CAP_SETUID to
+	 * be able to create random userids even in a userns he
+	 * created.
+	 */
+	if (current_user()->user_ns != ns)
+		return 0;
+	if (current_uid() == uid ||
+		current_euid() == uid ||
+		current_suid() == uid)
+		return 1;
+	return 0;
+}
+
+static struct user_struct *do_restore_user(struct ckpt_ctx *ctx)
+{
+	struct user_struct *u;
+	struct user_namespace *ns;
+	struct ckpt_hdr_user_struct *h;
+
+	h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_USER);
+	if (IS_ERR(h))
+		return ERR_PTR(PTR_ERR(h));
+
+	ns = ckpt_obj_fetch(ctx, h->userns_ref, CKPT_OBJ_USER_NS);
+	if (IS_ERR(ns)) {
+		u = ERR_PTR(PTR_ERR(ns));
+		goto out;
+	}
+
+	if (!may_setuid(ns, h->uid)) {
+		u = ERR_PTR(-EPERM);
+		goto out;
+	}
+	u = alloc_uid(ns, h->uid);
+	if (!u)
+		u = ERR_PTR(-EINVAL);
+
+out:
+	ckpt_hdr_put(ctx, h);
+	return u;
+}
+
+void *restore_user(struct ckpt_ctx *ctx)
+{
+	return (void *) do_restore_user(ctx);
+}
+
+#endif
diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
index e624b0f..3a35b50 100644
--- a/kernel/user_namespace.c
+++ b/kernel/user_namespace.c
@@ -9,6 +9,7 @@
 #include <linux/nsproxy.h>
 #include <linux/slab.h>
 #include <linux/user_namespace.h>
+#include <linux/checkpoint.h>
 #include <linux/cred.h>
 
 static struct user_namespace *_new_user_ns(struct user_struct *creator,
@@ -103,3 +104,91 @@ void free_user_ns(struct kref *kref)
 	schedule_work(&ns->destroyer);
 }
 EXPORT_SYMBOL(free_user_ns);
+
+#ifdef CONFIG_CHECKPOINT
+/*
+ * do_checkpoint_userns() is only called from do_checkpoint_user().
+ * When called, we always know that either:
+ *   1. This is the root_ns (user_ns of the ctx->root_task),
+ *	in which case we set h->creator_ref = 0.
+ * or
+ *   2. The creator has already been written out to the
+ *	checkpoint image (and saved in the objhash)
+ */
+static int do_checkpoint_userns(struct ckpt_ctx *ctx, struct user_namespace *ns)
+{
+	struct ckpt_hdr_user_ns *h;
+	struct user_namespace *root_ns;
+	int creator_ref = 0;
+	int ret;
+
+	root_ns = task_cred_xxx(ctx->root_task, user)->user_ns;
+	if (ns != root_ns) {
+		creator_ref = ckpt_obj_lookup(ctx, ns->creator, CKPT_OBJ_USER);
+		if (!creator_ref)
+			return -EINVAL;
+	}
+
+	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_USER_NS);
+	if (!h)
+		return -ENOMEM;
+	h->creator_ref = creator_ref;
+	ret = ckpt_write_obj(ctx, (struct ckpt_hdr *) h);
+	ckpt_hdr_put(ctx, h);
+
+	return ret;
+}
+
+int checkpoint_userns(struct ckpt_ctx *ctx, void *ptr)
+{
+	return do_checkpoint_userns(ctx, (struct user_namespace *) ptr);
+}
+
+static struct user_namespace *do_restore_userns(struct ckpt_ctx *ctx)
+{
+	struct ckpt_hdr_user_ns *h;
+	struct user_namespace *ns;
+	struct user_struct *new_root, *creator;
+
+	h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_USER_NS);
+	if (IS_ERR(h))
+		return ERR_PTR(PTR_ERR(h));
+
+	if (!h->creator_ref) {
+		ns = get_user_ns(current_user_ns());
+		goto out;
+	}
+
+	creator = ckpt_obj_fetch(ctx, h->creator_ref, CKPT_OBJ_USER);
+	if (IS_ERR(creator)) {
+		ns = ERR_PTR(-EINVAL);
+		goto out;
+	}
+
+	ns = new_user_ns(creator, &new_root);
+	if (IS_ERR(ns))
+		goto out;
+
+	/* ns only referenced from new_root, which we discard below */
+	get_user_ns(ns);
+
+	/* new_user_ns() doesn't bump creator's refcount */
+	get_uid(creator);
+
+	/*
+	 * Free the new root user.  If we actually needed it,
+	 * then it will show up later in the checkpoint image
+	 * The objhash will keep the userns pinned until then.
+	 */
+	free_uid(new_root);
+ out:
+	ctx->stats.user_ns++;
+	ckpt_hdr_put(ctx, h);
+	return ns;
+}
+
+void *restore_userns(struct ckpt_ctx *ctx)
+{
+	return (void *) do_restore_userns(ctx);
+}
+#endif
-- 
1.6.0.4


^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH v18 60/80] c/r: checkpoint and restore task credentials
@ 2009-09-23 23:51   ` Oren Laadan
  0 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-23 23:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, containers, linux-kernel, linux-mm, linux-api,
	Serge Hallyn, Ingo Molnar, Pavel Emelyanov

From: Serge E. Hallyn <serue@us.ibm.com>

This patch adds the checkpointing and restart of credentials
(uids, gids, and capabilities) to Oren's c/r patchset (on top
of v14).  It goes to great pains to re-use (and define when
needed) common helpers, in order to make sure that as security
code is modified, the cr code will be updated.  Some of the
helpers should still be moved (i.e. _creds() functions should
be in kernel/cred.c).

When building the credentials for the restarted process, I
1. create a new struct cred as a copy of the running task's
cred (using prepare_cred())
2. always authorize any changes to the new struct cred
based on the permissions of current_cred() (not the current
transient state of the new cred).

While this may mean that certain transient_cred1->transient_cred2
states are allowed which otherwise wouldn't be allowed, the
fact remains that current_cred() is allowed to transition to
transient_cred2.

The reconstructed creds are applied to the task at the very
end of the sys_restart call.  This ensures that any objects which
need to be re-created (file, socket, etc) are re-created using
the creds of the task calling sys_restart - preventing an unpriv
user from creating a privileged object, and ensuring that a
root task can restart a process which had started out privileged,
created some privileged objects, then dropped its privilege.

With these patches, the root user can restart checkpoint images
(created by either hallyn or root) of user hallyn's tasks,
resulting in a program owned by hallyn.

Changelog:
	Sep 08: [NTL] discard const from struct cred * where appropriate
	Jun 15: Fix user_ns handling when !CONFIG_USER_N
	        Set creator_ref=0 for root_ns (discard @flags)
		Don't  overwrite global user-ns if CONFIG_USER_NS
	Jun 10: Merge with ckpt-v16-dev (Oren Laadan)
	Jun 01: Don't check ordering of groups in group_info, bc
		set_groups() will sort it for us.
	May 28: 1. Restore securebits
		2. Address Alexey's comments: move prototypes out of
		   sched.h, validate ngroups < NGROUPS_MAX, validate
		   groups are sorted, and get rid of ckpt_hdr_cred->version.
		3. remove bogus unused flag RESTORE_CREATE_USERNS
	May 26: Move group, user, userns, creds c/r functions out
		of checkpoint/process.c and into the appropriate files.
	May 26: Define struct ckpt_hdr_task_creds and move task cred
		objref c/r into {checkpoint_restore}_task_shared().
	May 26: Take cred refs around checkpoint_write_creds()
	May 20: Remove the limit on number of groups in groupinfo
		at checkpoint time
	May 20: Remove the depth limit on empty user namespaces
	May 20: Better document checkpoint_user
	May 18: fix more refcounting: if (userns 5, uid 0) had
		no active tasks or child user_namespaces, then
		it shouldn't exist at restart or it, its namespace,
		and its whole chain of creators will be leaked.
	May 14: fix some refcounting:
		1. a new user_ns needs a ref to remain pinned
		   by its root user
		2. current_user_ns needs an extra ref bc objhash
		   drops two on restart
		3. cred needs a ref for the real credentials bc
		   commit_creds eats one ref.
	May 13: folded in fix to userns refcounting.

Signed-off-by: Serge E. Hallyn <serue@us.ibm.com>
[orenl@cs.columbia.edu: merge with ckpt-v16-dev]
---
 checkpoint/namespace.c           |   41 ++++++++++
 checkpoint/objhash.c             |   82 ++++++++++++++++++++
 checkpoint/process.c             |  111 ++++++++++++++++++++++++++-
 include/linux/capability.h       |    6 +-
 include/linux/checkpoint.h       |   12 +++
 include/linux/checkpoint_hdr.h   |   59 ++++++++++++++
 include/linux/checkpoint_types.h |    2 +
 kernel/cred.c                    |  123 +++++++++++++++++++++++++++++
 kernel/groups.c                  |   69 +++++++++++++++++
 kernel/user.c                    |  158 ++++++++++++++++++++++++++++++++++++++
 kernel/user_namespace.c          |   89 +++++++++++++++++++++
 11 files changed, 746 insertions(+), 6 deletions(-)

diff --git a/checkpoint/namespace.c b/checkpoint/namespace.c
index 49b8f0a..89af2c0 100644
--- a/checkpoint/namespace.c
+++ b/checkpoint/namespace.c
@@ -98,3 +98,44 @@ void *restore_uts_ns(struct ckpt_ctx *ctx)
 {
 	return (void *) do_restore_uts_ns(ctx);
 }
+
+/*
+ * user_ns  -  trivial checkpoint/restore for !CONFIG_USER_NS case
+ */
+#ifndef CONFIG_USER_NS
+int checkpoint_userns(struct ckpt_ctx *ctx, void *ptr)
+{
+	struct ckpt_hdr_user_ns *h;
+	int ret;
+
+	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_USER_NS);
+	if (!h)
+		return -ENOMEM;
+	ret = ckpt_write_obj(ctx, (struct ckpt_hdr *) h);
+	ckpt_hdr_put(ctx, h);
+	return ret;
+}
+
+void *restore_userns(struct ckpt_ctx *ctx)
+{
+	struct ckpt_hdr_user_ns *h;
+	struct user_namespace *ns;
+
+	/* complain if image contains multiple namespaces */
+	if (ctx->stats.user_ns)
+		return ERR_PTR(-EEXIST);
+
+	h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_USER_NS);
+	if (IS_ERR(h))
+		return ERR_PTR(PTR_ERR(h));
+
+	if (h->creator_ref)
+		ns = ERR_PTR(-EINVAL);
+	else
+		ns = get_user_ns(current_user_ns());
+
+	ctx->stats.user_ns++;
+	ckpt_hdr_put(ctx, h);
+	return ns;
+}
+#endif
diff --git a/checkpoint/objhash.c b/checkpoint/objhash.c
index 15a5caf..f8a3210 100644
--- a/checkpoint/objhash.c
+++ b/checkpoint/objhash.c
@@ -17,6 +17,7 @@
 #include <linux/fdtable.h>
 #include <linux/sched.h>
 #include <linux/ipc_namespace.h>
+#include <linux/user_namespace.h>
 #include <linux/checkpoint.h>
 #include <linux/checkpoint_hdr.h>
 
@@ -172,6 +173,51 @@ static int obj_ipc_ns_users(void *ptr)
 	return atomic_read(&((struct ipc_namespace *) ptr)->count);
 }
 
+static int obj_cred_grab(void *ptr)
+{
+	get_cred((struct cred *) ptr);
+	return 0;
+}
+
+static void obj_cred_drop(void *ptr, int lastref)
+{
+	put_cred((struct cred *) ptr);
+}
+
+static int obj_user_grab(void *ptr)
+{
+	struct user_struct *u = ptr;
+	(void) get_uid(u);
+	return 0;
+}
+
+static void obj_user_drop(void *ptr, int lastref)
+{
+	free_uid((struct user_struct *) ptr);
+}
+
+static int obj_userns_grab(void *ptr)
+{
+	get_user_ns((struct user_namespace *) ptr);
+	return 0;
+}
+
+static void obj_userns_drop(void *ptr, int lastref)
+{
+	put_user_ns((struct user_namespace *) ptr);
+}
+
+static int obj_groupinfo_grab(void *ptr)
+{
+	get_group_info((struct group_info *) ptr);
+	return 0;
+}
+
+static void obj_groupinfo_drop(void *ptr, int lastref)
+{
+	put_group_info((struct group_info *) ptr);
+}
+
 static struct ckpt_obj_ops ckpt_obj_ops[] = {
 	/* ignored object */
 	{
@@ -247,6 +293,42 @@ static struct ckpt_obj_ops ckpt_obj_ops[] = {
 		.checkpoint = checkpoint_ipc_ns,
 		.restore = restore_ipc_ns,
 	},
+	/* user_ns object */
+	{
+		.obj_name = "USER_NS",
+		.obj_type = CKPT_OBJ_USER_NS,
+		.ref_drop = obj_userns_drop,
+		.ref_grab = obj_userns_grab,
+		.checkpoint = checkpoint_userns,
+		.restore = restore_userns,
+	},
+	/* struct cred */
+	{
+		.obj_name = "CRED",
+		.obj_type = CKPT_OBJ_CRED,
+		.ref_drop = obj_cred_drop,
+		.ref_grab = obj_cred_grab,
+		.checkpoint = checkpoint_cred,
+		.restore = restore_cred,
+	},
+	/* user object */
+	{
+		.obj_name = "USER",
+		.obj_type = CKPT_OBJ_USER,
+		.ref_drop = obj_user_drop,
+		.ref_grab = obj_user_grab,
+		.checkpoint = checkpoint_user,
+		.restore = restore_user,
+	},
+	/* struct groupinfo */
+	{
+		.obj_name = "GROUPINFO",
+		.obj_type = CKPT_OBJ_GROUPINFO,
+		.ref_drop = obj_groupinfo_drop,
+		.ref_grab = obj_groupinfo_grab,
+		.checkpoint = checkpoint_groupinfo,
+		.restore = restore_groupinfo,
+	},
 };
 
 
diff --git a/checkpoint/process.c b/checkpoint/process.c
index b34ee3d..1e79f73 100644
--- a/checkpoint/process.c
+++ b/checkpoint/process.c
@@ -18,6 +18,7 @@
 #include <linux/compat.h>
 #include <linux/poll.h>
 #include <linux/utsname.h>
+#include <linux/user_namespace.h>
 #include <linux/checkpoint.h>
 #include <linux/checkpoint_hdr.h>
 #include <linux/syscalls.h>
@@ -136,6 +137,45 @@ static int checkpoint_task_ns(struct ckpt_ctx *ctx, struct task_struct *t)
 	return ret;
 }
 
+static int checkpoint_task_creds(struct ckpt_ctx *ctx, struct task_struct *t)
+{
+	int realcred_ref, ecred_ref;
+	struct cred *rcred, *ecred;
+	struct ckpt_hdr_task_creds *h;
+	int ret;
+
+	rcred = (struct cred *) get_cred(t->real_cred);
+	ecred = (struct cred *) get_cred(t->cred);
+
+	realcred_ref = checkpoint_obj(ctx, rcred, CKPT_OBJ_CRED);
+	if (realcred_ref < 0) {
+		ret = realcred_ref;
+		goto error;
+	}
+
+	ecred_ref = checkpoint_obj(ctx, ecred, CKPT_OBJ_CRED);
+	if (ecred_ref < 0) {
+		ret = ecred_ref;
+		goto error;
+	}
+
+	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_TASK_CREDS);
+	if (!h) {
+		ret = -ENOMEM;
+		goto error;
+	}
+
+	h->cred_ref = realcred_ref;
+	h->ecred_ref = ecred_ref;
+	ret = ckpt_write_obj(ctx, (struct ckpt_hdr *) h);
+	ckpt_hdr_put(ctx, h);
+
+error:
+	put_cred(rcred);
+	put_cred(ecred);
+	return ret;
+}
+
 static int checkpoint_task_objs(struct ckpt_ctx *ctx, struct task_struct *t)
 {
 	struct ckpt_hdr_task_objs *h;
@@ -151,8 +191,12 @@ static int checkpoint_task_objs(struct ckpt_ctx *ctx, struct task_struct *t)
 	 * restored when it gets to restore, e.g. its memory.
 	 */
 
-	ret = checkpoint_task_ns(ctx, t);
-	ckpt_debug("ns: objref %d\n", ret);
+	ret = checkpoint_task_creds(ctx, t);
+	ckpt_debug("cred: objref %d\n", ret);
+	if (!ret) {
+		ret = checkpoint_task_ns(ctx, t);
+		ckpt_debug("ns: objref %d\n", ret);
+	}
 	if (ret < 0)
 		return ret;
 
@@ -435,6 +479,34 @@ static int restore_task_ns(struct ckpt_ctx *ctx)
 	return ret;
 }
 
+static int restore_task_creds(struct ckpt_ctx *ctx)
+{
+	struct ckpt_hdr_task_creds *h;
+	struct cred *realcred, *ecred;
+	int ret = 0;
+
+	h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_TASK_CREDS);
+	if (IS_ERR(h))
+		return PTR_ERR(h);
+
+	realcred = ckpt_obj_fetch(ctx, h->cred_ref, CKPT_OBJ_CRED);
+	if (IS_ERR(realcred)) {
+		ret = PTR_ERR(realcred);
+		goto out;
+	}
+	ecred = ckpt_obj_fetch(ctx, h->ecred_ref, CKPT_OBJ_CRED);
+	if (IS_ERR(ecred)) {
+		ret = PTR_ERR(ecred);
+		goto out;
+	}
+	ctx->realcred = realcred;
+	ctx->ecred = ecred;
+
+out:
+	ckpt_hdr_put(ctx, h);
+	return ret;
+}
+
 static int restore_task_objs(struct ckpt_ctx *ctx)
 {
 	struct ckpt_hdr_task_objs *h;
@@ -445,7 +517,9 @@ static int restore_task_objs(struct ckpt_ctx *ctx)
 	 * and because shared objects are restored before they are
 	 * referenced. See comment in checkpoint_task_objs.
 	 */
-	ret = restore_task_ns(ctx);
+	ret = restore_task_creds(ctx);
+	if (!ret)
+		ret = restore_task_ns(ctx);
 	if (ret < 0)
 		return ret;
 
@@ -463,6 +537,33 @@ static int restore_task_objs(struct ckpt_ctx *ctx)
 	return ret;
 }
 
+static int restore_creds(struct ckpt_ctx *ctx)
+{
+	int ret;
+	const struct cred *old;
+	struct cred *rcred, *ecred;
+
+	rcred = ctx->realcred;
+	ecred = ctx->ecred;
+
+	/* commit_creds will take one ref for the eff creds, but
+	 * expects us to hold a ref for the obj creds, so take a
+	 * ref here */
+	get_cred(rcred);
+	ret = commit_creds(rcred);
+	if (ret)
+		return ret;
+
+	if (ecred == rcred)
+		return 0;
+
+	old =  override_creds(ecred); /* override_creds otoh takes new ref */
+	put_cred(old);
+
+	ctx->realcred = ctx->ecred = NULL;
+	return 0;
+}
+
 int restore_restart_block(struct ckpt_ctx *ctx)
 {
 	struct ckpt_hdr_restart_block *h;
@@ -596,6 +697,10 @@ int restore_task(struct ckpt_ctx *ctx)
 		goto out;
 	ret = restore_cpu(ctx);
 	ckpt_debug("cpu %d\n", ret);
+	if (ret < 0)
+		goto out;
+	ret = restore_creds(ctx);
+	ckpt_debug("creds: ret %d\n", ret);
  out:
 	return ret;
 }
diff --git a/include/linux/capability.h b/include/linux/capability.h
index 3a74655..2f726f7 100644
--- a/include/linux/capability.h
+++ b/include/linux/capability.h
@@ -569,10 +569,10 @@ struct dentry;
 extern int get_vfs_caps_from_disk(const struct dentry *dentry, struct cpu_vfs_cap_data *cpu_caps);
 
 struct cred;
-int apply_securebits(unsigned securebits, struct cred *new);
+extern int apply_securebits(unsigned securebits, struct cred *new);
 struct ckpt_capabilities;
-int restore_capabilities(struct ckpt_capabilities *h, struct cred *new);
-void checkpoint_capabilities(struct ckpt_capabilities *h, struct cred * cred);
+extern int restore_capabilities(struct ckpt_capabilities *h, struct cred *new);
+extern void checkpoint_capabilities(struct ckpt_capabilities *h, struct cred *cred);
 
 #endif /* __KERNEL__ */
 
diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
index 561232d..3dbf188 100644
--- a/include/linux/checkpoint.h
+++ b/include/linux/checkpoint.h
@@ -25,6 +25,7 @@
 #include <linux/sched.h>
 #include <linux/nsproxy.h>
 #include <linux/ipc_namespace.h>
+#include <linux/user_namespace.h>
 #include <linux/checkpoint_types.h>
 #include <linux/checkpoint_hdr.h>
 #include <linux/err.h>
@@ -192,6 +193,17 @@ extern int checkpoint_file_common(struct ckpt_ctx *ctx, struct file *file,
 extern int restore_file_common(struct ckpt_ctx *ctx, struct file *file,
 			       struct ckpt_hdr_file *h);
 
+/* credentials */
+extern int checkpoint_groupinfo(struct ckpt_ctx *ctx, void *ptr);
+extern int checkpoint_user(struct ckpt_ctx *ctx, void *ptr);
+extern int checkpoint_cred(struct ckpt_ctx *ctx, void *ptr);
+extern void *restore_groupinfo(struct ckpt_ctx *ctx);
+extern void *restore_user(struct ckpt_ctx *ctx);
+extern void *restore_cred(struct ckpt_ctx *ctx);
+
+extern int checkpoint_userns(struct ckpt_ctx *ctx, void *ptr);
+extern void *restore_userns(struct ckpt_ctx *ctx);
+
 /* memory */
 extern void ckpt_pgarr_free(struct ckpt_ctx *ctx);
 
diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
index cb036e8..3f00bce 100644
--- a/include/linux/checkpoint_hdr.h
+++ b/include/linux/checkpoint_hdr.h
@@ -62,6 +62,11 @@ enum {
 	CKPT_HDR_UTS_NS,
 	CKPT_HDR_IPC_NS,
 	CKPT_HDR_CAPABILITIES,
+	CKPT_HDR_USER_NS,
+	CKPT_HDR_CRED,
+	CKPT_HDR_USER,
+	CKPT_HDR_GROUPINFO,
+	CKPT_HDR_TASK_CREDS,
 
 	/* 201-299: reserved for arch-dependent */
 
@@ -111,6 +116,10 @@ enum obj_type {
 	CKPT_OBJ_NS,
 	CKPT_OBJ_UTS_NS,
 	CKPT_OBJ_IPC_NS,
+	CKPT_OBJ_USER_NS,
+	CKPT_OBJ_CRED,
+	CKPT_OBJ_USER,
+	CKPT_OBJ_GROUPINFO,
 	CKPT_OBJ_MAX
 };
 
@@ -184,6 +193,11 @@ struct ckpt_hdr_task {
 	__u32 exit_signal;
 	__u32 pdeath_signal;
 
+#ifdef CONFIG_AUDITSYSCALL
+	/* would audit want to track the checkpointed ids,
+	   or (more likely) who actually restarted? */
+#endif
+
 	__u64 set_child_tid;
 	__u64 clear_child_tid;
 
@@ -191,6 +205,7 @@ struct ckpt_hdr_task {
 	__u32 compat_robust_futex_list; /* a compat __user ptr */
 	__u32 robust_futex_head_len;
 	__u64 robust_futex_list; /* a __user ptr */
+
 } __attribute__((aligned(8)));
 
 /* Posix capabilities */
@@ -203,6 +218,50 @@ struct ckpt_capabilities {
 	__u32 padding;
 } __attribute__((aligned(8)));
 
+struct ckpt_hdr_task_creds {
+	struct ckpt_hdr h;
+	__s32 cred_ref;
+	__s32 ecred_ref;
+} __attribute__((aligned(8)));
+
+struct ckpt_hdr_cred {
+	struct ckpt_hdr h;
+	__u32 uid, suid, euid, fsuid;
+	__u32 gid, sgid, egid, fsgid;
+	__s32 user_ref;
+	__s32 groupinfo_ref;
+	struct ckpt_capabilities cap_s;
+} __attribute__((aligned(8)));
+
+struct ckpt_hdr_groupinfo {
+	struct ckpt_hdr h;
+	__u32 ngroups;
+	/*
+	 * This is followed by ngroups __u32s
+	 */
+	__u32 groups[0];
+} __attribute__((aligned(8)));
+
+/*
+ * todo - keyrings and LSM
+ * These may be better done with userspace help though
+ */
+struct ckpt_hdr_user_struct {
+	struct ckpt_hdr h;
+	__u32 uid;
+	__s32 userns_ref;
+} __attribute__((aligned(8)));
+
+/*
+ * The user-struct mostly tracks system resource usage.
+ * Most of it's contents therefore will simply be set
+ * correctly as restart opens resources
+ */
+struct ckpt_hdr_user_ns {
+	struct ckpt_hdr h;
+	__s32 creator_ref;
+} __attribute__((aligned(8)));
+
 /* namespaces */
 struct ckpt_hdr_task_ns {
 	struct ckpt_hdr h;
diff --git a/include/linux/checkpoint_types.h b/include/linux/checkpoint_types.h
index 9632aa4..be45666 100644
--- a/include/linux/checkpoint_types.h
+++ b/include/linux/checkpoint_types.h
@@ -25,6 +25,7 @@
 struct ckpt_stats {
 	int uts_ns;
 	int ipc_ns;
+	int user_ns;
 };
 
 struct ckpt_ctx {
@@ -70,6 +71,7 @@ struct ckpt_ctx {
 	int active_pid;			/* (next) position in pids array */
 	struct completion complete;	/* container root and other tasks on */
 	wait_queue_head_t waitq;	/* start, end, and restart ordering */
+	struct cred *realcred, *ecred;	/* tmp storage for cred at restart */
 
 	struct ckpt_stats stats;	/* statistics */
 };
diff --git a/kernel/cred.c b/kernel/cred.c
index 5c8db56..9710cae 100644
--- a/kernel/cred.c
+++ b/kernel/cred.c
@@ -16,6 +16,7 @@
 #include <linux/init_task.h>
 #include <linux/security.h>
 #include <linux/cn_proc.h>
+#include <linux/checkpoint.h>
 #include "cred-internals.h"
 
 static struct kmem_cache *cred_jar;
@@ -703,3 +704,125 @@ int cred_setfsgid(struct cred *new, gid_t gid, gid_t *old_fsgid)
 	}
 	return -EPERM;
 }
+
+#ifdef CONFIG_CHECKPOINT
+static int do_checkpoint_cred(struct ckpt_ctx *ctx, struct cred *cred)
+{
+	int ret;
+	int groupinfo_ref, user_ref;
+	struct ckpt_hdr_cred *h;
+
+	groupinfo_ref = checkpoint_obj(ctx, cred->group_info,
+					CKPT_OBJ_GROUPINFO);
+	if (groupinfo_ref < 0)
+		return groupinfo_ref;
+	user_ref = checkpoint_obj(ctx, cred->user, CKPT_OBJ_USER);
+	if (user_ref < 0)
+		return user_ref;
+
+	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_CRED);
+	if (!h)
+		return -ENOMEM;
+
+	h->uid = cred->uid;
+	h->suid = cred->suid;
+	h->euid = cred->euid;
+	h->fsuid = cred->fsuid;
+
+	h->gid = cred->gid;
+	h->sgid = cred->sgid;
+	h->egid = cred->egid;
+	h->fsgid = cred->fsgid;
+
+	checkpoint_capabilities(&h->cap_s, cred);
+
+	h->user_ref = user_ref;
+	h->groupinfo_ref = groupinfo_ref;
+
+	ret = ckpt_write_obj(ctx, (struct ckpt_hdr *) h);
+	ckpt_hdr_put(ctx, h);
+
+	return ret;
+}
+
+int checkpoint_cred(struct ckpt_ctx *ctx, void *ptr)
+{
+	return do_checkpoint_cred(ctx, (struct cred *) ptr);
+}
+
+static struct cred *do_restore_cred(struct ckpt_ctx *ctx)
+{
+	struct cred *cred;
+	struct ckpt_hdr_cred *h;
+	struct user_struct *user;
+	struct group_info *groupinfo;
+	int ret = -EINVAL;
+	uid_t olduid;
+	gid_t oldgid;
+	int i;
+
+	h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_CRED);
+	if (IS_ERR(h))
+		return ERR_PTR(PTR_ERR(h));
+
+	cred = prepare_creds();
+	if (!cred)
+		goto error;
+
+
+	/* Do we care if the target user and target group were compatible?
+	 * Probably.  But then, we can't do any setuid without CAP_SETUID,
+	 * so we must have been privileged to abuse it... */
+	groupinfo = ckpt_obj_fetch(ctx, h->groupinfo_ref, CKPT_OBJ_GROUPINFO);
+	if (IS_ERR(groupinfo))
+		goto err_putcred;
+	user = ckpt_obj_fetch(ctx, h->user_ref, CKPT_OBJ_USER);
+	if (IS_ERR(user))
+		goto err_putcred;
+
+	/*
+	 * TODO: this check should  go into the common helper in
+	 * kernel/sys.c, and should account for user namespaces
+	 */
+	if (!capable(CAP_SETGID))
+		for (i = 0; i < groupinfo->ngroups; i++) {
+			if (!in_egroup_p(GROUP_AT(groupinfo, i)))
+				goto err_putcred;
+		}
+	ret = set_groups(cred, groupinfo);
+	if (ret < 0)
+		goto err_putcred;
+	free_uid(cred->user);
+	cred->user = get_uid(user);
+	ret = cred_setresuid(cred, h->uid, h->euid, h->suid);
+	if (ret < 0)
+		goto err_putcred;
+	ret = cred_setfsuid(cred, h->fsuid, &olduid);
+	if (olduid != h->fsuid && ret < 0)
+		goto err_putcred;
+	ret = cred_setresgid(cred, h->gid, h->egid, h->sgid);
+	if (ret < 0)
+		goto err_putcred;
+	ret = cred_setfsgid(cred, h->fsgid, &oldgid);
+	if (oldgid != h->fsgid && ret < 0)
+		goto err_putcred;
+	ret = restore_capabilities(&h->cap_s, cred);
+	if (ret)
+		goto err_putcred;
+
+	ckpt_hdr_put(ctx, h);
+	return cred;
+
+err_putcred:
+	abort_creds(cred);
+error:
+	ckpt_hdr_put(ctx, h);
+	return ERR_PTR(ret);
+}
+
+void *restore_cred(struct ckpt_ctx *ctx)
+{
+	return (void *) do_restore_cred(ctx);
+}
+
+#endif
diff --git a/kernel/groups.c b/kernel/groups.c
index 2b45b2e..3612c3e 100644
--- a/kernel/groups.c
+++ b/kernel/groups.c
@@ -6,6 +6,7 @@
 #include <linux/slab.h>
 #include <linux/security.h>
 #include <linux/syscalls.h>
+#include <linux/checkpoint.h>
 #include <asm/uaccess.h>
 
 /* init to 2 - one for init_task, one to ensure it is never freed */
@@ -286,3 +287,71 @@ int in_egroup_p(gid_t grp)
 }
 
 EXPORT_SYMBOL(in_egroup_p);
+
+#ifdef CONFIG_CHECKPOINT
+static int do_checkpoint_groupinfo(struct ckpt_ctx *ctx, struct group_info *g)
+{
+	int ret, i, size;
+	struct ckpt_hdr_groupinfo *h;
+
+	size = sizeof(*h) + g->ngroups * sizeof(__u32);
+	h = ckpt_hdr_get_type(ctx, size, CKPT_HDR_GROUPINFO);
+	if (!h)
+		return -ENOMEM;
+
+	h->ngroups = g->ngroups;
+	for (i = 0; i < g->ngroups; i++)
+		h->groups[i] = GROUP_AT(g, i);
+
+	ret = ckpt_write_obj(ctx, (struct ckpt_hdr *) h);
+	ckpt_hdr_put(ctx, h);
+
+	return ret;
+}
+
+int checkpoint_groupinfo(struct ckpt_ctx *ctx, void *ptr)
+{
+	return do_checkpoint_groupinfo(ctx, (struct group_info *)ptr);
+}
+
+/*
+ * TODO - switch to reading in smaller blocks?
+ */
+#define MAX_GROUPINFO_SIZE (sizeof(*h)+NGROUPS_MAX*sizeof(gid_t))
+static struct group_info *do_restore_groupinfo(struct ckpt_ctx *ctx)
+{
+	struct group_info *g;
+	struct ckpt_hdr_groupinfo *h;
+	int i;
+
+	h = ckpt_read_buf_type(ctx, MAX_GROUPINFO_SIZE, CKPT_HDR_GROUPINFO);
+	if (IS_ERR(h))
+		return ERR_PTR(PTR_ERR(h));
+
+	g = ERR_PTR(-EINVAL);
+	if (h->ngroups > NGROUPS_MAX)
+		goto out;
+
+	for (i = 1; i < h->ngroups; i++)
+		if (h->groups[i-1] >= h->groups[i])
+			goto out;
+
+	g = groups_alloc(h->ngroups);
+	if (!g) {
+		g = ERR_PTR(-ENOMEM);
+		goto out;
+	}
+	for (i = 0; i < h->ngroups; i++)
+		GROUP_AT(g, i) = h->groups[i];
+
+out:
+	ckpt_hdr_put(ctx, h);
+	return g;
+}
+
+void *restore_groupinfo(struct ckpt_ctx *ctx)
+{
+	return (void *) do_restore_groupinfo(ctx);
+}
+
+#endif
diff --git a/kernel/user.c b/kernel/user.c
index 2c000e7..a535ed6 100644
--- a/kernel/user.c
+++ b/kernel/user.c
@@ -16,6 +16,7 @@
 #include <linux/interrupt.h>
 #include <linux/module.h>
 #include <linux/user_namespace.h>
+#include <linux/checkpoint.h>
 #include "cred-internals.h"
 
 struct user_namespace init_user_ns = {
@@ -508,3 +509,160 @@ static int __init uid_cache_init(void)
 }
 
 module_init(uid_cache_init);
+
+#ifdef CONFIG_CHECKPOINT
+/*
+ * write the user struct
+ * TODO keyring will need to be dumped
+ *
+ * Here is what we're doing.  Remember a task can do clone(CLONE_NEWUSER)
+ * resulting in a cloned task in a new user namespace, with uid 0 in that
+ * new user_ns.  In that case, the parent's user (uid+user_ns) is the
+ * 'creator' of the new user_ns.
+ * Here, we call the user_ns of the ctx->root_task the 'root_ns'.  When we
+ * checkpoint a user-struct, we must store the chain of creators.  We
+ * must not do so recursively, this being the kernel.  In
+ * checkpoint_write_user() we walk and record in memory the list of creators up
+ * to either the latest user_struct which has already been saved, or the
+ * root_ns.  Then we walk that chain backward, writing out the user_ns and
+ * user_struct to the checkpoint image.
+ */
+#define UNSAVED_STRIDE 50
+static int do_checkpoint_user(struct ckpt_ctx *ctx, struct user_struct *u)
+{
+	struct user_namespace *ns, *root_ns;
+	struct ckpt_hdr_user_struct *h;
+	int ns_objref;
+	int ret, i, unsaved_ns_nr = 0;
+	struct user_struct *save_u;
+	struct user_struct **unsaved_creators;
+	int step = 1, size;
+
+	/* if we've already saved the userns, then life is good */
+	ns_objref = ckpt_obj_lookup(ctx, u->user_ns, CKPT_OBJ_USER_NS);
+	if (ns_objref)
+		goto write_user;
+
+	root_ns = task_cred_xxx(ctx->root_task, user)->user_ns;
+
+	if (u->user_ns == root_ns)
+		goto save_last_ns;
+
+	size = UNSAVED_STRIDE*sizeof(struct user_struct *);
+	unsaved_creators = kmalloc(size, GFP_KERNEL);
+	if (!unsaved_creators)
+		return -ENOMEM;
+	save_u = u;
+	do {
+		ns = save_u->user_ns;
+		save_u = ns->creator;
+		if (ckpt_obj_lookup(ctx, save_u, CKPT_OBJ_USER))
+			goto found;
+		unsaved_creators[unsaved_ns_nr++] = save_u;
+		if (unsaved_ns_nr == step * UNSAVED_STRIDE) {
+			step++;
+			size = step*UNSAVED_STRIDE*sizeof(struct user_struct *);
+			unsaved_creators = krealloc(unsaved_creators, size,
+							GFP_KERNEL);
+			if (!unsaved_creators)
+				return -ENOMEM;
+		}
+	} while (ns != root_ns);
+
+found:
+	for (i = unsaved_ns_nr-1; i >= 0; i--) {
+		ret = checkpoint_obj(ctx, unsaved_creators[i], CKPT_OBJ_USER);
+		if (ret < 0) {
+			kfree(unsaved_creators);
+			return ret;
+		}
+	}
+	kfree(unsaved_creators);
+
+save_last_ns:
+	ns_objref = checkpoint_obj(ctx, u->user_ns, CKPT_OBJ_USER_NS);
+	if (ns_objref < 0)
+		return ns_objref;
+
+write_user:
+	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_USER);
+	if (!h)
+		return -ENOMEM;
+
+	h->uid = u->uid;
+	h->userns_ref = ns_objref;
+
+	/* write out the user_struct */
+	ret = ckpt_write_obj(ctx, (struct ckpt_hdr *) h);
+	ckpt_hdr_put(ctx, h);
+
+	return ret;
+}
+
+int checkpoint_user(struct ckpt_ctx *ctx, void *ptr)
+{
+	return do_checkpoint_user(ctx, (struct user_struct *) ptr);
+}
+
+static int may_setuid(struct user_namespace *ns, uid_t uid)
+{
+	/*
+	 * this next check will one day become
+	 * if capable(CAP_SETUID, ns) return 1;
+	 * followed by uid_equiv(current_userns, current_uid, ns, uid)
+	 * instead of just uids.
+	 */
+	if (capable(CAP_SETUID))
+		return 1;
+
+	/*
+	 * this may be overly strict, but since we might end up
+	 * restarting a privileged program here, we do not want
+	 * someone with only CAP_SYS_ADMIN but no CAP_SETUID to
+	 * be able to create random userids even in a userns he
+	 * created.
+	 */
+	if (current_user()->user_ns != ns)
+		return 0;
+	if (current_uid() == uid ||
+		current_euid() == uid ||
+		current_suid() == uid)
+		return 1;
+	return 0;
+}
+
+static struct user_struct *do_restore_user(struct ckpt_ctx *ctx)
+{
+	struct user_struct *u;
+	struct user_namespace *ns;
+	struct ckpt_hdr_user_struct *h;
+
+	h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_USER);
+	if (IS_ERR(h))
+		return ERR_PTR(PTR_ERR(h));
+
+	ns = ckpt_obj_fetch(ctx, h->userns_ref, CKPT_OBJ_USER_NS);
+	if (IS_ERR(ns)) {
+		u = ERR_PTR(PTR_ERR(ns));
+		goto out;
+	}
+
+	if (!may_setuid(ns, h->uid)) {
+		u = ERR_PTR(-EPERM);
+		goto out;
+	}
+	u = alloc_uid(ns, h->uid);
+	if (!u)
+		u = ERR_PTR(-EINVAL);
+
+out:
+	ckpt_hdr_put(ctx, h);
+	return u;
+}
+
+void *restore_user(struct ckpt_ctx *ctx)
+{
+	return (void *) do_restore_user(ctx);
+}
+
+#endif
diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
index e624b0f..3a35b50 100644
--- a/kernel/user_namespace.c
+++ b/kernel/user_namespace.c
@@ -9,6 +9,7 @@
 #include <linux/nsproxy.h>
 #include <linux/slab.h>
 #include <linux/user_namespace.h>
+#include <linux/checkpoint.h>
 #include <linux/cred.h>
 
 static struct user_namespace *_new_user_ns(struct user_struct *creator,
@@ -103,3 +104,91 @@ void free_user_ns(struct kref *kref)
 	schedule_work(&ns->destroyer);
 }
 EXPORT_SYMBOL(free_user_ns);
+
+#ifdef CONFIG_CHECKPOINT
+/*
+ * do_checkpoint_userns() is only called from do_checkpoint_user().
+ * When called, we always know that either:
+ *   1. This is the root_ns (user_ns of the ctx->root_task),
+ *	in which case we set h->creator_ref = 0.
+ * or
+ *   2. The creator has already been written out to the
+ *	checkpoint image (and saved in the objhash)
+ */
+static int do_checkpoint_userns(struct ckpt_ctx *ctx, struct user_namespace *ns)
+{
+	struct ckpt_hdr_user_ns *h;
+	struct user_namespace *root_ns;
+	int creator_ref = 0;
+	int ret;
+
+	root_ns = task_cred_xxx(ctx->root_task, user)->user_ns;
+	if (ns != root_ns) {
+		creator_ref = ckpt_obj_lookup(ctx, ns->creator, CKPT_OBJ_USER);
+		if (!creator_ref)
+			return -EINVAL;
+	}
+
+	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_USER_NS);
+	if (!h)
+		return -ENOMEM;
+	h->creator_ref = creator_ref;
+	ret = ckpt_write_obj(ctx, (struct ckpt_hdr *) h);
+	ckpt_hdr_put(ctx, h);
+
+	return ret;
+}
+
+int checkpoint_userns(struct ckpt_ctx *ctx, void *ptr)
+{
+	return do_checkpoint_userns(ctx, (struct user_namespace *) ptr);
+}
+
+static struct user_namespace *do_restore_userns(struct ckpt_ctx *ctx)
+{
+	struct ckpt_hdr_user_ns *h;
+	struct user_namespace *ns;
+	struct user_struct *new_root, *creator;
+
+	h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_USER_NS);
+	if (IS_ERR(h))
+		return ERR_PTR(PTR_ERR(h));
+
+	if (!h->creator_ref) {
+		ns = get_user_ns(current_user_ns());
+		goto out;
+	}
+
+	creator = ckpt_obj_fetch(ctx, h->creator_ref, CKPT_OBJ_USER);
+	if (IS_ERR(creator)) {
+		ns = ERR_PTR(-EINVAL);
+		goto out;
+	}
+
+	ns = new_user_ns(creator, &new_root);
+	if (IS_ERR(ns))
+		goto out;
+
+	/* ns only referenced from new_root, which we discard below */
+	get_user_ns(ns);
+
+	/* new_user_ns() doesn't bump creator's refcount */
+	get_uid(creator);
+
+	/*
+	 * Free the new root user.  If we actually needed it,
+	 * then it will show up later in the checkpoint image
+	 * The objhash will keep the userns pinned until then.
+	 */
+	free_uid(new_root);
+ out:
+	ctx->stats.user_ns++;
+	ckpt_hdr_put(ctx, h);
+	return ns;
+}
+
+void *restore_userns(struct ckpt_ctx *ctx)
+{
+	return (void *) do_restore_userns(ctx);
+}
+#endif
-- 
1.6.0.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH v18 61/80] c/r: restore file->f_cred
       [not found] ` <1253749920-18673-1-git-send-email-orenl-RdfvBDnrOixBDgjK7y7TUQ@public.gmane.org>
                     ` (59 preceding siblings ...)
  2009-09-23 23:51   ` [PATCH v18 60/80] c/r: checkpoint and restore task credentials Oren Laadan
@ 2009-09-23 23:51   ` Oren Laadan
  2009-09-23 23:51   ` [PATCH v18 62/80] c/r: checkpoint and restore (shared) task's sighand_struct Oren Laadan
                     ` (20 subsequent siblings)
  81 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-23 23:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-api-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Linus Torvalds, Ingo Molnar,
	Pavel Emelyanov

From: Serge E. Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>

Restore a file's f_cred.  This is set to the cred of the task doing
the open, so often it will be the same as that of the restarted task.

Changelog[v1]:
  - [Nathan Lynch] discard const from struct cred * where appropriate

Signed-off-by: Serge E. Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
---
 checkpoint/files.c             |   18 ++++++++++++++++--
 include/linux/checkpoint_hdr.h |    2 +-
 2 files changed, 17 insertions(+), 3 deletions(-)

diff --git a/checkpoint/files.c b/checkpoint/files.c
index 190c95b..1de89d6 100644
--- a/checkpoint/files.c
+++ b/checkpoint/files.c
@@ -147,12 +147,18 @@ static int scan_fds(struct files_struct *files, int **fdtable)
 int checkpoint_file_common(struct ckpt_ctx *ctx, struct file *file,
 			   struct ckpt_hdr_file *h)
 {
+	struct cred *f_cred = (struct cred *) file->f_cred;
+
 	h->f_flags = file->f_flags;
 	h->f_mode = file->f_mode;
 	h->f_pos = file->f_pos;
 	h->f_version = file->f_version;
 
-	/* FIX: need also file->uid, file->gid, file->f_owner, etc */
+	h->f_credref = checkpoint_obj(ctx, f_cred, CKPT_OBJ_CRED);
+	if (h->f_credref < 0)
+		return h->f_credref;
+
+	/* FIX: need also file->f_owner, etc */
 
 	return 0;
 }
@@ -489,8 +495,16 @@ int restore_file_common(struct ckpt_ctx *ctx, struct file *file,
 	fmode_t new_mode = file->f_mode;
 	fmode_t saved_mode = (__force fmode_t) h->f_mode;
 	int ret;
+	struct cred *cred;
+
+	/* FIX: need to restore owner etc */
 
-	/* FIX: need to restore uid, gid, owner etc */
+	/* restore the cred */
+	cred = ckpt_obj_fetch(ctx, h->f_credref, CKPT_OBJ_CRED);
+	if (IS_ERR(cred))
+		return PTR_ERR(cred);
+	put_cred(file->f_cred);
+	file->f_cred = get_cred(cred);
 
 	/* safe to set 1st arg (fd) to 0, as command is F_SETFL */
 	ret = vfs_fcntl(0, F_SETFL, h->f_flags & CKPT_SETFL_MASK, file);
diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
index 3f00bce..ca24112 100644
--- a/include/linux/checkpoint_hdr.h
+++ b/include/linux/checkpoint_hdr.h
@@ -350,7 +350,7 @@ struct ckpt_hdr_file {
 	__u32 f_type;
 	__u32 f_mode;
 	__u32 f_flags;
-	__u32 _padding;
+	__s32 f_credref;
 	__u64 f_pos;
 	__u64 f_version;
 } __attribute__((aligned(8)));
-- 
1.6.0.4

^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH v18 61/80] c/r: restore file->f_cred
  2009-09-23 23:50 ` Oren Laadan
@ 2009-09-23 23:51   ` Oren Laadan
  -1 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-23 23:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, containers, linux-kernel, linux-mm, linux-api,
	Serge Hallyn, Ingo Molnar, Pavel Emelyanov

From: Serge E. Hallyn <serue@us.ibm.com>

Restore a file's f_cred.  This is set to the cred of the task doing
the open, so often it will be the same as that of the restarted task.

Changelog[v1]:
  - [Nathan Lynch] discard const from struct cred * where appropriate

Signed-off-by: Serge E. Hallyn <serue@us.ibm.com>
---
 checkpoint/files.c             |   18 ++++++++++++++++--
 include/linux/checkpoint_hdr.h |    2 +-
 2 files changed, 17 insertions(+), 3 deletions(-)

diff --git a/checkpoint/files.c b/checkpoint/files.c
index 190c95b..1de89d6 100644
--- a/checkpoint/files.c
+++ b/checkpoint/files.c
@@ -147,12 +147,18 @@ static int scan_fds(struct files_struct *files, int **fdtable)
 int checkpoint_file_common(struct ckpt_ctx *ctx, struct file *file,
 			   struct ckpt_hdr_file *h)
 {
+	struct cred *f_cred = (struct cred *) file->f_cred;
+
 	h->f_flags = file->f_flags;
 	h->f_mode = file->f_mode;
 	h->f_pos = file->f_pos;
 	h->f_version = file->f_version;
 
-	/* FIX: need also file->uid, file->gid, file->f_owner, etc */
+	h->f_credref = checkpoint_obj(ctx, f_cred, CKPT_OBJ_CRED);
+	if (h->f_credref < 0)
+		return h->f_credref;
+
+	/* FIX: need also file->f_owner, etc */
 
 	return 0;
 }
@@ -489,8 +495,16 @@ int restore_file_common(struct ckpt_ctx *ctx, struct file *file,
 	fmode_t new_mode = file->f_mode;
 	fmode_t saved_mode = (__force fmode_t) h->f_mode;
 	int ret;
+	struct cred *cred;
+
+	/* FIX: need to restore owner etc */
 
-	/* FIX: need to restore uid, gid, owner etc */
+	/* restore the cred */
+	cred = ckpt_obj_fetch(ctx, h->f_credref, CKPT_OBJ_CRED);
+	if (IS_ERR(cred))
+		return PTR_ERR(cred);
+	put_cred(file->f_cred);
+	file->f_cred = get_cred(cred);
 
 	/* safe to set 1st arg (fd) to 0, as command is F_SETFL */
 	ret = vfs_fcntl(0, F_SETFL, h->f_flags & CKPT_SETFL_MASK, file);
diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
index 3f00bce..ca24112 100644
--- a/include/linux/checkpoint_hdr.h
+++ b/include/linux/checkpoint_hdr.h
@@ -350,7 +350,7 @@ struct ckpt_hdr_file {
 	__u32 f_type;
 	__u32 f_mode;
 	__u32 f_flags;
-	__u32 _padding;
+	__s32 f_credref;
 	__u64 f_pos;
 	__u64 f_version;
 } __attribute__((aligned(8)));
-- 
1.6.0.4


^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH v18 61/80] c/r: restore file->f_cred
@ 2009-09-23 23:51   ` Oren Laadan
  0 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-23 23:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, containers, linux-kernel, linux-mm, linux-api,
	Serge Hallyn, Ingo Molnar, Pavel Emelyanov

From: Serge E. Hallyn <serue@us.ibm.com>

Restore a file's f_cred.  This is set to the cred of the task doing
the open, so often it will be the same as that of the restarted task.

Changelog[v1]:
  - [Nathan Lynch] discard const from struct cred * where appropriate

Signed-off-by: Serge E. Hallyn <serue@us.ibm.com>
---
 checkpoint/files.c             |   18 ++++++++++++++++--
 include/linux/checkpoint_hdr.h |    2 +-
 2 files changed, 17 insertions(+), 3 deletions(-)

diff --git a/checkpoint/files.c b/checkpoint/files.c
index 190c95b..1de89d6 100644
--- a/checkpoint/files.c
+++ b/checkpoint/files.c
@@ -147,12 +147,18 @@ static int scan_fds(struct files_struct *files, int **fdtable)
 int checkpoint_file_common(struct ckpt_ctx *ctx, struct file *file,
 			   struct ckpt_hdr_file *h)
 {
+	struct cred *f_cred = (struct cred *) file->f_cred;
+
 	h->f_flags = file->f_flags;
 	h->f_mode = file->f_mode;
 	h->f_pos = file->f_pos;
 	h->f_version = file->f_version;
 
-	/* FIX: need also file->uid, file->gid, file->f_owner, etc */
+	h->f_credref = checkpoint_obj(ctx, f_cred, CKPT_OBJ_CRED);
+	if (h->f_credref < 0)
+		return h->f_credref;
+
+	/* FIX: need also file->f_owner, etc */
 
 	return 0;
 }
@@ -489,8 +495,16 @@ int restore_file_common(struct ckpt_ctx *ctx, struct file *file,
 	fmode_t new_mode = file->f_mode;
 	fmode_t saved_mode = (__force fmode_t) h->f_mode;
 	int ret;
+	struct cred *cred;
+
+	/* FIX: need to restore owner etc */
 
-	/* FIX: need to restore uid, gid, owner etc */
+	/* restore the cred */
+	cred = ckpt_obj_fetch(ctx, h->f_credref, CKPT_OBJ_CRED);
+	if (IS_ERR(cred))
+		return PTR_ERR(cred);
+	put_cred(file->f_cred);
+	file->f_cred = get_cred(cred);
 
 	/* safe to set 1st arg (fd) to 0, as command is F_SETFL */
 	ret = vfs_fcntl(0, F_SETFL, h->f_flags & CKPT_SETFL_MASK, file);
diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
index 3f00bce..ca24112 100644
--- a/include/linux/checkpoint_hdr.h
+++ b/include/linux/checkpoint_hdr.h
@@ -350,7 +350,7 @@ struct ckpt_hdr_file {
 	__u32 f_type;
 	__u32 f_mode;
 	__u32 f_flags;
-	__u32 _padding;
+	__s32 f_credref;
 	__u64 f_pos;
 	__u64 f_version;
 } __attribute__((aligned(8)));
-- 
1.6.0.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH v18 62/80] c/r: checkpoint and restore (shared) task's sighand_struct
       [not found] ` <1253749920-18673-1-git-send-email-orenl-RdfvBDnrOixBDgjK7y7TUQ@public.gmane.org>
                     ` (60 preceding siblings ...)
  2009-09-23 23:51   ` [PATCH v18 61/80] c/r: restore file->f_cred Oren Laadan
@ 2009-09-23 23:51   ` Oren Laadan
  2009-09-23 23:51   ` [PATCH v18 63/80] c/r: [signal 1/4] blocked and template for shared signals Oren Laadan
                     ` (19 subsequent siblings)
  81 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-23 23:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-api-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Linus Torvalds, Ingo Molnar,
	Pavel Emelyanov

This patch adds the checkpointing and restart of signal handling
state - 'struct sighand_struct'. Since the contents of this state
only affect userspace, no input validation is required.

Add _NSIG to kernel constants saved/tested with image header.

Number of signals (_NSIG) is arch-dependent, but is within __KERNEL__
and not visibile to userspace compile. Therefore, define per arch
CKPT_ARCH_NSIG in <asm/checkpoint_hdr.h>.

Changelog[v1]:
  - Rename headerless struct ckpt_hdr_* to struct ckpt_*

Signed-off-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
---
 arch/s390/include/asm/checkpoint_hdr.h |    8 ++
 arch/x86/include/asm/checkpoint_hdr.h  |    8 ++
 checkpoint/Makefile                    |    3 +-
 checkpoint/checkpoint.c                |    2 +
 checkpoint/objhash.c                   |   26 +++++
 checkpoint/process.c                   |   19 ++++
 checkpoint/restart.c                   |    3 +
 checkpoint/signal.c                    |  163 ++++++++++++++++++++++++++++++++
 include/linux/checkpoint.h             |    9 ++-
 include/linux/checkpoint_hdr.h         |   22 +++++
 10 files changed, 261 insertions(+), 2 deletions(-)
 create mode 100644 checkpoint/signal.c

diff --git a/arch/s390/include/asm/checkpoint_hdr.h b/arch/s390/include/asm/checkpoint_hdr.h
index ad9449e..1976355 100644
--- a/arch/s390/include/asm/checkpoint_hdr.h
+++ b/arch/s390/include/asm/checkpoint_hdr.h
@@ -82,6 +82,14 @@ struct ckpt_hdr_mm_context {
 	unsigned long asce_limit;
 };
 
+#define CKPT_ARCH_NSIG  64
+#ifdef __KERNEL__
+#include <asm/signal.h>
+#if CKPT_ARCH_NSIG != _SIGCONTEXT_NSIG
+#error CKPT_ARCH_NSIG size is wrong (asm/sigcontext.h and asm/checkpoint_hdr.h)
+#endif
+#endif
+
 struct ckpt_hdr_header_arch {
 	struct ckpt_hdr h;
 };
diff --git a/arch/x86/include/asm/checkpoint_hdr.h b/arch/x86/include/asm/checkpoint_hdr.h
index 0e756b0..1228d1b 100644
--- a/arch/x86/include/asm/checkpoint_hdr.h
+++ b/arch/x86/include/asm/checkpoint_hdr.h
@@ -48,6 +48,14 @@ enum {
 	CKPT_HDR_MM_CONTEXT_LDT,
 };
 
+#define CKPT_ARCH_NSIG  64
+#ifdef __KERNEL__
+#include <asm/signal.h>
+#if CKPT_ARCH_NSIG != _NSIG
+#error CKPT_ARCH_NSIG size is wrong per asm/signal.h and asm/checkpoint_hdr.h
+#endif
+#endif
+
 struct ckpt_hdr_header_arch {
 	struct ckpt_hdr h;
 	/* FIXME: add HAVE_HWFP */
diff --git a/checkpoint/Makefile b/checkpoint/Makefile
index bb2c0ca..f8a55df 100644
--- a/checkpoint/Makefile
+++ b/checkpoint/Makefile
@@ -10,4 +10,5 @@ obj-$(CONFIG_CHECKPOINT) += \
 	process.o \
 	namespace.o \
 	files.o \
-	memory.o
+	memory.o \
+	signal.o
diff --git a/checkpoint/checkpoint.c b/checkpoint/checkpoint.c
index 27eec5c..3460c03 100644
--- a/checkpoint/checkpoint.c
+++ b/checkpoint/checkpoint.c
@@ -288,6 +288,8 @@ static void fill_kernel_const(struct ckpt_const *h)
 	h->task_comm_len = sizeof(tsk->comm);
 	/* mm */
 	h->mm_saved_auxv_len = sizeof(mm->saved_auxv);
+	/* signal */
+	h->signal_nsig = _NSIG;
 	/* uts */
 	h->uts_sysname_len = sizeof(uts->sysname);
 	h->uts_nodename_len = sizeof(uts->nodename);
diff --git a/checkpoint/objhash.c b/checkpoint/objhash.c
index f8a3210..b4034dc 100644
--- a/checkpoint/objhash.c
+++ b/checkpoint/objhash.c
@@ -125,6 +125,22 @@ static int obj_mm_users(void *ptr)
 	return atomic_read(&((struct mm_struct *) ptr)->mm_users);
 }
 
+static int obj_sighand_grab(void *ptr)
+{
+	atomic_inc(&((struct sighand_struct *) ptr)->count);
+	return 0;
+}
+
+static void obj_sighand_drop(void *ptr, int lastref)
+{
+	__cleanup_sighand((struct sighand_struct *) ptr);
+}
+
+static int obj_sighand_users(void *ptr)
+{
+	return atomic_read(&((struct sighand_struct *) ptr)->count);
+}
+
 static int obj_ns_grab(void *ptr)
 {
 	get_nsproxy((struct nsproxy *) ptr);
@@ -263,6 +279,16 @@ static struct ckpt_obj_ops ckpt_obj_ops[] = {
 		.checkpoint = checkpoint_mm,
 		.restore = restore_mm,
 	},
+	/* sighand object */
+	{
+		.obj_name = "SIGHAND",
+		.obj_type = CKPT_OBJ_SIGHAND,
+		.ref_drop = obj_sighand_drop,
+		.ref_grab = obj_sighand_grab,
+		.ref_users = obj_sighand_users,
+		.checkpoint = checkpoint_sighand,
+		.restore = restore_sighand,
+	},
 	/* ns object */
 	{
 		.obj_name = "NSPROXY",
diff --git a/checkpoint/process.c b/checkpoint/process.c
index 1e79f73..56f33dd 100644
--- a/checkpoint/process.c
+++ b/checkpoint/process.c
@@ -181,6 +181,7 @@ static int checkpoint_task_objs(struct ckpt_ctx *ctx, struct task_struct *t)
 	struct ckpt_hdr_task_objs *h;
 	int files_objref;
 	int mm_objref;
+	int sighand_objref;
 	int ret;
 
 	/*
@@ -214,11 +215,19 @@ static int checkpoint_task_objs(struct ckpt_ctx *ctx, struct task_struct *t)
 		return mm_objref;
 	}
 
+	sighand_objref = checkpoint_obj_sighand(ctx, t);
+	ckpt_debug("sighand: objref %d\n", sighand_objref);
+	if (sighand_objref < 0) {
+		ckpt_write_err(ctx, "TE", "sighand_struct", sighand_objref);
+		return sighand_objref;
+	}
+
 	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_TASK_OBJS);
 	if (!h)
 		return -ENOMEM;
 	h->files_objref = files_objref;
 	h->mm_objref = mm_objref;
+	h->sighand_objref = sighand_objref;
 	ret = ckpt_write_obj(ctx, &h->h);
 	ckpt_hdr_put(ctx, h);
 
@@ -382,6 +391,9 @@ int ckpt_collect_task(struct ckpt_ctx *ctx, struct task_struct *t)
 	if (ret < 0)
 		return ret;
 	ret = ckpt_collect_mm(ctx, t);
+	if (ret < 0)
+		return ret;
+	ret = ckpt_collect_sighand(ctx, t);
 
 	return ret;
 }
@@ -529,10 +541,17 @@ static int restore_task_objs(struct ckpt_ctx *ctx)
 
 	ret = restore_obj_file_table(ctx, h->files_objref);
 	ckpt_debug("file_table: ret %d (%p)\n", ret, current->files);
+	if (ret < 0)
+		goto out;
 
 	ret = restore_obj_mm(ctx, h->mm_objref);
 	ckpt_debug("mm: ret %d (%p)\n", ret, current->mm);
+	if (ret < 0)
+		goto out;
 
+	ret = restore_obj_sighand(ctx, h->sighand_objref);
+	ckpt_debug("sighand: ret %d (%p)\n", ret, current->sighand);
+ out:
 	ckpt_hdr_put(ctx, h);
 	return ret;
 }
diff --git a/checkpoint/restart.c b/checkpoint/restart.c
index 53a759a..73c4e72 100644
--- a/checkpoint/restart.c
+++ b/checkpoint/restart.c
@@ -383,6 +383,9 @@ static int check_kernel_const(struct ckpt_const *h)
 	/* mm */
 	if (h->mm_saved_auxv_len != sizeof(mm->saved_auxv))
 		return -EINVAL;
+	/* signal */
+	if (h->signal_nsig != _NSIG)
+		return -EINVAL;
 	/* uts */
 	if (h->uts_sysname_len != sizeof(uts->sysname))
 		return -EINVAL;
diff --git a/checkpoint/signal.c b/checkpoint/signal.c
new file mode 100644
index 0000000..1aadadd
--- /dev/null
+++ b/checkpoint/signal.c
@@ -0,0 +1,163 @@
+/*
+ *  Checkpoint task signals
+ *
+ *  Copyright (C) 2009 Oren Laadan
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+/* default debug level for output */
+#define CKPT_DFLAG  CKPT_DSYS
+
+#include <linux/sched.h>
+#include <linux/signal.h>
+#include <linux/errno.h>
+#include <linux/checkpoint.h>
+#include <linux/checkpoint_hdr.h>
+
+static inline void fill_sigset(struct ckpt_sigset *h, sigset_t *sigset)
+{
+	memcpy(&h->sigset, sigset, sizeof(*sigset));
+}
+
+static inline void load_sigset(sigset_t *sigset, struct ckpt_sigset *h)
+{
+	memcpy(sigset, &h->sigset, sizeof(*sigset));
+}
+
+/***********************************************************************
+ * sighand checkpoint/collect/restart
+ */
+
+static int do_checkpoint_sighand(struct ckpt_ctx *ctx,
+				 struct sighand_struct *sighand)
+{
+	struct ckpt_hdr_sighand *h;
+	struct ckpt_sigaction *hh;
+	struct sigaction *sa;
+	int i, ret;
+
+	h = ckpt_hdr_get_type(ctx, _NSIG * sizeof(*hh) + sizeof(*h),
+			      CKPT_HDR_SIGHAND);
+	if (!h)
+		return -ENOMEM;
+
+	hh = h->action;
+	spin_lock_irq(&sighand->siglock);
+	for (i = 0; i < _NSIG; i++) {
+		sa = &sighand->action[i].sa;
+		hh[i]._sa_handler = (unsigned long) sa->sa_handler;
+		hh[i].sa_flags = sa->sa_flags;
+		hh[i].sa_restorer = (unsigned long) sa->sa_restorer;
+		fill_sigset(&hh[i].sa_mask, &sa->sa_mask);
+	}
+	spin_unlock_irq(&sighand->siglock);
+
+	ret = ckpt_write_obj(ctx, &h->h);
+	ckpt_hdr_put(ctx, h);
+
+	return ret;
+}
+
+int checkpoint_sighand(struct ckpt_ctx *ctx, void *ptr)
+{
+	return do_checkpoint_sighand(ctx, (struct sighand_struct *) ptr);
+}
+
+int checkpoint_obj_sighand(struct ckpt_ctx *ctx, struct task_struct *t)
+{
+	struct sighand_struct *sighand;
+	int objref;
+
+	read_lock(&tasklist_lock);
+	sighand = rcu_dereference(t->sighand);
+	atomic_inc(&sighand->count);
+	read_unlock(&tasklist_lock);
+
+	objref = checkpoint_obj(ctx, sighand, CKPT_OBJ_SIGHAND);
+	__cleanup_sighand(sighand);
+
+	return objref;
+}
+
+int ckpt_collect_sighand(struct ckpt_ctx *ctx, struct task_struct *t)
+{
+	struct sighand_struct *sighand;
+	int ret;
+
+	read_lock(&tasklist_lock);
+	sighand = rcu_dereference(t->sighand);
+	atomic_inc(&sighand->count);
+	read_unlock(&tasklist_lock);
+
+	ret = ckpt_obj_collect(ctx, sighand, CKPT_OBJ_SIGHAND);
+	__cleanup_sighand(sighand);
+
+	return ret;
+}
+
+static struct sighand_struct *do_restore_sighand(struct ckpt_ctx *ctx)
+{
+	struct ckpt_hdr_sighand *h;
+	struct ckpt_sigaction *hh;
+	struct sighand_struct *sighand;
+	struct sigaction *sa;
+	int i;
+
+	h = ckpt_read_obj_type(ctx, _NSIG * sizeof(*hh) + sizeof(*h),
+			       CKPT_HDR_SIGHAND);
+	if (IS_ERR(h))
+		return ERR_PTR(PTR_ERR(h));
+
+	sighand = kmem_cache_alloc(sighand_cachep, GFP_KERNEL);
+	if (!sighand) {
+		sighand = ERR_PTR(-ENOMEM);
+		goto out;
+	}
+	atomic_set(&sighand->count, 1);
+
+	hh = h->action;
+	for (i = 0; i < _NSIG; i++) {
+		sa = &sighand->action[i].sa;
+		sa->sa_handler = (void *) (unsigned long) hh[i]._sa_handler;
+		sa->sa_flags = hh[i].sa_flags;
+		sa->sa_restorer = (void *) (unsigned long) hh[i].sa_restorer;
+		load_sigset(&sa->sa_mask, &hh[i].sa_mask);
+	}
+ out:
+	ckpt_hdr_put(ctx, h);
+	return sighand;
+}
+
+void *restore_sighand(struct ckpt_ctx *ctx)
+{
+	return (void *) do_restore_sighand(ctx);
+}
+
+int restore_obj_sighand(struct ckpt_ctx *ctx, int sighand_objref)
+{
+	struct sighand_struct *sighand;
+	struct sighand_struct *old_sighand;
+
+	sighand = ckpt_obj_fetch(ctx, sighand_objref, CKPT_OBJ_SIGHAND);
+	if (IS_ERR(sighand))
+		return PTR_ERR(sighand);
+
+	if (sighand == current->sighand)
+		return 0;
+
+	atomic_inc(&sighand->count);
+
+	/* manipulate tsk->sighand with tasklist lock write-held */
+	write_lock_irq(&tasklist_lock);
+	old_sighand = rcu_dereference(current->sighand);
+	spin_lock(&old_sighand->siglock);
+	rcu_assign_pointer(current->sighand, sighand);
+	spin_unlock(&old_sighand->siglock);
+	write_unlock_irq(&tasklist_lock);
+	__cleanup_sighand(old_sighand);
+
+	return 0;
+}
diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
index 3dbf188..c0e549e 100644
--- a/include/linux/checkpoint.h
+++ b/include/linux/checkpoint.h
@@ -245,6 +245,14 @@ extern int restore_memory_contents(struct ckpt_ctx *ctx, struct inode *inode);
 	 VM_RESERVED | VM_NORESERVE | VM_HUGETLB | VM_NONLINEAR |	\
 	 VM_MAPPED_COPY | VM_INSERTPAGE | VM_MIXEDMAP | VM_SAO)
 
+/* signals */
+extern int checkpoint_obj_sighand(struct ckpt_ctx *ctx, struct task_struct *t);
+extern int restore_obj_sighand(struct ckpt_ctx *ctx, int sighand_objref);
+
+extern int ckpt_collect_sighand(struct ckpt_ctx *ctx, struct task_struct *t);
+extern int checkpoint_sighand(struct ckpt_ctx *ctx, void *ptr);
+extern void *restore_sighand(struct ckpt_ctx *ctx);
+
 static inline int ckpt_validate_errno(int errno)
 {
 	return (errno >= 0) && (errno < MAX_ERRNO);
@@ -277,7 +285,6 @@ static inline int ckpt_validate_errno(int errno)
 			memcpy(LIVE, SAVE, count * sizeof(*SAVE));	\
 	} while (0)
 
-
 /* debugging flags */
 #define CKPT_DBASE	0x1		/* anything */
 #define CKPT_DSYS	0x2		/* generic (system) */
diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
index ca24112..3d3a105 100644
--- a/include/linux/checkpoint_hdr.h
+++ b/include/linux/checkpoint_hdr.h
@@ -87,6 +87,8 @@ enum {
 	CKPT_HDR_IPC_MSG_MSG,
 	CKPT_HDR_IPC_SEM,
 
+	CKPT_HDR_SIGHAND = 601,
+
 	CKPT_HDR_TAIL = 9001,
 
 	CKPT_HDR_ERROR = 9999,
@@ -113,6 +115,7 @@ enum obj_type {
 	CKPT_OBJ_FILE_TABLE,
 	CKPT_OBJ_FILE,
 	CKPT_OBJ_MM,
+	CKPT_OBJ_SIGHAND,
 	CKPT_OBJ_NS,
 	CKPT_OBJ_UTS_NS,
 	CKPT_OBJ_IPC_NS,
@@ -129,6 +132,8 @@ struct ckpt_const {
 	__u16 task_comm_len;
 	/* mm */
 	__u16 mm_saved_auxv_len;
+	/* signal */
+	__u16 signal_nsig;
 	/* uts */
 	__u16 uts_sysname_len;
 	__u16 uts_nodename_len;
@@ -299,6 +304,7 @@ struct ckpt_hdr_task_objs {
 
 	__s32 files_objref;
 	__s32 mm_objref;
+	__s32 sighand_objref;
 } __attribute__((aligned(8)));
 
 /* restart blocks */
@@ -414,6 +420,22 @@ struct ckpt_hdr_pgarr {
 	__u64 nr_pages;		/* number of pages to saved */
 } __attribute__((aligned(8)));
 
+/* signals */
+struct ckpt_sigset {
+	__u8 sigset[CKPT_ARCH_NSIG / 8];
+} __attribute__((aligned(8)));
+
+struct ckpt_sigaction {
+	__u64 _sa_handler;
+	__u64 sa_flags;
+	__u64 sa_restorer;
+	struct ckpt_sigset sa_mask;
+} __attribute__((aligned(8)));
+
+struct ckpt_hdr_sighand {
+	struct ckpt_hdr h;
+	struct ckpt_sigaction action[0];
+} __attribute__((aligned(8)));
 
 /* ipc commons */
 struct ckpt_hdr_ipcns {
-- 
1.6.0.4

^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH v18 62/80] c/r: checkpoint and restore (shared) task's sighand_struct
  2009-09-23 23:50 ` Oren Laadan
@ 2009-09-23 23:51   ` Oren Laadan
  -1 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-23 23:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, containers, linux-kernel, linux-mm, linux-api,
	Serge Hallyn, Ingo Molnar, Pavel Emelyanov, Oren Laadan,
	Oren Laadan

This patch adds the checkpointing and restart of signal handling
state - 'struct sighand_struct'. Since the contents of this state
only affect userspace, no input validation is required.

Add _NSIG to kernel constants saved/tested with image header.

Number of signals (_NSIG) is arch-dependent, but is within __KERNEL__
and not visibile to userspace compile. Therefore, define per arch
CKPT_ARCH_NSIG in <asm/checkpoint_hdr.h>.

Changelog[v1]:
  - Rename headerless struct ckpt_hdr_* to struct ckpt_*

Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
---
 arch/s390/include/asm/checkpoint_hdr.h |    8 ++
 arch/x86/include/asm/checkpoint_hdr.h  |    8 ++
 checkpoint/Makefile                    |    3 +-
 checkpoint/checkpoint.c                |    2 +
 checkpoint/objhash.c                   |   26 +++++
 checkpoint/process.c                   |   19 ++++
 checkpoint/restart.c                   |    3 +
 checkpoint/signal.c                    |  163 ++++++++++++++++++++++++++++++++
 include/linux/checkpoint.h             |    9 ++-
 include/linux/checkpoint_hdr.h         |   22 +++++
 10 files changed, 261 insertions(+), 2 deletions(-)
 create mode 100644 checkpoint/signal.c

diff --git a/arch/s390/include/asm/checkpoint_hdr.h b/arch/s390/include/asm/checkpoint_hdr.h
index ad9449e..1976355 100644
--- a/arch/s390/include/asm/checkpoint_hdr.h
+++ b/arch/s390/include/asm/checkpoint_hdr.h
@@ -82,6 +82,14 @@ struct ckpt_hdr_mm_context {
 	unsigned long asce_limit;
 };
 
+#define CKPT_ARCH_NSIG  64
+#ifdef __KERNEL__
+#include <asm/signal.h>
+#if CKPT_ARCH_NSIG != _SIGCONTEXT_NSIG
+#error CKPT_ARCH_NSIG size is wrong (asm/sigcontext.h and asm/checkpoint_hdr.h)
+#endif
+#endif
+
 struct ckpt_hdr_header_arch {
 	struct ckpt_hdr h;
 };
diff --git a/arch/x86/include/asm/checkpoint_hdr.h b/arch/x86/include/asm/checkpoint_hdr.h
index 0e756b0..1228d1b 100644
--- a/arch/x86/include/asm/checkpoint_hdr.h
+++ b/arch/x86/include/asm/checkpoint_hdr.h
@@ -48,6 +48,14 @@ enum {
 	CKPT_HDR_MM_CONTEXT_LDT,
 };
 
+#define CKPT_ARCH_NSIG  64
+#ifdef __KERNEL__
+#include <asm/signal.h>
+#if CKPT_ARCH_NSIG != _NSIG
+#error CKPT_ARCH_NSIG size is wrong per asm/signal.h and asm/checkpoint_hdr.h
+#endif
+#endif
+
 struct ckpt_hdr_header_arch {
 	struct ckpt_hdr h;
 	/* FIXME: add HAVE_HWFP */
diff --git a/checkpoint/Makefile b/checkpoint/Makefile
index bb2c0ca..f8a55df 100644
--- a/checkpoint/Makefile
+++ b/checkpoint/Makefile
@@ -10,4 +10,5 @@ obj-$(CONFIG_CHECKPOINT) += \
 	process.o \
 	namespace.o \
 	files.o \
-	memory.o
+	memory.o \
+	signal.o
diff --git a/checkpoint/checkpoint.c b/checkpoint/checkpoint.c
index 27eec5c..3460c03 100644
--- a/checkpoint/checkpoint.c
+++ b/checkpoint/checkpoint.c
@@ -288,6 +288,8 @@ static void fill_kernel_const(struct ckpt_const *h)
 	h->task_comm_len = sizeof(tsk->comm);
 	/* mm */
 	h->mm_saved_auxv_len = sizeof(mm->saved_auxv);
+	/* signal */
+	h->signal_nsig = _NSIG;
 	/* uts */
 	h->uts_sysname_len = sizeof(uts->sysname);
 	h->uts_nodename_len = sizeof(uts->nodename);
diff --git a/checkpoint/objhash.c b/checkpoint/objhash.c
index f8a3210..b4034dc 100644
--- a/checkpoint/objhash.c
+++ b/checkpoint/objhash.c
@@ -125,6 +125,22 @@ static int obj_mm_users(void *ptr)
 	return atomic_read(&((struct mm_struct *) ptr)->mm_users);
 }
 
+static int obj_sighand_grab(void *ptr)
+{
+	atomic_inc(&((struct sighand_struct *) ptr)->count);
+	return 0;
+}
+
+static void obj_sighand_drop(void *ptr, int lastref)
+{
+	__cleanup_sighand((struct sighand_struct *) ptr);
+}
+
+static int obj_sighand_users(void *ptr)
+{
+	return atomic_read(&((struct sighand_struct *) ptr)->count);
+}
+
 static int obj_ns_grab(void *ptr)
 {
 	get_nsproxy((struct nsproxy *) ptr);
@@ -263,6 +279,16 @@ static struct ckpt_obj_ops ckpt_obj_ops[] = {
 		.checkpoint = checkpoint_mm,
 		.restore = restore_mm,
 	},
+	/* sighand object */
+	{
+		.obj_name = "SIGHAND",
+		.obj_type = CKPT_OBJ_SIGHAND,
+		.ref_drop = obj_sighand_drop,
+		.ref_grab = obj_sighand_grab,
+		.ref_users = obj_sighand_users,
+		.checkpoint = checkpoint_sighand,
+		.restore = restore_sighand,
+	},
 	/* ns object */
 	{
 		.obj_name = "NSPROXY",
diff --git a/checkpoint/process.c b/checkpoint/process.c
index 1e79f73..56f33dd 100644
--- a/checkpoint/process.c
+++ b/checkpoint/process.c
@@ -181,6 +181,7 @@ static int checkpoint_task_objs(struct ckpt_ctx *ctx, struct task_struct *t)
 	struct ckpt_hdr_task_objs *h;
 	int files_objref;
 	int mm_objref;
+	int sighand_objref;
 	int ret;
 
 	/*
@@ -214,11 +215,19 @@ static int checkpoint_task_objs(struct ckpt_ctx *ctx, struct task_struct *t)
 		return mm_objref;
 	}
 
+	sighand_objref = checkpoint_obj_sighand(ctx, t);
+	ckpt_debug("sighand: objref %d\n", sighand_objref);
+	if (sighand_objref < 0) {
+		ckpt_write_err(ctx, "TE", "sighand_struct", sighand_objref);
+		return sighand_objref;
+	}
+
 	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_TASK_OBJS);
 	if (!h)
 		return -ENOMEM;
 	h->files_objref = files_objref;
 	h->mm_objref = mm_objref;
+	h->sighand_objref = sighand_objref;
 	ret = ckpt_write_obj(ctx, &h->h);
 	ckpt_hdr_put(ctx, h);
 
@@ -382,6 +391,9 @@ int ckpt_collect_task(struct ckpt_ctx *ctx, struct task_struct *t)
 	if (ret < 0)
 		return ret;
 	ret = ckpt_collect_mm(ctx, t);
+	if (ret < 0)
+		return ret;
+	ret = ckpt_collect_sighand(ctx, t);
 
 	return ret;
 }
@@ -529,10 +541,17 @@ static int restore_task_objs(struct ckpt_ctx *ctx)
 
 	ret = restore_obj_file_table(ctx, h->files_objref);
 	ckpt_debug("file_table: ret %d (%p)\n", ret, current->files);
+	if (ret < 0)
+		goto out;
 
 	ret = restore_obj_mm(ctx, h->mm_objref);
 	ckpt_debug("mm: ret %d (%p)\n", ret, current->mm);
+	if (ret < 0)
+		goto out;
 
+	ret = restore_obj_sighand(ctx, h->sighand_objref);
+	ckpt_debug("sighand: ret %d (%p)\n", ret, current->sighand);
+ out:
 	ckpt_hdr_put(ctx, h);
 	return ret;
 }
diff --git a/checkpoint/restart.c b/checkpoint/restart.c
index 53a759a..73c4e72 100644
--- a/checkpoint/restart.c
+++ b/checkpoint/restart.c
@@ -383,6 +383,9 @@ static int check_kernel_const(struct ckpt_const *h)
 	/* mm */
 	if (h->mm_saved_auxv_len != sizeof(mm->saved_auxv))
 		return -EINVAL;
+	/* signal */
+	if (h->signal_nsig != _NSIG)
+		return -EINVAL;
 	/* uts */
 	if (h->uts_sysname_len != sizeof(uts->sysname))
 		return -EINVAL;
diff --git a/checkpoint/signal.c b/checkpoint/signal.c
new file mode 100644
index 0000000..1aadadd
--- /dev/null
+++ b/checkpoint/signal.c
@@ -0,0 +1,163 @@
+/*
+ *  Checkpoint task signals
+ *
+ *  Copyright (C) 2009 Oren Laadan
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+/* default debug level for output */
+#define CKPT_DFLAG  CKPT_DSYS
+
+#include <linux/sched.h>
+#include <linux/signal.h>
+#include <linux/errno.h>
+#include <linux/checkpoint.h>
+#include <linux/checkpoint_hdr.h>
+
+static inline void fill_sigset(struct ckpt_sigset *h, sigset_t *sigset)
+{
+	memcpy(&h->sigset, sigset, sizeof(*sigset));
+}
+
+static inline void load_sigset(sigset_t *sigset, struct ckpt_sigset *h)
+{
+	memcpy(sigset, &h->sigset, sizeof(*sigset));
+}
+
+/***********************************************************************
+ * sighand checkpoint/collect/restart
+ */
+
+static int do_checkpoint_sighand(struct ckpt_ctx *ctx,
+				 struct sighand_struct *sighand)
+{
+	struct ckpt_hdr_sighand *h;
+	struct ckpt_sigaction *hh;
+	struct sigaction *sa;
+	int i, ret;
+
+	h = ckpt_hdr_get_type(ctx, _NSIG * sizeof(*hh) + sizeof(*h),
+			      CKPT_HDR_SIGHAND);
+	if (!h)
+		return -ENOMEM;
+
+	hh = h->action;
+	spin_lock_irq(&sighand->siglock);
+	for (i = 0; i < _NSIG; i++) {
+		sa = &sighand->action[i].sa;
+		hh[i]._sa_handler = (unsigned long) sa->sa_handler;
+		hh[i].sa_flags = sa->sa_flags;
+		hh[i].sa_restorer = (unsigned long) sa->sa_restorer;
+		fill_sigset(&hh[i].sa_mask, &sa->sa_mask);
+	}
+	spin_unlock_irq(&sighand->siglock);
+
+	ret = ckpt_write_obj(ctx, &h->h);
+	ckpt_hdr_put(ctx, h);
+
+	return ret;
+}
+
+int checkpoint_sighand(struct ckpt_ctx *ctx, void *ptr)
+{
+	return do_checkpoint_sighand(ctx, (struct sighand_struct *) ptr);
+}
+
+int checkpoint_obj_sighand(struct ckpt_ctx *ctx, struct task_struct *t)
+{
+	struct sighand_struct *sighand;
+	int objref;
+
+	read_lock(&tasklist_lock);
+	sighand = rcu_dereference(t->sighand);
+	atomic_inc(&sighand->count);
+	read_unlock(&tasklist_lock);
+
+	objref = checkpoint_obj(ctx, sighand, CKPT_OBJ_SIGHAND);
+	__cleanup_sighand(sighand);
+
+	return objref;
+}
+
+int ckpt_collect_sighand(struct ckpt_ctx *ctx, struct task_struct *t)
+{
+	struct sighand_struct *sighand;
+	int ret;
+
+	read_lock(&tasklist_lock);
+	sighand = rcu_dereference(t->sighand);
+	atomic_inc(&sighand->count);
+	read_unlock(&tasklist_lock);
+
+	ret = ckpt_obj_collect(ctx, sighand, CKPT_OBJ_SIGHAND);
+	__cleanup_sighand(sighand);
+
+	return ret;
+}
+
+static struct sighand_struct *do_restore_sighand(struct ckpt_ctx *ctx)
+{
+	struct ckpt_hdr_sighand *h;
+	struct ckpt_sigaction *hh;
+	struct sighand_struct *sighand;
+	struct sigaction *sa;
+	int i;
+
+	h = ckpt_read_obj_type(ctx, _NSIG * sizeof(*hh) + sizeof(*h),
+			       CKPT_HDR_SIGHAND);
+	if (IS_ERR(h))
+		return ERR_PTR(PTR_ERR(h));
+
+	sighand = kmem_cache_alloc(sighand_cachep, GFP_KERNEL);
+	if (!sighand) {
+		sighand = ERR_PTR(-ENOMEM);
+		goto out;
+	}
+	atomic_set(&sighand->count, 1);
+
+	hh = h->action;
+	for (i = 0; i < _NSIG; i++) {
+		sa = &sighand->action[i].sa;
+		sa->sa_handler = (void *) (unsigned long) hh[i]._sa_handler;
+		sa->sa_flags = hh[i].sa_flags;
+		sa->sa_restorer = (void *) (unsigned long) hh[i].sa_restorer;
+		load_sigset(&sa->sa_mask, &hh[i].sa_mask);
+	}
+ out:
+	ckpt_hdr_put(ctx, h);
+	return sighand;
+}
+
+void *restore_sighand(struct ckpt_ctx *ctx)
+{
+	return (void *) do_restore_sighand(ctx);
+}
+
+int restore_obj_sighand(struct ckpt_ctx *ctx, int sighand_objref)
+{
+	struct sighand_struct *sighand;
+	struct sighand_struct *old_sighand;
+
+	sighand = ckpt_obj_fetch(ctx, sighand_objref, CKPT_OBJ_SIGHAND);
+	if (IS_ERR(sighand))
+		return PTR_ERR(sighand);
+
+	if (sighand == current->sighand)
+		return 0;
+
+	atomic_inc(&sighand->count);
+
+	/* manipulate tsk->sighand with tasklist lock write-held */
+	write_lock_irq(&tasklist_lock);
+	old_sighand = rcu_dereference(current->sighand);
+	spin_lock(&old_sighand->siglock);
+	rcu_assign_pointer(current->sighand, sighand);
+	spin_unlock(&old_sighand->siglock);
+	write_unlock_irq(&tasklist_lock);
+	__cleanup_sighand(old_sighand);
+
+	return 0;
+}
diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
index 3dbf188..c0e549e 100644
--- a/include/linux/checkpoint.h
+++ b/include/linux/checkpoint.h
@@ -245,6 +245,14 @@ extern int restore_memory_contents(struct ckpt_ctx *ctx, struct inode *inode);
 	 VM_RESERVED | VM_NORESERVE | VM_HUGETLB | VM_NONLINEAR |	\
 	 VM_MAPPED_COPY | VM_INSERTPAGE | VM_MIXEDMAP | VM_SAO)
 
+/* signals */
+extern int checkpoint_obj_sighand(struct ckpt_ctx *ctx, struct task_struct *t);
+extern int restore_obj_sighand(struct ckpt_ctx *ctx, int sighand_objref);
+
+extern int ckpt_collect_sighand(struct ckpt_ctx *ctx, struct task_struct *t);
+extern int checkpoint_sighand(struct ckpt_ctx *ctx, void *ptr);
+extern void *restore_sighand(struct ckpt_ctx *ctx);
+
 static inline int ckpt_validate_errno(int errno)
 {
 	return (errno >= 0) && (errno < MAX_ERRNO);
@@ -277,7 +285,6 @@ static inline int ckpt_validate_errno(int errno)
 			memcpy(LIVE, SAVE, count * sizeof(*SAVE));	\
 	} while (0)
 
-
 /* debugging flags */
 #define CKPT_DBASE	0x1		/* anything */
 #define CKPT_DSYS	0x2		/* generic (system) */
diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
index ca24112..3d3a105 100644
--- a/include/linux/checkpoint_hdr.h
+++ b/include/linux/checkpoint_hdr.h
@@ -87,6 +87,8 @@ enum {
 	CKPT_HDR_IPC_MSG_MSG,
 	CKPT_HDR_IPC_SEM,
 
+	CKPT_HDR_SIGHAND = 601,
+
 	CKPT_HDR_TAIL = 9001,
 
 	CKPT_HDR_ERROR = 9999,
@@ -113,6 +115,7 @@ enum obj_type {
 	CKPT_OBJ_FILE_TABLE,
 	CKPT_OBJ_FILE,
 	CKPT_OBJ_MM,
+	CKPT_OBJ_SIGHAND,
 	CKPT_OBJ_NS,
 	CKPT_OBJ_UTS_NS,
 	CKPT_OBJ_IPC_NS,
@@ -129,6 +132,8 @@ struct ckpt_const {
 	__u16 task_comm_len;
 	/* mm */
 	__u16 mm_saved_auxv_len;
+	/* signal */
+	__u16 signal_nsig;
 	/* uts */
 	__u16 uts_sysname_len;
 	__u16 uts_nodename_len;
@@ -299,6 +304,7 @@ struct ckpt_hdr_task_objs {
 
 	__s32 files_objref;
 	__s32 mm_objref;
+	__s32 sighand_objref;
 } __attribute__((aligned(8)));
 
 /* restart blocks */
@@ -414,6 +420,22 @@ struct ckpt_hdr_pgarr {
 	__u64 nr_pages;		/* number of pages to saved */
 } __attribute__((aligned(8)));
 
+/* signals */
+struct ckpt_sigset {
+	__u8 sigset[CKPT_ARCH_NSIG / 8];
+} __attribute__((aligned(8)));
+
+struct ckpt_sigaction {
+	__u64 _sa_handler;
+	__u64 sa_flags;
+	__u64 sa_restorer;
+	struct ckpt_sigset sa_mask;
+} __attribute__((aligned(8)));
+
+struct ckpt_hdr_sighand {
+	struct ckpt_hdr h;
+	struct ckpt_sigaction action[0];
+} __attribute__((aligned(8)));
 
 /* ipc commons */
 struct ckpt_hdr_ipcns {
-- 
1.6.0.4


^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH v18 62/80] c/r: checkpoint and restore (shared) task's sighand_struct
@ 2009-09-23 23:51   ` Oren Laadan
  0 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-23 23:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, containers, linux-kernel, linux-mm, linux-api,
	Serge Hallyn, Ingo Molnar, Pavel Emelyanov, Oren Laadan,
	Oren Laadan

This patch adds the checkpointing and restart of signal handling
state - 'struct sighand_struct'. Since the contents of this state
only affect userspace, no input validation is required.

Add _NSIG to kernel constants saved/tested with image header.

Number of signals (_NSIG) is arch-dependent, but is within __KERNEL__
and not visibile to userspace compile. Therefore, define per arch
CKPT_ARCH_NSIG in <asm/checkpoint_hdr.h>.

Changelog[v1]:
  - Rename headerless struct ckpt_hdr_* to struct ckpt_*

Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
---
 arch/s390/include/asm/checkpoint_hdr.h |    8 ++
 arch/x86/include/asm/checkpoint_hdr.h  |    8 ++
 checkpoint/Makefile                    |    3 +-
 checkpoint/checkpoint.c                |    2 +
 checkpoint/objhash.c                   |   26 +++++
 checkpoint/process.c                   |   19 ++++
 checkpoint/restart.c                   |    3 +
 checkpoint/signal.c                    |  163 ++++++++++++++++++++++++++++++++
 include/linux/checkpoint.h             |    9 ++-
 include/linux/checkpoint_hdr.h         |   22 +++++
 10 files changed, 261 insertions(+), 2 deletions(-)
 create mode 100644 checkpoint/signal.c

diff --git a/arch/s390/include/asm/checkpoint_hdr.h b/arch/s390/include/asm/checkpoint_hdr.h
index ad9449e..1976355 100644
--- a/arch/s390/include/asm/checkpoint_hdr.h
+++ b/arch/s390/include/asm/checkpoint_hdr.h
@@ -82,6 +82,14 @@ struct ckpt_hdr_mm_context {
 	unsigned long asce_limit;
 };
 
+#define CKPT_ARCH_NSIG  64
+#ifdef __KERNEL__
+#include <asm/signal.h>
+#if CKPT_ARCH_NSIG != _SIGCONTEXT_NSIG
+#error CKPT_ARCH_NSIG size is wrong (asm/sigcontext.h and asm/checkpoint_hdr.h)
+#endif
+#endif
+
 struct ckpt_hdr_header_arch {
 	struct ckpt_hdr h;
 };
diff --git a/arch/x86/include/asm/checkpoint_hdr.h b/arch/x86/include/asm/checkpoint_hdr.h
index 0e756b0..1228d1b 100644
--- a/arch/x86/include/asm/checkpoint_hdr.h
+++ b/arch/x86/include/asm/checkpoint_hdr.h
@@ -48,6 +48,14 @@ enum {
 	CKPT_HDR_MM_CONTEXT_LDT,
 };
 
+#define CKPT_ARCH_NSIG  64
+#ifdef __KERNEL__
+#include <asm/signal.h>
+#if CKPT_ARCH_NSIG != _NSIG
+#error CKPT_ARCH_NSIG size is wrong per asm/signal.h and asm/checkpoint_hdr.h
+#endif
+#endif
+
 struct ckpt_hdr_header_arch {
 	struct ckpt_hdr h;
 	/* FIXME: add HAVE_HWFP */
diff --git a/checkpoint/Makefile b/checkpoint/Makefile
index bb2c0ca..f8a55df 100644
--- a/checkpoint/Makefile
+++ b/checkpoint/Makefile
@@ -10,4 +10,5 @@ obj-$(CONFIG_CHECKPOINT) += \
 	process.o \
 	namespace.o \
 	files.o \
-	memory.o
+	memory.o \
+	signal.o
diff --git a/checkpoint/checkpoint.c b/checkpoint/checkpoint.c
index 27eec5c..3460c03 100644
--- a/checkpoint/checkpoint.c
+++ b/checkpoint/checkpoint.c
@@ -288,6 +288,8 @@ static void fill_kernel_const(struct ckpt_const *h)
 	h->task_comm_len = sizeof(tsk->comm);
 	/* mm */
 	h->mm_saved_auxv_len = sizeof(mm->saved_auxv);
+	/* signal */
+	h->signal_nsig = _NSIG;
 	/* uts */
 	h->uts_sysname_len = sizeof(uts->sysname);
 	h->uts_nodename_len = sizeof(uts->nodename);
diff --git a/checkpoint/objhash.c b/checkpoint/objhash.c
index f8a3210..b4034dc 100644
--- a/checkpoint/objhash.c
+++ b/checkpoint/objhash.c
@@ -125,6 +125,22 @@ static int obj_mm_users(void *ptr)
 	return atomic_read(&((struct mm_struct *) ptr)->mm_users);
 }
 
+static int obj_sighand_grab(void *ptr)
+{
+	atomic_inc(&((struct sighand_struct *) ptr)->count);
+	return 0;
+}
+
+static void obj_sighand_drop(void *ptr, int lastref)
+{
+	__cleanup_sighand((struct sighand_struct *) ptr);
+}
+
+static int obj_sighand_users(void *ptr)
+{
+	return atomic_read(&((struct sighand_struct *) ptr)->count);
+}
+
 static int obj_ns_grab(void *ptr)
 {
 	get_nsproxy((struct nsproxy *) ptr);
@@ -263,6 +279,16 @@ static struct ckpt_obj_ops ckpt_obj_ops[] = {
 		.checkpoint = checkpoint_mm,
 		.restore = restore_mm,
 	},
+	/* sighand object */
+	{
+		.obj_name = "SIGHAND",
+		.obj_type = CKPT_OBJ_SIGHAND,
+		.ref_drop = obj_sighand_drop,
+		.ref_grab = obj_sighand_grab,
+		.ref_users = obj_sighand_users,
+		.checkpoint = checkpoint_sighand,
+		.restore = restore_sighand,
+	},
 	/* ns object */
 	{
 		.obj_name = "NSPROXY",
diff --git a/checkpoint/process.c b/checkpoint/process.c
index 1e79f73..56f33dd 100644
--- a/checkpoint/process.c
+++ b/checkpoint/process.c
@@ -181,6 +181,7 @@ static int checkpoint_task_objs(struct ckpt_ctx *ctx, struct task_struct *t)
 	struct ckpt_hdr_task_objs *h;
 	int files_objref;
 	int mm_objref;
+	int sighand_objref;
 	int ret;
 
 	/*
@@ -214,11 +215,19 @@ static int checkpoint_task_objs(struct ckpt_ctx *ctx, struct task_struct *t)
 		return mm_objref;
 	}
 
+	sighand_objref = checkpoint_obj_sighand(ctx, t);
+	ckpt_debug("sighand: objref %d\n", sighand_objref);
+	if (sighand_objref < 0) {
+		ckpt_write_err(ctx, "TE", "sighand_struct", sighand_objref);
+		return sighand_objref;
+	}
+
 	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_TASK_OBJS);
 	if (!h)
 		return -ENOMEM;
 	h->files_objref = files_objref;
 	h->mm_objref = mm_objref;
+	h->sighand_objref = sighand_objref;
 	ret = ckpt_write_obj(ctx, &h->h);
 	ckpt_hdr_put(ctx, h);
 
@@ -382,6 +391,9 @@ int ckpt_collect_task(struct ckpt_ctx *ctx, struct task_struct *t)
 	if (ret < 0)
 		return ret;
 	ret = ckpt_collect_mm(ctx, t);
+	if (ret < 0)
+		return ret;
+	ret = ckpt_collect_sighand(ctx, t);
 
 	return ret;
 }
@@ -529,10 +541,17 @@ static int restore_task_objs(struct ckpt_ctx *ctx)
 
 	ret = restore_obj_file_table(ctx, h->files_objref);
 	ckpt_debug("file_table: ret %d (%p)\n", ret, current->files);
+	if (ret < 0)
+		goto out;
 
 	ret = restore_obj_mm(ctx, h->mm_objref);
 	ckpt_debug("mm: ret %d (%p)\n", ret, current->mm);
+	if (ret < 0)
+		goto out;
 
+	ret = restore_obj_sighand(ctx, h->sighand_objref);
+	ckpt_debug("sighand: ret %d (%p)\n", ret, current->sighand);
+ out:
 	ckpt_hdr_put(ctx, h);
 	return ret;
 }
diff --git a/checkpoint/restart.c b/checkpoint/restart.c
index 53a759a..73c4e72 100644
--- a/checkpoint/restart.c
+++ b/checkpoint/restart.c
@@ -383,6 +383,9 @@ static int check_kernel_const(struct ckpt_const *h)
 	/* mm */
 	if (h->mm_saved_auxv_len != sizeof(mm->saved_auxv))
 		return -EINVAL;
+	/* signal */
+	if (h->signal_nsig != _NSIG)
+		return -EINVAL;
 	/* uts */
 	if (h->uts_sysname_len != sizeof(uts->sysname))
 		return -EINVAL;
diff --git a/checkpoint/signal.c b/checkpoint/signal.c
new file mode 100644
index 0000000..1aadadd
--- /dev/null
+++ b/checkpoint/signal.c
@@ -0,0 +1,163 @@
+/*
+ *  Checkpoint task signals
+ *
+ *  Copyright (C) 2009 Oren Laadan
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+/* default debug level for output */
+#define CKPT_DFLAG  CKPT_DSYS
+
+#include <linux/sched.h>
+#include <linux/signal.h>
+#include <linux/errno.h>
+#include <linux/checkpoint.h>
+#include <linux/checkpoint_hdr.h>
+
+static inline void fill_sigset(struct ckpt_sigset *h, sigset_t *sigset)
+{
+	memcpy(&h->sigset, sigset, sizeof(*sigset));
+}
+
+static inline void load_sigset(sigset_t *sigset, struct ckpt_sigset *h)
+{
+	memcpy(sigset, &h->sigset, sizeof(*sigset));
+}
+
+/***********************************************************************
+ * sighand checkpoint/collect/restart
+ */
+
+static int do_checkpoint_sighand(struct ckpt_ctx *ctx,
+				 struct sighand_struct *sighand)
+{
+	struct ckpt_hdr_sighand *h;
+	struct ckpt_sigaction *hh;
+	struct sigaction *sa;
+	int i, ret;
+
+	h = ckpt_hdr_get_type(ctx, _NSIG * sizeof(*hh) + sizeof(*h),
+			      CKPT_HDR_SIGHAND);
+	if (!h)
+		return -ENOMEM;
+
+	hh = h->action;
+	spin_lock_irq(&sighand->siglock);
+	for (i = 0; i < _NSIG; i++) {
+		sa = &sighand->action[i].sa;
+		hh[i]._sa_handler = (unsigned long) sa->sa_handler;
+		hh[i].sa_flags = sa->sa_flags;
+		hh[i].sa_restorer = (unsigned long) sa->sa_restorer;
+		fill_sigset(&hh[i].sa_mask, &sa->sa_mask);
+	}
+	spin_unlock_irq(&sighand->siglock);
+
+	ret = ckpt_write_obj(ctx, &h->h);
+	ckpt_hdr_put(ctx, h);
+
+	return ret;
+}
+
+int checkpoint_sighand(struct ckpt_ctx *ctx, void *ptr)
+{
+	return do_checkpoint_sighand(ctx, (struct sighand_struct *) ptr);
+}
+
+int checkpoint_obj_sighand(struct ckpt_ctx *ctx, struct task_struct *t)
+{
+	struct sighand_struct *sighand;
+	int objref;
+
+	read_lock(&tasklist_lock);
+	sighand = rcu_dereference(t->sighand);
+	atomic_inc(&sighand->count);
+	read_unlock(&tasklist_lock);
+
+	objref = checkpoint_obj(ctx, sighand, CKPT_OBJ_SIGHAND);
+	__cleanup_sighand(sighand);
+
+	return objref;
+}
+
+int ckpt_collect_sighand(struct ckpt_ctx *ctx, struct task_struct *t)
+{
+	struct sighand_struct *sighand;
+	int ret;
+
+	read_lock(&tasklist_lock);
+	sighand = rcu_dereference(t->sighand);
+	atomic_inc(&sighand->count);
+	read_unlock(&tasklist_lock);
+
+	ret = ckpt_obj_collect(ctx, sighand, CKPT_OBJ_SIGHAND);
+	__cleanup_sighand(sighand);
+
+	return ret;
+}
+
+static struct sighand_struct *do_restore_sighand(struct ckpt_ctx *ctx)
+{
+	struct ckpt_hdr_sighand *h;
+	struct ckpt_sigaction *hh;
+	struct sighand_struct *sighand;
+	struct sigaction *sa;
+	int i;
+
+	h = ckpt_read_obj_type(ctx, _NSIG * sizeof(*hh) + sizeof(*h),
+			       CKPT_HDR_SIGHAND);
+	if (IS_ERR(h))
+		return ERR_PTR(PTR_ERR(h));
+
+	sighand = kmem_cache_alloc(sighand_cachep, GFP_KERNEL);
+	if (!sighand) {
+		sighand = ERR_PTR(-ENOMEM);
+		goto out;
+	}
+	atomic_set(&sighand->count, 1);
+
+	hh = h->action;
+	for (i = 0; i < _NSIG; i++) {
+		sa = &sighand->action[i].sa;
+		sa->sa_handler = (void *) (unsigned long) hh[i]._sa_handler;
+		sa->sa_flags = hh[i].sa_flags;
+		sa->sa_restorer = (void *) (unsigned long) hh[i].sa_restorer;
+		load_sigset(&sa->sa_mask, &hh[i].sa_mask);
+	}
+ out:
+	ckpt_hdr_put(ctx, h);
+	return sighand;
+}
+
+void *restore_sighand(struct ckpt_ctx *ctx)
+{
+	return (void *) do_restore_sighand(ctx);
+}
+
+int restore_obj_sighand(struct ckpt_ctx *ctx, int sighand_objref)
+{
+	struct sighand_struct *sighand;
+	struct sighand_struct *old_sighand;
+
+	sighand = ckpt_obj_fetch(ctx, sighand_objref, CKPT_OBJ_SIGHAND);
+	if (IS_ERR(sighand))
+		return PTR_ERR(sighand);
+
+	if (sighand == current->sighand)
+		return 0;
+
+	atomic_inc(&sighand->count);
+
+	/* manipulate tsk->sighand with tasklist lock write-held */
+	write_lock_irq(&tasklist_lock);
+	old_sighand = rcu_dereference(current->sighand);
+	spin_lock(&old_sighand->siglock);
+	rcu_assign_pointer(current->sighand, sighand);
+	spin_unlock(&old_sighand->siglock);
+	write_unlock_irq(&tasklist_lock);
+	__cleanup_sighand(old_sighand);
+
+	return 0;
+}
diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
index 3dbf188..c0e549e 100644
--- a/include/linux/checkpoint.h
+++ b/include/linux/checkpoint.h
@@ -245,6 +245,14 @@ extern int restore_memory_contents(struct ckpt_ctx *ctx, struct inode *inode);
 	 VM_RESERVED | VM_NORESERVE | VM_HUGETLB | VM_NONLINEAR |	\
 	 VM_MAPPED_COPY | VM_INSERTPAGE | VM_MIXEDMAP | VM_SAO)
 
+/* signals */
+extern int checkpoint_obj_sighand(struct ckpt_ctx *ctx, struct task_struct *t);
+extern int restore_obj_sighand(struct ckpt_ctx *ctx, int sighand_objref);
+
+extern int ckpt_collect_sighand(struct ckpt_ctx *ctx, struct task_struct *t);
+extern int checkpoint_sighand(struct ckpt_ctx *ctx, void *ptr);
+extern void *restore_sighand(struct ckpt_ctx *ctx);
+
 static inline int ckpt_validate_errno(int errno)
 {
 	return (errno >= 0) && (errno < MAX_ERRNO);
@@ -277,7 +285,6 @@ static inline int ckpt_validate_errno(int errno)
 			memcpy(LIVE, SAVE, count * sizeof(*SAVE));	\
 	} while (0)
 
-
 /* debugging flags */
 #define CKPT_DBASE	0x1		/* anything */
 #define CKPT_DSYS	0x2		/* generic (system) */
diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
index ca24112..3d3a105 100644
--- a/include/linux/checkpoint_hdr.h
+++ b/include/linux/checkpoint_hdr.h
@@ -87,6 +87,8 @@ enum {
 	CKPT_HDR_IPC_MSG_MSG,
 	CKPT_HDR_IPC_SEM,
 
+	CKPT_HDR_SIGHAND = 601,
+
 	CKPT_HDR_TAIL = 9001,
 
 	CKPT_HDR_ERROR = 9999,
@@ -113,6 +115,7 @@ enum obj_type {
 	CKPT_OBJ_FILE_TABLE,
 	CKPT_OBJ_FILE,
 	CKPT_OBJ_MM,
+	CKPT_OBJ_SIGHAND,
 	CKPT_OBJ_NS,
 	CKPT_OBJ_UTS_NS,
 	CKPT_OBJ_IPC_NS,
@@ -129,6 +132,8 @@ struct ckpt_const {
 	__u16 task_comm_len;
 	/* mm */
 	__u16 mm_saved_auxv_len;
+	/* signal */
+	__u16 signal_nsig;
 	/* uts */
 	__u16 uts_sysname_len;
 	__u16 uts_nodename_len;
@@ -299,6 +304,7 @@ struct ckpt_hdr_task_objs {
 
 	__s32 files_objref;
 	__s32 mm_objref;
+	__s32 sighand_objref;
 } __attribute__((aligned(8)));
 
 /* restart blocks */
@@ -414,6 +420,22 @@ struct ckpt_hdr_pgarr {
 	__u64 nr_pages;		/* number of pages to saved */
 } __attribute__((aligned(8)));
 
+/* signals */
+struct ckpt_sigset {
+	__u8 sigset[CKPT_ARCH_NSIG / 8];
+} __attribute__((aligned(8)));
+
+struct ckpt_sigaction {
+	__u64 _sa_handler;
+	__u64 sa_flags;
+	__u64 sa_restorer;
+	struct ckpt_sigset sa_mask;
+} __attribute__((aligned(8)));
+
+struct ckpt_hdr_sighand {
+	struct ckpt_hdr h;
+	struct ckpt_sigaction action[0];
+} __attribute__((aligned(8)));
 
 /* ipc commons */
 struct ckpt_hdr_ipcns {
-- 
1.6.0.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH v18 63/80] c/r: [signal 1/4] blocked and template for shared signals
       [not found] ` <1253749920-18673-1-git-send-email-orenl-RdfvBDnrOixBDgjK7y7TUQ@public.gmane.org>
                     ` (61 preceding siblings ...)
  2009-09-23 23:51   ` [PATCH v18 62/80] c/r: checkpoint and restore (shared) task's sighand_struct Oren Laadan
@ 2009-09-23 23:51   ` Oren Laadan
  2009-09-23 23:51   ` [PATCH v18 64/80] c/r: [signal 2/4] checkpoint/restart of rlimit Oren Laadan
                     ` (18 subsequent siblings)
  81 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-23 23:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-api-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Linus Torvalds, Ingo Molnar,
	Pavel Emelyanov

This patch adds checkpoint/restart of blocked signals mask
(t->blocked) and a template for shared signals (t->signal).

Because t->signal sharing is tied to threads, we ensure proper sharing
of t->signal (struct signal_struct) for threads only.

Access to t->signal is protected by locking t->sighand->lock.
Therefore, the usual checkpoint_obj() invoking the callback
checkpoint_signal(ctx, signal) is insufficient because the task
pointer is unavailable.

Instead, handling of t->signal sharing is explicit using helpers
like ckpt_obj_lookup_add(), ckpt_obj_fetch() and ckpt_obj_insert().
The actual state is saved (if needed) _after_ the task_objs data.

To prevent tasks from handling restored signals during restart,
set their mask to block all signals and only restore the original
mask at the very end (before the last sync point).

Introduce per-task pointer 'ckpt_data' to temporary store data
for restore actions that are deferred to the end (like restoring
the signal block mask).

Signed-off-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
Acked-by: Louis Rilling <Louis.Rilling-aw0BnHfMbSpBDgjK7y7TUQ@public.gmane.org>
---
 checkpoint/objhash.c           |    7 +++
 checkpoint/process.c           |   64 ++++++++++++++++++++++++-
 checkpoint/signal.c            |  104 ++++++++++++++++++++++++++++++++++++++++
 include/linux/checkpoint.h     |    6 ++
 include/linux/checkpoint_hdr.h |   14 +++++-
 5 files changed, 193 insertions(+), 2 deletions(-)

diff --git a/checkpoint/objhash.c b/checkpoint/objhash.c
index b4034dc..bf2f761 100644
--- a/checkpoint/objhash.c
+++ b/checkpoint/objhash.c
@@ -289,6 +289,13 @@ static struct ckpt_obj_ops ckpt_obj_ops[] = {
 		.checkpoint = checkpoint_sighand,
 		.restore = restore_sighand,
 	},
+	/* signal object */
+	{
+		.obj_name = "SIGNAL",
+		.obj_type = CKPT_OBJ_SIGNAL,
+		.ref_drop = obj_no_drop,
+		.ref_grab = obj_no_grab,
+	},
 	/* ns object */
 	{
 		.obj_name = "NSPROXY",
diff --git a/checkpoint/process.c b/checkpoint/process.c
index 56f33dd..e596e2a 100644
--- a/checkpoint/process.c
+++ b/checkpoint/process.c
@@ -182,7 +182,8 @@ static int checkpoint_task_objs(struct ckpt_ctx *ctx, struct task_struct *t)
 	int files_objref;
 	int mm_objref;
 	int sighand_objref;
-	int ret;
+	int signal_objref;
+	int first, ret;
 
 	/*
 	 * Shared objects may have dependencies among them: task->mm
@@ -222,14 +223,36 @@ static int checkpoint_task_objs(struct ckpt_ctx *ctx, struct task_struct *t)
 		return sighand_objref;
 	}
 
+	/*
+	 * Handle t->signal differently because the checkpoint method
+	 * for t->signal needs access to owning task_struct to access
+	 * t->sighand (to lock/unlock). First explicitly determine if
+	 * need to save, and only below invoke checkpoint_obj_signal()
+	 * if needed.
+	 */
+	signal_objref = ckpt_obj_lookup_add(ctx, t->signal,
+					    CKPT_OBJ_SIGNAL, &first);
+	ckpt_debug("signal: objref %d\n", signal_objref);
+	if (signal_objref < 0)
+		return signal_objref;
+
 	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_TASK_OBJS);
 	if (!h)
 		return -ENOMEM;
 	h->files_objref = files_objref;
 	h->mm_objref = mm_objref;
 	h->sighand_objref = sighand_objref;
+	h->signal_objref = signal_objref;
 	ret = ckpt_write_obj(ctx, &h->h);
 	ckpt_hdr_put(ctx, h);
+	if (ret < 0)
+		return ret;
+
+	/* actually save t->signal, if need to */
+	if (first)
+		ret = checkpoint_obj_signal(ctx, t);
+	if (ret < 0)
+		ckpt_write_err(ctx, "TE", "signal_struct", ret);
 
 	return ret;
 }
@@ -375,6 +398,10 @@ int checkpoint_task(struct ckpt_ctx *ctx, struct task_struct *t)
 		goto out;
 	ret = checkpoint_cpu(ctx, t);
 	ckpt_debug("cpu %d\n", ret);
+	if (ret < 0)
+		goto out;
+	ret = checkpoint_task_signal(ctx, t);
+	ckpt_debug("task-signal %d\n", ret);
  out:
 	ctx->tsk = NULL;
 	return ret;
@@ -551,6 +578,11 @@ static int restore_task_objs(struct ckpt_ctx *ctx)
 
 	ret = restore_obj_sighand(ctx, h->sighand_objref);
 	ckpt_debug("sighand: ret %d (%p)\n", ret, current->sighand);
+	if (ret < 0)
+		goto out;
+
+	ret = restore_obj_signal(ctx, h->signal_objref);
+	ckpt_debug("signal: ret %d (%p)\n", ret, current->signal);
  out:
 	ckpt_hdr_put(ctx, h);
 	return ret;
@@ -688,11 +720,37 @@ int restore_restart_block(struct ckpt_ctx *ctx)
 	return ret;
 }
 
+/* pre_restore_task - prepare the task for restore */
+static int pre_restore_task(struct ckpt_ctx *ctx)
+{
+	sigset_t sigset;
+
+	/*
+	 * Block task's signals to avoid interruptions due to signals,
+	 * say, from restored timers, file descriptors etc. Signals
+	 * will be unblocked when restore completes.
+	 *
+	 * NOTE: tasks with file descriptors set to send a SIGKILL as
+	 * i/o notification may fail the restart if a signal occurs
+	 * before that task completed its restore. FIX ?
+	 */
+	sigfillset(&sigset);
+	sigdelset(&sigset, SIGKILL);
+	sigdelset(&sigset, SIGSTOP);
+	sigprocmask(SIG_SETMASK, &sigset, NULL);
+
+	return 0;
+}
+
 /* read the entire state of the current task */
 int restore_task(struct ckpt_ctx *ctx)
 {
 	int ret;
 
+	ret = pre_restore_task(ctx);
+	if (ret < 0)
+		goto out;
+
 	ret = restore_task_struct(ctx);
 	ckpt_debug("task %d\n", ret);
 	if (ret < 0)
@@ -720,6 +778,10 @@ int restore_task(struct ckpt_ctx *ctx)
 		goto out;
 	ret = restore_creds(ctx);
 	ckpt_debug("creds: ret %d\n", ret);
+	if (ret < 0)
+		goto out;
+
+	ret = restore_task_signal(ctx);
  out:
 	return ret;
 }
diff --git a/checkpoint/signal.c b/checkpoint/signal.c
index 1aadadd..3fac75c 100644
--- a/checkpoint/signal.c
+++ b/checkpoint/signal.c
@@ -161,3 +161,107 @@ int restore_obj_sighand(struct ckpt_ctx *ctx, int sighand_objref)
 
 	return 0;
 }
+
+/***********************************************************************
+ * signal checkpoint/restart
+ */
+
+static int checkpoint_signal(struct ckpt_ctx *ctx, struct task_struct *t)
+{
+	struct ckpt_hdr_signal *h;
+	int ret;
+
+	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_SIGNAL);
+	if (!h)
+		return -ENOMEM;
+
+	/* fill in later */
+
+	ret = ckpt_write_obj(ctx, &h->h);
+	ckpt_hdr_put(ctx, h);
+	return ret;
+}
+
+int checkpoint_obj_signal(struct ckpt_ctx *ctx, struct task_struct *t)
+{
+	BUG_ON(t->flags & PF_EXITING);
+	return checkpoint_signal(ctx, t);
+}
+
+static int restore_signal(struct ckpt_ctx *ctx)
+{
+	struct ckpt_hdr_signal *h;
+
+	h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_SIGNAL);
+	if (IS_ERR(h))
+		return PTR_ERR(h);
+
+	/* fill in later */
+
+	ckpt_hdr_put(ctx, h);
+	return 0;
+}
+
+int restore_obj_signal(struct ckpt_ctx *ctx, int signal_objref)
+{
+	struct signal_struct *signal;
+	int ret = 0;
+
+	signal = ckpt_obj_fetch(ctx, signal_objref, CKPT_OBJ_SIGNAL);
+	if (!IS_ERR(signal)) {
+		/*
+		 * signal_struct is already shared properly as it is
+		 * tied to thread groups. Since thread relationships
+		 * are already restore now, t->signal must match.
+		 */
+		if (signal != current->signal)
+			ret = -EINVAL;
+	} else if (PTR_ERR(signal) == -EINVAL) {
+		/* first timer: add to hash and restore our t->signal */
+		ret = ckpt_obj_insert(ctx, current->signal,
+				      signal_objref, CKPT_OBJ_SIGNAL);
+		if (ret >= 0)
+			ret = restore_signal(ctx);
+	} else {
+		ret = PTR_ERR(signal);
+	}
+
+	return ret;
+}
+
+int checkpoint_task_signal(struct ckpt_ctx *ctx, struct task_struct *t)
+{
+	struct ckpt_hdr_signal_task *h;
+	int ret;
+
+	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_SIGNAL_TASK);
+	if (!h)
+		return -ENOMEM;
+
+	fill_sigset(&h->blocked, &t->blocked);
+
+	ret = ckpt_write_obj(ctx, &h->h);
+	ckpt_hdr_put(ctx, h);
+	return ret;
+}
+
+int restore_task_signal(struct ckpt_ctx *ctx)
+{
+	struct ckpt_hdr_signal_task *h;
+	sigset_t blocked;
+
+	h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_SIGNAL_TASK);
+	if (IS_ERR(h))
+		return PTR_ERR(h);
+
+	load_sigset(&blocked, &h->blocked);
+	/* silently remove SIGKILL, SIGSTOP */
+	sigdelset(&blocked, SIGKILL);
+	sigdelset(&blocked, SIGSTOP);
+
+	sigprocmask(SIG_SETMASK, &blocked, NULL);
+	recalc_sigpending();
+
+	ckpt_hdr_put(ctx, h);
+	return 0;
+}
diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
index c0e549e..ec98a43 100644
--- a/include/linux/checkpoint.h
+++ b/include/linux/checkpoint.h
@@ -253,6 +253,12 @@ extern int ckpt_collect_sighand(struct ckpt_ctx *ctx, struct task_struct *t);
 extern int checkpoint_sighand(struct ckpt_ctx *ctx, void *ptr);
 extern void *restore_sighand(struct ckpt_ctx *ctx);
 
+extern int checkpoint_obj_signal(struct ckpt_ctx *ctx, struct task_struct *t);
+extern int restore_obj_signal(struct ckpt_ctx *ctx, int signal_objref);
+
+extern int checkpoint_task_signal(struct ckpt_ctx *ctx, struct task_struct *t);
+extern int restore_task_signal(struct ckpt_ctx *ctx);
+
 static inline int ckpt_validate_errno(int errno)
 {
 	return (errno >= 0) && (errno < MAX_ERRNO);
diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
index 3d3a105..ee949b5 100644
--- a/include/linux/checkpoint_hdr.h
+++ b/include/linux/checkpoint_hdr.h
@@ -88,6 +88,8 @@ enum {
 	CKPT_HDR_IPC_SEM,
 
 	CKPT_HDR_SIGHAND = 601,
+	CKPT_HDR_SIGNAL,
+	CKPT_HDR_SIGNAL_TASK,
 
 	CKPT_HDR_TAIL = 9001,
 
@@ -116,6 +118,7 @@ enum obj_type {
 	CKPT_OBJ_FILE,
 	CKPT_OBJ_MM,
 	CKPT_OBJ_SIGHAND,
+	CKPT_OBJ_SIGNAL,
 	CKPT_OBJ_NS,
 	CKPT_OBJ_UTS_NS,
 	CKPT_OBJ_IPC_NS,
@@ -210,7 +213,6 @@ struct ckpt_hdr_task {
 	__u32 compat_robust_futex_list; /* a compat __user ptr */
 	__u32 robust_futex_head_len;
 	__u64 robust_futex_list; /* a __user ptr */
-
 } __attribute__((aligned(8)));
 
 /* Posix capabilities */
@@ -305,6 +307,7 @@ struct ckpt_hdr_task_objs {
 	__s32 files_objref;
 	__s32 mm_objref;
 	__s32 sighand_objref;
+	__s32 signal_objref;
 } __attribute__((aligned(8)));
 
 /* restart blocks */
@@ -437,6 +440,15 @@ struct ckpt_hdr_sighand {
 	struct ckpt_sigaction action[0];
 } __attribute__((aligned(8)));
 
+struct ckpt_hdr_signal {
+	struct ckpt_hdr h;
+} __attribute__((aligned(8)));
+
+struct ckpt_hdr_signal_task {
+	struct ckpt_hdr h;
+	struct ckpt_sigset blocked;
+} __attribute__((aligned(8)));
+
 /* ipc commons */
 struct ckpt_hdr_ipcns {
 	struct ckpt_hdr h;
-- 
1.6.0.4

^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH v18 63/80] c/r: [signal 1/4] blocked and template for shared signals
  2009-09-23 23:50 ` Oren Laadan
@ 2009-09-23 23:51   ` Oren Laadan
  -1 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-23 23:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, containers, linux-kernel, linux-mm, linux-api,
	Serge Hallyn, Ingo Molnar, Pavel Emelyanov, Oren Laadan,
	Oren Laadan

This patch adds checkpoint/restart of blocked signals mask
(t->blocked) and a template for shared signals (t->signal).

Because t->signal sharing is tied to threads, we ensure proper sharing
of t->signal (struct signal_struct) for threads only.

Access to t->signal is protected by locking t->sighand->lock.
Therefore, the usual checkpoint_obj() invoking the callback
checkpoint_signal(ctx, signal) is insufficient because the task
pointer is unavailable.

Instead, handling of t->signal sharing is explicit using helpers
like ckpt_obj_lookup_add(), ckpt_obj_fetch() and ckpt_obj_insert().
The actual state is saved (if needed) _after_ the task_objs data.

To prevent tasks from handling restored signals during restart,
set their mask to block all signals and only restore the original
mask at the very end (before the last sync point).

Introduce per-task pointer 'ckpt_data' to temporary store data
for restore actions that are deferred to the end (like restoring
the signal block mask).

Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
Acked-by: Louis Rilling <Louis.Rilling@kerlabs.com>
---
 checkpoint/objhash.c           |    7 +++
 checkpoint/process.c           |   64 ++++++++++++++++++++++++-
 checkpoint/signal.c            |  104 ++++++++++++++++++++++++++++++++++++++++
 include/linux/checkpoint.h     |    6 ++
 include/linux/checkpoint_hdr.h |   14 +++++-
 5 files changed, 193 insertions(+), 2 deletions(-)

diff --git a/checkpoint/objhash.c b/checkpoint/objhash.c
index b4034dc..bf2f761 100644
--- a/checkpoint/objhash.c
+++ b/checkpoint/objhash.c
@@ -289,6 +289,13 @@ static struct ckpt_obj_ops ckpt_obj_ops[] = {
 		.checkpoint = checkpoint_sighand,
 		.restore = restore_sighand,
 	},
+	/* signal object */
+	{
+		.obj_name = "SIGNAL",
+		.obj_type = CKPT_OBJ_SIGNAL,
+		.ref_drop = obj_no_drop,
+		.ref_grab = obj_no_grab,
+	},
 	/* ns object */
 	{
 		.obj_name = "NSPROXY",
diff --git a/checkpoint/process.c b/checkpoint/process.c
index 56f33dd..e596e2a 100644
--- a/checkpoint/process.c
+++ b/checkpoint/process.c
@@ -182,7 +182,8 @@ static int checkpoint_task_objs(struct ckpt_ctx *ctx, struct task_struct *t)
 	int files_objref;
 	int mm_objref;
 	int sighand_objref;
-	int ret;
+	int signal_objref;
+	int first, ret;
 
 	/*
 	 * Shared objects may have dependencies among them: task->mm
@@ -222,14 +223,36 @@ static int checkpoint_task_objs(struct ckpt_ctx *ctx, struct task_struct *t)
 		return sighand_objref;
 	}
 
+	/*
+	 * Handle t->signal differently because the checkpoint method
+	 * for t->signal needs access to owning task_struct to access
+	 * t->sighand (to lock/unlock). First explicitly determine if
+	 * need to save, and only below invoke checkpoint_obj_signal()
+	 * if needed.
+	 */
+	signal_objref = ckpt_obj_lookup_add(ctx, t->signal,
+					    CKPT_OBJ_SIGNAL, &first);
+	ckpt_debug("signal: objref %d\n", signal_objref);
+	if (signal_objref < 0)
+		return signal_objref;
+
 	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_TASK_OBJS);
 	if (!h)
 		return -ENOMEM;
 	h->files_objref = files_objref;
 	h->mm_objref = mm_objref;
 	h->sighand_objref = sighand_objref;
+	h->signal_objref = signal_objref;
 	ret = ckpt_write_obj(ctx, &h->h);
 	ckpt_hdr_put(ctx, h);
+	if (ret < 0)
+		return ret;
+
+	/* actually save t->signal, if need to */
+	if (first)
+		ret = checkpoint_obj_signal(ctx, t);
+	if (ret < 0)
+		ckpt_write_err(ctx, "TE", "signal_struct", ret);
 
 	return ret;
 }
@@ -375,6 +398,10 @@ int checkpoint_task(struct ckpt_ctx *ctx, struct task_struct *t)
 		goto out;
 	ret = checkpoint_cpu(ctx, t);
 	ckpt_debug("cpu %d\n", ret);
+	if (ret < 0)
+		goto out;
+	ret = checkpoint_task_signal(ctx, t);
+	ckpt_debug("task-signal %d\n", ret);
  out:
 	ctx->tsk = NULL;
 	return ret;
@@ -551,6 +578,11 @@ static int restore_task_objs(struct ckpt_ctx *ctx)
 
 	ret = restore_obj_sighand(ctx, h->sighand_objref);
 	ckpt_debug("sighand: ret %d (%p)\n", ret, current->sighand);
+	if (ret < 0)
+		goto out;
+
+	ret = restore_obj_signal(ctx, h->signal_objref);
+	ckpt_debug("signal: ret %d (%p)\n", ret, current->signal);
  out:
 	ckpt_hdr_put(ctx, h);
 	return ret;
@@ -688,11 +720,37 @@ int restore_restart_block(struct ckpt_ctx *ctx)
 	return ret;
 }
 
+/* pre_restore_task - prepare the task for restore */
+static int pre_restore_task(struct ckpt_ctx *ctx)
+{
+	sigset_t sigset;
+
+	/*
+	 * Block task's signals to avoid interruptions due to signals,
+	 * say, from restored timers, file descriptors etc. Signals
+	 * will be unblocked when restore completes.
+	 *
+	 * NOTE: tasks with file descriptors set to send a SIGKILL as
+	 * i/o notification may fail the restart if a signal occurs
+	 * before that task completed its restore. FIX ?
+	 */
+	sigfillset(&sigset);
+	sigdelset(&sigset, SIGKILL);
+	sigdelset(&sigset, SIGSTOP);
+	sigprocmask(SIG_SETMASK, &sigset, NULL);
+
+	return 0;
+}
+
 /* read the entire state of the current task */
 int restore_task(struct ckpt_ctx *ctx)
 {
 	int ret;
 
+	ret = pre_restore_task(ctx);
+	if (ret < 0)
+		goto out;
+
 	ret = restore_task_struct(ctx);
 	ckpt_debug("task %d\n", ret);
 	if (ret < 0)
@@ -720,6 +778,10 @@ int restore_task(struct ckpt_ctx *ctx)
 		goto out;
 	ret = restore_creds(ctx);
 	ckpt_debug("creds: ret %d\n", ret);
+	if (ret < 0)
+		goto out;
+
+	ret = restore_task_signal(ctx);
  out:
 	return ret;
 }
diff --git a/checkpoint/signal.c b/checkpoint/signal.c
index 1aadadd..3fac75c 100644
--- a/checkpoint/signal.c
+++ b/checkpoint/signal.c
@@ -161,3 +161,107 @@ int restore_obj_sighand(struct ckpt_ctx *ctx, int sighand_objref)
 
 	return 0;
 }
+
+/***********************************************************************
+ * signal checkpoint/restart
+ */
+
+static int checkpoint_signal(struct ckpt_ctx *ctx, struct task_struct *t)
+{
+	struct ckpt_hdr_signal *h;
+	int ret;
+
+	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_SIGNAL);
+	if (!h)
+		return -ENOMEM;
+
+	/* fill in later */
+
+	ret = ckpt_write_obj(ctx, &h->h);
+	ckpt_hdr_put(ctx, h);
+	return ret;
+}
+
+int checkpoint_obj_signal(struct ckpt_ctx *ctx, struct task_struct *t)
+{
+	BUG_ON(t->flags & PF_EXITING);
+	return checkpoint_signal(ctx, t);
+}
+
+static int restore_signal(struct ckpt_ctx *ctx)
+{
+	struct ckpt_hdr_signal *h;
+
+	h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_SIGNAL);
+	if (IS_ERR(h))
+		return PTR_ERR(h);
+
+	/* fill in later */
+
+	ckpt_hdr_put(ctx, h);
+	return 0;
+}
+
+int restore_obj_signal(struct ckpt_ctx *ctx, int signal_objref)
+{
+	struct signal_struct *signal;
+	int ret = 0;
+
+	signal = ckpt_obj_fetch(ctx, signal_objref, CKPT_OBJ_SIGNAL);
+	if (!IS_ERR(signal)) {
+		/*
+		 * signal_struct is already shared properly as it is
+		 * tied to thread groups. Since thread relationships
+		 * are already restore now, t->signal must match.
+		 */
+		if (signal != current->signal)
+			ret = -EINVAL;
+	} else if (PTR_ERR(signal) == -EINVAL) {
+		/* first timer: add to hash and restore our t->signal */
+		ret = ckpt_obj_insert(ctx, current->signal,
+				      signal_objref, CKPT_OBJ_SIGNAL);
+		if (ret >= 0)
+			ret = restore_signal(ctx);
+	} else {
+		ret = PTR_ERR(signal);
+	}
+
+	return ret;
+}
+
+int checkpoint_task_signal(struct ckpt_ctx *ctx, struct task_struct *t)
+{
+	struct ckpt_hdr_signal_task *h;
+	int ret;
+
+	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_SIGNAL_TASK);
+	if (!h)
+		return -ENOMEM;
+
+	fill_sigset(&h->blocked, &t->blocked);
+
+	ret = ckpt_write_obj(ctx, &h->h);
+	ckpt_hdr_put(ctx, h);
+	return ret;
+}
+
+int restore_task_signal(struct ckpt_ctx *ctx)
+{
+	struct ckpt_hdr_signal_task *h;
+	sigset_t blocked;
+
+	h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_SIGNAL_TASK);
+	if (IS_ERR(h))
+		return PTR_ERR(h);
+
+	load_sigset(&blocked, &h->blocked);
+	/* silently remove SIGKILL, SIGSTOP */
+	sigdelset(&blocked, SIGKILL);
+	sigdelset(&blocked, SIGSTOP);
+
+	sigprocmask(SIG_SETMASK, &blocked, NULL);
+	recalc_sigpending();
+
+	ckpt_hdr_put(ctx, h);
+	return 0;
+}
diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
index c0e549e..ec98a43 100644
--- a/include/linux/checkpoint.h
+++ b/include/linux/checkpoint.h
@@ -253,6 +253,12 @@ extern int ckpt_collect_sighand(struct ckpt_ctx *ctx, struct task_struct *t);
 extern int checkpoint_sighand(struct ckpt_ctx *ctx, void *ptr);
 extern void *restore_sighand(struct ckpt_ctx *ctx);
 
+extern int checkpoint_obj_signal(struct ckpt_ctx *ctx, struct task_struct *t);
+extern int restore_obj_signal(struct ckpt_ctx *ctx, int signal_objref);
+
+extern int checkpoint_task_signal(struct ckpt_ctx *ctx, struct task_struct *t);
+extern int restore_task_signal(struct ckpt_ctx *ctx);
+
 static inline int ckpt_validate_errno(int errno)
 {
 	return (errno >= 0) && (errno < MAX_ERRNO);
diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
index 3d3a105..ee949b5 100644
--- a/include/linux/checkpoint_hdr.h
+++ b/include/linux/checkpoint_hdr.h
@@ -88,6 +88,8 @@ enum {
 	CKPT_HDR_IPC_SEM,
 
 	CKPT_HDR_SIGHAND = 601,
+	CKPT_HDR_SIGNAL,
+	CKPT_HDR_SIGNAL_TASK,
 
 	CKPT_HDR_TAIL = 9001,
 
@@ -116,6 +118,7 @@ enum obj_type {
 	CKPT_OBJ_FILE,
 	CKPT_OBJ_MM,
 	CKPT_OBJ_SIGHAND,
+	CKPT_OBJ_SIGNAL,
 	CKPT_OBJ_NS,
 	CKPT_OBJ_UTS_NS,
 	CKPT_OBJ_IPC_NS,
@@ -210,7 +213,6 @@ struct ckpt_hdr_task {
 	__u32 compat_robust_futex_list; /* a compat __user ptr */
 	__u32 robust_futex_head_len;
 	__u64 robust_futex_list; /* a __user ptr */
-
 } __attribute__((aligned(8)));
 
 /* Posix capabilities */
@@ -305,6 +307,7 @@ struct ckpt_hdr_task_objs {
 	__s32 files_objref;
 	__s32 mm_objref;
 	__s32 sighand_objref;
+	__s32 signal_objref;
 } __attribute__((aligned(8)));
 
 /* restart blocks */
@@ -437,6 +440,15 @@ struct ckpt_hdr_sighand {
 	struct ckpt_sigaction action[0];
 } __attribute__((aligned(8)));
 
+struct ckpt_hdr_signal {
+	struct ckpt_hdr h;
+} __attribute__((aligned(8)));
+
+struct ckpt_hdr_signal_task {
+	struct ckpt_hdr h;
+	struct ckpt_sigset blocked;
+} __attribute__((aligned(8)));
+
 /* ipc commons */
 struct ckpt_hdr_ipcns {
 	struct ckpt_hdr h;
-- 
1.6.0.4


^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH v18 63/80] c/r: [signal 1/4] blocked and template for shared signals
@ 2009-09-23 23:51   ` Oren Laadan
  0 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-23 23:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, containers, linux-kernel, linux-mm, linux-api,
	Serge Hallyn, Ingo Molnar, Pavel Emelyanov, Oren Laadan,
	Oren Laadan

This patch adds checkpoint/restart of blocked signals mask
(t->blocked) and a template for shared signals (t->signal).

Because t->signal sharing is tied to threads, we ensure proper sharing
of t->signal (struct signal_struct) for threads only.

Access to t->signal is protected by locking t->sighand->lock.
Therefore, the usual checkpoint_obj() invoking the callback
checkpoint_signal(ctx, signal) is insufficient because the task
pointer is unavailable.

Instead, handling of t->signal sharing is explicit using helpers
like ckpt_obj_lookup_add(), ckpt_obj_fetch() and ckpt_obj_insert().
The actual state is saved (if needed) _after_ the task_objs data.

To prevent tasks from handling restored signals during restart,
set their mask to block all signals and only restore the original
mask at the very end (before the last sync point).

Introduce per-task pointer 'ckpt_data' to temporary store data
for restore actions that are deferred to the end (like restoring
the signal block mask).

Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
Acked-by: Louis Rilling <Louis.Rilling@kerlabs.com>
---
 checkpoint/objhash.c           |    7 +++
 checkpoint/process.c           |   64 ++++++++++++++++++++++++-
 checkpoint/signal.c            |  104 ++++++++++++++++++++++++++++++++++++++++
 include/linux/checkpoint.h     |    6 ++
 include/linux/checkpoint_hdr.h |   14 +++++-
 5 files changed, 193 insertions(+), 2 deletions(-)

diff --git a/checkpoint/objhash.c b/checkpoint/objhash.c
index b4034dc..bf2f761 100644
--- a/checkpoint/objhash.c
+++ b/checkpoint/objhash.c
@@ -289,6 +289,13 @@ static struct ckpt_obj_ops ckpt_obj_ops[] = {
 		.checkpoint = checkpoint_sighand,
 		.restore = restore_sighand,
 	},
+	/* signal object */
+	{
+		.obj_name = "SIGNAL",
+		.obj_type = CKPT_OBJ_SIGNAL,
+		.ref_drop = obj_no_drop,
+		.ref_grab = obj_no_grab,
+	},
 	/* ns object */
 	{
 		.obj_name = "NSPROXY",
diff --git a/checkpoint/process.c b/checkpoint/process.c
index 56f33dd..e596e2a 100644
--- a/checkpoint/process.c
+++ b/checkpoint/process.c
@@ -182,7 +182,8 @@ static int checkpoint_task_objs(struct ckpt_ctx *ctx, struct task_struct *t)
 	int files_objref;
 	int mm_objref;
 	int sighand_objref;
-	int ret;
+	int signal_objref;
+	int first, ret;
 
 	/*
 	 * Shared objects may have dependencies among them: task->mm
@@ -222,14 +223,36 @@ static int checkpoint_task_objs(struct ckpt_ctx *ctx, struct task_struct *t)
 		return sighand_objref;
 	}
 
+	/*
+	 * Handle t->signal differently because the checkpoint method
+	 * for t->signal needs access to owning task_struct to access
+	 * t->sighand (to lock/unlock). First explicitly determine if
+	 * need to save, and only below invoke checkpoint_obj_signal()
+	 * if needed.
+	 */
+	signal_objref = ckpt_obj_lookup_add(ctx, t->signal,
+					    CKPT_OBJ_SIGNAL, &first);
+	ckpt_debug("signal: objref %d\n", signal_objref);
+	if (signal_objref < 0)
+		return signal_objref;
+
 	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_TASK_OBJS);
 	if (!h)
 		return -ENOMEM;
 	h->files_objref = files_objref;
 	h->mm_objref = mm_objref;
 	h->sighand_objref = sighand_objref;
+	h->signal_objref = signal_objref;
 	ret = ckpt_write_obj(ctx, &h->h);
 	ckpt_hdr_put(ctx, h);
+	if (ret < 0)
+		return ret;
+
+	/* actually save t->signal, if need to */
+	if (first)
+		ret = checkpoint_obj_signal(ctx, t);
+	if (ret < 0)
+		ckpt_write_err(ctx, "TE", "signal_struct", ret);
 
 	return ret;
 }
@@ -375,6 +398,10 @@ int checkpoint_task(struct ckpt_ctx *ctx, struct task_struct *t)
 		goto out;
 	ret = checkpoint_cpu(ctx, t);
 	ckpt_debug("cpu %d\n", ret);
+	if (ret < 0)
+		goto out;
+	ret = checkpoint_task_signal(ctx, t);
+	ckpt_debug("task-signal %d\n", ret);
  out:
 	ctx->tsk = NULL;
 	return ret;
@@ -551,6 +578,11 @@ static int restore_task_objs(struct ckpt_ctx *ctx)
 
 	ret = restore_obj_sighand(ctx, h->sighand_objref);
 	ckpt_debug("sighand: ret %d (%p)\n", ret, current->sighand);
+	if (ret < 0)
+		goto out;
+
+	ret = restore_obj_signal(ctx, h->signal_objref);
+	ckpt_debug("signal: ret %d (%p)\n", ret, current->signal);
  out:
 	ckpt_hdr_put(ctx, h);
 	return ret;
@@ -688,11 +720,37 @@ int restore_restart_block(struct ckpt_ctx *ctx)
 	return ret;
 }
 
+/* pre_restore_task - prepare the task for restore */
+static int pre_restore_task(struct ckpt_ctx *ctx)
+{
+	sigset_t sigset;
+
+	/*
+	 * Block task's signals to avoid interruptions due to signals,
+	 * say, from restored timers, file descriptors etc. Signals
+	 * will be unblocked when restore completes.
+	 *
+	 * NOTE: tasks with file descriptors set to send a SIGKILL as
+	 * i/o notification may fail the restart if a signal occurs
+	 * before that task completed its restore. FIX ?
+	 */
+	sigfillset(&sigset);
+	sigdelset(&sigset, SIGKILL);
+	sigdelset(&sigset, SIGSTOP);
+	sigprocmask(SIG_SETMASK, &sigset, NULL);
+
+	return 0;
+}
+
 /* read the entire state of the current task */
 int restore_task(struct ckpt_ctx *ctx)
 {
 	int ret;
 
+	ret = pre_restore_task(ctx);
+	if (ret < 0)
+		goto out;
+
 	ret = restore_task_struct(ctx);
 	ckpt_debug("task %d\n", ret);
 	if (ret < 0)
@@ -720,6 +778,10 @@ int restore_task(struct ckpt_ctx *ctx)
 		goto out;
 	ret = restore_creds(ctx);
 	ckpt_debug("creds: ret %d\n", ret);
+	if (ret < 0)
+		goto out;
+
+	ret = restore_task_signal(ctx);
  out:
 	return ret;
 }
diff --git a/checkpoint/signal.c b/checkpoint/signal.c
index 1aadadd..3fac75c 100644
--- a/checkpoint/signal.c
+++ b/checkpoint/signal.c
@@ -161,3 +161,107 @@ int restore_obj_sighand(struct ckpt_ctx *ctx, int sighand_objref)
 
 	return 0;
 }
+
+/***********************************************************************
+ * signal checkpoint/restart
+ */
+
+static int checkpoint_signal(struct ckpt_ctx *ctx, struct task_struct *t)
+{
+	struct ckpt_hdr_signal *h;
+	int ret;
+
+	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_SIGNAL);
+	if (!h)
+		return -ENOMEM;
+
+	/* fill in later */
+
+	ret = ckpt_write_obj(ctx, &h->h);
+	ckpt_hdr_put(ctx, h);
+	return ret;
+}
+
+int checkpoint_obj_signal(struct ckpt_ctx *ctx, struct task_struct *t)
+{
+	BUG_ON(t->flags & PF_EXITING);
+	return checkpoint_signal(ctx, t);
+}
+
+static int restore_signal(struct ckpt_ctx *ctx)
+{
+	struct ckpt_hdr_signal *h;
+
+	h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_SIGNAL);
+	if (IS_ERR(h))
+		return PTR_ERR(h);
+
+	/* fill in later */
+
+	ckpt_hdr_put(ctx, h);
+	return 0;
+}
+
+int restore_obj_signal(struct ckpt_ctx *ctx, int signal_objref)
+{
+	struct signal_struct *signal;
+	int ret = 0;
+
+	signal = ckpt_obj_fetch(ctx, signal_objref, CKPT_OBJ_SIGNAL);
+	if (!IS_ERR(signal)) {
+		/*
+		 * signal_struct is already shared properly as it is
+		 * tied to thread groups. Since thread relationships
+		 * are already restore now, t->signal must match.
+		 */
+		if (signal != current->signal)
+			ret = -EINVAL;
+	} else if (PTR_ERR(signal) == -EINVAL) {
+		/* first timer: add to hash and restore our t->signal */
+		ret = ckpt_obj_insert(ctx, current->signal,
+				      signal_objref, CKPT_OBJ_SIGNAL);
+		if (ret >= 0)
+			ret = restore_signal(ctx);
+	} else {
+		ret = PTR_ERR(signal);
+	}
+
+	return ret;
+}
+
+int checkpoint_task_signal(struct ckpt_ctx *ctx, struct task_struct *t)
+{
+	struct ckpt_hdr_signal_task *h;
+	int ret;
+
+	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_SIGNAL_TASK);
+	if (!h)
+		return -ENOMEM;
+
+	fill_sigset(&h->blocked, &t->blocked);
+
+	ret = ckpt_write_obj(ctx, &h->h);
+	ckpt_hdr_put(ctx, h);
+	return ret;
+}
+
+int restore_task_signal(struct ckpt_ctx *ctx)
+{
+	struct ckpt_hdr_signal_task *h;
+	sigset_t blocked;
+
+	h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_SIGNAL_TASK);
+	if (IS_ERR(h))
+		return PTR_ERR(h);
+
+	load_sigset(&blocked, &h->blocked);
+	/* silently remove SIGKILL, SIGSTOP */
+	sigdelset(&blocked, SIGKILL);
+	sigdelset(&blocked, SIGSTOP);
+
+	sigprocmask(SIG_SETMASK, &blocked, NULL);
+	recalc_sigpending();
+
+	ckpt_hdr_put(ctx, h);
+	return 0;
+}
diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
index c0e549e..ec98a43 100644
--- a/include/linux/checkpoint.h
+++ b/include/linux/checkpoint.h
@@ -253,6 +253,12 @@ extern int ckpt_collect_sighand(struct ckpt_ctx *ctx, struct task_struct *t);
 extern int checkpoint_sighand(struct ckpt_ctx *ctx, void *ptr);
 extern void *restore_sighand(struct ckpt_ctx *ctx);
 
+extern int checkpoint_obj_signal(struct ckpt_ctx *ctx, struct task_struct *t);
+extern int restore_obj_signal(struct ckpt_ctx *ctx, int signal_objref);
+
+extern int checkpoint_task_signal(struct ckpt_ctx *ctx, struct task_struct *t);
+extern int restore_task_signal(struct ckpt_ctx *ctx);
+
 static inline int ckpt_validate_errno(int errno)
 {
 	return (errno >= 0) && (errno < MAX_ERRNO);
diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
index 3d3a105..ee949b5 100644
--- a/include/linux/checkpoint_hdr.h
+++ b/include/linux/checkpoint_hdr.h
@@ -88,6 +88,8 @@ enum {
 	CKPT_HDR_IPC_SEM,
 
 	CKPT_HDR_SIGHAND = 601,
+	CKPT_HDR_SIGNAL,
+	CKPT_HDR_SIGNAL_TASK,
 
 	CKPT_HDR_TAIL = 9001,
 
@@ -116,6 +118,7 @@ enum obj_type {
 	CKPT_OBJ_FILE,
 	CKPT_OBJ_MM,
 	CKPT_OBJ_SIGHAND,
+	CKPT_OBJ_SIGNAL,
 	CKPT_OBJ_NS,
 	CKPT_OBJ_UTS_NS,
 	CKPT_OBJ_IPC_NS,
@@ -210,7 +213,6 @@ struct ckpt_hdr_task {
 	__u32 compat_robust_futex_list; /* a compat __user ptr */
 	__u32 robust_futex_head_len;
 	__u64 robust_futex_list; /* a __user ptr */
-
 } __attribute__((aligned(8)));
 
 /* Posix capabilities */
@@ -305,6 +307,7 @@ struct ckpt_hdr_task_objs {
 	__s32 files_objref;
 	__s32 mm_objref;
 	__s32 sighand_objref;
+	__s32 signal_objref;
 } __attribute__((aligned(8)));
 
 /* restart blocks */
@@ -437,6 +440,15 @@ struct ckpt_hdr_sighand {
 	struct ckpt_sigaction action[0];
 } __attribute__((aligned(8)));
 
+struct ckpt_hdr_signal {
+	struct ckpt_hdr h;
+} __attribute__((aligned(8)));
+
+struct ckpt_hdr_signal_task {
+	struct ckpt_hdr h;
+	struct ckpt_sigset blocked;
+} __attribute__((aligned(8)));
+
 /* ipc commons */
 struct ckpt_hdr_ipcns {
 	struct ckpt_hdr h;
-- 
1.6.0.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH v18 64/80] c/r: [signal 2/4] checkpoint/restart of rlimit
       [not found] ` <1253749920-18673-1-git-send-email-orenl-RdfvBDnrOixBDgjK7y7TUQ@public.gmane.org>
                     ` (62 preceding siblings ...)
  2009-09-23 23:51   ` [PATCH v18 63/80] c/r: [signal 1/4] blocked and template for shared signals Oren Laadan
@ 2009-09-23 23:51   ` Oren Laadan
  2009-09-23 23:51   ` [PATCH v18 65/80] c/r: [signal 3/4] pending signals (private, shared) Oren Laadan
                     ` (17 subsequent siblings)
  81 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-23 23:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-api-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Linus Torvalds, Ingo Molnar,
	Pavel Emelyanov

This patch adds checkpoint and restart of rlimit information
that is part of shared signal_struct.

Signed-off-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
Acked-by: Louis Rilling <Louis.Rilling-aw0BnHfMbSpBDgjK7y7TUQ@public.gmane.org>
---
 checkpoint/checkpoint.c        |    2 ++
 checkpoint/restart.c           |    3 +++
 checkpoint/signal.c            |   27 +++++++++++++++++++++++----
 include/linux/checkpoint_hdr.h |   17 +++++++++++++++++
 include/linux/resource.h       |    4 ++++
 kernel/sys.c                   |   36 +++++++++++++++++++++++-------------
 6 files changed, 72 insertions(+), 17 deletions(-)

diff --git a/checkpoint/checkpoint.c b/checkpoint/checkpoint.c
index 3460c03..ae79df7 100644
--- a/checkpoint/checkpoint.c
+++ b/checkpoint/checkpoint.c
@@ -297,6 +297,8 @@ static void fill_kernel_const(struct ckpt_const *h)
 	h->uts_version_len = sizeof(uts->version);
 	h->uts_machine_len = sizeof(uts->machine);
 	h->uts_domainname_len = sizeof(uts->domainname);
+	/* rlimit */
+	h->rlimit_nlimits = RLIM_NLIMITS;
 }
 
 /* write the checkpoint header */
diff --git a/checkpoint/restart.c b/checkpoint/restart.c
index 73c4e72..340698a 100644
--- a/checkpoint/restart.c
+++ b/checkpoint/restart.c
@@ -399,6 +399,9 @@ static int check_kernel_const(struct ckpt_const *h)
 		return -EINVAL;
 	if (h->uts_domainname_len != sizeof(uts->domainname))
 		return -EINVAL;
+	/* rlimit */
+	if (h->rlimit_nlimits != RLIM_NLIMITS)
+		return -EINVAL;
 
 	return 0;
 }
diff --git a/checkpoint/signal.c b/checkpoint/signal.c
index 3fac75c..04013ef 100644
--- a/checkpoint/signal.c
+++ b/checkpoint/signal.c
@@ -14,6 +14,7 @@
 #include <linux/sched.h>
 #include <linux/signal.h>
 #include <linux/errno.h>
+#include <linux/resource.h>
 #include <linux/checkpoint.h>
 #include <linux/checkpoint_hdr.h>
 
@@ -169,13 +170,22 @@ int restore_obj_sighand(struct ckpt_ctx *ctx, int sighand_objref)
 static int checkpoint_signal(struct ckpt_ctx *ctx, struct task_struct *t)
 {
 	struct ckpt_hdr_signal *h;
+	struct signal_struct *signal;
+	struct rlimit *rlim;
 	int ret;
 
 	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_SIGNAL);
 	if (!h)
 		return -ENOMEM;
 
-	/* fill in later */
+	signal = t->signal;
+	rlim = signal->rlim;
+
+	/* rlimit */
+	for (i = 0; i < RLIM_NLIMITS; i++) {
+		h->rlim[i].rlim_cur = rlim[i].rlim_cur;
+		h->rlim[i].rlim_max = rlim[i].rlim_max;
+	}
 
 	ret = ckpt_write_obj(ctx, &h->h);
 	ckpt_hdr_put(ctx, h);
@@ -191,15 +201,24 @@ int checkpoint_obj_signal(struct ckpt_ctx *ctx, struct task_struct *t)
 static int restore_signal(struct ckpt_ctx *ctx)
 {
 	struct ckpt_hdr_signal *h;
+	struct rlimit rlim;
+	int i, ret;
 
 	h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_SIGNAL);
 	if (IS_ERR(h))
 		return PTR_ERR(h);
 
-	/* fill in later */
-
+	/* rlimit */
+	for (i = 0; i < RLIM_NLIMITS; i++) {
+		rlim.rlim_cur = h->rlim[i].rlim_cur;
+		rlim.rlim_max = h->rlim[i].rlim_max;
+		ret = do_setrlimit(i, &rlim);
+		if (ret < 0)
+			break;
+	}
+ out:
 	ckpt_hdr_put(ctx, h);
-	return 0;
+	return ret;
 }
 
 int restore_obj_signal(struct ckpt_ctx *ctx, int signal_objref)
diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
index ee949b5..203b4ee 100644
--- a/include/linux/checkpoint_hdr.h
+++ b/include/linux/checkpoint_hdr.h
@@ -144,6 +144,8 @@ struct ckpt_const {
 	__u16 uts_version_len;
 	__u16 uts_machine_len;
 	__u16 uts_domainname_len;
+	/* rlimit */
+	__u16 rlimit_nlimits;
 } __attribute__((aligned(8)));
 
 /* checkpoint image header */
@@ -440,8 +442,23 @@ struct ckpt_hdr_sighand {
 	struct ckpt_sigaction action[0];
 } __attribute__((aligned(8)));
 
+struct ckpt_rlimit {
+	__u64 rlim_cur;
+	__u64 rlim_max;
+} __attribute__((aligned(8)));
+
+/* cannot include <linux/resource.h> from userspace, so define: */
+#define CKPT_RLIM_NLIMITS  16
+#ifdef __KERNEL__
+#include <linux/resource.h>
+#if CKPT_RLIM_NLIMITS != RLIM_NLIMITS
+#error CKPT_RLIM_NLIMIT size is wrong per asm-generic/resource.h
+#endif
+#endif
+
 struct ckpt_hdr_signal {
 	struct ckpt_hdr h;
+	struct ckpt_rlimit rlim[CKPT_RLIM_NLIMITS];
 } __attribute__((aligned(8)));
 
 struct ckpt_hdr_signal_task {
diff --git a/include/linux/resource.h b/include/linux/resource.h
index 40fc7e6..87e1bf3 100644
--- a/include/linux/resource.h
+++ b/include/linux/resource.h
@@ -72,4 +72,8 @@ struct rlimit {
 
 int getrusage(struct task_struct *p, int who, struct rusage __user *ru);
 
+#ifdef __KERNEL__
+extern int do_setrlimit(unsigned int resource, struct rlimit *rlim);
+#endif
+
 #endif
diff --git a/kernel/sys.c b/kernel/sys.c
index da4f9e0..0979a3f 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -1144,40 +1144,39 @@ SYSCALL_DEFINE2(old_getrlimit, unsigned int, resource,
 
 #endif
 
-SYSCALL_DEFINE2(setrlimit, unsigned int, resource, struct rlimit __user *, rlim)
+int do_setrlimit(unsigned int resource, struct rlimit *new_rlim)
 {
-	struct rlimit new_rlim, *old_rlim;
+	struct rlimit *old_rlim;
 	int retval;
 
 	if (resource >= RLIM_NLIMITS)
 		return -EINVAL;
-	if (copy_from_user(&new_rlim, rlim, sizeof(*rlim)))
-		return -EFAULT;
-	if (new_rlim.rlim_cur > new_rlim.rlim_max)
+	if (new_rlim->rlim_cur > new_rlim->rlim_max)
 		return -EINVAL;
+
 	old_rlim = current->signal->rlim + resource;
-	if ((new_rlim.rlim_max > old_rlim->rlim_max) &&
+	if ((new_rlim->rlim_max > old_rlim->rlim_max) &&
 	    !capable(CAP_SYS_RESOURCE))
 		return -EPERM;
-	if (resource == RLIMIT_NOFILE && new_rlim.rlim_max > sysctl_nr_open)
+	if (resource == RLIMIT_NOFILE && new_rlim->rlim_max > sysctl_nr_open)
 		return -EPERM;
 
-	retval = security_task_setrlimit(resource, &new_rlim);
+	retval = security_task_setrlimit(resource, new_rlim);
 	if (retval)
 		return retval;
 
-	if (resource == RLIMIT_CPU && new_rlim.rlim_cur == 0) {
+	if (resource == RLIMIT_CPU && new_rlim->rlim_cur == 0) {
 		/*
 		 * The caller is asking for an immediate RLIMIT_CPU
 		 * expiry.  But we use the zero value to mean "it was
 		 * never set".  So let's cheat and make it one second
 		 * instead
 		 */
-		new_rlim.rlim_cur = 1;
+		new_rlim->rlim_cur = 1;
 	}
 
 	task_lock(current->group_leader);
-	*old_rlim = new_rlim;
+	*old_rlim = *new_rlim;
 	task_unlock(current->group_leader);
 
 	if (resource != RLIMIT_CPU)
@@ -1189,14 +1188,25 @@ SYSCALL_DEFINE2(setrlimit, unsigned int, resource, struct rlimit __user *, rlim)
 	 * very long-standing error, and fixing it now risks breakage of
 	 * applications, so we live with it
 	 */
-	if (new_rlim.rlim_cur == RLIM_INFINITY)
+	if (new_rlim->rlim_cur == RLIM_INFINITY)
 		goto out;
 
-	update_rlimit_cpu(new_rlim.rlim_cur);
+	update_rlimit_cpu(new_rlim->rlim_cur);
 out:
 	return 0;
 }
 
+SYSCALL_DEFINE2(setrlimit, unsigned int, resource, struct rlimit __user *, rlim)
+{
+	struct rlimit new_rlim;
+
+	if (resource >= RLIM_NLIMITS)
+		return -EINVAL;
+	if (copy_from_user(&new_rlim, rlim, sizeof(*rlim)))
+		return -EFAULT;
+	return do_setrlimit(resource, &new_rlim);
+}
+
 /*
  * It would make sense to put struct rusage in the task_struct,
  * except that would make the task_struct be *really big*.  After
-- 
1.6.0.4

^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH v18 64/80] c/r: [signal 2/4] checkpoint/restart of rlimit
  2009-09-23 23:50 ` Oren Laadan
@ 2009-09-23 23:51   ` Oren Laadan
  -1 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-23 23:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, containers, linux-kernel, linux-mm, linux-api,
	Serge Hallyn, Ingo Molnar, Pavel Emelyanov, Oren Laadan,
	Oren Laadan

This patch adds checkpoint and restart of rlimit information
that is part of shared signal_struct.

Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
Acked-by: Louis Rilling <Louis.Rilling@kerlabs.com>
---
 checkpoint/checkpoint.c        |    2 ++
 checkpoint/restart.c           |    3 +++
 checkpoint/signal.c            |   27 +++++++++++++++++++++++----
 include/linux/checkpoint_hdr.h |   17 +++++++++++++++++
 include/linux/resource.h       |    4 ++++
 kernel/sys.c                   |   36 +++++++++++++++++++++++-------------
 6 files changed, 72 insertions(+), 17 deletions(-)

diff --git a/checkpoint/checkpoint.c b/checkpoint/checkpoint.c
index 3460c03..ae79df7 100644
--- a/checkpoint/checkpoint.c
+++ b/checkpoint/checkpoint.c
@@ -297,6 +297,8 @@ static void fill_kernel_const(struct ckpt_const *h)
 	h->uts_version_len = sizeof(uts->version);
 	h->uts_machine_len = sizeof(uts->machine);
 	h->uts_domainname_len = sizeof(uts->domainname);
+	/* rlimit */
+	h->rlimit_nlimits = RLIM_NLIMITS;
 }
 
 /* write the checkpoint header */
diff --git a/checkpoint/restart.c b/checkpoint/restart.c
index 73c4e72..340698a 100644
--- a/checkpoint/restart.c
+++ b/checkpoint/restart.c
@@ -399,6 +399,9 @@ static int check_kernel_const(struct ckpt_const *h)
 		return -EINVAL;
 	if (h->uts_domainname_len != sizeof(uts->domainname))
 		return -EINVAL;
+	/* rlimit */
+	if (h->rlimit_nlimits != RLIM_NLIMITS)
+		return -EINVAL;
 
 	return 0;
 }
diff --git a/checkpoint/signal.c b/checkpoint/signal.c
index 3fac75c..04013ef 100644
--- a/checkpoint/signal.c
+++ b/checkpoint/signal.c
@@ -14,6 +14,7 @@
 #include <linux/sched.h>
 #include <linux/signal.h>
 #include <linux/errno.h>
+#include <linux/resource.h>
 #include <linux/checkpoint.h>
 #include <linux/checkpoint_hdr.h>
 
@@ -169,13 +170,22 @@ int restore_obj_sighand(struct ckpt_ctx *ctx, int sighand_objref)
 static int checkpoint_signal(struct ckpt_ctx *ctx, struct task_struct *t)
 {
 	struct ckpt_hdr_signal *h;
+	struct signal_struct *signal;
+	struct rlimit *rlim;
 	int ret;
 
 	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_SIGNAL);
 	if (!h)
 		return -ENOMEM;
 
-	/* fill in later */
+	signal = t->signal;
+	rlim = signal->rlim;
+
+	/* rlimit */
+	for (i = 0; i < RLIM_NLIMITS; i++) {
+		h->rlim[i].rlim_cur = rlim[i].rlim_cur;
+		h->rlim[i].rlim_max = rlim[i].rlim_max;
+	}
 
 	ret = ckpt_write_obj(ctx, &h->h);
 	ckpt_hdr_put(ctx, h);
@@ -191,15 +201,24 @@ int checkpoint_obj_signal(struct ckpt_ctx *ctx, struct task_struct *t)
 static int restore_signal(struct ckpt_ctx *ctx)
 {
 	struct ckpt_hdr_signal *h;
+	struct rlimit rlim;
+	int i, ret;
 
 	h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_SIGNAL);
 	if (IS_ERR(h))
 		return PTR_ERR(h);
 
-	/* fill in later */
-
+	/* rlimit */
+	for (i = 0; i < RLIM_NLIMITS; i++) {
+		rlim.rlim_cur = h->rlim[i].rlim_cur;
+		rlim.rlim_max = h->rlim[i].rlim_max;
+		ret = do_setrlimit(i, &rlim);
+		if (ret < 0)
+			break;
+	}
+ out:
 	ckpt_hdr_put(ctx, h);
-	return 0;
+	return ret;
 }
 
 int restore_obj_signal(struct ckpt_ctx *ctx, int signal_objref)
diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
index ee949b5..203b4ee 100644
--- a/include/linux/checkpoint_hdr.h
+++ b/include/linux/checkpoint_hdr.h
@@ -144,6 +144,8 @@ struct ckpt_const {
 	__u16 uts_version_len;
 	__u16 uts_machine_len;
 	__u16 uts_domainname_len;
+	/* rlimit */
+	__u16 rlimit_nlimits;
 } __attribute__((aligned(8)));
 
 /* checkpoint image header */
@@ -440,8 +442,23 @@ struct ckpt_hdr_sighand {
 	struct ckpt_sigaction action[0];
 } __attribute__((aligned(8)));
 
+struct ckpt_rlimit {
+	__u64 rlim_cur;
+	__u64 rlim_max;
+} __attribute__((aligned(8)));
+
+/* cannot include <linux/resource.h> from userspace, so define: */
+#define CKPT_RLIM_NLIMITS  16
+#ifdef __KERNEL__
+#include <linux/resource.h>
+#if CKPT_RLIM_NLIMITS != RLIM_NLIMITS
+#error CKPT_RLIM_NLIMIT size is wrong per asm-generic/resource.h
+#endif
+#endif
+
 struct ckpt_hdr_signal {
 	struct ckpt_hdr h;
+	struct ckpt_rlimit rlim[CKPT_RLIM_NLIMITS];
 } __attribute__((aligned(8)));
 
 struct ckpt_hdr_signal_task {
diff --git a/include/linux/resource.h b/include/linux/resource.h
index 40fc7e6..87e1bf3 100644
--- a/include/linux/resource.h
+++ b/include/linux/resource.h
@@ -72,4 +72,8 @@ struct rlimit {
 
 int getrusage(struct task_struct *p, int who, struct rusage __user *ru);
 
+#ifdef __KERNEL__
+extern int do_setrlimit(unsigned int resource, struct rlimit *rlim);
+#endif
+
 #endif
diff --git a/kernel/sys.c b/kernel/sys.c
index da4f9e0..0979a3f 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -1144,40 +1144,39 @@ SYSCALL_DEFINE2(old_getrlimit, unsigned int, resource,
 
 #endif
 
-SYSCALL_DEFINE2(setrlimit, unsigned int, resource, struct rlimit __user *, rlim)
+int do_setrlimit(unsigned int resource, struct rlimit *new_rlim)
 {
-	struct rlimit new_rlim, *old_rlim;
+	struct rlimit *old_rlim;
 	int retval;
 
 	if (resource >= RLIM_NLIMITS)
 		return -EINVAL;
-	if (copy_from_user(&new_rlim, rlim, sizeof(*rlim)))
-		return -EFAULT;
-	if (new_rlim.rlim_cur > new_rlim.rlim_max)
+	if (new_rlim->rlim_cur > new_rlim->rlim_max)
 		return -EINVAL;
+
 	old_rlim = current->signal->rlim + resource;
-	if ((new_rlim.rlim_max > old_rlim->rlim_max) &&
+	if ((new_rlim->rlim_max > old_rlim->rlim_max) &&
 	    !capable(CAP_SYS_RESOURCE))
 		return -EPERM;
-	if (resource == RLIMIT_NOFILE && new_rlim.rlim_max > sysctl_nr_open)
+	if (resource == RLIMIT_NOFILE && new_rlim->rlim_max > sysctl_nr_open)
 		return -EPERM;
 
-	retval = security_task_setrlimit(resource, &new_rlim);
+	retval = security_task_setrlimit(resource, new_rlim);
 	if (retval)
 		return retval;
 
-	if (resource == RLIMIT_CPU && new_rlim.rlim_cur == 0) {
+	if (resource == RLIMIT_CPU && new_rlim->rlim_cur == 0) {
 		/*
 		 * The caller is asking for an immediate RLIMIT_CPU
 		 * expiry.  But we use the zero value to mean "it was
 		 * never set".  So let's cheat and make it one second
 		 * instead
 		 */
-		new_rlim.rlim_cur = 1;
+		new_rlim->rlim_cur = 1;
 	}
 
 	task_lock(current->group_leader);
-	*old_rlim = new_rlim;
+	*old_rlim = *new_rlim;
 	task_unlock(current->group_leader);
 
 	if (resource != RLIMIT_CPU)
@@ -1189,14 +1188,25 @@ SYSCALL_DEFINE2(setrlimit, unsigned int, resource, struct rlimit __user *, rlim)
 	 * very long-standing error, and fixing it now risks breakage of
 	 * applications, so we live with it
 	 */
-	if (new_rlim.rlim_cur == RLIM_INFINITY)
+	if (new_rlim->rlim_cur == RLIM_INFINITY)
 		goto out;
 
-	update_rlimit_cpu(new_rlim.rlim_cur);
+	update_rlimit_cpu(new_rlim->rlim_cur);
 out:
 	return 0;
 }
 
+SYSCALL_DEFINE2(setrlimit, unsigned int, resource, struct rlimit __user *, rlim)
+{
+	struct rlimit new_rlim;
+
+	if (resource >= RLIM_NLIMITS)
+		return -EINVAL;
+	if (copy_from_user(&new_rlim, rlim, sizeof(*rlim)))
+		return -EFAULT;
+	return do_setrlimit(resource, &new_rlim);
+}
+
 /*
  * It would make sense to put struct rusage in the task_struct,
  * except that would make the task_struct be *really big*.  After
-- 
1.6.0.4


^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH v18 64/80] c/r: [signal 2/4] checkpoint/restart of rlimit
@ 2009-09-23 23:51   ` Oren Laadan
  0 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-23 23:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, containers, linux-kernel, linux-mm, linux-api,
	Serge Hallyn, Ingo Molnar, Pavel Emelyanov, Oren Laadan,
	Oren Laadan

This patch adds checkpoint and restart of rlimit information
that is part of shared signal_struct.

Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
Acked-by: Louis Rilling <Louis.Rilling@kerlabs.com>
---
 checkpoint/checkpoint.c        |    2 ++
 checkpoint/restart.c           |    3 +++
 checkpoint/signal.c            |   27 +++++++++++++++++++++++----
 include/linux/checkpoint_hdr.h |   17 +++++++++++++++++
 include/linux/resource.h       |    4 ++++
 kernel/sys.c                   |   36 +++++++++++++++++++++++-------------
 6 files changed, 72 insertions(+), 17 deletions(-)

diff --git a/checkpoint/checkpoint.c b/checkpoint/checkpoint.c
index 3460c03..ae79df7 100644
--- a/checkpoint/checkpoint.c
+++ b/checkpoint/checkpoint.c
@@ -297,6 +297,8 @@ static void fill_kernel_const(struct ckpt_const *h)
 	h->uts_version_len = sizeof(uts->version);
 	h->uts_machine_len = sizeof(uts->machine);
 	h->uts_domainname_len = sizeof(uts->domainname);
+	/* rlimit */
+	h->rlimit_nlimits = RLIM_NLIMITS;
 }
 
 /* write the checkpoint header */
diff --git a/checkpoint/restart.c b/checkpoint/restart.c
index 73c4e72..340698a 100644
--- a/checkpoint/restart.c
+++ b/checkpoint/restart.c
@@ -399,6 +399,9 @@ static int check_kernel_const(struct ckpt_const *h)
 		return -EINVAL;
 	if (h->uts_domainname_len != sizeof(uts->domainname))
 		return -EINVAL;
+	/* rlimit */
+	if (h->rlimit_nlimits != RLIM_NLIMITS)
+		return -EINVAL;
 
 	return 0;
 }
diff --git a/checkpoint/signal.c b/checkpoint/signal.c
index 3fac75c..04013ef 100644
--- a/checkpoint/signal.c
+++ b/checkpoint/signal.c
@@ -14,6 +14,7 @@
 #include <linux/sched.h>
 #include <linux/signal.h>
 #include <linux/errno.h>
+#include <linux/resource.h>
 #include <linux/checkpoint.h>
 #include <linux/checkpoint_hdr.h>
 
@@ -169,13 +170,22 @@ int restore_obj_sighand(struct ckpt_ctx *ctx, int sighand_objref)
 static int checkpoint_signal(struct ckpt_ctx *ctx, struct task_struct *t)
 {
 	struct ckpt_hdr_signal *h;
+	struct signal_struct *signal;
+	struct rlimit *rlim;
 	int ret;
 
 	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_SIGNAL);
 	if (!h)
 		return -ENOMEM;
 
-	/* fill in later */
+	signal = t->signal;
+	rlim = signal->rlim;
+
+	/* rlimit */
+	for (i = 0; i < RLIM_NLIMITS; i++) {
+		h->rlim[i].rlim_cur = rlim[i].rlim_cur;
+		h->rlim[i].rlim_max = rlim[i].rlim_max;
+	}
 
 	ret = ckpt_write_obj(ctx, &h->h);
 	ckpt_hdr_put(ctx, h);
@@ -191,15 +201,24 @@ int checkpoint_obj_signal(struct ckpt_ctx *ctx, struct task_struct *t)
 static int restore_signal(struct ckpt_ctx *ctx)
 {
 	struct ckpt_hdr_signal *h;
+	struct rlimit rlim;
+	int i, ret;
 
 	h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_SIGNAL);
 	if (IS_ERR(h))
 		return PTR_ERR(h);
 
-	/* fill in later */
-
+	/* rlimit */
+	for (i = 0; i < RLIM_NLIMITS; i++) {
+		rlim.rlim_cur = h->rlim[i].rlim_cur;
+		rlim.rlim_max = h->rlim[i].rlim_max;
+		ret = do_setrlimit(i, &rlim);
+		if (ret < 0)
+			break;
+	}
+ out:
 	ckpt_hdr_put(ctx, h);
-	return 0;
+	return ret;
 }
 
 int restore_obj_signal(struct ckpt_ctx *ctx, int signal_objref)
diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
index ee949b5..203b4ee 100644
--- a/include/linux/checkpoint_hdr.h
+++ b/include/linux/checkpoint_hdr.h
@@ -144,6 +144,8 @@ struct ckpt_const {
 	__u16 uts_version_len;
 	__u16 uts_machine_len;
 	__u16 uts_domainname_len;
+	/* rlimit */
+	__u16 rlimit_nlimits;
 } __attribute__((aligned(8)));
 
 /* checkpoint image header */
@@ -440,8 +442,23 @@ struct ckpt_hdr_sighand {
 	struct ckpt_sigaction action[0];
 } __attribute__((aligned(8)));
 
+struct ckpt_rlimit {
+	__u64 rlim_cur;
+	__u64 rlim_max;
+} __attribute__((aligned(8)));
+
+/* cannot include <linux/resource.h> from userspace, so define: */
+#define CKPT_RLIM_NLIMITS  16
+#ifdef __KERNEL__
+#include <linux/resource.h>
+#if CKPT_RLIM_NLIMITS != RLIM_NLIMITS
+#error CKPT_RLIM_NLIMIT size is wrong per asm-generic/resource.h
+#endif
+#endif
+
 struct ckpt_hdr_signal {
 	struct ckpt_hdr h;
+	struct ckpt_rlimit rlim[CKPT_RLIM_NLIMITS];
 } __attribute__((aligned(8)));
 
 struct ckpt_hdr_signal_task {
diff --git a/include/linux/resource.h b/include/linux/resource.h
index 40fc7e6..87e1bf3 100644
--- a/include/linux/resource.h
+++ b/include/linux/resource.h
@@ -72,4 +72,8 @@ struct rlimit {
 
 int getrusage(struct task_struct *p, int who, struct rusage __user *ru);
 
+#ifdef __KERNEL__
+extern int do_setrlimit(unsigned int resource, struct rlimit *rlim);
+#endif
+
 #endif
diff --git a/kernel/sys.c b/kernel/sys.c
index da4f9e0..0979a3f 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -1144,40 +1144,39 @@ SYSCALL_DEFINE2(old_getrlimit, unsigned int, resource,
 
 #endif
 
-SYSCALL_DEFINE2(setrlimit, unsigned int, resource, struct rlimit __user *, rlim)
+int do_setrlimit(unsigned int resource, struct rlimit *new_rlim)
 {
-	struct rlimit new_rlim, *old_rlim;
+	struct rlimit *old_rlim;
 	int retval;
 
 	if (resource >= RLIM_NLIMITS)
 		return -EINVAL;
-	if (copy_from_user(&new_rlim, rlim, sizeof(*rlim)))
-		return -EFAULT;
-	if (new_rlim.rlim_cur > new_rlim.rlim_max)
+	if (new_rlim->rlim_cur > new_rlim->rlim_max)
 		return -EINVAL;
+
 	old_rlim = current->signal->rlim + resource;
-	if ((new_rlim.rlim_max > old_rlim->rlim_max) &&
+	if ((new_rlim->rlim_max > old_rlim->rlim_max) &&
 	    !capable(CAP_SYS_RESOURCE))
 		return -EPERM;
-	if (resource == RLIMIT_NOFILE && new_rlim.rlim_max > sysctl_nr_open)
+	if (resource == RLIMIT_NOFILE && new_rlim->rlim_max > sysctl_nr_open)
 		return -EPERM;
 
-	retval = security_task_setrlimit(resource, &new_rlim);
+	retval = security_task_setrlimit(resource, new_rlim);
 	if (retval)
 		return retval;
 
-	if (resource == RLIMIT_CPU && new_rlim.rlim_cur == 0) {
+	if (resource == RLIMIT_CPU && new_rlim->rlim_cur == 0) {
 		/*
 		 * The caller is asking for an immediate RLIMIT_CPU
 		 * expiry.  But we use the zero value to mean "it was
 		 * never set".  So let's cheat and make it one second
 		 * instead
 		 */
-		new_rlim.rlim_cur = 1;
+		new_rlim->rlim_cur = 1;
 	}
 
 	task_lock(current->group_leader);
-	*old_rlim = new_rlim;
+	*old_rlim = *new_rlim;
 	task_unlock(current->group_leader);
 
 	if (resource != RLIMIT_CPU)
@@ -1189,14 +1188,25 @@ SYSCALL_DEFINE2(setrlimit, unsigned int, resource, struct rlimit __user *, rlim)
 	 * very long-standing error, and fixing it now risks breakage of
 	 * applications, so we live with it
 	 */
-	if (new_rlim.rlim_cur == RLIM_INFINITY)
+	if (new_rlim->rlim_cur == RLIM_INFINITY)
 		goto out;
 
-	update_rlimit_cpu(new_rlim.rlim_cur);
+	update_rlimit_cpu(new_rlim->rlim_cur);
 out:
 	return 0;
 }
 
+SYSCALL_DEFINE2(setrlimit, unsigned int, resource, struct rlimit __user *, rlim)
+{
+	struct rlimit new_rlim;
+
+	if (resource >= RLIM_NLIMITS)
+		return -EINVAL;
+	if (copy_from_user(&new_rlim, rlim, sizeof(*rlim)))
+		return -EFAULT;
+	return do_setrlimit(resource, &new_rlim);
+}
+
 /*
  * It would make sense to put struct rusage in the task_struct,
  * except that would make the task_struct be *really big*.  After
-- 
1.6.0.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH v18 65/80] c/r: [signal 3/4] pending signals (private, shared)
       [not found] ` <1253749920-18673-1-git-send-email-orenl-RdfvBDnrOixBDgjK7y7TUQ@public.gmane.org>
                     ` (63 preceding siblings ...)
  2009-09-23 23:51   ` [PATCH v18 64/80] c/r: [signal 2/4] checkpoint/restart of rlimit Oren Laadan
@ 2009-09-23 23:51   ` Oren Laadan
  2009-09-23 23:51   ` [PATCH v18 66/80] c/r: [signal 4/4] support for real/virt/prof itimers Oren Laadan
                     ` (16 subsequent siblings)
  81 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-23 23:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-api-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Linus Torvalds, Ingo Molnar,
	Pavel Emelyanov

This patch adds checkpoint and restart of pending signals queues:
struct sigpending, both per-task t->sigpending and shared (per-
thread-group) t->signal->shared_sigpending.

To checkpoint pending signals (private/shared) we first detach the
signal queue (and copy the mask) to a separate struct sigpending.
This separate structure can be iterated through without locking.

Once the state is saved, we re-attaches (prepends) the original signal
queue back to the original struct sigpending.

Signals that arrive(d) in the meantime will be suitably queued after
these (for real-time signals). Repeated non-realtime signals will not
be queued because they will already be marked in the pending mask,
that remains as is. This is the expected behavior of non-realtime
signals.

Changelog [v4]:
  - Rename headerless struct ckpt_hdr_* to struct ckpt_*
Changelog [v3]:
  - [Dan Smith] Sanity check for number of pending signals in buffer
Changelog [v2]:
  - Validate si_errno from checkpoint image
Changelog [v1]:
  - Fix compilation warnings
  - [Louis Rilling] Remove SIGQUEUE_PREALLOC flag from queued signals
  - [Louis Rilling] Fail if task has posix-timers or SI_TIMER signal

Signed-off-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
Acked-by: Louis Rilling <Louis.Rilling-aw0BnHfMbSpBDgjK7y7TUQ@public.gmane.org>
---
 checkpoint/signal.c            |  277 +++++++++++++++++++++++++++++++++++++++-
 include/linux/checkpoint_hdr.h |   23 ++++
 2 files changed, 298 insertions(+), 2 deletions(-)

diff --git a/checkpoint/signal.c b/checkpoint/signal.c
index 04013ef..27e0f10 100644
--- a/checkpoint/signal.c
+++ b/checkpoint/signal.c
@@ -167,12 +167,156 @@ int restore_obj_sighand(struct ckpt_ctx *ctx, int sighand_objref)
  * signal checkpoint/restart
  */
 
+static void fill_siginfo(struct ckpt_siginfo *si, siginfo_t *info)
+{
+	si->signo = info->si_signo;
+	si->_errno = info->si_errno;
+	si->code = info->si_code;
+
+	/* TODO: convert info->si_uid to uid_objref */
+
+	switch(info->si_code & __SI_MASK) {
+	case __SI_TIMER:
+		si->pid = info->si_tid;
+		si->uid = info->si_overrun;
+		si->sigval_int = info->si_int;
+		si->utime = info->si_sys_private;
+		break;
+	case __SI_POLL:
+		si->pid = info->si_band;
+		si->sigval_int = info->si_fd;
+		break;
+	case __SI_FAULT:
+		si->sigval_ptr = (unsigned long) info->si_addr;
+#ifdef __ARCH_SI_TRAPNO
+		si->sigval_int = info->si_trapno;
+#endif
+		break;
+	case __SI_CHLD:
+		si->pid = info->si_pid;
+		si->uid = info->si_uid;
+		si->sigval_int = info->si_status;
+		si->stime = info->si_stime;
+		si->utime = info->si_utime;
+		break;
+	case __SI_KILL:
+	case __SI_RT:
+	case __SI_MESGQ:
+		si->pid = info->si_pid;
+		si->uid = info->si_uid;
+		si->sigval_ptr = (unsigned long) info->si_ptr;
+		break;
+	default:
+		BUG();
+	}
+}
+
+static int load_siginfo(siginfo_t *info, struct ckpt_siginfo *si)
+{
+	if (!valid_signal(si->signo))
+		return -EINVAL;
+	if (!ckpt_validate_errno(si->_errno))
+		return -EINVAL;
+
+	info->si_signo = si->signo;
+	info->si_errno = si->_errno;
+	info->si_code = si->code;
+
+	/* TODO: validate remaining signal fields */
+
+	switch(info->si_code & __SI_MASK) {
+	case __SI_TIMER:
+		info->si_tid = si->pid;
+		info->si_overrun = si->uid;
+		info->si_int = si->sigval_int;
+		info->si_sys_private = si->utime;
+		break;
+	case __SI_POLL:
+		info->si_band = si->pid;
+		info->si_fd = si->sigval_int;
+		break;
+	case __SI_FAULT:
+		info->si_addr = (void __user *) (unsigned long) si->sigval_ptr;
+#ifdef __ARCH_SI_TRAPNO
+		info->si_trapno = si->sigval_int;
+#endif
+		break;
+	case __SI_CHLD:
+		info->si_pid = si->pid;
+		info->si_uid = si->uid;
+		info->si_status = si->sigval_int;
+		info->si_stime = si->stime;
+		info->si_utime = si->utime;
+		break;
+	case __SI_KILL:
+	case __SI_RT:
+	case __SI_MESGQ:
+		info->si_pid = si->pid;
+		info->si_uid = si->uid;
+		info->si_ptr = (void __user *) (unsigned long) si->sigval_ptr;
+		break;
+	default:
+		return -EINVAL;
+	}
+
+	return 0;
+}
+
+/*
+ * To checkpoint pending signals (private/shared) the caller moves the
+ * signal queue (and copies the mask) to a separate struct sigpending,
+ * therefore we can iterate through it without locking.
+ * After we return, the caller re-attaches (prepends) the original
+ * signal queue to the original struct sigpending. Thus, signals that
+ * arrive(d) in the meantime will be suitably queued after these.
+ * Finally, repeated non-realtime signals will not be queued because
+ * they will already be marked in the pending mask, that remains as is.
+ * This is the expected behavior of non-realtime signals.
+ */
+static int checkpoint_sigpending(struct ckpt_ctx *ctx,
+				 struct sigpending *pending)
+{
+	struct ckpt_hdr_sigpending *h;
+	struct ckpt_siginfo *si;
+	struct sigqueue *q;
+	int nr_pending = 0;
+	int ret;
+
+	list_for_each_entry(q, &pending->list, list) {
+		/* TODO: remove after adding support for posix-timers */
+		if ((q->info.si_code & __SI_MASK) == __SI_TIMER) {
+			ckpt_write_err(ctx, "TE", "signal SI_TIMER", -ENOTSUPP);
+			return -ENOTSUPP;
+		}
+		nr_pending++;
+	}
+
+	h = ckpt_hdr_get_type(ctx, nr_pending * sizeof(*si) + sizeof(*h),
+			      CKPT_HDR_SIGPENDING);
+	if (!h)
+		return -ENOMEM;
+
+	h->nr_pending = nr_pending;
+	fill_sigset(&h->signal, &pending->signal);
+
+	si = h->siginfo;
+	list_for_each_entry(q, &pending->list, list)
+		fill_siginfo(si++, &q->info);
+
+	ret = ckpt_write_obj(ctx, &h->h);
+	ckpt_hdr_put(ctx, h);
+
+	return ret;
+}
+
 static int checkpoint_signal(struct ckpt_ctx *ctx, struct task_struct *t)
 {
 	struct ckpt_hdr_signal *h;
 	struct signal_struct *signal;
+	struct sigpending shared_pending;
 	struct rlimit *rlim;
-	int ret;
+	unsigned long flags;
+	int i, ret;
 
 	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_SIGNAL);
 	if (!h)
@@ -181,13 +325,45 @@ static int checkpoint_signal(struct ckpt_ctx *ctx, struct task_struct *t)
 	signal = t->signal;
 	rlim = signal->rlim;
 
+	INIT_LIST_HEAD(&shared_pending.list);
+
+	/* temporarily borrow signal queue - see chekcpoint_sigpending() */
+	if (!lock_task_sighand(t, &flags)) {
+		pr_warning("c/r: [%d] without sighand\n", task_pid_vnr(t));
+		ret = -EBUSY;
+		goto out;
+	}
+
+	/* TODO: remove after adding support for posix-timers */
+	if (!list_empty(&signal->posix_timers)) {
+		ckpt_write_err(ctx, "TEP", "posix-timers\n", -ENOTSUPP, signal);
+		unlock_task_sighand(t, &flags);
+		ret = -ENOTSUPP;
+		goto out;
+	}
+
+	list_splice_init(&signal->shared_pending.list, &shared_pending.list);
+	shared_pending.signal = signal->shared_pending.signal;
+
 	/* rlimit */
 	for (i = 0; i < RLIM_NLIMITS; i++) {
 		h->rlim[i].rlim_cur = rlim[i].rlim_cur;
 		h->rlim[i].rlim_max = rlim[i].rlim_max;
 	}
+	unlock_task_sighand(t, &flags);
 
 	ret = ckpt_write_obj(ctx, &h->h);
+	if (!ret)
+		ret = checkpoint_sigpending(ctx, &shared_pending);
+
+	/* return the borrowed queue */
+	if (!lock_task_sighand(t, &flags)) {
+		pr_warning("c/r: [%d] sighand disappeared\n", task_pid_vnr(t));
+		goto out;
+	}
+	list_splice(&shared_pending.list, &signal->shared_pending.list);
+	unlock_task_sighand(t, &flags);
+ out:
 	ckpt_hdr_put(ctx, h);
 	return ret;
 }
@@ -198,9 +374,55 @@ int checkpoint_obj_signal(struct ckpt_ctx *ctx, struct task_struct *t)
 	return checkpoint_signal(ctx, t);
 }
 
+static int restore_sigpending(struct ckpt_ctx *ctx, struct sigpending *pending)
+{
+	struct ckpt_hdr_sigpending *h;
+	struct ckpt_siginfo *si;
+	struct sigqueue *q;
+	int ret = 0;
+
+	h = ckpt_read_buf_type(ctx, 0, CKPT_HDR_SIGPENDING);
+	if (IS_ERR(h))
+		return PTR_ERR(h);
+
+	if (h->h.len != h->nr_pending * sizeof(*si) + sizeof(*h)) {
+		ret = -EINVAL;
+		goto out;
+	}
+
+	INIT_LIST_HEAD(&pending->list);
+	load_sigset(&pending->signal, &h->signal);
+
+	si = h->siginfo;
+	while (h->nr_pending--) {
+		q = sigqueue_alloc();
+		if (!q) {
+			ret = -ENOMEM;
+			break;
+		}
+
+		ret = load_siginfo(&q->info, si++);
+		if (ret < 0) {
+			sigqueue_free(q);
+			break;
+		}
+
+		q->flags &= ~SIGQUEUE_PREALLOC;
+		list_add_tail(&pending->list, &q->list);
+	}
+
+	if (ret < 0)
+		flush_sigqueue(pending);
+ out:
+	ckpt_hdr_put(ctx, h);
+	return ret;
+}
+
 static int restore_signal(struct ckpt_ctx *ctx)
 {
 	struct ckpt_hdr_signal *h;
+	struct sigpending new_pending;
+	struct sigpending *pending;
 	struct rlimit rlim;
 	int i, ret;
 
@@ -214,8 +436,19 @@ static int restore_signal(struct ckpt_ctx *ctx)
 		rlim.rlim_max = h->rlim[i].rlim_max;
 		ret = do_setrlimit(i, &rlim);
 		if (ret < 0)
-			break;
+			goto out;
 	}
+
+	ret = restore_sigpending(ctx, &new_pending);
+	if (ret < 0)
+		goto out;
+
+	spin_lock_irq(&current->sighand->siglock);
+	pending = &current->signal->shared_pending;
+	flush_sigqueue(pending);
+	pending->signal = new_pending.signal;
+	list_splice_init(&new_pending.list, &pending->list);
+	spin_unlock_irq(&current->sighand->siglock);
  out:
 	ckpt_hdr_put(ctx, h);
 	return ret;
@@ -251,8 +484,34 @@ int restore_obj_signal(struct ckpt_ctx *ctx, int signal_objref)
 int checkpoint_task_signal(struct ckpt_ctx *ctx, struct task_struct *t)
 {
 	struct ckpt_hdr_signal_task *h;
+	struct sigpending pending;
+	unsigned long flags;
 	int ret;
 
+	INIT_LIST_HEAD(&pending.list);
+
+	/* temporarily borrow signal queue - see chekcpoint_sigpending() */
+	if (!lock_task_sighand(t, &flags)) {
+		ckpt_write_err(ctx, "TE", "signand missing", -EBUSY);
+		return -EBUSY;
+	}
+	list_splice_init(&t->pending.list, &pending.list);
+	pending.signal = t->pending.signal;
+	unlock_task_sighand(t, &flags);
+
+	ret = checkpoint_sigpending(ctx, &pending);
+
+	/* re-attach the borrowed queue */
+	if (!lock_task_sighand(t, &flags)) {
+		ckpt_write_err(ctx, "TE", "signand missing", -EBUSY);
+		return -EBUSY;
+	}
+	list_splice(&pending.list, &t->pending.list);
+	unlock_task_sighand(t, &flags);
+
+	if (ret < 0)
+		return ret;
+
 	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_SIGNAL_TASK);
 	if (!h)
 		return -ENOMEM;
@@ -267,7 +526,21 @@ int checkpoint_task_signal(struct ckpt_ctx *ctx, struct task_struct *t)
 int restore_task_signal(struct ckpt_ctx *ctx)
 {
 	struct ckpt_hdr_signal_task *h;
+	struct sigpending new_pending;
+	struct sigpending *pending;
 	sigset_t blocked;
+	int ret;
+
+	ret = restore_sigpending(ctx, &new_pending);
+	if (ret < 0)
+		return ret;
+
+	spin_lock_irq(&current->sighand->siglock);
+	pending = &current->pending;
+	flush_sigqueue(pending);
+	pending->signal = new_pending.signal;
+	list_splice_init(&new_pending.list, &pending->list);
+	spin_unlock_irq(&current->sighand->siglock);
 
 	h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_SIGNAL_TASK);
 	if (IS_ERR(h))
diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
index 203b4ee..fd2836e 100644
--- a/include/linux/checkpoint_hdr.h
+++ b/include/linux/checkpoint_hdr.h
@@ -90,6 +90,7 @@ enum {
 	CKPT_HDR_SIGHAND = 601,
 	CKPT_HDR_SIGNAL,
 	CKPT_HDR_SIGNAL_TASK,
+	CKPT_HDR_SIGPENDING,
 
 	CKPT_HDR_TAIL = 9001,
 
@@ -442,6 +443,28 @@ struct ckpt_hdr_sighand {
 	struct ckpt_sigaction action[0];
 } __attribute__((aligned(8)));
 
+#ifndef HAVE_ARCH_SIGINFO_T
+struct ckpt_siginfo {
+	__u32 signo;
+	__u32 _errno;
+	__u32 code;
+
+	__u32 pid;
+	__s32 uid;
+	__u32 sigval_int;
+	__u64 sigval_ptr;
+	__u64 utime;
+	__u64 stime;
+} __attribute__((aligned(8)));
+#endif
+
+struct ckpt_hdr_sigpending {
+	struct ckpt_hdr h;
+	__u32 nr_pending;
+	struct ckpt_sigset signal;
+	struct ckpt_siginfo siginfo[0];
+} __attribute__((aligned(8)));
+
 struct ckpt_rlimit {
 	__u64 rlim_cur;
 	__u64 rlim_max;
-- 
1.6.0.4

^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH v18 65/80] c/r: [signal 3/4] pending signals (private, shared)
  2009-09-23 23:50 ` Oren Laadan
@ 2009-09-23 23:51   ` Oren Laadan
  -1 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-23 23:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, containers, linux-kernel, linux-mm, linux-api,
	Serge Hallyn, Ingo Molnar, Pavel Emelyanov, Oren Laadan,
	Oren Laadan

This patch adds checkpoint and restart of pending signals queues:
struct sigpending, both per-task t->sigpending and shared (per-
thread-group) t->signal->shared_sigpending.

To checkpoint pending signals (private/shared) we first detach the
signal queue (and copy the mask) to a separate struct sigpending.
This separate structure can be iterated through without locking.

Once the state is saved, we re-attaches (prepends) the original signal
queue back to the original struct sigpending.

Signals that arrive(d) in the meantime will be suitably queued after
these (for real-time signals). Repeated non-realtime signals will not
be queued because they will already be marked in the pending mask,
that remains as is. This is the expected behavior of non-realtime
signals.

Changelog [v4]:
  - Rename headerless struct ckpt_hdr_* to struct ckpt_*
Changelog [v3]:
  - [Dan Smith] Sanity check for number of pending signals in buffer
Changelog [v2]:
  - Validate si_errno from checkpoint image
Changelog [v1]:
  - Fix compilation warnings
  - [Louis Rilling] Remove SIGQUEUE_PREALLOC flag from queued signals
  - [Louis Rilling] Fail if task has posix-timers or SI_TIMER signal

Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
Acked-by: Louis Rilling <Louis.Rilling@kerlabs.com>
---
 checkpoint/signal.c            |  277 +++++++++++++++++++++++++++++++++++++++-
 include/linux/checkpoint_hdr.h |   23 ++++
 2 files changed, 298 insertions(+), 2 deletions(-)

diff --git a/checkpoint/signal.c b/checkpoint/signal.c
index 04013ef..27e0f10 100644
--- a/checkpoint/signal.c
+++ b/checkpoint/signal.c
@@ -167,12 +167,156 @@ int restore_obj_sighand(struct ckpt_ctx *ctx, int sighand_objref)
  * signal checkpoint/restart
  */
 
+static void fill_siginfo(struct ckpt_siginfo *si, siginfo_t *info)
+{
+	si->signo = info->si_signo;
+	si->_errno = info->si_errno;
+	si->code = info->si_code;
+
+	/* TODO: convert info->si_uid to uid_objref */
+
+	switch(info->si_code & __SI_MASK) {
+	case __SI_TIMER:
+		si->pid = info->si_tid;
+		si->uid = info->si_overrun;
+		si->sigval_int = info->si_int;
+		si->utime = info->si_sys_private;
+		break;
+	case __SI_POLL:
+		si->pid = info->si_band;
+		si->sigval_int = info->si_fd;
+		break;
+	case __SI_FAULT:
+		si->sigval_ptr = (unsigned long) info->si_addr;
+#ifdef __ARCH_SI_TRAPNO
+		si->sigval_int = info->si_trapno;
+#endif
+		break;
+	case __SI_CHLD:
+		si->pid = info->si_pid;
+		si->uid = info->si_uid;
+		si->sigval_int = info->si_status;
+		si->stime = info->si_stime;
+		si->utime = info->si_utime;
+		break;
+	case __SI_KILL:
+	case __SI_RT:
+	case __SI_MESGQ:
+		si->pid = info->si_pid;
+		si->uid = info->si_uid;
+		si->sigval_ptr = (unsigned long) info->si_ptr;
+		break;
+	default:
+		BUG();
+	}
+}
+
+static int load_siginfo(siginfo_t *info, struct ckpt_siginfo *si)
+{
+	if (!valid_signal(si->signo))
+		return -EINVAL;
+	if (!ckpt_validate_errno(si->_errno))
+		return -EINVAL;
+
+	info->si_signo = si->signo;
+	info->si_errno = si->_errno;
+	info->si_code = si->code;
+
+	/* TODO: validate remaining signal fields */
+
+	switch(info->si_code & __SI_MASK) {
+	case __SI_TIMER:
+		info->si_tid = si->pid;
+		info->si_overrun = si->uid;
+		info->si_int = si->sigval_int;
+		info->si_sys_private = si->utime;
+		break;
+	case __SI_POLL:
+		info->si_band = si->pid;
+		info->si_fd = si->sigval_int;
+		break;
+	case __SI_FAULT:
+		info->si_addr = (void __user *) (unsigned long) si->sigval_ptr;
+#ifdef __ARCH_SI_TRAPNO
+		info->si_trapno = si->sigval_int;
+#endif
+		break;
+	case __SI_CHLD:
+		info->si_pid = si->pid;
+		info->si_uid = si->uid;
+		info->si_status = si->sigval_int;
+		info->si_stime = si->stime;
+		info->si_utime = si->utime;
+		break;
+	case __SI_KILL:
+	case __SI_RT:
+	case __SI_MESGQ:
+		info->si_pid = si->pid;
+		info->si_uid = si->uid;
+		info->si_ptr = (void __user *) (unsigned long) si->sigval_ptr;
+		break;
+	default:
+		return -EINVAL;
+	}
+
+	return 0;
+}
+
+/*
+ * To checkpoint pending signals (private/shared) the caller moves the
+ * signal queue (and copies the mask) to a separate struct sigpending,
+ * therefore we can iterate through it without locking.
+ * After we return, the caller re-attaches (prepends) the original
+ * signal queue to the original struct sigpending. Thus, signals that
+ * arrive(d) in the meantime will be suitably queued after these.
+ * Finally, repeated non-realtime signals will not be queued because
+ * they will already be marked in the pending mask, that remains as is.
+ * This is the expected behavior of non-realtime signals.
+ */
+static int checkpoint_sigpending(struct ckpt_ctx *ctx,
+				 struct sigpending *pending)
+{
+	struct ckpt_hdr_sigpending *h;
+	struct ckpt_siginfo *si;
+	struct sigqueue *q;
+	int nr_pending = 0;
+	int ret;
+
+	list_for_each_entry(q, &pending->list, list) {
+		/* TODO: remove after adding support for posix-timers */
+		if ((q->info.si_code & __SI_MASK) == __SI_TIMER) {
+			ckpt_write_err(ctx, "TE", "signal SI_TIMER", -ENOTSUPP);
+			return -ENOTSUPP;
+		}
+		nr_pending++;
+	}
+
+	h = ckpt_hdr_get_type(ctx, nr_pending * sizeof(*si) + sizeof(*h),
+			      CKPT_HDR_SIGPENDING);
+	if (!h)
+		return -ENOMEM;
+
+	h->nr_pending = nr_pending;
+	fill_sigset(&h->signal, &pending->signal);
+
+	si = h->siginfo;
+	list_for_each_entry(q, &pending->list, list)
+		fill_siginfo(si++, &q->info);
+
+	ret = ckpt_write_obj(ctx, &h->h);
+	ckpt_hdr_put(ctx, h);
+
+	return ret;
+}
+
 static int checkpoint_signal(struct ckpt_ctx *ctx, struct task_struct *t)
 {
 	struct ckpt_hdr_signal *h;
 	struct signal_struct *signal;
+	struct sigpending shared_pending;
 	struct rlimit *rlim;
-	int ret;
+	unsigned long flags;
+	int i, ret;
 
 	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_SIGNAL);
 	if (!h)
@@ -181,13 +325,45 @@ static int checkpoint_signal(struct ckpt_ctx *ctx, struct task_struct *t)
 	signal = t->signal;
 	rlim = signal->rlim;
 
+	INIT_LIST_HEAD(&shared_pending.list);
+
+	/* temporarily borrow signal queue - see chekcpoint_sigpending() */
+	if (!lock_task_sighand(t, &flags)) {
+		pr_warning("c/r: [%d] without sighand\n", task_pid_vnr(t));
+		ret = -EBUSY;
+		goto out;
+	}
+
+	/* TODO: remove after adding support for posix-timers */
+	if (!list_empty(&signal->posix_timers)) {
+		ckpt_write_err(ctx, "TEP", "posix-timers\n", -ENOTSUPP, signal);
+		unlock_task_sighand(t, &flags);
+		ret = -ENOTSUPP;
+		goto out;
+	}
+
+	list_splice_init(&signal->shared_pending.list, &shared_pending.list);
+	shared_pending.signal = signal->shared_pending.signal;
+
 	/* rlimit */
 	for (i = 0; i < RLIM_NLIMITS; i++) {
 		h->rlim[i].rlim_cur = rlim[i].rlim_cur;
 		h->rlim[i].rlim_max = rlim[i].rlim_max;
 	}
+	unlock_task_sighand(t, &flags);
 
 	ret = ckpt_write_obj(ctx, &h->h);
+	if (!ret)
+		ret = checkpoint_sigpending(ctx, &shared_pending);
+
+	/* return the borrowed queue */
+	if (!lock_task_sighand(t, &flags)) {
+		pr_warning("c/r: [%d] sighand disappeared\n", task_pid_vnr(t));
+		goto out;
+	}
+	list_splice(&shared_pending.list, &signal->shared_pending.list);
+	unlock_task_sighand(t, &flags);
+ out:
 	ckpt_hdr_put(ctx, h);
 	return ret;
 }
@@ -198,9 +374,55 @@ int checkpoint_obj_signal(struct ckpt_ctx *ctx, struct task_struct *t)
 	return checkpoint_signal(ctx, t);
 }
 
+static int restore_sigpending(struct ckpt_ctx *ctx, struct sigpending *pending)
+{
+	struct ckpt_hdr_sigpending *h;
+	struct ckpt_siginfo *si;
+	struct sigqueue *q;
+	int ret = 0;
+
+	h = ckpt_read_buf_type(ctx, 0, CKPT_HDR_SIGPENDING);
+	if (IS_ERR(h))
+		return PTR_ERR(h);
+
+	if (h->h.len != h->nr_pending * sizeof(*si) + sizeof(*h)) {
+		ret = -EINVAL;
+		goto out;
+	}
+
+	INIT_LIST_HEAD(&pending->list);
+	load_sigset(&pending->signal, &h->signal);
+
+	si = h->siginfo;
+	while (h->nr_pending--) {
+		q = sigqueue_alloc();
+		if (!q) {
+			ret = -ENOMEM;
+			break;
+		}
+
+		ret = load_siginfo(&q->info, si++);
+		if (ret < 0) {
+			sigqueue_free(q);
+			break;
+		}
+
+		q->flags &= ~SIGQUEUE_PREALLOC;
+		list_add_tail(&pending->list, &q->list);
+	}
+
+	if (ret < 0)
+		flush_sigqueue(pending);
+ out:
+	ckpt_hdr_put(ctx, h);
+	return ret;
+}
+
 static int restore_signal(struct ckpt_ctx *ctx)
 {
 	struct ckpt_hdr_signal *h;
+	struct sigpending new_pending;
+	struct sigpending *pending;
 	struct rlimit rlim;
 	int i, ret;
 
@@ -214,8 +436,19 @@ static int restore_signal(struct ckpt_ctx *ctx)
 		rlim.rlim_max = h->rlim[i].rlim_max;
 		ret = do_setrlimit(i, &rlim);
 		if (ret < 0)
-			break;
+			goto out;
 	}
+
+	ret = restore_sigpending(ctx, &new_pending);
+	if (ret < 0)
+		goto out;
+
+	spin_lock_irq(&current->sighand->siglock);
+	pending = &current->signal->shared_pending;
+	flush_sigqueue(pending);
+	pending->signal = new_pending.signal;
+	list_splice_init(&new_pending.list, &pending->list);
+	spin_unlock_irq(&current->sighand->siglock);
  out:
 	ckpt_hdr_put(ctx, h);
 	return ret;
@@ -251,8 +484,34 @@ int restore_obj_signal(struct ckpt_ctx *ctx, int signal_objref)
 int checkpoint_task_signal(struct ckpt_ctx *ctx, struct task_struct *t)
 {
 	struct ckpt_hdr_signal_task *h;
+	struct sigpending pending;
+	unsigned long flags;
 	int ret;
 
+	INIT_LIST_HEAD(&pending.list);
+
+	/* temporarily borrow signal queue - see chekcpoint_sigpending() */
+	if (!lock_task_sighand(t, &flags)) {
+		ckpt_write_err(ctx, "TE", "signand missing", -EBUSY);
+		return -EBUSY;
+	}
+	list_splice_init(&t->pending.list, &pending.list);
+	pending.signal = t->pending.signal;
+	unlock_task_sighand(t, &flags);
+
+	ret = checkpoint_sigpending(ctx, &pending);
+
+	/* re-attach the borrowed queue */
+	if (!lock_task_sighand(t, &flags)) {
+		ckpt_write_err(ctx, "TE", "signand missing", -EBUSY);
+		return -EBUSY;
+	}
+	list_splice(&pending.list, &t->pending.list);
+	unlock_task_sighand(t, &flags);
+
+	if (ret < 0)
+		return ret;
+
 	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_SIGNAL_TASK);
 	if (!h)
 		return -ENOMEM;
@@ -267,7 +526,21 @@ int checkpoint_task_signal(struct ckpt_ctx *ctx, struct task_struct *t)
 int restore_task_signal(struct ckpt_ctx *ctx)
 {
 	struct ckpt_hdr_signal_task *h;
+	struct sigpending new_pending;
+	struct sigpending *pending;
 	sigset_t blocked;
+	int ret;
+
+	ret = restore_sigpending(ctx, &new_pending);
+	if (ret < 0)
+		return ret;
+
+	spin_lock_irq(&current->sighand->siglock);
+	pending = &current->pending;
+	flush_sigqueue(pending);
+	pending->signal = new_pending.signal;
+	list_splice_init(&new_pending.list, &pending->list);
+	spin_unlock_irq(&current->sighand->siglock);
 
 	h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_SIGNAL_TASK);
 	if (IS_ERR(h))
diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
index 203b4ee..fd2836e 100644
--- a/include/linux/checkpoint_hdr.h
+++ b/include/linux/checkpoint_hdr.h
@@ -90,6 +90,7 @@ enum {
 	CKPT_HDR_SIGHAND = 601,
 	CKPT_HDR_SIGNAL,
 	CKPT_HDR_SIGNAL_TASK,
+	CKPT_HDR_SIGPENDING,
 
 	CKPT_HDR_TAIL = 9001,
 
@@ -442,6 +443,28 @@ struct ckpt_hdr_sighand {
 	struct ckpt_sigaction action[0];
 } __attribute__((aligned(8)));
 
+#ifndef HAVE_ARCH_SIGINFO_T
+struct ckpt_siginfo {
+	__u32 signo;
+	__u32 _errno;
+	__u32 code;
+
+	__u32 pid;
+	__s32 uid;
+	__u32 sigval_int;
+	__u64 sigval_ptr;
+	__u64 utime;
+	__u64 stime;
+} __attribute__((aligned(8)));
+#endif
+
+struct ckpt_hdr_sigpending {
+	struct ckpt_hdr h;
+	__u32 nr_pending;
+	struct ckpt_sigset signal;
+	struct ckpt_siginfo siginfo[0];
+} __attribute__((aligned(8)));
+
 struct ckpt_rlimit {
 	__u64 rlim_cur;
 	__u64 rlim_max;
-- 
1.6.0.4


^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH v18 65/80] c/r: [signal 3/4] pending signals (private, shared)
@ 2009-09-23 23:51   ` Oren Laadan
  0 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-23 23:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, containers, linux-kernel, linux-mm, linux-api,
	Serge Hallyn, Ingo Molnar, Pavel Emelyanov, Oren Laadan,
	Oren Laadan

This patch adds checkpoint and restart of pending signals queues:
struct sigpending, both per-task t->sigpending and shared (per-
thread-group) t->signal->shared_sigpending.

To checkpoint pending signals (private/shared) we first detach the
signal queue (and copy the mask) to a separate struct sigpending.
This separate structure can be iterated through without locking.

Once the state is saved, we re-attaches (prepends) the original signal
queue back to the original struct sigpending.

Signals that arrive(d) in the meantime will be suitably queued after
these (for real-time signals). Repeated non-realtime signals will not
be queued because they will already be marked in the pending mask,
that remains as is. This is the expected behavior of non-realtime
signals.

Changelog [v4]:
  - Rename headerless struct ckpt_hdr_* to struct ckpt_*
Changelog [v3]:
  - [Dan Smith] Sanity check for number of pending signals in buffer
Changelog [v2]:
  - Validate si_errno from checkpoint image
Changelog [v1]:
  - Fix compilation warnings
  - [Louis Rilling] Remove SIGQUEUE_PREALLOC flag from queued signals
  - [Louis Rilling] Fail if task has posix-timers or SI_TIMER signal

Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
Acked-by: Louis Rilling <Louis.Rilling@kerlabs.com>
---
 checkpoint/signal.c            |  277 +++++++++++++++++++++++++++++++++++++++-
 include/linux/checkpoint_hdr.h |   23 ++++
 2 files changed, 298 insertions(+), 2 deletions(-)

diff --git a/checkpoint/signal.c b/checkpoint/signal.c
index 04013ef..27e0f10 100644
--- a/checkpoint/signal.c
+++ b/checkpoint/signal.c
@@ -167,12 +167,156 @@ int restore_obj_sighand(struct ckpt_ctx *ctx, int sighand_objref)
  * signal checkpoint/restart
  */
 
+static void fill_siginfo(struct ckpt_siginfo *si, siginfo_t *info)
+{
+	si->signo = info->si_signo;
+	si->_errno = info->si_errno;
+	si->code = info->si_code;
+
+	/* TODO: convert info->si_uid to uid_objref */
+
+	switch(info->si_code & __SI_MASK) {
+	case __SI_TIMER:
+		si->pid = info->si_tid;
+		si->uid = info->si_overrun;
+		si->sigval_int = info->si_int;
+		si->utime = info->si_sys_private;
+		break;
+	case __SI_POLL:
+		si->pid = info->si_band;
+		si->sigval_int = info->si_fd;
+		break;
+	case __SI_FAULT:
+		si->sigval_ptr = (unsigned long) info->si_addr;
+#ifdef __ARCH_SI_TRAPNO
+		si->sigval_int = info->si_trapno;
+#endif
+		break;
+	case __SI_CHLD:
+		si->pid = info->si_pid;
+		si->uid = info->si_uid;
+		si->sigval_int = info->si_status;
+		si->stime = info->si_stime;
+		si->utime = info->si_utime;
+		break;
+	case __SI_KILL:
+	case __SI_RT:
+	case __SI_MESGQ:
+		si->pid = info->si_pid;
+		si->uid = info->si_uid;
+		si->sigval_ptr = (unsigned long) info->si_ptr;
+		break;
+	default:
+		BUG();
+	}
+}
+
+static int load_siginfo(siginfo_t *info, struct ckpt_siginfo *si)
+{
+	if (!valid_signal(si->signo))
+		return -EINVAL;
+	if (!ckpt_validate_errno(si->_errno))
+		return -EINVAL;
+
+	info->si_signo = si->signo;
+	info->si_errno = si->_errno;
+	info->si_code = si->code;
+
+	/* TODO: validate remaining signal fields */
+
+	switch(info->si_code & __SI_MASK) {
+	case __SI_TIMER:
+		info->si_tid = si->pid;
+		info->si_overrun = si->uid;
+		info->si_int = si->sigval_int;
+		info->si_sys_private = si->utime;
+		break;
+	case __SI_POLL:
+		info->si_band = si->pid;
+		info->si_fd = si->sigval_int;
+		break;
+	case __SI_FAULT:
+		info->si_addr = (void __user *) (unsigned long) si->sigval_ptr;
+#ifdef __ARCH_SI_TRAPNO
+		info->si_trapno = si->sigval_int;
+#endif
+		break;
+	case __SI_CHLD:
+		info->si_pid = si->pid;
+		info->si_uid = si->uid;
+		info->si_status = si->sigval_int;
+		info->si_stime = si->stime;
+		info->si_utime = si->utime;
+		break;
+	case __SI_KILL:
+	case __SI_RT:
+	case __SI_MESGQ:
+		info->si_pid = si->pid;
+		info->si_uid = si->uid;
+		info->si_ptr = (void __user *) (unsigned long) si->sigval_ptr;
+		break;
+	default:
+		return -EINVAL;
+	}
+
+	return 0;
+}
+
+/*
+ * To checkpoint pending signals (private/shared) the caller moves the
+ * signal queue (and copies the mask) to a separate struct sigpending,
+ * therefore we can iterate through it without locking.
+ * After we return, the caller re-attaches (prepends) the original
+ * signal queue to the original struct sigpending. Thus, signals that
+ * arrive(d) in the meantime will be suitably queued after these.
+ * Finally, repeated non-realtime signals will not be queued because
+ * they will already be marked in the pending mask, that remains as is.
+ * This is the expected behavior of non-realtime signals.
+ */
+static int checkpoint_sigpending(struct ckpt_ctx *ctx,
+				 struct sigpending *pending)
+{
+	struct ckpt_hdr_sigpending *h;
+	struct ckpt_siginfo *si;
+	struct sigqueue *q;
+	int nr_pending = 0;
+	int ret;
+
+	list_for_each_entry(q, &pending->list, list) {
+		/* TODO: remove after adding support for posix-timers */
+		if ((q->info.si_code & __SI_MASK) == __SI_TIMER) {
+			ckpt_write_err(ctx, "TE", "signal SI_TIMER", -ENOTSUPP);
+			return -ENOTSUPP;
+		}
+		nr_pending++;
+	}
+
+	h = ckpt_hdr_get_type(ctx, nr_pending * sizeof(*si) + sizeof(*h),
+			      CKPT_HDR_SIGPENDING);
+	if (!h)
+		return -ENOMEM;
+
+	h->nr_pending = nr_pending;
+	fill_sigset(&h->signal, &pending->signal);
+
+	si = h->siginfo;
+	list_for_each_entry(q, &pending->list, list)
+		fill_siginfo(si++, &q->info);
+
+	ret = ckpt_write_obj(ctx, &h->h);
+	ckpt_hdr_put(ctx, h);
+
+	return ret;
+}
+
 static int checkpoint_signal(struct ckpt_ctx *ctx, struct task_struct *t)
 {
 	struct ckpt_hdr_signal *h;
 	struct signal_struct *signal;
+	struct sigpending shared_pending;
 	struct rlimit *rlim;
-	int ret;
+	unsigned long flags;
+	int i, ret;
 
 	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_SIGNAL);
 	if (!h)
@@ -181,13 +325,45 @@ static int checkpoint_signal(struct ckpt_ctx *ctx, struct task_struct *t)
 	signal = t->signal;
 	rlim = signal->rlim;
 
+	INIT_LIST_HEAD(&shared_pending.list);
+
+	/* temporarily borrow signal queue - see chekcpoint_sigpending() */
+	if (!lock_task_sighand(t, &flags)) {
+		pr_warning("c/r: [%d] without sighand\n", task_pid_vnr(t));
+		ret = -EBUSY;
+		goto out;
+	}
+
+	/* TODO: remove after adding support for posix-timers */
+	if (!list_empty(&signal->posix_timers)) {
+		ckpt_write_err(ctx, "TEP", "posix-timers\n", -ENOTSUPP, signal);
+		unlock_task_sighand(t, &flags);
+		ret = -ENOTSUPP;
+		goto out;
+	}
+
+	list_splice_init(&signal->shared_pending.list, &shared_pending.list);
+	shared_pending.signal = signal->shared_pending.signal;
+
 	/* rlimit */
 	for (i = 0; i < RLIM_NLIMITS; i++) {
 		h->rlim[i].rlim_cur = rlim[i].rlim_cur;
 		h->rlim[i].rlim_max = rlim[i].rlim_max;
 	}
+	unlock_task_sighand(t, &flags);
 
 	ret = ckpt_write_obj(ctx, &h->h);
+	if (!ret)
+		ret = checkpoint_sigpending(ctx, &shared_pending);
+
+	/* return the borrowed queue */
+	if (!lock_task_sighand(t, &flags)) {
+		pr_warning("c/r: [%d] sighand disappeared\n", task_pid_vnr(t));
+		goto out;
+	}
+	list_splice(&shared_pending.list, &signal->shared_pending.list);
+	unlock_task_sighand(t, &flags);
+ out:
 	ckpt_hdr_put(ctx, h);
 	return ret;
 }
@@ -198,9 +374,55 @@ int checkpoint_obj_signal(struct ckpt_ctx *ctx, struct task_struct *t)
 	return checkpoint_signal(ctx, t);
 }
 
+static int restore_sigpending(struct ckpt_ctx *ctx, struct sigpending *pending)
+{
+	struct ckpt_hdr_sigpending *h;
+	struct ckpt_siginfo *si;
+	struct sigqueue *q;
+	int ret = 0;
+
+	h = ckpt_read_buf_type(ctx, 0, CKPT_HDR_SIGPENDING);
+	if (IS_ERR(h))
+		return PTR_ERR(h);
+
+	if (h->h.len != h->nr_pending * sizeof(*si) + sizeof(*h)) {
+		ret = -EINVAL;
+		goto out;
+	}
+
+	INIT_LIST_HEAD(&pending->list);
+	load_sigset(&pending->signal, &h->signal);
+
+	si = h->siginfo;
+	while (h->nr_pending--) {
+		q = sigqueue_alloc();
+		if (!q) {
+			ret = -ENOMEM;
+			break;
+		}
+
+		ret = load_siginfo(&q->info, si++);
+		if (ret < 0) {
+			sigqueue_free(q);
+			break;
+		}
+
+		q->flags &= ~SIGQUEUE_PREALLOC;
+		list_add_tail(&pending->list, &q->list);
+	}
+
+	if (ret < 0)
+		flush_sigqueue(pending);
+ out:
+	ckpt_hdr_put(ctx, h);
+	return ret;
+}
+
 static int restore_signal(struct ckpt_ctx *ctx)
 {
 	struct ckpt_hdr_signal *h;
+	struct sigpending new_pending;
+	struct sigpending *pending;
 	struct rlimit rlim;
 	int i, ret;
 
@@ -214,8 +436,19 @@ static int restore_signal(struct ckpt_ctx *ctx)
 		rlim.rlim_max = h->rlim[i].rlim_max;
 		ret = do_setrlimit(i, &rlim);
 		if (ret < 0)
-			break;
+			goto out;
 	}
+
+	ret = restore_sigpending(ctx, &new_pending);
+	if (ret < 0)
+		goto out;
+
+	spin_lock_irq(&current->sighand->siglock);
+	pending = &current->signal->shared_pending;
+	flush_sigqueue(pending);
+	pending->signal = new_pending.signal;
+	list_splice_init(&new_pending.list, &pending->list);
+	spin_unlock_irq(&current->sighand->siglock);
  out:
 	ckpt_hdr_put(ctx, h);
 	return ret;
@@ -251,8 +484,34 @@ int restore_obj_signal(struct ckpt_ctx *ctx, int signal_objref)
 int checkpoint_task_signal(struct ckpt_ctx *ctx, struct task_struct *t)
 {
 	struct ckpt_hdr_signal_task *h;
+	struct sigpending pending;
+	unsigned long flags;
 	int ret;
 
+	INIT_LIST_HEAD(&pending.list);
+
+	/* temporarily borrow signal queue - see chekcpoint_sigpending() */
+	if (!lock_task_sighand(t, &flags)) {
+		ckpt_write_err(ctx, "TE", "signand missing", -EBUSY);
+		return -EBUSY;
+	}
+	list_splice_init(&t->pending.list, &pending.list);
+	pending.signal = t->pending.signal;
+	unlock_task_sighand(t, &flags);
+
+	ret = checkpoint_sigpending(ctx, &pending);
+
+	/* re-attach the borrowed queue */
+	if (!lock_task_sighand(t, &flags)) {
+		ckpt_write_err(ctx, "TE", "signand missing", -EBUSY);
+		return -EBUSY;
+	}
+	list_splice(&pending.list, &t->pending.list);
+	unlock_task_sighand(t, &flags);
+
+	if (ret < 0)
+		return ret;
+
 	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_SIGNAL_TASK);
 	if (!h)
 		return -ENOMEM;
@@ -267,7 +526,21 @@ int checkpoint_task_signal(struct ckpt_ctx *ctx, struct task_struct *t)
 int restore_task_signal(struct ckpt_ctx *ctx)
 {
 	struct ckpt_hdr_signal_task *h;
+	struct sigpending new_pending;
+	struct sigpending *pending;
 	sigset_t blocked;
+	int ret;
+
+	ret = restore_sigpending(ctx, &new_pending);
+	if (ret < 0)
+		return ret;
+
+	spin_lock_irq(&current->sighand->siglock);
+	pending = &current->pending;
+	flush_sigqueue(pending);
+	pending->signal = new_pending.signal;
+	list_splice_init(&new_pending.list, &pending->list);
+	spin_unlock_irq(&current->sighand->siglock);
 
 	h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_SIGNAL_TASK);
 	if (IS_ERR(h))
diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
index 203b4ee..fd2836e 100644
--- a/include/linux/checkpoint_hdr.h
+++ b/include/linux/checkpoint_hdr.h
@@ -90,6 +90,7 @@ enum {
 	CKPT_HDR_SIGHAND = 601,
 	CKPT_HDR_SIGNAL,
 	CKPT_HDR_SIGNAL_TASK,
+	CKPT_HDR_SIGPENDING,
 
 	CKPT_HDR_TAIL = 9001,
 
@@ -442,6 +443,28 @@ struct ckpt_hdr_sighand {
 	struct ckpt_sigaction action[0];
 } __attribute__((aligned(8)));
 
+#ifndef HAVE_ARCH_SIGINFO_T
+struct ckpt_siginfo {
+	__u32 signo;
+	__u32 _errno;
+	__u32 code;
+
+	__u32 pid;
+	__s32 uid;
+	__u32 sigval_int;
+	__u64 sigval_ptr;
+	__u64 utime;
+	__u64 stime;
+} __attribute__((aligned(8)));
+#endif
+
+struct ckpt_hdr_sigpending {
+	struct ckpt_hdr h;
+	__u32 nr_pending;
+	struct ckpt_sigset signal;
+	struct ckpt_siginfo siginfo[0];
+} __attribute__((aligned(8)));
+
 struct ckpt_rlimit {
 	__u64 rlim_cur;
 	__u64 rlim_max;
-- 
1.6.0.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH v18 66/80] c/r: [signal 4/4] support for real/virt/prof itimers
       [not found] ` <1253749920-18673-1-git-send-email-orenl-RdfvBDnrOixBDgjK7y7TUQ@public.gmane.org>
                     ` (64 preceding siblings ...)
  2009-09-23 23:51   ` [PATCH v18 65/80] c/r: [signal 3/4] pending signals (private, shared) Oren Laadan
@ 2009-09-23 23:51   ` Oren Laadan
  2009-09-23 23:51   ` [PATCH v18 67/80] Expose may_setuid() in user.h and add may_setgid() (v2) Oren Laadan
                     ` (15 subsequent siblings)
  81 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-23 23:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-api-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Linus Torvalds, Ingo Molnar,
	Pavel Emelyanov

This patch adds support for real/virt/prof itimers.
Expiry and the interval values are both saved in nanoseconds.

Changelog[v1]:
  - [Louis Rilling] Fix saving of signal->it_real_incr if not expired
  - Fix restoring of signal->it_real_incr if expire is zero
  - Save virt/prof expire relative to process accumulated time

Signed-off-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
Acked-by: Louis Rilling <Louis.Rilling-aw0BnHfMbSpBDgjK7y7TUQ@public.gmane.org>
---
 checkpoint/signal.c            |   86 ++++++++++++++++++++++++++++++++++++++++
 include/linux/checkpoint_hdr.h |    6 +++
 include/linux/posix-timers.h   |    9 ++++
 kernel/posix-cpu-timers.c      |    9 ----
 4 files changed, 101 insertions(+), 9 deletions(-)

diff --git a/checkpoint/signal.c b/checkpoint/signal.c
index 27e0f10..5ff0734 100644
--- a/checkpoint/signal.c
+++ b/checkpoint/signal.c
@@ -15,6 +15,8 @@
 #include <linux/signal.h>
 #include <linux/errno.h>
 #include <linux/resource.h>
+#include <linux/timer.h>
+#include <linux/posix-timers.h>
 #include <linux/checkpoint.h>
 #include <linux/checkpoint_hdr.h>
 
@@ -315,6 +317,8 @@ static int checkpoint_signal(struct ckpt_ctx *ctx, struct task_struct *t)
 	struct signal_struct *signal;
 	struct sigpending shared_pending;
 	struct rlimit *rlim;
+	struct timeval tval;
+	cputime_t cputime;
 	unsigned long flags;
 	int i, ret;
 
@@ -350,6 +354,50 @@ static int checkpoint_signal(struct ckpt_ctx *ctx, struct task_struct *t)
 		h->rlim[i].rlim_cur = rlim[i].rlim_cur;
 		h->rlim[i].rlim_max = rlim[i].rlim_max;
 	}
+
+	/* real/virt/prof itimers */
+	if (hrtimer_active(&signal->real_timer)) {
+		/* For an active timer compute the time delta */
+		ktime_t delta = hrtimer_get_remaining(&signal->real_timer);
+		/*
+		 * If the timer expired after the the test above, then
+		 * set the expire to the minimum possible (because by
+		 * now the pending signal have been saved already, but
+		 * the signal from this very expiry won't be sent before
+		 * we release t->sighand->siglock).
+		 */
+		ckpt_debug("active ! %lld\n", delta.tv64);
+		if (delta.tv64 <= 0)
+			delta.tv64 = NSEC_PER_USEC;
+		h->it_real_value = ktime_to_ns(delta);
+	} else {
+		/*
+		 * Timer is inactive; if @it_real_incr is 0 the timer
+		 * will not be re-armed. Beacuse we hold siglock, if
+		 * @it_real_incr > 0, the timer must have just expired
+		 * but not yet re-armed, and we have a SIGALRM pending
+		 * - that will trigger timer re-arm after restart.
+		 */
+		h->it_real_value = 0;
+	}
+	h->it_real_incr = ktime_to_ns(signal->it_real_incr);
+
+	cputime = signal->it_virt_expires;
+	if (!cputime_eq(cputime, cputime_zero))
+		cputime = cputime_sub(signal->it_virt_expires, virt_ticks(t));
+	cputime_to_timeval(cputime, &tval);
+	h->it_virt_value = timeval_to_ns(&tval);
+	cputime_to_timeval(signal->it_virt_incr, &tval);
+	h->it_virt_incr = timeval_to_ns(&tval);
+
+	cputime = signal->it_prof_expires;
+	if (!cputime_eq(cputime, cputime_zero))
+		cputime = cputime_sub(signal->it_prof_expires, prof_ticks(t));
+	cputime_to_timeval(cputime, &tval);
+	h->it_prof_value = timeval_to_ns(&tval);
+	cputime_to_timeval(signal->it_prof_incr, &tval);
+	h->it_prof_incr = timeval_to_ns(&tval);
+
 	unlock_task_sighand(t, &flags);
 
 	ret = ckpt_write_obj(ctx, &h->h);
@@ -423,6 +471,7 @@ static int restore_signal(struct ckpt_ctx *ctx)
 	struct ckpt_hdr_signal *h;
 	struct sigpending new_pending;
 	struct sigpending *pending;
+	struct itimerval itimer;
 	struct rlimit rlim;
 	int i, ret;
 
@@ -443,12 +492,49 @@ static int restore_signal(struct ckpt_ctx *ctx)
 	if (ret < 0)
 		goto out;
 
+	/*
+	 * Reset real/virt/prof itimer (in case they were set), to
+	 * prevent unwanted signals after flushing current signals
+	 * and before restoring original real/virt/prof itimer.
+	 */
+	itimer.it_value = (struct timeval) { .tv_sec = 0, .tv_usec = 0 };
+	itimer.it_interval =  (struct timeval) { .tv_sec = 0, .tv_usec = 0 };
+	do_setitimer(ITIMER_REAL, &itimer, NULL);
+	do_setitimer(ITIMER_VIRTUAL, &itimer, NULL);
+	do_setitimer(ITIMER_PROF, &itimer, NULL);
+
 	spin_lock_irq(&current->sighand->siglock);
 	pending = &current->signal->shared_pending;
 	flush_sigqueue(pending);
 	pending->signal = new_pending.signal;
 	list_splice_init(&new_pending.list, &pending->list);
 	spin_unlock_irq(&current->sighand->siglock);
+
+	/* real/virt/prof itimers */
+	itimer.it_value = ns_to_timeval(h->it_real_value);
+	itimer.it_interval = ns_to_timeval(h->it_real_incr);
+	ret = do_setitimer(ITIMER_REAL, &itimer, NULL);
+	if (ret < 0)
+		goto out;
+	/*
+	 * If expire is 0 but incr > 0 then we have a SIGALRM pending.
+	 * It should re-arm the timer when handled. But do_setitimer()
+	 * above already ignored @it_real_incr because @it_real_value
+	 * that was zero. So we set it manually. (This is safe against
+	 * malicious input, because in the worst case will generate an
+	 * unexpected SIGALRM to this process).
+	 */
+	if (!h->it_real_value && h->it_real_incr)
+		current->signal->it_real_incr = ns_to_ktime(h->it_real_incr);
+
+	itimer.it_value = ns_to_timeval(h->it_virt_value);
+	itimer.it_interval = ns_to_timeval(h->it_virt_incr);
+	ret = do_setitimer(ITIMER_VIRTUAL, &itimer, NULL);
+	if (ret < 0)
+		goto out;
+	itimer.it_value = ns_to_timeval(h->it_prof_value);
+	itimer.it_interval = ns_to_timeval(h->it_prof_incr);
+	ret = do_setitimer(ITIMER_PROF, &itimer, NULL);
  out:
 	ckpt_hdr_put(ctx, h);
 	return ret;
diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
index fd2836e..e4dfbd7 100644
--- a/include/linux/checkpoint_hdr.h
+++ b/include/linux/checkpoint_hdr.h
@@ -482,6 +482,12 @@ struct ckpt_rlimit {
 struct ckpt_hdr_signal {
 	struct ckpt_hdr h;
 	struct ckpt_rlimit rlim[CKPT_RLIM_NLIMITS];
+	__u64 it_real_value;
+	__u64 it_real_incr;
+	__u64 it_virt_value;
+	__u64 it_virt_incr;
+	__u64 it_prof_value;
+	__u64 it_prof_incr;
 } __attribute__((aligned(8)));
 
 struct ckpt_hdr_signal_task {
diff --git a/include/linux/posix-timers.h b/include/linux/posix-timers.h
index d0d6a66..7dd69c3 100644
--- a/include/linux/posix-timers.h
+++ b/include/linux/posix-timers.h
@@ -125,4 +125,13 @@ void update_rlimit_cpu(unsigned long rlim_new);
 
 int invalid_clockid(const clockid_t which_clock);
 
+static inline cputime_t prof_ticks(struct task_struct *p)
+{
+	return cputime_add(p->utime, p->stime);
+}
+static inline cputime_t virt_ticks(struct task_struct *p)
+{
+	return p->utime;
+}
+
 #endif
diff --git a/kernel/posix-cpu-timers.c b/kernel/posix-cpu-timers.c
index e33a21c..a3491e6 100644
--- a/kernel/posix-cpu-timers.c
+++ b/kernel/posix-cpu-timers.c
@@ -167,15 +167,6 @@ static void bump_cpu_timer(struct k_itimer *timer,
 	}
 }
 
-static inline cputime_t prof_ticks(struct task_struct *p)
-{
-	return cputime_add(p->utime, p->stime);
-}
-static inline cputime_t virt_ticks(struct task_struct *p)
-{
-	return p->utime;
-}
-
 int posix_cpu_clock_getres(const clockid_t which_clock, struct timespec *tp)
 {
 	int error = check_clock(which_clock);
-- 
1.6.0.4

^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH v18 66/80] c/r: [signal 4/4] support for real/virt/prof itimers
  2009-09-23 23:50 ` Oren Laadan
@ 2009-09-23 23:51   ` Oren Laadan
  -1 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-23 23:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, containers, linux-kernel, linux-mm, linux-api,
	Serge Hallyn, Ingo Molnar, Pavel Emelyanov, Oren Laadan,
	Oren Laadan

This patch adds support for real/virt/prof itimers.
Expiry and the interval values are both saved in nanoseconds.

Changelog[v1]:
  - [Louis Rilling] Fix saving of signal->it_real_incr if not expired
  - Fix restoring of signal->it_real_incr if expire is zero
  - Save virt/prof expire relative to process accumulated time

Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
Acked-by: Louis Rilling <Louis.Rilling@kerlabs.com>
---
 checkpoint/signal.c            |   86 ++++++++++++++++++++++++++++++++++++++++
 include/linux/checkpoint_hdr.h |    6 +++
 include/linux/posix-timers.h   |    9 ++++
 kernel/posix-cpu-timers.c      |    9 ----
 4 files changed, 101 insertions(+), 9 deletions(-)

diff --git a/checkpoint/signal.c b/checkpoint/signal.c
index 27e0f10..5ff0734 100644
--- a/checkpoint/signal.c
+++ b/checkpoint/signal.c
@@ -15,6 +15,8 @@
 #include <linux/signal.h>
 #include <linux/errno.h>
 #include <linux/resource.h>
+#include <linux/timer.h>
+#include <linux/posix-timers.h>
 #include <linux/checkpoint.h>
 #include <linux/checkpoint_hdr.h>
 
@@ -315,6 +317,8 @@ static int checkpoint_signal(struct ckpt_ctx *ctx, struct task_struct *t)
 	struct signal_struct *signal;
 	struct sigpending shared_pending;
 	struct rlimit *rlim;
+	struct timeval tval;
+	cputime_t cputime;
 	unsigned long flags;
 	int i, ret;
 
@@ -350,6 +354,50 @@ static int checkpoint_signal(struct ckpt_ctx *ctx, struct task_struct *t)
 		h->rlim[i].rlim_cur = rlim[i].rlim_cur;
 		h->rlim[i].rlim_max = rlim[i].rlim_max;
 	}
+
+	/* real/virt/prof itimers */
+	if (hrtimer_active(&signal->real_timer)) {
+		/* For an active timer compute the time delta */
+		ktime_t delta = hrtimer_get_remaining(&signal->real_timer);
+		/*
+		 * If the timer expired after the the test above, then
+		 * set the expire to the minimum possible (because by
+		 * now the pending signal have been saved already, but
+		 * the signal from this very expiry won't be sent before
+		 * we release t->sighand->siglock).
+		 */
+		ckpt_debug("active ! %lld\n", delta.tv64);
+		if (delta.tv64 <= 0)
+			delta.tv64 = NSEC_PER_USEC;
+		h->it_real_value = ktime_to_ns(delta);
+	} else {
+		/*
+		 * Timer is inactive; if @it_real_incr is 0 the timer
+		 * will not be re-armed. Beacuse we hold siglock, if
+		 * @it_real_incr > 0, the timer must have just expired
+		 * but not yet re-armed, and we have a SIGALRM pending
+		 * - that will trigger timer re-arm after restart.
+		 */
+		h->it_real_value = 0;
+	}
+	h->it_real_incr = ktime_to_ns(signal->it_real_incr);
+
+	cputime = signal->it_virt_expires;
+	if (!cputime_eq(cputime, cputime_zero))
+		cputime = cputime_sub(signal->it_virt_expires, virt_ticks(t));
+	cputime_to_timeval(cputime, &tval);
+	h->it_virt_value = timeval_to_ns(&tval);
+	cputime_to_timeval(signal->it_virt_incr, &tval);
+	h->it_virt_incr = timeval_to_ns(&tval);
+
+	cputime = signal->it_prof_expires;
+	if (!cputime_eq(cputime, cputime_zero))
+		cputime = cputime_sub(signal->it_prof_expires, prof_ticks(t));
+	cputime_to_timeval(cputime, &tval);
+	h->it_prof_value = timeval_to_ns(&tval);
+	cputime_to_timeval(signal->it_prof_incr, &tval);
+	h->it_prof_incr = timeval_to_ns(&tval);
+
 	unlock_task_sighand(t, &flags);
 
 	ret = ckpt_write_obj(ctx, &h->h);
@@ -423,6 +471,7 @@ static int restore_signal(struct ckpt_ctx *ctx)
 	struct ckpt_hdr_signal *h;
 	struct sigpending new_pending;
 	struct sigpending *pending;
+	struct itimerval itimer;
 	struct rlimit rlim;
 	int i, ret;
 
@@ -443,12 +492,49 @@ static int restore_signal(struct ckpt_ctx *ctx)
 	if (ret < 0)
 		goto out;
 
+	/*
+	 * Reset real/virt/prof itimer (in case they were set), to
+	 * prevent unwanted signals after flushing current signals
+	 * and before restoring original real/virt/prof itimer.
+	 */
+	itimer.it_value = (struct timeval) { .tv_sec = 0, .tv_usec = 0 };
+	itimer.it_interval =  (struct timeval) { .tv_sec = 0, .tv_usec = 0 };
+	do_setitimer(ITIMER_REAL, &itimer, NULL);
+	do_setitimer(ITIMER_VIRTUAL, &itimer, NULL);
+	do_setitimer(ITIMER_PROF, &itimer, NULL);
+
 	spin_lock_irq(&current->sighand->siglock);
 	pending = &current->signal->shared_pending;
 	flush_sigqueue(pending);
 	pending->signal = new_pending.signal;
 	list_splice_init(&new_pending.list, &pending->list);
 	spin_unlock_irq(&current->sighand->siglock);
+
+	/* real/virt/prof itimers */
+	itimer.it_value = ns_to_timeval(h->it_real_value);
+	itimer.it_interval = ns_to_timeval(h->it_real_incr);
+	ret = do_setitimer(ITIMER_REAL, &itimer, NULL);
+	if (ret < 0)
+		goto out;
+	/*
+	 * If expire is 0 but incr > 0 then we have a SIGALRM pending.
+	 * It should re-arm the timer when handled. But do_setitimer()
+	 * above already ignored @it_real_incr because @it_real_value
+	 * that was zero. So we set it manually. (This is safe against
+	 * malicious input, because in the worst case will generate an
+	 * unexpected SIGALRM to this process).
+	 */
+	if (!h->it_real_value && h->it_real_incr)
+		current->signal->it_real_incr = ns_to_ktime(h->it_real_incr);
+
+	itimer.it_value = ns_to_timeval(h->it_virt_value);
+	itimer.it_interval = ns_to_timeval(h->it_virt_incr);
+	ret = do_setitimer(ITIMER_VIRTUAL, &itimer, NULL);
+	if (ret < 0)
+		goto out;
+	itimer.it_value = ns_to_timeval(h->it_prof_value);
+	itimer.it_interval = ns_to_timeval(h->it_prof_incr);
+	ret = do_setitimer(ITIMER_PROF, &itimer, NULL);
  out:
 	ckpt_hdr_put(ctx, h);
 	return ret;
diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
index fd2836e..e4dfbd7 100644
--- a/include/linux/checkpoint_hdr.h
+++ b/include/linux/checkpoint_hdr.h
@@ -482,6 +482,12 @@ struct ckpt_rlimit {
 struct ckpt_hdr_signal {
 	struct ckpt_hdr h;
 	struct ckpt_rlimit rlim[CKPT_RLIM_NLIMITS];
+	__u64 it_real_value;
+	__u64 it_real_incr;
+	__u64 it_virt_value;
+	__u64 it_virt_incr;
+	__u64 it_prof_value;
+	__u64 it_prof_incr;
 } __attribute__((aligned(8)));
 
 struct ckpt_hdr_signal_task {
diff --git a/include/linux/posix-timers.h b/include/linux/posix-timers.h
index d0d6a66..7dd69c3 100644
--- a/include/linux/posix-timers.h
+++ b/include/linux/posix-timers.h
@@ -125,4 +125,13 @@ void update_rlimit_cpu(unsigned long rlim_new);
 
 int invalid_clockid(const clockid_t which_clock);
 
+static inline cputime_t prof_ticks(struct task_struct *p)
+{
+	return cputime_add(p->utime, p->stime);
+}
+static inline cputime_t virt_ticks(struct task_struct *p)
+{
+	return p->utime;
+}
+
 #endif
diff --git a/kernel/posix-cpu-timers.c b/kernel/posix-cpu-timers.c
index e33a21c..a3491e6 100644
--- a/kernel/posix-cpu-timers.c
+++ b/kernel/posix-cpu-timers.c
@@ -167,15 +167,6 @@ static void bump_cpu_timer(struct k_itimer *timer,
 	}
 }
 
-static inline cputime_t prof_ticks(struct task_struct *p)
-{
-	return cputime_add(p->utime, p->stime);
-}
-static inline cputime_t virt_ticks(struct task_struct *p)
-{
-	return p->utime;
-}
-
 int posix_cpu_clock_getres(const clockid_t which_clock, struct timespec *tp)
 {
 	int error = check_clock(which_clock);
-- 
1.6.0.4


^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH v18 66/80] c/r: [signal 4/4] support for real/virt/prof itimers
@ 2009-09-23 23:51   ` Oren Laadan
  0 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-23 23:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, containers, linux-kernel, linux-mm, linux-api,
	Serge Hallyn, Ingo Molnar, Pavel Emelyanov, Oren Laadan,
	Oren Laadan

This patch adds support for real/virt/prof itimers.
Expiry and the interval values are both saved in nanoseconds.

Changelog[v1]:
  - [Louis Rilling] Fix saving of signal->it_real_incr if not expired
  - Fix restoring of signal->it_real_incr if expire is zero
  - Save virt/prof expire relative to process accumulated time

Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
Acked-by: Louis Rilling <Louis.Rilling@kerlabs.com>
---
 checkpoint/signal.c            |   86 ++++++++++++++++++++++++++++++++++++++++
 include/linux/checkpoint_hdr.h |    6 +++
 include/linux/posix-timers.h   |    9 ++++
 kernel/posix-cpu-timers.c      |    9 ----
 4 files changed, 101 insertions(+), 9 deletions(-)

diff --git a/checkpoint/signal.c b/checkpoint/signal.c
index 27e0f10..5ff0734 100644
--- a/checkpoint/signal.c
+++ b/checkpoint/signal.c
@@ -15,6 +15,8 @@
 #include <linux/signal.h>
 #include <linux/errno.h>
 #include <linux/resource.h>
+#include <linux/timer.h>
+#include <linux/posix-timers.h>
 #include <linux/checkpoint.h>
 #include <linux/checkpoint_hdr.h>
 
@@ -315,6 +317,8 @@ static int checkpoint_signal(struct ckpt_ctx *ctx, struct task_struct *t)
 	struct signal_struct *signal;
 	struct sigpending shared_pending;
 	struct rlimit *rlim;
+	struct timeval tval;
+	cputime_t cputime;
 	unsigned long flags;
 	int i, ret;
 
@@ -350,6 +354,50 @@ static int checkpoint_signal(struct ckpt_ctx *ctx, struct task_struct *t)
 		h->rlim[i].rlim_cur = rlim[i].rlim_cur;
 		h->rlim[i].rlim_max = rlim[i].rlim_max;
 	}
+
+	/* real/virt/prof itimers */
+	if (hrtimer_active(&signal->real_timer)) {
+		/* For an active timer compute the time delta */
+		ktime_t delta = hrtimer_get_remaining(&signal->real_timer);
+		/*
+		 * If the timer expired after the the test above, then
+		 * set the expire to the minimum possible (because by
+		 * now the pending signal have been saved already, but
+		 * the signal from this very expiry won't be sent before
+		 * we release t->sighand->siglock).
+		 */
+		ckpt_debug("active ! %lld\n", delta.tv64);
+		if (delta.tv64 <= 0)
+			delta.tv64 = NSEC_PER_USEC;
+		h->it_real_value = ktime_to_ns(delta);
+	} else {
+		/*
+		 * Timer is inactive; if @it_real_incr is 0 the timer
+		 * will not be re-armed. Beacuse we hold siglock, if
+		 * @it_real_incr > 0, the timer must have just expired
+		 * but not yet re-armed, and we have a SIGALRM pending
+		 * - that will trigger timer re-arm after restart.
+		 */
+		h->it_real_value = 0;
+	}
+	h->it_real_incr = ktime_to_ns(signal->it_real_incr);
+
+	cputime = signal->it_virt_expires;
+	if (!cputime_eq(cputime, cputime_zero))
+		cputime = cputime_sub(signal->it_virt_expires, virt_ticks(t));
+	cputime_to_timeval(cputime, &tval);
+	h->it_virt_value = timeval_to_ns(&tval);
+	cputime_to_timeval(signal->it_virt_incr, &tval);
+	h->it_virt_incr = timeval_to_ns(&tval);
+
+	cputime = signal->it_prof_expires;
+	if (!cputime_eq(cputime, cputime_zero))
+		cputime = cputime_sub(signal->it_prof_expires, prof_ticks(t));
+	cputime_to_timeval(cputime, &tval);
+	h->it_prof_value = timeval_to_ns(&tval);
+	cputime_to_timeval(signal->it_prof_incr, &tval);
+	h->it_prof_incr = timeval_to_ns(&tval);
+
 	unlock_task_sighand(t, &flags);
 
 	ret = ckpt_write_obj(ctx, &h->h);
@@ -423,6 +471,7 @@ static int restore_signal(struct ckpt_ctx *ctx)
 	struct ckpt_hdr_signal *h;
 	struct sigpending new_pending;
 	struct sigpending *pending;
+	struct itimerval itimer;
 	struct rlimit rlim;
 	int i, ret;
 
@@ -443,12 +492,49 @@ static int restore_signal(struct ckpt_ctx *ctx)
 	if (ret < 0)
 		goto out;
 
+	/*
+	 * Reset real/virt/prof itimer (in case they were set), to
+	 * prevent unwanted signals after flushing current signals
+	 * and before restoring original real/virt/prof itimer.
+	 */
+	itimer.it_value = (struct timeval) { .tv_sec = 0, .tv_usec = 0 };
+	itimer.it_interval =  (struct timeval) { .tv_sec = 0, .tv_usec = 0 };
+	do_setitimer(ITIMER_REAL, &itimer, NULL);
+	do_setitimer(ITIMER_VIRTUAL, &itimer, NULL);
+	do_setitimer(ITIMER_PROF, &itimer, NULL);
+
 	spin_lock_irq(&current->sighand->siglock);
 	pending = &current->signal->shared_pending;
 	flush_sigqueue(pending);
 	pending->signal = new_pending.signal;
 	list_splice_init(&new_pending.list, &pending->list);
 	spin_unlock_irq(&current->sighand->siglock);
+
+	/* real/virt/prof itimers */
+	itimer.it_value = ns_to_timeval(h->it_real_value);
+	itimer.it_interval = ns_to_timeval(h->it_real_incr);
+	ret = do_setitimer(ITIMER_REAL, &itimer, NULL);
+	if (ret < 0)
+		goto out;
+	/*
+	 * If expire is 0 but incr > 0 then we have a SIGALRM pending.
+	 * It should re-arm the timer when handled. But do_setitimer()
+	 * above already ignored @it_real_incr because @it_real_value
+	 * that was zero. So we set it manually. (This is safe against
+	 * malicious input, because in the worst case will generate an
+	 * unexpected SIGALRM to this process).
+	 */
+	if (!h->it_real_value && h->it_real_incr)
+		current->signal->it_real_incr = ns_to_ktime(h->it_real_incr);
+
+	itimer.it_value = ns_to_timeval(h->it_virt_value);
+	itimer.it_interval = ns_to_timeval(h->it_virt_incr);
+	ret = do_setitimer(ITIMER_VIRTUAL, &itimer, NULL);
+	if (ret < 0)
+		goto out;
+	itimer.it_value = ns_to_timeval(h->it_prof_value);
+	itimer.it_interval = ns_to_timeval(h->it_prof_incr);
+	ret = do_setitimer(ITIMER_PROF, &itimer, NULL);
  out:
 	ckpt_hdr_put(ctx, h);
 	return ret;
diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
index fd2836e..e4dfbd7 100644
--- a/include/linux/checkpoint_hdr.h
+++ b/include/linux/checkpoint_hdr.h
@@ -482,6 +482,12 @@ struct ckpt_rlimit {
 struct ckpt_hdr_signal {
 	struct ckpt_hdr h;
 	struct ckpt_rlimit rlim[CKPT_RLIM_NLIMITS];
+	__u64 it_real_value;
+	__u64 it_real_incr;
+	__u64 it_virt_value;
+	__u64 it_virt_incr;
+	__u64 it_prof_value;
+	__u64 it_prof_incr;
 } __attribute__((aligned(8)));
 
 struct ckpt_hdr_signal_task {
diff --git a/include/linux/posix-timers.h b/include/linux/posix-timers.h
index d0d6a66..7dd69c3 100644
--- a/include/linux/posix-timers.h
+++ b/include/linux/posix-timers.h
@@ -125,4 +125,13 @@ void update_rlimit_cpu(unsigned long rlim_new);
 
 int invalid_clockid(const clockid_t which_clock);
 
+static inline cputime_t prof_ticks(struct task_struct *p)
+{
+	return cputime_add(p->utime, p->stime);
+}
+static inline cputime_t virt_ticks(struct task_struct *p)
+{
+	return p->utime;
+}
+
 #endif
diff --git a/kernel/posix-cpu-timers.c b/kernel/posix-cpu-timers.c
index e33a21c..a3491e6 100644
--- a/kernel/posix-cpu-timers.c
+++ b/kernel/posix-cpu-timers.c
@@ -167,15 +167,6 @@ static void bump_cpu_timer(struct k_itimer *timer,
 	}
 }
 
-static inline cputime_t prof_ticks(struct task_struct *p)
-{
-	return cputime_add(p->utime, p->stime);
-}
-static inline cputime_t virt_ticks(struct task_struct *p)
-{
-	return p->utime;
-}
-
 int posix_cpu_clock_getres(const clockid_t which_clock, struct timespec *tp)
 {
 	int error = check_clock(which_clock);
-- 
1.6.0.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH v18 67/80] Expose may_setuid() in user.h and add may_setgid() (v2)
       [not found] ` <1253749920-18673-1-git-send-email-orenl-RdfvBDnrOixBDgjK7y7TUQ@public.gmane.org>
                     ` (65 preceding siblings ...)
  2009-09-23 23:51   ` [PATCH v18 66/80] c/r: [signal 4/4] support for real/virt/prof itimers Oren Laadan
@ 2009-09-23 23:51   ` Oren Laadan
  2009-09-23 23:51   ` [PATCH v18 68/80] Add common socket helpers to unify the security hooks Oren Laadan
                     ` (14 subsequent siblings)
  81 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-23 23:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-api-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Linus Torvalds, Ingo Molnar,
	Dan Smith, Pavel Emelyanov

From: Dan Smith <danms-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>

Make these helpers available to others.

Changes in v2:
 - Avoid checking the groupinfo in ctx->realcred against the current in
   may_setgid()

Cc: Serge Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
Signed-off-by: Dan Smith <danms-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
---
 include/linux/user.h |    9 +++++++++
 kernel/user.c        |   13 ++++++++++++-
 2 files changed, 21 insertions(+), 1 deletions(-)

diff --git a/include/linux/user.h b/include/linux/user.h
index 68daf84..c231e9c 100644
--- a/include/linux/user.h
+++ b/include/linux/user.h
@@ -1 +1,10 @@
+#ifndef _LINUX_USER_H
+#define _LINUX_USER_H
+
 #include <asm/user.h>
+#include <linux/sched.h>
+
+extern int may_setuid(struct user_namespace *ns, uid_t uid);
+extern int may_setgid(gid_t gid);
+
+#endif
diff --git a/kernel/user.c b/kernel/user.c
index a535ed6..a78fde7 100644
--- a/kernel/user.c
+++ b/kernel/user.c
@@ -604,7 +604,7 @@ int checkpoint_user(struct ckpt_ctx *ctx, void *ptr)
 	return do_checkpoint_user(ctx, (struct user_struct *) ptr);
 }
 
-static int may_setuid(struct user_namespace *ns, uid_t uid)
+int may_setuid(struct user_namespace *ns, uid_t uid)
 {
 	/*
 	 * this next check will one day become
@@ -631,6 +631,17 @@ static int may_setuid(struct user_namespace *ns, uid_t uid)
 	return 0;
 }
 
+int may_setgid(gid_t gid)
+{
+	if (capable(CAP_SETGID))
+		return 1;
+
+	if (in_egroup_p(gid))
+		return 1;
+
+	return 0;
+}
+
 static struct user_struct *do_restore_user(struct ckpt_ctx *ctx)
 {
 	struct user_struct *u;
-- 
1.6.0.4

^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH v18 67/80] Expose may_setuid() in user.h and add may_setgid() (v2)
  2009-09-23 23:50 ` Oren Laadan
@ 2009-09-23 23:51   ` Oren Laadan
  -1 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-23 23:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, containers, linux-kernel, linux-mm, linux-api,
	Serge Hallyn, Ingo Molnar, Pavel Emelyanov, Dan Smith

From: Dan Smith <danms@us.ibm.com>

Make these helpers available to others.

Changes in v2:
 - Avoid checking the groupinfo in ctx->realcred against the current in
   may_setgid()

Cc: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: Dan Smith <danms@us.ibm.com>
---
 include/linux/user.h |    9 +++++++++
 kernel/user.c        |   13 ++++++++++++-
 2 files changed, 21 insertions(+), 1 deletions(-)

diff --git a/include/linux/user.h b/include/linux/user.h
index 68daf84..c231e9c 100644
--- a/include/linux/user.h
+++ b/include/linux/user.h
@@ -1 +1,10 @@
+#ifndef _LINUX_USER_H
+#define _LINUX_USER_H
+
 #include <asm/user.h>
+#include <linux/sched.h>
+
+extern int may_setuid(struct user_namespace *ns, uid_t uid);
+extern int may_setgid(gid_t gid);
+
+#endif
diff --git a/kernel/user.c b/kernel/user.c
index a535ed6..a78fde7 100644
--- a/kernel/user.c
+++ b/kernel/user.c
@@ -604,7 +604,7 @@ int checkpoint_user(struct ckpt_ctx *ctx, void *ptr)
 	return do_checkpoint_user(ctx, (struct user_struct *) ptr);
 }
 
-static int may_setuid(struct user_namespace *ns, uid_t uid)
+int may_setuid(struct user_namespace *ns, uid_t uid)
 {
 	/*
 	 * this next check will one day become
@@ -631,6 +631,17 @@ static int may_setuid(struct user_namespace *ns, uid_t uid)
 	return 0;
 }
 
+int may_setgid(gid_t gid)
+{
+	if (capable(CAP_SETGID))
+		return 1;
+
+	if (in_egroup_p(gid))
+		return 1;
+
+	return 0;
+}
+
 static struct user_struct *do_restore_user(struct ckpt_ctx *ctx)
 {
 	struct user_struct *u;
-- 
1.6.0.4


^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH v18 67/80] Expose may_setuid() in user.h and add may_setgid() (v2)
@ 2009-09-23 23:51   ` Oren Laadan
  0 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-23 23:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, containers, linux-kernel, linux-mm, linux-api,
	Serge Hallyn, Ingo Molnar, Pavel Emelyanov, Dan Smith

From: Dan Smith <danms@us.ibm.com>

Make these helpers available to others.

Changes in v2:
 - Avoid checking the groupinfo in ctx->realcred against the current in
   may_setgid()

Cc: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: Dan Smith <danms@us.ibm.com>
---
 include/linux/user.h |    9 +++++++++
 kernel/user.c        |   13 ++++++++++++-
 2 files changed, 21 insertions(+), 1 deletions(-)

diff --git a/include/linux/user.h b/include/linux/user.h
index 68daf84..c231e9c 100644
--- a/include/linux/user.h
+++ b/include/linux/user.h
@@ -1 +1,10 @@
+#ifndef _LINUX_USER_H
+#define _LINUX_USER_H
+
 #include <asm/user.h>
+#include <linux/sched.h>
+
+extern int may_setuid(struct user_namespace *ns, uid_t uid);
+extern int may_setgid(gid_t gid);
+
+#endif
diff --git a/kernel/user.c b/kernel/user.c
index a535ed6..a78fde7 100644
--- a/kernel/user.c
+++ b/kernel/user.c
@@ -604,7 +604,7 @@ int checkpoint_user(struct ckpt_ctx *ctx, void *ptr)
 	return do_checkpoint_user(ctx, (struct user_struct *) ptr);
 }
 
-static int may_setuid(struct user_namespace *ns, uid_t uid)
+int may_setuid(struct user_namespace *ns, uid_t uid)
 {
 	/*
 	 * this next check will one day become
@@ -631,6 +631,17 @@ static int may_setuid(struct user_namespace *ns, uid_t uid)
 	return 0;
 }
 
+int may_setgid(gid_t gid)
+{
+	if (capable(CAP_SETGID))
+		return 1;
+
+	if (in_egroup_p(gid))
+		return 1;
+
+	return 0;
+}
+
 static struct user_struct *do_restore_user(struct ckpt_ctx *ctx)
 {
 	struct user_struct *u;
-- 
1.6.0.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH v18 68/80] Add common socket helpers to unify the security hooks
       [not found] ` <1253749920-18673-1-git-send-email-orenl-RdfvBDnrOixBDgjK7y7TUQ@public.gmane.org>
                     ` (66 preceding siblings ...)
  2009-09-23 23:51   ` [PATCH v18 67/80] Expose may_setuid() in user.h and add may_setgid() (v2) Oren Laadan
@ 2009-09-23 23:51   ` Oren Laadan
  2009-09-23 23:51   ` [PATCH v18 69/80] c/r: introduce checkpoint/restore methods to struct proto_ops Oren Laadan
                     ` (13 subsequent siblings)
  81 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-23 23:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-api-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Linus Torvalds,
	netdev-u79uwXL29TY76Z2rM5mHXA, Ingo Molnar, Dan Smith,
	Pavel Emelyanov

From: Dan Smith <danms-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>

This moves the meat out of the bind(), getsockname(), and getpeername() syscalls
into helper functions that performs security_socket_bind() and then the
sock->ops->call().  This allows a unification of this behavior between the
syscalls and the pending socket restart logic.

Acked-by: Serge Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
Signed-off-by: Dan Smith <danms-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
Cc: netdev-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
---
 include/net/sock.h |   48 ++++++++++++++++++++++++++++++++++++++++++++++++
 net/socket.c       |   29 ++++++-----------------------
 2 files changed, 54 insertions(+), 23 deletions(-)

diff --git a/include/net/sock.h b/include/net/sock.h
index 950409d..12530bf 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -1578,6 +1578,54 @@ extern void sock_enable_timestamp(struct sock *sk, int flag);
 extern int sock_get_timestamp(struct sock *, struct timeval __user *);
 extern int sock_get_timestampns(struct sock *, struct timespec __user *);
 
+/* bind() helper shared between any callers needing to perform a bind on
+ * behalf of userspace (syscall and restart) with the security hooks.
+ */
+static inline int sock_bind(struct socket *sock,
+			    struct sockaddr *addr,
+			    int addr_len)
+{
+	int err;
+
+	err = security_socket_bind(sock, addr, addr_len);
+	if (err)
+		return err;
+	else
+		return sock->ops->bind(sock, addr, addr_len);
+}
+
+/* getname() helper shared between any callers needing to perform a getname on
+ * behalf of userspace (syscall and restart) with the security hooks.
+ */
+static inline int sock_getname(struct socket *sock,
+			       struct sockaddr *addr,
+			       int *addr_len)
+{
+	int err;
+
+	err = security_socket_getsockname(sock);
+	if (err)
+		return err;
+	else
+		return sock->ops->getname(sock, addr, addr_len, 0);
+}
+
+/* getpeer() helper shared between any callers needing to perform a getpeer on
+ * behalf of userspace (syscall and restart) with the security hooks.
+ */
+static inline int sock_getpeer(struct socket *sock,
+			       struct sockaddr *addr,
+			       int *addr_len)
+{
+	int err;
+
+	err = security_socket_getpeername(sock);
+	if (err)
+		return err;
+	else
+		return sock->ops->getname(sock, addr, addr_len, 1);
+}
+
 /* 
  *	Enable debug/info messages 
  */
diff --git a/net/socket.c b/net/socket.c
index 6d47165..63c4498 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -1414,15 +1414,10 @@ SYSCALL_DEFINE3(bind, int, fd, struct sockaddr __user *, umyaddr, int, addrlen)
 	sock = sockfd_lookup_light(fd, &err, &fput_needed);
 	if (sock) {
 		err = move_addr_to_kernel(umyaddr, addrlen, (struct sockaddr *)&address);
-		if (err >= 0) {
-			err = security_socket_bind(sock,
-						   (struct sockaddr *)&address,
-						   addrlen);
-			if (!err)
-				err = sock->ops->bind(sock,
-						      (struct sockaddr *)
-						      &address, addrlen);
-		}
+		if (err >= 0)
+			err = sock_bind(sock,
+					(struct sockaddr *)&address,
+					addrlen);
 		fput_light(sock->file, fput_needed);
 	}
 	return err;
@@ -1610,11 +1605,7 @@ SYSCALL_DEFINE3(getsockname, int, fd, struct sockaddr __user *, usockaddr,
 	if (!sock)
 		goto out;
 
-	err = security_socket_getsockname(sock);
-	if (err)
-		goto out_put;
-
-	err = sock->ops->getname(sock, (struct sockaddr *)&address, &len, 0);
+	err = sock_getname(sock, (struct sockaddr *)&address, &len);
 	if (err)
 		goto out_put;
 	err = move_addr_to_user((struct sockaddr *)&address, len, usockaddr, usockaddr_len);
@@ -1639,15 +1630,7 @@ SYSCALL_DEFINE3(getpeername, int, fd, struct sockaddr __user *, usockaddr,
 
 	sock = sockfd_lookup_light(fd, &err, &fput_needed);
 	if (sock != NULL) {
-		err = security_socket_getpeername(sock);
-		if (err) {
-			fput_light(sock->file, fput_needed);
-			return err;
-		}
-
-		err =
-		    sock->ops->getname(sock, (struct sockaddr *)&address, &len,
-				       1);
+		err = sock_getpeer(sock, (struct sockaddr *)&address, &len);
 		if (!err)
 			err = move_addr_to_user((struct sockaddr *)&address, len, usockaddr,
 						usockaddr_len);
-- 
1.6.0.4

^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH v18 68/80] Add common socket helpers to unify the security hooks
  2009-09-23 23:50 ` Oren Laadan
@ 2009-09-23 23:51   ` Oren Laadan
  -1 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-23 23:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, containers, linux-kernel, linux-mm, linux-api,
	Serge Hallyn, Ingo Molnar, Pavel Emelyanov, Dan Smith, netdev

From: Dan Smith <danms@us.ibm.com>

This moves the meat out of the bind(), getsockname(), and getpeername() syscalls
into helper functions that performs security_socket_bind() and then the
sock->ops->call().  This allows a unification of this behavior between the
syscalls and the pending socket restart logic.

Acked-by: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: Dan Smith <danms@us.ibm.com>
Cc: netdev@vger.kernel.org
---
 include/net/sock.h |   48 ++++++++++++++++++++++++++++++++++++++++++++++++
 net/socket.c       |   29 ++++++-----------------------
 2 files changed, 54 insertions(+), 23 deletions(-)

diff --git a/include/net/sock.h b/include/net/sock.h
index 950409d..12530bf 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -1578,6 +1578,54 @@ extern void sock_enable_timestamp(struct sock *sk, int flag);
 extern int sock_get_timestamp(struct sock *, struct timeval __user *);
 extern int sock_get_timestampns(struct sock *, struct timespec __user *);
 
+/* bind() helper shared between any callers needing to perform a bind on
+ * behalf of userspace (syscall and restart) with the security hooks.
+ */
+static inline int sock_bind(struct socket *sock,
+			    struct sockaddr *addr,
+			    int addr_len)
+{
+	int err;
+
+	err = security_socket_bind(sock, addr, addr_len);
+	if (err)
+		return err;
+	else
+		return sock->ops->bind(sock, addr, addr_len);
+}
+
+/* getname() helper shared between any callers needing to perform a getname on
+ * behalf of userspace (syscall and restart) with the security hooks.
+ */
+static inline int sock_getname(struct socket *sock,
+			       struct sockaddr *addr,
+			       int *addr_len)
+{
+	int err;
+
+	err = security_socket_getsockname(sock);
+	if (err)
+		return err;
+	else
+		return sock->ops->getname(sock, addr, addr_len, 0);
+}
+
+/* getpeer() helper shared between any callers needing to perform a getpeer on
+ * behalf of userspace (syscall and restart) with the security hooks.
+ */
+static inline int sock_getpeer(struct socket *sock,
+			       struct sockaddr *addr,
+			       int *addr_len)
+{
+	int err;
+
+	err = security_socket_getpeername(sock);
+	if (err)
+		return err;
+	else
+		return sock->ops->getname(sock, addr, addr_len, 1);
+}
+
 /* 
  *	Enable debug/info messages 
  */
diff --git a/net/socket.c b/net/socket.c
index 6d47165..63c4498 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -1414,15 +1414,10 @@ SYSCALL_DEFINE3(bind, int, fd, struct sockaddr __user *, umyaddr, int, addrlen)
 	sock = sockfd_lookup_light(fd, &err, &fput_needed);
 	if (sock) {
 		err = move_addr_to_kernel(umyaddr, addrlen, (struct sockaddr *)&address);
-		if (err >= 0) {
-			err = security_socket_bind(sock,
-						   (struct sockaddr *)&address,
-						   addrlen);
-			if (!err)
-				err = sock->ops->bind(sock,
-						      (struct sockaddr *)
-						      &address, addrlen);
-		}
+		if (err >= 0)
+			err = sock_bind(sock,
+					(struct sockaddr *)&address,
+					addrlen);
 		fput_light(sock->file, fput_needed);
 	}
 	return err;
@@ -1610,11 +1605,7 @@ SYSCALL_DEFINE3(getsockname, int, fd, struct sockaddr __user *, usockaddr,
 	if (!sock)
 		goto out;
 
-	err = security_socket_getsockname(sock);
-	if (err)
-		goto out_put;
-
-	err = sock->ops->getname(sock, (struct sockaddr *)&address, &len, 0);
+	err = sock_getname(sock, (struct sockaddr *)&address, &len);
 	if (err)
 		goto out_put;
 	err = move_addr_to_user((struct sockaddr *)&address, len, usockaddr, usockaddr_len);
@@ -1639,15 +1630,7 @@ SYSCALL_DEFINE3(getpeername, int, fd, struct sockaddr __user *, usockaddr,
 
 	sock = sockfd_lookup_light(fd, &err, &fput_needed);
 	if (sock != NULL) {
-		err = security_socket_getpeername(sock);
-		if (err) {
-			fput_light(sock->file, fput_needed);
-			return err;
-		}
-
-		err =
-		    sock->ops->getname(sock, (struct sockaddr *)&address, &len,
-				       1);
+		err = sock_getpeer(sock, (struct sockaddr *)&address, &len);
 		if (!err)
 			err = move_addr_to_user((struct sockaddr *)&address, len, usockaddr,
 						usockaddr_len);
-- 
1.6.0.4


^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH v18 68/80] Add common socket helpers to unify the security hooks
@ 2009-09-23 23:51   ` Oren Laadan
  0 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-23 23:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, containers, linux-kernel, linux-mm, linux-api,
	Serge Hallyn, Ingo Molnar, Pavel Emelyanov, Dan Smith, netdev

From: Dan Smith <danms@us.ibm.com>

This moves the meat out of the bind(), getsockname(), and getpeername() syscalls
into helper functions that performs security_socket_bind() and then the
sock->ops->call().  This allows a unification of this behavior between the
syscalls and the pending socket restart logic.

Acked-by: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: Dan Smith <danms@us.ibm.com>
Cc: netdev@vger.kernel.org
---
 include/net/sock.h |   48 ++++++++++++++++++++++++++++++++++++++++++++++++
 net/socket.c       |   29 ++++++-----------------------
 2 files changed, 54 insertions(+), 23 deletions(-)

diff --git a/include/net/sock.h b/include/net/sock.h
index 950409d..12530bf 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -1578,6 +1578,54 @@ extern void sock_enable_timestamp(struct sock *sk, int flag);
 extern int sock_get_timestamp(struct sock *, struct timeval __user *);
 extern int sock_get_timestampns(struct sock *, struct timespec __user *);
 
+/* bind() helper shared between any callers needing to perform a bind on
+ * behalf of userspace (syscall and restart) with the security hooks.
+ */
+static inline int sock_bind(struct socket *sock,
+			    struct sockaddr *addr,
+			    int addr_len)
+{
+	int err;
+
+	err = security_socket_bind(sock, addr, addr_len);
+	if (err)
+		return err;
+	else
+		return sock->ops->bind(sock, addr, addr_len);
+}
+
+/* getname() helper shared between any callers needing to perform a getname on
+ * behalf of userspace (syscall and restart) with the security hooks.
+ */
+static inline int sock_getname(struct socket *sock,
+			       struct sockaddr *addr,
+			       int *addr_len)
+{
+	int err;
+
+	err = security_socket_getsockname(sock);
+	if (err)
+		return err;
+	else
+		return sock->ops->getname(sock, addr, addr_len, 0);
+}
+
+/* getpeer() helper shared between any callers needing to perform a getpeer on
+ * behalf of userspace (syscall and restart) with the security hooks.
+ */
+static inline int sock_getpeer(struct socket *sock,
+			       struct sockaddr *addr,
+			       int *addr_len)
+{
+	int err;
+
+	err = security_socket_getpeername(sock);
+	if (err)
+		return err;
+	else
+		return sock->ops->getname(sock, addr, addr_len, 1);
+}
+
 /* 
  *	Enable debug/info messages 
  */
diff --git a/net/socket.c b/net/socket.c
index 6d47165..63c4498 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -1414,15 +1414,10 @@ SYSCALL_DEFINE3(bind, int, fd, struct sockaddr __user *, umyaddr, int, addrlen)
 	sock = sockfd_lookup_light(fd, &err, &fput_needed);
 	if (sock) {
 		err = move_addr_to_kernel(umyaddr, addrlen, (struct sockaddr *)&address);
-		if (err >= 0) {
-			err = security_socket_bind(sock,
-						   (struct sockaddr *)&address,
-						   addrlen);
-			if (!err)
-				err = sock->ops->bind(sock,
-						      (struct sockaddr *)
-						      &address, addrlen);
-		}
+		if (err >= 0)
+			err = sock_bind(sock,
+					(struct sockaddr *)&address,
+					addrlen);
 		fput_light(sock->file, fput_needed);
 	}
 	return err;
@@ -1610,11 +1605,7 @@ SYSCALL_DEFINE3(getsockname, int, fd, struct sockaddr __user *, usockaddr,
 	if (!sock)
 		goto out;
 
-	err = security_socket_getsockname(sock);
-	if (err)
-		goto out_put;
-
-	err = sock->ops->getname(sock, (struct sockaddr *)&address, &len, 0);
+	err = sock_getname(sock, (struct sockaddr *)&address, &len);
 	if (err)
 		goto out_put;
 	err = move_addr_to_user((struct sockaddr *)&address, len, usockaddr, usockaddr_len);
@@ -1639,15 +1630,7 @@ SYSCALL_DEFINE3(getpeername, int, fd, struct sockaddr __user *, usockaddr,
 
 	sock = sockfd_lookup_light(fd, &err, &fput_needed);
 	if (sock != NULL) {
-		err = security_socket_getpeername(sock);
-		if (err) {
-			fput_light(sock->file, fput_needed);
-			return err;
-		}
-
-		err =
-		    sock->ops->getname(sock, (struct sockaddr *)&address, &len,
-				       1);
+		err = sock_getpeer(sock, (struct sockaddr *)&address, &len);
 		if (!err)
 			err = move_addr_to_user((struct sockaddr *)&address, len, usockaddr,
 						usockaddr_len);
-- 
1.6.0.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH v18 69/80] c/r: introduce checkpoint/restore methods to struct proto_ops
       [not found] ` <1253749920-18673-1-git-send-email-orenl-RdfvBDnrOixBDgjK7y7TUQ@public.gmane.org>
                     ` (67 preceding siblings ...)
  2009-09-23 23:51   ` [PATCH v18 68/80] Add common socket helpers to unify the security hooks Oren Laadan
@ 2009-09-23 23:51   ` Oren Laadan
  2009-09-23 23:51   ` [PATCH v18 70/80] c/r: Add AF_UNIX support (v12) Oren Laadan
                     ` (12 subsequent siblings)
  81 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-23 23:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-api-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Linus Torvalds, Ingo Molnar,
	Pavel Emelyanov

This adds new 'proto_ops' function for checkpointing and restoring
sockets. This allows the checkpoint/restart code to compile nicely
when, e.g., AF_UNIX sockets are selected as a module.

It also adds a function 'collecting' a socket for leak-detection
during full-container checkpoint. This is useful for those sockets
that hold references to other "collectable" objects. Two examples are
AF_UNIX buffers which reference the socket of origin, and sockets that
have file descriptors in-transit.

Signed-off-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
---
 include/linux/net.h |    9 +++++++++
 1 files changed, 9 insertions(+), 0 deletions(-)

diff --git a/include/linux/net.h b/include/linux/net.h
index 4fc2ffd..b99f350 100644
--- a/include/linux/net.h
+++ b/include/linux/net.h
@@ -147,6 +147,9 @@ struct sockaddr;
 struct msghdr;
 struct module;
 
+struct ckpt_ctx;
+struct ckpt_hdr_socket;
+
 struct proto_ops {
 	int		family;
 	struct module	*owner;
@@ -191,6 +194,12 @@ struct proto_ops {
 				      int offset, size_t size, int flags);
 	ssize_t 	(*splice_read)(struct socket *sock,  loff_t *ppos,
 				       struct pipe_inode_info *pipe, size_t len, unsigned int flags);
+	int		(*checkpoint)(struct ckpt_ctx *ctx,
+				      struct socket *sock);
+	int		(*collect)(struct ckpt_ctx *ctx,
+				   struct socket *sock);
+	int		(*restore)(struct ckpt_ctx *ctx, struct socket *sock,
+				   struct ckpt_hdr_socket *h);
 };
 
 struct net_proto_family {
-- 
1.6.0.4

^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH v18 69/80] c/r: introduce checkpoint/restore methods to struct proto_ops
  2009-09-23 23:50 ` Oren Laadan
@ 2009-09-23 23:51   ` Oren Laadan
  -1 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-23 23:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, containers, linux-kernel, linux-mm, linux-api,
	Serge Hallyn, Ingo Molnar, Pavel Emelyanov, Oren Laadan,
	Oren Laadan

This adds new 'proto_ops' function for checkpointing and restoring
sockets. This allows the checkpoint/restart code to compile nicely
when, e.g., AF_UNIX sockets are selected as a module.

It also adds a function 'collecting' a socket for leak-detection
during full-container checkpoint. This is useful for those sockets
that hold references to other "collectable" objects. Two examples are
AF_UNIX buffers which reference the socket of origin, and sockets that
have file descriptors in-transit.

Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
---
 include/linux/net.h |    9 +++++++++
 1 files changed, 9 insertions(+), 0 deletions(-)

diff --git a/include/linux/net.h b/include/linux/net.h
index 4fc2ffd..b99f350 100644
--- a/include/linux/net.h
+++ b/include/linux/net.h
@@ -147,6 +147,9 @@ struct sockaddr;
 struct msghdr;
 struct module;
 
+struct ckpt_ctx;
+struct ckpt_hdr_socket;
+
 struct proto_ops {
 	int		family;
 	struct module	*owner;
@@ -191,6 +194,12 @@ struct proto_ops {
 				      int offset, size_t size, int flags);
 	ssize_t 	(*splice_read)(struct socket *sock,  loff_t *ppos,
 				       struct pipe_inode_info *pipe, size_t len, unsigned int flags);
+	int		(*checkpoint)(struct ckpt_ctx *ctx,
+				      struct socket *sock);
+	int		(*collect)(struct ckpt_ctx *ctx,
+				   struct socket *sock);
+	int		(*restore)(struct ckpt_ctx *ctx, struct socket *sock,
+				   struct ckpt_hdr_socket *h);
 };
 
 struct net_proto_family {
-- 
1.6.0.4


^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH v18 69/80] c/r: introduce checkpoint/restore methods to struct proto_ops
@ 2009-09-23 23:51   ` Oren Laadan
  0 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-23 23:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, containers, linux-kernel, linux-mm, linux-api,
	Serge Hallyn, Ingo Molnar, Pavel Emelyanov, Oren Laadan,
	Oren Laadan

This adds new 'proto_ops' function for checkpointing and restoring
sockets. This allows the checkpoint/restart code to compile nicely
when, e.g., AF_UNIX sockets are selected as a module.

It also adds a function 'collecting' a socket for leak-detection
during full-container checkpoint. This is useful for those sockets
that hold references to other "collectable" objects. Two examples are
AF_UNIX buffers which reference the socket of origin, and sockets that
have file descriptors in-transit.

Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
---
 include/linux/net.h |    9 +++++++++
 1 files changed, 9 insertions(+), 0 deletions(-)

diff --git a/include/linux/net.h b/include/linux/net.h
index 4fc2ffd..b99f350 100644
--- a/include/linux/net.h
+++ b/include/linux/net.h
@@ -147,6 +147,9 @@ struct sockaddr;
 struct msghdr;
 struct module;
 
+struct ckpt_ctx;
+struct ckpt_hdr_socket;
+
 struct proto_ops {
 	int		family;
 	struct module	*owner;
@@ -191,6 +194,12 @@ struct proto_ops {
 				      int offset, size_t size, int flags);
 	ssize_t 	(*splice_read)(struct socket *sock,  loff_t *ppos,
 				       struct pipe_inode_info *pipe, size_t len, unsigned int flags);
+	int		(*checkpoint)(struct ckpt_ctx *ctx,
+				      struct socket *sock);
+	int		(*collect)(struct ckpt_ctx *ctx,
+				   struct socket *sock);
+	int		(*restore)(struct ckpt_ctx *ctx, struct socket *sock,
+				   struct ckpt_hdr_socket *h);
 };
 
 struct net_proto_family {
-- 
1.6.0.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH v18 70/80] c/r: Add AF_UNIX support (v12)
       [not found] ` <1253749920-18673-1-git-send-email-orenl-RdfvBDnrOixBDgjK7y7TUQ@public.gmane.org>
                     ` (68 preceding siblings ...)
  2009-09-23 23:51   ` [PATCH v18 69/80] c/r: introduce checkpoint/restore methods to struct proto_ops Oren Laadan
@ 2009-09-23 23:51   ` Oren Laadan
  2009-09-23 23:51   ` [PATCH v18 71/80] c/r: [pty 1/2] allow allocation of desired pty slave Oren Laadan
                     ` (11 subsequent siblings)
  81 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-23 23:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-api-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Alexey Dobriyan,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Linus Torvalds,
	netdev-u79uwXL29TY76Z2rM5mHXA, Ingo Molnar, Dan Smith,
	Pavel Emelyanov

From: Dan Smith <danms-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>

This patch adds basic checkpoint/restart support for AF_UNIX sockets.  It
has been tested with a single and multiple processes, and with data inflight
at the time of checkpoint.  It supports socketpair()s, path-based, and
abstract sockets.

Changes in v12:
  - Collect sockets for leak-detection
  - Adjust socket reference count during leak detection phase

Changes in v11:
  - Create a struct socket for orphan socket during checkpoint
  - Make sockets proper objhash objects and use checkpoint_obj() on them
  - Rename headerless struct ckpt_hdr_* to struct ckpt_*
  - Remove struct timeval from socket header
  - Save and restore UNIX socket peer credentials
  - Set socket flags on restore using sock_setsockopt() where possible
  - Fail on the TIMESTAMPING_* flags for the moment (with a TODO)
  - Remove other explicit flag checks that are no longer copied blindly
  - Changed functions/variables names to follow existing conventions
  - Use proto_ops->{checkpoint,restart} methods for af_unix
  - Cleanup sock_file_restore()/sock_file_checkpoint()
  - Make ckpt_hdr_socket be part of ckpt_hdr_file_socket
  - Fold do_sock_file_checkpoint() into sock_file_checkpoint()
  - Fold do_sock_file_restore() into sock_file_restore()
  - Move sock_file_{checkpoint,restore} to net/checkpoint.c
  - Properly define sock_file_{checkpoint,restore} in header file
  - sock_file_restore() now calls restore_file_common()

Changes in v10:
  - Moved header structure definitions back to checkpoint_hdr.h
  - Moved AF_UNIX checkpoint/restart code to net/unix/checkpoint.c
  - Make sock_unix_*() functions only compile if CONFIG_UNIX=y
  - Add TODO for CONFIG_UNIX=m case

Changes in v9:
  - Fix double-free of skb's in the list and target holding queue in the
    error path of sock_copy_buffers()
  - Adjust use of ckpt_read_string() to match new signature

Changes in v8:
  - Fix stale dev_alloc_skb() from before the conversion to skb_clone()
  - Fix a couple of broken error paths
  - Fix memory leak of kvec.iov_base on successful return from sendmsg()
  - Fix condition for deciding when to run sock_cptrst_verify()
  - Fix buffer queue copy algorithm to hold the lock during walk(s)
  - Log the errno when either getname() or getpeer() fails
  - Add comments about ancillary messages in the UNIX queue
  - Add TODO comments for credential restore and flags via setsockopt()
  - Add TODO comment about strangely-connected dgram sockets and the use
    of sendmsg(peer)

Changes in v7:
  - Fix failure to free iov_base in error path of sock_read_buffer()
  - Change sock_read_buffer() to use _ckpt_read_obj_type() to get the
    header length and then use ckpt_kread() directly to read the payload
  - Change sock_read_buffers() to sock_unix_read_buffers() and break out
    some common functionality to better accommodate the subsequent INET
    patch
  - Generalize sock_unix_getnames() into sock_getnames() so INET can use it
  - Change skb_morph() to skb_clone() which uses the more common path and
    still avoids the copy
  - Add check to validate the socket type before creating socket
    on restore
  - Comment the CAP_NET_ADMIN override in sock_read_buffer_hdr
  - Strengthen the comment about priming the buffer limits
  - Change the objhash functions to deny direct checkpoint of sockets and
    remove the reference counting function
  - Change SOCKET_BUFFERS to SOCKET_QUEUE
  - Change this,peer objrefs to signed integers
  - Remove names from internal socket structures
  - Fix handling of sock_copy_buffers() result
  - Use ckpt_fill_fname() instead of d_path() for writing CWD
  - Use sock_getname() and sock_getpeer() for proper security hookage
  - Return -ENOSYS for unsupported socket families in checkpoint and restart
  - Use sock_setsockopt() and sock_getsockopt() where possible to save and
    restore socket option values
  - Check for SOCK_DESTROY flag in the global verify function because none
    of our supported socket types use it
  - Check for SOCK_USE_WRITE_QUEUE in AF_UNIX restore function because
    that flag should not be used on such a socket
  - Check socket state in UNIX restart path to validate the subset of valid
    values

Changes in v6:
  - Moved the socket addresses to the per-type header
  - Eliminated the HASCWD flag
  - Remove use of ckpt_write_err() in restart paths
  - Change the order in which buffers are read so that we can set the
    socket's limit equal to the size of the image's buffers (if appropriate)
    and then restore the original values afterwards.
  - Use the ckpt_validate_errno() helper
  - Add a check to make sure that we didn't restore a (UNIX) socket with
    any skb's in the send buffer
  - Fix up sock_unix_join() to not leave addr uninitialized for socketpair
  - Remove inclusion of checkpoint_hdr.h in the socket files
  - Make sock_unix_write_cwd() use ckpt_write_string() and use the new
    ckpt_read_string() for reading the cwd
  - Use the restored realcred credentials in sock_unix_join()
  - Fix error path of the chdir_and_bind
  - Change the algorithm for reloading the socket buffers to use sendmsg()
    on the socket's peer for better accounting
  - For DGRAM sockets, check the backlog value against the system max
    to avoid letting a restart bypass the overloaded queue length
  - Use sock_bind() instead of sock->ops->bind() to gain the security hook
  - Change "restart" to "restore" in some of the function names

Changes in v5:
  - Change laddr and raddr buffers in socket header to be long enough
    for INET6 addresses
  - Place socket.c and sock.h function definitions inside #ifdef
    CONFIG_CHECKPOINT
  - Add explicit check in sock_unix_makeaddr() to refuse if the
    checkpoint image specifies an addr length of 0
  - Split sock_unix_restart() into a few pieces to facilitate:
  - Changed behavior of the unix restore code so that unlinked LISTEN
    sockets don't do a bind()...unlink()
  - Save the base path of a bound socket's path so that we can chdir()
    to the base before bind() if it is a relative path
  - Call bind() for any socket that is not established but has a
    non-zero-length local address
  - Enforce the current sysctl limit on socket buffer size during restart
    unless the user holds CAP_NET_ADMIN
  - Unlink a path-based socket before calling bind()

Changes in v4:
  - Changed the signdness of rcvlowat, rcvtimeo, sndtimeo, and backlog
    to match their struct sock definitions.  This should avoid issues
    with sign extension.
  - Add a sock_cptrst_verify() function to be run at restore time to
    validate several of the values in the checkpoint image against
    limits, flag masks, etc.
  - Write an error string with ctk_write_err() in the obscure cases
  - Don't write socket buffers for listen sockets
  - Sanity check address lengths before we agree to allocate memory
  - Check the result of inserting the peer object in the objhash on
    restart
  - Check return value of sock_cptrst() on restart
  - Change logic in remote getname() phase of checkpoint to not fail for
    closed (et al) sockets
  - Eliminate the memory copy while reading socket buffers on restart

Changes in v3:
  - Move sock_file_checkpoint() above sock_file_restore()
  - Change __sock_file_*() functions to do_sock_file_*()
  - Adjust some of the struct cr_hdr_socket alignment
  - Improve the sock_copy_buffers() algorithm to avoid locking the source
    queue for the entire operation
  - Fix alignment in the socket header struct(s)
  - Move the per-protocol structure (ckpt_hdr_socket_un) out of the
    common socket header and read/write it separately
  - Fix missing call to sock_cptrst() in restore path
  - Break out the socket joining into another function
  - Fix failure to restore the socket address thus fixing getname()
  - Check the state values on restart
  - Fix case of state being TCP_CLOSE, which allows dgram sockets to be
    properly connected (if appropriate) to their peer and maintain the
    sockaddr for getname() operation
  - Fix restoring a listening socket that has been unlink()'d
  - Fix checkpointing sockets with an in-flight FD-passing SKB.  Fail
    with EBUSY.
  - Fix checkpointing listening sockets with an unaccepted connection.
    Fail with EBUSY.
  - Changed 'un' to 'unix' in function and structure names

Changes in v2:
  - Change GFP_KERNEL to GFP_ATOMIC in sock_copy_buffers() (this seems
    to be rather common in other uses of skb_copy())
  - Move the ckpt_hdr_socket structure definition to linux/socket.h
  - Fix whitespace issue
  - Move sock_file_checkpoint() to net/socket.c for symmetry

Cc: Alexey Dobriyan <adobriyan-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
Cc: netdev-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
Acked-by: Serge Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
Acked-by: David S. Miller <davem-fT/PcQaiUtIeIZ0/mPfg9Q@public.gmane.org>
Signed-off-by: Dan Smith <danms-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
Signed-off-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
---
 checkpoint/files.c             |    7 +
 checkpoint/objhash.c           |   69 ++++
 include/linux/checkpoint.h     |    7 +
 include/linux/checkpoint_hdr.h |   87 +++++
 include/linux/net.h            |    2 +
 include/net/af_unix.h          |   14 +
 include/net/sock.h             |   12 +
 net/Makefile                   |    2 +
 net/checkpoint.c               |  752 ++++++++++++++++++++++++++++++++++++++++
 net/socket.c                   |    6 +-
 net/unix/Makefile              |    1 +
 net/unix/af_unix.c             |    9 +
 net/unix/checkpoint.c          |  634 +++++++++++++++++++++++++++++++++
 13 files changed, 1601 insertions(+), 1 deletions(-)
 create mode 100644 net/checkpoint.c
 create mode 100644 net/unix/checkpoint.c

diff --git a/checkpoint/files.c b/checkpoint/files.c
index 1de89d6..058bc0e 100644
--- a/checkpoint/files.c
+++ b/checkpoint/files.c
@@ -22,6 +22,7 @@
 #include <linux/deferqueue.h>
 #include <linux/checkpoint.h>
 #include <linux/checkpoint_hdr.h>
+#include <net/sock.h>
 
 
 /**************************************************************************
@@ -591,6 +592,12 @@ static struct restore_file_ops restore_file_ops[] = {
 		.file_type = CKPT_FILE_FIFO,
 		.restore = fifo_file_restore,
 	},
+	/* socket */
+	{
+		.file_name = "SOCKET",
+		.file_type = CKPT_FILE_SOCKET,
+		.restore = sock_file_restore,
+	},
 };
 
 static struct file *do_restore_file(struct ckpt_ctx *ctx)
diff --git a/checkpoint/objhash.c b/checkpoint/objhash.c
index bf2f761..0978060 100644
--- a/checkpoint/objhash.c
+++ b/checkpoint/objhash.c
@@ -20,6 +20,7 @@
 #include <linux/user_namespace.h>
 #include <linux/checkpoint.h>
 #include <linux/checkpoint_hdr.h>
+#include <net/sock.h>
 
 struct ckpt_obj;
 struct ckpt_obj_ops;
@@ -234,6 +235,40 @@ static void obj_groupinfo_drop(void *ptr, int lastref)
 	put_group_info((struct group_info *) ptr);
 }
 
+static int obj_sock_grab(void *ptr)
+{
+	sock_hold((struct sock *) ptr);
+	return 0;
+}
+
+static void obj_sock_drop(void *ptr, int lastref)
+{
+	struct sock *sk = (struct sock *) ptr;
+
+	/*
+	 * Sockets created during restart are graft()ed, i.e. have a
+	 * valid @sk->sk_socket. Because only an fput() results in the
+	 * necessary sock_release(), we may leak the struct socket of
+	 * sockets that were not attached to a file. Therefore, if
+	 * @lastref is set, we hereby invoke sock_release() on sockets
+	 * that we have put into the objhash but were never attached
+	 * to a file.
+	 */
+	if (lastref && sk->sk_socket && !sk->sk_socket->file) {
+		struct socket *sock = sk->sk_socket;
+		sock_orphan(sk);
+		sock->sk = NULL;
+		sock_release(sock);
+	}
+
+	sock_put((struct sock *) ptr);
+}
+
+static int obj_sock_users(void *ptr)
+{
+	return atomic_read(&((struct sock *) ptr)->sk_refcnt);
+}
+
 static struct ckpt_obj_ops ckpt_obj_ops[] = {
 	/* ignored object */
 	{
@@ -362,6 +397,16 @@ static struct ckpt_obj_ops ckpt_obj_ops[] = {
 		.checkpoint = checkpoint_groupinfo,
 		.restore = restore_groupinfo,
 	},
+	/* sock object */
+	{
+		.obj_name = "SOCKET",
+		.obj_type = CKPT_OBJ_SOCK,
+		.ref_drop = obj_sock_drop,
+		.ref_grab = obj_sock_grab,
+		.ref_users = obj_sock_users,
+		.checkpoint = checkpoint_sock,
+		.restore = restore_sock,
+	},
 };
 
 
@@ -751,6 +796,26 @@ static void ckpt_obj_users_inc(struct ckpt_ctx *ctx, void *ptr, int increment)
  */
 
 /**
+ * obj_sock_adjust_users - remove implicit reference on DEAD sockets
+ * @obj: CKPT_OBJ_SOCK object to adjust
+ *
+ * Sockets that have been disconnected from their struct file have
+ * a reference count one less than normal sockets.  The objhash's
+ * assumption of such a reference is therefore incorrect, so we correct
+ * it here.
+ */
+static inline void obj_sock_adjust_users(struct ckpt_obj *obj)
+{
+	struct sock *sk = (struct sock *)obj->ptr;
+
+	if (sock_flag(sk, SOCK_DEAD)) {
+		obj->users--;
+		ckpt_debug("Adjusting SOCK %i count to %i\n",
+			   obj->objref, obj->users);
+	}
+}
+
+/**
  * ckpt_obj_contained - test if shared objects are contained in checkpoint
  * @ctx: checkpoint context
  *
@@ -773,6 +838,10 @@ int ckpt_obj_contained(struct ckpt_ctx *ctx)
 	hlist_for_each_entry(obj, node, &ctx->obj_hash->list, next) {
 		if (!obj->ops->ref_users)
 			continue;
+
+		if (obj->ops->obj_type == CKPT_OBJ_SOCK)
+			obj_sock_adjust_users(obj);
+
 		if (obj->ops->ref_users(obj->ptr) != obj->users) {
 			ckpt_debug("usage leak: %s\n", obj->ops->obj_name);
 			ckpt_write_err(ctx, "OP", "leak: usage (%d != %d (%s)",
diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
index ec98a43..92a21b2 100644
--- a/include/linux/checkpoint.h
+++ b/include/linux/checkpoint.h
@@ -29,6 +29,7 @@
 #include <linux/checkpoint_types.h>
 #include <linux/checkpoint_hdr.h>
 #include <linux/err.h>
+#include <net/sock.h>
 
 /* ckpt_ctx: kflags */
 #define CKPT_CTX_CHECKPOINT_BIT		0
@@ -77,6 +78,12 @@ extern int ckpt_read_consume(struct ckpt_ctx *ctx, int len, int type);
 extern char *ckpt_fill_fname(struct path *path, struct path *root,
 			     char *buf, int *len);
 
+/* socket functions */
+extern int ckpt_sock_getnames(struct ckpt_ctx *ctx,
+			      struct socket *socket,
+			      struct sockaddr *loc, unsigned *loc_len,
+			      struct sockaddr *rem, unsigned *rem_len);
+
 /* ckpt kflags */
 #define ckpt_set_ctx_kflag(__ctx, __kflag)  \
 	set_bit(__kflag##_BIT, &(__ctx)->kflags)
diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
index e4dfbd7..ac16c59 100644
--- a/include/linux/checkpoint_hdr.h
+++ b/include/linux/checkpoint_hdr.h
@@ -12,6 +12,14 @@
 
 #include <linux/types.h>
 
+#ifdef __KERNEL__
+#include <linux/socket.h>
+#include <linux/un.h>
+#else
+#include <sys/socket.h>
+#include <sys/un.h>
+#endif
+
 /*
  * To maintain compatibility between 32-bit and 64-bit architecture flavors,
  * keep data 64-bit aligned: use padding for structure members, and use
@@ -92,6 +100,11 @@ enum {
 	CKPT_HDR_SIGNAL_TASK,
 	CKPT_HDR_SIGPENDING,
 
+	CKPT_HDR_SOCKET = 701,
+	CKPT_HDR_SOCKET_QUEUE,
+	CKPT_HDR_SOCKET_BUFFER,
+	CKPT_HDR_SOCKET_UNIX,
+
 	CKPT_HDR_TAIL = 9001,
 
 	CKPT_HDR_ERROR = 9999,
@@ -127,6 +140,7 @@ enum obj_type {
 	CKPT_OBJ_CRED,
 	CKPT_OBJ_USER,
 	CKPT_OBJ_GROUPINFO,
+	CKPT_OBJ_SOCK,
 	CKPT_OBJ_MAX
 };
 
@@ -353,6 +367,7 @@ enum file_type {
 	CKPT_FILE_GENERIC,
 	CKPT_FILE_PIPE,
 	CKPT_FILE_FIFO,
+	CKPT_FILE_SOCKET,
 	CKPT_FILE_MAX
 };
 
@@ -376,6 +391,78 @@ struct ckpt_hdr_file_pipe {
 	__s32 pipe_objref;
 } __attribute__((aligned(8)));
 
+/* socket */
+struct ckpt_hdr_socket {
+	struct ckpt_hdr h;
+
+	struct { /* struct socket */
+		__u64 flags;
+		__u8 state;
+	} socket __attribute__ ((aligned(8)));
+
+	struct { /* struct sock_common */
+		__u32 bound_dev_if;
+		__u32 reuse;
+		__u16 family;
+		__u8 state;
+	} sock_common __attribute__ ((aligned(8)));
+
+	struct { /* struct sock */
+		__s64 rcvlowat;
+		__u64 flags;
+
+		__s64 rcvtimeo;
+		__s64 sndtimeo;
+
+		__u32 err;
+		__u32 err_soft;
+		__u32 priority;
+		__s32 rcvbuf;
+		__s32 sndbuf;
+		__u16 type;
+		__s16 backlog;
+
+		__u8 protocol;
+		__u8 state;
+		__u8 shutdown;
+		__u8 userlocks;
+		__u8 no_check;
+
+		struct linger linger;
+	} sock __attribute__ ((aligned(8)));
+} __attribute__ ((aligned(8)));
+
+struct ckpt_hdr_socket_queue {
+	struct ckpt_hdr h;
+	__u32 skb_count;
+	__u32 total_bytes;
+} __attribute__ ((aligned(8)));
+
+struct ckpt_hdr_socket_buffer {
+	struct ckpt_hdr h;
+	__s32 sk_objref;
+	__s32 pr_objref;
+};
+
+#define CKPT_UNIX_LINKED 1
+struct ckpt_hdr_socket_unix {
+	struct ckpt_hdr h;
+	__s32 this;
+	__s32 peer;
+	__u32 peercred_uid;
+	__u32 peercred_gid;
+	__u32 flags;
+	__u32 laddr_len;
+	__u32 raddr_len;
+	struct sockaddr_un laddr;
+	struct sockaddr_un raddr;
+} __attribute__ ((aligned(8)));
+
+struct ckpt_hdr_file_socket {
+	struct ckpt_hdr_file common;
+	__s32 sock_objref;
+} __attribute__((aligned(8)));
+
 /* memory layout */
 struct ckpt_hdr_mm {
 	struct ckpt_hdr h;
diff --git a/include/linux/net.h b/include/linux/net.h
index b99f350..d1ce6eb 100644
--- a/include/linux/net.h
+++ b/include/linux/net.h
@@ -232,6 +232,8 @@ extern int   	     sock_sendmsg(struct socket *sock, struct msghdr *msg,
 				  size_t len);
 extern int	     sock_recvmsg(struct socket *sock, struct msghdr *msg,
 				  size_t size, int flags);
+extern int	     sock_attach_fd(struct socket *sock, struct file *file,
+				    int flags);
 extern int 	     sock_map_fd(struct socket *sock, int flags);
 extern struct socket *sockfd_lookup(int fd, int *err);
 #define		     sockfd_put(sock) fput(sock->file)
diff --git a/include/net/af_unix.h b/include/net/af_unix.h
index 1614d78..e42a714 100644
--- a/include/net/af_unix.h
+++ b/include/net/af_unix.h
@@ -68,4 +68,18 @@ static inline int unix_sysctl_register(struct net *net) { return 0; }
 static inline void unix_sysctl_unregister(struct net *net) {}
 #endif
 #endif
+
+#ifdef CONFIG_CHECKPOINT
+struct ckpt_ctx;
+struct ckpt_hdr_socket;
+extern int unix_checkpoint(struct ckpt_ctx *ctx, struct socket *sock);
+extern int unix_restore(struct ckpt_ctx *ctx, struct socket *sock,
+			struct ckpt_hdr_socket *h);
+extern int unix_collect(struct ckpt_ctx *ctx, struct socket *sock);
+
+#else
+#define unix_checkpoint NULL
+#define unix_restore NULL
+#endif /* CONFIG_CHECKPOINT */
+
 #endif
diff --git a/include/net/sock.h b/include/net/sock.h
index 12530bf..ec351f9 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -1646,4 +1646,16 @@ extern int sysctl_optmem_max;
 extern __u32 sysctl_wmem_default;
 extern __u32 sysctl_rmem_default;
 
+#ifdef CONFIG_CHECKPOINT
+/* Checkpoint/Restart Functions */
+struct ckpt_ctx;
+struct ckpt_hdr_file;
+extern int checkpoint_sock(struct ckpt_ctx *ctx, void *ptr);
+extern void *restore_sock(struct ckpt_ctx *ctx);
+extern int sock_file_checkpoint(struct ckpt_ctx *ctx, struct file *file);
+extern struct file *sock_file_restore(struct ckpt_ctx *ctx,
+				      struct ckpt_hdr_file *h);
+extern int sock_file_collect(struct ckpt_ctx *ctx, struct file *file);
+#endif
+
 #endif	/* _SOCK_H */
diff --git a/net/Makefile b/net/Makefile
index ba324ae..91d12fe 100644
--- a/net/Makefile
+++ b/net/Makefile
@@ -66,3 +66,5 @@ ifeq ($(CONFIG_NET),y)
 obj-$(CONFIG_SYSCTL)		+= sysctl_net.o
 endif
 obj-$(CONFIG_WIMAX)		+= wimax/
+
+obj-$(CONFIG_CHECKPOINT)	+= checkpoint.o
diff --git a/net/checkpoint.c b/net/checkpoint.c
new file mode 100644
index 0000000..a11ec7a
--- /dev/null
+++ b/net/checkpoint.c
@@ -0,0 +1,752 @@
+/*
+ *  Copyright 2009 IBM Corporation
+ *
+ *  Author(s): Dan Smith <danms-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
+ *             Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
+ *
+ *  This program is free software; you can redistribute it and/or
+ *  modify it under the terms of the GNU General Public License as
+ *  published by the Free Software Foundation, version 2 of the
+ *  License.
+ */
+
+#include <linux/socket.h>
+#include <linux/mount.h>
+#include <linux/file.h>
+#include <linux/namei.h>
+#include <linux/syscalls.h>
+#include <linux/sched.h>
+#include <linux/fs_struct.h>
+
+#include <net/af_unix.h>
+#include <net/tcp_states.h>
+
+#include <linux/deferqueue.h>
+#include <linux/checkpoint.h>
+#include <linux/checkpoint_hdr.h>
+
+struct dq_buffers {
+	struct ckpt_ctx *ctx;
+	struct sock *sk;
+};
+
+static int sock_copy_buffers(struct sk_buff_head *from,
+			     struct sk_buff_head *to,
+			     uint32_t *total_bytes)
+{
+	int count1 = 0;
+	int count2 = 0;
+	int i;
+	struct sk_buff *skb;
+	struct sk_buff **skbs;
+
+	*total_bytes = 0;
+
+	spin_lock(&from->lock);
+	skb_queue_walk(from, skb)
+		count1++;
+	spin_unlock(&from->lock);
+
+	skbs = kzalloc(sizeof(*skbs) * count1, GFP_KERNEL);
+	if (!skbs)
+		return -ENOMEM;
+
+	for (i = 0; i < count1;  i++) {
+		skbs[i] = dev_alloc_skb(0);
+		if (!skbs[i])
+			goto err;
+	}
+
+	i = 0;
+	spin_lock(&from->lock);
+	skb_queue_walk(from, skb) {
+		if (++count2 > count1)
+			break; /* The queue changed as we read it */
+
+		skb_morph(skbs[i], skb);
+		skbs[i]->sk = skb->sk;
+		skb_queue_tail(to, skbs[i]);
+
+		*total_bytes += skb->len;
+		i++;
+	}
+	spin_unlock(&from->lock);
+
+	if (count1 != count2)
+		goto err;
+
+	kfree(skbs);
+
+	return count1;
+ err:
+	while (skb_dequeue(to))
+		; /* Pull all the buffers out of the queue */
+	for (i = 0; i < count1; i++)
+		kfree_skb(skbs[i]);
+	kfree(skbs);
+
+	return -EAGAIN;
+}
+
+static int __sock_write_buffers(struct ckpt_ctx *ctx,
+				struct sk_buff_head *queue,
+				int dst_objref)
+{
+	struct sk_buff *skb;
+
+	skb_queue_walk(queue, skb) {
+		struct ckpt_hdr_socket_buffer *h;
+		int ret = 0;
+
+		/* FIXME: This could be a false positive for non-unix
+		 *        buffers, so add a type check here in the
+		 *        future
+		 */
+		if (UNIXCB(skb).fp) {
+			ckpt_write_err(ctx, "TE", "af_unix: pass fd", -EBUSY);
+			return -EBUSY;
+		}
+
+		/* The other ancillary messages are always present
+		 * unlike descriptors.  Even though we can't detect
+		 * them and fail the checkpoint, we're not at risk
+		 * because we don't save out (or restore) the control
+		 * information contained in the skb.
+		 */
+		h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_SOCKET_BUFFER);
+		if (!h)
+			return -ENOMEM;
+
+		BUG_ON(!skb->sk);
+		ret = checkpoint_obj(ctx, skb->sk, CKPT_OBJ_SOCK);
+		if (ret < 0)
+			goto end;
+		h->sk_objref = ret;
+		h->pr_objref = dst_objref;
+
+		ret = ckpt_write_obj(ctx, (struct ckpt_hdr *) h);
+		if (ret < 0)
+			goto end;
+
+		ret = ckpt_write_obj_type(ctx, skb->data, skb->len,
+					  CKPT_HDR_BUFFER);
+	end:
+		ckpt_hdr_put(ctx, h);
+		if (ret < 0)
+			return ret;
+	}
+
+	return 0;
+}
+
+static int sock_write_buffers(struct ckpt_ctx *ctx,
+			      struct sk_buff_head *queue,
+			      int dst_objref)
+{
+	struct ckpt_hdr_socket_queue *h;
+	struct sk_buff_head tmpq;
+	int ret = -ENOMEM;
+
+	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_SOCKET_QUEUE);
+	if (!h)
+		return -ENOMEM;
+
+	skb_queue_head_init(&tmpq);
+
+	ret = sock_copy_buffers(queue, &tmpq, &h->total_bytes);
+	if (ret < 0)
+		goto out;
+
+	h->skb_count = ret;
+	ret = ckpt_write_obj(ctx, (struct ckpt_hdr *) h);
+	if (!ret)
+		ret = __sock_write_buffers(ctx, &tmpq, dst_objref);
+
+ out:
+	ckpt_hdr_put(ctx, h);
+	__skb_queue_purge(&tmpq);
+
+	return ret;
+}
+
+int sock_deferred_write_buffers(void *data)
+{
+	struct dq_buffers *dq = (struct dq_buffers *)data;
+	struct ckpt_ctx *ctx = dq->ctx;
+	int ret;
+	int dst_objref;
+
+	dst_objref = ckpt_obj_lookup(ctx, dq->sk, CKPT_OBJ_SOCK);
+	if (dst_objref < 0) {
+		ckpt_write_err(ctx, "TE", "socket: owner gone?", dst_objref);
+		return dst_objref;
+	}
+
+	ret = sock_write_buffers(ctx, &dq->sk->sk_receive_queue, dst_objref);
+	ckpt_debug("write recv buffers: %i\n", ret);
+	if (ret < 0)
+		return ret;
+
+	ret = sock_write_buffers(ctx, &dq->sk->sk_write_queue, dst_objref);
+	ckpt_debug("write send buffers: %i\n", ret);
+
+	return ret;
+}
+
+int sock_defer_write_buffers(struct ckpt_ctx *ctx, struct sock *sk)
+{
+	struct dq_buffers dq;
+
+	dq.ctx = ctx;
+	dq.sk = sk;
+
+	/* NB: This is safe to do inside deferqueue_run() since it uses
+	 * list_for_each_safe()
+	 */
+	return deferqueue_add(ctx->files_deferq, &dq, sizeof(dq),
+			      sock_deferred_write_buffers, NULL);
+}
+
+int ckpt_sock_getnames(struct ckpt_ctx *ctx, struct socket *sock,
+		       struct sockaddr *loc, unsigned *loc_len,
+		       struct sockaddr *rem, unsigned *rem_len)
+{
+	int ret;
+
+	ret = sock_getname(sock, loc, loc_len);
+	if (ret) {
+		ckpt_write_err(ctx, "TEP", "socket: getname local", ret, sock);
+		return -EINVAL;
+	}
+
+	ret = sock_getpeer(sock, rem, rem_len);
+	if (ret) {
+		if ((sock->sk->sk_type != SOCK_DGRAM) &&
+		    (sock->sk->sk_state == TCP_ESTABLISHED)) {
+			ckpt_write_err(ctx, "TEP", "socket: getname peer",
+				       ret, sock);
+			return -EINVAL;
+		}
+		*rem_len = 0;
+	}
+
+	return 0;
+}
+
+static int sock_cptrst_verify(struct ckpt_hdr_socket *h)
+{
+	uint8_t userlocks_mask = SOCK_SNDBUF_LOCK | SOCK_RCVBUF_LOCK |
+		                 SOCK_BINDADDR_LOCK | SOCK_BINDPORT_LOCK;
+
+	if (h->sock.shutdown & ~SHUTDOWN_MASK)
+		return -EINVAL;
+	if (h->sock.userlocks & ~userlocks_mask)
+		return -EINVAL;
+	if (!ckpt_validate_errno(h->sock.err))
+		return -EINVAL;
+
+	return 0;
+}
+
+static int sock_cptrst_opt(int op, struct socket *sock,
+			   int optname, char *opt, int len)
+{
+	mm_segment_t fs;
+	int ret;
+
+	fs = get_fs();
+	set_fs(KERNEL_DS);
+
+	if (op == CKPT_CPT)
+		ret = sock_getsockopt(sock, SOL_SOCKET, optname, opt, &len);
+	else
+		ret = sock_setsockopt(sock, SOL_SOCKET, optname, opt, len);
+
+	set_fs(fs);
+
+	return ret;
+}
+
+#define CKPT_COPY_SOPT(op, sk, name, opt) \
+	sock_cptrst_opt(op, sk->sk_socket, name, (char *)opt, sizeof(*opt))
+
+static int sock_cptrst_bufopts(int op, struct sock *sk,
+			       struct ckpt_hdr_socket *h)
+{
+	if (CKPT_COPY_SOPT(op, sk, SO_RCVBUF, &h->sock.rcvbuf))
+		if ((op == CKPT_RST) &&
+		    CKPT_COPY_SOPT(op, sk, SO_RCVBUFFORCE, &h->sock.rcvbuf)) {
+			ckpt_debug("Failed to set SO_RCVBUF");
+			return -EINVAL;
+		}
+
+	if (CKPT_COPY_SOPT(op, sk, SO_SNDBUF, &h->sock.sndbuf))
+		if ((op == CKPT_RST) &&
+		    CKPT_COPY_SOPT(op, sk, SO_SNDBUFFORCE, &h->sock.sndbuf)) {
+			ckpt_debug("Failed to set SO_SNDBUF");
+			return -EINVAL;
+		}
+
+	/* It's silly that we have to fight ourselves here, but
+	 * sock_setsockopt() doubles the initial value, so divide here
+	 * to store the user's value and avoid doubling on restart
+	 */
+	if ((op == CKPT_CPT) && (h->sock.rcvbuf != SOCK_MIN_RCVBUF))
+		h->sock.rcvbuf >>= 1;
+
+	if ((op == CKPT_CPT) && (h->sock.sndbuf != SOCK_MIN_SNDBUF))
+		h->sock.sndbuf >>= 1;
+
+	return 0;
+}
+
+struct sock_flag_mapping {
+	int opt;
+	int flag;
+};
+
+struct sock_flag_mapping sk_flag_map[] = {
+	{SO_OOBINLINE, SOCK_URGINLINE},
+	{SO_KEEPALIVE, SOCK_KEEPOPEN},
+	{SO_BROADCAST, SOCK_BROADCAST},
+	{SO_TIMESTAMP, SOCK_RCVTSTAMP},
+	{SO_TIMESTAMPNS, SOCK_RCVTSTAMPNS},
+	{SO_DEBUG, SOCK_DBG},
+	{SO_DONTROUTE, SOCK_LOCALROUTE},
+};
+
+struct sock_flag_mapping sock_flag_map[] = {
+	{SO_PASSCRED, SOCK_PASSCRED},
+};
+
+static int sock_restore_flag(struct socket *sock,
+			     unsigned long *flags,
+			     int flag,
+			     int option)
+{
+	int v = 1;
+	int ret = 0;
+
+	if (test_and_clear_bit(flag, flags))
+		ret = sock_setsockopt(sock, SOL_SOCKET, option,
+				      (char *)&v, sizeof(v));
+
+	return ret;
+}
+
+
+static int sock_restore_flags(struct socket *sock, struct ckpt_hdr_socket *h)
+{
+	unsigned long sk_flags = h->sock.flags;
+	unsigned long sock_flags = h->socket.flags;
+	int ret;
+	int i;
+
+	for (i = 0; i < ARRAY_SIZE(sk_flag_map); i++) {
+		int opt = sk_flag_map[i].opt;
+		int flag = sk_flag_map[i].flag;
+		ret = sock_restore_flag(sock, &sk_flags, flag, opt);
+		if (ret) {
+			ckpt_debug("Failed to set skopt %i: %i\n", opt, ret);
+			return ret;
+		}
+	}
+
+	for (i = 0; i < ARRAY_SIZE(sock_flag_map); i++) {
+		int opt = sock_flag_map[i].opt;
+		int flag = sock_flag_map[i].flag;
+		ret = sock_restore_flag(sock, &sock_flags, flag, opt);
+		if (ret) {
+			ckpt_debug("Failed to set sockopt %i: %i\n", opt, ret);
+			return ret;
+		}
+	}
+
+	/* TODO: Handle SOCK_TIMESTAMPING_* flags */
+	if (test_bit(SOCK_TIMESTAMPING_TX_HARDWARE, &sk_flags) ||
+	    test_bit(SOCK_TIMESTAMPING_TX_SOFTWARE, &sk_flags) ||
+	    test_bit(SOCK_TIMESTAMPING_RX_HARDWARE, &sk_flags) ||
+	    test_bit(SOCK_TIMESTAMPING_RX_SOFTWARE, &sk_flags) ||
+	    test_bit(SOCK_TIMESTAMPING_SOFTWARE, &sk_flags) ||
+	    test_bit(SOCK_TIMESTAMPING_RAW_HARDWARE, &sk_flags) ||
+	    test_bit(SOCK_TIMESTAMPING_SYS_HARDWARE, &sk_flags)) {
+		ckpt_debug("SOF_TIMESTAMPING_* flags are not supported\n");
+		return -ENOSYS;
+	}
+
+	if (test_and_clear_bit(SOCK_DEAD, &sk_flags))
+		sock_set_flag(sock->sk, SOCK_DEAD);
+
+
+	/* Anything that is still set in the flags that isn't part of
+	 * our protocol's default set, indicates an error
+	 */
+	if (sk_flags & ~sock->sk->sk_flags) {
+		ckpt_debug("Unhandled sock flags: %lx\n", sk_flags);
+		return -EINVAL;
+	}
+
+	return 0;
+}
+
+static int sock_copy_timeval(int op, struct sock *sk,
+			     int sockopt, __s64 *saved)
+{
+	struct timeval tv;
+
+	if (op == CKPT_CPT) {
+		if (CKPT_COPY_SOPT(op, sk, sockopt, &tv))
+			return -EINVAL;
+		*saved = timeval_to_ns(&tv);
+	} else {
+		tv = ns_to_timeval(*saved);
+		if (CKPT_COPY_SOPT(op, sk, sockopt, &tv))
+			return -EINVAL;
+	}
+
+	return 0;
+}
+
+static int sock_cptrst(struct ckpt_ctx *ctx, struct sock *sk,
+		       struct ckpt_hdr_socket *h, int op)
+{
+	if (sk->sk_socket) {
+		CKPT_COPY(op, h->socket.state, sk->sk_socket->state);
+	}
+
+	CKPT_COPY(op, h->sock_common.bound_dev_if, sk->sk_bound_dev_if);
+	CKPT_COPY(op, h->sock_common.family, sk->sk_family);
+
+	CKPT_COPY(op, h->sock.shutdown, sk->sk_shutdown);
+	CKPT_COPY(op, h->sock.userlocks, sk->sk_userlocks);
+	CKPT_COPY(op, h->sock.no_check, sk->sk_no_check);
+	CKPT_COPY(op, h->sock.protocol, sk->sk_protocol);
+	CKPT_COPY(op, h->sock.err, sk->sk_err);
+	CKPT_COPY(op, h->sock.err_soft, sk->sk_err_soft);
+	CKPT_COPY(op, h->sock.type, sk->sk_type);
+	CKPT_COPY(op, h->sock.state, sk->sk_state);
+	CKPT_COPY(op, h->sock.backlog, sk->sk_max_ack_backlog);
+
+	if (sock_cptrst_bufopts(op, sk, h))
+		return -EINVAL;
+
+	if (CKPT_COPY_SOPT(op, sk, SO_REUSEADDR, &h->sock_common.reuse)) {
+		ckpt_debug("Failed to set SO_REUSEADDR");
+		return -EINVAL;
+	}
+
+	if (CKPT_COPY_SOPT(op, sk, SO_PRIORITY, &h->sock.priority)) {
+		ckpt_debug("Failed to set SO_PRIORITY");
+		return -EINVAL;
+	}
+
+	if (CKPT_COPY_SOPT(op, sk, SO_RCVLOWAT, &h->sock.rcvlowat)) {
+		ckpt_debug("Failed to set SO_RCVLOWAT");
+		return -EINVAL;
+	}
+
+	if (CKPT_COPY_SOPT(op, sk, SO_LINGER, &h->sock.linger)) {
+		ckpt_debug("Failed to set SO_LINGER");
+		return -EINVAL;
+	}
+
+	if (sock_copy_timeval(op, sk, SO_SNDTIMEO, &h->sock.sndtimeo)) {
+		ckpt_debug("Failed to set SO_SNDTIMEO");
+		return -EINVAL;
+	}
+
+	if (sock_copy_timeval(op, sk, SO_RCVTIMEO, &h->sock.rcvtimeo)) {
+		ckpt_debug("Failed to set SO_RCVTIMEO");
+		return -EINVAL;
+	}
+
+	if (op == CKPT_CPT) {
+		h->sock.flags = sk->sk_flags;
+		h->socket.flags = sk->sk_socket->flags;
+	} else {
+		int ret;
+		mm_segment_t old_fs;
+
+		old_fs = get_fs();
+		set_fs(KERNEL_DS);
+		ret = sock_restore_flags(sk->sk_socket, h);
+		set_fs(old_fs);
+		if (ret)
+			return ret;
+	}
+
+	if ((h->socket.state == SS_CONNECTED) &&
+	    (h->sock.state != TCP_ESTABLISHED)) {
+		ckpt_debug("socket/sock in inconsistent state: %i/%i",
+			   h->socket.state, h->sock.state);
+		return -EINVAL;
+	} else if ((h->sock.state < TCP_ESTABLISHED) ||
+		   (h->sock.state >= TCP_MAX_STATES)) {
+		ckpt_debug("sock in invalid state: %i", h->sock.state);
+		return -EINVAL;
+	} else if ((h->socket.state < SS_FREE) ||
+		   (h->socket.state > SS_DISCONNECTING)) {
+		ckpt_debug("socket in invalid state: %i",
+			   h->socket.state);
+		return -EINVAL;
+	}
+
+	if (op == CKPT_RST)
+		return sock_cptrst_verify(h);
+	else
+		return 0;
+}
+
+static int __do_sock_checkpoint(struct ckpt_ctx *ctx, struct sock *sk)
+{
+	struct socket *sock = sk->sk_socket;
+	struct ckpt_hdr_socket *h;
+	int ret;
+
+	if (!sock->ops->checkpoint) {
+		ckpt_write_err(ctx, "TEVP", "socket: proto_ops",
+			       -ENOSYS, sock->ops, sock);
+		return -ENOSYS;
+	}
+
+	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_SOCKET);
+	if (!h)
+		return -ENOMEM;
+
+	/* part I: common to all sockets */
+	ret = sock_cptrst(ctx, sk, h, CKPT_CPT);
+	if (ret < 0)
+		goto out;
+
+	ret = ckpt_write_obj(ctx, (struct ckpt_hdr *) h);
+	if (ret < 0)
+		goto out;
+
+	/* part II: per socket type state */
+	ret = sock->ops->checkpoint(ctx, sock);
+	if (ret < 0)
+		goto out;
+
+	/* part III: socket buffers */
+	if ((sk->sk_state != TCP_LISTEN) && (!sock_flag(sk, SOCK_DEAD)))
+		ret = sock_defer_write_buffers(ctx, sk);
+ out:
+	ckpt_hdr_put(ctx, h);
+	return ret;
+}
+
+static int do_sock_checkpoint(struct ckpt_ctx *ctx, struct sock *sk)
+{
+	struct socket *sock;
+	int ret;
+
+	if (sk->sk_socket)
+		return __do_sock_checkpoint(ctx, sk);
+
+	/* Temporarily adopt this orphan socket */
+	ret = sock_create(sk->sk_family, sk->sk_type, 0, &sock);
+	if (ret < 0)
+		return ret;
+	sock_graft(sk, sock);
+
+	ret = __do_sock_checkpoint(ctx, sk);
+
+	sock_orphan(sk);
+	sock->sk = NULL;
+	sock_release(sock);
+
+	return ret;
+}
+
+int checkpoint_sock(struct ckpt_ctx *ctx, void *ptr)
+{
+	return do_sock_checkpoint(ctx, (struct sock *)ptr);
+}
+
+int sock_file_checkpoint(struct ckpt_ctx *ctx, struct file *file)
+{
+	struct ckpt_hdr_file_socket *h;
+	struct socket *sock = file->private_data;
+	struct sock *sk = sock->sk;
+	int ret;
+
+	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_FILE);
+	if (!h)
+		return -ENOMEM;
+
+	h->common.f_type = CKPT_FILE_SOCKET;
+
+	h->sock_objref = checkpoint_obj(ctx, sk, CKPT_OBJ_SOCK);
+	if (h->sock_objref < 0) {
+		ret = h->sock_objref;
+		goto out;
+	}
+
+	ret = checkpoint_file_common(ctx, file, &h->common);
+	if (ret < 0)
+		goto out;
+
+	ret = ckpt_write_obj(ctx, (struct ckpt_hdr *) h);
+ out:
+	ckpt_hdr_put(ctx, h);
+	return ret;
+}
+
+static int sock_collect_skbs(struct ckpt_ctx *ctx, struct sk_buff_head *queue)
+{
+	struct sk_buff_head tmpq;
+	struct sk_buff *skb;
+	int ret = 0;
+	int bytes;
+
+	skb_queue_head_init(&tmpq);
+
+	ret = sock_copy_buffers(queue, &tmpq, &bytes);
+	if (ret < 0)
+		return ret;
+
+	skb_queue_walk(&tmpq, skb) {
+		/* Socket buffers do not maintain a ref count on their
+		 * owning sock because they're counted in sock_wmem_alloc.
+		 * So, we only need to collect sockets from the queue that
+		 * won't be collected any other way (i.e. DEAD sockets that
+		 * are hanging around only because they're waiting for us
+		 * to process their skb.
+		 */
+
+		if (!ckpt_obj_lookup(ctx, skb->sk, CKPT_OBJ_SOCK) &&
+		    sock_flag(skb->sk, SOCK_DEAD)) {
+			ret = ckpt_obj_collect(ctx, skb->sk, CKPT_OBJ_SOCK);
+			if (ret < 0)
+				break;
+		}
+	}
+
+	__skb_queue_purge(&tmpq);
+
+	return ret;
+}
+
+int sock_file_collect(struct ckpt_ctx *ctx, struct file *file)
+{
+	struct socket *sock = file->private_data;
+	struct sock *sk = sock->sk;
+	int ret;
+
+	ret = sock_collect_skbs(ctx, &sk->sk_write_queue);
+	if (ret < 0)
+		return ret;
+
+	ret = sock_collect_skbs(ctx, &sk->sk_receive_queue);
+	if (ret < 0)
+		return ret;
+
+	ret = ckpt_obj_collect(ctx, sk, CKPT_OBJ_SOCK);
+	if (ret < 0)
+		return ret;
+
+	if (sock->ops->collect)
+		ret = sock->ops->collect(ctx, sock);
+
+	return ret;
+}
+
+static struct file *sock_alloc_attach_fd(struct socket *sock)
+{
+	struct file *file;
+	int err;
+
+	file = get_empty_filp();
+	if (!file)
+		return ERR_PTR(ENOMEM);
+
+	err = sock_attach_fd(sock, file, 0);
+	if (err < 0) {
+		put_filp(file);
+		file = ERR_PTR(err);
+	}
+
+	/* Since objhash assumes the initial reference for a socket,
+	 * we bump it here for this descriptor, unlike other places in
+	 * the socket code which assume the descriptor is the owner.
+	 */
+	sock_hold(sock->sk);
+
+	return file;
+}
+
+struct sock *do_sock_restore(struct ckpt_ctx *ctx)
+{
+	struct ckpt_hdr_socket *h;
+	struct socket *sock;
+	int ret;
+
+	h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_SOCKET);
+	if (IS_ERR(h))
+		return ERR_PTR(PTR_ERR(h));
+
+	/* silently clear flags, e.g. SOCK_NONBLOCK or SOCK_CLOEXEC */
+	h->sock.type &= SOCK_TYPE_MASK;
+
+	ret = sock_create(h->sock_common.family, h->sock.type, 0, &sock);
+	if (ret < 0)
+		goto err;
+
+	if (!sock->ops->restore) {
+		ckpt_debug("proto_ops lacks checkpoint: %pS\n", sock->ops);
+		ret = -EINVAL;
+		goto err;
+	}
+
+	/*
+	 * part II: per socket type state
+	 * (also takes care of part III: socket buffer)
+	 */
+	ret = sock->ops->restore(ctx, sock, h);
+	if (ret < 0)
+		goto err;
+
+	/* part I: common to all sockets */
+	ret = sock_cptrst(ctx, sock->sk, h, CKPT_RST);
+	if (ret < 0)
+		goto err;
+
+	ckpt_hdr_put(ctx, h);
+	return sock->sk;
+ err:
+	ckpt_hdr_put(ctx, h);
+	sock_release(sock);
+	return ERR_PTR(ret);
+}
+
+void *restore_sock(struct ckpt_ctx *ctx)
+{
+	return do_sock_restore(ctx);
+}
+
+struct file *sock_file_restore(struct ckpt_ctx *ctx, struct ckpt_hdr_file *ptr)
+{
+	struct ckpt_hdr_file_socket *h = (struct ckpt_hdr_file_socket *)ptr;
+	struct sock *sk;
+	struct file *file;
+	int ret;
+
+	if (ptr->h.type != CKPT_HDR_FILE || ptr->f_type != CKPT_FILE_SOCKET)
+		return ERR_PTR(-EINVAL);
+
+	sk = ckpt_obj_fetch(ctx, h->sock_objref, CKPT_OBJ_SOCK);
+	if (IS_ERR(sk))
+		return ERR_PTR(PTR_ERR(sk));
+
+	file = sock_alloc_attach_fd(sk->sk_socket);
+	if (IS_ERR(file))
+		return file;
+
+	ret = restore_file_common(ctx, file, ptr);
+	if (ret < 0) {
+		fput(file);
+		return ERR_PTR(ret);
+	}
+
+	return file;
+}
diff --git a/net/socket.c b/net/socket.c
index 63c4498..0a4d539 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -140,6 +140,10 @@ static const struct file_operations socket_file_ops = {
 	.sendpage =	sock_sendpage,
 	.splice_write = generic_splice_sendpage,
 	.splice_read =	sock_splice_read,
+#ifdef CONFIG_CHECKPOINT
+	.checkpoint =   sock_file_checkpoint,
+	.collect = sock_file_collect,
+#endif
 };
 
 /*
@@ -368,7 +372,7 @@ static int sock_alloc_fd(struct file **filep, int flags)
 	return fd;
 }
 
-static int sock_attach_fd(struct socket *sock, struct file *file, int flags)
+int sock_attach_fd(struct socket *sock, struct file *file, int flags)
 {
 	struct dentry *dentry;
 	struct qstr name = { .name = "" };
diff --git a/net/unix/Makefile b/net/unix/Makefile
index b852a2b..fbff1e6 100644
--- a/net/unix/Makefile
+++ b/net/unix/Makefile
@@ -6,3 +6,4 @@ obj-$(CONFIG_UNIX)	+= unix.o
 
 unix-y			:= af_unix.o garbage.o
 unix-$(CONFIG_SYSCTL)	+= sysctl_net_unix.o
+unix-$(CONFIG_CHECKPOINT) += checkpoint.o
diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index fc3ebb9..b3d4f16 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -523,6 +523,9 @@ static const struct proto_ops unix_stream_ops = {
 	.recvmsg =	unix_stream_recvmsg,
 	.mmap =		sock_no_mmap,
 	.sendpage =	sock_no_sendpage,
+	.checkpoint =	unix_checkpoint,
+	.restore =	unix_restore,
+	.collect =      unix_collect,
 };
 
 static const struct proto_ops unix_dgram_ops = {
@@ -544,6 +547,9 @@ static const struct proto_ops unix_dgram_ops = {
 	.recvmsg =	unix_dgram_recvmsg,
 	.mmap =		sock_no_mmap,
 	.sendpage =	sock_no_sendpage,
+	.checkpoint =	unix_checkpoint,
+	.restore =	unix_restore,
+	.collect =      unix_collect,
 };
 
 static const struct proto_ops unix_seqpacket_ops = {
@@ -565,6 +571,9 @@ static const struct proto_ops unix_seqpacket_ops = {
 	.recvmsg =	unix_dgram_recvmsg,
 	.mmap =		sock_no_mmap,
 	.sendpage =	sock_no_sendpage,
+	.checkpoint =	unix_checkpoint,
+	.restore =	unix_restore,
+	.collect =      unix_collect,
 };
 
 static struct proto unix_proto = {
diff --git a/net/unix/checkpoint.c b/net/unix/checkpoint.c
new file mode 100644
index 0000000..8b7cb22
--- /dev/null
+++ b/net/unix/checkpoint.c
@@ -0,0 +1,634 @@
+#include <linux/namei.h>
+#include <linux/file.h>
+#include <linux/fs_struct.h>
+#include <linux/deferqueue.h>
+#include <linux/checkpoint.h>
+#include <linux/checkpoint_hdr.h>
+#include <linux/user.h>
+#include <net/af_unix.h>
+#include <net/tcp_states.h>
+
+struct dq_join {
+	struct ckpt_ctx *ctx;
+	int src_objref;
+	int dst_objref;
+};
+
+struct dq_buffers {
+	struct ckpt_ctx *ctx;
+	int sk_objref; /* objref of the socket these buffers belong to */
+};
+
+#define UNIX_ADDR_EMPTY(a) (a <= sizeof(short))
+
+static inline int unix_need_cwd(struct sockaddr_un *addr, unsigned long len)
+{
+	return (!UNIX_ADDR_EMPTY(len)) &&
+		addr->sun_path[0] &&
+		(addr->sun_path[0] != '/');
+}
+
+static int unix_join(struct sock *src, struct sock *dst)
+{
+	if (unix_sk(src)->peer != NULL)
+		return 0; /* We're second */
+
+	sock_hold(dst);
+	unix_sk(src)->peer = dst;
+
+	return 0;
+
+}
+
+static int unix_deferred_join(void *data)
+{
+	struct dq_join *dq = (struct dq_join *)data;
+	struct ckpt_ctx *ctx = dq->ctx;
+	struct sock *src;
+	struct sock *dst;
+
+	src = ckpt_obj_fetch(ctx, dq->src_objref, CKPT_OBJ_SOCK);
+	if (!src) {
+		ckpt_debug("Missing src sock ref %i\n", dq->src_objref);
+		return -EINVAL;
+	}
+
+	dst = ckpt_obj_fetch(ctx, dq->dst_objref, CKPT_OBJ_SOCK);
+	if (!src) {
+		ckpt_debug("Missing dst sock ref %i\n", dq->dst_objref);
+		return -EINVAL;
+	}
+
+	return unix_join(src, dst);
+}
+
+static int unix_defer_join(struct ckpt_ctx *ctx,
+			   int src_objref,
+			   int dst_objref)
+{
+	struct dq_join dq;
+
+	dq.ctx = ctx;
+	dq.src_objref = src_objref;
+	dq.dst_objref = dst_objref;
+
+	/* NB: This is safe to do inside deferqueue_run() since it uses
+	 * list_for_each_safe()
+	 */
+	return deferqueue_add(ctx->files_deferq, &dq, sizeof(dq),
+			      unix_deferred_join, NULL);
+}
+
+static int unix_write_cwd(struct ckpt_ctx *ctx,
+			  struct sock *sk, const char *sockpath)
+{
+	struct path path;
+	char *buf;
+	char *fqpath;
+	int offset;
+	int len = PATH_MAX;
+	int ret = -ENOENT;
+
+	buf = kmalloc(len, GFP_KERNEL);
+	if (!buf)
+		return -ENOMEM;
+
+	path.dentry = unix_sk(sk)->dentry;
+	path.mnt = unix_sk(sk)->mnt;
+
+	fqpath = ckpt_fill_fname(&path, &ctx->fs_mnt, buf, &len);
+	if (IS_ERR(fqpath)) {
+		ret = PTR_ERR(fqpath);
+		goto out;
+	}
+
+	offset = strlen(fqpath) - strlen(sockpath);
+	if (offset <= 0) {
+		ret = -EINVAL;
+		goto out;
+	}
+
+	fqpath[offset] = '\0';
+
+	ckpt_debug("writing socket directory: %s\n", fqpath);
+	ret = ckpt_write_string(ctx, fqpath, offset + 1);
+ out:
+	kfree(buf);
+	return ret;
+}
+
+int unix_checkpoint(struct ckpt_ctx *ctx, struct socket *sock)
+{
+	struct unix_sock *sk = unix_sk(sock->sk);
+	struct ckpt_hdr_socket_unix *un;
+	int new;
+	int ret = -ENOMEM;
+
+	if ((sock->sk->sk_state == TCP_LISTEN) &&
+	    !skb_queue_empty(&sock->sk->sk_receive_queue)) {
+		ckpt_write_err(ctx, "TEP", "af_unix: listen with pending peers",
+			       -EBUSY, sock);
+		return -EBUSY;
+	}
+
+	un = ckpt_hdr_get_type(ctx, sizeof(*un), CKPT_HDR_SOCKET_UNIX);
+	if (!un)
+		return -EINVAL;
+
+	ret = ckpt_sock_getnames(ctx, sock,
+				 (struct sockaddr *)&un->laddr, &un->laddr_len,
+				 (struct sockaddr *)&un->raddr, &un->raddr_len);
+	if (ret)
+		goto out;
+
+	if (sk->dentry && (sk->dentry->d_inode->i_nlink > 0))
+		un->flags |= CKPT_UNIX_LINKED;
+
+	un->this = ckpt_obj_lookup_add(ctx, sk, CKPT_OBJ_SOCK, &new);
+	if (un->this < 0)
+		goto out;
+
+	if (sk->peer)
+		un->peer = checkpoint_obj(ctx, sk->peer, CKPT_OBJ_SOCK);
+	else
+		un->peer = 0;
+
+	if (un->peer < 0) {
+		ret = un->peer;
+		goto out;
+	}
+
+	un->peercred_uid = sock->sk->sk_peercred.uid;
+	un->peercred_gid = sock->sk->sk_peercred.gid;
+
+	ret = ckpt_write_obj(ctx, (struct ckpt_hdr *) un);
+	if (ret < 0)
+		goto out;
+
+	if (unix_need_cwd(&un->laddr, un->laddr_len))
+		ret = unix_write_cwd(ctx, sock->sk, un->laddr.sun_path);
+ out:
+	ckpt_hdr_put(ctx, un);
+
+	return ret;
+}
+
+int unix_collect(struct ckpt_ctx *ctx, struct socket *sock)
+{
+	struct unix_sock *sk = unix_sk(sock->sk);
+	int ret;
+
+	ret = ckpt_obj_collect(ctx, sock->sk, CKPT_OBJ_SOCK);
+	if (ret < 0)
+		return ret;
+
+	if (sk->peer)
+		ret = ckpt_obj_collect(ctx, sk->peer, CKPT_OBJ_SOCK);
+
+	return 0;
+}
+
+static int sock_read_buffer_sendmsg(struct ckpt_ctx *ctx,
+				    struct sockaddr *addr,
+				    unsigned int addrlen)
+{
+	struct ckpt_hdr_socket_buffer *h;
+	struct sock *sk;
+	struct msghdr msg;
+	struct kvec kvec;
+	uint8_t sock_shutdown;
+	uint8_t peer_shutdown = 0;
+	void *buf = NULL;
+	int sndbuf;
+	int len;
+	int ret = 0;
+
+	memset(&msg, 0, sizeof(msg));
+
+	h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_SOCKET_BUFFER);
+	if (IS_ERR(h))
+		return PTR_ERR(h);
+
+	len = _ckpt_read_obj_type(ctx, NULL, 0, CKPT_HDR_BUFFER);
+	if (len < 0) {
+		ret = len;
+		goto out;
+	} else if (len > SKB_MAX_ALLOC) {
+		ckpt_debug("Socket buffer too big (%i > %lu)",
+			   len, SKB_MAX_ALLOC);
+		ret = -ENOSPC;
+		goto out;
+	}
+
+	sk = ckpt_obj_fetch(ctx, h->sk_objref, CKPT_OBJ_SOCK);
+	if (IS_ERR(sk)) {
+		ret = PTR_ERR(sk);
+		goto out;
+	}
+
+	/* If we don't have a destination or a peer and we know the
+	 * destination of this skb, then we must need to join with our
+	 * peer
+	 */
+	if (!addrlen && !unix_sk(sk)->peer) {
+		struct sock *pr;
+		pr = ckpt_obj_fetch(ctx, h->pr_objref, CKPT_OBJ_SOCK);
+		if (IS_ERR(pr)) {
+			ckpt_debug("Failed to get our peer: %li\n", PTR_ERR(pr));
+			ret = PTR_ERR(pr);
+			goto out;
+		}
+		ret = unix_join(sk, pr);
+		if (ret < 0) {
+			ckpt_debug("Failed to join: %i\n", ret);
+			goto out;
+		}
+	}
+
+	kvec.iov_len = len;
+	buf = kmalloc(len, GFP_KERNEL);
+	kvec.iov_base = buf;
+	if (!buf) {
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	ret = ckpt_kread(ctx, kvec.iov_base, len);
+	if (ret < 0)
+		goto out;
+
+	msg.msg_name = addr;
+	msg.msg_namelen = addrlen;
+
+	/* If peer is shutdown, unshutdown it for this process */
+	sock_shutdown = sk->sk_shutdown;
+	sk->sk_shutdown &= ~SHUTDOWN_MASK;
+
+	/* Unshutdown peer too, if necessary */
+	if (unix_sk(sk)->peer) {
+		peer_shutdown = unix_sk(sk)->peer->sk_shutdown;
+		unix_sk(sk)->peer->sk_shutdown &= ~SHUTDOWN_MASK;
+	}
+
+	/* Make sure there's room in the send buffer */
+	sndbuf = sk->sk_sndbuf;
+	if (((sk->sk_sndbuf - atomic_read(&sk->sk_wmem_alloc)) < len) &&
+	    capable(CAP_NET_ADMIN))
+		sk->sk_sndbuf += len;
+	else
+		sk->sk_sndbuf = sysctl_wmem_max;
+
+	ret = kernel_sendmsg(sk->sk_socket, &msg, &kvec, 1, len);
+	ckpt_debug("kernel_sendmsg(%i,%i): %i\n", h->sk_objref, len, ret);
+	if ((ret > 0) && (ret != len))
+		ret = -ENOMEM;
+
+	sk->sk_sndbuf = sndbuf;
+	sk->sk_shutdown = sock_shutdown;
+	if (peer_shutdown)
+		unix_sk(sk)->peer->sk_shutdown = peer_shutdown;
+ out:
+	ckpt_hdr_put(ctx, h);
+	kfree(buf);
+	return ret;
+}
+
+static int unix_read_buffers(struct ckpt_ctx *ctx,
+			     struct sockaddr *addr,
+			     unsigned int addrlen)
+{
+	struct ckpt_hdr_socket_queue *h;
+	int ret = 0;
+	int i;
+
+	h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_SOCKET_QUEUE);
+	if (IS_ERR(h))
+		return PTR_ERR(h);
+
+	for (i = 0; i < h->skb_count; i++) {
+		ret = sock_read_buffer_sendmsg(ctx, addr, addrlen);
+		ckpt_debug("read_buffer_sendmsg(%i): %i\n", i, ret);
+		if (ret < 0)
+			goto out;
+
+		if (ret > h->total_bytes) {
+			ckpt_debug("Buffers exceeded claim");
+			ret = -EINVAL;
+			goto out;
+		}
+
+		h->total_bytes -= ret;
+		ret = 0;
+	}
+
+	ret = h->skb_count;
+ out:
+	ckpt_hdr_put(ctx, h);
+	return ret;
+}
+
+static int unix_deferred_restore_buffers(void *data)
+{
+	struct dq_buffers *dq = (struct dq_buffers *)data;
+	struct ckpt_ctx *ctx = dq->ctx;
+	struct sock *sk;
+	struct sockaddr *addr = NULL;
+	unsigned int addrlen = 0;
+	int ret;
+
+	sk = ckpt_obj_fetch(ctx, dq->sk_objref, CKPT_OBJ_SOCK);
+	if (!sk) {
+		ckpt_debug("Missing sock ref %i\n", dq->sk_objref);
+		return -EINVAL;
+	}
+
+	if ((sk->sk_type == SOCK_DGRAM) && (unix_sk(sk)->addr != NULL)) {
+		addr = (struct sockaddr *)&unix_sk(sk)->addr->name;
+		addrlen = unix_sk(sk)->addr->len;
+	}
+
+	ret = unix_read_buffers(ctx, addr, addrlen);
+	ckpt_debug("read recv buffers: %i\n", ret);
+	if (ret < 0)
+		return ret;
+
+	ret = unix_read_buffers(ctx, addr, addrlen);
+	ckpt_debug("read send buffers: %i\n", ret);
+	if (ret > 0)
+		ret = -EINVAL; /* No send buffers for UNIX sockets */
+
+	return ret;
+}
+
+static int unix_defer_restore_buffers(struct ckpt_ctx *ctx, int sk_objref)
+{
+	struct dq_buffers dq;
+
+	dq.ctx = ctx;
+	dq.sk_objref = sk_objref;
+
+	/* NB: This is safe to do inside deferqueue_run() since it uses
+	 * list_for_each_safe()
+	 */
+	return deferqueue_add(ctx->files_deferq, &dq, sizeof(dq),
+			      unix_deferred_restore_buffers, NULL);
+}
+
+static struct unix_address *unix_makeaddr(struct sockaddr_un *sun_addr,
+					  unsigned len)
+{
+	struct unix_address *addr;
+
+	if (len > sizeof(struct sockaddr_un))
+		return ERR_PTR(-EINVAL);
+
+	addr = kmalloc(sizeof(*addr) + len, GFP_KERNEL);
+	if (!addr)
+		return ERR_PTR(-ENOMEM);
+
+	memcpy(addr->name, sun_addr, len);
+	addr->len = len;
+	atomic_set(&addr->refcnt, 1);
+
+	return addr;
+}
+
+static int unix_restore_connected(struct ckpt_ctx *ctx,
+				  struct ckpt_hdr_socket *h,
+				  struct ckpt_hdr_socket_unix *un,
+				  struct socket *sock)
+{
+	struct sock *sk = sock->sk;
+	struct sockaddr *addr = NULL;
+	unsigned long flags = h->sock.flags;
+	unsigned int addrlen = 0;
+	int dead = test_bit(SOCK_DEAD, &flags);
+	int ret = 0;
+
+
+	if (un->peer == 0) {
+		/* These get propagated to the msghdr, so only set them
+		 * if we're not connected to a peer, else we'll get an error
+		 * when we sendmsg()
+		 */
+		addr = (struct sockaddr *)&un->laddr;
+		addrlen = un->laddr_len;
+	}
+
+	sk->sk_peercred.pid = task_tgid_vnr(current);
+
+	if (may_setuid(ctx->realcred->user->user_ns, un->peercred_uid) &&
+	    may_setgid(un->peercred_gid)) {
+		sk->sk_peercred.uid = un->peercred_uid;
+		sk->sk_peercred.gid = un->peercred_gid;
+	} else {
+		ckpt_debug("peercred %i:%i would require setuid",
+			   un->peercred_uid, un->peercred_gid);
+		return -EPERM;
+	}
+
+	if (!dead && (un->peer > 0)) {
+		ret = unix_defer_join(ctx, un->this, un->peer);
+		ckpt_debug("unix_defer_join: %i\n", ret);
+	}
+
+	if (!dead && !ret)
+		ret = unix_defer_restore_buffers(ctx, un->this);
+
+	return ret;
+}
+
+static int unix_unlink(const char *name)
+{
+	struct path spath;
+	struct path ppath;
+	int ret;
+
+	ret = kern_path(name, 0, &spath);
+	if (ret)
+		return ret;
+
+	ret = kern_path(name, LOOKUP_PARENT, &ppath);
+	if (ret)
+		goto out_s;
+
+	if (!spath.dentry) {
+		ckpt_debug("No dentry found for %s\n", name);
+		ret = -ENOENT;
+		goto out_p;
+	}
+
+	if (!ppath.dentry || !ppath.dentry->d_inode) {
+		ckpt_debug("No inode for parent of %s\n", name);
+		ret = -ENOENT;
+		goto out_p;
+	}
+
+	ret = vfs_unlink(ppath.dentry->d_inode, spath.dentry);
+ out_p:
+	path_put(&ppath);
+ out_s:
+	path_put(&spath);
+
+	return ret;
+}
+
+/* Call bind() for socket, optionally changing (temporarily) to @path first
+ * if non-NULL
+ */
+static int unix_chdir_and_bind(struct socket *sock,
+			       const char *path,
+			       struct sockaddr *addr,
+			       unsigned long addrlen)
+{
+	struct sockaddr_un *un = (struct sockaddr_un *)addr;
+	struct path cur = { .mnt = NULL, .dentry = NULL };
+	struct path dir = { .mnt = NULL, .dentry = NULL };
+	int ret;
+
+	if (path) {
+		ckpt_debug("switching to cwd %s for unix bind", path);
+
+		ret = kern_path(path, 0, &dir);
+		if (ret)
+			return ret;
+
+		ret = inode_permission(dir.dentry->d_inode,
+				       MAY_EXEC | MAY_ACCESS);
+		if (ret)
+			goto out;
+
+		write_lock(&current->fs->lock);
+		cur = current->fs->pwd;
+		current->fs->pwd = dir;
+		write_unlock(&current->fs->lock);
+	}
+
+	ret = unix_unlink(un->sun_path);
+	ckpt_debug("unlink(%s): %i\n", un->sun_path, ret);
+	if ((ret == 0) || (ret == -ENOENT))
+		ret = sock_bind(sock, addr, addrlen);
+
+	if (path) {
+		write_lock(&current->fs->lock);
+		current->fs->pwd = cur;
+		write_unlock(&current->fs->lock);
+	}
+ out:
+	if (path)
+		path_put(&dir);
+
+	return ret;
+}
+
+static int unix_fakebind(struct socket *sock,
+			 struct sockaddr_un *addr, unsigned long len)
+{
+	struct unix_address *uaddr;
+
+	uaddr = unix_makeaddr(addr, len);
+	if (IS_ERR(uaddr))
+		return PTR_ERR(uaddr);
+
+	unix_sk(sock->sk)->addr = uaddr;
+
+	return 0;
+}
+
+static int unix_restore_bind(struct ckpt_hdr_socket *h,
+			     struct ckpt_hdr_socket_unix *un,
+			     struct socket *sock,
+			     const char *path)
+{
+	struct sockaddr *addr = (struct sockaddr *)&un->laddr;
+	unsigned long len = un->laddr_len;
+	unsigned long flags = h->sock.flags;
+	int dead = test_bit(SOCK_DEAD, &flags);
+
+	if (dead)
+		return unix_fakebind(sock, &un->laddr, len);
+	else if (!un->laddr.sun_path[0])
+		return sock_bind(sock, addr, len);
+	else if (!(un->flags & CKPT_UNIX_LINKED))
+		return unix_fakebind(sock, &un->laddr, len);
+	else
+		return unix_chdir_and_bind(sock, path, addr, len);
+}
+
+/* Some easy pre-flight checks before we get underway */
+static int unix_precheck(struct socket *sock, struct ckpt_hdr_socket *h)
+{
+	struct net *net = sock_net(sock->sk);
+	unsigned long sk_flags = h->sock.flags;
+
+	if ((h->socket.state == SS_CONNECTING) ||
+	    (h->socket.state == SS_DISCONNECTING) ||
+	    (h->socket.state == SS_FREE)) {
+		ckpt_debug("AF_UNIX socket can't be SS_(DIS)CONNECTING");
+		return -EINVAL;
+	}
+
+	/* AF_UNIX overloads the backlog setting to define the maximum
+	 * queue length for DGRAM sockets.  Make sure we don't let the
+	 * caller exceed that value on restart.
+	 */
+	if ((h->sock.type == SOCK_DGRAM) &&
+	    (h->sock.backlog > net->unx.sysctl_max_dgram_qlen)) {
+		ckpt_debug("DGRAM backlog of %i exceeds system max of %i\n",
+			   h->sock.backlog, net->unx.sysctl_max_dgram_qlen);
+		return -EINVAL;
+	}
+
+	if (test_bit(SOCK_USE_WRITE_QUEUE, &sk_flags)) {
+		ckpt_debug("AF_UNIX socket has SOCK_USE_WRITE_QUEUE set");
+		return -EINVAL;
+	}
+
+	return 0;
+}
+
+int unix_restore(struct ckpt_ctx *ctx, struct socket *sock,
+		 struct ckpt_hdr_socket *h)
+
+{
+	struct ckpt_hdr_socket_unix *un;
+	int ret = -EINVAL;
+	char *cwd = NULL;
+
+	ret = unix_precheck(sock, h);
+	if (ret)
+		return ret;
+
+	un = ckpt_read_obj_type(ctx, sizeof(*un), CKPT_HDR_SOCKET_UNIX);
+	if (IS_ERR(un))
+		return PTR_ERR(un);
+
+	if (un->peer < 0)
+		goto out;
+
+	if (unix_need_cwd(&un->laddr, un->laddr_len)) {
+		cwd = ckpt_read_string(ctx, PATH_MAX);
+		if (IS_ERR(cwd)) {
+			ret = PTR_ERR(cwd);
+			goto out;
+		}
+	}
+
+	if ((h->sock.state != TCP_ESTABLISHED) &&
+	    !UNIX_ADDR_EMPTY(un->laddr_len)) {
+		ret = unix_restore_bind(h, un, sock, cwd);
+		if (ret)
+			goto out;
+	}
+
+	if ((h->sock.state == TCP_ESTABLISHED) || (h->sock.state == TCP_CLOSE))
+		ret = unix_restore_connected(ctx, h, un, sock);
+	else if (h->sock.state == TCP_LISTEN)
+		ret = sock->ops->listen(sock, h->sock.backlog);
+	else
+		ckpt_debug("unsupported UNIX socket state %i\n", h->sock.state);
+ out:
+	ckpt_hdr_put(ctx, un);
+	kfree(cwd);
+	return ret;
+}
-- 
1.6.0.4

^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH v18 70/80] c/r: Add AF_UNIX support (v12)
  2009-09-23 23:50 ` Oren Laadan
@ 2009-09-23 23:51   ` Oren Laadan
  -1 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-23 23:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, containers, linux-kernel, linux-mm, linux-api,
	Serge Hallyn, Ingo Molnar, Pavel Emelyanov, Dan Smith,
	Alexey Dobriyan, netdev, Oren Laadan

From: Dan Smith <danms@us.ibm.com>

This patch adds basic checkpoint/restart support for AF_UNIX sockets.  It
has been tested with a single and multiple processes, and with data inflight
at the time of checkpoint.  It supports socketpair()s, path-based, and
abstract sockets.

Changes in v12:
  - Collect sockets for leak-detection
  - Adjust socket reference count during leak detection phase

Changes in v11:
  - Create a struct socket for orphan socket during checkpoint
  - Make sockets proper objhash objects and use checkpoint_obj() on them
  - Rename headerless struct ckpt_hdr_* to struct ckpt_*
  - Remove struct timeval from socket header
  - Save and restore UNIX socket peer credentials
  - Set socket flags on restore using sock_setsockopt() where possible
  - Fail on the TIMESTAMPING_* flags for the moment (with a TODO)
  - Remove other explicit flag checks that are no longer copied blindly
  - Changed functions/variables names to follow existing conventions
  - Use proto_ops->{checkpoint,restart} methods for af_unix
  - Cleanup sock_file_restore()/sock_file_checkpoint()
  - Make ckpt_hdr_socket be part of ckpt_hdr_file_socket
  - Fold do_sock_file_checkpoint() into sock_file_checkpoint()
  - Fold do_sock_file_restore() into sock_file_restore()
  - Move sock_file_{checkpoint,restore} to net/checkpoint.c
  - Properly define sock_file_{checkpoint,restore} in header file
  - sock_file_restore() now calls restore_file_common()

Changes in v10:
  - Moved header structure definitions back to checkpoint_hdr.h
  - Moved AF_UNIX checkpoint/restart code to net/unix/checkpoint.c
  - Make sock_unix_*() functions only compile if CONFIG_UNIX=y
  - Add TODO for CONFIG_UNIX=m case

Changes in v9:
  - Fix double-free of skb's in the list and target holding queue in the
    error path of sock_copy_buffers()
  - Adjust use of ckpt_read_string() to match new signature

Changes in v8:
  - Fix stale dev_alloc_skb() from before the conversion to skb_clone()
  - Fix a couple of broken error paths
  - Fix memory leak of kvec.iov_base on successful return from sendmsg()
  - Fix condition for deciding when to run sock_cptrst_verify()
  - Fix buffer queue copy algorithm to hold the lock during walk(s)
  - Log the errno when either getname() or getpeer() fails
  - Add comments about ancillary messages in the UNIX queue
  - Add TODO comments for credential restore and flags via setsockopt()
  - Add TODO comment about strangely-connected dgram sockets and the use
    of sendmsg(peer)

Changes in v7:
  - Fix failure to free iov_base in error path of sock_read_buffer()
  - Change sock_read_buffer() to use _ckpt_read_obj_type() to get the
    header length and then use ckpt_kread() directly to read the payload
  - Change sock_read_buffers() to sock_unix_read_buffers() and break out
    some common functionality to better accommodate the subsequent INET
    patch
  - Generalize sock_unix_getnames() into sock_getnames() so INET can use it
  - Change skb_morph() to skb_clone() which uses the more common path and
    still avoids the copy
  - Add check to validate the socket type before creating socket
    on restore
  - Comment the CAP_NET_ADMIN override in sock_read_buffer_hdr
  - Strengthen the comment about priming the buffer limits
  - Change the objhash functions to deny direct checkpoint of sockets and
    remove the reference counting function
  - Change SOCKET_BUFFERS to SOCKET_QUEUE
  - Change this,peer objrefs to signed integers
  - Remove names from internal socket structures
  - Fix handling of sock_copy_buffers() result
  - Use ckpt_fill_fname() instead of d_path() for writing CWD
  - Use sock_getname() and sock_getpeer() for proper security hookage
  - Return -ENOSYS for unsupported socket families in checkpoint and restart
  - Use sock_setsockopt() and sock_getsockopt() where possible to save and
    restore socket option values
  - Check for SOCK_DESTROY flag in the global verify function because none
    of our supported socket types use it
  - Check for SOCK_USE_WRITE_QUEUE in AF_UNIX restore function because
    that flag should not be used on such a socket
  - Check socket state in UNIX restart path to validate the subset of valid
    values

Changes in v6:
  - Moved the socket addresses to the per-type header
  - Eliminated the HASCWD flag
  - Remove use of ckpt_write_err() in restart paths
  - Change the order in which buffers are read so that we can set the
    socket's limit equal to the size of the image's buffers (if appropriate)
    and then restore the original values afterwards.
  - Use the ckpt_validate_errno() helper
  - Add a check to make sure that we didn't restore a (UNIX) socket with
    any skb's in the send buffer
  - Fix up sock_unix_join() to not leave addr uninitialized for socketpair
  - Remove inclusion of checkpoint_hdr.h in the socket files
  - Make sock_unix_write_cwd() use ckpt_write_string() and use the new
    ckpt_read_string() for reading the cwd
  - Use the restored realcred credentials in sock_unix_join()
  - Fix error path of the chdir_and_bind
  - Change the algorithm for reloading the socket buffers to use sendmsg()
    on the socket's peer for better accounting
  - For DGRAM sockets, check the backlog value against the system max
    to avoid letting a restart bypass the overloaded queue length
  - Use sock_bind() instead of sock->ops->bind() to gain the security hook
  - Change "restart" to "restore" in some of the function names

Changes in v5:
  - Change laddr and raddr buffers in socket header to be long enough
    for INET6 addresses
  - Place socket.c and sock.h function definitions inside #ifdef
    CONFIG_CHECKPOINT
  - Add explicit check in sock_unix_makeaddr() to refuse if the
    checkpoint image specifies an addr length of 0
  - Split sock_unix_restart() into a few pieces to facilitate:
  - Changed behavior of the unix restore code so that unlinked LISTEN
    sockets don't do a bind()...unlink()
  - Save the base path of a bound socket's path so that we can chdir()
    to the base before bind() if it is a relative path
  - Call bind() for any socket that is not established but has a
    non-zero-length local address
  - Enforce the current sysctl limit on socket buffer size during restart
    unless the user holds CAP_NET_ADMIN
  - Unlink a path-based socket before calling bind()

Changes in v4:
  - Changed the signdness of rcvlowat, rcvtimeo, sndtimeo, and backlog
    to match their struct sock definitions.  This should avoid issues
    with sign extension.
  - Add a sock_cptrst_verify() function to be run at restore time to
    validate several of the values in the checkpoint image against
    limits, flag masks, etc.
  - Write an error string with ctk_write_err() in the obscure cases
  - Don't write socket buffers for listen sockets
  - Sanity check address lengths before we agree to allocate memory
  - Check the result of inserting the peer object in the objhash on
    restart
  - Check return value of sock_cptrst() on restart
  - Change logic in remote getname() phase of checkpoint to not fail for
    closed (et al) sockets
  - Eliminate the memory copy while reading socket buffers on restart

Changes in v3:
  - Move sock_file_checkpoint() above sock_file_restore()
  - Change __sock_file_*() functions to do_sock_file_*()
  - Adjust some of the struct cr_hdr_socket alignment
  - Improve the sock_copy_buffers() algorithm to avoid locking the source
    queue for the entire operation
  - Fix alignment in the socket header struct(s)
  - Move the per-protocol structure (ckpt_hdr_socket_un) out of the
    common socket header and read/write it separately
  - Fix missing call to sock_cptrst() in restore path
  - Break out the socket joining into another function
  - Fix failure to restore the socket address thus fixing getname()
  - Check the state values on restart
  - Fix case of state being TCP_CLOSE, which allows dgram sockets to be
    properly connected (if appropriate) to their peer and maintain the
    sockaddr for getname() operation
  - Fix restoring a listening socket that has been unlink()'d
  - Fix checkpointing sockets with an in-flight FD-passing SKB.  Fail
    with EBUSY.
  - Fix checkpointing listening sockets with an unaccepted connection.
    Fail with EBUSY.
  - Changed 'un' to 'unix' in function and structure names

Changes in v2:
  - Change GFP_KERNEL to GFP_ATOMIC in sock_copy_buffers() (this seems
    to be rather common in other uses of skb_copy())
  - Move the ckpt_hdr_socket structure definition to linux/socket.h
  - Fix whitespace issue
  - Move sock_file_checkpoint() to net/socket.c for symmetry

Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: netdev@vger.kernel.org
Acked-by: Serge Hallyn <serue@us.ibm.com>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Dan Smith <danms@us.ibm.com>
Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
---
 checkpoint/files.c             |    7 +
 checkpoint/objhash.c           |   69 ++++
 include/linux/checkpoint.h     |    7 +
 include/linux/checkpoint_hdr.h |   87 +++++
 include/linux/net.h            |    2 +
 include/net/af_unix.h          |   14 +
 include/net/sock.h             |   12 +
 net/Makefile                   |    2 +
 net/checkpoint.c               |  752 ++++++++++++++++++++++++++++++++++++++++
 net/socket.c                   |    6 +-
 net/unix/Makefile              |    1 +
 net/unix/af_unix.c             |    9 +
 net/unix/checkpoint.c          |  634 +++++++++++++++++++++++++++++++++
 13 files changed, 1601 insertions(+), 1 deletions(-)
 create mode 100644 net/checkpoint.c
 create mode 100644 net/unix/checkpoint.c

diff --git a/checkpoint/files.c b/checkpoint/files.c
index 1de89d6..058bc0e 100644
--- a/checkpoint/files.c
+++ b/checkpoint/files.c
@@ -22,6 +22,7 @@
 #include <linux/deferqueue.h>
 #include <linux/checkpoint.h>
 #include <linux/checkpoint_hdr.h>
+#include <net/sock.h>
 
 
 /**************************************************************************
@@ -591,6 +592,12 @@ static struct restore_file_ops restore_file_ops[] = {
 		.file_type = CKPT_FILE_FIFO,
 		.restore = fifo_file_restore,
 	},
+	/* socket */
+	{
+		.file_name = "SOCKET",
+		.file_type = CKPT_FILE_SOCKET,
+		.restore = sock_file_restore,
+	},
 };
 
 static struct file *do_restore_file(struct ckpt_ctx *ctx)
diff --git a/checkpoint/objhash.c b/checkpoint/objhash.c
index bf2f761..0978060 100644
--- a/checkpoint/objhash.c
+++ b/checkpoint/objhash.c
@@ -20,6 +20,7 @@
 #include <linux/user_namespace.h>
 #include <linux/checkpoint.h>
 #include <linux/checkpoint_hdr.h>
+#include <net/sock.h>
 
 struct ckpt_obj;
 struct ckpt_obj_ops;
@@ -234,6 +235,40 @@ static void obj_groupinfo_drop(void *ptr, int lastref)
 	put_group_info((struct group_info *) ptr);
 }
 
+static int obj_sock_grab(void *ptr)
+{
+	sock_hold((struct sock *) ptr);
+	return 0;
+}
+
+static void obj_sock_drop(void *ptr, int lastref)
+{
+	struct sock *sk = (struct sock *) ptr;
+
+	/*
+	 * Sockets created during restart are graft()ed, i.e. have a
+	 * valid @sk->sk_socket. Because only an fput() results in the
+	 * necessary sock_release(), we may leak the struct socket of
+	 * sockets that were not attached to a file. Therefore, if
+	 * @lastref is set, we hereby invoke sock_release() on sockets
+	 * that we have put into the objhash but were never attached
+	 * to a file.
+	 */
+	if (lastref && sk->sk_socket && !sk->sk_socket->file) {
+		struct socket *sock = sk->sk_socket;
+		sock_orphan(sk);
+		sock->sk = NULL;
+		sock_release(sock);
+	}
+
+	sock_put((struct sock *) ptr);
+}
+
+static int obj_sock_users(void *ptr)
+{
+	return atomic_read(&((struct sock *) ptr)->sk_refcnt);
+}
+
 static struct ckpt_obj_ops ckpt_obj_ops[] = {
 	/* ignored object */
 	{
@@ -362,6 +397,16 @@ static struct ckpt_obj_ops ckpt_obj_ops[] = {
 		.checkpoint = checkpoint_groupinfo,
 		.restore = restore_groupinfo,
 	},
+	/* sock object */
+	{
+		.obj_name = "SOCKET",
+		.obj_type = CKPT_OBJ_SOCK,
+		.ref_drop = obj_sock_drop,
+		.ref_grab = obj_sock_grab,
+		.ref_users = obj_sock_users,
+		.checkpoint = checkpoint_sock,
+		.restore = restore_sock,
+	},
 };
 
 
@@ -751,6 +796,26 @@ static void ckpt_obj_users_inc(struct ckpt_ctx *ctx, void *ptr, int increment)
  */
 
 /**
+ * obj_sock_adjust_users - remove implicit reference on DEAD sockets
+ * @obj: CKPT_OBJ_SOCK object to adjust
+ *
+ * Sockets that have been disconnected from their struct file have
+ * a reference count one less than normal sockets.  The objhash's
+ * assumption of such a reference is therefore incorrect, so we correct
+ * it here.
+ */
+static inline void obj_sock_adjust_users(struct ckpt_obj *obj)
+{
+	struct sock *sk = (struct sock *)obj->ptr;
+
+	if (sock_flag(sk, SOCK_DEAD)) {
+		obj->users--;
+		ckpt_debug("Adjusting SOCK %i count to %i\n",
+			   obj->objref, obj->users);
+	}
+}
+
+/**
  * ckpt_obj_contained - test if shared objects are contained in checkpoint
  * @ctx: checkpoint context
  *
@@ -773,6 +838,10 @@ int ckpt_obj_contained(struct ckpt_ctx *ctx)
 	hlist_for_each_entry(obj, node, &ctx->obj_hash->list, next) {
 		if (!obj->ops->ref_users)
 			continue;
+
+		if (obj->ops->obj_type == CKPT_OBJ_SOCK)
+			obj_sock_adjust_users(obj);
+
 		if (obj->ops->ref_users(obj->ptr) != obj->users) {
 			ckpt_debug("usage leak: %s\n", obj->ops->obj_name);
 			ckpt_write_err(ctx, "OP", "leak: usage (%d != %d (%s)",
diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
index ec98a43..92a21b2 100644
--- a/include/linux/checkpoint.h
+++ b/include/linux/checkpoint.h
@@ -29,6 +29,7 @@
 #include <linux/checkpoint_types.h>
 #include <linux/checkpoint_hdr.h>
 #include <linux/err.h>
+#include <net/sock.h>
 
 /* ckpt_ctx: kflags */
 #define CKPT_CTX_CHECKPOINT_BIT		0
@@ -77,6 +78,12 @@ extern int ckpt_read_consume(struct ckpt_ctx *ctx, int len, int type);
 extern char *ckpt_fill_fname(struct path *path, struct path *root,
 			     char *buf, int *len);
 
+/* socket functions */
+extern int ckpt_sock_getnames(struct ckpt_ctx *ctx,
+			      struct socket *socket,
+			      struct sockaddr *loc, unsigned *loc_len,
+			      struct sockaddr *rem, unsigned *rem_len);
+
 /* ckpt kflags */
 #define ckpt_set_ctx_kflag(__ctx, __kflag)  \
 	set_bit(__kflag##_BIT, &(__ctx)->kflags)
diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
index e4dfbd7..ac16c59 100644
--- a/include/linux/checkpoint_hdr.h
+++ b/include/linux/checkpoint_hdr.h
@@ -12,6 +12,14 @@
 
 #include <linux/types.h>
 
+#ifdef __KERNEL__
+#include <linux/socket.h>
+#include <linux/un.h>
+#else
+#include <sys/socket.h>
+#include <sys/un.h>
+#endif
+
 /*
  * To maintain compatibility between 32-bit and 64-bit architecture flavors,
  * keep data 64-bit aligned: use padding for structure members, and use
@@ -92,6 +100,11 @@ enum {
 	CKPT_HDR_SIGNAL_TASK,
 	CKPT_HDR_SIGPENDING,
 
+	CKPT_HDR_SOCKET = 701,
+	CKPT_HDR_SOCKET_QUEUE,
+	CKPT_HDR_SOCKET_BUFFER,
+	CKPT_HDR_SOCKET_UNIX,
+
 	CKPT_HDR_TAIL = 9001,
 
 	CKPT_HDR_ERROR = 9999,
@@ -127,6 +140,7 @@ enum obj_type {
 	CKPT_OBJ_CRED,
 	CKPT_OBJ_USER,
 	CKPT_OBJ_GROUPINFO,
+	CKPT_OBJ_SOCK,
 	CKPT_OBJ_MAX
 };
 
@@ -353,6 +367,7 @@ enum file_type {
 	CKPT_FILE_GENERIC,
 	CKPT_FILE_PIPE,
 	CKPT_FILE_FIFO,
+	CKPT_FILE_SOCKET,
 	CKPT_FILE_MAX
 };
 
@@ -376,6 +391,78 @@ struct ckpt_hdr_file_pipe {
 	__s32 pipe_objref;
 } __attribute__((aligned(8)));
 
+/* socket */
+struct ckpt_hdr_socket {
+	struct ckpt_hdr h;
+
+	struct { /* struct socket */
+		__u64 flags;
+		__u8 state;
+	} socket __attribute__ ((aligned(8)));
+
+	struct { /* struct sock_common */
+		__u32 bound_dev_if;
+		__u32 reuse;
+		__u16 family;
+		__u8 state;
+	} sock_common __attribute__ ((aligned(8)));
+
+	struct { /* struct sock */
+		__s64 rcvlowat;
+		__u64 flags;
+
+		__s64 rcvtimeo;
+		__s64 sndtimeo;
+
+		__u32 err;
+		__u32 err_soft;
+		__u32 priority;
+		__s32 rcvbuf;
+		__s32 sndbuf;
+		__u16 type;
+		__s16 backlog;
+
+		__u8 protocol;
+		__u8 state;
+		__u8 shutdown;
+		__u8 userlocks;
+		__u8 no_check;
+
+		struct linger linger;
+	} sock __attribute__ ((aligned(8)));
+} __attribute__ ((aligned(8)));
+
+struct ckpt_hdr_socket_queue {
+	struct ckpt_hdr h;
+	__u32 skb_count;
+	__u32 total_bytes;
+} __attribute__ ((aligned(8)));
+
+struct ckpt_hdr_socket_buffer {
+	struct ckpt_hdr h;
+	__s32 sk_objref;
+	__s32 pr_objref;
+};
+
+#define CKPT_UNIX_LINKED 1
+struct ckpt_hdr_socket_unix {
+	struct ckpt_hdr h;
+	__s32 this;
+	__s32 peer;
+	__u32 peercred_uid;
+	__u32 peercred_gid;
+	__u32 flags;
+	__u32 laddr_len;
+	__u32 raddr_len;
+	struct sockaddr_un laddr;
+	struct sockaddr_un raddr;
+} __attribute__ ((aligned(8)));
+
+struct ckpt_hdr_file_socket {
+	struct ckpt_hdr_file common;
+	__s32 sock_objref;
+} __attribute__((aligned(8)));
+
 /* memory layout */
 struct ckpt_hdr_mm {
 	struct ckpt_hdr h;
diff --git a/include/linux/net.h b/include/linux/net.h
index b99f350..d1ce6eb 100644
--- a/include/linux/net.h
+++ b/include/linux/net.h
@@ -232,6 +232,8 @@ extern int   	     sock_sendmsg(struct socket *sock, struct msghdr *msg,
 				  size_t len);
 extern int	     sock_recvmsg(struct socket *sock, struct msghdr *msg,
 				  size_t size, int flags);
+extern int	     sock_attach_fd(struct socket *sock, struct file *file,
+				    int flags);
 extern int 	     sock_map_fd(struct socket *sock, int flags);
 extern struct socket *sockfd_lookup(int fd, int *err);
 #define		     sockfd_put(sock) fput(sock->file)
diff --git a/include/net/af_unix.h b/include/net/af_unix.h
index 1614d78..e42a714 100644
--- a/include/net/af_unix.h
+++ b/include/net/af_unix.h
@@ -68,4 +68,18 @@ static inline int unix_sysctl_register(struct net *net) { return 0; }
 static inline void unix_sysctl_unregister(struct net *net) {}
 #endif
 #endif
+
+#ifdef CONFIG_CHECKPOINT
+struct ckpt_ctx;
+struct ckpt_hdr_socket;
+extern int unix_checkpoint(struct ckpt_ctx *ctx, struct socket *sock);
+extern int unix_restore(struct ckpt_ctx *ctx, struct socket *sock,
+			struct ckpt_hdr_socket *h);
+extern int unix_collect(struct ckpt_ctx *ctx, struct socket *sock);
+
+#else
+#define unix_checkpoint NULL
+#define unix_restore NULL
+#endif /* CONFIG_CHECKPOINT */
+
 #endif
diff --git a/include/net/sock.h b/include/net/sock.h
index 12530bf..ec351f9 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -1646,4 +1646,16 @@ extern int sysctl_optmem_max;
 extern __u32 sysctl_wmem_default;
 extern __u32 sysctl_rmem_default;
 
+#ifdef CONFIG_CHECKPOINT
+/* Checkpoint/Restart Functions */
+struct ckpt_ctx;
+struct ckpt_hdr_file;
+extern int checkpoint_sock(struct ckpt_ctx *ctx, void *ptr);
+extern void *restore_sock(struct ckpt_ctx *ctx);
+extern int sock_file_checkpoint(struct ckpt_ctx *ctx, struct file *file);
+extern struct file *sock_file_restore(struct ckpt_ctx *ctx,
+				      struct ckpt_hdr_file *h);
+extern int sock_file_collect(struct ckpt_ctx *ctx, struct file *file);
+#endif
+
 #endif	/* _SOCK_H */
diff --git a/net/Makefile b/net/Makefile
index ba324ae..91d12fe 100644
--- a/net/Makefile
+++ b/net/Makefile
@@ -66,3 +66,5 @@ ifeq ($(CONFIG_NET),y)
 obj-$(CONFIG_SYSCTL)		+= sysctl_net.o
 endif
 obj-$(CONFIG_WIMAX)		+= wimax/
+
+obj-$(CONFIG_CHECKPOINT)	+= checkpoint.o
diff --git a/net/checkpoint.c b/net/checkpoint.c
new file mode 100644
index 0000000..a11ec7a
--- /dev/null
+++ b/net/checkpoint.c
@@ -0,0 +1,752 @@
+/*
+ *  Copyright 2009 IBM Corporation
+ *
+ *  Author(s): Dan Smith <danms@us.ibm.com>
+ *             Oren Laadan <orenl@cs.columbia.edu>
+ *
+ *  This program is free software; you can redistribute it and/or
+ *  modify it under the terms of the GNU General Public License as
+ *  published by the Free Software Foundation, version 2 of the
+ *  License.
+ */
+
+#include <linux/socket.h>
+#include <linux/mount.h>
+#include <linux/file.h>
+#include <linux/namei.h>
+#include <linux/syscalls.h>
+#include <linux/sched.h>
+#include <linux/fs_struct.h>
+
+#include <net/af_unix.h>
+#include <net/tcp_states.h>
+
+#include <linux/deferqueue.h>
+#include <linux/checkpoint.h>
+#include <linux/checkpoint_hdr.h>
+
+struct dq_buffers {
+	struct ckpt_ctx *ctx;
+	struct sock *sk;
+};
+
+static int sock_copy_buffers(struct sk_buff_head *from,
+			     struct sk_buff_head *to,
+			     uint32_t *total_bytes)
+{
+	int count1 = 0;
+	int count2 = 0;
+	int i;
+	struct sk_buff *skb;
+	struct sk_buff **skbs;
+
+	*total_bytes = 0;
+
+	spin_lock(&from->lock);
+	skb_queue_walk(from, skb)
+		count1++;
+	spin_unlock(&from->lock);
+
+	skbs = kzalloc(sizeof(*skbs) * count1, GFP_KERNEL);
+	if (!skbs)
+		return -ENOMEM;
+
+	for (i = 0; i < count1;  i++) {
+		skbs[i] = dev_alloc_skb(0);
+		if (!skbs[i])
+			goto err;
+	}
+
+	i = 0;
+	spin_lock(&from->lock);
+	skb_queue_walk(from, skb) {
+		if (++count2 > count1)
+			break; /* The queue changed as we read it */
+
+		skb_morph(skbs[i], skb);
+		skbs[i]->sk = skb->sk;
+		skb_queue_tail(to, skbs[i]);
+
+		*total_bytes += skb->len;
+		i++;
+	}
+	spin_unlock(&from->lock);
+
+	if (count1 != count2)
+		goto err;
+
+	kfree(skbs);
+
+	return count1;
+ err:
+	while (skb_dequeue(to))
+		; /* Pull all the buffers out of the queue */
+	for (i = 0; i < count1; i++)
+		kfree_skb(skbs[i]);
+	kfree(skbs);
+
+	return -EAGAIN;
+}
+
+static int __sock_write_buffers(struct ckpt_ctx *ctx,
+				struct sk_buff_head *queue,
+				int dst_objref)
+{
+	struct sk_buff *skb;
+
+	skb_queue_walk(queue, skb) {
+		struct ckpt_hdr_socket_buffer *h;
+		int ret = 0;
+
+		/* FIXME: This could be a false positive for non-unix
+		 *        buffers, so add a type check here in the
+		 *        future
+		 */
+		if (UNIXCB(skb).fp) {
+			ckpt_write_err(ctx, "TE", "af_unix: pass fd", -EBUSY);
+			return -EBUSY;
+		}
+
+		/* The other ancillary messages are always present
+		 * unlike descriptors.  Even though we can't detect
+		 * them and fail the checkpoint, we're not at risk
+		 * because we don't save out (or restore) the control
+		 * information contained in the skb.
+		 */
+		h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_SOCKET_BUFFER);
+		if (!h)
+			return -ENOMEM;
+
+		BUG_ON(!skb->sk);
+		ret = checkpoint_obj(ctx, skb->sk, CKPT_OBJ_SOCK);
+		if (ret < 0)
+			goto end;
+		h->sk_objref = ret;
+		h->pr_objref = dst_objref;
+
+		ret = ckpt_write_obj(ctx, (struct ckpt_hdr *) h);
+		if (ret < 0)
+			goto end;
+
+		ret = ckpt_write_obj_type(ctx, skb->data, skb->len,
+					  CKPT_HDR_BUFFER);
+	end:
+		ckpt_hdr_put(ctx, h);
+		if (ret < 0)
+			return ret;
+	}
+
+	return 0;
+}
+
+static int sock_write_buffers(struct ckpt_ctx *ctx,
+			      struct sk_buff_head *queue,
+			      int dst_objref)
+{
+	struct ckpt_hdr_socket_queue *h;
+	struct sk_buff_head tmpq;
+	int ret = -ENOMEM;
+
+	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_SOCKET_QUEUE);
+	if (!h)
+		return -ENOMEM;
+
+	skb_queue_head_init(&tmpq);
+
+	ret = sock_copy_buffers(queue, &tmpq, &h->total_bytes);
+	if (ret < 0)
+		goto out;
+
+	h->skb_count = ret;
+	ret = ckpt_write_obj(ctx, (struct ckpt_hdr *) h);
+	if (!ret)
+		ret = __sock_write_buffers(ctx, &tmpq, dst_objref);
+
+ out:
+	ckpt_hdr_put(ctx, h);
+	__skb_queue_purge(&tmpq);
+
+	return ret;
+}
+
+int sock_deferred_write_buffers(void *data)
+{
+	struct dq_buffers *dq = (struct dq_buffers *)data;
+	struct ckpt_ctx *ctx = dq->ctx;
+	int ret;
+	int dst_objref;
+
+	dst_objref = ckpt_obj_lookup(ctx, dq->sk, CKPT_OBJ_SOCK);
+	if (dst_objref < 0) {
+		ckpt_write_err(ctx, "TE", "socket: owner gone?", dst_objref);
+		return dst_objref;
+	}
+
+	ret = sock_write_buffers(ctx, &dq->sk->sk_receive_queue, dst_objref);
+	ckpt_debug("write recv buffers: %i\n", ret);
+	if (ret < 0)
+		return ret;
+
+	ret = sock_write_buffers(ctx, &dq->sk->sk_write_queue, dst_objref);
+	ckpt_debug("write send buffers: %i\n", ret);
+
+	return ret;
+}
+
+int sock_defer_write_buffers(struct ckpt_ctx *ctx, struct sock *sk)
+{
+	struct dq_buffers dq;
+
+	dq.ctx = ctx;
+	dq.sk = sk;
+
+	/* NB: This is safe to do inside deferqueue_run() since it uses
+	 * list_for_each_safe()
+	 */
+	return deferqueue_add(ctx->files_deferq, &dq, sizeof(dq),
+			      sock_deferred_write_buffers, NULL);
+}
+
+int ckpt_sock_getnames(struct ckpt_ctx *ctx, struct socket *sock,
+		       struct sockaddr *loc, unsigned *loc_len,
+		       struct sockaddr *rem, unsigned *rem_len)
+{
+	int ret;
+
+	ret = sock_getname(sock, loc, loc_len);
+	if (ret) {
+		ckpt_write_err(ctx, "TEP", "socket: getname local", ret, sock);
+		return -EINVAL;
+	}
+
+	ret = sock_getpeer(sock, rem, rem_len);
+	if (ret) {
+		if ((sock->sk->sk_type != SOCK_DGRAM) &&
+		    (sock->sk->sk_state == TCP_ESTABLISHED)) {
+			ckpt_write_err(ctx, "TEP", "socket: getname peer",
+				       ret, sock);
+			return -EINVAL;
+		}
+		*rem_len = 0;
+	}
+
+	return 0;
+}
+
+static int sock_cptrst_verify(struct ckpt_hdr_socket *h)
+{
+	uint8_t userlocks_mask = SOCK_SNDBUF_LOCK | SOCK_RCVBUF_LOCK |
+		                 SOCK_BINDADDR_LOCK | SOCK_BINDPORT_LOCK;
+
+	if (h->sock.shutdown & ~SHUTDOWN_MASK)
+		return -EINVAL;
+	if (h->sock.userlocks & ~userlocks_mask)
+		return -EINVAL;
+	if (!ckpt_validate_errno(h->sock.err))
+		return -EINVAL;
+
+	return 0;
+}
+
+static int sock_cptrst_opt(int op, struct socket *sock,
+			   int optname, char *opt, int len)
+{
+	mm_segment_t fs;
+	int ret;
+
+	fs = get_fs();
+	set_fs(KERNEL_DS);
+
+	if (op == CKPT_CPT)
+		ret = sock_getsockopt(sock, SOL_SOCKET, optname, opt, &len);
+	else
+		ret = sock_setsockopt(sock, SOL_SOCKET, optname, opt, len);
+
+	set_fs(fs);
+
+	return ret;
+}
+
+#define CKPT_COPY_SOPT(op, sk, name, opt) \
+	sock_cptrst_opt(op, sk->sk_socket, name, (char *)opt, sizeof(*opt))
+
+static int sock_cptrst_bufopts(int op, struct sock *sk,
+			       struct ckpt_hdr_socket *h)
+{
+	if (CKPT_COPY_SOPT(op, sk, SO_RCVBUF, &h->sock.rcvbuf))
+		if ((op == CKPT_RST) &&
+		    CKPT_COPY_SOPT(op, sk, SO_RCVBUFFORCE, &h->sock.rcvbuf)) {
+			ckpt_debug("Failed to set SO_RCVBUF");
+			return -EINVAL;
+		}
+
+	if (CKPT_COPY_SOPT(op, sk, SO_SNDBUF, &h->sock.sndbuf))
+		if ((op == CKPT_RST) &&
+		    CKPT_COPY_SOPT(op, sk, SO_SNDBUFFORCE, &h->sock.sndbuf)) {
+			ckpt_debug("Failed to set SO_SNDBUF");
+			return -EINVAL;
+		}
+
+	/* It's silly that we have to fight ourselves here, but
+	 * sock_setsockopt() doubles the initial value, so divide here
+	 * to store the user's value and avoid doubling on restart
+	 */
+	if ((op == CKPT_CPT) && (h->sock.rcvbuf != SOCK_MIN_RCVBUF))
+		h->sock.rcvbuf >>= 1;
+
+	if ((op == CKPT_CPT) && (h->sock.sndbuf != SOCK_MIN_SNDBUF))
+		h->sock.sndbuf >>= 1;
+
+	return 0;
+}
+
+struct sock_flag_mapping {
+	int opt;
+	int flag;
+};
+
+struct sock_flag_mapping sk_flag_map[] = {
+	{SO_OOBINLINE, SOCK_URGINLINE},
+	{SO_KEEPALIVE, SOCK_KEEPOPEN},
+	{SO_BROADCAST, SOCK_BROADCAST},
+	{SO_TIMESTAMP, SOCK_RCVTSTAMP},
+	{SO_TIMESTAMPNS, SOCK_RCVTSTAMPNS},
+	{SO_DEBUG, SOCK_DBG},
+	{SO_DONTROUTE, SOCK_LOCALROUTE},
+};
+
+struct sock_flag_mapping sock_flag_map[] = {
+	{SO_PASSCRED, SOCK_PASSCRED},
+};
+
+static int sock_restore_flag(struct socket *sock,
+			     unsigned long *flags,
+			     int flag,
+			     int option)
+{
+	int v = 1;
+	int ret = 0;
+
+	if (test_and_clear_bit(flag, flags))
+		ret = sock_setsockopt(sock, SOL_SOCKET, option,
+				      (char *)&v, sizeof(v));
+
+	return ret;
+}
+
+
+static int sock_restore_flags(struct socket *sock, struct ckpt_hdr_socket *h)
+{
+	unsigned long sk_flags = h->sock.flags;
+	unsigned long sock_flags = h->socket.flags;
+	int ret;
+	int i;
+
+	for (i = 0; i < ARRAY_SIZE(sk_flag_map); i++) {
+		int opt = sk_flag_map[i].opt;
+		int flag = sk_flag_map[i].flag;
+		ret = sock_restore_flag(sock, &sk_flags, flag, opt);
+		if (ret) {
+			ckpt_debug("Failed to set skopt %i: %i\n", opt, ret);
+			return ret;
+		}
+	}
+
+	for (i = 0; i < ARRAY_SIZE(sock_flag_map); i++) {
+		int opt = sock_flag_map[i].opt;
+		int flag = sock_flag_map[i].flag;
+		ret = sock_restore_flag(sock, &sock_flags, flag, opt);
+		if (ret) {
+			ckpt_debug("Failed to set sockopt %i: %i\n", opt, ret);
+			return ret;
+		}
+	}
+
+	/* TODO: Handle SOCK_TIMESTAMPING_* flags */
+	if (test_bit(SOCK_TIMESTAMPING_TX_HARDWARE, &sk_flags) ||
+	    test_bit(SOCK_TIMESTAMPING_TX_SOFTWARE, &sk_flags) ||
+	    test_bit(SOCK_TIMESTAMPING_RX_HARDWARE, &sk_flags) ||
+	    test_bit(SOCK_TIMESTAMPING_RX_SOFTWARE, &sk_flags) ||
+	    test_bit(SOCK_TIMESTAMPING_SOFTWARE, &sk_flags) ||
+	    test_bit(SOCK_TIMESTAMPING_RAW_HARDWARE, &sk_flags) ||
+	    test_bit(SOCK_TIMESTAMPING_SYS_HARDWARE, &sk_flags)) {
+		ckpt_debug("SOF_TIMESTAMPING_* flags are not supported\n");
+		return -ENOSYS;
+	}
+
+	if (test_and_clear_bit(SOCK_DEAD, &sk_flags))
+		sock_set_flag(sock->sk, SOCK_DEAD);
+
+
+	/* Anything that is still set in the flags that isn't part of
+	 * our protocol's default set, indicates an error
+	 */
+	if (sk_flags & ~sock->sk->sk_flags) {
+		ckpt_debug("Unhandled sock flags: %lx\n", sk_flags);
+		return -EINVAL;
+	}
+
+	return 0;
+}
+
+static int sock_copy_timeval(int op, struct sock *sk,
+			     int sockopt, __s64 *saved)
+{
+	struct timeval tv;
+
+	if (op == CKPT_CPT) {
+		if (CKPT_COPY_SOPT(op, sk, sockopt, &tv))
+			return -EINVAL;
+		*saved = timeval_to_ns(&tv);
+	} else {
+		tv = ns_to_timeval(*saved);
+		if (CKPT_COPY_SOPT(op, sk, sockopt, &tv))
+			return -EINVAL;
+	}
+
+	return 0;
+}
+
+static int sock_cptrst(struct ckpt_ctx *ctx, struct sock *sk,
+		       struct ckpt_hdr_socket *h, int op)
+{
+	if (sk->sk_socket) {
+		CKPT_COPY(op, h->socket.state, sk->sk_socket->state);
+	}
+
+	CKPT_COPY(op, h->sock_common.bound_dev_if, sk->sk_bound_dev_if);
+	CKPT_COPY(op, h->sock_common.family, sk->sk_family);
+
+	CKPT_COPY(op, h->sock.shutdown, sk->sk_shutdown);
+	CKPT_COPY(op, h->sock.userlocks, sk->sk_userlocks);
+	CKPT_COPY(op, h->sock.no_check, sk->sk_no_check);
+	CKPT_COPY(op, h->sock.protocol, sk->sk_protocol);
+	CKPT_COPY(op, h->sock.err, sk->sk_err);
+	CKPT_COPY(op, h->sock.err_soft, sk->sk_err_soft);
+	CKPT_COPY(op, h->sock.type, sk->sk_type);
+	CKPT_COPY(op, h->sock.state, sk->sk_state);
+	CKPT_COPY(op, h->sock.backlog, sk->sk_max_ack_backlog);
+
+	if (sock_cptrst_bufopts(op, sk, h))
+		return -EINVAL;
+
+	if (CKPT_COPY_SOPT(op, sk, SO_REUSEADDR, &h->sock_common.reuse)) {
+		ckpt_debug("Failed to set SO_REUSEADDR");
+		return -EINVAL;
+	}
+
+	if (CKPT_COPY_SOPT(op, sk, SO_PRIORITY, &h->sock.priority)) {
+		ckpt_debug("Failed to set SO_PRIORITY");
+		return -EINVAL;
+	}
+
+	if (CKPT_COPY_SOPT(op, sk, SO_RCVLOWAT, &h->sock.rcvlowat)) {
+		ckpt_debug("Failed to set SO_RCVLOWAT");
+		return -EINVAL;
+	}
+
+	if (CKPT_COPY_SOPT(op, sk, SO_LINGER, &h->sock.linger)) {
+		ckpt_debug("Failed to set SO_LINGER");
+		return -EINVAL;
+	}
+
+	if (sock_copy_timeval(op, sk, SO_SNDTIMEO, &h->sock.sndtimeo)) {
+		ckpt_debug("Failed to set SO_SNDTIMEO");
+		return -EINVAL;
+	}
+
+	if (sock_copy_timeval(op, sk, SO_RCVTIMEO, &h->sock.rcvtimeo)) {
+		ckpt_debug("Failed to set SO_RCVTIMEO");
+		return -EINVAL;
+	}
+
+	if (op == CKPT_CPT) {
+		h->sock.flags = sk->sk_flags;
+		h->socket.flags = sk->sk_socket->flags;
+	} else {
+		int ret;
+		mm_segment_t old_fs;
+
+		old_fs = get_fs();
+		set_fs(KERNEL_DS);
+		ret = sock_restore_flags(sk->sk_socket, h);
+		set_fs(old_fs);
+		if (ret)
+			return ret;
+	}
+
+	if ((h->socket.state == SS_CONNECTED) &&
+	    (h->sock.state != TCP_ESTABLISHED)) {
+		ckpt_debug("socket/sock in inconsistent state: %i/%i",
+			   h->socket.state, h->sock.state);
+		return -EINVAL;
+	} else if ((h->sock.state < TCP_ESTABLISHED) ||
+		   (h->sock.state >= TCP_MAX_STATES)) {
+		ckpt_debug("sock in invalid state: %i", h->sock.state);
+		return -EINVAL;
+	} else if ((h->socket.state < SS_FREE) ||
+		   (h->socket.state > SS_DISCONNECTING)) {
+		ckpt_debug("socket in invalid state: %i",
+			   h->socket.state);
+		return -EINVAL;
+	}
+
+	if (op == CKPT_RST)
+		return sock_cptrst_verify(h);
+	else
+		return 0;
+}
+
+static int __do_sock_checkpoint(struct ckpt_ctx *ctx, struct sock *sk)
+{
+	struct socket *sock = sk->sk_socket;
+	struct ckpt_hdr_socket *h;
+	int ret;
+
+	if (!sock->ops->checkpoint) {
+		ckpt_write_err(ctx, "TEVP", "socket: proto_ops",
+			       -ENOSYS, sock->ops, sock);
+		return -ENOSYS;
+	}
+
+	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_SOCKET);
+	if (!h)
+		return -ENOMEM;
+
+	/* part I: common to all sockets */
+	ret = sock_cptrst(ctx, sk, h, CKPT_CPT);
+	if (ret < 0)
+		goto out;
+
+	ret = ckpt_write_obj(ctx, (struct ckpt_hdr *) h);
+	if (ret < 0)
+		goto out;
+
+	/* part II: per socket type state */
+	ret = sock->ops->checkpoint(ctx, sock);
+	if (ret < 0)
+		goto out;
+
+	/* part III: socket buffers */
+	if ((sk->sk_state != TCP_LISTEN) && (!sock_flag(sk, SOCK_DEAD)))
+		ret = sock_defer_write_buffers(ctx, sk);
+ out:
+	ckpt_hdr_put(ctx, h);
+	return ret;
+}
+
+static int do_sock_checkpoint(struct ckpt_ctx *ctx, struct sock *sk)
+{
+	struct socket *sock;
+	int ret;
+
+	if (sk->sk_socket)
+		return __do_sock_checkpoint(ctx, sk);
+
+	/* Temporarily adopt this orphan socket */
+	ret = sock_create(sk->sk_family, sk->sk_type, 0, &sock);
+	if (ret < 0)
+		return ret;
+	sock_graft(sk, sock);
+
+	ret = __do_sock_checkpoint(ctx, sk);
+
+	sock_orphan(sk);
+	sock->sk = NULL;
+	sock_release(sock);
+
+	return ret;
+}
+
+int checkpoint_sock(struct ckpt_ctx *ctx, void *ptr)
+{
+	return do_sock_checkpoint(ctx, (struct sock *)ptr);
+}
+
+int sock_file_checkpoint(struct ckpt_ctx *ctx, struct file *file)
+{
+	struct ckpt_hdr_file_socket *h;
+	struct socket *sock = file->private_data;
+	struct sock *sk = sock->sk;
+	int ret;
+
+	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_FILE);
+	if (!h)
+		return -ENOMEM;
+
+	h->common.f_type = CKPT_FILE_SOCKET;
+
+	h->sock_objref = checkpoint_obj(ctx, sk, CKPT_OBJ_SOCK);
+	if (h->sock_objref < 0) {
+		ret = h->sock_objref;
+		goto out;
+	}
+
+	ret = checkpoint_file_common(ctx, file, &h->common);
+	if (ret < 0)
+		goto out;
+
+	ret = ckpt_write_obj(ctx, (struct ckpt_hdr *) h);
+ out:
+	ckpt_hdr_put(ctx, h);
+	return ret;
+}
+
+static int sock_collect_skbs(struct ckpt_ctx *ctx, struct sk_buff_head *queue)
+{
+	struct sk_buff_head tmpq;
+	struct sk_buff *skb;
+	int ret = 0;
+	int bytes;
+
+	skb_queue_head_init(&tmpq);
+
+	ret = sock_copy_buffers(queue, &tmpq, &bytes);
+	if (ret < 0)
+		return ret;
+
+	skb_queue_walk(&tmpq, skb) {
+		/* Socket buffers do not maintain a ref count on their
+		 * owning sock because they're counted in sock_wmem_alloc.
+		 * So, we only need to collect sockets from the queue that
+		 * won't be collected any other way (i.e. DEAD sockets that
+		 * are hanging around only because they're waiting for us
+		 * to process their skb.
+		 */
+
+		if (!ckpt_obj_lookup(ctx, skb->sk, CKPT_OBJ_SOCK) &&
+		    sock_flag(skb->sk, SOCK_DEAD)) {
+			ret = ckpt_obj_collect(ctx, skb->sk, CKPT_OBJ_SOCK);
+			if (ret < 0)
+				break;
+		}
+	}
+
+	__skb_queue_purge(&tmpq);
+
+	return ret;
+}
+
+int sock_file_collect(struct ckpt_ctx *ctx, struct file *file)
+{
+	struct socket *sock = file->private_data;
+	struct sock *sk = sock->sk;
+	int ret;
+
+	ret = sock_collect_skbs(ctx, &sk->sk_write_queue);
+	if (ret < 0)
+		return ret;
+
+	ret = sock_collect_skbs(ctx, &sk->sk_receive_queue);
+	if (ret < 0)
+		return ret;
+
+	ret = ckpt_obj_collect(ctx, sk, CKPT_OBJ_SOCK);
+	if (ret < 0)
+		return ret;
+
+	if (sock->ops->collect)
+		ret = sock->ops->collect(ctx, sock);
+
+	return ret;
+}
+
+static struct file *sock_alloc_attach_fd(struct socket *sock)
+{
+	struct file *file;
+	int err;
+
+	file = get_empty_filp();
+	if (!file)
+		return ERR_PTR(ENOMEM);
+
+	err = sock_attach_fd(sock, file, 0);
+	if (err < 0) {
+		put_filp(file);
+		file = ERR_PTR(err);
+	}
+
+	/* Since objhash assumes the initial reference for a socket,
+	 * we bump it here for this descriptor, unlike other places in
+	 * the socket code which assume the descriptor is the owner.
+	 */
+	sock_hold(sock->sk);
+
+	return file;
+}
+
+struct sock *do_sock_restore(struct ckpt_ctx *ctx)
+{
+	struct ckpt_hdr_socket *h;
+	struct socket *sock;
+	int ret;
+
+	h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_SOCKET);
+	if (IS_ERR(h))
+		return ERR_PTR(PTR_ERR(h));
+
+	/* silently clear flags, e.g. SOCK_NONBLOCK or SOCK_CLOEXEC */
+	h->sock.type &= SOCK_TYPE_MASK;
+
+	ret = sock_create(h->sock_common.family, h->sock.type, 0, &sock);
+	if (ret < 0)
+		goto err;
+
+	if (!sock->ops->restore) {
+		ckpt_debug("proto_ops lacks checkpoint: %pS\n", sock->ops);
+		ret = -EINVAL;
+		goto err;
+	}
+
+	/*
+	 * part II: per socket type state
+	 * (also takes care of part III: socket buffer)
+	 */
+	ret = sock->ops->restore(ctx, sock, h);
+	if (ret < 0)
+		goto err;
+
+	/* part I: common to all sockets */
+	ret = sock_cptrst(ctx, sock->sk, h, CKPT_RST);
+	if (ret < 0)
+		goto err;
+
+	ckpt_hdr_put(ctx, h);
+	return sock->sk;
+ err:
+	ckpt_hdr_put(ctx, h);
+	sock_release(sock);
+	return ERR_PTR(ret);
+}
+
+void *restore_sock(struct ckpt_ctx *ctx)
+{
+	return do_sock_restore(ctx);
+}
+
+struct file *sock_file_restore(struct ckpt_ctx *ctx, struct ckpt_hdr_file *ptr)
+{
+	struct ckpt_hdr_file_socket *h = (struct ckpt_hdr_file_socket *)ptr;
+	struct sock *sk;
+	struct file *file;
+	int ret;
+
+	if (ptr->h.type != CKPT_HDR_FILE || ptr->f_type != CKPT_FILE_SOCKET)
+		return ERR_PTR(-EINVAL);
+
+	sk = ckpt_obj_fetch(ctx, h->sock_objref, CKPT_OBJ_SOCK);
+	if (IS_ERR(sk))
+		return ERR_PTR(PTR_ERR(sk));
+
+	file = sock_alloc_attach_fd(sk->sk_socket);
+	if (IS_ERR(file))
+		return file;
+
+	ret = restore_file_common(ctx, file, ptr);
+	if (ret < 0) {
+		fput(file);
+		return ERR_PTR(ret);
+	}
+
+	return file;
+}
diff --git a/net/socket.c b/net/socket.c
index 63c4498..0a4d539 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -140,6 +140,10 @@ static const struct file_operations socket_file_ops = {
 	.sendpage =	sock_sendpage,
 	.splice_write = generic_splice_sendpage,
 	.splice_read =	sock_splice_read,
+#ifdef CONFIG_CHECKPOINT
+	.checkpoint =   sock_file_checkpoint,
+	.collect = sock_file_collect,
+#endif
 };
 
 /*
@@ -368,7 +372,7 @@ static int sock_alloc_fd(struct file **filep, int flags)
 	return fd;
 }
 
-static int sock_attach_fd(struct socket *sock, struct file *file, int flags)
+int sock_attach_fd(struct socket *sock, struct file *file, int flags)
 {
 	struct dentry *dentry;
 	struct qstr name = { .name = "" };
diff --git a/net/unix/Makefile b/net/unix/Makefile
index b852a2b..fbff1e6 100644
--- a/net/unix/Makefile
+++ b/net/unix/Makefile
@@ -6,3 +6,4 @@ obj-$(CONFIG_UNIX)	+= unix.o
 
 unix-y			:= af_unix.o garbage.o
 unix-$(CONFIG_SYSCTL)	+= sysctl_net_unix.o
+unix-$(CONFIG_CHECKPOINT) += checkpoint.o
diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index fc3ebb9..b3d4f16 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -523,6 +523,9 @@ static const struct proto_ops unix_stream_ops = {
 	.recvmsg =	unix_stream_recvmsg,
 	.mmap =		sock_no_mmap,
 	.sendpage =	sock_no_sendpage,
+	.checkpoint =	unix_checkpoint,
+	.restore =	unix_restore,
+	.collect =      unix_collect,
 };
 
 static const struct proto_ops unix_dgram_ops = {
@@ -544,6 +547,9 @@ static const struct proto_ops unix_dgram_ops = {
 	.recvmsg =	unix_dgram_recvmsg,
 	.mmap =		sock_no_mmap,
 	.sendpage =	sock_no_sendpage,
+	.checkpoint =	unix_checkpoint,
+	.restore =	unix_restore,
+	.collect =      unix_collect,
 };
 
 static const struct proto_ops unix_seqpacket_ops = {
@@ -565,6 +571,9 @@ static const struct proto_ops unix_seqpacket_ops = {
 	.recvmsg =	unix_dgram_recvmsg,
 	.mmap =		sock_no_mmap,
 	.sendpage =	sock_no_sendpage,
+	.checkpoint =	unix_checkpoint,
+	.restore =	unix_restore,
+	.collect =      unix_collect,
 };
 
 static struct proto unix_proto = {
diff --git a/net/unix/checkpoint.c b/net/unix/checkpoint.c
new file mode 100644
index 0000000..8b7cb22
--- /dev/null
+++ b/net/unix/checkpoint.c
@@ -0,0 +1,634 @@
+#include <linux/namei.h>
+#include <linux/file.h>
+#include <linux/fs_struct.h>
+#include <linux/deferqueue.h>
+#include <linux/checkpoint.h>
+#include <linux/checkpoint_hdr.h>
+#include <linux/user.h>
+#include <net/af_unix.h>
+#include <net/tcp_states.h>
+
+struct dq_join {
+	struct ckpt_ctx *ctx;
+	int src_objref;
+	int dst_objref;
+};
+
+struct dq_buffers {
+	struct ckpt_ctx *ctx;
+	int sk_objref; /* objref of the socket these buffers belong to */
+};
+
+#define UNIX_ADDR_EMPTY(a) (a <= sizeof(short))
+
+static inline int unix_need_cwd(struct sockaddr_un *addr, unsigned long len)
+{
+	return (!UNIX_ADDR_EMPTY(len)) &&
+		addr->sun_path[0] &&
+		(addr->sun_path[0] != '/');
+}
+
+static int unix_join(struct sock *src, struct sock *dst)
+{
+	if (unix_sk(src)->peer != NULL)
+		return 0; /* We're second */
+
+	sock_hold(dst);
+	unix_sk(src)->peer = dst;
+
+	return 0;
+
+}
+
+static int unix_deferred_join(void *data)
+{
+	struct dq_join *dq = (struct dq_join *)data;
+	struct ckpt_ctx *ctx = dq->ctx;
+	struct sock *src;
+	struct sock *dst;
+
+	src = ckpt_obj_fetch(ctx, dq->src_objref, CKPT_OBJ_SOCK);
+	if (!src) {
+		ckpt_debug("Missing src sock ref %i\n", dq->src_objref);
+		return -EINVAL;
+	}
+
+	dst = ckpt_obj_fetch(ctx, dq->dst_objref, CKPT_OBJ_SOCK);
+	if (!src) {
+		ckpt_debug("Missing dst sock ref %i\n", dq->dst_objref);
+		return -EINVAL;
+	}
+
+	return unix_join(src, dst);
+}
+
+static int unix_defer_join(struct ckpt_ctx *ctx,
+			   int src_objref,
+			   int dst_objref)
+{
+	struct dq_join dq;
+
+	dq.ctx = ctx;
+	dq.src_objref = src_objref;
+	dq.dst_objref = dst_objref;
+
+	/* NB: This is safe to do inside deferqueue_run() since it uses
+	 * list_for_each_safe()
+	 */
+	return deferqueue_add(ctx->files_deferq, &dq, sizeof(dq),
+			      unix_deferred_join, NULL);
+}
+
+static int unix_write_cwd(struct ckpt_ctx *ctx,
+			  struct sock *sk, const char *sockpath)
+{
+	struct path path;
+	char *buf;
+	char *fqpath;
+	int offset;
+	int len = PATH_MAX;
+	int ret = -ENOENT;
+
+	buf = kmalloc(len, GFP_KERNEL);
+	if (!buf)
+		return -ENOMEM;
+
+	path.dentry = unix_sk(sk)->dentry;
+	path.mnt = unix_sk(sk)->mnt;
+
+	fqpath = ckpt_fill_fname(&path, &ctx->fs_mnt, buf, &len);
+	if (IS_ERR(fqpath)) {
+		ret = PTR_ERR(fqpath);
+		goto out;
+	}
+
+	offset = strlen(fqpath) - strlen(sockpath);
+	if (offset <= 0) {
+		ret = -EINVAL;
+		goto out;
+	}
+
+	fqpath[offset] = '\0';
+
+	ckpt_debug("writing socket directory: %s\n", fqpath);
+	ret = ckpt_write_string(ctx, fqpath, offset + 1);
+ out:
+	kfree(buf);
+	return ret;
+}
+
+int unix_checkpoint(struct ckpt_ctx *ctx, struct socket *sock)
+{
+	struct unix_sock *sk = unix_sk(sock->sk);
+	struct ckpt_hdr_socket_unix *un;
+	int new;
+	int ret = -ENOMEM;
+
+	if ((sock->sk->sk_state == TCP_LISTEN) &&
+	    !skb_queue_empty(&sock->sk->sk_receive_queue)) {
+		ckpt_write_err(ctx, "TEP", "af_unix: listen with pending peers",
+			       -EBUSY, sock);
+		return -EBUSY;
+	}
+
+	un = ckpt_hdr_get_type(ctx, sizeof(*un), CKPT_HDR_SOCKET_UNIX);
+	if (!un)
+		return -EINVAL;
+
+	ret = ckpt_sock_getnames(ctx, sock,
+				 (struct sockaddr *)&un->laddr, &un->laddr_len,
+				 (struct sockaddr *)&un->raddr, &un->raddr_len);
+	if (ret)
+		goto out;
+
+	if (sk->dentry && (sk->dentry->d_inode->i_nlink > 0))
+		un->flags |= CKPT_UNIX_LINKED;
+
+	un->this = ckpt_obj_lookup_add(ctx, sk, CKPT_OBJ_SOCK, &new);
+	if (un->this < 0)
+		goto out;
+
+	if (sk->peer)
+		un->peer = checkpoint_obj(ctx, sk->peer, CKPT_OBJ_SOCK);
+	else
+		un->peer = 0;
+
+	if (un->peer < 0) {
+		ret = un->peer;
+		goto out;
+	}
+
+	un->peercred_uid = sock->sk->sk_peercred.uid;
+	un->peercred_gid = sock->sk->sk_peercred.gid;
+
+	ret = ckpt_write_obj(ctx, (struct ckpt_hdr *) un);
+	if (ret < 0)
+		goto out;
+
+	if (unix_need_cwd(&un->laddr, un->laddr_len))
+		ret = unix_write_cwd(ctx, sock->sk, un->laddr.sun_path);
+ out:
+	ckpt_hdr_put(ctx, un);
+
+	return ret;
+}
+
+int unix_collect(struct ckpt_ctx *ctx, struct socket *sock)
+{
+	struct unix_sock *sk = unix_sk(sock->sk);
+	int ret;
+
+	ret = ckpt_obj_collect(ctx, sock->sk, CKPT_OBJ_SOCK);
+	if (ret < 0)
+		return ret;
+
+	if (sk->peer)
+		ret = ckpt_obj_collect(ctx, sk->peer, CKPT_OBJ_SOCK);
+
+	return 0;
+}
+
+static int sock_read_buffer_sendmsg(struct ckpt_ctx *ctx,
+				    struct sockaddr *addr,
+				    unsigned int addrlen)
+{
+	struct ckpt_hdr_socket_buffer *h;
+	struct sock *sk;
+	struct msghdr msg;
+	struct kvec kvec;
+	uint8_t sock_shutdown;
+	uint8_t peer_shutdown = 0;
+	void *buf = NULL;
+	int sndbuf;
+	int len;
+	int ret = 0;
+
+	memset(&msg, 0, sizeof(msg));
+
+	h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_SOCKET_BUFFER);
+	if (IS_ERR(h))
+		return PTR_ERR(h);
+
+	len = _ckpt_read_obj_type(ctx, NULL, 0, CKPT_HDR_BUFFER);
+	if (len < 0) {
+		ret = len;
+		goto out;
+	} else if (len > SKB_MAX_ALLOC) {
+		ckpt_debug("Socket buffer too big (%i > %lu)",
+			   len, SKB_MAX_ALLOC);
+		ret = -ENOSPC;
+		goto out;
+	}
+
+	sk = ckpt_obj_fetch(ctx, h->sk_objref, CKPT_OBJ_SOCK);
+	if (IS_ERR(sk)) {
+		ret = PTR_ERR(sk);
+		goto out;
+	}
+
+	/* If we don't have a destination or a peer and we know the
+	 * destination of this skb, then we must need to join with our
+	 * peer
+	 */
+	if (!addrlen && !unix_sk(sk)->peer) {
+		struct sock *pr;
+		pr = ckpt_obj_fetch(ctx, h->pr_objref, CKPT_OBJ_SOCK);
+		if (IS_ERR(pr)) {
+			ckpt_debug("Failed to get our peer: %li\n", PTR_ERR(pr));
+			ret = PTR_ERR(pr);
+			goto out;
+		}
+		ret = unix_join(sk, pr);
+		if (ret < 0) {
+			ckpt_debug("Failed to join: %i\n", ret);
+			goto out;
+		}
+	}
+
+	kvec.iov_len = len;
+	buf = kmalloc(len, GFP_KERNEL);
+	kvec.iov_base = buf;
+	if (!buf) {
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	ret = ckpt_kread(ctx, kvec.iov_base, len);
+	if (ret < 0)
+		goto out;
+
+	msg.msg_name = addr;
+	msg.msg_namelen = addrlen;
+
+	/* If peer is shutdown, unshutdown it for this process */
+	sock_shutdown = sk->sk_shutdown;
+	sk->sk_shutdown &= ~SHUTDOWN_MASK;
+
+	/* Unshutdown peer too, if necessary */
+	if (unix_sk(sk)->peer) {
+		peer_shutdown = unix_sk(sk)->peer->sk_shutdown;
+		unix_sk(sk)->peer->sk_shutdown &= ~SHUTDOWN_MASK;
+	}
+
+	/* Make sure there's room in the send buffer */
+	sndbuf = sk->sk_sndbuf;
+	if (((sk->sk_sndbuf - atomic_read(&sk->sk_wmem_alloc)) < len) &&
+	    capable(CAP_NET_ADMIN))
+		sk->sk_sndbuf += len;
+	else
+		sk->sk_sndbuf = sysctl_wmem_max;
+
+	ret = kernel_sendmsg(sk->sk_socket, &msg, &kvec, 1, len);
+	ckpt_debug("kernel_sendmsg(%i,%i): %i\n", h->sk_objref, len, ret);
+	if ((ret > 0) && (ret != len))
+		ret = -ENOMEM;
+
+	sk->sk_sndbuf = sndbuf;
+	sk->sk_shutdown = sock_shutdown;
+	if (peer_shutdown)
+		unix_sk(sk)->peer->sk_shutdown = peer_shutdown;
+ out:
+	ckpt_hdr_put(ctx, h);
+	kfree(buf);
+	return ret;
+}
+
+static int unix_read_buffers(struct ckpt_ctx *ctx,
+			     struct sockaddr *addr,
+			     unsigned int addrlen)
+{
+	struct ckpt_hdr_socket_queue *h;
+	int ret = 0;
+	int i;
+
+	h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_SOCKET_QUEUE);
+	if (IS_ERR(h))
+		return PTR_ERR(h);
+
+	for (i = 0; i < h->skb_count; i++) {
+		ret = sock_read_buffer_sendmsg(ctx, addr, addrlen);
+		ckpt_debug("read_buffer_sendmsg(%i): %i\n", i, ret);
+		if (ret < 0)
+			goto out;
+
+		if (ret > h->total_bytes) {
+			ckpt_debug("Buffers exceeded claim");
+			ret = -EINVAL;
+			goto out;
+		}
+
+		h->total_bytes -= ret;
+		ret = 0;
+	}
+
+	ret = h->skb_count;
+ out:
+	ckpt_hdr_put(ctx, h);
+	return ret;
+}
+
+static int unix_deferred_restore_buffers(void *data)
+{
+	struct dq_buffers *dq = (struct dq_buffers *)data;
+	struct ckpt_ctx *ctx = dq->ctx;
+	struct sock *sk;
+	struct sockaddr *addr = NULL;
+	unsigned int addrlen = 0;
+	int ret;
+
+	sk = ckpt_obj_fetch(ctx, dq->sk_objref, CKPT_OBJ_SOCK);
+	if (!sk) {
+		ckpt_debug("Missing sock ref %i\n", dq->sk_objref);
+		return -EINVAL;
+	}
+
+	if ((sk->sk_type == SOCK_DGRAM) && (unix_sk(sk)->addr != NULL)) {
+		addr = (struct sockaddr *)&unix_sk(sk)->addr->name;
+		addrlen = unix_sk(sk)->addr->len;
+	}
+
+	ret = unix_read_buffers(ctx, addr, addrlen);
+	ckpt_debug("read recv buffers: %i\n", ret);
+	if (ret < 0)
+		return ret;
+
+	ret = unix_read_buffers(ctx, addr, addrlen);
+	ckpt_debug("read send buffers: %i\n", ret);
+	if (ret > 0)
+		ret = -EINVAL; /* No send buffers for UNIX sockets */
+
+	return ret;
+}
+
+static int unix_defer_restore_buffers(struct ckpt_ctx *ctx, int sk_objref)
+{
+	struct dq_buffers dq;
+
+	dq.ctx = ctx;
+	dq.sk_objref = sk_objref;
+
+	/* NB: This is safe to do inside deferqueue_run() since it uses
+	 * list_for_each_safe()
+	 */
+	return deferqueue_add(ctx->files_deferq, &dq, sizeof(dq),
+			      unix_deferred_restore_buffers, NULL);
+}
+
+static struct unix_address *unix_makeaddr(struct sockaddr_un *sun_addr,
+					  unsigned len)
+{
+	struct unix_address *addr;
+
+	if (len > sizeof(struct sockaddr_un))
+		return ERR_PTR(-EINVAL);
+
+	addr = kmalloc(sizeof(*addr) + len, GFP_KERNEL);
+	if (!addr)
+		return ERR_PTR(-ENOMEM);
+
+	memcpy(addr->name, sun_addr, len);
+	addr->len = len;
+	atomic_set(&addr->refcnt, 1);
+
+	return addr;
+}
+
+static int unix_restore_connected(struct ckpt_ctx *ctx,
+				  struct ckpt_hdr_socket *h,
+				  struct ckpt_hdr_socket_unix *un,
+				  struct socket *sock)
+{
+	struct sock *sk = sock->sk;
+	struct sockaddr *addr = NULL;
+	unsigned long flags = h->sock.flags;
+	unsigned int addrlen = 0;
+	int dead = test_bit(SOCK_DEAD, &flags);
+	int ret = 0;
+
+
+	if (un->peer == 0) {
+		/* These get propagated to the msghdr, so only set them
+		 * if we're not connected to a peer, else we'll get an error
+		 * when we sendmsg()
+		 */
+		addr = (struct sockaddr *)&un->laddr;
+		addrlen = un->laddr_len;
+	}
+
+	sk->sk_peercred.pid = task_tgid_vnr(current);
+
+	if (may_setuid(ctx->realcred->user->user_ns, un->peercred_uid) &&
+	    may_setgid(un->peercred_gid)) {
+		sk->sk_peercred.uid = un->peercred_uid;
+		sk->sk_peercred.gid = un->peercred_gid;
+	} else {
+		ckpt_debug("peercred %i:%i would require setuid",
+			   un->peercred_uid, un->peercred_gid);
+		return -EPERM;
+	}
+
+	if (!dead && (un->peer > 0)) {
+		ret = unix_defer_join(ctx, un->this, un->peer);
+		ckpt_debug("unix_defer_join: %i\n", ret);
+	}
+
+	if (!dead && !ret)
+		ret = unix_defer_restore_buffers(ctx, un->this);
+
+	return ret;
+}
+
+static int unix_unlink(const char *name)
+{
+	struct path spath;
+	struct path ppath;
+	int ret;
+
+	ret = kern_path(name, 0, &spath);
+	if (ret)
+		return ret;
+
+	ret = kern_path(name, LOOKUP_PARENT, &ppath);
+	if (ret)
+		goto out_s;
+
+	if (!spath.dentry) {
+		ckpt_debug("No dentry found for %s\n", name);
+		ret = -ENOENT;
+		goto out_p;
+	}
+
+	if (!ppath.dentry || !ppath.dentry->d_inode) {
+		ckpt_debug("No inode for parent of %s\n", name);
+		ret = -ENOENT;
+		goto out_p;
+	}
+
+	ret = vfs_unlink(ppath.dentry->d_inode, spath.dentry);
+ out_p:
+	path_put(&ppath);
+ out_s:
+	path_put(&spath);
+
+	return ret;
+}
+
+/* Call bind() for socket, optionally changing (temporarily) to @path first
+ * if non-NULL
+ */
+static int unix_chdir_and_bind(struct socket *sock,
+			       const char *path,
+			       struct sockaddr *addr,
+			       unsigned long addrlen)
+{
+	struct sockaddr_un *un = (struct sockaddr_un *)addr;
+	struct path cur = { .mnt = NULL, .dentry = NULL };
+	struct path dir = { .mnt = NULL, .dentry = NULL };
+	int ret;
+
+	if (path) {
+		ckpt_debug("switching to cwd %s for unix bind", path);
+
+		ret = kern_path(path, 0, &dir);
+		if (ret)
+			return ret;
+
+		ret = inode_permission(dir.dentry->d_inode,
+				       MAY_EXEC | MAY_ACCESS);
+		if (ret)
+			goto out;
+
+		write_lock(&current->fs->lock);
+		cur = current->fs->pwd;
+		current->fs->pwd = dir;
+		write_unlock(&current->fs->lock);
+	}
+
+	ret = unix_unlink(un->sun_path);
+	ckpt_debug("unlink(%s): %i\n", un->sun_path, ret);
+	if ((ret == 0) || (ret == -ENOENT))
+		ret = sock_bind(sock, addr, addrlen);
+
+	if (path) {
+		write_lock(&current->fs->lock);
+		current->fs->pwd = cur;
+		write_unlock(&current->fs->lock);
+	}
+ out:
+	if (path)
+		path_put(&dir);
+
+	return ret;
+}
+
+static int unix_fakebind(struct socket *sock,
+			 struct sockaddr_un *addr, unsigned long len)
+{
+	struct unix_address *uaddr;
+
+	uaddr = unix_makeaddr(addr, len);
+	if (IS_ERR(uaddr))
+		return PTR_ERR(uaddr);
+
+	unix_sk(sock->sk)->addr = uaddr;
+
+	return 0;
+}
+
+static int unix_restore_bind(struct ckpt_hdr_socket *h,
+			     struct ckpt_hdr_socket_unix *un,
+			     struct socket *sock,
+			     const char *path)
+{
+	struct sockaddr *addr = (struct sockaddr *)&un->laddr;
+	unsigned long len = un->laddr_len;
+	unsigned long flags = h->sock.flags;
+	int dead = test_bit(SOCK_DEAD, &flags);
+
+	if (dead)
+		return unix_fakebind(sock, &un->laddr, len);
+	else if (!un->laddr.sun_path[0])
+		return sock_bind(sock, addr, len);
+	else if (!(un->flags & CKPT_UNIX_LINKED))
+		return unix_fakebind(sock, &un->laddr, len);
+	else
+		return unix_chdir_and_bind(sock, path, addr, len);
+}
+
+/* Some easy pre-flight checks before we get underway */
+static int unix_precheck(struct socket *sock, struct ckpt_hdr_socket *h)
+{
+	struct net *net = sock_net(sock->sk);
+	unsigned long sk_flags = h->sock.flags;
+
+	if ((h->socket.state == SS_CONNECTING) ||
+	    (h->socket.state == SS_DISCONNECTING) ||
+	    (h->socket.state == SS_FREE)) {
+		ckpt_debug("AF_UNIX socket can't be SS_(DIS)CONNECTING");
+		return -EINVAL;
+	}
+
+	/* AF_UNIX overloads the backlog setting to define the maximum
+	 * queue length for DGRAM sockets.  Make sure we don't let the
+	 * caller exceed that value on restart.
+	 */
+	if ((h->sock.type == SOCK_DGRAM) &&
+	    (h->sock.backlog > net->unx.sysctl_max_dgram_qlen)) {
+		ckpt_debug("DGRAM backlog of %i exceeds system max of %i\n",
+			   h->sock.backlog, net->unx.sysctl_max_dgram_qlen);
+		return -EINVAL;
+	}
+
+	if (test_bit(SOCK_USE_WRITE_QUEUE, &sk_flags)) {
+		ckpt_debug("AF_UNIX socket has SOCK_USE_WRITE_QUEUE set");
+		return -EINVAL;
+	}
+
+	return 0;
+}
+
+int unix_restore(struct ckpt_ctx *ctx, struct socket *sock,
+		 struct ckpt_hdr_socket *h)
+
+{
+	struct ckpt_hdr_socket_unix *un;
+	int ret = -EINVAL;
+	char *cwd = NULL;
+
+	ret = unix_precheck(sock, h);
+	if (ret)
+		return ret;
+
+	un = ckpt_read_obj_type(ctx, sizeof(*un), CKPT_HDR_SOCKET_UNIX);
+	if (IS_ERR(un))
+		return PTR_ERR(un);
+
+	if (un->peer < 0)
+		goto out;
+
+	if (unix_need_cwd(&un->laddr, un->laddr_len)) {
+		cwd = ckpt_read_string(ctx, PATH_MAX);
+		if (IS_ERR(cwd)) {
+			ret = PTR_ERR(cwd);
+			goto out;
+		}
+	}
+
+	if ((h->sock.state != TCP_ESTABLISHED) &&
+	    !UNIX_ADDR_EMPTY(un->laddr_len)) {
+		ret = unix_restore_bind(h, un, sock, cwd);
+		if (ret)
+			goto out;
+	}
+
+	if ((h->sock.state == TCP_ESTABLISHED) || (h->sock.state == TCP_CLOSE))
+		ret = unix_restore_connected(ctx, h, un, sock);
+	else if (h->sock.state == TCP_LISTEN)
+		ret = sock->ops->listen(sock, h->sock.backlog);
+	else
+		ckpt_debug("unsupported UNIX socket state %i\n", h->sock.state);
+ out:
+	ckpt_hdr_put(ctx, un);
+	kfree(cwd);
+	return ret;
+}
-- 
1.6.0.4


^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH v18 70/80] c/r: Add AF_UNIX support (v12)
@ 2009-09-23 23:51   ` Oren Laadan
  0 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-23 23:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, containers, linux-kernel, linux-mm, linux-api,
	Serge Hallyn, Ingo Molnar, Pavel Emelyanov, Dan Smith,
	Alexey Dobriyan, netdev, Oren Laadan

From: Dan Smith <danms@us.ibm.com>

This patch adds basic checkpoint/restart support for AF_UNIX sockets.  It
has been tested with a single and multiple processes, and with data inflight
at the time of checkpoint.  It supports socketpair()s, path-based, and
abstract sockets.

Changes in v12:
  - Collect sockets for leak-detection
  - Adjust socket reference count during leak detection phase

Changes in v11:
  - Create a struct socket for orphan socket during checkpoint
  - Make sockets proper objhash objects and use checkpoint_obj() on them
  - Rename headerless struct ckpt_hdr_* to struct ckpt_*
  - Remove struct timeval from socket header
  - Save and restore UNIX socket peer credentials
  - Set socket flags on restore using sock_setsockopt() where possible
  - Fail on the TIMESTAMPING_* flags for the moment (with a TODO)
  - Remove other explicit flag checks that are no longer copied blindly
  - Changed functions/variables names to follow existing conventions
  - Use proto_ops->{checkpoint,restart} methods for af_unix
  - Cleanup sock_file_restore()/sock_file_checkpoint()
  - Make ckpt_hdr_socket be part of ckpt_hdr_file_socket
  - Fold do_sock_file_checkpoint() into sock_file_checkpoint()
  - Fold do_sock_file_restore() into sock_file_restore()
  - Move sock_file_{checkpoint,restore} to net/checkpoint.c
  - Properly define sock_file_{checkpoint,restore} in header file
  - sock_file_restore() now calls restore_file_common()

Changes in v10:
  - Moved header structure definitions back to checkpoint_hdr.h
  - Moved AF_UNIX checkpoint/restart code to net/unix/checkpoint.c
  - Make sock_unix_*() functions only compile if CONFIG_UNIX=y
  - Add TODO for CONFIG_UNIX=m case

Changes in v9:
  - Fix double-free of skb's in the list and target holding queue in the
    error path of sock_copy_buffers()
  - Adjust use of ckpt_read_string() to match new signature

Changes in v8:
  - Fix stale dev_alloc_skb() from before the conversion to skb_clone()
  - Fix a couple of broken error paths
  - Fix memory leak of kvec.iov_base on successful return from sendmsg()
  - Fix condition for deciding when to run sock_cptrst_verify()
  - Fix buffer queue copy algorithm to hold the lock during walk(s)
  - Log the errno when either getname() or getpeer() fails
  - Add comments about ancillary messages in the UNIX queue
  - Add TODO comments for credential restore and flags via setsockopt()
  - Add TODO comment about strangely-connected dgram sockets and the use
    of sendmsg(peer)

Changes in v7:
  - Fix failure to free iov_base in error path of sock_read_buffer()
  - Change sock_read_buffer() to use _ckpt_read_obj_type() to get the
    header length and then use ckpt_kread() directly to read the payload
  - Change sock_read_buffers() to sock_unix_read_buffers() and break out
    some common functionality to better accommodate the subsequent INET
    patch
  - Generalize sock_unix_getnames() into sock_getnames() so INET can use it
  - Change skb_morph() to skb_clone() which uses the more common path and
    still avoids the copy
  - Add check to validate the socket type before creating socket
    on restore
  - Comment the CAP_NET_ADMIN override in sock_read_buffer_hdr
  - Strengthen the comment about priming the buffer limits
  - Change the objhash functions to deny direct checkpoint of sockets and
    remove the reference counting function
  - Change SOCKET_BUFFERS to SOCKET_QUEUE
  - Change this,peer objrefs to signed integers
  - Remove names from internal socket structures
  - Fix handling of sock_copy_buffers() result
  - Use ckpt_fill_fname() instead of d_path() for writing CWD
  - Use sock_getname() and sock_getpeer() for proper security hookage
  - Return -ENOSYS for unsupported socket families in checkpoint and restart
  - Use sock_setsockopt() and sock_getsockopt() where possible to save and
    restore socket option values
  - Check for SOCK_DESTROY flag in the global verify function because none
    of our supported socket types use it
  - Check for SOCK_USE_WRITE_QUEUE in AF_UNIX restore function because
    that flag should not be used on such a socket
  - Check socket state in UNIX restart path to validate the subset of valid
    values

Changes in v6:
  - Moved the socket addresses to the per-type header
  - Eliminated the HASCWD flag
  - Remove use of ckpt_write_err() in restart paths
  - Change the order in which buffers are read so that we can set the
    socket's limit equal to the size of the image's buffers (if appropriate)
    and then restore the original values afterwards.
  - Use the ckpt_validate_errno() helper
  - Add a check to make sure that we didn't restore a (UNIX) socket with
    any skb's in the send buffer
  - Fix up sock_unix_join() to not leave addr uninitialized for socketpair
  - Remove inclusion of checkpoint_hdr.h in the socket files
  - Make sock_unix_write_cwd() use ckpt_write_string() and use the new
    ckpt_read_string() for reading the cwd
  - Use the restored realcred credentials in sock_unix_join()
  - Fix error path of the chdir_and_bind
  - Change the algorithm for reloading the socket buffers to use sendmsg()
    on the socket's peer for better accounting
  - For DGRAM sockets, check the backlog value against the system max
    to avoid letting a restart bypass the overloaded queue length
  - Use sock_bind() instead of sock->ops->bind() to gain the security hook
  - Change "restart" to "restore" in some of the function names

Changes in v5:
  - Change laddr and raddr buffers in socket header to be long enough
    for INET6 addresses
  - Place socket.c and sock.h function definitions inside #ifdef
    CONFIG_CHECKPOINT
  - Add explicit check in sock_unix_makeaddr() to refuse if the
    checkpoint image specifies an addr length of 0
  - Split sock_unix_restart() into a few pieces to facilitate:
  - Changed behavior of the unix restore code so that unlinked LISTEN
    sockets don't do a bind()...unlink()
  - Save the base path of a bound socket's path so that we can chdir()
    to the base before bind() if it is a relative path
  - Call bind() for any socket that is not established but has a
    non-zero-length local address
  - Enforce the current sysctl limit on socket buffer size during restart
    unless the user holds CAP_NET_ADMIN
  - Unlink a path-based socket before calling bind()

Changes in v4:
  - Changed the signdness of rcvlowat, rcvtimeo, sndtimeo, and backlog
    to match their struct sock definitions.  This should avoid issues
    with sign extension.
  - Add a sock_cptrst_verify() function to be run at restore time to
    validate several of the values in the checkpoint image against
    limits, flag masks, etc.
  - Write an error string with ctk_write_err() in the obscure cases
  - Don't write socket buffers for listen sockets
  - Sanity check address lengths before we agree to allocate memory
  - Check the result of inserting the peer object in the objhash on
    restart
  - Check return value of sock_cptrst() on restart
  - Change logic in remote getname() phase of checkpoint to not fail for
    closed (et al) sockets
  - Eliminate the memory copy while reading socket buffers on restart

Changes in v3:
  - Move sock_file_checkpoint() above sock_file_restore()
  - Change __sock_file_*() functions to do_sock_file_*()
  - Adjust some of the struct cr_hdr_socket alignment
  - Improve the sock_copy_buffers() algorithm to avoid locking the source
    queue for the entire operation
  - Fix alignment in the socket header struct(s)
  - Move the per-protocol structure (ckpt_hdr_socket_un) out of the
    common socket header and read/write it separately
  - Fix missing call to sock_cptrst() in restore path
  - Break out the socket joining into another function
  - Fix failure to restore the socket address thus fixing getname()
  - Check the state values on restart
  - Fix case of state being TCP_CLOSE, which allows dgram sockets to be
    properly connected (if appropriate) to their peer and maintain the
    sockaddr for getname() operation
  - Fix restoring a listening socket that has been unlink()'d
  - Fix checkpointing sockets with an in-flight FD-passing SKB.  Fail
    with EBUSY.
  - Fix checkpointing listening sockets with an unaccepted connection.
    Fail with EBUSY.
  - Changed 'un' to 'unix' in function and structure names

Changes in v2:
  - Change GFP_KERNEL to GFP_ATOMIC in sock_copy_buffers() (this seems
    to be rather common in other uses of skb_copy())
  - Move the ckpt_hdr_socket structure definition to linux/socket.h
  - Fix whitespace issue
  - Move sock_file_checkpoint() to net/socket.c for symmetry

Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: netdev@vger.kernel.org
Acked-by: Serge Hallyn <serue@us.ibm.com>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Dan Smith <danms@us.ibm.com>
Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
---
 checkpoint/files.c             |    7 +
 checkpoint/objhash.c           |   69 ++++
 include/linux/checkpoint.h     |    7 +
 include/linux/checkpoint_hdr.h |   87 +++++
 include/linux/net.h            |    2 +
 include/net/af_unix.h          |   14 +
 include/net/sock.h             |   12 +
 net/Makefile                   |    2 +
 net/checkpoint.c               |  752 ++++++++++++++++++++++++++++++++++++++++
 net/socket.c                   |    6 +-
 net/unix/Makefile              |    1 +
 net/unix/af_unix.c             |    9 +
 net/unix/checkpoint.c          |  634 +++++++++++++++++++++++++++++++++
 13 files changed, 1601 insertions(+), 1 deletions(-)
 create mode 100644 net/checkpoint.c
 create mode 100644 net/unix/checkpoint.c

diff --git a/checkpoint/files.c b/checkpoint/files.c
index 1de89d6..058bc0e 100644
--- a/checkpoint/files.c
+++ b/checkpoint/files.c
@@ -22,6 +22,7 @@
 #include <linux/deferqueue.h>
 #include <linux/checkpoint.h>
 #include <linux/checkpoint_hdr.h>
+#include <net/sock.h>
 
 
 /**************************************************************************
@@ -591,6 +592,12 @@ static struct restore_file_ops restore_file_ops[] = {
 		.file_type = CKPT_FILE_FIFO,
 		.restore = fifo_file_restore,
 	},
+	/* socket */
+	{
+		.file_name = "SOCKET",
+		.file_type = CKPT_FILE_SOCKET,
+		.restore = sock_file_restore,
+	},
 };
 
 static struct file *do_restore_file(struct ckpt_ctx *ctx)
diff --git a/checkpoint/objhash.c b/checkpoint/objhash.c
index bf2f761..0978060 100644
--- a/checkpoint/objhash.c
+++ b/checkpoint/objhash.c
@@ -20,6 +20,7 @@
 #include <linux/user_namespace.h>
 #include <linux/checkpoint.h>
 #include <linux/checkpoint_hdr.h>
+#include <net/sock.h>
 
 struct ckpt_obj;
 struct ckpt_obj_ops;
@@ -234,6 +235,40 @@ static void obj_groupinfo_drop(void *ptr, int lastref)
 	put_group_info((struct group_info *) ptr);
 }
 
+static int obj_sock_grab(void *ptr)
+{
+	sock_hold((struct sock *) ptr);
+	return 0;
+}
+
+static void obj_sock_drop(void *ptr, int lastref)
+{
+	struct sock *sk = (struct sock *) ptr;
+
+	/*
+	 * Sockets created during restart are graft()ed, i.e. have a
+	 * valid @sk->sk_socket. Because only an fput() results in the
+	 * necessary sock_release(), we may leak the struct socket of
+	 * sockets that were not attached to a file. Therefore, if
+	 * @lastref is set, we hereby invoke sock_release() on sockets
+	 * that we have put into the objhash but were never attached
+	 * to a file.
+	 */
+	if (lastref && sk->sk_socket && !sk->sk_socket->file) {
+		struct socket *sock = sk->sk_socket;
+		sock_orphan(sk);
+		sock->sk = NULL;
+		sock_release(sock);
+	}
+
+	sock_put((struct sock *) ptr);
+}
+
+static int obj_sock_users(void *ptr)
+{
+	return atomic_read(&((struct sock *) ptr)->sk_refcnt);
+}
+
 static struct ckpt_obj_ops ckpt_obj_ops[] = {
 	/* ignored object */
 	{
@@ -362,6 +397,16 @@ static struct ckpt_obj_ops ckpt_obj_ops[] = {
 		.checkpoint = checkpoint_groupinfo,
 		.restore = restore_groupinfo,
 	},
+	/* sock object */
+	{
+		.obj_name = "SOCKET",
+		.obj_type = CKPT_OBJ_SOCK,
+		.ref_drop = obj_sock_drop,
+		.ref_grab = obj_sock_grab,
+		.ref_users = obj_sock_users,
+		.checkpoint = checkpoint_sock,
+		.restore = restore_sock,
+	},
 };
 
 
@@ -751,6 +796,26 @@ static void ckpt_obj_users_inc(struct ckpt_ctx *ctx, void *ptr, int increment)
  */
 
 /**
+ * obj_sock_adjust_users - remove implicit reference on DEAD sockets
+ * @obj: CKPT_OBJ_SOCK object to adjust
+ *
+ * Sockets that have been disconnected from their struct file have
+ * a reference count one less than normal sockets.  The objhash's
+ * assumption of such a reference is therefore incorrect, so we correct
+ * it here.
+ */
+static inline void obj_sock_adjust_users(struct ckpt_obj *obj)
+{
+	struct sock *sk = (struct sock *)obj->ptr;
+
+	if (sock_flag(sk, SOCK_DEAD)) {
+		obj->users--;
+		ckpt_debug("Adjusting SOCK %i count to %i\n",
+			   obj->objref, obj->users);
+	}
+}
+
+/**
  * ckpt_obj_contained - test if shared objects are contained in checkpoint
  * @ctx: checkpoint context
  *
@@ -773,6 +838,10 @@ int ckpt_obj_contained(struct ckpt_ctx *ctx)
 	hlist_for_each_entry(obj, node, &ctx->obj_hash->list, next) {
 		if (!obj->ops->ref_users)
 			continue;
+
+		if (obj->ops->obj_type == CKPT_OBJ_SOCK)
+			obj_sock_adjust_users(obj);
+
 		if (obj->ops->ref_users(obj->ptr) != obj->users) {
 			ckpt_debug("usage leak: %s\n", obj->ops->obj_name);
 			ckpt_write_err(ctx, "OP", "leak: usage (%d != %d (%s)",
diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
index ec98a43..92a21b2 100644
--- a/include/linux/checkpoint.h
+++ b/include/linux/checkpoint.h
@@ -29,6 +29,7 @@
 #include <linux/checkpoint_types.h>
 #include <linux/checkpoint_hdr.h>
 #include <linux/err.h>
+#include <net/sock.h>
 
 /* ckpt_ctx: kflags */
 #define CKPT_CTX_CHECKPOINT_BIT		0
@@ -77,6 +78,12 @@ extern int ckpt_read_consume(struct ckpt_ctx *ctx, int len, int type);
 extern char *ckpt_fill_fname(struct path *path, struct path *root,
 			     char *buf, int *len);
 
+/* socket functions */
+extern int ckpt_sock_getnames(struct ckpt_ctx *ctx,
+			      struct socket *socket,
+			      struct sockaddr *loc, unsigned *loc_len,
+			      struct sockaddr *rem, unsigned *rem_len);
+
 /* ckpt kflags */
 #define ckpt_set_ctx_kflag(__ctx, __kflag)  \
 	set_bit(__kflag##_BIT, &(__ctx)->kflags)
diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
index e4dfbd7..ac16c59 100644
--- a/include/linux/checkpoint_hdr.h
+++ b/include/linux/checkpoint_hdr.h
@@ -12,6 +12,14 @@
 
 #include <linux/types.h>
 
+#ifdef __KERNEL__
+#include <linux/socket.h>
+#include <linux/un.h>
+#else
+#include <sys/socket.h>
+#include <sys/un.h>
+#endif
+
 /*
  * To maintain compatibility between 32-bit and 64-bit architecture flavors,
  * keep data 64-bit aligned: use padding for structure members, and use
@@ -92,6 +100,11 @@ enum {
 	CKPT_HDR_SIGNAL_TASK,
 	CKPT_HDR_SIGPENDING,
 
+	CKPT_HDR_SOCKET = 701,
+	CKPT_HDR_SOCKET_QUEUE,
+	CKPT_HDR_SOCKET_BUFFER,
+	CKPT_HDR_SOCKET_UNIX,
+
 	CKPT_HDR_TAIL = 9001,
 
 	CKPT_HDR_ERROR = 9999,
@@ -127,6 +140,7 @@ enum obj_type {
 	CKPT_OBJ_CRED,
 	CKPT_OBJ_USER,
 	CKPT_OBJ_GROUPINFO,
+	CKPT_OBJ_SOCK,
 	CKPT_OBJ_MAX
 };
 
@@ -353,6 +367,7 @@ enum file_type {
 	CKPT_FILE_GENERIC,
 	CKPT_FILE_PIPE,
 	CKPT_FILE_FIFO,
+	CKPT_FILE_SOCKET,
 	CKPT_FILE_MAX
 };
 
@@ -376,6 +391,78 @@ struct ckpt_hdr_file_pipe {
 	__s32 pipe_objref;
 } __attribute__((aligned(8)));
 
+/* socket */
+struct ckpt_hdr_socket {
+	struct ckpt_hdr h;
+
+	struct { /* struct socket */
+		__u64 flags;
+		__u8 state;
+	} socket __attribute__ ((aligned(8)));
+
+	struct { /* struct sock_common */
+		__u32 bound_dev_if;
+		__u32 reuse;
+		__u16 family;
+		__u8 state;
+	} sock_common __attribute__ ((aligned(8)));
+
+	struct { /* struct sock */
+		__s64 rcvlowat;
+		__u64 flags;
+
+		__s64 rcvtimeo;
+		__s64 sndtimeo;
+
+		__u32 err;
+		__u32 err_soft;
+		__u32 priority;
+		__s32 rcvbuf;
+		__s32 sndbuf;
+		__u16 type;
+		__s16 backlog;
+
+		__u8 protocol;
+		__u8 state;
+		__u8 shutdown;
+		__u8 userlocks;
+		__u8 no_check;
+
+		struct linger linger;
+	} sock __attribute__ ((aligned(8)));
+} __attribute__ ((aligned(8)));
+
+struct ckpt_hdr_socket_queue {
+	struct ckpt_hdr h;
+	__u32 skb_count;
+	__u32 total_bytes;
+} __attribute__ ((aligned(8)));
+
+struct ckpt_hdr_socket_buffer {
+	struct ckpt_hdr h;
+	__s32 sk_objref;
+	__s32 pr_objref;
+};
+
+#define CKPT_UNIX_LINKED 1
+struct ckpt_hdr_socket_unix {
+	struct ckpt_hdr h;
+	__s32 this;
+	__s32 peer;
+	__u32 peercred_uid;
+	__u32 peercred_gid;
+	__u32 flags;
+	__u32 laddr_len;
+	__u32 raddr_len;
+	struct sockaddr_un laddr;
+	struct sockaddr_un raddr;
+} __attribute__ ((aligned(8)));
+
+struct ckpt_hdr_file_socket {
+	struct ckpt_hdr_file common;
+	__s32 sock_objref;
+} __attribute__((aligned(8)));
+
 /* memory layout */
 struct ckpt_hdr_mm {
 	struct ckpt_hdr h;
diff --git a/include/linux/net.h b/include/linux/net.h
index b99f350..d1ce6eb 100644
--- a/include/linux/net.h
+++ b/include/linux/net.h
@@ -232,6 +232,8 @@ extern int   	     sock_sendmsg(struct socket *sock, struct msghdr *msg,
 				  size_t len);
 extern int	     sock_recvmsg(struct socket *sock, struct msghdr *msg,
 				  size_t size, int flags);
+extern int	     sock_attach_fd(struct socket *sock, struct file *file,
+				    int flags);
 extern int 	     sock_map_fd(struct socket *sock, int flags);
 extern struct socket *sockfd_lookup(int fd, int *err);
 #define		     sockfd_put(sock) fput(sock->file)
diff --git a/include/net/af_unix.h b/include/net/af_unix.h
index 1614d78..e42a714 100644
--- a/include/net/af_unix.h
+++ b/include/net/af_unix.h
@@ -68,4 +68,18 @@ static inline int unix_sysctl_register(struct net *net) { return 0; }
 static inline void unix_sysctl_unregister(struct net *net) {}
 #endif
 #endif
+
+#ifdef CONFIG_CHECKPOINT
+struct ckpt_ctx;
+struct ckpt_hdr_socket;
+extern int unix_checkpoint(struct ckpt_ctx *ctx, struct socket *sock);
+extern int unix_restore(struct ckpt_ctx *ctx, struct socket *sock,
+			struct ckpt_hdr_socket *h);
+extern int unix_collect(struct ckpt_ctx *ctx, struct socket *sock);
+
+#else
+#define unix_checkpoint NULL
+#define unix_restore NULL
+#endif /* CONFIG_CHECKPOINT */
+
 #endif
diff --git a/include/net/sock.h b/include/net/sock.h
index 12530bf..ec351f9 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -1646,4 +1646,16 @@ extern int sysctl_optmem_max;
 extern __u32 sysctl_wmem_default;
 extern __u32 sysctl_rmem_default;
 
+#ifdef CONFIG_CHECKPOINT
+/* Checkpoint/Restart Functions */
+struct ckpt_ctx;
+struct ckpt_hdr_file;
+extern int checkpoint_sock(struct ckpt_ctx *ctx, void *ptr);
+extern void *restore_sock(struct ckpt_ctx *ctx);
+extern int sock_file_checkpoint(struct ckpt_ctx *ctx, struct file *file);
+extern struct file *sock_file_restore(struct ckpt_ctx *ctx,
+				      struct ckpt_hdr_file *h);
+extern int sock_file_collect(struct ckpt_ctx *ctx, struct file *file);
+#endif
+
 #endif	/* _SOCK_H */
diff --git a/net/Makefile b/net/Makefile
index ba324ae..91d12fe 100644
--- a/net/Makefile
+++ b/net/Makefile
@@ -66,3 +66,5 @@ ifeq ($(CONFIG_NET),y)
 obj-$(CONFIG_SYSCTL)		+= sysctl_net.o
 endif
 obj-$(CONFIG_WIMAX)		+= wimax/
+
+obj-$(CONFIG_CHECKPOINT)	+= checkpoint.o
diff --git a/net/checkpoint.c b/net/checkpoint.c
new file mode 100644
index 0000000..a11ec7a
--- /dev/null
+++ b/net/checkpoint.c
@@ -0,0 +1,752 @@
+/*
+ *  Copyright 2009 IBM Corporation
+ *
+ *  Author(s): Dan Smith <danms@us.ibm.com>
+ *             Oren Laadan <orenl@cs.columbia.edu>
+ *
+ *  This program is free software; you can redistribute it and/or
+ *  modify it under the terms of the GNU General Public License as
+ *  published by the Free Software Foundation, version 2 of the
+ *  License.
+ */
+
+#include <linux/socket.h>
+#include <linux/mount.h>
+#include <linux/file.h>
+#include <linux/namei.h>
+#include <linux/syscalls.h>
+#include <linux/sched.h>
+#include <linux/fs_struct.h>
+
+#include <net/af_unix.h>
+#include <net/tcp_states.h>
+
+#include <linux/deferqueue.h>
+#include <linux/checkpoint.h>
+#include <linux/checkpoint_hdr.h>
+
+struct dq_buffers {
+	struct ckpt_ctx *ctx;
+	struct sock *sk;
+};
+
+static int sock_copy_buffers(struct sk_buff_head *from,
+			     struct sk_buff_head *to,
+			     uint32_t *total_bytes)
+{
+	int count1 = 0;
+	int count2 = 0;
+	int i;
+	struct sk_buff *skb;
+	struct sk_buff **skbs;
+
+	*total_bytes = 0;
+
+	spin_lock(&from->lock);
+	skb_queue_walk(from, skb)
+		count1++;
+	spin_unlock(&from->lock);
+
+	skbs = kzalloc(sizeof(*skbs) * count1, GFP_KERNEL);
+	if (!skbs)
+		return -ENOMEM;
+
+	for (i = 0; i < count1;  i++) {
+		skbs[i] = dev_alloc_skb(0);
+		if (!skbs[i])
+			goto err;
+	}
+
+	i = 0;
+	spin_lock(&from->lock);
+	skb_queue_walk(from, skb) {
+		if (++count2 > count1)
+			break; /* The queue changed as we read it */
+
+		skb_morph(skbs[i], skb);
+		skbs[i]->sk = skb->sk;
+		skb_queue_tail(to, skbs[i]);
+
+		*total_bytes += skb->len;
+		i++;
+	}
+	spin_unlock(&from->lock);
+
+	if (count1 != count2)
+		goto err;
+
+	kfree(skbs);
+
+	return count1;
+ err:
+	while (skb_dequeue(to))
+		; /* Pull all the buffers out of the queue */
+	for (i = 0; i < count1; i++)
+		kfree_skb(skbs[i]);
+	kfree(skbs);
+
+	return -EAGAIN;
+}
+
+static int __sock_write_buffers(struct ckpt_ctx *ctx,
+				struct sk_buff_head *queue,
+				int dst_objref)
+{
+	struct sk_buff *skb;
+
+	skb_queue_walk(queue, skb) {
+		struct ckpt_hdr_socket_buffer *h;
+		int ret = 0;
+
+		/* FIXME: This could be a false positive for non-unix
+		 *        buffers, so add a type check here in the
+		 *        future
+		 */
+		if (UNIXCB(skb).fp) {
+			ckpt_write_err(ctx, "TE", "af_unix: pass fd", -EBUSY);
+			return -EBUSY;
+		}
+
+		/* The other ancillary messages are always present
+		 * unlike descriptors.  Even though we can't detect
+		 * them and fail the checkpoint, we're not at risk
+		 * because we don't save out (or restore) the control
+		 * information contained in the skb.
+		 */
+		h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_SOCKET_BUFFER);
+		if (!h)
+			return -ENOMEM;
+
+		BUG_ON(!skb->sk);
+		ret = checkpoint_obj(ctx, skb->sk, CKPT_OBJ_SOCK);
+		if (ret < 0)
+			goto end;
+		h->sk_objref = ret;
+		h->pr_objref = dst_objref;
+
+		ret = ckpt_write_obj(ctx, (struct ckpt_hdr *) h);
+		if (ret < 0)
+			goto end;
+
+		ret = ckpt_write_obj_type(ctx, skb->data, skb->len,
+					  CKPT_HDR_BUFFER);
+	end:
+		ckpt_hdr_put(ctx, h);
+		if (ret < 0)
+			return ret;
+	}
+
+	return 0;
+}
+
+static int sock_write_buffers(struct ckpt_ctx *ctx,
+			      struct sk_buff_head *queue,
+			      int dst_objref)
+{
+	struct ckpt_hdr_socket_queue *h;
+	struct sk_buff_head tmpq;
+	int ret = -ENOMEM;
+
+	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_SOCKET_QUEUE);
+	if (!h)
+		return -ENOMEM;
+
+	skb_queue_head_init(&tmpq);
+
+	ret = sock_copy_buffers(queue, &tmpq, &h->total_bytes);
+	if (ret < 0)
+		goto out;
+
+	h->skb_count = ret;
+	ret = ckpt_write_obj(ctx, (struct ckpt_hdr *) h);
+	if (!ret)
+		ret = __sock_write_buffers(ctx, &tmpq, dst_objref);
+
+ out:
+	ckpt_hdr_put(ctx, h);
+	__skb_queue_purge(&tmpq);
+
+	return ret;
+}
+
+int sock_deferred_write_buffers(void *data)
+{
+	struct dq_buffers *dq = (struct dq_buffers *)data;
+	struct ckpt_ctx *ctx = dq->ctx;
+	int ret;
+	int dst_objref;
+
+	dst_objref = ckpt_obj_lookup(ctx, dq->sk, CKPT_OBJ_SOCK);
+	if (dst_objref < 0) {
+		ckpt_write_err(ctx, "TE", "socket: owner gone?", dst_objref);
+		return dst_objref;
+	}
+
+	ret = sock_write_buffers(ctx, &dq->sk->sk_receive_queue, dst_objref);
+	ckpt_debug("write recv buffers: %i\n", ret);
+	if (ret < 0)
+		return ret;
+
+	ret = sock_write_buffers(ctx, &dq->sk->sk_write_queue, dst_objref);
+	ckpt_debug("write send buffers: %i\n", ret);
+
+	return ret;
+}
+
+int sock_defer_write_buffers(struct ckpt_ctx *ctx, struct sock *sk)
+{
+	struct dq_buffers dq;
+
+	dq.ctx = ctx;
+	dq.sk = sk;
+
+	/* NB: This is safe to do inside deferqueue_run() since it uses
+	 * list_for_each_safe()
+	 */
+	return deferqueue_add(ctx->files_deferq, &dq, sizeof(dq),
+			      sock_deferred_write_buffers, NULL);
+}
+
+int ckpt_sock_getnames(struct ckpt_ctx *ctx, struct socket *sock,
+		       struct sockaddr *loc, unsigned *loc_len,
+		       struct sockaddr *rem, unsigned *rem_len)
+{
+	int ret;
+
+	ret = sock_getname(sock, loc, loc_len);
+	if (ret) {
+		ckpt_write_err(ctx, "TEP", "socket: getname local", ret, sock);
+		return -EINVAL;
+	}
+
+	ret = sock_getpeer(sock, rem, rem_len);
+	if (ret) {
+		if ((sock->sk->sk_type != SOCK_DGRAM) &&
+		    (sock->sk->sk_state == TCP_ESTABLISHED)) {
+			ckpt_write_err(ctx, "TEP", "socket: getname peer",
+				       ret, sock);
+			return -EINVAL;
+		}
+		*rem_len = 0;
+	}
+
+	return 0;
+}
+
+static int sock_cptrst_verify(struct ckpt_hdr_socket *h)
+{
+	uint8_t userlocks_mask = SOCK_SNDBUF_LOCK | SOCK_RCVBUF_LOCK |
+		                 SOCK_BINDADDR_LOCK | SOCK_BINDPORT_LOCK;
+
+	if (h->sock.shutdown & ~SHUTDOWN_MASK)
+		return -EINVAL;
+	if (h->sock.userlocks & ~userlocks_mask)
+		return -EINVAL;
+	if (!ckpt_validate_errno(h->sock.err))
+		return -EINVAL;
+
+	return 0;
+}
+
+static int sock_cptrst_opt(int op, struct socket *sock,
+			   int optname, char *opt, int len)
+{
+	mm_segment_t fs;
+	int ret;
+
+	fs = get_fs();
+	set_fs(KERNEL_DS);
+
+	if (op == CKPT_CPT)
+		ret = sock_getsockopt(sock, SOL_SOCKET, optname, opt, &len);
+	else
+		ret = sock_setsockopt(sock, SOL_SOCKET, optname, opt, len);
+
+	set_fs(fs);
+
+	return ret;
+}
+
+#define CKPT_COPY_SOPT(op, sk, name, opt) \
+	sock_cptrst_opt(op, sk->sk_socket, name, (char *)opt, sizeof(*opt))
+
+static int sock_cptrst_bufopts(int op, struct sock *sk,
+			       struct ckpt_hdr_socket *h)
+{
+	if (CKPT_COPY_SOPT(op, sk, SO_RCVBUF, &h->sock.rcvbuf))
+		if ((op == CKPT_RST) &&
+		    CKPT_COPY_SOPT(op, sk, SO_RCVBUFFORCE, &h->sock.rcvbuf)) {
+			ckpt_debug("Failed to set SO_RCVBUF");
+			return -EINVAL;
+		}
+
+	if (CKPT_COPY_SOPT(op, sk, SO_SNDBUF, &h->sock.sndbuf))
+		if ((op == CKPT_RST) &&
+		    CKPT_COPY_SOPT(op, sk, SO_SNDBUFFORCE, &h->sock.sndbuf)) {
+			ckpt_debug("Failed to set SO_SNDBUF");
+			return -EINVAL;
+		}
+
+	/* It's silly that we have to fight ourselves here, but
+	 * sock_setsockopt() doubles the initial value, so divide here
+	 * to store the user's value and avoid doubling on restart
+	 */
+	if ((op == CKPT_CPT) && (h->sock.rcvbuf != SOCK_MIN_RCVBUF))
+		h->sock.rcvbuf >>= 1;
+
+	if ((op == CKPT_CPT) && (h->sock.sndbuf != SOCK_MIN_SNDBUF))
+		h->sock.sndbuf >>= 1;
+
+	return 0;
+}
+
+struct sock_flag_mapping {
+	int opt;
+	int flag;
+};
+
+struct sock_flag_mapping sk_flag_map[] = {
+	{SO_OOBINLINE, SOCK_URGINLINE},
+	{SO_KEEPALIVE, SOCK_KEEPOPEN},
+	{SO_BROADCAST, SOCK_BROADCAST},
+	{SO_TIMESTAMP, SOCK_RCVTSTAMP},
+	{SO_TIMESTAMPNS, SOCK_RCVTSTAMPNS},
+	{SO_DEBUG, SOCK_DBG},
+	{SO_DONTROUTE, SOCK_LOCALROUTE},
+};
+
+struct sock_flag_mapping sock_flag_map[] = {
+	{SO_PASSCRED, SOCK_PASSCRED},
+};
+
+static int sock_restore_flag(struct socket *sock,
+			     unsigned long *flags,
+			     int flag,
+			     int option)
+{
+	int v = 1;
+	int ret = 0;
+
+	if (test_and_clear_bit(flag, flags))
+		ret = sock_setsockopt(sock, SOL_SOCKET, option,
+				      (char *)&v, sizeof(v));
+
+	return ret;
+}
+
+
+static int sock_restore_flags(struct socket *sock, struct ckpt_hdr_socket *h)
+{
+	unsigned long sk_flags = h->sock.flags;
+	unsigned long sock_flags = h->socket.flags;
+	int ret;
+	int i;
+
+	for (i = 0; i < ARRAY_SIZE(sk_flag_map); i++) {
+		int opt = sk_flag_map[i].opt;
+		int flag = sk_flag_map[i].flag;
+		ret = sock_restore_flag(sock, &sk_flags, flag, opt);
+		if (ret) {
+			ckpt_debug("Failed to set skopt %i: %i\n", opt, ret);
+			return ret;
+		}
+	}
+
+	for (i = 0; i < ARRAY_SIZE(sock_flag_map); i++) {
+		int opt = sock_flag_map[i].opt;
+		int flag = sock_flag_map[i].flag;
+		ret = sock_restore_flag(sock, &sock_flags, flag, opt);
+		if (ret) {
+			ckpt_debug("Failed to set sockopt %i: %i\n", opt, ret);
+			return ret;
+		}
+	}
+
+	/* TODO: Handle SOCK_TIMESTAMPING_* flags */
+	if (test_bit(SOCK_TIMESTAMPING_TX_HARDWARE, &sk_flags) ||
+	    test_bit(SOCK_TIMESTAMPING_TX_SOFTWARE, &sk_flags) ||
+	    test_bit(SOCK_TIMESTAMPING_RX_HARDWARE, &sk_flags) ||
+	    test_bit(SOCK_TIMESTAMPING_RX_SOFTWARE, &sk_flags) ||
+	    test_bit(SOCK_TIMESTAMPING_SOFTWARE, &sk_flags) ||
+	    test_bit(SOCK_TIMESTAMPING_RAW_HARDWARE, &sk_flags) ||
+	    test_bit(SOCK_TIMESTAMPING_SYS_HARDWARE, &sk_flags)) {
+		ckpt_debug("SOF_TIMESTAMPING_* flags are not supported\n");
+		return -ENOSYS;
+	}
+
+	if (test_and_clear_bit(SOCK_DEAD, &sk_flags))
+		sock_set_flag(sock->sk, SOCK_DEAD);
+
+
+	/* Anything that is still set in the flags that isn't part of
+	 * our protocol's default set, indicates an error
+	 */
+	if (sk_flags & ~sock->sk->sk_flags) {
+		ckpt_debug("Unhandled sock flags: %lx\n", sk_flags);
+		return -EINVAL;
+	}
+
+	return 0;
+}
+
+static int sock_copy_timeval(int op, struct sock *sk,
+			     int sockopt, __s64 *saved)
+{
+	struct timeval tv;
+
+	if (op == CKPT_CPT) {
+		if (CKPT_COPY_SOPT(op, sk, sockopt, &tv))
+			return -EINVAL;
+		*saved = timeval_to_ns(&tv);
+	} else {
+		tv = ns_to_timeval(*saved);
+		if (CKPT_COPY_SOPT(op, sk, sockopt, &tv))
+			return -EINVAL;
+	}
+
+	return 0;
+}
+
+static int sock_cptrst(struct ckpt_ctx *ctx, struct sock *sk,
+		       struct ckpt_hdr_socket *h, int op)
+{
+	if (sk->sk_socket) {
+		CKPT_COPY(op, h->socket.state, sk->sk_socket->state);
+	}
+
+	CKPT_COPY(op, h->sock_common.bound_dev_if, sk->sk_bound_dev_if);
+	CKPT_COPY(op, h->sock_common.family, sk->sk_family);
+
+	CKPT_COPY(op, h->sock.shutdown, sk->sk_shutdown);
+	CKPT_COPY(op, h->sock.userlocks, sk->sk_userlocks);
+	CKPT_COPY(op, h->sock.no_check, sk->sk_no_check);
+	CKPT_COPY(op, h->sock.protocol, sk->sk_protocol);
+	CKPT_COPY(op, h->sock.err, sk->sk_err);
+	CKPT_COPY(op, h->sock.err_soft, sk->sk_err_soft);
+	CKPT_COPY(op, h->sock.type, sk->sk_type);
+	CKPT_COPY(op, h->sock.state, sk->sk_state);
+	CKPT_COPY(op, h->sock.backlog, sk->sk_max_ack_backlog);
+
+	if (sock_cptrst_bufopts(op, sk, h))
+		return -EINVAL;
+
+	if (CKPT_COPY_SOPT(op, sk, SO_REUSEADDR, &h->sock_common.reuse)) {
+		ckpt_debug("Failed to set SO_REUSEADDR");
+		return -EINVAL;
+	}
+
+	if (CKPT_COPY_SOPT(op, sk, SO_PRIORITY, &h->sock.priority)) {
+		ckpt_debug("Failed to set SO_PRIORITY");
+		return -EINVAL;
+	}
+
+	if (CKPT_COPY_SOPT(op, sk, SO_RCVLOWAT, &h->sock.rcvlowat)) {
+		ckpt_debug("Failed to set SO_RCVLOWAT");
+		return -EINVAL;
+	}
+
+	if (CKPT_COPY_SOPT(op, sk, SO_LINGER, &h->sock.linger)) {
+		ckpt_debug("Failed to set SO_LINGER");
+		return -EINVAL;
+	}
+
+	if (sock_copy_timeval(op, sk, SO_SNDTIMEO, &h->sock.sndtimeo)) {
+		ckpt_debug("Failed to set SO_SNDTIMEO");
+		return -EINVAL;
+	}
+
+	if (sock_copy_timeval(op, sk, SO_RCVTIMEO, &h->sock.rcvtimeo)) {
+		ckpt_debug("Failed to set SO_RCVTIMEO");
+		return -EINVAL;
+	}
+
+	if (op == CKPT_CPT) {
+		h->sock.flags = sk->sk_flags;
+		h->socket.flags = sk->sk_socket->flags;
+	} else {
+		int ret;
+		mm_segment_t old_fs;
+
+		old_fs = get_fs();
+		set_fs(KERNEL_DS);
+		ret = sock_restore_flags(sk->sk_socket, h);
+		set_fs(old_fs);
+		if (ret)
+			return ret;
+	}
+
+	if ((h->socket.state == SS_CONNECTED) &&
+	    (h->sock.state != TCP_ESTABLISHED)) {
+		ckpt_debug("socket/sock in inconsistent state: %i/%i",
+			   h->socket.state, h->sock.state);
+		return -EINVAL;
+	} else if ((h->sock.state < TCP_ESTABLISHED) ||
+		   (h->sock.state >= TCP_MAX_STATES)) {
+		ckpt_debug("sock in invalid state: %i", h->sock.state);
+		return -EINVAL;
+	} else if ((h->socket.state < SS_FREE) ||
+		   (h->socket.state > SS_DISCONNECTING)) {
+		ckpt_debug("socket in invalid state: %i",
+			   h->socket.state);
+		return -EINVAL;
+	}
+
+	if (op == CKPT_RST)
+		return sock_cptrst_verify(h);
+	else
+		return 0;
+}
+
+static int __do_sock_checkpoint(struct ckpt_ctx *ctx, struct sock *sk)
+{
+	struct socket *sock = sk->sk_socket;
+	struct ckpt_hdr_socket *h;
+	int ret;
+
+	if (!sock->ops->checkpoint) {
+		ckpt_write_err(ctx, "TEVP", "socket: proto_ops",
+			       -ENOSYS, sock->ops, sock);
+		return -ENOSYS;
+	}
+
+	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_SOCKET);
+	if (!h)
+		return -ENOMEM;
+
+	/* part I: common to all sockets */
+	ret = sock_cptrst(ctx, sk, h, CKPT_CPT);
+	if (ret < 0)
+		goto out;
+
+	ret = ckpt_write_obj(ctx, (struct ckpt_hdr *) h);
+	if (ret < 0)
+		goto out;
+
+	/* part II: per socket type state */
+	ret = sock->ops->checkpoint(ctx, sock);
+	if (ret < 0)
+		goto out;
+
+	/* part III: socket buffers */
+	if ((sk->sk_state != TCP_LISTEN) && (!sock_flag(sk, SOCK_DEAD)))
+		ret = sock_defer_write_buffers(ctx, sk);
+ out:
+	ckpt_hdr_put(ctx, h);
+	return ret;
+}
+
+static int do_sock_checkpoint(struct ckpt_ctx *ctx, struct sock *sk)
+{
+	struct socket *sock;
+	int ret;
+
+	if (sk->sk_socket)
+		return __do_sock_checkpoint(ctx, sk);
+
+	/* Temporarily adopt this orphan socket */
+	ret = sock_create(sk->sk_family, sk->sk_type, 0, &sock);
+	if (ret < 0)
+		return ret;
+	sock_graft(sk, sock);
+
+	ret = __do_sock_checkpoint(ctx, sk);
+
+	sock_orphan(sk);
+	sock->sk = NULL;
+	sock_release(sock);
+
+	return ret;
+}
+
+int checkpoint_sock(struct ckpt_ctx *ctx, void *ptr)
+{
+	return do_sock_checkpoint(ctx, (struct sock *)ptr);
+}
+
+int sock_file_checkpoint(struct ckpt_ctx *ctx, struct file *file)
+{
+	struct ckpt_hdr_file_socket *h;
+	struct socket *sock = file->private_data;
+	struct sock *sk = sock->sk;
+	int ret;
+
+	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_FILE);
+	if (!h)
+		return -ENOMEM;
+
+	h->common.f_type = CKPT_FILE_SOCKET;
+
+	h->sock_objref = checkpoint_obj(ctx, sk, CKPT_OBJ_SOCK);
+	if (h->sock_objref < 0) {
+		ret = h->sock_objref;
+		goto out;
+	}
+
+	ret = checkpoint_file_common(ctx, file, &h->common);
+	if (ret < 0)
+		goto out;
+
+	ret = ckpt_write_obj(ctx, (struct ckpt_hdr *) h);
+ out:
+	ckpt_hdr_put(ctx, h);
+	return ret;
+}
+
+static int sock_collect_skbs(struct ckpt_ctx *ctx, struct sk_buff_head *queue)
+{
+	struct sk_buff_head tmpq;
+	struct sk_buff *skb;
+	int ret = 0;
+	int bytes;
+
+	skb_queue_head_init(&tmpq);
+
+	ret = sock_copy_buffers(queue, &tmpq, &bytes);
+	if (ret < 0)
+		return ret;
+
+	skb_queue_walk(&tmpq, skb) {
+		/* Socket buffers do not maintain a ref count on their
+		 * owning sock because they're counted in sock_wmem_alloc.
+		 * So, we only need to collect sockets from the queue that
+		 * won't be collected any other way (i.e. DEAD sockets that
+		 * are hanging around only because they're waiting for us
+		 * to process their skb.
+		 */
+
+		if (!ckpt_obj_lookup(ctx, skb->sk, CKPT_OBJ_SOCK) &&
+		    sock_flag(skb->sk, SOCK_DEAD)) {
+			ret = ckpt_obj_collect(ctx, skb->sk, CKPT_OBJ_SOCK);
+			if (ret < 0)
+				break;
+		}
+	}
+
+	__skb_queue_purge(&tmpq);
+
+	return ret;
+}
+
+int sock_file_collect(struct ckpt_ctx *ctx, struct file *file)
+{
+	struct socket *sock = file->private_data;
+	struct sock *sk = sock->sk;
+	int ret;
+
+	ret = sock_collect_skbs(ctx, &sk->sk_write_queue);
+	if (ret < 0)
+		return ret;
+
+	ret = sock_collect_skbs(ctx, &sk->sk_receive_queue);
+	if (ret < 0)
+		return ret;
+
+	ret = ckpt_obj_collect(ctx, sk, CKPT_OBJ_SOCK);
+	if (ret < 0)
+		return ret;
+
+	if (sock->ops->collect)
+		ret = sock->ops->collect(ctx, sock);
+
+	return ret;
+}
+
+static struct file *sock_alloc_attach_fd(struct socket *sock)
+{
+	struct file *file;
+	int err;
+
+	file = get_empty_filp();
+	if (!file)
+		return ERR_PTR(ENOMEM);
+
+	err = sock_attach_fd(sock, file, 0);
+	if (err < 0) {
+		put_filp(file);
+		file = ERR_PTR(err);
+	}
+
+	/* Since objhash assumes the initial reference for a socket,
+	 * we bump it here for this descriptor, unlike other places in
+	 * the socket code which assume the descriptor is the owner.
+	 */
+	sock_hold(sock->sk);
+
+	return file;
+}
+
+struct sock *do_sock_restore(struct ckpt_ctx *ctx)
+{
+	struct ckpt_hdr_socket *h;
+	struct socket *sock;
+	int ret;
+
+	h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_SOCKET);
+	if (IS_ERR(h))
+		return ERR_PTR(PTR_ERR(h));
+
+	/* silently clear flags, e.g. SOCK_NONBLOCK or SOCK_CLOEXEC */
+	h->sock.type &= SOCK_TYPE_MASK;
+
+	ret = sock_create(h->sock_common.family, h->sock.type, 0, &sock);
+	if (ret < 0)
+		goto err;
+
+	if (!sock->ops->restore) {
+		ckpt_debug("proto_ops lacks checkpoint: %pS\n", sock->ops);
+		ret = -EINVAL;
+		goto err;
+	}
+
+	/*
+	 * part II: per socket type state
+	 * (also takes care of part III: socket buffer)
+	 */
+	ret = sock->ops->restore(ctx, sock, h);
+	if (ret < 0)
+		goto err;
+
+	/* part I: common to all sockets */
+	ret = sock_cptrst(ctx, sock->sk, h, CKPT_RST);
+	if (ret < 0)
+		goto err;
+
+	ckpt_hdr_put(ctx, h);
+	return sock->sk;
+ err:
+	ckpt_hdr_put(ctx, h);
+	sock_release(sock);
+	return ERR_PTR(ret);
+}
+
+void *restore_sock(struct ckpt_ctx *ctx)
+{
+	return do_sock_restore(ctx);
+}
+
+struct file *sock_file_restore(struct ckpt_ctx *ctx, struct ckpt_hdr_file *ptr)
+{
+	struct ckpt_hdr_file_socket *h = (struct ckpt_hdr_file_socket *)ptr;
+	struct sock *sk;
+	struct file *file;
+	int ret;
+
+	if (ptr->h.type != CKPT_HDR_FILE || ptr->f_type != CKPT_FILE_SOCKET)
+		return ERR_PTR(-EINVAL);
+
+	sk = ckpt_obj_fetch(ctx, h->sock_objref, CKPT_OBJ_SOCK);
+	if (IS_ERR(sk))
+		return ERR_PTR(PTR_ERR(sk));
+
+	file = sock_alloc_attach_fd(sk->sk_socket);
+	if (IS_ERR(file))
+		return file;
+
+	ret = restore_file_common(ctx, file, ptr);
+	if (ret < 0) {
+		fput(file);
+		return ERR_PTR(ret);
+	}
+
+	return file;
+}
diff --git a/net/socket.c b/net/socket.c
index 63c4498..0a4d539 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -140,6 +140,10 @@ static const struct file_operations socket_file_ops = {
 	.sendpage =	sock_sendpage,
 	.splice_write = generic_splice_sendpage,
 	.splice_read =	sock_splice_read,
+#ifdef CONFIG_CHECKPOINT
+	.checkpoint =   sock_file_checkpoint,
+	.collect = sock_file_collect,
+#endif
 };
 
 /*
@@ -368,7 +372,7 @@ static int sock_alloc_fd(struct file **filep, int flags)
 	return fd;
 }
 
-static int sock_attach_fd(struct socket *sock, struct file *file, int flags)
+int sock_attach_fd(struct socket *sock, struct file *file, int flags)
 {
 	struct dentry *dentry;
 	struct qstr name = { .name = "" };
diff --git a/net/unix/Makefile b/net/unix/Makefile
index b852a2b..fbff1e6 100644
--- a/net/unix/Makefile
+++ b/net/unix/Makefile
@@ -6,3 +6,4 @@ obj-$(CONFIG_UNIX)	+= unix.o
 
 unix-y			:= af_unix.o garbage.o
 unix-$(CONFIG_SYSCTL)	+= sysctl_net_unix.o
+unix-$(CONFIG_CHECKPOINT) += checkpoint.o
diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index fc3ebb9..b3d4f16 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -523,6 +523,9 @@ static const struct proto_ops unix_stream_ops = {
 	.recvmsg =	unix_stream_recvmsg,
 	.mmap =		sock_no_mmap,
 	.sendpage =	sock_no_sendpage,
+	.checkpoint =	unix_checkpoint,
+	.restore =	unix_restore,
+	.collect =      unix_collect,
 };
 
 static const struct proto_ops unix_dgram_ops = {
@@ -544,6 +547,9 @@ static const struct proto_ops unix_dgram_ops = {
 	.recvmsg =	unix_dgram_recvmsg,
 	.mmap =		sock_no_mmap,
 	.sendpage =	sock_no_sendpage,
+	.checkpoint =	unix_checkpoint,
+	.restore =	unix_restore,
+	.collect =      unix_collect,
 };
 
 static const struct proto_ops unix_seqpacket_ops = {
@@ -565,6 +571,9 @@ static const struct proto_ops unix_seqpacket_ops = {
 	.recvmsg =	unix_dgram_recvmsg,
 	.mmap =		sock_no_mmap,
 	.sendpage =	sock_no_sendpage,
+	.checkpoint =	unix_checkpoint,
+	.restore =	unix_restore,
+	.collect =      unix_collect,
 };
 
 static struct proto unix_proto = {
diff --git a/net/unix/checkpoint.c b/net/unix/checkpoint.c
new file mode 100644
index 0000000..8b7cb22
--- /dev/null
+++ b/net/unix/checkpoint.c
@@ -0,0 +1,634 @@
+#include <linux/namei.h>
+#include <linux/file.h>
+#include <linux/fs_struct.h>
+#include <linux/deferqueue.h>
+#include <linux/checkpoint.h>
+#include <linux/checkpoint_hdr.h>
+#include <linux/user.h>
+#include <net/af_unix.h>
+#include <net/tcp_states.h>
+
+struct dq_join {
+	struct ckpt_ctx *ctx;
+	int src_objref;
+	int dst_objref;
+};
+
+struct dq_buffers {
+	struct ckpt_ctx *ctx;
+	int sk_objref; /* objref of the socket these buffers belong to */
+};
+
+#define UNIX_ADDR_EMPTY(a) (a <= sizeof(short))
+
+static inline int unix_need_cwd(struct sockaddr_un *addr, unsigned long len)
+{
+	return (!UNIX_ADDR_EMPTY(len)) &&
+		addr->sun_path[0] &&
+		(addr->sun_path[0] != '/');
+}
+
+static int unix_join(struct sock *src, struct sock *dst)
+{
+	if (unix_sk(src)->peer != NULL)
+		return 0; /* We're second */
+
+	sock_hold(dst);
+	unix_sk(src)->peer = dst;
+
+	return 0;
+
+}
+
+static int unix_deferred_join(void *data)
+{
+	struct dq_join *dq = (struct dq_join *)data;
+	struct ckpt_ctx *ctx = dq->ctx;
+	struct sock *src;
+	struct sock *dst;
+
+	src = ckpt_obj_fetch(ctx, dq->src_objref, CKPT_OBJ_SOCK);
+	if (!src) {
+		ckpt_debug("Missing src sock ref %i\n", dq->src_objref);
+		return -EINVAL;
+	}
+
+	dst = ckpt_obj_fetch(ctx, dq->dst_objref, CKPT_OBJ_SOCK);
+	if (!src) {
+		ckpt_debug("Missing dst sock ref %i\n", dq->dst_objref);
+		return -EINVAL;
+	}
+
+	return unix_join(src, dst);
+}
+
+static int unix_defer_join(struct ckpt_ctx *ctx,
+			   int src_objref,
+			   int dst_objref)
+{
+	struct dq_join dq;
+
+	dq.ctx = ctx;
+	dq.src_objref = src_objref;
+	dq.dst_objref = dst_objref;
+
+	/* NB: This is safe to do inside deferqueue_run() since it uses
+	 * list_for_each_safe()
+	 */
+	return deferqueue_add(ctx->files_deferq, &dq, sizeof(dq),
+			      unix_deferred_join, NULL);
+}
+
+static int unix_write_cwd(struct ckpt_ctx *ctx,
+			  struct sock *sk, const char *sockpath)
+{
+	struct path path;
+	char *buf;
+	char *fqpath;
+	int offset;
+	int len = PATH_MAX;
+	int ret = -ENOENT;
+
+	buf = kmalloc(len, GFP_KERNEL);
+	if (!buf)
+		return -ENOMEM;
+
+	path.dentry = unix_sk(sk)->dentry;
+	path.mnt = unix_sk(sk)->mnt;
+
+	fqpath = ckpt_fill_fname(&path, &ctx->fs_mnt, buf, &len);
+	if (IS_ERR(fqpath)) {
+		ret = PTR_ERR(fqpath);
+		goto out;
+	}
+
+	offset = strlen(fqpath) - strlen(sockpath);
+	if (offset <= 0) {
+		ret = -EINVAL;
+		goto out;
+	}
+
+	fqpath[offset] = '\0';
+
+	ckpt_debug("writing socket directory: %s\n", fqpath);
+	ret = ckpt_write_string(ctx, fqpath, offset + 1);
+ out:
+	kfree(buf);
+	return ret;
+}
+
+int unix_checkpoint(struct ckpt_ctx *ctx, struct socket *sock)
+{
+	struct unix_sock *sk = unix_sk(sock->sk);
+	struct ckpt_hdr_socket_unix *un;
+	int new;
+	int ret = -ENOMEM;
+
+	if ((sock->sk->sk_state == TCP_LISTEN) &&
+	    !skb_queue_empty(&sock->sk->sk_receive_queue)) {
+		ckpt_write_err(ctx, "TEP", "af_unix: listen with pending peers",
+			       -EBUSY, sock);
+		return -EBUSY;
+	}
+
+	un = ckpt_hdr_get_type(ctx, sizeof(*un), CKPT_HDR_SOCKET_UNIX);
+	if (!un)
+		return -EINVAL;
+
+	ret = ckpt_sock_getnames(ctx, sock,
+				 (struct sockaddr *)&un->laddr, &un->laddr_len,
+				 (struct sockaddr *)&un->raddr, &un->raddr_len);
+	if (ret)
+		goto out;
+
+	if (sk->dentry && (sk->dentry->d_inode->i_nlink > 0))
+		un->flags |= CKPT_UNIX_LINKED;
+
+	un->this = ckpt_obj_lookup_add(ctx, sk, CKPT_OBJ_SOCK, &new);
+	if (un->this < 0)
+		goto out;
+
+	if (sk->peer)
+		un->peer = checkpoint_obj(ctx, sk->peer, CKPT_OBJ_SOCK);
+	else
+		un->peer = 0;
+
+	if (un->peer < 0) {
+		ret = un->peer;
+		goto out;
+	}
+
+	un->peercred_uid = sock->sk->sk_peercred.uid;
+	un->peercred_gid = sock->sk->sk_peercred.gid;
+
+	ret = ckpt_write_obj(ctx, (struct ckpt_hdr *) un);
+	if (ret < 0)
+		goto out;
+
+	if (unix_need_cwd(&un->laddr, un->laddr_len))
+		ret = unix_write_cwd(ctx, sock->sk, un->laddr.sun_path);
+ out:
+	ckpt_hdr_put(ctx, un);
+
+	return ret;
+}
+
+int unix_collect(struct ckpt_ctx *ctx, struct socket *sock)
+{
+	struct unix_sock *sk = unix_sk(sock->sk);
+	int ret;
+
+	ret = ckpt_obj_collect(ctx, sock->sk, CKPT_OBJ_SOCK);
+	if (ret < 0)
+		return ret;
+
+	if (sk->peer)
+		ret = ckpt_obj_collect(ctx, sk->peer, CKPT_OBJ_SOCK);
+
+	return 0;
+}
+
+static int sock_read_buffer_sendmsg(struct ckpt_ctx *ctx,
+				    struct sockaddr *addr,
+				    unsigned int addrlen)
+{
+	struct ckpt_hdr_socket_buffer *h;
+	struct sock *sk;
+	struct msghdr msg;
+	struct kvec kvec;
+	uint8_t sock_shutdown;
+	uint8_t peer_shutdown = 0;
+	void *buf = NULL;
+	int sndbuf;
+	int len;
+	int ret = 0;
+
+	memset(&msg, 0, sizeof(msg));
+
+	h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_SOCKET_BUFFER);
+	if (IS_ERR(h))
+		return PTR_ERR(h);
+
+	len = _ckpt_read_obj_type(ctx, NULL, 0, CKPT_HDR_BUFFER);
+	if (len < 0) {
+		ret = len;
+		goto out;
+	} else if (len > SKB_MAX_ALLOC) {
+		ckpt_debug("Socket buffer too big (%i > %lu)",
+			   len, SKB_MAX_ALLOC);
+		ret = -ENOSPC;
+		goto out;
+	}
+
+	sk = ckpt_obj_fetch(ctx, h->sk_objref, CKPT_OBJ_SOCK);
+	if (IS_ERR(sk)) {
+		ret = PTR_ERR(sk);
+		goto out;
+	}
+
+	/* If we don't have a destination or a peer and we know the
+	 * destination of this skb, then we must need to join with our
+	 * peer
+	 */
+	if (!addrlen && !unix_sk(sk)->peer) {
+		struct sock *pr;
+		pr = ckpt_obj_fetch(ctx, h->pr_objref, CKPT_OBJ_SOCK);
+		if (IS_ERR(pr)) {
+			ckpt_debug("Failed to get our peer: %li\n", PTR_ERR(pr));
+			ret = PTR_ERR(pr);
+			goto out;
+		}
+		ret = unix_join(sk, pr);
+		if (ret < 0) {
+			ckpt_debug("Failed to join: %i\n", ret);
+			goto out;
+		}
+	}
+
+	kvec.iov_len = len;
+	buf = kmalloc(len, GFP_KERNEL);
+	kvec.iov_base = buf;
+	if (!buf) {
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	ret = ckpt_kread(ctx, kvec.iov_base, len);
+	if (ret < 0)
+		goto out;
+
+	msg.msg_name = addr;
+	msg.msg_namelen = addrlen;
+
+	/* If peer is shutdown, unshutdown it for this process */
+	sock_shutdown = sk->sk_shutdown;
+	sk->sk_shutdown &= ~SHUTDOWN_MASK;
+
+	/* Unshutdown peer too, if necessary */
+	if (unix_sk(sk)->peer) {
+		peer_shutdown = unix_sk(sk)->peer->sk_shutdown;
+		unix_sk(sk)->peer->sk_shutdown &= ~SHUTDOWN_MASK;
+	}
+
+	/* Make sure there's room in the send buffer */
+	sndbuf = sk->sk_sndbuf;
+	if (((sk->sk_sndbuf - atomic_read(&sk->sk_wmem_alloc)) < len) &&
+	    capable(CAP_NET_ADMIN))
+		sk->sk_sndbuf += len;
+	else
+		sk->sk_sndbuf = sysctl_wmem_max;
+
+	ret = kernel_sendmsg(sk->sk_socket, &msg, &kvec, 1, len);
+	ckpt_debug("kernel_sendmsg(%i,%i): %i\n", h->sk_objref, len, ret);
+	if ((ret > 0) && (ret != len))
+		ret = -ENOMEM;
+
+	sk->sk_sndbuf = sndbuf;
+	sk->sk_shutdown = sock_shutdown;
+	if (peer_shutdown)
+		unix_sk(sk)->peer->sk_shutdown = peer_shutdown;
+ out:
+	ckpt_hdr_put(ctx, h);
+	kfree(buf);
+	return ret;
+}
+
+static int unix_read_buffers(struct ckpt_ctx *ctx,
+			     struct sockaddr *addr,
+			     unsigned int addrlen)
+{
+	struct ckpt_hdr_socket_queue *h;
+	int ret = 0;
+	int i;
+
+	h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_SOCKET_QUEUE);
+	if (IS_ERR(h))
+		return PTR_ERR(h);
+
+	for (i = 0; i < h->skb_count; i++) {
+		ret = sock_read_buffer_sendmsg(ctx, addr, addrlen);
+		ckpt_debug("read_buffer_sendmsg(%i): %i\n", i, ret);
+		if (ret < 0)
+			goto out;
+
+		if (ret > h->total_bytes) {
+			ckpt_debug("Buffers exceeded claim");
+			ret = -EINVAL;
+			goto out;
+		}
+
+		h->total_bytes -= ret;
+		ret = 0;
+	}
+
+	ret = h->skb_count;
+ out:
+	ckpt_hdr_put(ctx, h);
+	return ret;
+}
+
+static int unix_deferred_restore_buffers(void *data)
+{
+	struct dq_buffers *dq = (struct dq_buffers *)data;
+	struct ckpt_ctx *ctx = dq->ctx;
+	struct sock *sk;
+	struct sockaddr *addr = NULL;
+	unsigned int addrlen = 0;
+	int ret;
+
+	sk = ckpt_obj_fetch(ctx, dq->sk_objref, CKPT_OBJ_SOCK);
+	if (!sk) {
+		ckpt_debug("Missing sock ref %i\n", dq->sk_objref);
+		return -EINVAL;
+	}
+
+	if ((sk->sk_type == SOCK_DGRAM) && (unix_sk(sk)->addr != NULL)) {
+		addr = (struct sockaddr *)&unix_sk(sk)->addr->name;
+		addrlen = unix_sk(sk)->addr->len;
+	}
+
+	ret = unix_read_buffers(ctx, addr, addrlen);
+	ckpt_debug("read recv buffers: %i\n", ret);
+	if (ret < 0)
+		return ret;
+
+	ret = unix_read_buffers(ctx, addr, addrlen);
+	ckpt_debug("read send buffers: %i\n", ret);
+	if (ret > 0)
+		ret = -EINVAL; /* No send buffers for UNIX sockets */
+
+	return ret;
+}
+
+static int unix_defer_restore_buffers(struct ckpt_ctx *ctx, int sk_objref)
+{
+	struct dq_buffers dq;
+
+	dq.ctx = ctx;
+	dq.sk_objref = sk_objref;
+
+	/* NB: This is safe to do inside deferqueue_run() since it uses
+	 * list_for_each_safe()
+	 */
+	return deferqueue_add(ctx->files_deferq, &dq, sizeof(dq),
+			      unix_deferred_restore_buffers, NULL);
+}
+
+static struct unix_address *unix_makeaddr(struct sockaddr_un *sun_addr,
+					  unsigned len)
+{
+	struct unix_address *addr;
+
+	if (len > sizeof(struct sockaddr_un))
+		return ERR_PTR(-EINVAL);
+
+	addr = kmalloc(sizeof(*addr) + len, GFP_KERNEL);
+	if (!addr)
+		return ERR_PTR(-ENOMEM);
+
+	memcpy(addr->name, sun_addr, len);
+	addr->len = len;
+	atomic_set(&addr->refcnt, 1);
+
+	return addr;
+}
+
+static int unix_restore_connected(struct ckpt_ctx *ctx,
+				  struct ckpt_hdr_socket *h,
+				  struct ckpt_hdr_socket_unix *un,
+				  struct socket *sock)
+{
+	struct sock *sk = sock->sk;
+	struct sockaddr *addr = NULL;
+	unsigned long flags = h->sock.flags;
+	unsigned int addrlen = 0;
+	int dead = test_bit(SOCK_DEAD, &flags);
+	int ret = 0;
+
+
+	if (un->peer == 0) {
+		/* These get propagated to the msghdr, so only set them
+		 * if we're not connected to a peer, else we'll get an error
+		 * when we sendmsg()
+		 */
+		addr = (struct sockaddr *)&un->laddr;
+		addrlen = un->laddr_len;
+	}
+
+	sk->sk_peercred.pid = task_tgid_vnr(current);
+
+	if (may_setuid(ctx->realcred->user->user_ns, un->peercred_uid) &&
+	    may_setgid(un->peercred_gid)) {
+		sk->sk_peercred.uid = un->peercred_uid;
+		sk->sk_peercred.gid = un->peercred_gid;
+	} else {
+		ckpt_debug("peercred %i:%i would require setuid",
+			   un->peercred_uid, un->peercred_gid);
+		return -EPERM;
+	}
+
+	if (!dead && (un->peer > 0)) {
+		ret = unix_defer_join(ctx, un->this, un->peer);
+		ckpt_debug("unix_defer_join: %i\n", ret);
+	}
+
+	if (!dead && !ret)
+		ret = unix_defer_restore_buffers(ctx, un->this);
+
+	return ret;
+}
+
+static int unix_unlink(const char *name)
+{
+	struct path spath;
+	struct path ppath;
+	int ret;
+
+	ret = kern_path(name, 0, &spath);
+	if (ret)
+		return ret;
+
+	ret = kern_path(name, LOOKUP_PARENT, &ppath);
+	if (ret)
+		goto out_s;
+
+	if (!spath.dentry) {
+		ckpt_debug("No dentry found for %s\n", name);
+		ret = -ENOENT;
+		goto out_p;
+	}
+
+	if (!ppath.dentry || !ppath.dentry->d_inode) {
+		ckpt_debug("No inode for parent of %s\n", name);
+		ret = -ENOENT;
+		goto out_p;
+	}
+
+	ret = vfs_unlink(ppath.dentry->d_inode, spath.dentry);
+ out_p:
+	path_put(&ppath);
+ out_s:
+	path_put(&spath);
+
+	return ret;
+}
+
+/* Call bind() for socket, optionally changing (temporarily) to @path first
+ * if non-NULL
+ */
+static int unix_chdir_and_bind(struct socket *sock,
+			       const char *path,
+			       struct sockaddr *addr,
+			       unsigned long addrlen)
+{
+	struct sockaddr_un *un = (struct sockaddr_un *)addr;
+	struct path cur = { .mnt = NULL, .dentry = NULL };
+	struct path dir = { .mnt = NULL, .dentry = NULL };
+	int ret;
+
+	if (path) {
+		ckpt_debug("switching to cwd %s for unix bind", path);
+
+		ret = kern_path(path, 0, &dir);
+		if (ret)
+			return ret;
+
+		ret = inode_permission(dir.dentry->d_inode,
+				       MAY_EXEC | MAY_ACCESS);
+		if (ret)
+			goto out;
+
+		write_lock(&current->fs->lock);
+		cur = current->fs->pwd;
+		current->fs->pwd = dir;
+		write_unlock(&current->fs->lock);
+	}
+
+	ret = unix_unlink(un->sun_path);
+	ckpt_debug("unlink(%s): %i\n", un->sun_path, ret);
+	if ((ret == 0) || (ret == -ENOENT))
+		ret = sock_bind(sock, addr, addrlen);
+
+	if (path) {
+		write_lock(&current->fs->lock);
+		current->fs->pwd = cur;
+		write_unlock(&current->fs->lock);
+	}
+ out:
+	if (path)
+		path_put(&dir);
+
+	return ret;
+}
+
+static int unix_fakebind(struct socket *sock,
+			 struct sockaddr_un *addr, unsigned long len)
+{
+	struct unix_address *uaddr;
+
+	uaddr = unix_makeaddr(addr, len);
+	if (IS_ERR(uaddr))
+		return PTR_ERR(uaddr);
+
+	unix_sk(sock->sk)->addr = uaddr;
+
+	return 0;
+}
+
+static int unix_restore_bind(struct ckpt_hdr_socket *h,
+			     struct ckpt_hdr_socket_unix *un,
+			     struct socket *sock,
+			     const char *path)
+{
+	struct sockaddr *addr = (struct sockaddr *)&un->laddr;
+	unsigned long len = un->laddr_len;
+	unsigned long flags = h->sock.flags;
+	int dead = test_bit(SOCK_DEAD, &flags);
+
+	if (dead)
+		return unix_fakebind(sock, &un->laddr, len);
+	else if (!un->laddr.sun_path[0])
+		return sock_bind(sock, addr, len);
+	else if (!(un->flags & CKPT_UNIX_LINKED))
+		return unix_fakebind(sock, &un->laddr, len);
+	else
+		return unix_chdir_and_bind(sock, path, addr, len);
+}
+
+/* Some easy pre-flight checks before we get underway */
+static int unix_precheck(struct socket *sock, struct ckpt_hdr_socket *h)
+{
+	struct net *net = sock_net(sock->sk);
+	unsigned long sk_flags = h->sock.flags;
+
+	if ((h->socket.state == SS_CONNECTING) ||
+	    (h->socket.state == SS_DISCONNECTING) ||
+	    (h->socket.state == SS_FREE)) {
+		ckpt_debug("AF_UNIX socket can't be SS_(DIS)CONNECTING");
+		return -EINVAL;
+	}
+
+	/* AF_UNIX overloads the backlog setting to define the maximum
+	 * queue length for DGRAM sockets.  Make sure we don't let the
+	 * caller exceed that value on restart.
+	 */
+	if ((h->sock.type == SOCK_DGRAM) &&
+	    (h->sock.backlog > net->unx.sysctl_max_dgram_qlen)) {
+		ckpt_debug("DGRAM backlog of %i exceeds system max of %i\n",
+			   h->sock.backlog, net->unx.sysctl_max_dgram_qlen);
+		return -EINVAL;
+	}
+
+	if (test_bit(SOCK_USE_WRITE_QUEUE, &sk_flags)) {
+		ckpt_debug("AF_UNIX socket has SOCK_USE_WRITE_QUEUE set");
+		return -EINVAL;
+	}
+
+	return 0;
+}
+
+int unix_restore(struct ckpt_ctx *ctx, struct socket *sock,
+		 struct ckpt_hdr_socket *h)
+
+{
+	struct ckpt_hdr_socket_unix *un;
+	int ret = -EINVAL;
+	char *cwd = NULL;
+
+	ret = unix_precheck(sock, h);
+	if (ret)
+		return ret;
+
+	un = ckpt_read_obj_type(ctx, sizeof(*un), CKPT_HDR_SOCKET_UNIX);
+	if (IS_ERR(un))
+		return PTR_ERR(un);
+
+	if (un->peer < 0)
+		goto out;
+
+	if (unix_need_cwd(&un->laddr, un->laddr_len)) {
+		cwd = ckpt_read_string(ctx, PATH_MAX);
+		if (IS_ERR(cwd)) {
+			ret = PTR_ERR(cwd);
+			goto out;
+		}
+	}
+
+	if ((h->sock.state != TCP_ESTABLISHED) &&
+	    !UNIX_ADDR_EMPTY(un->laddr_len)) {
+		ret = unix_restore_bind(h, un, sock, cwd);
+		if (ret)
+			goto out;
+	}
+
+	if ((h->sock.state == TCP_ESTABLISHED) || (h->sock.state == TCP_CLOSE))
+		ret = unix_restore_connected(ctx, h, un, sock);
+	else if (h->sock.state == TCP_LISTEN)
+		ret = sock->ops->listen(sock, h->sock.backlog);
+	else
+		ckpt_debug("unsupported UNIX socket state %i\n", h->sock.state);
+ out:
+	ckpt_hdr_put(ctx, un);
+	kfree(cwd);
+	return ret;
+}
-- 
1.6.0.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH v18 71/80] c/r: [pty 1/2] allow allocation of desired pty slave
       [not found] ` <1253749920-18673-1-git-send-email-orenl-RdfvBDnrOixBDgjK7y7TUQ@public.gmane.org>
                     ` (69 preceding siblings ...)
  2009-09-23 23:51   ` [PATCH v18 70/80] c/r: Add AF_UNIX support (v12) Oren Laadan
@ 2009-09-23 23:51   ` Oren Laadan
  2009-09-23 23:51   ` [PATCH v18 72/80] c/r: [pty 2/2] support for pseudo terminals Oren Laadan
                     ` (10 subsequent siblings)
  81 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-23 23:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-api-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Linus Torvalds, Ingo Molnar,
	Pavel Emelyanov

During restart, we need to allocate pty slaves with the same
identifiers as recorded during checkpoint. Modify the allocation code
to allow an in-kernel caller to request a specific slave identifier.

For this, add a new field to task_struct - 'required_id'. It will
hold the desired identifier when restoring a (master) pty.

The code in ptmx_open() will use this value only for tasks that try to
open /dev/ptmx that are restarting (PF_RESTARTING), and if the value
isn't CKPT_REQUIRED_NONE (-1).

Signed-off-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
Acked-by: Serge Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
---
 drivers/char/pty.c        |   65 +++++++++++++++++++++++++++++++++++++++++---
 drivers/char/tty_io.c     |    4 +-
 fs/devpts/inode.c         |   13 +++++++--
 include/linux/devpts_fs.h |    6 +++-
 include/linux/sched.h     |    1 +
 include/linux/tty.h       |    2 +
 6 files changed, 80 insertions(+), 11 deletions(-)

diff --git a/drivers/char/pty.c b/drivers/char/pty.c
index b33d668..e2fef99 100644
--- a/drivers/char/pty.c
+++ b/drivers/char/pty.c
@@ -612,9 +612,10 @@ static const struct tty_operations pty_unix98_ops = {
 };
 
 /**
- *	ptmx_open		-	open a unix 98 pty master
+ *	__ptmx_open		-	open a unix 98 pty master
  *	@inode: inode of device file
  *	@filp: file pointer to tty
+ *	@index: desired slave index
  *
  *	Allocate a unix98 pty master device from the ptmx driver.
  *
@@ -623,16 +624,15 @@ static const struct tty_operations pty_unix98_ops = {
  *		allocated_ptys_lock handles the list of free pty numbers
  */
 
-static int __ptmx_open(struct inode *inode, struct file *filp)
+static int __ptmx_open(struct inode *inode, struct file *filp, int index)
 {
 	struct tty_struct *tty;
 	int retval;
-	int index;
 
 	nonseekable_open(inode, filp);
 
 	/* find a device that is not in use. */
-	index = devpts_new_index(inode);
+	index = devpts_new_index(inode, index);
 	if (index < 0)
 		return index;
 
@@ -668,12 +668,66 @@ static int ptmx_open(struct inode *inode, struct file *filp)
 {
 	int ret;
 
+#ifdef CONFIG_CHECKPOINT
+	/*
+	 * If current task is restarting, we skip the actual open.
+	 * Instead, leave it up to the caller (restart code) to invoke
+	 * __ptmx_open() with the desired pty index request.
+	 *
+	 * NOTE: this gives a half-baked file that has ptmx f_op but
+	 * the tty (private_data) is NULL. It is the responsibility of
+	 * the _caller_ to ensure proper initialization before
+	 * allowing it to be used (ptmx_release() tolerates NULL tty).
+	 */
+	if (current->flags & PF_RESTARTING)
+		return 0;
+#endif
+
 	lock_kernel();
-	ret = __ptmx_open(inode, filp);
+	ret = __ptmx_open(inode, filp, UNSPECIFIED_PTY_INDEX);
 	unlock_kernel();
 	return ret;
 }
 
+static int ptmx_release(struct inode *inode, struct file *filp)
+{
+#ifdef CONFIG_CHECKPOINT
+	/*
+	 * It is possible for a restart to create a half-baked
+	 * ptmx file - see ptmx_open(). In that case there is no
+	 * tty (private_data) and nothing to do.
+	 */
+	if (!filp->private_data)
+		return 0;
+#endif
+
+	return tty_release(inode, filp);
+}
+
+struct file *pty_open_by_index(char *ptmxpath, int index)
+{
+	struct file *ptmxfile;
+	int ret;
+
+	/*
+	 * We need to pick a way to specify which devpts mountpoint to
+	 * use. For now, we'll just use whatever /dev/ptmx points to.
+	 */
+	ptmxfile = filp_open(ptmxpath, O_RDWR|O_NOCTTY, 0);
+	if (IS_ERR(ptmxfile))
+		return ptmxfile;
+
+	lock_kernel();
+	ret = __ptmx_open(ptmxfile->f_dentry->d_inode, ptmxfile, index);
+	unlock_kernel();
+	if (ret) {
+		fput(ptmxfile);
+		return ERR_PTR(ret);
+	}
+
+	return ptmxfile;
+}
+
 static struct file_operations ptmx_fops;
 
 static void __init unix98_pty_init(void)
@@ -730,6 +784,7 @@ static void __init unix98_pty_init(void)
 	/* Now create the /dev/ptmx special device */
 	tty_default_fops(&ptmx_fops);
 	ptmx_fops.open = ptmx_open;
+	ptmx_fops.release = ptmx_release;
 
 	cdev_init(&ptmx_cdev, &ptmx_fops);
 	if (cdev_add(&ptmx_cdev, MKDEV(TTYAUX_MAJOR, 2), 1) ||
diff --git a/drivers/char/tty_io.c b/drivers/char/tty_io.c
index a3afa0c..7853ea2 100644
--- a/drivers/char/tty_io.c
+++ b/drivers/char/tty_io.c
@@ -142,7 +142,7 @@ ssize_t redirected_tty_write(struct file *, const char __user *,
 							size_t, loff_t *);
 static unsigned int tty_poll(struct file *, poll_table *);
 static int tty_open(struct inode *, struct file *);
-static int tty_release(struct inode *, struct file *);
+int tty_release(struct inode *, struct file *);
 long tty_ioctl(struct file *file, unsigned int cmd, unsigned long arg);
 #ifdef CONFIG_COMPAT
 static long tty_compat_ioctl(struct file *file, unsigned int cmd,
@@ -1846,7 +1846,7 @@ static int tty_open(struct inode *inode, struct file *filp)
  *		Takes bkl. See tty_release_dev
  */
 
-static int tty_release(struct inode *inode, struct file *filp)
+int tty_release(struct inode *inode, struct file *filp)
 {
 	lock_kernel();
 	tty_release_dev(filp);
diff --git a/fs/devpts/inode.c b/fs/devpts/inode.c
index 75efb02..82a2160 100644
--- a/fs/devpts/inode.c
+++ b/fs/devpts/inode.c
@@ -433,11 +433,11 @@ static struct file_system_type devpts_fs_type = {
  * to the System V naming convention
  */
 
-int devpts_new_index(struct inode *ptmx_inode)
+int devpts_new_index(struct inode *ptmx_inode, int req_idx)
 {
 	struct super_block *sb = pts_sb_from_inode(ptmx_inode);
 	struct pts_fs_info *fsi = DEVPTS_SB(sb);
-	int index;
+	int index = req_idx;
 	int ida_ret;
 
 retry:
@@ -445,7 +445,9 @@ retry:
 		return -ENOMEM;
 
 	mutex_lock(&allocated_ptys_lock);
-	ida_ret = ida_get_new(&fsi->allocated_ptys, &index);
+	if (index == UNSPECIFIED_PTY_INDEX)
+		index = 0;
+	ida_ret = ida_get_new_above(&fsi->allocated_ptys, index, &index);
 	if (ida_ret < 0) {
 		mutex_unlock(&allocated_ptys_lock);
 		if (ida_ret == -EAGAIN)
@@ -453,6 +455,11 @@ retry:
 		return -EIO;
 	}
 
+	if (req_idx != UNSPECIFIED_PTY_INDEX && index != req_idx) {
+		ida_remove(&fsi->allocated_ptys, index);
+		mutex_unlock(&allocated_ptys_lock);
+		return -EBUSY;
+	}
 	if (index >= pty_limit) {
 		ida_remove(&fsi->allocated_ptys, index);
 		mutex_unlock(&allocated_ptys_lock);
diff --git a/include/linux/devpts_fs.h b/include/linux/devpts_fs.h
index 5ce0e5f..163a70e 100644
--- a/include/linux/devpts_fs.h
+++ b/include/linux/devpts_fs.h
@@ -15,9 +15,13 @@
 
 #include <linux/errno.h>
 
+#define UNSPECIFIED_PTY_INDEX -1
+
 #ifdef CONFIG_UNIX98_PTYS
 
-int devpts_new_index(struct inode *ptmx_inode);
+struct file *pty_open_by_index(char *ptmxpath, int index);
+
+int devpts_new_index(struct inode *ptmx_inode, int req_idx);
 void devpts_kill_index(struct inode *ptmx_inode, int idx);
 /* mknod in devpts */
 int devpts_pty_new(struct inode *ptmx_inode, struct tty_struct *tty);
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 91b57db..0ab9553 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1482,6 +1482,7 @@ struct task_struct {
 #endif /* CONFIG_TRACING */
 #ifdef CONFIG_CHECKPOINT
 	struct ckpt_ctx *checkpoint_ctx;
+	unsigned long required_id;
 #endif
 };
 
diff --git a/include/linux/tty.h b/include/linux/tty.h
index e8c6c91..fd40561 100644
--- a/include/linux/tty.h
+++ b/include/linux/tty.h
@@ -468,6 +468,8 @@ extern void tty_ldisc_begin(void);
 /* This last one is just for the tty layer internals and shouldn't be used elsewhere */
 extern void tty_ldisc_enable(struct tty_struct *tty);
 
+/* This one is for ptmx_close() */
+extern int tty_release(struct inode *inode, struct file *filp);
 
 /* n_tty.c */
 extern struct tty_ldisc_ops tty_ldisc_N_TTY;
-- 
1.6.0.4

^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH v18 71/80] c/r: [pty 1/2] allow allocation of desired pty slave
  2009-09-23 23:50 ` Oren Laadan
@ 2009-09-23 23:51   ` Oren Laadan
  -1 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-23 23:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, containers, linux-kernel, linux-mm, linux-api,
	Serge Hallyn, Ingo Molnar, Pavel Emelyanov, Oren Laadan,
	Oren Laadan

During restart, we need to allocate pty slaves with the same
identifiers as recorded during checkpoint. Modify the allocation code
to allow an in-kernel caller to request a specific slave identifier.

For this, add a new field to task_struct - 'required_id'. It will
hold the desired identifier when restoring a (master) pty.

The code in ptmx_open() will use this value only for tasks that try to
open /dev/ptmx that are restarting (PF_RESTARTING), and if the value
isn't CKPT_REQUIRED_NONE (-1).

Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
Acked-by: Serge Hallyn <serue@us.ibm.com>
---
 drivers/char/pty.c        |   65 +++++++++++++++++++++++++++++++++++++++++---
 drivers/char/tty_io.c     |    4 +-
 fs/devpts/inode.c         |   13 +++++++--
 include/linux/devpts_fs.h |    6 +++-
 include/linux/sched.h     |    1 +
 include/linux/tty.h       |    2 +
 6 files changed, 80 insertions(+), 11 deletions(-)

diff --git a/drivers/char/pty.c b/drivers/char/pty.c
index b33d668..e2fef99 100644
--- a/drivers/char/pty.c
+++ b/drivers/char/pty.c
@@ -612,9 +612,10 @@ static const struct tty_operations pty_unix98_ops = {
 };
 
 /**
- *	ptmx_open		-	open a unix 98 pty master
+ *	__ptmx_open		-	open a unix 98 pty master
  *	@inode: inode of device file
  *	@filp: file pointer to tty
+ *	@index: desired slave index
  *
  *	Allocate a unix98 pty master device from the ptmx driver.
  *
@@ -623,16 +624,15 @@ static const struct tty_operations pty_unix98_ops = {
  *		allocated_ptys_lock handles the list of free pty numbers
  */
 
-static int __ptmx_open(struct inode *inode, struct file *filp)
+static int __ptmx_open(struct inode *inode, struct file *filp, int index)
 {
 	struct tty_struct *tty;
 	int retval;
-	int index;
 
 	nonseekable_open(inode, filp);
 
 	/* find a device that is not in use. */
-	index = devpts_new_index(inode);
+	index = devpts_new_index(inode, index);
 	if (index < 0)
 		return index;
 
@@ -668,12 +668,66 @@ static int ptmx_open(struct inode *inode, struct file *filp)
 {
 	int ret;
 
+#ifdef CONFIG_CHECKPOINT
+	/*
+	 * If current task is restarting, we skip the actual open.
+	 * Instead, leave it up to the caller (restart code) to invoke
+	 * __ptmx_open() with the desired pty index request.
+	 *
+	 * NOTE: this gives a half-baked file that has ptmx f_op but
+	 * the tty (private_data) is NULL. It is the responsibility of
+	 * the _caller_ to ensure proper initialization before
+	 * allowing it to be used (ptmx_release() tolerates NULL tty).
+	 */
+	if (current->flags & PF_RESTARTING)
+		return 0;
+#endif
+
 	lock_kernel();
-	ret = __ptmx_open(inode, filp);
+	ret = __ptmx_open(inode, filp, UNSPECIFIED_PTY_INDEX);
 	unlock_kernel();
 	return ret;
 }
 
+static int ptmx_release(struct inode *inode, struct file *filp)
+{
+#ifdef CONFIG_CHECKPOINT
+	/*
+	 * It is possible for a restart to create a half-baked
+	 * ptmx file - see ptmx_open(). In that case there is no
+	 * tty (private_data) and nothing to do.
+	 */
+	if (!filp->private_data)
+		return 0;
+#endif
+
+	return tty_release(inode, filp);
+}
+
+struct file *pty_open_by_index(char *ptmxpath, int index)
+{
+	struct file *ptmxfile;
+	int ret;
+
+	/*
+	 * We need to pick a way to specify which devpts mountpoint to
+	 * use. For now, we'll just use whatever /dev/ptmx points to.
+	 */
+	ptmxfile = filp_open(ptmxpath, O_RDWR|O_NOCTTY, 0);
+	if (IS_ERR(ptmxfile))
+		return ptmxfile;
+
+	lock_kernel();
+	ret = __ptmx_open(ptmxfile->f_dentry->d_inode, ptmxfile, index);
+	unlock_kernel();
+	if (ret) {
+		fput(ptmxfile);
+		return ERR_PTR(ret);
+	}
+
+	return ptmxfile;
+}
+
 static struct file_operations ptmx_fops;
 
 static void __init unix98_pty_init(void)
@@ -730,6 +784,7 @@ static void __init unix98_pty_init(void)
 	/* Now create the /dev/ptmx special device */
 	tty_default_fops(&ptmx_fops);
 	ptmx_fops.open = ptmx_open;
+	ptmx_fops.release = ptmx_release;
 
 	cdev_init(&ptmx_cdev, &ptmx_fops);
 	if (cdev_add(&ptmx_cdev, MKDEV(TTYAUX_MAJOR, 2), 1) ||
diff --git a/drivers/char/tty_io.c b/drivers/char/tty_io.c
index a3afa0c..7853ea2 100644
--- a/drivers/char/tty_io.c
+++ b/drivers/char/tty_io.c
@@ -142,7 +142,7 @@ ssize_t redirected_tty_write(struct file *, const char __user *,
 							size_t, loff_t *);
 static unsigned int tty_poll(struct file *, poll_table *);
 static int tty_open(struct inode *, struct file *);
-static int tty_release(struct inode *, struct file *);
+int tty_release(struct inode *, struct file *);
 long tty_ioctl(struct file *file, unsigned int cmd, unsigned long arg);
 #ifdef CONFIG_COMPAT
 static long tty_compat_ioctl(struct file *file, unsigned int cmd,
@@ -1846,7 +1846,7 @@ static int tty_open(struct inode *inode, struct file *filp)
  *		Takes bkl. See tty_release_dev
  */
 
-static int tty_release(struct inode *inode, struct file *filp)
+int tty_release(struct inode *inode, struct file *filp)
 {
 	lock_kernel();
 	tty_release_dev(filp);
diff --git a/fs/devpts/inode.c b/fs/devpts/inode.c
index 75efb02..82a2160 100644
--- a/fs/devpts/inode.c
+++ b/fs/devpts/inode.c
@@ -433,11 +433,11 @@ static struct file_system_type devpts_fs_type = {
  * to the System V naming convention
  */
 
-int devpts_new_index(struct inode *ptmx_inode)
+int devpts_new_index(struct inode *ptmx_inode, int req_idx)
 {
 	struct super_block *sb = pts_sb_from_inode(ptmx_inode);
 	struct pts_fs_info *fsi = DEVPTS_SB(sb);
-	int index;
+	int index = req_idx;
 	int ida_ret;
 
 retry:
@@ -445,7 +445,9 @@ retry:
 		return -ENOMEM;
 
 	mutex_lock(&allocated_ptys_lock);
-	ida_ret = ida_get_new(&fsi->allocated_ptys, &index);
+	if (index == UNSPECIFIED_PTY_INDEX)
+		index = 0;
+	ida_ret = ida_get_new_above(&fsi->allocated_ptys, index, &index);
 	if (ida_ret < 0) {
 		mutex_unlock(&allocated_ptys_lock);
 		if (ida_ret == -EAGAIN)
@@ -453,6 +455,11 @@ retry:
 		return -EIO;
 	}
 
+	if (req_idx != UNSPECIFIED_PTY_INDEX && index != req_idx) {
+		ida_remove(&fsi->allocated_ptys, index);
+		mutex_unlock(&allocated_ptys_lock);
+		return -EBUSY;
+	}
 	if (index >= pty_limit) {
 		ida_remove(&fsi->allocated_ptys, index);
 		mutex_unlock(&allocated_ptys_lock);
diff --git a/include/linux/devpts_fs.h b/include/linux/devpts_fs.h
index 5ce0e5f..163a70e 100644
--- a/include/linux/devpts_fs.h
+++ b/include/linux/devpts_fs.h
@@ -15,9 +15,13 @@
 
 #include <linux/errno.h>
 
+#define UNSPECIFIED_PTY_INDEX -1
+
 #ifdef CONFIG_UNIX98_PTYS
 
-int devpts_new_index(struct inode *ptmx_inode);
+struct file *pty_open_by_index(char *ptmxpath, int index);
+
+int devpts_new_index(struct inode *ptmx_inode, int req_idx);
 void devpts_kill_index(struct inode *ptmx_inode, int idx);
 /* mknod in devpts */
 int devpts_pty_new(struct inode *ptmx_inode, struct tty_struct *tty);
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 91b57db..0ab9553 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1482,6 +1482,7 @@ struct task_struct {
 #endif /* CONFIG_TRACING */
 #ifdef CONFIG_CHECKPOINT
 	struct ckpt_ctx *checkpoint_ctx;
+	unsigned long required_id;
 #endif
 };
 
diff --git a/include/linux/tty.h b/include/linux/tty.h
index e8c6c91..fd40561 100644
--- a/include/linux/tty.h
+++ b/include/linux/tty.h
@@ -468,6 +468,8 @@ extern void tty_ldisc_begin(void);
 /* This last one is just for the tty layer internals and shouldn't be used elsewhere */
 extern void tty_ldisc_enable(struct tty_struct *tty);
 
+/* This one is for ptmx_close() */
+extern int tty_release(struct inode *inode, struct file *filp);
 
 /* n_tty.c */
 extern struct tty_ldisc_ops tty_ldisc_N_TTY;
-- 
1.6.0.4


^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH v18 71/80] c/r: [pty 1/2] allow allocation of desired pty slave
@ 2009-09-23 23:51   ` Oren Laadan
  0 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-23 23:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, containers, linux-kernel, linux-mm, linux-api,
	Serge Hallyn, Ingo Molnar, Pavel Emelyanov, Oren Laadan,
	Oren Laadan

During restart, we need to allocate pty slaves with the same
identifiers as recorded during checkpoint. Modify the allocation code
to allow an in-kernel caller to request a specific slave identifier.

For this, add a new field to task_struct - 'required_id'. It will
hold the desired identifier when restoring a (master) pty.

The code in ptmx_open() will use this value only for tasks that try to
open /dev/ptmx that are restarting (PF_RESTARTING), and if the value
isn't CKPT_REQUIRED_NONE (-1).

Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
Acked-by: Serge Hallyn <serue@us.ibm.com>
---
 drivers/char/pty.c        |   65 +++++++++++++++++++++++++++++++++++++++++---
 drivers/char/tty_io.c     |    4 +-
 fs/devpts/inode.c         |   13 +++++++--
 include/linux/devpts_fs.h |    6 +++-
 include/linux/sched.h     |    1 +
 include/linux/tty.h       |    2 +
 6 files changed, 80 insertions(+), 11 deletions(-)

diff --git a/drivers/char/pty.c b/drivers/char/pty.c
index b33d668..e2fef99 100644
--- a/drivers/char/pty.c
+++ b/drivers/char/pty.c
@@ -612,9 +612,10 @@ static const struct tty_operations pty_unix98_ops = {
 };
 
 /**
- *	ptmx_open		-	open a unix 98 pty master
+ *	__ptmx_open		-	open a unix 98 pty master
  *	@inode: inode of device file
  *	@filp: file pointer to tty
+ *	@index: desired slave index
  *
  *	Allocate a unix98 pty master device from the ptmx driver.
  *
@@ -623,16 +624,15 @@ static const struct tty_operations pty_unix98_ops = {
  *		allocated_ptys_lock handles the list of free pty numbers
  */
 
-static int __ptmx_open(struct inode *inode, struct file *filp)
+static int __ptmx_open(struct inode *inode, struct file *filp, int index)
 {
 	struct tty_struct *tty;
 	int retval;
-	int index;
 
 	nonseekable_open(inode, filp);
 
 	/* find a device that is not in use. */
-	index = devpts_new_index(inode);
+	index = devpts_new_index(inode, index);
 	if (index < 0)
 		return index;
 
@@ -668,12 +668,66 @@ static int ptmx_open(struct inode *inode, struct file *filp)
 {
 	int ret;
 
+#ifdef CONFIG_CHECKPOINT
+	/*
+	 * If current task is restarting, we skip the actual open.
+	 * Instead, leave it up to the caller (restart code) to invoke
+	 * __ptmx_open() with the desired pty index request.
+	 *
+	 * NOTE: this gives a half-baked file that has ptmx f_op but
+	 * the tty (private_data) is NULL. It is the responsibility of
+	 * the _caller_ to ensure proper initialization before
+	 * allowing it to be used (ptmx_release() tolerates NULL tty).
+	 */
+	if (current->flags & PF_RESTARTING)
+		return 0;
+#endif
+
 	lock_kernel();
-	ret = __ptmx_open(inode, filp);
+	ret = __ptmx_open(inode, filp, UNSPECIFIED_PTY_INDEX);
 	unlock_kernel();
 	return ret;
 }
 
+static int ptmx_release(struct inode *inode, struct file *filp)
+{
+#ifdef CONFIG_CHECKPOINT
+	/*
+	 * It is possible for a restart to create a half-baked
+	 * ptmx file - see ptmx_open(). In that case there is no
+	 * tty (private_data) and nothing to do.
+	 */
+	if (!filp->private_data)
+		return 0;
+#endif
+
+	return tty_release(inode, filp);
+}
+
+struct file *pty_open_by_index(char *ptmxpath, int index)
+{
+	struct file *ptmxfile;
+	int ret;
+
+	/*
+	 * We need to pick a way to specify which devpts mountpoint to
+	 * use. For now, we'll just use whatever /dev/ptmx points to.
+	 */
+	ptmxfile = filp_open(ptmxpath, O_RDWR|O_NOCTTY, 0);
+	if (IS_ERR(ptmxfile))
+		return ptmxfile;
+
+	lock_kernel();
+	ret = __ptmx_open(ptmxfile->f_dentry->d_inode, ptmxfile, index);
+	unlock_kernel();
+	if (ret) {
+		fput(ptmxfile);
+		return ERR_PTR(ret);
+	}
+
+	return ptmxfile;
+}
+
 static struct file_operations ptmx_fops;
 
 static void __init unix98_pty_init(void)
@@ -730,6 +784,7 @@ static void __init unix98_pty_init(void)
 	/* Now create the /dev/ptmx special device */
 	tty_default_fops(&ptmx_fops);
 	ptmx_fops.open = ptmx_open;
+	ptmx_fops.release = ptmx_release;
 
 	cdev_init(&ptmx_cdev, &ptmx_fops);
 	if (cdev_add(&ptmx_cdev, MKDEV(TTYAUX_MAJOR, 2), 1) ||
diff --git a/drivers/char/tty_io.c b/drivers/char/tty_io.c
index a3afa0c..7853ea2 100644
--- a/drivers/char/tty_io.c
+++ b/drivers/char/tty_io.c
@@ -142,7 +142,7 @@ ssize_t redirected_tty_write(struct file *, const char __user *,
 							size_t, loff_t *);
 static unsigned int tty_poll(struct file *, poll_table *);
 static int tty_open(struct inode *, struct file *);
-static int tty_release(struct inode *, struct file *);
+int tty_release(struct inode *, struct file *);
 long tty_ioctl(struct file *file, unsigned int cmd, unsigned long arg);
 #ifdef CONFIG_COMPAT
 static long tty_compat_ioctl(struct file *file, unsigned int cmd,
@@ -1846,7 +1846,7 @@ static int tty_open(struct inode *inode, struct file *filp)
  *		Takes bkl. See tty_release_dev
  */
 
-static int tty_release(struct inode *inode, struct file *filp)
+int tty_release(struct inode *inode, struct file *filp)
 {
 	lock_kernel();
 	tty_release_dev(filp);
diff --git a/fs/devpts/inode.c b/fs/devpts/inode.c
index 75efb02..82a2160 100644
--- a/fs/devpts/inode.c
+++ b/fs/devpts/inode.c
@@ -433,11 +433,11 @@ static struct file_system_type devpts_fs_type = {
  * to the System V naming convention
  */
 
-int devpts_new_index(struct inode *ptmx_inode)
+int devpts_new_index(struct inode *ptmx_inode, int req_idx)
 {
 	struct super_block *sb = pts_sb_from_inode(ptmx_inode);
 	struct pts_fs_info *fsi = DEVPTS_SB(sb);
-	int index;
+	int index = req_idx;
 	int ida_ret;
 
 retry:
@@ -445,7 +445,9 @@ retry:
 		return -ENOMEM;
 
 	mutex_lock(&allocated_ptys_lock);
-	ida_ret = ida_get_new(&fsi->allocated_ptys, &index);
+	if (index == UNSPECIFIED_PTY_INDEX)
+		index = 0;
+	ida_ret = ida_get_new_above(&fsi->allocated_ptys, index, &index);
 	if (ida_ret < 0) {
 		mutex_unlock(&allocated_ptys_lock);
 		if (ida_ret == -EAGAIN)
@@ -453,6 +455,11 @@ retry:
 		return -EIO;
 	}
 
+	if (req_idx != UNSPECIFIED_PTY_INDEX && index != req_idx) {
+		ida_remove(&fsi->allocated_ptys, index);
+		mutex_unlock(&allocated_ptys_lock);
+		return -EBUSY;
+	}
 	if (index >= pty_limit) {
 		ida_remove(&fsi->allocated_ptys, index);
 		mutex_unlock(&allocated_ptys_lock);
diff --git a/include/linux/devpts_fs.h b/include/linux/devpts_fs.h
index 5ce0e5f..163a70e 100644
--- a/include/linux/devpts_fs.h
+++ b/include/linux/devpts_fs.h
@@ -15,9 +15,13 @@
 
 #include <linux/errno.h>
 
+#define UNSPECIFIED_PTY_INDEX -1
+
 #ifdef CONFIG_UNIX98_PTYS
 
-int devpts_new_index(struct inode *ptmx_inode);
+struct file *pty_open_by_index(char *ptmxpath, int index);
+
+int devpts_new_index(struct inode *ptmx_inode, int req_idx);
 void devpts_kill_index(struct inode *ptmx_inode, int idx);
 /* mknod in devpts */
 int devpts_pty_new(struct inode *ptmx_inode, struct tty_struct *tty);
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 91b57db..0ab9553 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1482,6 +1482,7 @@ struct task_struct {
 #endif /* CONFIG_TRACING */
 #ifdef CONFIG_CHECKPOINT
 	struct ckpt_ctx *checkpoint_ctx;
+	unsigned long required_id;
 #endif
 };
 
diff --git a/include/linux/tty.h b/include/linux/tty.h
index e8c6c91..fd40561 100644
--- a/include/linux/tty.h
+++ b/include/linux/tty.h
@@ -468,6 +468,8 @@ extern void tty_ldisc_begin(void);
 /* This last one is just for the tty layer internals and shouldn't be used elsewhere */
 extern void tty_ldisc_enable(struct tty_struct *tty);
 
+/* This one is for ptmx_close() */
+extern int tty_release(struct inode *inode, struct file *filp);
 
 /* n_tty.c */
 extern struct tty_ldisc_ops tty_ldisc_N_TTY;
-- 
1.6.0.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH v18 72/80] c/r: [pty 2/2] support for pseudo terminals
       [not found] ` <1253749920-18673-1-git-send-email-orenl-RdfvBDnrOixBDgjK7y7TUQ@public.gmane.org>
                     ` (70 preceding siblings ...)
  2009-09-23 23:51   ` [PATCH v18 71/80] c/r: [pty 1/2] allow allocation of desired pty slave Oren Laadan
@ 2009-09-23 23:51   ` Oren Laadan
  2009-09-23 23:51   ` [PATCH v18 73/80] c/r: correctly restore pgid Oren Laadan
                     ` (9 subsequent siblings)
  81 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-23 23:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-api-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Linus Torvalds, Ingo Molnar,
	Pavel Emelyanov

This patch adds support for checkpoint and restart of pseudo terminals
(PTYs). Since PTYs are shared (pointed to by file, and signal), they
are managed via objhash.

PTYs are master/slave pairs; The code arranges for the master to
always be checkpointed first, followed by the slave. This is important
since during restart both ends are created when restoring the master.

In this patch only UNIX98 style PTYs are supported.

Currently only PTYs that are referenced by open files are handled.
Thus PTYs checkpoint starts with a file in tty_file_checkpoint(). It
will first checkpoint the master and slave PTYs via tty_checkpoint(),
and then complete the saving of the file descriptor. This means that
in the image file, the order of objects is: master-tty, slave-tty,
file-desc.

During restart, to restore the master side, we open the /dev/ptmx
device and get a file handle. But at this point we don't know the
designated objref for this file, because the file is due later on in
the image stream. On the other hand, we can't just fput() the file
because it will close the PTY too.

Instead, when we checkpoint the master PTY, we _reserve_ an objref
for the file (which won't be further used in checkpoint). Then at
restart, use it to insert the file to objhash.

TODO:

* Better sanitize input from checkpoint image on restore
* Check the locking when saving/restoring tty_struct state
* Echo position/buffer isn't saved (is it needed ?)
* Handle multiple devpts mounts (namespaces)
* Paths of ptmx and slaves are hard coded (/dev/ptmx, /dev/pts/...)

Changelog[v4]:
  - Fix error path(s) in restore_tty_ldisc()
  - Fix memory leak in restore_tty_ldisc()
Changelog[v3]:
  - [Serge Hallyn] Set tty on error path
Changelog[v2]:
  - Don't save/restore tty->{session,pgrp}
  - Fix leak: drop file reference after ckpt_obj_insert()
  - Move get_file() inside locked clause (fix race)
Changelog[v1]:
  - Adjust include/asm/checkpoint_hdr.h for s390 architecture
  - Add NCC to kernel constants header (ckpt_hdr_const)
  - [Serge Hallyn] fix calculation of canon_datalen

Signed-off-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
Acked-by: Serge Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
---
 arch/s390/include/asm/checkpoint_hdr.h |   11 +
 arch/x86/include/asm/checkpoint_hdr.h  |   11 +
 checkpoint/checkpoint.c                |    3 +
 checkpoint/files.c                     |    6 +
 checkpoint/objhash.c                   |   26 ++
 checkpoint/restart.c                   |    6 +
 drivers/char/pty.c                     |    1 +
 drivers/char/tty_io.c                  |  499 ++++++++++++++++++++++++++++++++
 include/linux/checkpoint.h             |    4 +
 include/linux/checkpoint_hdr.h         |   85 ++++++
 include/linux/tty.h                    |    7 +
 11 files changed, 659 insertions(+), 0 deletions(-)

diff --git a/arch/s390/include/asm/checkpoint_hdr.h b/arch/s390/include/asm/checkpoint_hdr.h
index 1976355..b6ea8ce 100644
--- a/arch/s390/include/asm/checkpoint_hdr.h
+++ b/arch/s390/include/asm/checkpoint_hdr.h
@@ -83,13 +83,24 @@ struct ckpt_hdr_mm_context {
 };
 
 #define CKPT_ARCH_NSIG  64
+#define CKPT_TTY_NCC  8
+
+/* arch dependent constants */
 #ifdef __KERNEL__
+
 #include <asm/signal.h>
 #if CKPT_ARCH_NSIG != _SIGCONTEXT_NSIG
 #error CKPT_ARCH_NSIG size is wrong (asm/sigcontext.h and asm/checkpoint_hdr.h)
 #endif
+
+#include <linux/tty.h>
+#if CKPT_TTY_NCC != NCC
+#error CKPT_TTY_NCC size is wrong per asm-generic/termios.h
 #endif
 
+#endif /* __KERNEL__ */
+
+
 struct ckpt_hdr_header_arch {
 	struct ckpt_hdr h;
 };
diff --git a/arch/x86/include/asm/checkpoint_hdr.h b/arch/x86/include/asm/checkpoint_hdr.h
index 1228d1b..7a24de5 100644
--- a/arch/x86/include/asm/checkpoint_hdr.h
+++ b/arch/x86/include/asm/checkpoint_hdr.h
@@ -48,14 +48,25 @@ enum {
 	CKPT_HDR_MM_CONTEXT_LDT,
 };
 
+/* arch dependent constants */
 #define CKPT_ARCH_NSIG  64
+#define CKPT_TTY_NCC  8
+
 #ifdef __KERNEL__
+
 #include <asm/signal.h>
 #if CKPT_ARCH_NSIG != _NSIG
 #error CKPT_ARCH_NSIG size is wrong per asm/signal.h and asm/checkpoint_hdr.h
 #endif
+
+#include <linux/tty.h>
+#if CKPT_TTY_NCC != NCC
+#error CKPT_TTY_NCC size is wrong per asm-generic/termios.h
 #endif
 
+#endif /* __KERNEL__ */
+
+
 struct ckpt_hdr_header_arch {
 	struct ckpt_hdr h;
 	/* FIXME: add HAVE_HWFP */
diff --git a/checkpoint/checkpoint.c b/checkpoint/checkpoint.c
index ae79df7..dbe9e10 100644
--- a/checkpoint/checkpoint.c
+++ b/checkpoint/checkpoint.c
@@ -299,6 +299,9 @@ static void fill_kernel_const(struct ckpt_const *h)
 	h->uts_domainname_len = sizeof(uts->domainname);
 	/* rlimit */
 	h->rlimit_nlimits = RLIM_NLIMITS;
+	/* tty */
+	h->n_tty_buf_size = N_TTY_BUF_SIZE;
+	h->tty_termios_ncc = NCC;
 }
 
 /* write the checkpoint header */
diff --git a/checkpoint/files.c b/checkpoint/files.c
index 058bc0e..27e29a0 100644
--- a/checkpoint/files.c
+++ b/checkpoint/files.c
@@ -598,6 +598,12 @@ static struct restore_file_ops restore_file_ops[] = {
 		.file_type = CKPT_FILE_SOCKET,
 		.restore = sock_file_restore,
 	},
+	/* tty */
+	{
+		.file_name = "TTY",
+		.file_type = CKPT_FILE_TTY,
+		.restore = tty_file_restore,
+	},
 };
 
 static struct file *do_restore_file(struct ckpt_ctx *ctx)
diff --git a/checkpoint/objhash.c b/checkpoint/objhash.c
index 0978060..f84388d 100644
--- a/checkpoint/objhash.c
+++ b/checkpoint/objhash.c
@@ -269,6 +269,22 @@ static int obj_sock_users(void *ptr)
 	return atomic_read(&((struct sock *) ptr)->sk_refcnt);
 }
 
+static int obj_tty_grab(void *ptr)
+{
+	tty_kref_get((struct tty_struct *) ptr);
+	return 0;
+}
+
+static void obj_tty_drop(void *ptr, int lastref)
+{
+	tty_kref_put((struct tty_struct *) ptr);
+}
+
+static int obj_tty_users(void *ptr)
+{
+	return atomic_read(&((struct tty_struct *) ptr)->kref.refcount);
+}
+
 static struct ckpt_obj_ops ckpt_obj_ops[] = {
 	/* ignored object */
 	{
@@ -407,6 +423,16 @@ static struct ckpt_obj_ops ckpt_obj_ops[] = {
 		.checkpoint = checkpoint_sock,
 		.restore = restore_sock,
 	},
+	/* struct tty_struct */
+	{
+		.obj_name = "TTY",
+		.obj_type = CKPT_OBJ_TTY,
+		.ref_drop = obj_tty_drop,
+		.ref_grab = obj_tty_grab,
+		.ref_users = obj_tty_users,
+		.checkpoint = checkpoint_tty,
+		.restore = restore_tty,
+	},
 };
 
 
diff --git a/checkpoint/restart.c b/checkpoint/restart.c
index 340698a..1016278 100644
--- a/checkpoint/restart.c
+++ b/checkpoint/restart.c
@@ -19,6 +19,7 @@
 #include <linux/freezer.h>
 #include <linux/magic.h>
 #include <linux/utsname.h>
+#include <linux/termios.h>
 #include <asm/syscall.h>
 #include <linux/elf.h>
 #include <linux/deferqueue.h>
@@ -402,6 +403,11 @@ static int check_kernel_const(struct ckpt_const *h)
 	/* rlimit */
 	if (h->rlimit_nlimits != RLIM_NLIMITS)
 		return -EINVAL;
+	/* tty */
+	if (h->n_tty_buf_size != N_TTY_BUF_SIZE)
+		return -EINVAL;
+	if (h->tty_termios_ncc != NCC)
+		return -EINVAL;
 
 	return 0;
 }
diff --git a/drivers/char/pty.c b/drivers/char/pty.c
index e2fef99..5fb4ec5 100644
--- a/drivers/char/pty.c
+++ b/drivers/char/pty.c
@@ -15,6 +15,7 @@
 
 #include <linux/errno.h>
 #include <linux/interrupt.h>
+#include <linux/file.h>
 #include <linux/tty.h>
 #include <linux/tty_flip.h>
 #include <linux/fcntl.h>
diff --git a/drivers/char/tty_io.c b/drivers/char/tty_io.c
index 7853ea2..72f4432 100644
--- a/drivers/char/tty_io.c
+++ b/drivers/char/tty_io.c
@@ -106,6 +106,7 @@
 
 #include <linux/kmod.h>
 #include <linux/nsproxy.h>
+#include <linux/checkpoint.h>
 
 #undef TTY_DEBUG_HANGUP
 
@@ -151,6 +152,13 @@ static long tty_compat_ioctl(struct file *file, unsigned int cmd,
 #define tty_compat_ioctl NULL
 #endif
 static int tty_fasync(int fd, struct file *filp, int on);
+#ifdef CONFIG_CHECKPOINT
+static int tty_file_checkpoint(struct ckpt_ctx *ctx, struct file *file);
+static int tty_file_collect(struct ckpt_ctx *ctx, struct file *file);
+#else
+#define tty_file_checkpoint NULL
+#define tty_file_collect NULL
+#endif /* CONFIG_CHECKPOINT */
 static void release_tty(struct tty_struct *tty, int idx);
 static void __proc_set_tty(struct task_struct *tsk, struct tty_struct *tty);
 static void proc_set_tty(struct task_struct *tsk, struct tty_struct *tty);
@@ -417,6 +425,8 @@ static const struct file_operations tty_fops = {
 	.open		= tty_open,
 	.release	= tty_release,
 	.fasync		= tty_fasync,
+	.checkpoint	= tty_file_checkpoint,
+	.collect	= tty_file_collect,
 };
 
 static const struct file_operations console_fops = {
@@ -439,6 +449,8 @@ static const struct file_operations hung_up_tty_fops = {
 	.unlocked_ioctl	= hung_up_tty_ioctl,
 	.compat_ioctl	= hung_up_tty_compat_ioctl,
 	.release	= tty_release,
+	.checkpoint	= tty_file_checkpoint,
+	.collect	= tty_file_collect,
 };
 
 static DEFINE_SPINLOCK(redirect_lock);
@@ -2586,6 +2598,493 @@ static long tty_compat_ioctl(struct file *file, unsigned int cmd,
 }
 #endif
 
+#ifdef CONFIG_CHECKPOINT
+static int tty_can_checkpoint(struct ckpt_ctx *ctx, struct tty_struct *tty)
+{
+	/* only support pty driver */
+	if (tty->driver->type != TTY_DRIVER_TYPE_PTY) {
+		ckpt_write_err(ctx, "TSP", "tty: unknown driverv type %d",
+			       tty->driver, tty, tty->driver->type);
+		return 0;
+	}
+	/* only support unix98 style */
+	if (tty->driver->major != UNIX98_PTY_MASTER_MAJOR &&
+	    tty->driver->major != UNIX98_PTY_SLAVE_MAJOR) {
+		ckpt_write_err(ctx, "TP", "tty: legacy pty", tty);
+		return 0;
+	}
+	/* only support n_tty ldisc */
+	if (tty->ldisc->ops->num != N_TTY) {
+		ckpt_write_err(ctx, "TSP", "tty: unknown ldisc type %d",
+			       tty->ldisc->ops, tty, tty->ldisc->ops->num);
+		return 0;
+	}
+
+	return 1;
+}
+
+static int tty_file_checkpoint(struct ckpt_ctx *ctx, struct file *file)
+{
+	struct ckpt_hdr_file_tty *h;
+	struct tty_struct *tty, *real_tty;
+	struct inode *inode;
+	int master_objref, slave_objref;
+	int ret;
+
+	tty = (struct tty_struct *)file->private_data;
+	inode = file->f_path.dentry->d_inode;
+	if (tty_paranoia_check(tty, inode, "tty_file_checkpoint"))
+		return -EIO;
+
+	if (!tty_can_checkpoint(ctx, tty))
+		return -ENOSYS;
+
+	/*
+	 * If we ever support more than PTYs, this would be tty-type
+	 * specific (and probably called via tty_operations).
+	 */
+
+	real_tty = tty_pair_get_tty(tty);
+	ckpt_debug("tty: %p, real_tty: %p\n", tty, real_tty);
+
+	master_objref = checkpoint_obj(ctx, real_tty->link, CKPT_OBJ_TTY);
+	if (master_objref < 0)
+		return master_objref;
+	slave_objref = checkpoint_obj(ctx, real_tty, CKPT_OBJ_TTY);
+	if (slave_objref < 0)
+		return slave_objref;
+	ckpt_debug("master %p %d, slave %p %d\n",
+		   real_tty->link, master_objref, real_tty, slave_objref);
+
+	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_FILE);
+	if (!h)
+		return -ENOMEM;
+
+	h->common.f_type = CKPT_FILE_TTY;
+	h->tty_objref = (tty == real_tty ? slave_objref : master_objref);
+
+	ret = checkpoint_file_common(ctx, file, &h->common);
+	if (!ret)
+		ret = ckpt_write_obj(ctx, &h->common.h);
+
+	ckpt_hdr_put(ctx, h);
+	return ret;
+}
+
+static int tty_file_collect(struct ckpt_ctx *ctx, struct file *file)
+{
+	struct tty_struct *tty;
+	struct inode *inode;
+	int ret;
+
+	tty = (struct tty_struct *)file->private_data;
+	inode = file->f_path.dentry->d_inode;
+	if (tty_paranoia_check(tty, inode, "tty_collect"))
+		return -EIO;
+
+	if (!tty_can_checkpoint(ctx, tty))
+		return -ENOSYS;
+
+	ckpt_debug("collecting tty: %p\n", tty);
+	ret = ckpt_obj_collect(ctx, tty, CKPT_OBJ_TTY);
+	if (ret < 0)
+		return ret;
+
+	if (tty->driver->subtype == PTY_TYPE_MASTER) {
+		if (!tty->link) {
+			ckpt_write_err(ctx, "TP", "tty: missing link\n", tty);
+			return -EIO;
+		}
+		ckpt_debug("collecting slave tty: %p\n", tty->link);
+		ret = ckpt_obj_collect(ctx, tty->link, CKPT_OBJ_TTY);
+	}
+
+	return ret;
+}
+
+#define CKPT_LDISC_BAD   (1 << TTY_LDISC_CHANGING)
+#define CKPT_LDISC_GOOD  ((1 << TTY_LDISC_OPEN) | (1 << TTY_LDISC))
+#define CKPT_LDISC_FLAGS (CKPT_LDISC_GOOD | CKPT_LDISC_BAD)
+
+static int checkpoint_tty_ldisc(struct ckpt_ctx *ctx, struct tty_struct *tty)
+{
+	struct ckpt_hdr_ldisc_n_tty *h;
+	int datalen, read_tail;
+	int n, ret;
+
+	/* shouldn't reach here unless ldisc is n_tty */
+	BUG_ON(tty->ldisc->ops->num != N_TTY);
+
+	if ((tty->flags & CKPT_LDISC_FLAGS) != CKPT_LDISC_GOOD) {
+		ckpt_write_err(ctx, "TP", "tty: bad ldisc flags %#lx\n",
+			       tty, tty->flags);
+		return -EBUSY;
+	}
+
+	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_TTY_LDISC);
+	if (!h)
+		return -ENOMEM;
+
+	spin_lock_irq(&tty->read_lock);
+	h->column = tty->column;
+	h->datalen = tty->read_cnt;
+	h->canon_column = tty->canon_column;
+	h->canon_datalen = tty->canon_head;
+	if (tty->canon_head > tty->read_tail)
+		h->canon_datalen -= tty->read_tail;
+	else
+		h->canon_datalen += N_TTY_BUF_SIZE - tty->read_tail;
+	h->canon_data = tty->canon_data;
+
+	datalen = tty->read_cnt;
+	read_tail = tty->read_tail;
+	spin_unlock_irq(&tty->read_lock);
+
+	h->minimum_to_wake = tty->minimum_to_wake;
+
+	h->stopped = tty->stopped;
+	h->hw_stopped = tty->hw_stopped;
+	h->flow_stopped = tty->flow_stopped;
+	h->packet = tty->packet;
+	h->ctrl_status = tty->ctrl_status;
+	h->lnext = tty->lnext;
+	h->erasing = tty->erasing;
+	h->raw = tty->raw;
+	h->real_raw = tty->real_raw;
+	h->icanon = tty->icanon;
+	h->closing = tty->closing;
+
+	BUILD_BUG_ON(sizeof(h->read_flags) != sizeof(tty->read_flags));
+	memcpy(h->read_flags, tty->read_flags, sizeof(tty->read_flags));
+
+	ret = ckpt_write_obj(ctx, &h->h);
+	ckpt_hdr_put(ctx, h);
+	if (ret < 0)
+		return ret;
+
+	ckpt_debug("datalen %d\n", datalen);
+	if (datalen) {
+		ret = ckpt_write_buffer(ctx, NULL, datalen);
+		if (ret < 0)
+			return ret;
+		n = min(datalen, N_TTY_BUF_SIZE - read_tail);
+		ret = ckpt_kwrite(ctx, &tty->read_buf[read_tail], n);
+		if (ret < 0)
+			return ret;
+		n = datalen - n;
+		ret = ckpt_kwrite(ctx, tty->read_buf, n);
+	}
+
+	return ret;
+}
+
+#define CKPT_TTY_BAD   ((1 << TTY_CLOSING) | (1 << TTY_FLUSHING))
+#define CKPT_TTY_GOOD  0
+
+static int do_checkpoint_tty(struct ckpt_ctx *ctx, struct tty_struct *tty)
+{
+	struct ckpt_hdr_tty *h;
+	int link_objref;
+	int master = 0;
+	int ret;
+
+	if ((tty->flags & (CKPT_TTY_BAD | CKPT_TTY_GOOD)) != CKPT_TTY_GOOD) {
+		ckpt_write_err(ctx, "TP", "tty: bad flags %#lx\n",
+			       tty, tty->flags);
+		return -EBUSY;
+	}
+
+	/*
+	 * If we ever support more than PTYs, this would be tty-type
+	 * specific (and probably called via tty_operations).
+	 */
+	link_objref = ckpt_obj_lookup(ctx, tty->link, CKPT_OBJ_TTY);
+
+	if (tty->driver->subtype == PTY_TYPE_MASTER)
+		master = 1;
+
+	/* tty is master if-and-only-if link_objref is zero */
+	BUG_ON(master ^ !link_objref);
+
+	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_TTY);
+	if (!h)
+		return -ENOMEM;
+
+	h->driver_type = tty->driver->type;
+	h->driver_subtype = tty->driver->subtype;
+
+	h->link_objref = link_objref;
+
+	/* if master, reserve an objref (see do_restore_tty) */
+	h->file_objref = (master ? ckpt_obj_reserve(ctx) : 0);
+	ckpt_debug("link %d file %d\n", h->link_objref, h->file_objref);
+
+	h->index = tty->index;
+	h->ldisc = tty->ldisc->ops->num;
+	h->flags = tty->flags;
+
+	mutex_lock(&tty->termios_mutex);
+	h->termios.c_line = tty->termios->c_line;
+	h->termios.c_iflag = tty->termios->c_iflag;
+	h->termios.c_oflag = tty->termios->c_oflag;
+	h->termios.c_cflag = tty->termios->c_cflag;
+	h->termios.c_lflag = tty->termios->c_lflag;
+	memcpy(h->termios.c_cc, tty->termios->c_cc, NCC);
+	h->winsize.ws_row = tty->winsize.ws_row;
+	h->winsize.ws_col = tty->winsize.ws_col;
+	h->winsize.ws_ypixel = tty->winsize.ws_ypixel;
+	h->winsize.ws_xpixel = tty->winsize.ws_xpixel;
+	mutex_unlock(&tty->termios_mutex);
+
+	ret  = ckpt_write_obj(ctx, &h->h);
+	ckpt_hdr_put(ctx, h);
+	if (ret < 0)
+		return ret;
+
+	/* save line discipline data (also writes buffer) */
+	if (!test_bit(TTY_HUPPED, &tty->flags))
+		ret = checkpoint_tty_ldisc(ctx, tty);
+
+	return ret;
+}
+
+int checkpoint_tty(struct ckpt_ctx *ctx, void *ptr)
+{
+	return do_checkpoint_tty(ctx, (struct tty_struct *) ptr);
+}
+
+struct file *tty_file_restore(struct ckpt_ctx *ctx, struct ckpt_hdr_file *ptr)
+{
+	struct ckpt_hdr_file_tty *h = (struct ckpt_hdr_file_tty *) ptr;
+	struct tty_struct *tty;
+	struct file *file;
+	char slavepath[16];	/* "/dev/pts/###" */
+	int slavelock;
+	int ret;
+
+	if (ptr->h.type != CKPT_HDR_FILE ||
+	    ptr->h.len != sizeof(*h) || ptr->f_type != CKPT_FILE_TTY)
+		return ERR_PTR(-EINVAL);
+
+	if (h->tty_objref <= 0)
+		return ERR_PTR(-EINVAL);
+
+	tty = ckpt_obj_fetch(ctx, h->tty_objref, CKPT_OBJ_TTY);
+	ckpt_debug("tty %p objref %d\n", tty, h->tty_objref);
+
+	/* at this point the tty should have been restore already */
+	if (IS_ERR(tty))
+		return (struct file *) tty;
+
+	/*
+	 * If we ever support more than PTYs, this would be tty-type
+	 * specific (and probably called via tty_operations).
+	 */
+
+	/*
+	 * If this tty is master, get the corresponding file from
+	 * tty->tty_file. Otherwise, open the slave device.
+	 */
+	if (tty->driver->subtype == PTY_TYPE_MASTER) {
+		file_list_lock();
+		file = list_first_entry(&tty->tty_files,
+					typeof(*file), f_u.fu_list);
+		get_file(file);
+		file_list_unlock();
+		ckpt_debug("master file %p\n", file);
+	} else {
+		sprintf(slavepath, "/dev/pts/%d", tty->index);
+		slavelock = test_bit(TTY_PTY_LOCK, &tty->link->flags);
+		clear_bit(TTY_PTY_LOCK, &tty->link->flags);
+		file = filp_open(slavepath, O_RDWR | O_NOCTTY, 0);
+		ckpt_debug("slave file %p (idnex %d)\n", file, tty->index);
+		if (IS_ERR(file))
+			return file;
+		if (slavelock)
+			set_bit(TTY_PTY_LOCK, &tty->link->flags);
+	}
+
+	ret = restore_file_common(ctx, file, ptr);
+	if (ret < 0) {
+		fput(file);
+		file = ERR_PTR(ret);
+	}
+
+	return file;
+}
+
+static int restore_tty_ldisc(struct ckpt_ctx *ctx,
+			     struct tty_struct *tty,
+			     struct ckpt_hdr_tty *hh)
+{
+	struct ckpt_hdr_ldisc_n_tty *h;
+	int ret = -EINVAL;
+
+	h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_TTY_LDISC);
+	if (IS_ERR(h))
+		return PTR_ERR(h);
+
+	/* this is unfair shortcut, because we know ldisc is n_tty */
+	if (hh->ldisc != N_TTY)
+		goto out;
+	if ((hh->flags & CKPT_LDISC_FLAGS) != CKPT_LDISC_GOOD)
+		goto out;
+
+	if (h->datalen > N_TTY_BUF_SIZE)
+		goto out;
+	if (h->canon_datalen > N_TTY_BUF_SIZE)
+		goto out;
+
+	if (h->datalen) {
+		ret = _ckpt_read_buffer(ctx, tty->read_buf, h->datalen);
+		if (ret < 0)
+			goto out;
+	}
+
+	/* TODO: sanitize all these values ? */
+
+	spin_lock_irq(&tty->read_lock);
+	tty->column = h->column;
+	tty->read_cnt = h->datalen;
+	tty->read_head = h->datalen;
+	tty->read_tail = 0;
+	tty->canon_column = h->canon_column;
+	tty->canon_head = h->canon_datalen;
+	tty->canon_data = h->canon_data;
+	spin_unlock_irq(&tty->read_lock);
+
+	tty->minimum_to_wake = h->minimum_to_wake;
+
+	tty->stopped = h->stopped;
+	tty->hw_stopped = h->hw_stopped;
+	tty->flow_stopped = h->flow_stopped;
+	tty->packet = h->packet;
+	tty->ctrl_status = h->ctrl_status;
+	tty->lnext = h->lnext;
+	tty->erasing = h->erasing;
+	tty->raw = h->raw;
+	tty->real_raw = h->real_raw;
+	tty->icanon = h->icanon;
+	tty->closing = h->closing;
+
+	BUILD_BUG_ON(sizeof(h->read_flags) != sizeof(tty->read_flags));
+	memcpy(tty->read_flags, h->read_flags, sizeof(tty->read_flags));
+ out:
+	ckpt_hdr_put(ctx, h);
+	return 0;
+}
+
+#define CKPT_PTMX_PATH  "/dev/ptmx"
+
+static struct tty_struct *do_restore_tty(struct ckpt_ctx *ctx)
+{
+	struct ckpt_hdr_tty *h;
+	struct tty_struct *tty = ERR_PTR(-EINVAL);
+	struct file *file = NULL;
+	int master, ret;
+
+	h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_TTY);
+	if (IS_ERR(h))
+		return (struct tty_struct *) h;
+
+	if (h->driver_type != TTY_DRIVER_TYPE_PTY)
+		goto out;
+	if (h->driver_subtype == PTY_TYPE_MASTER)
+		master = 1;
+	else if (h->driver_subtype == PTY_TYPE_SLAVE)
+		master = 0;
+	else
+		goto out;
+	/* @link_object is positive if-and-only-if tty is not master */
+	if (h->link_objref < 0 || (master ^ !h->link_objref))
+		goto out;
+	/* @file_object is positive if-and-only-if tty is master */
+	if (h->file_objref < 0 || (master ^ !!h->file_objref))
+		goto out;
+	if (h->flags & CKPT_TTY_BAD)
+		goto out;
+	/* hung-up tty cannot be master, or have session or pgrp */
+	if (test_bit(TTY_HUPPED, (unsigned long *) &h->flags) && master)
+		goto out;
+
+	ckpt_debug("sanity checks passed, link %d\n", h->link_objref);
+
+	/*
+	 * If we ever support more than PTYs, this would be tty-type
+	 * specific (and probably called via tty_operations).
+	 */
+	if (master) {
+		file = pty_open_by_index("/dev/ptmx", h->index);
+		if (IS_ERR(file)) {
+			ckpt_write_err(ctx, "TE", "open ptmx", PTR_ERR(file));
+			tty = ERR_PTR(PTR_ERR(file));
+			goto out;
+		}
+
+		/*
+		 * Add file to objhash to ensure proper cleanup later
+		 * (it isn't referenced elsewhere). Use h->file_objref
+		 * which was explicitly during checkpoint for this.
+		 */
+		ret = ckpt_obj_insert(ctx, file, h->file_objref, CKPT_OBJ_FILE);
+		fput(file);  /* even on succes (referenced in objash) */
+		if (ret < 0) {
+			tty = ERR_PTR(ret);
+			goto out;
+		}
+
+		tty = file->private_data;
+	} else {
+		tty = ckpt_obj_fetch(ctx, h->link_objref, CKPT_OBJ_TTY);
+		if (IS_ERR(tty))
+			goto out;
+		tty = tty->link;
+	}
+
+	ckpt_debug("tty %p (hup %d)\n",
+		   tty, test_bit(TTY_HUPPED, (unsigned long *) &h->flags));
+
+	/* we now have the desired tty: restore its state as per @h */
+
+	mutex_lock(&tty->termios_mutex);
+	tty->termios->c_line = h->termios.c_line;
+	tty->termios->c_iflag = h->termios.c_iflag;
+	tty->termios->c_oflag = h->termios.c_oflag;
+	tty->termios->c_cflag = h->termios.c_cflag;
+	tty->termios->c_lflag = h->termios.c_lflag;
+	memcpy(tty->termios->c_cc, h->termios.c_cc, NCC);
+	tty->winsize.ws_row = h->winsize.ws_row;
+	tty->winsize.ws_col = h->winsize.ws_col;
+	tty->winsize.ws_ypixel = h->winsize.ws_ypixel;
+	tty->winsize.ws_xpixel = h->winsize.ws_xpixel;
+	mutex_unlock(&tty->termios_mutex);
+
+	if (test_bit(TTY_HUPPED, (unsigned long *) &h->flags))
+		tty_vhangup(tty);
+	else {
+		ret = restore_tty_ldisc(ctx, tty, h);
+		if (ret < 0) {
+			tty = ERR_PTR(ret);
+			goto out;
+		}
+	}
+
+	tty_kref_get(tty);
+ out:
+	ckpt_hdr_put(ctx, h);
+	return tty;
+}
+
+void *restore_tty(struct ckpt_ctx *ctx)
+{
+#ifdef CONFIG_UNIX98_PTYS
+	return (void *) do_restore_tty(ctx);
+#else
+	return ERR_PTR(-ENOSYS);
+#endif
+}
+#endif /* COFNIG_CHECKPOINT */
+
 /*
  * This implements the "Secure Attention Key" ---  the idea is to
  * prevent trojan horses by killing all processes associated with this
diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
index 92a21b2..7c117fc 100644
--- a/include/linux/checkpoint.h
+++ b/include/linux/checkpoint.h
@@ -266,6 +266,10 @@ extern int restore_obj_signal(struct ckpt_ctx *ctx, int signal_objref);
 extern int checkpoint_task_signal(struct ckpt_ctx *ctx, struct task_struct *t);
 extern int restore_task_signal(struct ckpt_ctx *ctx);
 
+/* ttys */
+extern int checkpoint_tty(struct ckpt_ctx *ctx, void *ptr);
+extern void *restore_tty(struct ckpt_ctx *ctx);
+
 static inline int ckpt_validate_errno(int errno)
 {
 	return (errno >= 0) && (errno < MAX_ERRNO);
diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
index ac16c59..bf584cb 100644
--- a/include/linux/checkpoint_hdr.h
+++ b/include/linux/checkpoint_hdr.h
@@ -83,6 +83,8 @@ enum {
 	CKPT_HDR_FILE_NAME,
 	CKPT_HDR_FILE,
 	CKPT_HDR_PIPE_BUF,
+	CKPT_HDR_TTY,
+	CKPT_HDR_TTY_LDISC,
 
 	CKPT_HDR_MM = 401,
 	CKPT_HDR_VMA,
@@ -141,6 +143,7 @@ enum obj_type {
 	CKPT_OBJ_USER,
 	CKPT_OBJ_GROUPINFO,
 	CKPT_OBJ_SOCK,
+	CKPT_OBJ_TTY,
 	CKPT_OBJ_MAX
 };
 
@@ -161,6 +164,9 @@ struct ckpt_const {
 	__u16 uts_domainname_len;
 	/* rlimit */
 	__u16 rlimit_nlimits;
+	/* tty */
+	__u16 n_tty_buf_size;
+	__u16 tty_termios_ncc;
 } __attribute__((aligned(8)));
 
 /* checkpoint image header */
@@ -368,6 +374,7 @@ enum file_type {
 	CKPT_FILE_PIPE,
 	CKPT_FILE_FIFO,
 	CKPT_FILE_SOCKET,
+	CKPT_FILE_TTY,
 	CKPT_FILE_MAX
 };
 
@@ -660,6 +667,84 @@ struct ckpt_hdr_ipc_sem {
 } __attribute__((aligned(8)));
 
 
+/* devices */
+struct ckpt_hdr_file_tty {
+	struct ckpt_hdr_file common;
+	__s32 tty_objref;
+};
+
+struct ckpt_hdr_tty {
+	struct ckpt_hdr h;
+
+	__u16 driver_type;
+	__u16 driver_subtype;
+
+	__s32 link_objref;
+	__s32 file_objref;
+	__u32 _padding;
+
+	__u32 index;
+	__u32 ldisc;
+	__u64 flags;
+
+	/* termios */
+	struct {
+		__u16 c_iflag;
+		__u16 c_oflag;
+		__u16 c_cflag;
+		__u16 c_lflag;
+		__u8 c_line;
+		__u8 c_cc[CKPT_TTY_NCC];
+	} __attribute__((aligned(8))) termios;
+
+	/* winsize */
+	struct {
+		__u16 ws_row;
+		__u16 ws_col;
+		__u16 ws_xpixel;
+		__u16 ws_ypixel;
+	} __attribute__((aligned(8))) winsize;
+} __attribute__((aligned(8)));
+
+/* cannot include <linux/tty.h> from userspace, so define: */
+#define CKPT_N_TTY_BUF_SIZE  4096
+#ifdef __KERNEL__
+#include <linux/tty.h>
+#if CKPT_N_TTY_BUF_SIZE != N_TTY_BUF_SIZE
+#error CKPT_N_TTY_BUF_SIZE size is wrong per linux/tty.h
+#endif
+#endif
+
+struct ckpt_hdr_ldisc_n_tty {
+	struct ckpt_hdr h;
+
+	__u32 column;
+	__u32 datalen;
+	__u32 canon_column;
+	__u32 canon_datalen;
+	__u32 canon_data;
+
+	__u16 minimum_to_wake;
+
+	__u8 stopped;
+	__u8 hw_stopped;
+	__u8 flow_stopped;
+	__u8 packet;
+	__u8 ctrl_status;
+	__u8 lnext;
+	__u8 erasing;
+	__u8 raw;
+	__u8 real_raw;
+	__u8 icanon;
+	__u8 closing;
+	__u8 padding[3];
+
+	__u8 read_flags[CKPT_N_TTY_BUF_SIZE / 8];
+
+	/* if @datalen > 0, buffer contents follow (next object) */
+} __attribute__((aligned(8)));
+
+
 #define CKPT_TST_OVERFLOW_16(a, b) \
 	((sizeof(a) > sizeof(b)) && ((a) > SHORT_MAX))
 
diff --git a/include/linux/tty.h b/include/linux/tty.h
index fd40561..295447b 100644
--- a/include/linux/tty.h
+++ b/include/linux/tty.h
@@ -471,6 +471,13 @@ extern void tty_ldisc_enable(struct tty_struct *tty);
 /* This one is for ptmx_close() */
 extern int tty_release(struct inode *inode, struct file *filp);
 
+#ifdef CONFIG_CHECKPOINT
+struct ckpt_ctx;
+struct ckpt_hdr_file;
+extern struct file *tty_file_restore(struct ckpt_ctx *ctx,
+				     struct ckpt_hdr_file *ptr);
+#endif
+
 /* n_tty.c */
 extern struct tty_ldisc_ops tty_ldisc_N_TTY;
 
-- 
1.6.0.4

^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH v18 72/80] c/r: [pty 2/2] support for pseudo terminals
       [not found] ` <1253749920-18673-1-git-send-email-orenl-RdfvBDnrOixBDgjK7y7TUQ@public.gmane.org>
  2009-09-23 23:50   ` [PATCH v18 01/80] c/r: extend arch_setup_additional_pages() Oren Laadan
@ 2009-09-23 23:51   ` Oren Laadan
  2009-09-23 23:50   ` [PATCH v18 03/80] c/r: break out new_user_ns() Oren Laadan
                     ` (79 subsequent siblings)
  81 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-23 23:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, containers, linux-kernel, linux-mm, linux-api,
	Serge Hallyn, Ingo Molnar, Pavel Emelyanov, Oren Laadan,
	Oren Laadan

This patch adds support for checkpoint and restart of pseudo terminals
(PTYs). Since PTYs are shared (pointed to by file, and signal), they
are managed via objhash.

PTYs are master/slave pairs; The code arranges for the master to
always be checkpointed first, followed by the slave. This is important
since during restart both ends are created when restoring the master.

In this patch only UNIX98 style PTYs are supported.

Currently only PTYs that are referenced by open files are handled.
Thus PTYs checkpoint starts with a file in tty_file_checkpoint(). It
will first checkpoint the master and slave PTYs via tty_checkpoint(),
and then complete the saving of the file descriptor. This means that
in the image file, the order of objects is: master-tty, slave-tty,
file-desc.

During restart, to restore the master side, we open the /dev/ptmx
device and get a file handle. But at this point we don't know the
designated objref for this file, because the file is due later on in
the image stream. On the other hand, we can't just fput() the file
because it will close the PTY too.

Instead, when we checkpoint the master PTY, we _reserve_ an objref
for the file (which won't be further used in checkpoint). Then at
restart, use it to insert the file to objhash.

TODO:

* Better sanitize input from checkpoint image on restore
* Check the locking when saving/restoring tty_struct state
* Echo position/buffer isn't saved (is it needed ?)
* Handle multiple devpts mounts (namespaces)
* Paths of ptmx and slaves are hard coded (/dev/ptmx, /dev/pts/...)

Changelog[v4]:
  - Fix error path(s) in restore_tty_ldisc()
  - Fix memory leak in restore_tty_ldisc()
Changelog[v3]:
  - [Serge Hallyn] Set tty on error path
Changelog[v2]:
  - Don't save/restore tty->{session,pgrp}
  - Fix leak: drop file reference after ckpt_obj_insert()
  - Move get_file() inside locked clause (fix race)
Changelog[v1]:
  - Adjust include/asm/checkpoint_hdr.h for s390 architecture
  - Add NCC to kernel constants header (ckpt_hdr_const)
  - [Serge Hallyn] fix calculation of canon_datalen

Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
Acked-by: Serge Hallyn <serue@us.ibm.com>
---
 arch/s390/include/asm/checkpoint_hdr.h |   11 +
 arch/x86/include/asm/checkpoint_hdr.h  |   11 +
 checkpoint/checkpoint.c                |    3 +
 checkpoint/files.c                     |    6 +
 checkpoint/objhash.c                   |   26 ++
 checkpoint/restart.c                   |    6 +
 drivers/char/pty.c                     |    1 +
 drivers/char/tty_io.c                  |  499 ++++++++++++++++++++++++++++++++
 include/linux/checkpoint.h             |    4 +
 include/linux/checkpoint_hdr.h         |   85 ++++++
 include/linux/tty.h                    |    7 +
 11 files changed, 659 insertions(+), 0 deletions(-)

diff --git a/arch/s390/include/asm/checkpoint_hdr.h b/arch/s390/include/asm/checkpoint_hdr.h
index 1976355..b6ea8ce 100644
--- a/arch/s390/include/asm/checkpoint_hdr.h
+++ b/arch/s390/include/asm/checkpoint_hdr.h
@@ -83,13 +83,24 @@ struct ckpt_hdr_mm_context {
 };
 
 #define CKPT_ARCH_NSIG  64
+#define CKPT_TTY_NCC  8
+
+/* arch dependent constants */
 #ifdef __KERNEL__
+
 #include <asm/signal.h>
 #if CKPT_ARCH_NSIG != _SIGCONTEXT_NSIG
 #error CKPT_ARCH_NSIG size is wrong (asm/sigcontext.h and asm/checkpoint_hdr.h)
 #endif
+
+#include <linux/tty.h>
+#if CKPT_TTY_NCC != NCC
+#error CKPT_TTY_NCC size is wrong per asm-generic/termios.h
 #endif
 
+#endif /* __KERNEL__ */
+
+
 struct ckpt_hdr_header_arch {
 	struct ckpt_hdr h;
 };
diff --git a/arch/x86/include/asm/checkpoint_hdr.h b/arch/x86/include/asm/checkpoint_hdr.h
index 1228d1b..7a24de5 100644
--- a/arch/x86/include/asm/checkpoint_hdr.h
+++ b/arch/x86/include/asm/checkpoint_hdr.h
@@ -48,14 +48,25 @@ enum {
 	CKPT_HDR_MM_CONTEXT_LDT,
 };
 
+/* arch dependent constants */
 #define CKPT_ARCH_NSIG  64
+#define CKPT_TTY_NCC  8
+
 #ifdef __KERNEL__
+
 #include <asm/signal.h>
 #if CKPT_ARCH_NSIG != _NSIG
 #error CKPT_ARCH_NSIG size is wrong per asm/signal.h and asm/checkpoint_hdr.h
 #endif
+
+#include <linux/tty.h>
+#if CKPT_TTY_NCC != NCC
+#error CKPT_TTY_NCC size is wrong per asm-generic/termios.h
 #endif
 
+#endif /* __KERNEL__ */
+
+
 struct ckpt_hdr_header_arch {
 	struct ckpt_hdr h;
 	/* FIXME: add HAVE_HWFP */
diff --git a/checkpoint/checkpoint.c b/checkpoint/checkpoint.c
index ae79df7..dbe9e10 100644
--- a/checkpoint/checkpoint.c
+++ b/checkpoint/checkpoint.c
@@ -299,6 +299,9 @@ static void fill_kernel_const(struct ckpt_const *h)
 	h->uts_domainname_len = sizeof(uts->domainname);
 	/* rlimit */
 	h->rlimit_nlimits = RLIM_NLIMITS;
+	/* tty */
+	h->n_tty_buf_size = N_TTY_BUF_SIZE;
+	h->tty_termios_ncc = NCC;
 }
 
 /* write the checkpoint header */
diff --git a/checkpoint/files.c b/checkpoint/files.c
index 058bc0e..27e29a0 100644
--- a/checkpoint/files.c
+++ b/checkpoint/files.c
@@ -598,6 +598,12 @@ static struct restore_file_ops restore_file_ops[] = {
 		.file_type = CKPT_FILE_SOCKET,
 		.restore = sock_file_restore,
 	},
+	/* tty */
+	{
+		.file_name = "TTY",
+		.file_type = CKPT_FILE_TTY,
+		.restore = tty_file_restore,
+	},
 };
 
 static struct file *do_restore_file(struct ckpt_ctx *ctx)
diff --git a/checkpoint/objhash.c b/checkpoint/objhash.c
index 0978060..f84388d 100644
--- a/checkpoint/objhash.c
+++ b/checkpoint/objhash.c
@@ -269,6 +269,22 @@ static int obj_sock_users(void *ptr)
 	return atomic_read(&((struct sock *) ptr)->sk_refcnt);
 }
 
+static int obj_tty_grab(void *ptr)
+{
+	tty_kref_get((struct tty_struct *) ptr);
+	return 0;
+}
+
+static void obj_tty_drop(void *ptr, int lastref)
+{
+	tty_kref_put((struct tty_struct *) ptr);
+}
+
+static int obj_tty_users(void *ptr)
+{
+	return atomic_read(&((struct tty_struct *) ptr)->kref.refcount);
+}
+
 static struct ckpt_obj_ops ckpt_obj_ops[] = {
 	/* ignored object */
 	{
@@ -407,6 +423,16 @@ static struct ckpt_obj_ops ckpt_obj_ops[] = {
 		.checkpoint = checkpoint_sock,
 		.restore = restore_sock,
 	},
+	/* struct tty_struct */
+	{
+		.obj_name = "TTY",
+		.obj_type = CKPT_OBJ_TTY,
+		.ref_drop = obj_tty_drop,
+		.ref_grab = obj_tty_grab,
+		.ref_users = obj_tty_users,
+		.checkpoint = checkpoint_tty,
+		.restore = restore_tty,
+	},
 };
 
 
diff --git a/checkpoint/restart.c b/checkpoint/restart.c
index 340698a..1016278 100644
--- a/checkpoint/restart.c
+++ b/checkpoint/restart.c
@@ -19,6 +19,7 @@
 #include <linux/freezer.h>
 #include <linux/magic.h>
 #include <linux/utsname.h>
+#include <linux/termios.h>
 #include <asm/syscall.h>
 #include <linux/elf.h>
 #include <linux/deferqueue.h>
@@ -402,6 +403,11 @@ static int check_kernel_const(struct ckpt_const *h)
 	/* rlimit */
 	if (h->rlimit_nlimits != RLIM_NLIMITS)
 		return -EINVAL;
+	/* tty */
+	if (h->n_tty_buf_size != N_TTY_BUF_SIZE)
+		return -EINVAL;
+	if (h->tty_termios_ncc != NCC)
+		return -EINVAL;
 
 	return 0;
 }
diff --git a/drivers/char/pty.c b/drivers/char/pty.c
index e2fef99..5fb4ec5 100644
--- a/drivers/char/pty.c
+++ b/drivers/char/pty.c
@@ -15,6 +15,7 @@
 
 #include <linux/errno.h>
 #include <linux/interrupt.h>
+#include <linux/file.h>
 #include <linux/tty.h>
 #include <linux/tty_flip.h>
 #include <linux/fcntl.h>
diff --git a/drivers/char/tty_io.c b/drivers/char/tty_io.c
index 7853ea2..72f4432 100644
--- a/drivers/char/tty_io.c
+++ b/drivers/char/tty_io.c
@@ -106,6 +106,7 @@
 
 #include <linux/kmod.h>
 #include <linux/nsproxy.h>
+#include <linux/checkpoint.h>
 
 #undef TTY_DEBUG_HANGUP
 
@@ -151,6 +152,13 @@ static long tty_compat_ioctl(struct file *file, unsigned int cmd,
 #define tty_compat_ioctl NULL
 #endif
 static int tty_fasync(int fd, struct file *filp, int on);
+#ifdef CONFIG_CHECKPOINT
+static int tty_file_checkpoint(struct ckpt_ctx *ctx, struct file *file);
+static int tty_file_collect(struct ckpt_ctx *ctx, struct file *file);
+#else
+#define tty_file_checkpoint NULL
+#define tty_file_collect NULL
+#endif /* CONFIG_CHECKPOINT */
 static void release_tty(struct tty_struct *tty, int idx);
 static void __proc_set_tty(struct task_struct *tsk, struct tty_struct *tty);
 static void proc_set_tty(struct task_struct *tsk, struct tty_struct *tty);
@@ -417,6 +425,8 @@ static const struct file_operations tty_fops = {
 	.open		= tty_open,
 	.release	= tty_release,
 	.fasync		= tty_fasync,
+	.checkpoint	= tty_file_checkpoint,
+	.collect	= tty_file_collect,
 };
 
 static const struct file_operations console_fops = {
@@ -439,6 +449,8 @@ static const struct file_operations hung_up_tty_fops = {
 	.unlocked_ioctl	= hung_up_tty_ioctl,
 	.compat_ioctl	= hung_up_tty_compat_ioctl,
 	.release	= tty_release,
+	.checkpoint	= tty_file_checkpoint,
+	.collect	= tty_file_collect,
 };
 
 static DEFINE_SPINLOCK(redirect_lock);
@@ -2586,6 +2598,493 @@ static long tty_compat_ioctl(struct file *file, unsigned int cmd,
 }
 #endif
 
+#ifdef CONFIG_CHECKPOINT
+static int tty_can_checkpoint(struct ckpt_ctx *ctx, struct tty_struct *tty)
+{
+	/* only support pty driver */
+	if (tty->driver->type != TTY_DRIVER_TYPE_PTY) {
+		ckpt_write_err(ctx, "TSP", "tty: unknown driverv type %d",
+			       tty->driver, tty, tty->driver->type);
+		return 0;
+	}
+	/* only support unix98 style */
+	if (tty->driver->major != UNIX98_PTY_MASTER_MAJOR &&
+	    tty->driver->major != UNIX98_PTY_SLAVE_MAJOR) {
+		ckpt_write_err(ctx, "TP", "tty: legacy pty", tty);
+		return 0;
+	}
+	/* only support n_tty ldisc */
+	if (tty->ldisc->ops->num != N_TTY) {
+		ckpt_write_err(ctx, "TSP", "tty: unknown ldisc type %d",
+			       tty->ldisc->ops, tty, tty->ldisc->ops->num);
+		return 0;
+	}
+
+	return 1;
+}
+
+static int tty_file_checkpoint(struct ckpt_ctx *ctx, struct file *file)
+{
+	struct ckpt_hdr_file_tty *h;
+	struct tty_struct *tty, *real_tty;
+	struct inode *inode;
+	int master_objref, slave_objref;
+	int ret;
+
+	tty = (struct tty_struct *)file->private_data;
+	inode = file->f_path.dentry->d_inode;
+	if (tty_paranoia_check(tty, inode, "tty_file_checkpoint"))
+		return -EIO;
+
+	if (!tty_can_checkpoint(ctx, tty))
+		return -ENOSYS;
+
+	/*
+	 * If we ever support more than PTYs, this would be tty-type
+	 * specific (and probably called via tty_operations).
+	 */
+
+	real_tty = tty_pair_get_tty(tty);
+	ckpt_debug("tty: %p, real_tty: %p\n", tty, real_tty);
+
+	master_objref = checkpoint_obj(ctx, real_tty->link, CKPT_OBJ_TTY);
+	if (master_objref < 0)
+		return master_objref;
+	slave_objref = checkpoint_obj(ctx, real_tty, CKPT_OBJ_TTY);
+	if (slave_objref < 0)
+		return slave_objref;
+	ckpt_debug("master %p %d, slave %p %d\n",
+		   real_tty->link, master_objref, real_tty, slave_objref);
+
+	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_FILE);
+	if (!h)
+		return -ENOMEM;
+
+	h->common.f_type = CKPT_FILE_TTY;
+	h->tty_objref = (tty == real_tty ? slave_objref : master_objref);
+
+	ret = checkpoint_file_common(ctx, file, &h->common);
+	if (!ret)
+		ret = ckpt_write_obj(ctx, &h->common.h);
+
+	ckpt_hdr_put(ctx, h);
+	return ret;
+}
+
+static int tty_file_collect(struct ckpt_ctx *ctx, struct file *file)
+{
+	struct tty_struct *tty;
+	struct inode *inode;
+	int ret;
+
+	tty = (struct tty_struct *)file->private_data;
+	inode = file->f_path.dentry->d_inode;
+	if (tty_paranoia_check(tty, inode, "tty_collect"))
+		return -EIO;
+
+	if (!tty_can_checkpoint(ctx, tty))
+		return -ENOSYS;
+
+	ckpt_debug("collecting tty: %p\n", tty);
+	ret = ckpt_obj_collect(ctx, tty, CKPT_OBJ_TTY);
+	if (ret < 0)
+		return ret;
+
+	if (tty->driver->subtype == PTY_TYPE_MASTER) {
+		if (!tty->link) {
+			ckpt_write_err(ctx, "TP", "tty: missing link\n", tty);
+			return -EIO;
+		}
+		ckpt_debug("collecting slave tty: %p\n", tty->link);
+		ret = ckpt_obj_collect(ctx, tty->link, CKPT_OBJ_TTY);
+	}
+
+	return ret;
+}
+
+#define CKPT_LDISC_BAD   (1 << TTY_LDISC_CHANGING)
+#define CKPT_LDISC_GOOD  ((1 << TTY_LDISC_OPEN) | (1 << TTY_LDISC))
+#define CKPT_LDISC_FLAGS (CKPT_LDISC_GOOD | CKPT_LDISC_BAD)
+
+static int checkpoint_tty_ldisc(struct ckpt_ctx *ctx, struct tty_struct *tty)
+{
+	struct ckpt_hdr_ldisc_n_tty *h;
+	int datalen, read_tail;
+	int n, ret;
+
+	/* shouldn't reach here unless ldisc is n_tty */
+	BUG_ON(tty->ldisc->ops->num != N_TTY);
+
+	if ((tty->flags & CKPT_LDISC_FLAGS) != CKPT_LDISC_GOOD) {
+		ckpt_write_err(ctx, "TP", "tty: bad ldisc flags %#lx\n",
+			       tty, tty->flags);
+		return -EBUSY;
+	}
+
+	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_TTY_LDISC);
+	if (!h)
+		return -ENOMEM;
+
+	spin_lock_irq(&tty->read_lock);
+	h->column = tty->column;
+	h->datalen = tty->read_cnt;
+	h->canon_column = tty->canon_column;
+	h->canon_datalen = tty->canon_head;
+	if (tty->canon_head > tty->read_tail)
+		h->canon_datalen -= tty->read_tail;
+	else
+		h->canon_datalen += N_TTY_BUF_SIZE - tty->read_tail;
+	h->canon_data = tty->canon_data;
+
+	datalen = tty->read_cnt;
+	read_tail = tty->read_tail;
+	spin_unlock_irq(&tty->read_lock);
+
+	h->minimum_to_wake = tty->minimum_to_wake;
+
+	h->stopped = tty->stopped;
+	h->hw_stopped = tty->hw_stopped;
+	h->flow_stopped = tty->flow_stopped;
+	h->packet = tty->packet;
+	h->ctrl_status = tty->ctrl_status;
+	h->lnext = tty->lnext;
+	h->erasing = tty->erasing;
+	h->raw = tty->raw;
+	h->real_raw = tty->real_raw;
+	h->icanon = tty->icanon;
+	h->closing = tty->closing;
+
+	BUILD_BUG_ON(sizeof(h->read_flags) != sizeof(tty->read_flags));
+	memcpy(h->read_flags, tty->read_flags, sizeof(tty->read_flags));
+
+	ret = ckpt_write_obj(ctx, &h->h);
+	ckpt_hdr_put(ctx, h);
+	if (ret < 0)
+		return ret;
+
+	ckpt_debug("datalen %d\n", datalen);
+	if (datalen) {
+		ret = ckpt_write_buffer(ctx, NULL, datalen);
+		if (ret < 0)
+			return ret;
+		n = min(datalen, N_TTY_BUF_SIZE - read_tail);
+		ret = ckpt_kwrite(ctx, &tty->read_buf[read_tail], n);
+		if (ret < 0)
+			return ret;
+		n = datalen - n;
+		ret = ckpt_kwrite(ctx, tty->read_buf, n);
+	}
+
+	return ret;
+}
+
+#define CKPT_TTY_BAD   ((1 << TTY_CLOSING) | (1 << TTY_FLUSHING))
+#define CKPT_TTY_GOOD  0
+
+static int do_checkpoint_tty(struct ckpt_ctx *ctx, struct tty_struct *tty)
+{
+	struct ckpt_hdr_tty *h;
+	int link_objref;
+	int master = 0;
+	int ret;
+
+	if ((tty->flags & (CKPT_TTY_BAD | CKPT_TTY_GOOD)) != CKPT_TTY_GOOD) {
+		ckpt_write_err(ctx, "TP", "tty: bad flags %#lx\n",
+			       tty, tty->flags);
+		return -EBUSY;
+	}
+
+	/*
+	 * If we ever support more than PTYs, this would be tty-type
+	 * specific (and probably called via tty_operations).
+	 */
+	link_objref = ckpt_obj_lookup(ctx, tty->link, CKPT_OBJ_TTY);
+
+	if (tty->driver->subtype == PTY_TYPE_MASTER)
+		master = 1;
+
+	/* tty is master if-and-only-if link_objref is zero */
+	BUG_ON(master ^ !link_objref);
+
+	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_TTY);
+	if (!h)
+		return -ENOMEM;
+
+	h->driver_type = tty->driver->type;
+	h->driver_subtype = tty->driver->subtype;
+
+	h->link_objref = link_objref;
+
+	/* if master, reserve an objref (see do_restore_tty) */
+	h->file_objref = (master ? ckpt_obj_reserve(ctx) : 0);
+	ckpt_debug("link %d file %d\n", h->link_objref, h->file_objref);
+
+	h->index = tty->index;
+	h->ldisc = tty->ldisc->ops->num;
+	h->flags = tty->flags;
+
+	mutex_lock(&tty->termios_mutex);
+	h->termios.c_line = tty->termios->c_line;
+	h->termios.c_iflag = tty->termios->c_iflag;
+	h->termios.c_oflag = tty->termios->c_oflag;
+	h->termios.c_cflag = tty->termios->c_cflag;
+	h->termios.c_lflag = tty->termios->c_lflag;
+	memcpy(h->termios.c_cc, tty->termios->c_cc, NCC);
+	h->winsize.ws_row = tty->winsize.ws_row;
+	h->winsize.ws_col = tty->winsize.ws_col;
+	h->winsize.ws_ypixel = tty->winsize.ws_ypixel;
+	h->winsize.ws_xpixel = tty->winsize.ws_xpixel;
+	mutex_unlock(&tty->termios_mutex);
+
+	ret  = ckpt_write_obj(ctx, &h->h);
+	ckpt_hdr_put(ctx, h);
+	if (ret < 0)
+		return ret;
+
+	/* save line discipline data (also writes buffer) */
+	if (!test_bit(TTY_HUPPED, &tty->flags))
+		ret = checkpoint_tty_ldisc(ctx, tty);
+
+	return ret;
+}
+
+int checkpoint_tty(struct ckpt_ctx *ctx, void *ptr)
+{
+	return do_checkpoint_tty(ctx, (struct tty_struct *) ptr);
+}
+
+struct file *tty_file_restore(struct ckpt_ctx *ctx, struct ckpt_hdr_file *ptr)
+{
+	struct ckpt_hdr_file_tty *h = (struct ckpt_hdr_file_tty *) ptr;
+	struct tty_struct *tty;
+	struct file *file;
+	char slavepath[16];	/* "/dev/pts/###" */
+	int slavelock;
+	int ret;
+
+	if (ptr->h.type != CKPT_HDR_FILE ||
+	    ptr->h.len != sizeof(*h) || ptr->f_type != CKPT_FILE_TTY)
+		return ERR_PTR(-EINVAL);
+
+	if (h->tty_objref <= 0)
+		return ERR_PTR(-EINVAL);
+
+	tty = ckpt_obj_fetch(ctx, h->tty_objref, CKPT_OBJ_TTY);
+	ckpt_debug("tty %p objref %d\n", tty, h->tty_objref);
+
+	/* at this point the tty should have been restore already */
+	if (IS_ERR(tty))
+		return (struct file *) tty;
+
+	/*
+	 * If we ever support more than PTYs, this would be tty-type
+	 * specific (and probably called via tty_operations).
+	 */
+
+	/*
+	 * If this tty is master, get the corresponding file from
+	 * tty->tty_file. Otherwise, open the slave device.
+	 */
+	if (tty->driver->subtype == PTY_TYPE_MASTER) {
+		file_list_lock();
+		file = list_first_entry(&tty->tty_files,
+					typeof(*file), f_u.fu_list);
+		get_file(file);
+		file_list_unlock();
+		ckpt_debug("master file %p\n", file);
+	} else {
+		sprintf(slavepath, "/dev/pts/%d", tty->index);
+		slavelock = test_bit(TTY_PTY_LOCK, &tty->link->flags);
+		clear_bit(TTY_PTY_LOCK, &tty->link->flags);
+		file = filp_open(slavepath, O_RDWR | O_NOCTTY, 0);
+		ckpt_debug("slave file %p (idnex %d)\n", file, tty->index);
+		if (IS_ERR(file))
+			return file;
+		if (slavelock)
+			set_bit(TTY_PTY_LOCK, &tty->link->flags);
+	}
+
+	ret = restore_file_common(ctx, file, ptr);
+	if (ret < 0) {
+		fput(file);
+		file = ERR_PTR(ret);
+	}
+
+	return file;
+}
+
+static int restore_tty_ldisc(struct ckpt_ctx *ctx,
+			     struct tty_struct *tty,
+			     struct ckpt_hdr_tty *hh)
+{
+	struct ckpt_hdr_ldisc_n_tty *h;
+	int ret = -EINVAL;
+
+	h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_TTY_LDISC);
+	if (IS_ERR(h))
+		return PTR_ERR(h);
+
+	/* this is unfair shortcut, because we know ldisc is n_tty */
+	if (hh->ldisc != N_TTY)
+		goto out;
+	if ((hh->flags & CKPT_LDISC_FLAGS) != CKPT_LDISC_GOOD)
+		goto out;
+
+	if (h->datalen > N_TTY_BUF_SIZE)
+		goto out;
+	if (h->canon_datalen > N_TTY_BUF_SIZE)
+		goto out;
+
+	if (h->datalen) {
+		ret = _ckpt_read_buffer(ctx, tty->read_buf, h->datalen);
+		if (ret < 0)
+			goto out;
+	}
+
+	/* TODO: sanitize all these values ? */
+
+	spin_lock_irq(&tty->read_lock);
+	tty->column = h->column;
+	tty->read_cnt = h->datalen;
+	tty->read_head = h->datalen;
+	tty->read_tail = 0;
+	tty->canon_column = h->canon_column;
+	tty->canon_head = h->canon_datalen;
+	tty->canon_data = h->canon_data;
+	spin_unlock_irq(&tty->read_lock);
+
+	tty->minimum_to_wake = h->minimum_to_wake;
+
+	tty->stopped = h->stopped;
+	tty->hw_stopped = h->hw_stopped;
+	tty->flow_stopped = h->flow_stopped;
+	tty->packet = h->packet;
+	tty->ctrl_status = h->ctrl_status;
+	tty->lnext = h->lnext;
+	tty->erasing = h->erasing;
+	tty->raw = h->raw;
+	tty->real_raw = h->real_raw;
+	tty->icanon = h->icanon;
+	tty->closing = h->closing;
+
+	BUILD_BUG_ON(sizeof(h->read_flags) != sizeof(tty->read_flags));
+	memcpy(tty->read_flags, h->read_flags, sizeof(tty->read_flags));
+ out:
+	ckpt_hdr_put(ctx, h);
+	return 0;
+}
+
+#define CKPT_PTMX_PATH  "/dev/ptmx"
+
+static struct tty_struct *do_restore_tty(struct ckpt_ctx *ctx)
+{
+	struct ckpt_hdr_tty *h;
+	struct tty_struct *tty = ERR_PTR(-EINVAL);
+	struct file *file = NULL;
+	int master, ret;
+
+	h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_TTY);
+	if (IS_ERR(h))
+		return (struct tty_struct *) h;
+
+	if (h->driver_type != TTY_DRIVER_TYPE_PTY)
+		goto out;
+	if (h->driver_subtype == PTY_TYPE_MASTER)
+		master = 1;
+	else if (h->driver_subtype == PTY_TYPE_SLAVE)
+		master = 0;
+	else
+		goto out;
+	/* @link_object is positive if-and-only-if tty is not master */
+	if (h->link_objref < 0 || (master ^ !h->link_objref))
+		goto out;
+	/* @file_object is positive if-and-only-if tty is master */
+	if (h->file_objref < 0 || (master ^ !!h->file_objref))
+		goto out;
+	if (h->flags & CKPT_TTY_BAD)
+		goto out;
+	/* hung-up tty cannot be master, or have session or pgrp */
+	if (test_bit(TTY_HUPPED, (unsigned long *) &h->flags) && master)
+		goto out;
+
+	ckpt_debug("sanity checks passed, link %d\n", h->link_objref);
+
+	/*
+	 * If we ever support more than PTYs, this would be tty-type
+	 * specific (and probably called via tty_operations).
+	 */
+	if (master) {
+		file = pty_open_by_index("/dev/ptmx", h->index);
+		if (IS_ERR(file)) {
+			ckpt_write_err(ctx, "TE", "open ptmx", PTR_ERR(file));
+			tty = ERR_PTR(PTR_ERR(file));
+			goto out;
+		}
+
+		/*
+		 * Add file to objhash to ensure proper cleanup later
+		 * (it isn't referenced elsewhere). Use h->file_objref
+		 * which was explicitly during checkpoint for this.
+		 */
+		ret = ckpt_obj_insert(ctx, file, h->file_objref, CKPT_OBJ_FILE);
+		fput(file);  /* even on succes (referenced in objash) */
+		if (ret < 0) {
+			tty = ERR_PTR(ret);
+			goto out;
+		}
+
+		tty = file->private_data;
+	} else {
+		tty = ckpt_obj_fetch(ctx, h->link_objref, CKPT_OBJ_TTY);
+		if (IS_ERR(tty))
+			goto out;
+		tty = tty->link;
+	}
+
+	ckpt_debug("tty %p (hup %d)\n",
+		   tty, test_bit(TTY_HUPPED, (unsigned long *) &h->flags));
+
+	/* we now have the desired tty: restore its state as per @h */
+
+	mutex_lock(&tty->termios_mutex);
+	tty->termios->c_line = h->termios.c_line;
+	tty->termios->c_iflag = h->termios.c_iflag;
+	tty->termios->c_oflag = h->termios.c_oflag;
+	tty->termios->c_cflag = h->termios.c_cflag;
+	tty->termios->c_lflag = h->termios.c_lflag;
+	memcpy(tty->termios->c_cc, h->termios.c_cc, NCC);
+	tty->winsize.ws_row = h->winsize.ws_row;
+	tty->winsize.ws_col = h->winsize.ws_col;
+	tty->winsize.ws_ypixel = h->winsize.ws_ypixel;
+	tty->winsize.ws_xpixel = h->winsize.ws_xpixel;
+	mutex_unlock(&tty->termios_mutex);
+
+	if (test_bit(TTY_HUPPED, (unsigned long *) &h->flags))
+		tty_vhangup(tty);
+	else {
+		ret = restore_tty_ldisc(ctx, tty, h);
+		if (ret < 0) {
+			tty = ERR_PTR(ret);
+			goto out;
+		}
+	}
+
+	tty_kref_get(tty);
+ out:
+	ckpt_hdr_put(ctx, h);
+	return tty;
+}
+
+void *restore_tty(struct ckpt_ctx *ctx)
+{
+#ifdef CONFIG_UNIX98_PTYS
+	return (void *) do_restore_tty(ctx);
+#else
+	return ERR_PTR(-ENOSYS);
+#endif
+}
+#endif /* COFNIG_CHECKPOINT */
+
 /*
  * This implements the "Secure Attention Key" ---  the idea is to
  * prevent trojan horses by killing all processes associated with this
diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
index 92a21b2..7c117fc 100644
--- a/include/linux/checkpoint.h
+++ b/include/linux/checkpoint.h
@@ -266,6 +266,10 @@ extern int restore_obj_signal(struct ckpt_ctx *ctx, int signal_objref);
 extern int checkpoint_task_signal(struct ckpt_ctx *ctx, struct task_struct *t);
 extern int restore_task_signal(struct ckpt_ctx *ctx);
 
+/* ttys */
+extern int checkpoint_tty(struct ckpt_ctx *ctx, void *ptr);
+extern void *restore_tty(struct ckpt_ctx *ctx);
+
 static inline int ckpt_validate_errno(int errno)
 {
 	return (errno >= 0) && (errno < MAX_ERRNO);
diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
index ac16c59..bf584cb 100644
--- a/include/linux/checkpoint_hdr.h
+++ b/include/linux/checkpoint_hdr.h
@@ -83,6 +83,8 @@ enum {
 	CKPT_HDR_FILE_NAME,
 	CKPT_HDR_FILE,
 	CKPT_HDR_PIPE_BUF,
+	CKPT_HDR_TTY,
+	CKPT_HDR_TTY_LDISC,
 
 	CKPT_HDR_MM = 401,
 	CKPT_HDR_VMA,
@@ -141,6 +143,7 @@ enum obj_type {
 	CKPT_OBJ_USER,
 	CKPT_OBJ_GROUPINFO,
 	CKPT_OBJ_SOCK,
+	CKPT_OBJ_TTY,
 	CKPT_OBJ_MAX
 };
 
@@ -161,6 +164,9 @@ struct ckpt_const {
 	__u16 uts_domainname_len;
 	/* rlimit */
 	__u16 rlimit_nlimits;
+	/* tty */
+	__u16 n_tty_buf_size;
+	__u16 tty_termios_ncc;
 } __attribute__((aligned(8)));
 
 /* checkpoint image header */
@@ -368,6 +374,7 @@ enum file_type {
 	CKPT_FILE_PIPE,
 	CKPT_FILE_FIFO,
 	CKPT_FILE_SOCKET,
+	CKPT_FILE_TTY,
 	CKPT_FILE_MAX
 };
 
@@ -660,6 +667,84 @@ struct ckpt_hdr_ipc_sem {
 } __attribute__((aligned(8)));
 
 
+/* devices */
+struct ckpt_hdr_file_tty {
+	struct ckpt_hdr_file common;
+	__s32 tty_objref;
+};
+
+struct ckpt_hdr_tty {
+	struct ckpt_hdr h;
+
+	__u16 driver_type;
+	__u16 driver_subtype;
+
+	__s32 link_objref;
+	__s32 file_objref;
+	__u32 _padding;
+
+	__u32 index;
+	__u32 ldisc;
+	__u64 flags;
+
+	/* termios */
+	struct {
+		__u16 c_iflag;
+		__u16 c_oflag;
+		__u16 c_cflag;
+		__u16 c_lflag;
+		__u8 c_line;
+		__u8 c_cc[CKPT_TTY_NCC];
+	} __attribute__((aligned(8))) termios;
+
+	/* winsize */
+	struct {
+		__u16 ws_row;
+		__u16 ws_col;
+		__u16 ws_xpixel;
+		__u16 ws_ypixel;
+	} __attribute__((aligned(8))) winsize;
+} __attribute__((aligned(8)));
+
+/* cannot include <linux/tty.h> from userspace, so define: */
+#define CKPT_N_TTY_BUF_SIZE  4096
+#ifdef __KERNEL__
+#include <linux/tty.h>
+#if CKPT_N_TTY_BUF_SIZE != N_TTY_BUF_SIZE
+#error CKPT_N_TTY_BUF_SIZE size is wrong per linux/tty.h
+#endif
+#endif
+
+struct ckpt_hdr_ldisc_n_tty {
+	struct ckpt_hdr h;
+
+	__u32 column;
+	__u32 datalen;
+	__u32 canon_column;
+	__u32 canon_datalen;
+	__u32 canon_data;
+
+	__u16 minimum_to_wake;
+
+	__u8 stopped;
+	__u8 hw_stopped;
+	__u8 flow_stopped;
+	__u8 packet;
+	__u8 ctrl_status;
+	__u8 lnext;
+	__u8 erasing;
+	__u8 raw;
+	__u8 real_raw;
+	__u8 icanon;
+	__u8 closing;
+	__u8 padding[3];
+
+	__u8 read_flags[CKPT_N_TTY_BUF_SIZE / 8];
+
+	/* if @datalen > 0, buffer contents follow (next object) */
+} __attribute__((aligned(8)));
+
+
 #define CKPT_TST_OVERFLOW_16(a, b) \
 	((sizeof(a) > sizeof(b)) && ((a) > SHORT_MAX))
 
diff --git a/include/linux/tty.h b/include/linux/tty.h
index fd40561..295447b 100644
--- a/include/linux/tty.h
+++ b/include/linux/tty.h
@@ -471,6 +471,13 @@ extern void tty_ldisc_enable(struct tty_struct *tty);
 /* This one is for ptmx_close() */
 extern int tty_release(struct inode *inode, struct file *filp);
 
+#ifdef CONFIG_CHECKPOINT
+struct ckpt_ctx;
+struct ckpt_hdr_file;
+extern struct file *tty_file_restore(struct ckpt_ctx *ctx,
+				     struct ckpt_hdr_file *ptr);
+#endif
+
 /* n_tty.c */
 extern struct tty_ldisc_ops tty_ldisc_N_TTY;
 
-- 
1.6.0.4


^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH v18 72/80] c/r: [pty 2/2] support for pseudo terminals
@ 2009-09-23 23:51   ` Oren Laadan
  0 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-23 23:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Serge Hallyn, Ingo Molnar,
	Pavel Emelyanov, Oren Laadan, Oren Laadan

This patch adds support for checkpoint and restart of pseudo terminals
(PTYs). Since PTYs are shared (pointed to by file, and signal), they
are managed via objhash.

PTYs are master/slave pairs; The code arranges for the master to
always be checkpointed first, followed by the slave. This is important
since during restart both ends are created when restoring the master.

In this patch only UNIX98 style PTYs are supported.

Currently only PTYs that are referenced by open files are handled.
Thus PTYs checkpoint starts with a file in tty_file_checkpoint(). It
will first checkpoint the master and slave PTYs via tty_checkpoint(),
and then complete the saving of the file descriptor. This means that
in the image file, the order of objects is: master-tty, slave-tty,
file-desc.

During restart, to restore the master side, we open the /dev/ptmx
device and get a file handle. But at this point we don't know the
designated objref for this file, because the file is due later on in
the image stream. On the other hand, we can't just fput() the file
because it will close the PTY too.

Instead, when we checkpoint the master PTY, we _reserve_ an objref
for the file (which won't be further used in checkpoint). Then at
restart, use it to insert the file to objhash.

TODO:

* Better sanitize input from checkpoint image on restore
* Check the locking when saving/restoring tty_struct state
* Echo position/buffer isn't saved (is it needed ?)
* Handle multiple devpts mounts (namespaces)
* Paths of ptmx and slaves are hard coded (/dev/ptmx, /dev/pts/...)

Changelog[v4]:
  - Fix error path(s) in restore_tty_ldisc()
  - Fix memory leak in restore_tty_ldisc()
Changelog[v3]:
  - [Serge Hallyn] Set tty on error path
Changelog[v2]:
  - Don't save/restore tty->{session,pgrp}
  - Fix leak: drop file reference after ckpt_obj_insert()
  - Move get_file() inside locked clause (fix race)
Changelog[v1]:
  - Adjust include/asm/checkpoint_hdr.h for s390 architecture
  - Add NCC to kernel constants header (ckpt_hdr_const)
  - [Serge Hallyn] fix calculation of canon_datalen

Signed-off-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
Acked-by: Serge Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
---
 arch/s390/include/asm/checkpoint_hdr.h |   11 +
 arch/x86/include/asm/checkpoint_hdr.h  |   11 +
 checkpoint/checkpoint.c                |    3 +
 checkpoint/files.c                     |    6 +
 checkpoint/objhash.c                   |   26 ++
 checkpoint/restart.c                   |    6 +
 drivers/char/pty.c                     |    1 +
 drivers/char/tty_io.c                  |  499 ++++++++++++++++++++++++++++++++
 include/linux/checkpoint.h             |    4 +
 include/linux/checkpoint_hdr.h         |   85 ++++++
 include/linux/tty.h                    |    7 +
 11 files changed, 659 insertions(+), 0 deletions(-)

diff --git a/arch/s390/include/asm/checkpoint_hdr.h b/arch/s390/include/asm/checkpoint_hdr.h
index 1976355..b6ea8ce 100644
--- a/arch/s390/include/asm/checkpoint_hdr.h
+++ b/arch/s390/include/asm/checkpoint_hdr.h
@@ -83,13 +83,24 @@ struct ckpt_hdr_mm_context {
 };
 
 #define CKPT_ARCH_NSIG  64
+#define CKPT_TTY_NCC  8
+
+/* arch dependent constants */
 #ifdef __KERNEL__
+
 #include <asm/signal.h>
 #if CKPT_ARCH_NSIG != _SIGCONTEXT_NSIG
 #error CKPT_ARCH_NSIG size is wrong (asm/sigcontext.h and asm/checkpoint_hdr.h)
 #endif
+
+#include <linux/tty.h>
+#if CKPT_TTY_NCC != NCC
+#error CKPT_TTY_NCC size is wrong per asm-generic/termios.h
 #endif
 
+#endif /* __KERNEL__ */
+
+
 struct ckpt_hdr_header_arch {
 	struct ckpt_hdr h;
 };
diff --git a/arch/x86/include/asm/checkpoint_hdr.h b/arch/x86/include/asm/checkpoint_hdr.h
index 1228d1b..7a24de5 100644
--- a/arch/x86/include/asm/checkpoint_hdr.h
+++ b/arch/x86/include/asm/checkpoint_hdr.h
@@ -48,14 +48,25 @@ enum {
 	CKPT_HDR_MM_CONTEXT_LDT,
 };
 
+/* arch dependent constants */
 #define CKPT_ARCH_NSIG  64
+#define CKPT_TTY_NCC  8
+
 #ifdef __KERNEL__
+
 #include <asm/signal.h>
 #if CKPT_ARCH_NSIG != _NSIG
 #error CKPT_ARCH_NSIG size is wrong per asm/signal.h and asm/checkpoint_hdr.h
 #endif
+
+#include <linux/tty.h>
+#if CKPT_TTY_NCC != NCC
+#error CKPT_TTY_NCC size is wrong per asm-generic/termios.h
 #endif
 
+#endif /* __KERNEL__ */
+
+
 struct ckpt_hdr_header_arch {
 	struct ckpt_hdr h;
 	/* FIXME: add HAVE_HWFP */
diff --git a/checkpoint/checkpoint.c b/checkpoint/checkpoint.c
index ae79df7..dbe9e10 100644
--- a/checkpoint/checkpoint.c
+++ b/checkpoint/checkpoint.c
@@ -299,6 +299,9 @@ static void fill_kernel_const(struct ckpt_const *h)
 	h->uts_domainname_len = sizeof(uts->domainname);
 	/* rlimit */
 	h->rlimit_nlimits = RLIM_NLIMITS;
+	/* tty */
+	h->n_tty_buf_size = N_TTY_BUF_SIZE;
+	h->tty_termios_ncc = NCC;
 }
 
 /* write the checkpoint header */
diff --git a/checkpoint/files.c b/checkpoint/files.c
index 058bc0e..27e29a0 100644
--- a/checkpoint/files.c
+++ b/checkpoint/files.c
@@ -598,6 +598,12 @@ static struct restore_file_ops restore_file_ops[] = {
 		.file_type = CKPT_FILE_SOCKET,
 		.restore = sock_file_restore,
 	},
+	/* tty */
+	{
+		.file_name = "TTY",
+		.file_type = CKPT_FILE_TTY,
+		.restore = tty_file_restore,
+	},
 };
 
 static struct file *do_restore_file(struct ckpt_ctx *ctx)
diff --git a/checkpoint/objhash.c b/checkpoint/objhash.c
index 0978060..f84388d 100644
--- a/checkpoint/objhash.c
+++ b/checkpoint/objhash.c
@@ -269,6 +269,22 @@ static int obj_sock_users(void *ptr)
 	return atomic_read(&((struct sock *) ptr)->sk_refcnt);
 }
 
+static int obj_tty_grab(void *ptr)
+{
+	tty_kref_get((struct tty_struct *) ptr);
+	return 0;
+}
+
+static void obj_tty_drop(void *ptr, int lastref)
+{
+	tty_kref_put((struct tty_struct *) ptr);
+}
+
+static int obj_tty_users(void *ptr)
+{
+	return atomic_read(&((struct tty_struct *) ptr)->kref.refcount);
+}
+
 static struct ckpt_obj_ops ckpt_obj_ops[] = {
 	/* ignored object */
 	{
@@ -407,6 +423,16 @@ static struct ckpt_obj_ops ckpt_obj_ops[] = {
 		.checkpoint = checkpoint_sock,
 		.restore = restore_sock,
 	},
+	/* struct tty_struct */
+	{
+		.obj_name = "TTY",
+		.obj_type = CKPT_OBJ_TTY,
+		.ref_drop = obj_tty_drop,
+		.ref_grab = obj_tty_grab,
+		.ref_users = obj_tty_users,
+		.checkpoint = checkpoint_tty,
+		.restore = restore_tty,
+	},
 };
 
 
diff --git a/checkpoint/restart.c b/checkpoint/restart.c
index 340698a..1016278 100644
--- a/checkpoint/restart.c
+++ b/checkpoint/restart.c
@@ -19,6 +19,7 @@
 #include <linux/freezer.h>
 #include <linux/magic.h>
 #include <linux/utsname.h>
+#include <linux/termios.h>
 #include <asm/syscall.h>
 #include <linux/elf.h>
 #include <linux/deferqueue.h>
@@ -402,6 +403,11 @@ static int check_kernel_const(struct ckpt_const *h)
 	/* rlimit */
 	if (h->rlimit_nlimits != RLIM_NLIMITS)
 		return -EINVAL;
+	/* tty */
+	if (h->n_tty_buf_size != N_TTY_BUF_SIZE)
+		return -EINVAL;
+	if (h->tty_termios_ncc != NCC)
+		return -EINVAL;
 
 	return 0;
 }
diff --git a/drivers/char/pty.c b/drivers/char/pty.c
index e2fef99..5fb4ec5 100644
--- a/drivers/char/pty.c
+++ b/drivers/char/pty.c
@@ -15,6 +15,7 @@
 
 #include <linux/errno.h>
 #include <linux/interrupt.h>
+#include <linux/file.h>
 #include <linux/tty.h>
 #include <linux/tty_flip.h>
 #include <linux/fcntl.h>
diff --git a/drivers/char/tty_io.c b/drivers/char/tty_io.c
index 7853ea2..72f4432 100644
--- a/drivers/char/tty_io.c
+++ b/drivers/char/tty_io.c
@@ -106,6 +106,7 @@
 
 #include <linux/kmod.h>
 #include <linux/nsproxy.h>
+#include <linux/checkpoint.h>
 
 #undef TTY_DEBUG_HANGUP
 
@@ -151,6 +152,13 @@ static long tty_compat_ioctl(struct file *file, unsigned int cmd,
 #define tty_compat_ioctl NULL
 #endif
 static int tty_fasync(int fd, struct file *filp, int on);
+#ifdef CONFIG_CHECKPOINT
+static int tty_file_checkpoint(struct ckpt_ctx *ctx, struct file *file);
+static int tty_file_collect(struct ckpt_ctx *ctx, struct file *file);
+#else
+#define tty_file_checkpoint NULL
+#define tty_file_collect NULL
+#endif /* CONFIG_CHECKPOINT */
 static void release_tty(struct tty_struct *tty, int idx);
 static void __proc_set_tty(struct task_struct *tsk, struct tty_struct *tty);
 static void proc_set_tty(struct task_struct *tsk, struct tty_struct *tty);
@@ -417,6 +425,8 @@ static const struct file_operations tty_fops = {
 	.open		= tty_open,
 	.release	= tty_release,
 	.fasync		= tty_fasync,
+	.checkpoint	= tty_file_checkpoint,
+	.collect	= tty_file_collect,
 };
 
 static const struct file_operations console_fops = {
@@ -439,6 +449,8 @@ static const struct file_operations hung_up_tty_fops = {
 	.unlocked_ioctl	= hung_up_tty_ioctl,
 	.compat_ioctl	= hung_up_tty_compat_ioctl,
 	.release	= tty_release,
+	.checkpoint	= tty_file_checkpoint,
+	.collect	= tty_file_collect,
 };
 
 static DEFINE_SPINLOCK(redirect_lock);
@@ -2586,6 +2598,493 @@ static long tty_compat_ioctl(struct file *file, unsigned int cmd,
 }
 #endif
 
+#ifdef CONFIG_CHECKPOINT
+static int tty_can_checkpoint(struct ckpt_ctx *ctx, struct tty_struct *tty)
+{
+	/* only support pty driver */
+	if (tty->driver->type != TTY_DRIVER_TYPE_PTY) {
+		ckpt_write_err(ctx, "TSP", "tty: unknown driverv type %d",
+			       tty->driver, tty, tty->driver->type);
+		return 0;
+	}
+	/* only support unix98 style */
+	if (tty->driver->major != UNIX98_PTY_MASTER_MAJOR &&
+	    tty->driver->major != UNIX98_PTY_SLAVE_MAJOR) {
+		ckpt_write_err(ctx, "TP", "tty: legacy pty", tty);
+		return 0;
+	}
+	/* only support n_tty ldisc */
+	if (tty->ldisc->ops->num != N_TTY) {
+		ckpt_write_err(ctx, "TSP", "tty: unknown ldisc type %d",
+			       tty->ldisc->ops, tty, tty->ldisc->ops->num);
+		return 0;
+	}
+
+	return 1;
+}
+
+static int tty_file_checkpoint(struct ckpt_ctx *ctx, struct file *file)
+{
+	struct ckpt_hdr_file_tty *h;
+	struct tty_struct *tty, *real_tty;
+	struct inode *inode;
+	int master_objref, slave_objref;
+	int ret;
+
+	tty = (struct tty_struct *)file->private_data;
+	inode = file->f_path.dentry->d_inode;
+	if (tty_paranoia_check(tty, inode, "tty_file_checkpoint"))
+		return -EIO;
+
+	if (!tty_can_checkpoint(ctx, tty))
+		return -ENOSYS;
+
+	/*
+	 * If we ever support more than PTYs, this would be tty-type
+	 * specific (and probably called via tty_operations).
+	 */
+
+	real_tty = tty_pair_get_tty(tty);
+	ckpt_debug("tty: %p, real_tty: %p\n", tty, real_tty);
+
+	master_objref = checkpoint_obj(ctx, real_tty->link, CKPT_OBJ_TTY);
+	if (master_objref < 0)
+		return master_objref;
+	slave_objref = checkpoint_obj(ctx, real_tty, CKPT_OBJ_TTY);
+	if (slave_objref < 0)
+		return slave_objref;
+	ckpt_debug("master %p %d, slave %p %d\n",
+		   real_tty->link, master_objref, real_tty, slave_objref);
+
+	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_FILE);
+	if (!h)
+		return -ENOMEM;
+
+	h->common.f_type = CKPT_FILE_TTY;
+	h->tty_objref = (tty == real_tty ? slave_objref : master_objref);
+
+	ret = checkpoint_file_common(ctx, file, &h->common);
+	if (!ret)
+		ret = ckpt_write_obj(ctx, &h->common.h);
+
+	ckpt_hdr_put(ctx, h);
+	return ret;
+}
+
+static int tty_file_collect(struct ckpt_ctx *ctx, struct file *file)
+{
+	struct tty_struct *tty;
+	struct inode *inode;
+	int ret;
+
+	tty = (struct tty_struct *)file->private_data;
+	inode = file->f_path.dentry->d_inode;
+	if (tty_paranoia_check(tty, inode, "tty_collect"))
+		return -EIO;
+
+	if (!tty_can_checkpoint(ctx, tty))
+		return -ENOSYS;
+
+	ckpt_debug("collecting tty: %p\n", tty);
+	ret = ckpt_obj_collect(ctx, tty, CKPT_OBJ_TTY);
+	if (ret < 0)
+		return ret;
+
+	if (tty->driver->subtype == PTY_TYPE_MASTER) {
+		if (!tty->link) {
+			ckpt_write_err(ctx, "TP", "tty: missing link\n", tty);
+			return -EIO;
+		}
+		ckpt_debug("collecting slave tty: %p\n", tty->link);
+		ret = ckpt_obj_collect(ctx, tty->link, CKPT_OBJ_TTY);
+	}
+
+	return ret;
+}
+
+#define CKPT_LDISC_BAD   (1 << TTY_LDISC_CHANGING)
+#define CKPT_LDISC_GOOD  ((1 << TTY_LDISC_OPEN) | (1 << TTY_LDISC))
+#define CKPT_LDISC_FLAGS (CKPT_LDISC_GOOD | CKPT_LDISC_BAD)
+
+static int checkpoint_tty_ldisc(struct ckpt_ctx *ctx, struct tty_struct *tty)
+{
+	struct ckpt_hdr_ldisc_n_tty *h;
+	int datalen, read_tail;
+	int n, ret;
+
+	/* shouldn't reach here unless ldisc is n_tty */
+	BUG_ON(tty->ldisc->ops->num != N_TTY);
+
+	if ((tty->flags & CKPT_LDISC_FLAGS) != CKPT_LDISC_GOOD) {
+		ckpt_write_err(ctx, "TP", "tty: bad ldisc flags %#lx\n",
+			       tty, tty->flags);
+		return -EBUSY;
+	}
+
+	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_TTY_LDISC);
+	if (!h)
+		return -ENOMEM;
+
+	spin_lock_irq(&tty->read_lock);
+	h->column = tty->column;
+	h->datalen = tty->read_cnt;
+	h->canon_column = tty->canon_column;
+	h->canon_datalen = tty->canon_head;
+	if (tty->canon_head > tty->read_tail)
+		h->canon_datalen -= tty->read_tail;
+	else
+		h->canon_datalen += N_TTY_BUF_SIZE - tty->read_tail;
+	h->canon_data = tty->canon_data;
+
+	datalen = tty->read_cnt;
+	read_tail = tty->read_tail;
+	spin_unlock_irq(&tty->read_lock);
+
+	h->minimum_to_wake = tty->minimum_to_wake;
+
+	h->stopped = tty->stopped;
+	h->hw_stopped = tty->hw_stopped;
+	h->flow_stopped = tty->flow_stopped;
+	h->packet = tty->packet;
+	h->ctrl_status = tty->ctrl_status;
+	h->lnext = tty->lnext;
+	h->erasing = tty->erasing;
+	h->raw = tty->raw;
+	h->real_raw = tty->real_raw;
+	h->icanon = tty->icanon;
+	h->closing = tty->closing;
+
+	BUILD_BUG_ON(sizeof(h->read_flags) != sizeof(tty->read_flags));
+	memcpy(h->read_flags, tty->read_flags, sizeof(tty->read_flags));
+
+	ret = ckpt_write_obj(ctx, &h->h);
+	ckpt_hdr_put(ctx, h);
+	if (ret < 0)
+		return ret;
+
+	ckpt_debug("datalen %d\n", datalen);
+	if (datalen) {
+		ret = ckpt_write_buffer(ctx, NULL, datalen);
+		if (ret < 0)
+			return ret;
+		n = min(datalen, N_TTY_BUF_SIZE - read_tail);
+		ret = ckpt_kwrite(ctx, &tty->read_buf[read_tail], n);
+		if (ret < 0)
+			return ret;
+		n = datalen - n;
+		ret = ckpt_kwrite(ctx, tty->read_buf, n);
+	}
+
+	return ret;
+}
+
+#define CKPT_TTY_BAD   ((1 << TTY_CLOSING) | (1 << TTY_FLUSHING))
+#define CKPT_TTY_GOOD  0
+
+static int do_checkpoint_tty(struct ckpt_ctx *ctx, struct tty_struct *tty)
+{
+	struct ckpt_hdr_tty *h;
+	int link_objref;
+	int master = 0;
+	int ret;
+
+	if ((tty->flags & (CKPT_TTY_BAD | CKPT_TTY_GOOD)) != CKPT_TTY_GOOD) {
+		ckpt_write_err(ctx, "TP", "tty: bad flags %#lx\n",
+			       tty, tty->flags);
+		return -EBUSY;
+	}
+
+	/*
+	 * If we ever support more than PTYs, this would be tty-type
+	 * specific (and probably called via tty_operations).
+	 */
+	link_objref = ckpt_obj_lookup(ctx, tty->link, CKPT_OBJ_TTY);
+
+	if (tty->driver->subtype == PTY_TYPE_MASTER)
+		master = 1;
+
+	/* tty is master if-and-only-if link_objref is zero */
+	BUG_ON(master ^ !link_objref);
+
+	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_TTY);
+	if (!h)
+		return -ENOMEM;
+
+	h->driver_type = tty->driver->type;
+	h->driver_subtype = tty->driver->subtype;
+
+	h->link_objref = link_objref;
+
+	/* if master, reserve an objref (see do_restore_tty) */
+	h->file_objref = (master ? ckpt_obj_reserve(ctx) : 0);
+	ckpt_debug("link %d file %d\n", h->link_objref, h->file_objref);
+
+	h->index = tty->index;
+	h->ldisc = tty->ldisc->ops->num;
+	h->flags = tty->flags;
+
+	mutex_lock(&tty->termios_mutex);
+	h->termios.c_line = tty->termios->c_line;
+	h->termios.c_iflag = tty->termios->c_iflag;
+	h->termios.c_oflag = tty->termios->c_oflag;
+	h->termios.c_cflag = tty->termios->c_cflag;
+	h->termios.c_lflag = tty->termios->c_lflag;
+	memcpy(h->termios.c_cc, tty->termios->c_cc, NCC);
+	h->winsize.ws_row = tty->winsize.ws_row;
+	h->winsize.ws_col = tty->winsize.ws_col;
+	h->winsize.ws_ypixel = tty->winsize.ws_ypixel;
+	h->winsize.ws_xpixel = tty->winsize.ws_xpixel;
+	mutex_unlock(&tty->termios_mutex);
+
+	ret  = ckpt_write_obj(ctx, &h->h);
+	ckpt_hdr_put(ctx, h);
+	if (ret < 0)
+		return ret;
+
+	/* save line discipline data (also writes buffer) */
+	if (!test_bit(TTY_HUPPED, &tty->flags))
+		ret = checkpoint_tty_ldisc(ctx, tty);
+
+	return ret;
+}
+
+int checkpoint_tty(struct ckpt_ctx *ctx, void *ptr)
+{
+	return do_checkpoint_tty(ctx, (struct tty_struct *) ptr);
+}
+
+struct file *tty_file_restore(struct ckpt_ctx *ctx, struct ckpt_hdr_file *ptr)
+{
+	struct ckpt_hdr_file_tty *h = (struct ckpt_hdr_file_tty *) ptr;
+	struct tty_struct *tty;
+	struct file *file;
+	char slavepath[16];	/* "/dev/pts/###" */
+	int slavelock;
+	int ret;
+
+	if (ptr->h.type != CKPT_HDR_FILE ||
+	    ptr->h.len != sizeof(*h) || ptr->f_type != CKPT_FILE_TTY)
+		return ERR_PTR(-EINVAL);
+
+	if (h->tty_objref <= 0)
+		return ERR_PTR(-EINVAL);
+
+	tty = ckpt_obj_fetch(ctx, h->tty_objref, CKPT_OBJ_TTY);
+	ckpt_debug("tty %p objref %d\n", tty, h->tty_objref);
+
+	/* at this point the tty should have been restore already */
+	if (IS_ERR(tty))
+		return (struct file *) tty;
+
+	/*
+	 * If we ever support more than PTYs, this would be tty-type
+	 * specific (and probably called via tty_operations).
+	 */
+
+	/*
+	 * If this tty is master, get the corresponding file from
+	 * tty->tty_file. Otherwise, open the slave device.
+	 */
+	if (tty->driver->subtype == PTY_TYPE_MASTER) {
+		file_list_lock();
+		file = list_first_entry(&tty->tty_files,
+					typeof(*file), f_u.fu_list);
+		get_file(file);
+		file_list_unlock();
+		ckpt_debug("master file %p\n", file);
+	} else {
+		sprintf(slavepath, "/dev/pts/%d", tty->index);
+		slavelock = test_bit(TTY_PTY_LOCK, &tty->link->flags);
+		clear_bit(TTY_PTY_LOCK, &tty->link->flags);
+		file = filp_open(slavepath, O_RDWR | O_NOCTTY, 0);
+		ckpt_debug("slave file %p (idnex %d)\n", file, tty->index);
+		if (IS_ERR(file))
+			return file;
+		if (slavelock)
+			set_bit(TTY_PTY_LOCK, &tty->link->flags);
+	}
+
+	ret = restore_file_common(ctx, file, ptr);
+	if (ret < 0) {
+		fput(file);
+		file = ERR_PTR(ret);
+	}
+
+	return file;
+}
+
+static int restore_tty_ldisc(struct ckpt_ctx *ctx,
+			     struct tty_struct *tty,
+			     struct ckpt_hdr_tty *hh)
+{
+	struct ckpt_hdr_ldisc_n_tty *h;
+	int ret = -EINVAL;
+
+	h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_TTY_LDISC);
+	if (IS_ERR(h))
+		return PTR_ERR(h);
+
+	/* this is unfair shortcut, because we know ldisc is n_tty */
+	if (hh->ldisc != N_TTY)
+		goto out;
+	if ((hh->flags & CKPT_LDISC_FLAGS) != CKPT_LDISC_GOOD)
+		goto out;
+
+	if (h->datalen > N_TTY_BUF_SIZE)
+		goto out;
+	if (h->canon_datalen > N_TTY_BUF_SIZE)
+		goto out;
+
+	if (h->datalen) {
+		ret = _ckpt_read_buffer(ctx, tty->read_buf, h->datalen);
+		if (ret < 0)
+			goto out;
+	}
+
+	/* TODO: sanitize all these values ? */
+
+	spin_lock_irq(&tty->read_lock);
+	tty->column = h->column;
+	tty->read_cnt = h->datalen;
+	tty->read_head = h->datalen;
+	tty->read_tail = 0;
+	tty->canon_column = h->canon_column;
+	tty->canon_head = h->canon_datalen;
+	tty->canon_data = h->canon_data;
+	spin_unlock_irq(&tty->read_lock);
+
+	tty->minimum_to_wake = h->minimum_to_wake;
+
+	tty->stopped = h->stopped;
+	tty->hw_stopped = h->hw_stopped;
+	tty->flow_stopped = h->flow_stopped;
+	tty->packet = h->packet;
+	tty->ctrl_status = h->ctrl_status;
+	tty->lnext = h->lnext;
+	tty->erasing = h->erasing;
+	tty->raw = h->raw;
+	tty->real_raw = h->real_raw;
+	tty->icanon = h->icanon;
+	tty->closing = h->closing;
+
+	BUILD_BUG_ON(sizeof(h->read_flags) != sizeof(tty->read_flags));
+	memcpy(tty->read_flags, h->read_flags, sizeof(tty->read_flags));
+ out:
+	ckpt_hdr_put(ctx, h);
+	return 0;
+}
+
+#define CKPT_PTMX_PATH  "/dev/ptmx"
+
+static struct tty_struct *do_restore_tty(struct ckpt_ctx *ctx)
+{
+	struct ckpt_hdr_tty *h;
+	struct tty_struct *tty = ERR_PTR(-EINVAL);
+	struct file *file = NULL;
+	int master, ret;
+
+	h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_TTY);
+	if (IS_ERR(h))
+		return (struct tty_struct *) h;
+
+	if (h->driver_type != TTY_DRIVER_TYPE_PTY)
+		goto out;
+	if (h->driver_subtype == PTY_TYPE_MASTER)
+		master = 1;
+	else if (h->driver_subtype == PTY_TYPE_SLAVE)
+		master = 0;
+	else
+		goto out;
+	/* @link_object is positive if-and-only-if tty is not master */
+	if (h->link_objref < 0 || (master ^ !h->link_objref))
+		goto out;
+	/* @file_object is positive if-and-only-if tty is master */
+	if (h->file_objref < 0 || (master ^ !!h->file_objref))
+		goto out;
+	if (h->flags & CKPT_TTY_BAD)
+		goto out;
+	/* hung-up tty cannot be master, or have session or pgrp */
+	if (test_bit(TTY_HUPPED, (unsigned long *) &h->flags) && master)
+		goto out;
+
+	ckpt_debug("sanity checks passed, link %d\n", h->link_objref);
+
+	/*
+	 * If we ever support more than PTYs, this would be tty-type
+	 * specific (and probably called via tty_operations).
+	 */
+	if (master) {
+		file = pty_open_by_index("/dev/ptmx", h->index);
+		if (IS_ERR(file)) {
+			ckpt_write_err(ctx, "TE", "open ptmx", PTR_ERR(file));
+			tty = ERR_PTR(PTR_ERR(file));
+			goto out;
+		}
+
+		/*
+		 * Add file to objhash to ensure proper cleanup later
+		 * (it isn't referenced elsewhere). Use h->file_objref
+		 * which was explicitly during checkpoint for this.
+		 */
+		ret = ckpt_obj_insert(ctx, file, h->file_objref, CKPT_OBJ_FILE);
+		fput(file);  /* even on succes (referenced in objash) */
+		if (ret < 0) {
+			tty = ERR_PTR(ret);
+			goto out;
+		}
+
+		tty = file->private_data;
+	} else {
+		tty = ckpt_obj_fetch(ctx, h->link_objref, CKPT_OBJ_TTY);
+		if (IS_ERR(tty))
+			goto out;
+		tty = tty->link;
+	}
+
+	ckpt_debug("tty %p (hup %d)\n",
+		   tty, test_bit(TTY_HUPPED, (unsigned long *) &h->flags));
+
+	/* we now have the desired tty: restore its state as per @h */
+
+	mutex_lock(&tty->termios_mutex);
+	tty->termios->c_line = h->termios.c_line;
+	tty->termios->c_iflag = h->termios.c_iflag;
+	tty->termios->c_oflag = h->termios.c_oflag;
+	tty->termios->c_cflag = h->termios.c_cflag;
+	tty->termios->c_lflag = h->termios.c_lflag;
+	memcpy(tty->termios->c_cc, h->termios.c_cc, NCC);
+	tty->winsize.ws_row = h->winsize.ws_row;
+	tty->winsize.ws_col = h->winsize.ws_col;
+	tty->winsize.ws_ypixel = h->winsize.ws_ypixel;
+	tty->winsize.ws_xpixel = h->winsize.ws_xpixel;
+	mutex_unlock(&tty->termios_mutex);
+
+	if (test_bit(TTY_HUPPED, (unsigned long *) &h->flags))
+		tty_vhangup(tty);
+	else {
+		ret = restore_tty_ldisc(ctx, tty, h);
+		if (ret < 0) {
+			tty = ERR_PTR(ret);
+			goto out;
+		}
+	}
+
+	tty_kref_get(tty);
+ out:
+	ckpt_hdr_put(ctx, h);
+	return tty;
+}
+
+void *restore_tty(struct ckpt_ctx *ctx)
+{
+#ifdef CONFIG_UNIX98_PTYS
+	return (void *) do_restore_tty(ctx);
+#else
+	return ERR_PTR(-ENOSYS);
+#endif
+}
+#endif /* COFNIG_CHECKPOINT */
+
 /*
  * This implements the "Secure Attention Key" ---  the idea is to
  * prevent trojan horses by killing all processes associated with this
diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
index 92a21b2..7c117fc 100644
--- a/include/linux/checkpoint.h
+++ b/include/linux/checkpoint.h
@@ -266,6 +266,10 @@ extern int restore_obj_signal(struct ckpt_ctx *ctx, int signal_objref);
 extern int checkpoint_task_signal(struct ckpt_ctx *ctx, struct task_struct *t);
 extern int restore_task_signal(struct ckpt_ctx *ctx);
 
+/* ttys */
+extern int checkpoint_tty(struct ckpt_ctx *ctx, void *ptr);
+extern void *restore_tty(struct ckpt_ctx *ctx);
+
 static inline int ckpt_validate_errno(int errno)
 {
 	return (errno >= 0) && (errno < MAX_ERRNO);
diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
index ac16c59..bf584cb 100644
--- a/include/linux/checkpoint_hdr.h
+++ b/include/linux/checkpoint_hdr.h
@@ -83,6 +83,8 @@ enum {
 	CKPT_HDR_FILE_NAME,
 	CKPT_HDR_FILE,
 	CKPT_HDR_PIPE_BUF,
+	CKPT_HDR_TTY,
+	CKPT_HDR_TTY_LDISC,
 
 	CKPT_HDR_MM = 401,
 	CKPT_HDR_VMA,
@@ -141,6 +143,7 @@ enum obj_type {
 	CKPT_OBJ_USER,
 	CKPT_OBJ_GROUPINFO,
 	CKPT_OBJ_SOCK,
+	CKPT_OBJ_TTY,
 	CKPT_OBJ_MAX
 };
 
@@ -161,6 +164,9 @@ struct ckpt_const {
 	__u16 uts_domainname_len;
 	/* rlimit */
 	__u16 rlimit_nlimits;
+	/* tty */
+	__u16 n_tty_buf_size;
+	__u16 tty_termios_ncc;
 } __attribute__((aligned(8)));
 
 /* checkpoint image header */
@@ -368,6 +374,7 @@ enum file_type {
 	CKPT_FILE_PIPE,
 	CKPT_FILE_FIFO,
 	CKPT_FILE_SOCKET,
+	CKPT_FILE_TTY,
 	CKPT_FILE_MAX
 };
 
@@ -660,6 +667,84 @@ struct ckpt_hdr_ipc_sem {
 } __attribute__((aligned(8)));
 
 
+/* devices */
+struct ckpt_hdr_file_tty {
+	struct ckpt_hdr_file common;
+	__s32 tty_objref;
+};
+
+struct ckpt_hdr_tty {
+	struct ckpt_hdr h;
+
+	__u16 driver_type;
+	__u16 driver_subtype;
+
+	__s32 link_objref;
+	__s32 file_objref;
+	__u32 _padding;
+
+	__u32 index;
+	__u32 ldisc;
+	__u64 flags;
+
+	/* termios */
+	struct {
+		__u16 c_iflag;
+		__u16 c_oflag;
+		__u16 c_cflag;
+		__u16 c_lflag;
+		__u8 c_line;
+		__u8 c_cc[CKPT_TTY_NCC];
+	} __attribute__((aligned(8))) termios;
+
+	/* winsize */
+	struct {
+		__u16 ws_row;
+		__u16 ws_col;
+		__u16 ws_xpixel;
+		__u16 ws_ypixel;
+	} __attribute__((aligned(8))) winsize;
+} __attribute__((aligned(8)));
+
+/* cannot include <linux/tty.h> from userspace, so define: */
+#define CKPT_N_TTY_BUF_SIZE  4096
+#ifdef __KERNEL__
+#include <linux/tty.h>
+#if CKPT_N_TTY_BUF_SIZE != N_TTY_BUF_SIZE
+#error CKPT_N_TTY_BUF_SIZE size is wrong per linux/tty.h
+#endif
+#endif
+
+struct ckpt_hdr_ldisc_n_tty {
+	struct ckpt_hdr h;
+
+	__u32 column;
+	__u32 datalen;
+	__u32 canon_column;
+	__u32 canon_datalen;
+	__u32 canon_data;
+
+	__u16 minimum_to_wake;
+
+	__u8 stopped;
+	__u8 hw_stopped;
+	__u8 flow_stopped;
+	__u8 packet;
+	__u8 ctrl_status;
+	__u8 lnext;
+	__u8 erasing;
+	__u8 raw;
+	__u8 real_raw;
+	__u8 icanon;
+	__u8 closing;
+	__u8 padding[3];
+
+	__u8 read_flags[CKPT_N_TTY_BUF_SIZE / 8];
+
+	/* if @datalen > 0, buffer contents follow (next object) */
+} __attribute__((aligned(8)));
+
+
 #define CKPT_TST_OVERFLOW_16(a, b) \
 	((sizeof(a) > sizeof(b)) && ((a) > SHORT_MAX))
 
diff --git a/include/linux/tty.h b/include/linux/tty.h
index fd40561..295447b 100644
--- a/include/linux/tty.h
+++ b/include/linux/tty.h
@@ -471,6 +471,13 @@ extern void tty_ldisc_enable(struct tty_struct *tty);
 /* This one is for ptmx_close() */
 extern int tty_release(struct inode *inode, struct file *filp);
 
+#ifdef CONFIG_CHECKPOINT
+struct ckpt_ctx;
+struct ckpt_hdr_file;
+extern struct file *tty_file_restore(struct ckpt_ctx *ctx,
+				     struct ckpt_hdr_file *ptr);
+#endif
+
 /* n_tty.c */
 extern struct tty_ldisc_ops tty_ldisc_N_TTY;
 
-- 
1.6.0.4

--
To unsubscribe from this list: send the line "unsubscribe linux-api" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH v18 72/80] c/r: [pty 2/2] support for pseudo terminals
@ 2009-09-23 23:51   ` Oren Laadan
  0 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-23 23:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, containers, linux-kernel, linux-mm, linux-api,
	Serge Hallyn, Ingo Molnar, Pavel Emelyanov, Oren Laadan,
	Oren Laadan

This patch adds support for checkpoint and restart of pseudo terminals
(PTYs). Since PTYs are shared (pointed to by file, and signal), they
are managed via objhash.

PTYs are master/slave pairs; The code arranges for the master to
always be checkpointed first, followed by the slave. This is important
since during restart both ends are created when restoring the master.

In this patch only UNIX98 style PTYs are supported.

Currently only PTYs that are referenced by open files are handled.
Thus PTYs checkpoint starts with a file in tty_file_checkpoint(). It
will first checkpoint the master and slave PTYs via tty_checkpoint(),
and then complete the saving of the file descriptor. This means that
in the image file, the order of objects is: master-tty, slave-tty,
file-desc.

During restart, to restore the master side, we open the /dev/ptmx
device and get a file handle. But at this point we don't know the
designated objref for this file, because the file is due later on in
the image stream. On the other hand, we can't just fput() the file
because it will close the PTY too.

Instead, when we checkpoint the master PTY, we _reserve_ an objref
for the file (which won't be further used in checkpoint). Then at
restart, use it to insert the file to objhash.

TODO:

* Better sanitize input from checkpoint image on restore
* Check the locking when saving/restoring tty_struct state
* Echo position/buffer isn't saved (is it needed ?)
* Handle multiple devpts mounts (namespaces)
* Paths of ptmx and slaves are hard coded (/dev/ptmx, /dev/pts/...)

Changelog[v4]:
  - Fix error path(s) in restore_tty_ldisc()
  - Fix memory leak in restore_tty_ldisc()
Changelog[v3]:
  - [Serge Hallyn] Set tty on error path
Changelog[v2]:
  - Don't save/restore tty->{session,pgrp}
  - Fix leak: drop file reference after ckpt_obj_insert()
  - Move get_file() inside locked clause (fix race)
Changelog[v1]:
  - Adjust include/asm/checkpoint_hdr.h for s390 architecture
  - Add NCC to kernel constants header (ckpt_hdr_const)
  - [Serge Hallyn] fix calculation of canon_datalen

Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
Acked-by: Serge Hallyn <serue@us.ibm.com>
---
 arch/s390/include/asm/checkpoint_hdr.h |   11 +
 arch/x86/include/asm/checkpoint_hdr.h  |   11 +
 checkpoint/checkpoint.c                |    3 +
 checkpoint/files.c                     |    6 +
 checkpoint/objhash.c                   |   26 ++
 checkpoint/restart.c                   |    6 +
 drivers/char/pty.c                     |    1 +
 drivers/char/tty_io.c                  |  499 ++++++++++++++++++++++++++++++++
 include/linux/checkpoint.h             |    4 +
 include/linux/checkpoint_hdr.h         |   85 ++++++
 include/linux/tty.h                    |    7 +
 11 files changed, 659 insertions(+), 0 deletions(-)

diff --git a/arch/s390/include/asm/checkpoint_hdr.h b/arch/s390/include/asm/checkpoint_hdr.h
index 1976355..b6ea8ce 100644
--- a/arch/s390/include/asm/checkpoint_hdr.h
+++ b/arch/s390/include/asm/checkpoint_hdr.h
@@ -83,13 +83,24 @@ struct ckpt_hdr_mm_context {
 };
 
 #define CKPT_ARCH_NSIG  64
+#define CKPT_TTY_NCC  8
+
+/* arch dependent constants */
 #ifdef __KERNEL__
+
 #include <asm/signal.h>
 #if CKPT_ARCH_NSIG != _SIGCONTEXT_NSIG
 #error CKPT_ARCH_NSIG size is wrong (asm/sigcontext.h and asm/checkpoint_hdr.h)
 #endif
+
+#include <linux/tty.h>
+#if CKPT_TTY_NCC != NCC
+#error CKPT_TTY_NCC size is wrong per asm-generic/termios.h
 #endif
 
+#endif /* __KERNEL__ */
+
+
 struct ckpt_hdr_header_arch {
 	struct ckpt_hdr h;
 };
diff --git a/arch/x86/include/asm/checkpoint_hdr.h b/arch/x86/include/asm/checkpoint_hdr.h
index 1228d1b..7a24de5 100644
--- a/arch/x86/include/asm/checkpoint_hdr.h
+++ b/arch/x86/include/asm/checkpoint_hdr.h
@@ -48,14 +48,25 @@ enum {
 	CKPT_HDR_MM_CONTEXT_LDT,
 };
 
+/* arch dependent constants */
 #define CKPT_ARCH_NSIG  64
+#define CKPT_TTY_NCC  8
+
 #ifdef __KERNEL__
+
 #include <asm/signal.h>
 #if CKPT_ARCH_NSIG != _NSIG
 #error CKPT_ARCH_NSIG size is wrong per asm/signal.h and asm/checkpoint_hdr.h
 #endif
+
+#include <linux/tty.h>
+#if CKPT_TTY_NCC != NCC
+#error CKPT_TTY_NCC size is wrong per asm-generic/termios.h
 #endif
 
+#endif /* __KERNEL__ */
+
+
 struct ckpt_hdr_header_arch {
 	struct ckpt_hdr h;
 	/* FIXME: add HAVE_HWFP */
diff --git a/checkpoint/checkpoint.c b/checkpoint/checkpoint.c
index ae79df7..dbe9e10 100644
--- a/checkpoint/checkpoint.c
+++ b/checkpoint/checkpoint.c
@@ -299,6 +299,9 @@ static void fill_kernel_const(struct ckpt_const *h)
 	h->uts_domainname_len = sizeof(uts->domainname);
 	/* rlimit */
 	h->rlimit_nlimits = RLIM_NLIMITS;
+	/* tty */
+	h->n_tty_buf_size = N_TTY_BUF_SIZE;
+	h->tty_termios_ncc = NCC;
 }
 
 /* write the checkpoint header */
diff --git a/checkpoint/files.c b/checkpoint/files.c
index 058bc0e..27e29a0 100644
--- a/checkpoint/files.c
+++ b/checkpoint/files.c
@@ -598,6 +598,12 @@ static struct restore_file_ops restore_file_ops[] = {
 		.file_type = CKPT_FILE_SOCKET,
 		.restore = sock_file_restore,
 	},
+	/* tty */
+	{
+		.file_name = "TTY",
+		.file_type = CKPT_FILE_TTY,
+		.restore = tty_file_restore,
+	},
 };
 
 static struct file *do_restore_file(struct ckpt_ctx *ctx)
diff --git a/checkpoint/objhash.c b/checkpoint/objhash.c
index 0978060..f84388d 100644
--- a/checkpoint/objhash.c
+++ b/checkpoint/objhash.c
@@ -269,6 +269,22 @@ static int obj_sock_users(void *ptr)
 	return atomic_read(&((struct sock *) ptr)->sk_refcnt);
 }
 
+static int obj_tty_grab(void *ptr)
+{
+	tty_kref_get((struct tty_struct *) ptr);
+	return 0;
+}
+
+static void obj_tty_drop(void *ptr, int lastref)
+{
+	tty_kref_put((struct tty_struct *) ptr);
+}
+
+static int obj_tty_users(void *ptr)
+{
+	return atomic_read(&((struct tty_struct *) ptr)->kref.refcount);
+}
+
 static struct ckpt_obj_ops ckpt_obj_ops[] = {
 	/* ignored object */
 	{
@@ -407,6 +423,16 @@ static struct ckpt_obj_ops ckpt_obj_ops[] = {
 		.checkpoint = checkpoint_sock,
 		.restore = restore_sock,
 	},
+	/* struct tty_struct */
+	{
+		.obj_name = "TTY",
+		.obj_type = CKPT_OBJ_TTY,
+		.ref_drop = obj_tty_drop,
+		.ref_grab = obj_tty_grab,
+		.ref_users = obj_tty_users,
+		.checkpoint = checkpoint_tty,
+		.restore = restore_tty,
+	},
 };
 
 
diff --git a/checkpoint/restart.c b/checkpoint/restart.c
index 340698a..1016278 100644
--- a/checkpoint/restart.c
+++ b/checkpoint/restart.c
@@ -19,6 +19,7 @@
 #include <linux/freezer.h>
 #include <linux/magic.h>
 #include <linux/utsname.h>
+#include <linux/termios.h>
 #include <asm/syscall.h>
 #include <linux/elf.h>
 #include <linux/deferqueue.h>
@@ -402,6 +403,11 @@ static int check_kernel_const(struct ckpt_const *h)
 	/* rlimit */
 	if (h->rlimit_nlimits != RLIM_NLIMITS)
 		return -EINVAL;
+	/* tty */
+	if (h->n_tty_buf_size != N_TTY_BUF_SIZE)
+		return -EINVAL;
+	if (h->tty_termios_ncc != NCC)
+		return -EINVAL;
 
 	return 0;
 }
diff --git a/drivers/char/pty.c b/drivers/char/pty.c
index e2fef99..5fb4ec5 100644
--- a/drivers/char/pty.c
+++ b/drivers/char/pty.c
@@ -15,6 +15,7 @@
 
 #include <linux/errno.h>
 #include <linux/interrupt.h>
+#include <linux/file.h>
 #include <linux/tty.h>
 #include <linux/tty_flip.h>
 #include <linux/fcntl.h>
diff --git a/drivers/char/tty_io.c b/drivers/char/tty_io.c
index 7853ea2..72f4432 100644
--- a/drivers/char/tty_io.c
+++ b/drivers/char/tty_io.c
@@ -106,6 +106,7 @@
 
 #include <linux/kmod.h>
 #include <linux/nsproxy.h>
+#include <linux/checkpoint.h>
 
 #undef TTY_DEBUG_HANGUP
 
@@ -151,6 +152,13 @@ static long tty_compat_ioctl(struct file *file, unsigned int cmd,
 #define tty_compat_ioctl NULL
 #endif
 static int tty_fasync(int fd, struct file *filp, int on);
+#ifdef CONFIG_CHECKPOINT
+static int tty_file_checkpoint(struct ckpt_ctx *ctx, struct file *file);
+static int tty_file_collect(struct ckpt_ctx *ctx, struct file *file);
+#else
+#define tty_file_checkpoint NULL
+#define tty_file_collect NULL
+#endif /* CONFIG_CHECKPOINT */
 static void release_tty(struct tty_struct *tty, int idx);
 static void __proc_set_tty(struct task_struct *tsk, struct tty_struct *tty);
 static void proc_set_tty(struct task_struct *tsk, struct tty_struct *tty);
@@ -417,6 +425,8 @@ static const struct file_operations tty_fops = {
 	.open		= tty_open,
 	.release	= tty_release,
 	.fasync		= tty_fasync,
+	.checkpoint	= tty_file_checkpoint,
+	.collect	= tty_file_collect,
 };
 
 static const struct file_operations console_fops = {
@@ -439,6 +449,8 @@ static const struct file_operations hung_up_tty_fops = {
 	.unlocked_ioctl	= hung_up_tty_ioctl,
 	.compat_ioctl	= hung_up_tty_compat_ioctl,
 	.release	= tty_release,
+	.checkpoint	= tty_file_checkpoint,
+	.collect	= tty_file_collect,
 };
 
 static DEFINE_SPINLOCK(redirect_lock);
@@ -2586,6 +2598,493 @@ static long tty_compat_ioctl(struct file *file, unsigned int cmd,
 }
 #endif
 
+#ifdef CONFIG_CHECKPOINT
+static int tty_can_checkpoint(struct ckpt_ctx *ctx, struct tty_struct *tty)
+{
+	/* only support pty driver */
+	if (tty->driver->type != TTY_DRIVER_TYPE_PTY) {
+		ckpt_write_err(ctx, "TSP", "tty: unknown driverv type %d",
+			       tty->driver, tty, tty->driver->type);
+		return 0;
+	}
+	/* only support unix98 style */
+	if (tty->driver->major != UNIX98_PTY_MASTER_MAJOR &&
+	    tty->driver->major != UNIX98_PTY_SLAVE_MAJOR) {
+		ckpt_write_err(ctx, "TP", "tty: legacy pty", tty);
+		return 0;
+	}
+	/* only support n_tty ldisc */
+	if (tty->ldisc->ops->num != N_TTY) {
+		ckpt_write_err(ctx, "TSP", "tty: unknown ldisc type %d",
+			       tty->ldisc->ops, tty, tty->ldisc->ops->num);
+		return 0;
+	}
+
+	return 1;
+}
+
+static int tty_file_checkpoint(struct ckpt_ctx *ctx, struct file *file)
+{
+	struct ckpt_hdr_file_tty *h;
+	struct tty_struct *tty, *real_tty;
+	struct inode *inode;
+	int master_objref, slave_objref;
+	int ret;
+
+	tty = (struct tty_struct *)file->private_data;
+	inode = file->f_path.dentry->d_inode;
+	if (tty_paranoia_check(tty, inode, "tty_file_checkpoint"))
+		return -EIO;
+
+	if (!tty_can_checkpoint(ctx, tty))
+		return -ENOSYS;
+
+	/*
+	 * If we ever support more than PTYs, this would be tty-type
+	 * specific (and probably called via tty_operations).
+	 */
+
+	real_tty = tty_pair_get_tty(tty);
+	ckpt_debug("tty: %p, real_tty: %p\n", tty, real_tty);
+
+	master_objref = checkpoint_obj(ctx, real_tty->link, CKPT_OBJ_TTY);
+	if (master_objref < 0)
+		return master_objref;
+	slave_objref = checkpoint_obj(ctx, real_tty, CKPT_OBJ_TTY);
+	if (slave_objref < 0)
+		return slave_objref;
+	ckpt_debug("master %p %d, slave %p %d\n",
+		   real_tty->link, master_objref, real_tty, slave_objref);
+
+	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_FILE);
+	if (!h)
+		return -ENOMEM;
+
+	h->common.f_type = CKPT_FILE_TTY;
+	h->tty_objref = (tty == real_tty ? slave_objref : master_objref);
+
+	ret = checkpoint_file_common(ctx, file, &h->common);
+	if (!ret)
+		ret = ckpt_write_obj(ctx, &h->common.h);
+
+	ckpt_hdr_put(ctx, h);
+	return ret;
+}
+
+static int tty_file_collect(struct ckpt_ctx *ctx, struct file *file)
+{
+	struct tty_struct *tty;
+	struct inode *inode;
+	int ret;
+
+	tty = (struct tty_struct *)file->private_data;
+	inode = file->f_path.dentry->d_inode;
+	if (tty_paranoia_check(tty, inode, "tty_collect"))
+		return -EIO;
+
+	if (!tty_can_checkpoint(ctx, tty))
+		return -ENOSYS;
+
+	ckpt_debug("collecting tty: %p\n", tty);
+	ret = ckpt_obj_collect(ctx, tty, CKPT_OBJ_TTY);
+	if (ret < 0)
+		return ret;
+
+	if (tty->driver->subtype == PTY_TYPE_MASTER) {
+		if (!tty->link) {
+			ckpt_write_err(ctx, "TP", "tty: missing link\n", tty);
+			return -EIO;
+		}
+		ckpt_debug("collecting slave tty: %p\n", tty->link);
+		ret = ckpt_obj_collect(ctx, tty->link, CKPT_OBJ_TTY);
+	}
+
+	return ret;
+}
+
+#define CKPT_LDISC_BAD   (1 << TTY_LDISC_CHANGING)
+#define CKPT_LDISC_GOOD  ((1 << TTY_LDISC_OPEN) | (1 << TTY_LDISC))
+#define CKPT_LDISC_FLAGS (CKPT_LDISC_GOOD | CKPT_LDISC_BAD)
+
+static int checkpoint_tty_ldisc(struct ckpt_ctx *ctx, struct tty_struct *tty)
+{
+	struct ckpt_hdr_ldisc_n_tty *h;
+	int datalen, read_tail;
+	int n, ret;
+
+	/* shouldn't reach here unless ldisc is n_tty */
+	BUG_ON(tty->ldisc->ops->num != N_TTY);
+
+	if ((tty->flags & CKPT_LDISC_FLAGS) != CKPT_LDISC_GOOD) {
+		ckpt_write_err(ctx, "TP", "tty: bad ldisc flags %#lx\n",
+			       tty, tty->flags);
+		return -EBUSY;
+	}
+
+	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_TTY_LDISC);
+	if (!h)
+		return -ENOMEM;
+
+	spin_lock_irq(&tty->read_lock);
+	h->column = tty->column;
+	h->datalen = tty->read_cnt;
+	h->canon_column = tty->canon_column;
+	h->canon_datalen = tty->canon_head;
+	if (tty->canon_head > tty->read_tail)
+		h->canon_datalen -= tty->read_tail;
+	else
+		h->canon_datalen += N_TTY_BUF_SIZE - tty->read_tail;
+	h->canon_data = tty->canon_data;
+
+	datalen = tty->read_cnt;
+	read_tail = tty->read_tail;
+	spin_unlock_irq(&tty->read_lock);
+
+	h->minimum_to_wake = tty->minimum_to_wake;
+
+	h->stopped = tty->stopped;
+	h->hw_stopped = tty->hw_stopped;
+	h->flow_stopped = tty->flow_stopped;
+	h->packet = tty->packet;
+	h->ctrl_status = tty->ctrl_status;
+	h->lnext = tty->lnext;
+	h->erasing = tty->erasing;
+	h->raw = tty->raw;
+	h->real_raw = tty->real_raw;
+	h->icanon = tty->icanon;
+	h->closing = tty->closing;
+
+	BUILD_BUG_ON(sizeof(h->read_flags) != sizeof(tty->read_flags));
+	memcpy(h->read_flags, tty->read_flags, sizeof(tty->read_flags));
+
+	ret = ckpt_write_obj(ctx, &h->h);
+	ckpt_hdr_put(ctx, h);
+	if (ret < 0)
+		return ret;
+
+	ckpt_debug("datalen %d\n", datalen);
+	if (datalen) {
+		ret = ckpt_write_buffer(ctx, NULL, datalen);
+		if (ret < 0)
+			return ret;
+		n = min(datalen, N_TTY_BUF_SIZE - read_tail);
+		ret = ckpt_kwrite(ctx, &tty->read_buf[read_tail], n);
+		if (ret < 0)
+			return ret;
+		n = datalen - n;
+		ret = ckpt_kwrite(ctx, tty->read_buf, n);
+	}
+
+	return ret;
+}
+
+#define CKPT_TTY_BAD   ((1 << TTY_CLOSING) | (1 << TTY_FLUSHING))
+#define CKPT_TTY_GOOD  0
+
+static int do_checkpoint_tty(struct ckpt_ctx *ctx, struct tty_struct *tty)
+{
+	struct ckpt_hdr_tty *h;
+	int link_objref;
+	int master = 0;
+	int ret;
+
+	if ((tty->flags & (CKPT_TTY_BAD | CKPT_TTY_GOOD)) != CKPT_TTY_GOOD) {
+		ckpt_write_err(ctx, "TP", "tty: bad flags %#lx\n",
+			       tty, tty->flags);
+		return -EBUSY;
+	}
+
+	/*
+	 * If we ever support more than PTYs, this would be tty-type
+	 * specific (and probably called via tty_operations).
+	 */
+	link_objref = ckpt_obj_lookup(ctx, tty->link, CKPT_OBJ_TTY);
+
+	if (tty->driver->subtype == PTY_TYPE_MASTER)
+		master = 1;
+
+	/* tty is master if-and-only-if link_objref is zero */
+	BUG_ON(master ^ !link_objref);
+
+	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_TTY);
+	if (!h)
+		return -ENOMEM;
+
+	h->driver_type = tty->driver->type;
+	h->driver_subtype = tty->driver->subtype;
+
+	h->link_objref = link_objref;
+
+	/* if master, reserve an objref (see do_restore_tty) */
+	h->file_objref = (master ? ckpt_obj_reserve(ctx) : 0);
+	ckpt_debug("link %d file %d\n", h->link_objref, h->file_objref);
+
+	h->index = tty->index;
+	h->ldisc = tty->ldisc->ops->num;
+	h->flags = tty->flags;
+
+	mutex_lock(&tty->termios_mutex);
+	h->termios.c_line = tty->termios->c_line;
+	h->termios.c_iflag = tty->termios->c_iflag;
+	h->termios.c_oflag = tty->termios->c_oflag;
+	h->termios.c_cflag = tty->termios->c_cflag;
+	h->termios.c_lflag = tty->termios->c_lflag;
+	memcpy(h->termios.c_cc, tty->termios->c_cc, NCC);
+	h->winsize.ws_row = tty->winsize.ws_row;
+	h->winsize.ws_col = tty->winsize.ws_col;
+	h->winsize.ws_ypixel = tty->winsize.ws_ypixel;
+	h->winsize.ws_xpixel = tty->winsize.ws_xpixel;
+	mutex_unlock(&tty->termios_mutex);
+
+	ret  = ckpt_write_obj(ctx, &h->h);
+	ckpt_hdr_put(ctx, h);
+	if (ret < 0)
+		return ret;
+
+	/* save line discipline data (also writes buffer) */
+	if (!test_bit(TTY_HUPPED, &tty->flags))
+		ret = checkpoint_tty_ldisc(ctx, tty);
+
+	return ret;
+}
+
+int checkpoint_tty(struct ckpt_ctx *ctx, void *ptr)
+{
+	return do_checkpoint_tty(ctx, (struct tty_struct *) ptr);
+}
+
+struct file *tty_file_restore(struct ckpt_ctx *ctx, struct ckpt_hdr_file *ptr)
+{
+	struct ckpt_hdr_file_tty *h = (struct ckpt_hdr_file_tty *) ptr;
+	struct tty_struct *tty;
+	struct file *file;
+	char slavepath[16];	/* "/dev/pts/###" */
+	int slavelock;
+	int ret;
+
+	if (ptr->h.type != CKPT_HDR_FILE ||
+	    ptr->h.len != sizeof(*h) || ptr->f_type != CKPT_FILE_TTY)
+		return ERR_PTR(-EINVAL);
+
+	if (h->tty_objref <= 0)
+		return ERR_PTR(-EINVAL);
+
+	tty = ckpt_obj_fetch(ctx, h->tty_objref, CKPT_OBJ_TTY);
+	ckpt_debug("tty %p objref %d\n", tty, h->tty_objref);
+
+	/* at this point the tty should have been restore already */
+	if (IS_ERR(tty))
+		return (struct file *) tty;
+
+	/*
+	 * If we ever support more than PTYs, this would be tty-type
+	 * specific (and probably called via tty_operations).
+	 */
+
+	/*
+	 * If this tty is master, get the corresponding file from
+	 * tty->tty_file. Otherwise, open the slave device.
+	 */
+	if (tty->driver->subtype == PTY_TYPE_MASTER) {
+		file_list_lock();
+		file = list_first_entry(&tty->tty_files,
+					typeof(*file), f_u.fu_list);
+		get_file(file);
+		file_list_unlock();
+		ckpt_debug("master file %p\n", file);
+	} else {
+		sprintf(slavepath, "/dev/pts/%d", tty->index);
+		slavelock = test_bit(TTY_PTY_LOCK, &tty->link->flags);
+		clear_bit(TTY_PTY_LOCK, &tty->link->flags);
+		file = filp_open(slavepath, O_RDWR | O_NOCTTY, 0);
+		ckpt_debug("slave file %p (idnex %d)\n", file, tty->index);
+		if (IS_ERR(file))
+			return file;
+		if (slavelock)
+			set_bit(TTY_PTY_LOCK, &tty->link->flags);
+	}
+
+	ret = restore_file_common(ctx, file, ptr);
+	if (ret < 0) {
+		fput(file);
+		file = ERR_PTR(ret);
+	}
+
+	return file;
+}
+
+static int restore_tty_ldisc(struct ckpt_ctx *ctx,
+			     struct tty_struct *tty,
+			     struct ckpt_hdr_tty *hh)
+{
+	struct ckpt_hdr_ldisc_n_tty *h;
+	int ret = -EINVAL;
+
+	h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_TTY_LDISC);
+	if (IS_ERR(h))
+		return PTR_ERR(h);
+
+	/* this is unfair shortcut, because we know ldisc is n_tty */
+	if (hh->ldisc != N_TTY)
+		goto out;
+	if ((hh->flags & CKPT_LDISC_FLAGS) != CKPT_LDISC_GOOD)
+		goto out;
+
+	if (h->datalen > N_TTY_BUF_SIZE)
+		goto out;
+	if (h->canon_datalen > N_TTY_BUF_SIZE)
+		goto out;
+
+	if (h->datalen) {
+		ret = _ckpt_read_buffer(ctx, tty->read_buf, h->datalen);
+		if (ret < 0)
+			goto out;
+	}
+
+	/* TODO: sanitize all these values ? */
+
+	spin_lock_irq(&tty->read_lock);
+	tty->column = h->column;
+	tty->read_cnt = h->datalen;
+	tty->read_head = h->datalen;
+	tty->read_tail = 0;
+	tty->canon_column = h->canon_column;
+	tty->canon_head = h->canon_datalen;
+	tty->canon_data = h->canon_data;
+	spin_unlock_irq(&tty->read_lock);
+
+	tty->minimum_to_wake = h->minimum_to_wake;
+
+	tty->stopped = h->stopped;
+	tty->hw_stopped = h->hw_stopped;
+	tty->flow_stopped = h->flow_stopped;
+	tty->packet = h->packet;
+	tty->ctrl_status = h->ctrl_status;
+	tty->lnext = h->lnext;
+	tty->erasing = h->erasing;
+	tty->raw = h->raw;
+	tty->real_raw = h->real_raw;
+	tty->icanon = h->icanon;
+	tty->closing = h->closing;
+
+	BUILD_BUG_ON(sizeof(h->read_flags) != sizeof(tty->read_flags));
+	memcpy(tty->read_flags, h->read_flags, sizeof(tty->read_flags));
+ out:
+	ckpt_hdr_put(ctx, h);
+	return 0;
+}
+
+#define CKPT_PTMX_PATH  "/dev/ptmx"
+
+static struct tty_struct *do_restore_tty(struct ckpt_ctx *ctx)
+{
+	struct ckpt_hdr_tty *h;
+	struct tty_struct *tty = ERR_PTR(-EINVAL);
+	struct file *file = NULL;
+	int master, ret;
+
+	h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_TTY);
+	if (IS_ERR(h))
+		return (struct tty_struct *) h;
+
+	if (h->driver_type != TTY_DRIVER_TYPE_PTY)
+		goto out;
+	if (h->driver_subtype == PTY_TYPE_MASTER)
+		master = 1;
+	else if (h->driver_subtype == PTY_TYPE_SLAVE)
+		master = 0;
+	else
+		goto out;
+	/* @link_object is positive if-and-only-if tty is not master */
+	if (h->link_objref < 0 || (master ^ !h->link_objref))
+		goto out;
+	/* @file_object is positive if-and-only-if tty is master */
+	if (h->file_objref < 0 || (master ^ !!h->file_objref))
+		goto out;
+	if (h->flags & CKPT_TTY_BAD)
+		goto out;
+	/* hung-up tty cannot be master, or have session or pgrp */
+	if (test_bit(TTY_HUPPED, (unsigned long *) &h->flags) && master)
+		goto out;
+
+	ckpt_debug("sanity checks passed, link %d\n", h->link_objref);
+
+	/*
+	 * If we ever support more than PTYs, this would be tty-type
+	 * specific (and probably called via tty_operations).
+	 */
+	if (master) {
+		file = pty_open_by_index("/dev/ptmx", h->index);
+		if (IS_ERR(file)) {
+			ckpt_write_err(ctx, "TE", "open ptmx", PTR_ERR(file));
+			tty = ERR_PTR(PTR_ERR(file));
+			goto out;
+		}
+
+		/*
+		 * Add file to objhash to ensure proper cleanup later
+		 * (it isn't referenced elsewhere). Use h->file_objref
+		 * which was explicitly during checkpoint for this.
+		 */
+		ret = ckpt_obj_insert(ctx, file, h->file_objref, CKPT_OBJ_FILE);
+		fput(file);  /* even on succes (referenced in objash) */
+		if (ret < 0) {
+			tty = ERR_PTR(ret);
+			goto out;
+		}
+
+		tty = file->private_data;
+	} else {
+		tty = ckpt_obj_fetch(ctx, h->link_objref, CKPT_OBJ_TTY);
+		if (IS_ERR(tty))
+			goto out;
+		tty = tty->link;
+	}
+
+	ckpt_debug("tty %p (hup %d)\n",
+		   tty, test_bit(TTY_HUPPED, (unsigned long *) &h->flags));
+
+	/* we now have the desired tty: restore its state as per @h */
+
+	mutex_lock(&tty->termios_mutex);
+	tty->termios->c_line = h->termios.c_line;
+	tty->termios->c_iflag = h->termios.c_iflag;
+	tty->termios->c_oflag = h->termios.c_oflag;
+	tty->termios->c_cflag = h->termios.c_cflag;
+	tty->termios->c_lflag = h->termios.c_lflag;
+	memcpy(tty->termios->c_cc, h->termios.c_cc, NCC);
+	tty->winsize.ws_row = h->winsize.ws_row;
+	tty->winsize.ws_col = h->winsize.ws_col;
+	tty->winsize.ws_ypixel = h->winsize.ws_ypixel;
+	tty->winsize.ws_xpixel = h->winsize.ws_xpixel;
+	mutex_unlock(&tty->termios_mutex);
+
+	if (test_bit(TTY_HUPPED, (unsigned long *) &h->flags))
+		tty_vhangup(tty);
+	else {
+		ret = restore_tty_ldisc(ctx, tty, h);
+		if (ret < 0) {
+			tty = ERR_PTR(ret);
+			goto out;
+		}
+	}
+
+	tty_kref_get(tty);
+ out:
+	ckpt_hdr_put(ctx, h);
+	return tty;
+}
+
+void *restore_tty(struct ckpt_ctx *ctx)
+{
+#ifdef CONFIG_UNIX98_PTYS
+	return (void *) do_restore_tty(ctx);
+#else
+	return ERR_PTR(-ENOSYS);
+#endif
+}
+#endif /* COFNIG_CHECKPOINT */
+
 /*
  * This implements the "Secure Attention Key" ---  the idea is to
  * prevent trojan horses by killing all processes associated with this
diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
index 92a21b2..7c117fc 100644
--- a/include/linux/checkpoint.h
+++ b/include/linux/checkpoint.h
@@ -266,6 +266,10 @@ extern int restore_obj_signal(struct ckpt_ctx *ctx, int signal_objref);
 extern int checkpoint_task_signal(struct ckpt_ctx *ctx, struct task_struct *t);
 extern int restore_task_signal(struct ckpt_ctx *ctx);
 
+/* ttys */
+extern int checkpoint_tty(struct ckpt_ctx *ctx, void *ptr);
+extern void *restore_tty(struct ckpt_ctx *ctx);
+
 static inline int ckpt_validate_errno(int errno)
 {
 	return (errno >= 0) && (errno < MAX_ERRNO);
diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
index ac16c59..bf584cb 100644
--- a/include/linux/checkpoint_hdr.h
+++ b/include/linux/checkpoint_hdr.h
@@ -83,6 +83,8 @@ enum {
 	CKPT_HDR_FILE_NAME,
 	CKPT_HDR_FILE,
 	CKPT_HDR_PIPE_BUF,
+	CKPT_HDR_TTY,
+	CKPT_HDR_TTY_LDISC,
 
 	CKPT_HDR_MM = 401,
 	CKPT_HDR_VMA,
@@ -141,6 +143,7 @@ enum obj_type {
 	CKPT_OBJ_USER,
 	CKPT_OBJ_GROUPINFO,
 	CKPT_OBJ_SOCK,
+	CKPT_OBJ_TTY,
 	CKPT_OBJ_MAX
 };
 
@@ -161,6 +164,9 @@ struct ckpt_const {
 	__u16 uts_domainname_len;
 	/* rlimit */
 	__u16 rlimit_nlimits;
+	/* tty */
+	__u16 n_tty_buf_size;
+	__u16 tty_termios_ncc;
 } __attribute__((aligned(8)));
 
 /* checkpoint image header */
@@ -368,6 +374,7 @@ enum file_type {
 	CKPT_FILE_PIPE,
 	CKPT_FILE_FIFO,
 	CKPT_FILE_SOCKET,
+	CKPT_FILE_TTY,
 	CKPT_FILE_MAX
 };
 
@@ -660,6 +667,84 @@ struct ckpt_hdr_ipc_sem {
 } __attribute__((aligned(8)));
 
 
+/* devices */
+struct ckpt_hdr_file_tty {
+	struct ckpt_hdr_file common;
+	__s32 tty_objref;
+};
+
+struct ckpt_hdr_tty {
+	struct ckpt_hdr h;
+
+	__u16 driver_type;
+	__u16 driver_subtype;
+
+	__s32 link_objref;
+	__s32 file_objref;
+	__u32 _padding;
+
+	__u32 index;
+	__u32 ldisc;
+	__u64 flags;
+
+	/* termios */
+	struct {
+		__u16 c_iflag;
+		__u16 c_oflag;
+		__u16 c_cflag;
+		__u16 c_lflag;
+		__u8 c_line;
+		__u8 c_cc[CKPT_TTY_NCC];
+	} __attribute__((aligned(8))) termios;
+
+	/* winsize */
+	struct {
+		__u16 ws_row;
+		__u16 ws_col;
+		__u16 ws_xpixel;
+		__u16 ws_ypixel;
+	} __attribute__((aligned(8))) winsize;
+} __attribute__((aligned(8)));
+
+/* cannot include <linux/tty.h> from userspace, so define: */
+#define CKPT_N_TTY_BUF_SIZE  4096
+#ifdef __KERNEL__
+#include <linux/tty.h>
+#if CKPT_N_TTY_BUF_SIZE != N_TTY_BUF_SIZE
+#error CKPT_N_TTY_BUF_SIZE size is wrong per linux/tty.h
+#endif
+#endif
+
+struct ckpt_hdr_ldisc_n_tty {
+	struct ckpt_hdr h;
+
+	__u32 column;
+	__u32 datalen;
+	__u32 canon_column;
+	__u32 canon_datalen;
+	__u32 canon_data;
+
+	__u16 minimum_to_wake;
+
+	__u8 stopped;
+	__u8 hw_stopped;
+	__u8 flow_stopped;
+	__u8 packet;
+	__u8 ctrl_status;
+	__u8 lnext;
+	__u8 erasing;
+	__u8 raw;
+	__u8 real_raw;
+	__u8 icanon;
+	__u8 closing;
+	__u8 padding[3];
+
+	__u8 read_flags[CKPT_N_TTY_BUF_SIZE / 8];
+
+	/* if @datalen > 0, buffer contents follow (next object) */
+} __attribute__((aligned(8)));
+
+
 #define CKPT_TST_OVERFLOW_16(a, b) \
 	((sizeof(a) > sizeof(b)) && ((a) > SHORT_MAX))
 
diff --git a/include/linux/tty.h b/include/linux/tty.h
index fd40561..295447b 100644
--- a/include/linux/tty.h
+++ b/include/linux/tty.h
@@ -471,6 +471,13 @@ extern void tty_ldisc_enable(struct tty_struct *tty);
 /* This one is for ptmx_close() */
 extern int tty_release(struct inode *inode, struct file *filp);
 
+#ifdef CONFIG_CHECKPOINT
+struct ckpt_ctx;
+struct ckpt_hdr_file;
+extern struct file *tty_file_restore(struct ckpt_ctx *ctx,
+				     struct ckpt_hdr_file *ptr);
+#endif
+
 /* n_tty.c */
 extern struct tty_ldisc_ops tty_ldisc_N_TTY;
 
-- 
1.6.0.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH v18 73/80] c/r: correctly restore pgid
       [not found] ` <1253749920-18673-1-git-send-email-orenl-RdfvBDnrOixBDgjK7y7TUQ@public.gmane.org>
                     ` (71 preceding siblings ...)
  2009-09-23 23:51   ` [PATCH v18 72/80] c/r: [pty 2/2] support for pseudo terminals Oren Laadan
@ 2009-09-23 23:51   ` Oren Laadan
  2009-09-23 23:51   ` [PATCH v18 74/80] c/r: support for controlling terminal and job control Oren Laadan
                     ` (8 subsequent siblings)
  81 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-23 23:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-api-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Linus Torvalds, Ingo Molnar,
	Pavel Emelyanov

The main challenge with restoring the pgid of tasks is that the
original "owner" (the process with that pid) might have exited
already. I call these "ghost" pgids. 'mktree' does create these
processes, but they then exit without participating in the restart.

To solve this, this patch introduces a RESTART_GHOST flag, used for
"ghost" owners that are created only to pass their pgid to other
tasks. ('mktree' now makes them call restart(2) instead of exiting).

When a "ghost" task calls restart(2), it will be placed on a wait
queue until the restart completes and then exit. This guarantees that
the pgid that it owns remains available for all (regular) restarting
tasks for when they need it.

Regular tasks perform the restart as before, except that they also
now restore their old pgrp, which is guaranteed to exist.

Changelog [v3]:
  - Fix leak of ckpt_ctx when restoring "ghost" tasks
Changelog [v2]:
  - Call change_pid() only if new pgrp differs from current one
Changelog [v1]:
  - Verify that pgid owner is a thread-group-leader.
  - Handle the case of pgid/sid == 0 using root's parent pid-ns

Signed-off-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
Acked-by: Serge Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
---
 checkpoint/process.c             |   97 +++++++++++++++++++++++++++
 checkpoint/restart.c             |  137 +++++++++++++++++++++++++++-----------
 checkpoint/sys.c                 |    3 +-
 include/linux/checkpoint.h       |   11 +++-
 include/linux/checkpoint_hdr.h   |    3 +
 include/linux/checkpoint_types.h |    6 +-
 6 files changed, 214 insertions(+), 43 deletions(-)

diff --git a/checkpoint/process.c b/checkpoint/process.c
index e596e2a..3c02f8e 100644
--- a/checkpoint/process.c
+++ b/checkpoint/process.c
@@ -24,6 +24,57 @@
 #include <linux/syscalls.h>
 
 
+pid_t ckpt_pid_nr(struct ckpt_ctx *ctx, struct pid *pid)
+{
+	return pid ? pid_nr_ns(pid, ctx->root_nsproxy->pid_ns) : CKPT_PID_NULL;
+}
+
+/* must be called with tasklist_lock or rcu_read_lock() held */
+struct pid *_ckpt_find_pgrp(struct ckpt_ctx *ctx, pid_t pgid)
+{
+	struct task_struct *p;
+	struct pid *pgrp;
+
+	if (pgid == 0) {
+		/*
+		 * At checkpoint the pgid owner lived in an ancestor
+		 * pid-ns. The best we can do (sanely and safely) is
+		 * to examine the parent of this restart's root: if in
+		 * a distinct pid-ns, use its pgrp; otherwise fail.
+		 */
+		p = ctx->root_task->real_parent;
+		if (p->nsproxy->pid_ns == current->nsproxy->pid_ns)
+			return NULL;
+		pgrp = task_pgrp(p);
+	} else {
+		/*
+		 * Find the owner process of this pgid (it must exist
+		 * if pgrp exists). It must be a thread group leader.
+		 */
+		pgrp = find_vpid(pgid);
+		p = pid_task(pgrp, PIDTYPE_PID);
+		if (!p || !thread_group_leader(p))
+			return NULL;
+		/*
+		 * The pgrp must "belong" to our restart tree (compare
+		 * p->checkpoint_ctx to ours). This prevents malicious
+		 * input from (guessing and) using unrelated pgrps. If
+		 * the owner is dead, then it doesn't have a context,
+		 * so instead compare against its (real) parent's.
+		 */
+		if (p->exit_state == EXIT_ZOMBIE)
+			p = p->real_parent;
+		if (p->checkpoint_ctx != ctx)
+			return NULL;
+	}
+
+	if (task_session(current) != task_session(p))
+		return NULL;
+
+	return pgrp;
+}
+
+
 #ifdef CONFIG_FUTEX
 static void save_task_robust_futex_list(struct ckpt_hdr_task *h,
 					struct task_struct *t)
@@ -720,6 +771,49 @@ int restore_restart_block(struct ckpt_ctx *ctx)
 	return ret;
 }
 
+static int restore_task_pgid(struct ckpt_ctx *ctx)
+{
+	struct task_struct *task = current;
+	struct pid *pgrp;
+	pid_t pgid;
+	int ret;
+
+	/*
+	 * We enforce the following restrictions on restoring pgrp:
+	 *  1) Only thread group leaders restore pgrp
+	 *  2) Session leader cannot change own pgrp
+	 *  3) Owner of pgrp must belong to same restart tree
+	 *  4) Must have same session as other tasks in same pgrp
+	 *  5) Change must pass setpgid security callback
+	 *
+	 * TODO - check if we need additional restrictions ?
+	 */
+
+	if (!thread_group_leader(task))  /* (1) */
+		return 0;
+
+	pgid = ctx->pids_arr[ctx->active_pid].vpgid;
+
+	if (pgid == task_pgrp_vnr(task))  /* nothing to do */
+		return 0;
+
+	if (task->signal->leader)  /* (2) */
+		return -EINVAL;
+
+	ret = -EINVAL;
+
+	write_lock_irq(&tasklist_lock);
+	pgrp = _ckpt_find_pgrp(ctx, pgid);  /* (3) and (4) */
+	if (pgrp && task_pgrp(task) != pgrp) {
+		ret = security_task_setpgid(task, pgid);  /* (5) */
+		if (!ret)
+			change_pid(task, PIDTYPE_PGID, pgrp);
+	}
+	write_unlock_irq(&tasklist_lock);
+
+	return ret;
+}
+
 /* pre_restore_task - prepare the task for restore */
 static int pre_restore_task(struct ckpt_ctx *ctx)
 {
@@ -760,6 +854,9 @@ int restore_task(struct ckpt_ctx *ctx)
 	if (ret)
 		goto out;
 
+	ret = restore_task_pgid(ctx);
+	if (ret < 0)
+		goto out;
 	ret = restore_task_objs(ctx);
 	ckpt_debug("objs %d\n", ret);
 	if (ret < 0)
diff --git a/checkpoint/restart.c b/checkpoint/restart.c
index 1016278..543b380 100644
--- a/checkpoint/restart.c
+++ b/checkpoint/restart.c
@@ -511,6 +511,11 @@ static int restore_read_tree(struct ckpt_ctx *ctx)
 	return ret;
 }
 
+static inline int all_tasks_activated(struct ckpt_ctx *ctx)
+{
+	return (ctx->active_pid == ctx->nr_pids);
+}
+
 static inline pid_t get_active_pid(struct ckpt_ctx *ctx)
 {
 	int active = ctx->active_pid;
@@ -535,30 +540,42 @@ do { \
 	_restore_notify_error(ctx, errno); \
 } while(0)
 
+static void restore_task_done(struct ckpt_ctx *ctx)
+{
+	if (atomic_dec_and_test(&ctx->nr_total))
+		complete(&ctx->complete);
+	BUG_ON(atomic_read(&ctx->nr_total) < 0);
+}
+
 static int restore_activate_next(struct ckpt_ctx *ctx)
 {
 	struct task_struct *task;
 	pid_t pid;
 
-	if (++ctx->active_pid >= ctx->nr_pids) {
-		complete(&ctx->complete);
-		return 0;
-	}
+	ctx->active_pid++;
 
-	pid = get_active_pid(ctx);
+	BUG_ON(ctx->active_pid > ctx->nr_pids);
 
-	rcu_read_lock();
-	task = find_task_by_pid_ns(pid, ctx->root_nsproxy->pid_ns);
-	/* target task must have same restart context */
-	if (task && task->checkpoint_ctx == ctx)
-		wake_up_process(task);
-	else
-		task = NULL;
-	rcu_read_unlock();
+	if (!all_tasks_activated(ctx)) {
+		/* wake up next task in line to restore its state */
+		pid = get_active_pid(ctx);
 
-	if (!task) {
-		restore_notify_error(ctx, -ESRCH);
-		return -ESRCH;
+		rcu_read_lock();
+		task = find_task_by_pid_ns(pid, ctx->root_nsproxy->pid_ns);
+		/* target task must have same restart context */
+		if (task && task->checkpoint_ctx == ctx)
+			wake_up_process(task);
+		else
+			task = NULL;
+		rcu_read_unlock();
+
+		if (!task) {
+			restore_notify_error(ctx, -ESRCH);
+			return -ESRCH;
+		}
+	} else {
+		/* wake up ghosts tasks so that they can terminate */
+		wake_up_all(&ctx->ghostq);
 	}
 
 	return 0;
@@ -593,7 +610,7 @@ static int wait_task_sync(struct ckpt_ctx *ctx)
 	return 0;
 }
 
-static int do_restore_task(void)
+static struct ckpt_ctx *wait_checkpoint_ctx(void)
 {
 	DECLARE_WAIT_QUEUE_HEAD(waitq);
 	struct ckpt_ctx *ctx, *old_ctx;
@@ -605,11 +622,11 @@ static int do_restore_task(void)
 	 */
 	ret = wait_event_interruptible(waitq, current->checkpoint_ctx);
 	if (ret < 0)
-		return ret;
+		return ERR_PTR(ret);
 
 	ctx = xchg(&current->checkpoint_ctx, NULL);
 	if (!ctx)
-		return -EAGAIN;
+		return ERR_PTR(-EAGAIN);
 	ckpt_ctx_get(ctx);
 
 	/*
@@ -628,9 +645,43 @@ static int do_restore_task(void)
 		/* alert our coordinator that we bail */
 		restore_notify_error(ctx, -EAGAIN);
 		ckpt_ctx_put(ctx);
-		return -EAGAIN;
+
+		ctx = ERR_PTR(-EAGAIN);
 	}
 
+	return ctx;
+}
+
+static int do_ghost_task(void)
+{
+	struct ckpt_ctx *ctx;
+
+	ctx = wait_checkpoint_ctx();
+	if (IS_ERR(ctx))
+		return PTR_ERR(ctx);
+
+	current->flags |= PF_RESTARTING;
+
+	wait_event_interruptible(ctx->ghostq,
+				 all_tasks_activated(ctx) ||
+				 ckpt_test_ctx_error(ctx));
+
+	current->exit_signal = -1;
+	ckpt_ctx_put(ctx);
+	do_exit(0);
+
+	/* NOT REACHED */
+}
+
+static int do_restore_task(void)
+{
+	struct ckpt_ctx *ctx, *old_ctx;
+	int zombie, ret;
+
+	ctx = wait_checkpoint_ctx();
+	if (IS_ERR(ctx))
+		return PTR_ERR(ctx);
+
 	current->flags |= PF_RESTARTING;
 
 	/* wait for our turn, do the restore, and tell next task in line */
@@ -638,24 +689,28 @@ static int do_restore_task(void)
 	if (ret < 0)
 		goto out;
 
-	ret = restore_task(ctx);
+	zombie = restore_task(ctx);
+	if (zombie < 0) {
+		ret = zombie;
+		goto out;
+	}
+
+	ret = restore_activate_next(ctx);
 	if (ret < 0)
 		goto out;
 
 	/*
 	 * zombie: we're done here; do_exit() will notice the @ctx on
-	 * our current->checkpoint_ctx (and our PF_RESTARTING) - it
-	 * will call restore_activate_next() and release the @ctx.
+	 * our current->checkpoint_ctx (and our PF_RESTARTING), will
+	 * call restore_task_done() and release the @ctx. This ensures
+	 * that we only report done after we really become zombie.
 	 */
-	if (ret) {
+	if (zombie) {
 		ckpt_ctx_put(ctx);
 		do_exit(current->exit_code);
 	}
 
-	ret = restore_activate_next(ctx);
-	if (ret < 0)
-		goto out;
-
+	restore_task_done(ctx);
 	ret = wait_task_sync(ctx);
  out:
 	old_ctx = xchg(&current->checkpoint_ctx, NULL);
@@ -666,6 +721,7 @@ static int do_restore_task(void)
 	if (ret < 0 && !ckpt_test_ctx_error(ctx)) {
 		restore_notify_error(ctx, ret);
 		wake_up_all(&ctx->waitq);
+		wake_up_all(&ctx->ghostq);
 	}
 
 	current->flags &= ~PF_RESTARTING;
@@ -687,11 +743,11 @@ static int prepare_descendants(struct ckpt_ctx *ctx, struct task_struct *root)
 	struct task_struct *parent = NULL;
 	struct task_struct *task = root;
 	struct ckpt_ctx *old_ctx;
-	int nr_pids = ctx->nr_pids;
+	int nr_pids = 0;
 	int ret = 0;
 
 	read_lock(&tasklist_lock);
-	while (nr_pids) {
+	while (1) {
 		ckpt_debug("consider task %d\n", task_pid_vnr(task));
 		if (task_ptrace(task) & PT_PTRACED) {
 			ret = -EBUSY;
@@ -718,7 +774,7 @@ static int prepare_descendants(struct ckpt_ctx *ctx, struct task_struct *root)
 			}
 			ckpt_debug("prepare task %d\n", task_pid_vnr(task));
 			wake_up_process(task);
-			nr_pids--;
+			nr_pids++;
 		}
 
 		/* if has children - proceed with child */
@@ -748,12 +804,16 @@ static int prepare_descendants(struct ckpt_ctx *ctx, struct task_struct *root)
 		}
 	}
 	read_unlock(&tasklist_lock);
-	ckpt_debug("left %d ret %d root/task %d\n", nr_pids, ret, task == root);
+	ckpt_debug("nr %d/%d  ret %d\n", ctx->nr_pids, nr_pids, ret);
 
-	/* fail unless number of processes matches */
-	if (!ret && (nr_pids || task != root))
+	/*
+	 * Actual tasks count may exceed ctx->nr_pids due of 'dead'
+	 * tasks used as place-holders for PGIDs, but not fall short.
+	 */
+	if (!ret && (nr_pids < ctx->nr_pids))
 		ret = -ESRCH;
 
+	atomic_set(&ctx->nr_total, nr_pids);
 	return ret;
 }
 
@@ -832,7 +892,7 @@ static int init_restart_ctx(struct ckpt_ctx *ctx, pid_t pid)
 	if (!nsproxy)
 		return -ESRCH;
 
-	ctx->active_pid = -1;	/* see restore_activate_next, get_active_pid */
+	ctx->active_pid = -1;   /* see restore_activate_next, get_active_pid */
 
 	return 0;
 }
@@ -973,12 +1033,14 @@ static long restore_retval(void)
 	return ret;
 }
 
-long do_restart(struct ckpt_ctx *ctx, pid_t pid)
+long do_restart(struct ckpt_ctx *ctx, pid_t pid, unsigned long flags)
 {
 	long ret;
 
 	if (ctx)
 		ret = do_restore_coord(ctx, pid);
+	else if (flags & RESTART_GHOST)
+		ret = do_ghost_task();
 	else
 		ret = do_restore_task();
 
@@ -1025,8 +1087,7 @@ void exit_checkpoint(struct task_struct *tsk)
 	/* restarting zombies will activate next task in restart */
 	if (tsk->flags & PF_RESTARTING) {
 		BUG_ON(ctx->active_pid == -1);
-		if (restore_activate_next(ctx) < 0)
-			pr_warning("c/r: [%d] failed zombie exit\n", tsk->pid);
+		restore_task_done(ctx);
 	}
 
 	ckpt_ctx_put(ctx);
diff --git a/checkpoint/sys.c b/checkpoint/sys.c
index d6a1650..76a3fa9 100644
--- a/checkpoint/sys.c
+++ b/checkpoint/sys.c
@@ -238,6 +238,7 @@ static struct ckpt_ctx *ckpt_ctx_alloc(int fd, unsigned long uflags,
 	INIT_LIST_HEAD(&ctx->pgarr_list);
 	INIT_LIST_HEAD(&ctx->pgarr_pool);
 	init_waitqueue_head(&ctx->waitq);
+	init_waitqueue_head(&ctx->ghostq);
 
 	err = -EBADF;
 	ctx->file = fget(fd);
@@ -334,7 +335,7 @@ SYSCALL_DEFINE3(restart, pid_t, pid, int, fd, unsigned long, flags)
 	if (IS_ERR(ctx))
 		return PTR_ERR(ctx);
 
-	ret = do_restart(ctx, pid);
+	ret = do_restart(ctx, pid, flags);
 
 	ckpt_ctx_put(ctx);
 	return ret;
diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
index 7c117fc..8e1cce7 100644
--- a/include/linux/checkpoint.h
+++ b/include/linux/checkpoint.h
@@ -18,6 +18,7 @@
 /* restart user flags */
 #define RESTART_TASKSELF	0x1
 #define RESTART_FROZEN		0x2
+#define RESTART_GHOST		0x4
 
 #ifdef __KERNEL__
 #ifdef CONFIG_CHECKPOINT
@@ -44,7 +45,10 @@
 
 /* ckpt_ctx: uflags */
 #define CHECKPOINT_USER_FLAGS		CHECKPOINT_SUBTREE
-#define RESTART_USER_FLAGS		(RESTART_TASKSELF | RESTART_FROZEN)
+#define RESTART_USER_FLAGS  \
+	(RESTART_TASKSELF | \
+	 RESTART_FROZEN | \
+	 RESTART_GHOST)
 
 extern void exit_checkpoint(struct task_struct *tsk);
 
@@ -78,6 +82,9 @@ extern int ckpt_read_consume(struct ckpt_ctx *ctx, int len, int type);
 extern char *ckpt_fill_fname(struct path *path, struct path *root,
 			     char *buf, int *len);
 
+/* pids */
+extern pid_t ckpt_pid_nr(struct ckpt_ctx *ctx, struct pid *pid);
+
 /* socket functions */
 extern int ckpt_sock_getnames(struct ckpt_ctx *ctx,
 			      struct socket *socket,
@@ -130,7 +137,7 @@ extern void ckpt_ctx_get(struct ckpt_ctx *ctx);
 extern void ckpt_ctx_put(struct ckpt_ctx *ctx);
 
 extern long do_checkpoint(struct ckpt_ctx *ctx, pid_t pid);
-extern long do_restart(struct ckpt_ctx *ctx, pid_t pid);
+extern long do_restart(struct ckpt_ctx *ctx, pid_t pid, unsigned long flags);
 
 /* task */
 extern int ckpt_activate_next(struct ckpt_ctx *ctx);
diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
index bf584cb..842177f 100644
--- a/include/linux/checkpoint_hdr.h
+++ b/include/linux/checkpoint_hdr.h
@@ -215,6 +215,9 @@ struct ckpt_pids {
 	__s32 vsid;
 } __attribute__((aligned(8)));
 
+/* pids */
+#define CKPT_PID_NULL  -1
+
 /* task data */
 struct ckpt_hdr_task {
 	struct ckpt_hdr h;
diff --git a/include/linux/checkpoint_types.h b/include/linux/checkpoint_types.h
index be45666..9b7b4dd 100644
--- a/include/linux/checkpoint_types.h
+++ b/include/linux/checkpoint_types.h
@@ -68,9 +68,11 @@ struct ckpt_ctx {
 	/* [multi-process restart] */
 	struct ckpt_pids *pids_arr;	/* array of all pids [restart] */
 	int nr_pids;			/* size of pids array */
+	atomic_t nr_total;		/* total tasks count (with ghosts) */
 	int active_pid;			/* (next) position in pids array */
-	struct completion complete;	/* container root and other tasks on */
-	wait_queue_head_t waitq;	/* start, end, and restart ordering */
+	struct completion complete;	/* completion for container root */
+	wait_queue_head_t waitq;	/* waitqueue for restarting tasks */
+	wait_queue_head_t ghostq;	/* waitqueue for ghost tasks */
 	struct cred *realcred, *ecred;	/* tmp storage for cred at restart */
 
 	struct ckpt_stats stats;	/* statistics */
-- 
1.6.0.4

^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH v18 73/80] c/r: correctly restore pgid
  2009-09-23 23:50 ` Oren Laadan
@ 2009-09-23 23:51   ` Oren Laadan
  -1 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-23 23:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, containers, linux-kernel, linux-mm, linux-api,
	Serge Hallyn, Ingo Molnar, Pavel Emelyanov, Oren Laadan,
	Oren Laadan

The main challenge with restoring the pgid of tasks is that the
original "owner" (the process with that pid) might have exited
already. I call these "ghost" pgids. 'mktree' does create these
processes, but they then exit without participating in the restart.

To solve this, this patch introduces a RESTART_GHOST flag, used for
"ghost" owners that are created only to pass their pgid to other
tasks. ('mktree' now makes them call restart(2) instead of exiting).

When a "ghost" task calls restart(2), it will be placed on a wait
queue until the restart completes and then exit. This guarantees that
the pgid that it owns remains available for all (regular) restarting
tasks for when they need it.

Regular tasks perform the restart as before, except that they also
now restore their old pgrp, which is guaranteed to exist.

Changelog [v3]:
  - Fix leak of ckpt_ctx when restoring "ghost" tasks
Changelog [v2]:
  - Call change_pid() only if new pgrp differs from current one
Changelog [v1]:
  - Verify that pgid owner is a thread-group-leader.
  - Handle the case of pgid/sid == 0 using root's parent pid-ns

Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
Acked-by: Serge Hallyn <serue@us.ibm.com>
---
 checkpoint/process.c             |   97 +++++++++++++++++++++++++++
 checkpoint/restart.c             |  137 +++++++++++++++++++++++++++-----------
 checkpoint/sys.c                 |    3 +-
 include/linux/checkpoint.h       |   11 +++-
 include/linux/checkpoint_hdr.h   |    3 +
 include/linux/checkpoint_types.h |    6 +-
 6 files changed, 214 insertions(+), 43 deletions(-)

diff --git a/checkpoint/process.c b/checkpoint/process.c
index e596e2a..3c02f8e 100644
--- a/checkpoint/process.c
+++ b/checkpoint/process.c
@@ -24,6 +24,57 @@
 #include <linux/syscalls.h>
 
 
+pid_t ckpt_pid_nr(struct ckpt_ctx *ctx, struct pid *pid)
+{
+	return pid ? pid_nr_ns(pid, ctx->root_nsproxy->pid_ns) : CKPT_PID_NULL;
+}
+
+/* must be called with tasklist_lock or rcu_read_lock() held */
+struct pid *_ckpt_find_pgrp(struct ckpt_ctx *ctx, pid_t pgid)
+{
+	struct task_struct *p;
+	struct pid *pgrp;
+
+	if (pgid == 0) {
+		/*
+		 * At checkpoint the pgid owner lived in an ancestor
+		 * pid-ns. The best we can do (sanely and safely) is
+		 * to examine the parent of this restart's root: if in
+		 * a distinct pid-ns, use its pgrp; otherwise fail.
+		 */
+		p = ctx->root_task->real_parent;
+		if (p->nsproxy->pid_ns == current->nsproxy->pid_ns)
+			return NULL;
+		pgrp = task_pgrp(p);
+	} else {
+		/*
+		 * Find the owner process of this pgid (it must exist
+		 * if pgrp exists). It must be a thread group leader.
+		 */
+		pgrp = find_vpid(pgid);
+		p = pid_task(pgrp, PIDTYPE_PID);
+		if (!p || !thread_group_leader(p))
+			return NULL;
+		/*
+		 * The pgrp must "belong" to our restart tree (compare
+		 * p->checkpoint_ctx to ours). This prevents malicious
+		 * input from (guessing and) using unrelated pgrps. If
+		 * the owner is dead, then it doesn't have a context,
+		 * so instead compare against its (real) parent's.
+		 */
+		if (p->exit_state == EXIT_ZOMBIE)
+			p = p->real_parent;
+		if (p->checkpoint_ctx != ctx)
+			return NULL;
+	}
+
+	if (task_session(current) != task_session(p))
+		return NULL;
+
+	return pgrp;
+}
+
+
 #ifdef CONFIG_FUTEX
 static void save_task_robust_futex_list(struct ckpt_hdr_task *h,
 					struct task_struct *t)
@@ -720,6 +771,49 @@ int restore_restart_block(struct ckpt_ctx *ctx)
 	return ret;
 }
 
+static int restore_task_pgid(struct ckpt_ctx *ctx)
+{
+	struct task_struct *task = current;
+	struct pid *pgrp;
+	pid_t pgid;
+	int ret;
+
+	/*
+	 * We enforce the following restrictions on restoring pgrp:
+	 *  1) Only thread group leaders restore pgrp
+	 *  2) Session leader cannot change own pgrp
+	 *  3) Owner of pgrp must belong to same restart tree
+	 *  4) Must have same session as other tasks in same pgrp
+	 *  5) Change must pass setpgid security callback
+	 *
+	 * TODO - check if we need additional restrictions ?
+	 */
+
+	if (!thread_group_leader(task))  /* (1) */
+		return 0;
+
+	pgid = ctx->pids_arr[ctx->active_pid].vpgid;
+
+	if (pgid == task_pgrp_vnr(task))  /* nothing to do */
+		return 0;
+
+	if (task->signal->leader)  /* (2) */
+		return -EINVAL;
+
+	ret = -EINVAL;
+
+	write_lock_irq(&tasklist_lock);
+	pgrp = _ckpt_find_pgrp(ctx, pgid);  /* (3) and (4) */
+	if (pgrp && task_pgrp(task) != pgrp) {
+		ret = security_task_setpgid(task, pgid);  /* (5) */
+		if (!ret)
+			change_pid(task, PIDTYPE_PGID, pgrp);
+	}
+	write_unlock_irq(&tasklist_lock);
+
+	return ret;
+}
+
 /* pre_restore_task - prepare the task for restore */
 static int pre_restore_task(struct ckpt_ctx *ctx)
 {
@@ -760,6 +854,9 @@ int restore_task(struct ckpt_ctx *ctx)
 	if (ret)
 		goto out;
 
+	ret = restore_task_pgid(ctx);
+	if (ret < 0)
+		goto out;
 	ret = restore_task_objs(ctx);
 	ckpt_debug("objs %d\n", ret);
 	if (ret < 0)
diff --git a/checkpoint/restart.c b/checkpoint/restart.c
index 1016278..543b380 100644
--- a/checkpoint/restart.c
+++ b/checkpoint/restart.c
@@ -511,6 +511,11 @@ static int restore_read_tree(struct ckpt_ctx *ctx)
 	return ret;
 }
 
+static inline int all_tasks_activated(struct ckpt_ctx *ctx)
+{
+	return (ctx->active_pid == ctx->nr_pids);
+}
+
 static inline pid_t get_active_pid(struct ckpt_ctx *ctx)
 {
 	int active = ctx->active_pid;
@@ -535,30 +540,42 @@ do { \
 	_restore_notify_error(ctx, errno); \
 } while(0)
 
+static void restore_task_done(struct ckpt_ctx *ctx)
+{
+	if (atomic_dec_and_test(&ctx->nr_total))
+		complete(&ctx->complete);
+	BUG_ON(atomic_read(&ctx->nr_total) < 0);
+}
+
 static int restore_activate_next(struct ckpt_ctx *ctx)
 {
 	struct task_struct *task;
 	pid_t pid;
 
-	if (++ctx->active_pid >= ctx->nr_pids) {
-		complete(&ctx->complete);
-		return 0;
-	}
+	ctx->active_pid++;
 
-	pid = get_active_pid(ctx);
+	BUG_ON(ctx->active_pid > ctx->nr_pids);
 
-	rcu_read_lock();
-	task = find_task_by_pid_ns(pid, ctx->root_nsproxy->pid_ns);
-	/* target task must have same restart context */
-	if (task && task->checkpoint_ctx == ctx)
-		wake_up_process(task);
-	else
-		task = NULL;
-	rcu_read_unlock();
+	if (!all_tasks_activated(ctx)) {
+		/* wake up next task in line to restore its state */
+		pid = get_active_pid(ctx);
 
-	if (!task) {
-		restore_notify_error(ctx, -ESRCH);
-		return -ESRCH;
+		rcu_read_lock();
+		task = find_task_by_pid_ns(pid, ctx->root_nsproxy->pid_ns);
+		/* target task must have same restart context */
+		if (task && task->checkpoint_ctx == ctx)
+			wake_up_process(task);
+		else
+			task = NULL;
+		rcu_read_unlock();
+
+		if (!task) {
+			restore_notify_error(ctx, -ESRCH);
+			return -ESRCH;
+		}
+	} else {
+		/* wake up ghosts tasks so that they can terminate */
+		wake_up_all(&ctx->ghostq);
 	}
 
 	return 0;
@@ -593,7 +610,7 @@ static int wait_task_sync(struct ckpt_ctx *ctx)
 	return 0;
 }
 
-static int do_restore_task(void)
+static struct ckpt_ctx *wait_checkpoint_ctx(void)
 {
 	DECLARE_WAIT_QUEUE_HEAD(waitq);
 	struct ckpt_ctx *ctx, *old_ctx;
@@ -605,11 +622,11 @@ static int do_restore_task(void)
 	 */
 	ret = wait_event_interruptible(waitq, current->checkpoint_ctx);
 	if (ret < 0)
-		return ret;
+		return ERR_PTR(ret);
 
 	ctx = xchg(&current->checkpoint_ctx, NULL);
 	if (!ctx)
-		return -EAGAIN;
+		return ERR_PTR(-EAGAIN);
 	ckpt_ctx_get(ctx);
 
 	/*
@@ -628,9 +645,43 @@ static int do_restore_task(void)
 		/* alert our coordinator that we bail */
 		restore_notify_error(ctx, -EAGAIN);
 		ckpt_ctx_put(ctx);
-		return -EAGAIN;
+
+		ctx = ERR_PTR(-EAGAIN);
 	}
 
+	return ctx;
+}
+
+static int do_ghost_task(void)
+{
+	struct ckpt_ctx *ctx;
+
+	ctx = wait_checkpoint_ctx();
+	if (IS_ERR(ctx))
+		return PTR_ERR(ctx);
+
+	current->flags |= PF_RESTARTING;
+
+	wait_event_interruptible(ctx->ghostq,
+				 all_tasks_activated(ctx) ||
+				 ckpt_test_ctx_error(ctx));
+
+	current->exit_signal = -1;
+	ckpt_ctx_put(ctx);
+	do_exit(0);
+
+	/* NOT REACHED */
+}
+
+static int do_restore_task(void)
+{
+	struct ckpt_ctx *ctx, *old_ctx;
+	int zombie, ret;
+
+	ctx = wait_checkpoint_ctx();
+	if (IS_ERR(ctx))
+		return PTR_ERR(ctx);
+
 	current->flags |= PF_RESTARTING;
 
 	/* wait for our turn, do the restore, and tell next task in line */
@@ -638,24 +689,28 @@ static int do_restore_task(void)
 	if (ret < 0)
 		goto out;
 
-	ret = restore_task(ctx);
+	zombie = restore_task(ctx);
+	if (zombie < 0) {
+		ret = zombie;
+		goto out;
+	}
+
+	ret = restore_activate_next(ctx);
 	if (ret < 0)
 		goto out;
 
 	/*
 	 * zombie: we're done here; do_exit() will notice the @ctx on
-	 * our current->checkpoint_ctx (and our PF_RESTARTING) - it
-	 * will call restore_activate_next() and release the @ctx.
+	 * our current->checkpoint_ctx (and our PF_RESTARTING), will
+	 * call restore_task_done() and release the @ctx. This ensures
+	 * that we only report done after we really become zombie.
 	 */
-	if (ret) {
+	if (zombie) {
 		ckpt_ctx_put(ctx);
 		do_exit(current->exit_code);
 	}
 
-	ret = restore_activate_next(ctx);
-	if (ret < 0)
-		goto out;
-
+	restore_task_done(ctx);
 	ret = wait_task_sync(ctx);
  out:
 	old_ctx = xchg(&current->checkpoint_ctx, NULL);
@@ -666,6 +721,7 @@ static int do_restore_task(void)
 	if (ret < 0 && !ckpt_test_ctx_error(ctx)) {
 		restore_notify_error(ctx, ret);
 		wake_up_all(&ctx->waitq);
+		wake_up_all(&ctx->ghostq);
 	}
 
 	current->flags &= ~PF_RESTARTING;
@@ -687,11 +743,11 @@ static int prepare_descendants(struct ckpt_ctx *ctx, struct task_struct *root)
 	struct task_struct *parent = NULL;
 	struct task_struct *task = root;
 	struct ckpt_ctx *old_ctx;
-	int nr_pids = ctx->nr_pids;
+	int nr_pids = 0;
 	int ret = 0;
 
 	read_lock(&tasklist_lock);
-	while (nr_pids) {
+	while (1) {
 		ckpt_debug("consider task %d\n", task_pid_vnr(task));
 		if (task_ptrace(task) & PT_PTRACED) {
 			ret = -EBUSY;
@@ -718,7 +774,7 @@ static int prepare_descendants(struct ckpt_ctx *ctx, struct task_struct *root)
 			}
 			ckpt_debug("prepare task %d\n", task_pid_vnr(task));
 			wake_up_process(task);
-			nr_pids--;
+			nr_pids++;
 		}
 
 		/* if has children - proceed with child */
@@ -748,12 +804,16 @@ static int prepare_descendants(struct ckpt_ctx *ctx, struct task_struct *root)
 		}
 	}
 	read_unlock(&tasklist_lock);
-	ckpt_debug("left %d ret %d root/task %d\n", nr_pids, ret, task == root);
+	ckpt_debug("nr %d/%d  ret %d\n", ctx->nr_pids, nr_pids, ret);
 
-	/* fail unless number of processes matches */
-	if (!ret && (nr_pids || task != root))
+	/*
+	 * Actual tasks count may exceed ctx->nr_pids due of 'dead'
+	 * tasks used as place-holders for PGIDs, but not fall short.
+	 */
+	if (!ret && (nr_pids < ctx->nr_pids))
 		ret = -ESRCH;
 
+	atomic_set(&ctx->nr_total, nr_pids);
 	return ret;
 }
 
@@ -832,7 +892,7 @@ static int init_restart_ctx(struct ckpt_ctx *ctx, pid_t pid)
 	if (!nsproxy)
 		return -ESRCH;
 
-	ctx->active_pid = -1;	/* see restore_activate_next, get_active_pid */
+	ctx->active_pid = -1;   /* see restore_activate_next, get_active_pid */
 
 	return 0;
 }
@@ -973,12 +1033,14 @@ static long restore_retval(void)
 	return ret;
 }
 
-long do_restart(struct ckpt_ctx *ctx, pid_t pid)
+long do_restart(struct ckpt_ctx *ctx, pid_t pid, unsigned long flags)
 {
 	long ret;
 
 	if (ctx)
 		ret = do_restore_coord(ctx, pid);
+	else if (flags & RESTART_GHOST)
+		ret = do_ghost_task();
 	else
 		ret = do_restore_task();
 
@@ -1025,8 +1087,7 @@ void exit_checkpoint(struct task_struct *tsk)
 	/* restarting zombies will activate next task in restart */
 	if (tsk->flags & PF_RESTARTING) {
 		BUG_ON(ctx->active_pid == -1);
-		if (restore_activate_next(ctx) < 0)
-			pr_warning("c/r: [%d] failed zombie exit\n", tsk->pid);
+		restore_task_done(ctx);
 	}
 
 	ckpt_ctx_put(ctx);
diff --git a/checkpoint/sys.c b/checkpoint/sys.c
index d6a1650..76a3fa9 100644
--- a/checkpoint/sys.c
+++ b/checkpoint/sys.c
@@ -238,6 +238,7 @@ static struct ckpt_ctx *ckpt_ctx_alloc(int fd, unsigned long uflags,
 	INIT_LIST_HEAD(&ctx->pgarr_list);
 	INIT_LIST_HEAD(&ctx->pgarr_pool);
 	init_waitqueue_head(&ctx->waitq);
+	init_waitqueue_head(&ctx->ghostq);
 
 	err = -EBADF;
 	ctx->file = fget(fd);
@@ -334,7 +335,7 @@ SYSCALL_DEFINE3(restart, pid_t, pid, int, fd, unsigned long, flags)
 	if (IS_ERR(ctx))
 		return PTR_ERR(ctx);
 
-	ret = do_restart(ctx, pid);
+	ret = do_restart(ctx, pid, flags);
 
 	ckpt_ctx_put(ctx);
 	return ret;
diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
index 7c117fc..8e1cce7 100644
--- a/include/linux/checkpoint.h
+++ b/include/linux/checkpoint.h
@@ -18,6 +18,7 @@
 /* restart user flags */
 #define RESTART_TASKSELF	0x1
 #define RESTART_FROZEN		0x2
+#define RESTART_GHOST		0x4
 
 #ifdef __KERNEL__
 #ifdef CONFIG_CHECKPOINT
@@ -44,7 +45,10 @@
 
 /* ckpt_ctx: uflags */
 #define CHECKPOINT_USER_FLAGS		CHECKPOINT_SUBTREE
-#define RESTART_USER_FLAGS		(RESTART_TASKSELF | RESTART_FROZEN)
+#define RESTART_USER_FLAGS  \
+	(RESTART_TASKSELF | \
+	 RESTART_FROZEN | \
+	 RESTART_GHOST)
 
 extern void exit_checkpoint(struct task_struct *tsk);
 
@@ -78,6 +82,9 @@ extern int ckpt_read_consume(struct ckpt_ctx *ctx, int len, int type);
 extern char *ckpt_fill_fname(struct path *path, struct path *root,
 			     char *buf, int *len);
 
+/* pids */
+extern pid_t ckpt_pid_nr(struct ckpt_ctx *ctx, struct pid *pid);
+
 /* socket functions */
 extern int ckpt_sock_getnames(struct ckpt_ctx *ctx,
 			      struct socket *socket,
@@ -130,7 +137,7 @@ extern void ckpt_ctx_get(struct ckpt_ctx *ctx);
 extern void ckpt_ctx_put(struct ckpt_ctx *ctx);
 
 extern long do_checkpoint(struct ckpt_ctx *ctx, pid_t pid);
-extern long do_restart(struct ckpt_ctx *ctx, pid_t pid);
+extern long do_restart(struct ckpt_ctx *ctx, pid_t pid, unsigned long flags);
 
 /* task */
 extern int ckpt_activate_next(struct ckpt_ctx *ctx);
diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
index bf584cb..842177f 100644
--- a/include/linux/checkpoint_hdr.h
+++ b/include/linux/checkpoint_hdr.h
@@ -215,6 +215,9 @@ struct ckpt_pids {
 	__s32 vsid;
 } __attribute__((aligned(8)));
 
+/* pids */
+#define CKPT_PID_NULL  -1
+
 /* task data */
 struct ckpt_hdr_task {
 	struct ckpt_hdr h;
diff --git a/include/linux/checkpoint_types.h b/include/linux/checkpoint_types.h
index be45666..9b7b4dd 100644
--- a/include/linux/checkpoint_types.h
+++ b/include/linux/checkpoint_types.h
@@ -68,9 +68,11 @@ struct ckpt_ctx {
 	/* [multi-process restart] */
 	struct ckpt_pids *pids_arr;	/* array of all pids [restart] */
 	int nr_pids;			/* size of pids array */
+	atomic_t nr_total;		/* total tasks count (with ghosts) */
 	int active_pid;			/* (next) position in pids array */
-	struct completion complete;	/* container root and other tasks on */
-	wait_queue_head_t waitq;	/* start, end, and restart ordering */
+	struct completion complete;	/* completion for container root */
+	wait_queue_head_t waitq;	/* waitqueue for restarting tasks */
+	wait_queue_head_t ghostq;	/* waitqueue for ghost tasks */
 	struct cred *realcred, *ecred;	/* tmp storage for cred at restart */
 
 	struct ckpt_stats stats;	/* statistics */
-- 
1.6.0.4


^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH v18 73/80] c/r: correctly restore pgid
@ 2009-09-23 23:51   ` Oren Laadan
  0 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-23 23:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, containers, linux-kernel, linux-mm, linux-api,
	Serge Hallyn, Ingo Molnar, Pavel Emelyanov, Oren Laadan,
	Oren Laadan

The main challenge with restoring the pgid of tasks is that the
original "owner" (the process with that pid) might have exited
already. I call these "ghost" pgids. 'mktree' does create these
processes, but they then exit without participating in the restart.

To solve this, this patch introduces a RESTART_GHOST flag, used for
"ghost" owners that are created only to pass their pgid to other
tasks. ('mktree' now makes them call restart(2) instead of exiting).

When a "ghost" task calls restart(2), it will be placed on a wait
queue until the restart completes and then exit. This guarantees that
the pgid that it owns remains available for all (regular) restarting
tasks for when they need it.

Regular tasks perform the restart as before, except that they also
now restore their old pgrp, which is guaranteed to exist.

Changelog [v3]:
  - Fix leak of ckpt_ctx when restoring "ghost" tasks
Changelog [v2]:
  - Call change_pid() only if new pgrp differs from current one
Changelog [v1]:
  - Verify that pgid owner is a thread-group-leader.
  - Handle the case of pgid/sid == 0 using root's parent pid-ns

Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
Acked-by: Serge Hallyn <serue@us.ibm.com>
---
 checkpoint/process.c             |   97 +++++++++++++++++++++++++++
 checkpoint/restart.c             |  137 +++++++++++++++++++++++++++-----------
 checkpoint/sys.c                 |    3 +-
 include/linux/checkpoint.h       |   11 +++-
 include/linux/checkpoint_hdr.h   |    3 +
 include/linux/checkpoint_types.h |    6 +-
 6 files changed, 214 insertions(+), 43 deletions(-)

diff --git a/checkpoint/process.c b/checkpoint/process.c
index e596e2a..3c02f8e 100644
--- a/checkpoint/process.c
+++ b/checkpoint/process.c
@@ -24,6 +24,57 @@
 #include <linux/syscalls.h>
 
 
+pid_t ckpt_pid_nr(struct ckpt_ctx *ctx, struct pid *pid)
+{
+	return pid ? pid_nr_ns(pid, ctx->root_nsproxy->pid_ns) : CKPT_PID_NULL;
+}
+
+/* must be called with tasklist_lock or rcu_read_lock() held */
+struct pid *_ckpt_find_pgrp(struct ckpt_ctx *ctx, pid_t pgid)
+{
+	struct task_struct *p;
+	struct pid *pgrp;
+
+	if (pgid == 0) {
+		/*
+		 * At checkpoint the pgid owner lived in an ancestor
+		 * pid-ns. The best we can do (sanely and safely) is
+		 * to examine the parent of this restart's root: if in
+		 * a distinct pid-ns, use its pgrp; otherwise fail.
+		 */
+		p = ctx->root_task->real_parent;
+		if (p->nsproxy->pid_ns == current->nsproxy->pid_ns)
+			return NULL;
+		pgrp = task_pgrp(p);
+	} else {
+		/*
+		 * Find the owner process of this pgid (it must exist
+		 * if pgrp exists). It must be a thread group leader.
+		 */
+		pgrp = find_vpid(pgid);
+		p = pid_task(pgrp, PIDTYPE_PID);
+		if (!p || !thread_group_leader(p))
+			return NULL;
+		/*
+		 * The pgrp must "belong" to our restart tree (compare
+		 * p->checkpoint_ctx to ours). This prevents malicious
+		 * input from (guessing and) using unrelated pgrps. If
+		 * the owner is dead, then it doesn't have a context,
+		 * so instead compare against its (real) parent's.
+		 */
+		if (p->exit_state == EXIT_ZOMBIE)
+			p = p->real_parent;
+		if (p->checkpoint_ctx != ctx)
+			return NULL;
+	}
+
+	if (task_session(current) != task_session(p))
+		return NULL;
+
+	return pgrp;
+}
+
+
 #ifdef CONFIG_FUTEX
 static void save_task_robust_futex_list(struct ckpt_hdr_task *h,
 					struct task_struct *t)
@@ -720,6 +771,49 @@ int restore_restart_block(struct ckpt_ctx *ctx)
 	return ret;
 }
 
+static int restore_task_pgid(struct ckpt_ctx *ctx)
+{
+	struct task_struct *task = current;
+	struct pid *pgrp;
+	pid_t pgid;
+	int ret;
+
+	/*
+	 * We enforce the following restrictions on restoring pgrp:
+	 *  1) Only thread group leaders restore pgrp
+	 *  2) Session leader cannot change own pgrp
+	 *  3) Owner of pgrp must belong to same restart tree
+	 *  4) Must have same session as other tasks in same pgrp
+	 *  5) Change must pass setpgid security callback
+	 *
+	 * TODO - check if we need additional restrictions ?
+	 */
+
+	if (!thread_group_leader(task))  /* (1) */
+		return 0;
+
+	pgid = ctx->pids_arr[ctx->active_pid].vpgid;
+
+	if (pgid == task_pgrp_vnr(task))  /* nothing to do */
+		return 0;
+
+	if (task->signal->leader)  /* (2) */
+		return -EINVAL;
+
+	ret = -EINVAL;
+
+	write_lock_irq(&tasklist_lock);
+	pgrp = _ckpt_find_pgrp(ctx, pgid);  /* (3) and (4) */
+	if (pgrp && task_pgrp(task) != pgrp) {
+		ret = security_task_setpgid(task, pgid);  /* (5) */
+		if (!ret)
+			change_pid(task, PIDTYPE_PGID, pgrp);
+	}
+	write_unlock_irq(&tasklist_lock);
+
+	return ret;
+}
+
 /* pre_restore_task - prepare the task for restore */
 static int pre_restore_task(struct ckpt_ctx *ctx)
 {
@@ -760,6 +854,9 @@ int restore_task(struct ckpt_ctx *ctx)
 	if (ret)
 		goto out;
 
+	ret = restore_task_pgid(ctx);
+	if (ret < 0)
+		goto out;
 	ret = restore_task_objs(ctx);
 	ckpt_debug("objs %d\n", ret);
 	if (ret < 0)
diff --git a/checkpoint/restart.c b/checkpoint/restart.c
index 1016278..543b380 100644
--- a/checkpoint/restart.c
+++ b/checkpoint/restart.c
@@ -511,6 +511,11 @@ static int restore_read_tree(struct ckpt_ctx *ctx)
 	return ret;
 }
 
+static inline int all_tasks_activated(struct ckpt_ctx *ctx)
+{
+	return (ctx->active_pid == ctx->nr_pids);
+}
+
 static inline pid_t get_active_pid(struct ckpt_ctx *ctx)
 {
 	int active = ctx->active_pid;
@@ -535,30 +540,42 @@ do { \
 	_restore_notify_error(ctx, errno); \
 } while(0)
 
+static void restore_task_done(struct ckpt_ctx *ctx)
+{
+	if (atomic_dec_and_test(&ctx->nr_total))
+		complete(&ctx->complete);
+	BUG_ON(atomic_read(&ctx->nr_total) < 0);
+}
+
 static int restore_activate_next(struct ckpt_ctx *ctx)
 {
 	struct task_struct *task;
 	pid_t pid;
 
-	if (++ctx->active_pid >= ctx->nr_pids) {
-		complete(&ctx->complete);
-		return 0;
-	}
+	ctx->active_pid++;
 
-	pid = get_active_pid(ctx);
+	BUG_ON(ctx->active_pid > ctx->nr_pids);
 
-	rcu_read_lock();
-	task = find_task_by_pid_ns(pid, ctx->root_nsproxy->pid_ns);
-	/* target task must have same restart context */
-	if (task && task->checkpoint_ctx == ctx)
-		wake_up_process(task);
-	else
-		task = NULL;
-	rcu_read_unlock();
+	if (!all_tasks_activated(ctx)) {
+		/* wake up next task in line to restore its state */
+		pid = get_active_pid(ctx);
 
-	if (!task) {
-		restore_notify_error(ctx, -ESRCH);
-		return -ESRCH;
+		rcu_read_lock();
+		task = find_task_by_pid_ns(pid, ctx->root_nsproxy->pid_ns);
+		/* target task must have same restart context */
+		if (task && task->checkpoint_ctx == ctx)
+			wake_up_process(task);
+		else
+			task = NULL;
+		rcu_read_unlock();
+
+		if (!task) {
+			restore_notify_error(ctx, -ESRCH);
+			return -ESRCH;
+		}
+	} else {
+		/* wake up ghosts tasks so that they can terminate */
+		wake_up_all(&ctx->ghostq);
 	}
 
 	return 0;
@@ -593,7 +610,7 @@ static int wait_task_sync(struct ckpt_ctx *ctx)
 	return 0;
 }
 
-static int do_restore_task(void)
+static struct ckpt_ctx *wait_checkpoint_ctx(void)
 {
 	DECLARE_WAIT_QUEUE_HEAD(waitq);
 	struct ckpt_ctx *ctx, *old_ctx;
@@ -605,11 +622,11 @@ static int do_restore_task(void)
 	 */
 	ret = wait_event_interruptible(waitq, current->checkpoint_ctx);
 	if (ret < 0)
-		return ret;
+		return ERR_PTR(ret);
 
 	ctx = xchg(&current->checkpoint_ctx, NULL);
 	if (!ctx)
-		return -EAGAIN;
+		return ERR_PTR(-EAGAIN);
 	ckpt_ctx_get(ctx);
 
 	/*
@@ -628,9 +645,43 @@ static int do_restore_task(void)
 		/* alert our coordinator that we bail */
 		restore_notify_error(ctx, -EAGAIN);
 		ckpt_ctx_put(ctx);
-		return -EAGAIN;
+
+		ctx = ERR_PTR(-EAGAIN);
 	}
 
+	return ctx;
+}
+
+static int do_ghost_task(void)
+{
+	struct ckpt_ctx *ctx;
+
+	ctx = wait_checkpoint_ctx();
+	if (IS_ERR(ctx))
+		return PTR_ERR(ctx);
+
+	current->flags |= PF_RESTARTING;
+
+	wait_event_interruptible(ctx->ghostq,
+				 all_tasks_activated(ctx) ||
+				 ckpt_test_ctx_error(ctx));
+
+	current->exit_signal = -1;
+	ckpt_ctx_put(ctx);
+	do_exit(0);
+
+	/* NOT REACHED */
+}
+
+static int do_restore_task(void)
+{
+	struct ckpt_ctx *ctx, *old_ctx;
+	int zombie, ret;
+
+	ctx = wait_checkpoint_ctx();
+	if (IS_ERR(ctx))
+		return PTR_ERR(ctx);
+
 	current->flags |= PF_RESTARTING;
 
 	/* wait for our turn, do the restore, and tell next task in line */
@@ -638,24 +689,28 @@ static int do_restore_task(void)
 	if (ret < 0)
 		goto out;
 
-	ret = restore_task(ctx);
+	zombie = restore_task(ctx);
+	if (zombie < 0) {
+		ret = zombie;
+		goto out;
+	}
+
+	ret = restore_activate_next(ctx);
 	if (ret < 0)
 		goto out;
 
 	/*
 	 * zombie: we're done here; do_exit() will notice the @ctx on
-	 * our current->checkpoint_ctx (and our PF_RESTARTING) - it
-	 * will call restore_activate_next() and release the @ctx.
+	 * our current->checkpoint_ctx (and our PF_RESTARTING), will
+	 * call restore_task_done() and release the @ctx. This ensures
+	 * that we only report done after we really become zombie.
 	 */
-	if (ret) {
+	if (zombie) {
 		ckpt_ctx_put(ctx);
 		do_exit(current->exit_code);
 	}
 
-	ret = restore_activate_next(ctx);
-	if (ret < 0)
-		goto out;
-
+	restore_task_done(ctx);
 	ret = wait_task_sync(ctx);
  out:
 	old_ctx = xchg(&current->checkpoint_ctx, NULL);
@@ -666,6 +721,7 @@ static int do_restore_task(void)
 	if (ret < 0 && !ckpt_test_ctx_error(ctx)) {
 		restore_notify_error(ctx, ret);
 		wake_up_all(&ctx->waitq);
+		wake_up_all(&ctx->ghostq);
 	}
 
 	current->flags &= ~PF_RESTARTING;
@@ -687,11 +743,11 @@ static int prepare_descendants(struct ckpt_ctx *ctx, struct task_struct *root)
 	struct task_struct *parent = NULL;
 	struct task_struct *task = root;
 	struct ckpt_ctx *old_ctx;
-	int nr_pids = ctx->nr_pids;
+	int nr_pids = 0;
 	int ret = 0;
 
 	read_lock(&tasklist_lock);
-	while (nr_pids) {
+	while (1) {
 		ckpt_debug("consider task %d\n", task_pid_vnr(task));
 		if (task_ptrace(task) & PT_PTRACED) {
 			ret = -EBUSY;
@@ -718,7 +774,7 @@ static int prepare_descendants(struct ckpt_ctx *ctx, struct task_struct *root)
 			}
 			ckpt_debug("prepare task %d\n", task_pid_vnr(task));
 			wake_up_process(task);
-			nr_pids--;
+			nr_pids++;
 		}
 
 		/* if has children - proceed with child */
@@ -748,12 +804,16 @@ static int prepare_descendants(struct ckpt_ctx *ctx, struct task_struct *root)
 		}
 	}
 	read_unlock(&tasklist_lock);
-	ckpt_debug("left %d ret %d root/task %d\n", nr_pids, ret, task == root);
+	ckpt_debug("nr %d/%d  ret %d\n", ctx->nr_pids, nr_pids, ret);
 
-	/* fail unless number of processes matches */
-	if (!ret && (nr_pids || task != root))
+	/*
+	 * Actual tasks count may exceed ctx->nr_pids due of 'dead'
+	 * tasks used as place-holders for PGIDs, but not fall short.
+	 */
+	if (!ret && (nr_pids < ctx->nr_pids))
 		ret = -ESRCH;
 
+	atomic_set(&ctx->nr_total, nr_pids);
 	return ret;
 }
 
@@ -832,7 +892,7 @@ static int init_restart_ctx(struct ckpt_ctx *ctx, pid_t pid)
 	if (!nsproxy)
 		return -ESRCH;
 
-	ctx->active_pid = -1;	/* see restore_activate_next, get_active_pid */
+	ctx->active_pid = -1;   /* see restore_activate_next, get_active_pid */
 
 	return 0;
 }
@@ -973,12 +1033,14 @@ static long restore_retval(void)
 	return ret;
 }
 
-long do_restart(struct ckpt_ctx *ctx, pid_t pid)
+long do_restart(struct ckpt_ctx *ctx, pid_t pid, unsigned long flags)
 {
 	long ret;
 
 	if (ctx)
 		ret = do_restore_coord(ctx, pid);
+	else if (flags & RESTART_GHOST)
+		ret = do_ghost_task();
 	else
 		ret = do_restore_task();
 
@@ -1025,8 +1087,7 @@ void exit_checkpoint(struct task_struct *tsk)
 	/* restarting zombies will activate next task in restart */
 	if (tsk->flags & PF_RESTARTING) {
 		BUG_ON(ctx->active_pid == -1);
-		if (restore_activate_next(ctx) < 0)
-			pr_warning("c/r: [%d] failed zombie exit\n", tsk->pid);
+		restore_task_done(ctx);
 	}
 
 	ckpt_ctx_put(ctx);
diff --git a/checkpoint/sys.c b/checkpoint/sys.c
index d6a1650..76a3fa9 100644
--- a/checkpoint/sys.c
+++ b/checkpoint/sys.c
@@ -238,6 +238,7 @@ static struct ckpt_ctx *ckpt_ctx_alloc(int fd, unsigned long uflags,
 	INIT_LIST_HEAD(&ctx->pgarr_list);
 	INIT_LIST_HEAD(&ctx->pgarr_pool);
 	init_waitqueue_head(&ctx->waitq);
+	init_waitqueue_head(&ctx->ghostq);
 
 	err = -EBADF;
 	ctx->file = fget(fd);
@@ -334,7 +335,7 @@ SYSCALL_DEFINE3(restart, pid_t, pid, int, fd, unsigned long, flags)
 	if (IS_ERR(ctx))
 		return PTR_ERR(ctx);
 
-	ret = do_restart(ctx, pid);
+	ret = do_restart(ctx, pid, flags);
 
 	ckpt_ctx_put(ctx);
 	return ret;
diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
index 7c117fc..8e1cce7 100644
--- a/include/linux/checkpoint.h
+++ b/include/linux/checkpoint.h
@@ -18,6 +18,7 @@
 /* restart user flags */
 #define RESTART_TASKSELF	0x1
 #define RESTART_FROZEN		0x2
+#define RESTART_GHOST		0x4
 
 #ifdef __KERNEL__
 #ifdef CONFIG_CHECKPOINT
@@ -44,7 +45,10 @@
 
 /* ckpt_ctx: uflags */
 #define CHECKPOINT_USER_FLAGS		CHECKPOINT_SUBTREE
-#define RESTART_USER_FLAGS		(RESTART_TASKSELF | RESTART_FROZEN)
+#define RESTART_USER_FLAGS  \
+	(RESTART_TASKSELF | \
+	 RESTART_FROZEN | \
+	 RESTART_GHOST)
 
 extern void exit_checkpoint(struct task_struct *tsk);
 
@@ -78,6 +82,9 @@ extern int ckpt_read_consume(struct ckpt_ctx *ctx, int len, int type);
 extern char *ckpt_fill_fname(struct path *path, struct path *root,
 			     char *buf, int *len);
 
+/* pids */
+extern pid_t ckpt_pid_nr(struct ckpt_ctx *ctx, struct pid *pid);
+
 /* socket functions */
 extern int ckpt_sock_getnames(struct ckpt_ctx *ctx,
 			      struct socket *socket,
@@ -130,7 +137,7 @@ extern void ckpt_ctx_get(struct ckpt_ctx *ctx);
 extern void ckpt_ctx_put(struct ckpt_ctx *ctx);
 
 extern long do_checkpoint(struct ckpt_ctx *ctx, pid_t pid);
-extern long do_restart(struct ckpt_ctx *ctx, pid_t pid);
+extern long do_restart(struct ckpt_ctx *ctx, pid_t pid, unsigned long flags);
 
 /* task */
 extern int ckpt_activate_next(struct ckpt_ctx *ctx);
diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
index bf584cb..842177f 100644
--- a/include/linux/checkpoint_hdr.h
+++ b/include/linux/checkpoint_hdr.h
@@ -215,6 +215,9 @@ struct ckpt_pids {
 	__s32 vsid;
 } __attribute__((aligned(8)));
 
+/* pids */
+#define CKPT_PID_NULL  -1
+
 /* task data */
 struct ckpt_hdr_task {
 	struct ckpt_hdr h;
diff --git a/include/linux/checkpoint_types.h b/include/linux/checkpoint_types.h
index be45666..9b7b4dd 100644
--- a/include/linux/checkpoint_types.h
+++ b/include/linux/checkpoint_types.h
@@ -68,9 +68,11 @@ struct ckpt_ctx {
 	/* [multi-process restart] */
 	struct ckpt_pids *pids_arr;	/* array of all pids [restart] */
 	int nr_pids;			/* size of pids array */
+	atomic_t nr_total;		/* total tasks count (with ghosts) */
 	int active_pid;			/* (next) position in pids array */
-	struct completion complete;	/* container root and other tasks on */
-	wait_queue_head_t waitq;	/* start, end, and restart ordering */
+	struct completion complete;	/* completion for container root */
+	wait_queue_head_t waitq;	/* waitqueue for restarting tasks */
+	wait_queue_head_t ghostq;	/* waitqueue for ghost tasks */
 	struct cred *realcred, *ecred;	/* tmp storage for cred at restart */
 
 	struct ckpt_stats stats;	/* statistics */
-- 
1.6.0.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH v18 74/80] c/r: support for controlling terminal and job control
       [not found] ` <1253749920-18673-1-git-send-email-orenl-RdfvBDnrOixBDgjK7y7TUQ@public.gmane.org>
                     ` (72 preceding siblings ...)
  2009-09-23 23:51   ` [PATCH v18 73/80] c/r: correctly restore pgid Oren Laadan
@ 2009-09-23 23:51   ` Oren Laadan
  2009-09-23 23:51   ` [PATCH v18 75/80] powerpc: reserve checkpoint arch identifiers Oren Laadan
                     ` (7 subsequent siblings)
  81 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-23 23:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-api-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Linus Torvalds, Ingo Molnar,
	Pavel Emelyanov

Add checkpoint/restart of controlling terminal: current->signal->tty.
This is only done for session leaders.

If the session leader belongs to the ancestor pid-ns, then checkpoint
skips this tty; On restart, it will not be restored, and whatever tty
is in place from parent pid-ns (at restart) will be inherited.

Chagnelog [v1]:
  - Don't restore tty_old_pgrp it pgid is CKPT_PID_NULL
  - Initialize pgrp to NULL in restore_signal

Signed-off-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
---
 checkpoint/signal.c            |   79 +++++++++++++++++++++++++++++++++++++++-
 drivers/char/tty_io.c          |   33 +++++++++++++----
 include/linux/checkpoint.h     |    1 +
 include/linux/checkpoint_hdr.h |    6 +++
 include/linux/tty.h            |    5 +++
 5 files changed, 115 insertions(+), 9 deletions(-)

diff --git a/checkpoint/signal.c b/checkpoint/signal.c
index 5ff0734..cd3956d 100644
--- a/checkpoint/signal.c
+++ b/checkpoint/signal.c
@@ -316,11 +316,12 @@ static int checkpoint_signal(struct ckpt_ctx *ctx, struct task_struct *t)
 	struct ckpt_hdr_signal *h;
 	struct signal_struct *signal;
 	struct sigpending shared_pending;
+	struct tty_struct *tty = NULL;
 	struct rlimit *rlim;
 	struct timeval tval;
 	cputime_t cputime;
 	unsigned long flags;
-	int i, ret;
+	int i, ret = 0;
 
 	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_SIGNAL);
 	if (!h)
@@ -398,9 +399,34 @@ static int checkpoint_signal(struct ckpt_ctx *ctx, struct task_struct *t)
 	cputime_to_timeval(signal->it_prof_incr, &tval);
 	h->it_prof_incr = timeval_to_ns(&tval);
 
+	/* tty */
+	if (signal->leader) {
+		h->tty_old_pgrp = ckpt_pid_nr(ctx, signal->tty_old_pgrp);
+		tty = tty_kref_get(signal->tty);
+		if (tty) {
+			/* irq is already disabled */
+			spin_lock(&tty->ctrl_lock);
+			h->tty_pgrp = ckpt_pid_nr(ctx, tty->pgrp);
+			spin_unlock(&tty->ctrl_lock);
+			tty_kref_put(tty);
+		}
+	}
+
 	unlock_task_sighand(t, &flags);
 
-	ret = ckpt_write_obj(ctx, &h->h);
+	/*
+	 * If the session is in an ancestor namespace, skip this tty
+	 * and set tty_objref = 0. It will not be explicitly restored,
+	 * but rather inherited from parent pid-ns at restart time.
+	 */
+	if (tty && ckpt_pid_nr(ctx, tty->session) > 0) {
+		h->tty_objref = checkpoint_obj(ctx, tty, CKPT_OBJ_TTY);
+		if (h->tty_objref < 0)
+			ret = h->tty_objref;
+	}
+
+	if (!ret)
+		ret = ckpt_write_obj(ctx, &h->h);
 	if (!ret)
 		ret = checkpoint_sigpending(ctx, &shared_pending);
 
@@ -471,8 +497,10 @@ static int restore_signal(struct ckpt_ctx *ctx)
 	struct ckpt_hdr_signal *h;
 	struct sigpending new_pending;
 	struct sigpending *pending;
+	struct tty_struct *tty = NULL;
 	struct itimerval itimer;
 	struct rlimit rlim;
+	struct pid *pgrp = NULL;
 	int i, ret;
 
 	h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_SIGNAL);
@@ -492,6 +520,40 @@ static int restore_signal(struct ckpt_ctx *ctx)
 	if (ret < 0)
 		goto out;
 
+	/* tty - session */
+	if (h->tty_objref) {
+		tty = ckpt_obj_fetch(ctx, h->tty_objref, CKPT_OBJ_TTY);
+		if (IS_ERR(tty)) {
+			ret = PTR_ERR(tty);
+			goto out;
+		}
+		/* this will fail unless we're the session leader */
+		ret = tiocsctty(tty, 0);
+		if (ret < 0)
+			goto out;
+		/* now restore the foreground group (job control) */
+		if (h->tty_pgrp) {
+			/*
+			 * If tty_pgrp == CKPT_PID_NULL, below will
+			 * fail, so no need for explicit test
+			 */
+			ret = do_tiocspgrp(tty, tty_pair_get_tty(tty),
+					   h->tty_pgrp);
+			if (ret < 0)
+				goto out;
+		}
+	} else {
+		/*
+		 * If tty_objref isn't set, we _keep_ whatever tty we
+		 * already have as a ctty. Why does this make sense ?
+		 * - If our session is "within" the restart context,
+		 * then that session has no controlling terminal.
+		 * - If out session is "outside" the restart context,
+                 * then we're like to keep whatever we inherit from
+                 * the parent pid-ns.
+		 */
+	}
+
 	/*
 	 * Reset real/virt/prof itimer (in case they were set), to
 	 * prevent unwanted signals after flushing current signals
@@ -503,7 +565,20 @@ static int restore_signal(struct ckpt_ctx *ctx)
 	do_setitimer(ITIMER_VIRTUAL, &itimer, NULL);
 	do_setitimer(ITIMER_PROF, &itimer, NULL);
 
+	/* tty - tty_old_pgrp */
+	if (current->signal->leader && h->tty_old_pgrp != CKPT_PID_NULL) {
+		rcu_read_lock();
+		pgrp = get_pid(_ckpt_find_pgrp(ctx, h->tty_old_pgrp));
+		rcu_read_unlock();
+		if (!pgrp)
+			goto out;
+	}
+
 	spin_lock_irq(&current->sighand->siglock);
+	/* tty - tty_old_pgrp */
+	put_pid(current->signal->tty_old_pgrp);
+	current->signal->tty_old_pgrp = pgrp;
+	/* pending signals */
 	pending = &current->signal->shared_pending;
 	flush_sigqueue(pending);
 	pending->signal = new_pending.signal;
diff --git a/drivers/char/tty_io.c b/drivers/char/tty_io.c
index 72f4432..1b220c1 100644
--- a/drivers/char/tty_io.c
+++ b/drivers/char/tty_io.c
@@ -2130,7 +2130,7 @@ static int fionbio(struct file *file, int __user *p)
  *		Takes ->siglock() when updating signal->tty
  */
 
-static int tiocsctty(struct tty_struct *tty, int arg)
+int tiocsctty(struct tty_struct *tty, int arg)
 {
 	int ret = 0;
 	if (current->signal->leader && (task_session(current) == tty->session))
@@ -2219,10 +2219,10 @@ static int tiocgpgrp(struct tty_struct *tty, struct tty_struct *real_tty, pid_t
 }
 
 /**
- *	tiocspgrp		-	attempt to set process group
+ *	do_tiocspgrp		-	attempt to set process group
  *	@tty: tty passed by user
  *	@real_tty: tty side device matching tty passed by user
- *	@p: pid pointer
+ *	@pid: pgrp_nr
  *
  *	Set the process group of the tty to the session passed. Only
  *	permitted where the tty session is our session.
@@ -2230,10 +2230,10 @@ static int tiocgpgrp(struct tty_struct *tty, struct tty_struct *real_tty, pid_t
  *	Locking: RCU, ctrl lock
  */
 
-static int tiocspgrp(struct tty_struct *tty, struct tty_struct *real_tty, pid_t __user *p)
+int do_tiocspgrp(struct tty_struct *tty,
+		 struct tty_struct *real_tty, pid_t pgrp_nr)
 {
 	struct pid *pgrp;
-	pid_t pgrp_nr;
 	int retval = tty_check_change(real_tty);
 	unsigned long flags;
 
@@ -2245,8 +2245,6 @@ static int tiocspgrp(struct tty_struct *tty, struct tty_struct *real_tty, pid_t
 	    (current->signal->tty != real_tty) ||
 	    (real_tty->session != task_session(current)))
 		return -ENOTTY;
-	if (get_user(pgrp_nr, p))
-		return -EFAULT;
 	if (pgrp_nr < 0)
 		return -EINVAL;
 	rcu_read_lock();
@@ -2268,6 +2266,27 @@ out_unlock:
 }
 
 /**
+ *	tiocspgrp		-	attempt to set process group
+ *	@tty: tty passed by user
+ *	@real_tty: tty side device matching tty passed by user
+ *	@p: pid pointer
+ *
+ *	Set the process group of the tty to the session passed. Only
+ *	permitted where the tty session is our session.
+ *
+ *	Locking: RCU, ctrl lock
+ */
+
+static int tiocspgrp(struct tty_struct *tty, struct tty_struct *real_tty, pid_t __user *p)
+{
+	pid_t pgrp_nr;
+
+	if (get_user(pgrp_nr, p))
+		return -EFAULT;
+	return do_tiocspgrp(tty, real_tty, pgrp_nr);
+}
+
+/**
  *	tiocgsid		-	get session id
  *	@tty: tty passed by user
  *	@real_tty: tty side of the tty pased by the user if a pty else the tty
diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
index 8e1cce7..e00dd70 100644
--- a/include/linux/checkpoint.h
+++ b/include/linux/checkpoint.h
@@ -84,6 +84,7 @@ extern char *ckpt_fill_fname(struct path *path, struct path *root,
 
 /* pids */
 extern pid_t ckpt_pid_nr(struct ckpt_ctx *ctx, struct pid *pid);
+extern struct pid *_ckpt_find_pgrp(struct ckpt_ctx *ctx, pid_t pgid);
 
 /* socket functions */
 extern int ckpt_sock_getnames(struct ckpt_ctx *ctx,
diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
index 842177f..9ae35a0 100644
--- a/include/linux/checkpoint_hdr.h
+++ b/include/linux/checkpoint_hdr.h
@@ -578,13 +578,19 @@ struct ckpt_rlimit {
 
 struct ckpt_hdr_signal {
 	struct ckpt_hdr h;
+	/* rlimit */
 	struct ckpt_rlimit rlim[CKPT_RLIM_NLIMITS];
+	/* itimer */
 	__u64 it_real_value;
 	__u64 it_real_incr;
 	__u64 it_virt_value;
 	__u64 it_virt_incr;
 	__u64 it_prof_value;
 	__u64 it_prof_incr;
+	/* tty */
+	__s32 tty_objref;
+	__s32 tty_pgrp;
+	__s32 tty_old_pgrp;
 } __attribute__((aligned(8)));
 
 struct ckpt_hdr_signal_task {
diff --git a/include/linux/tty.h b/include/linux/tty.h
index 295447b..9447251 100644
--- a/include/linux/tty.h
+++ b/include/linux/tty.h
@@ -471,6 +471,11 @@ extern void tty_ldisc_enable(struct tty_struct *tty);
 /* This one is for ptmx_close() */
 extern int tty_release(struct inode *inode, struct file *filp);
 
+/* These are for checkpoint/restart */
+extern int tiocsctty(struct tty_struct *tty, int arg);
+extern int do_tiocspgrp(struct tty_struct *tty,
+			struct tty_struct *real_tty, pid_t pgrp_nr);
+
 #ifdef CONFIG_CHECKPOINT
 struct ckpt_ctx;
 struct ckpt_hdr_file;
-- 
1.6.0.4

^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH v18 74/80] c/r: support for controlling terminal and job control
  2009-09-23 23:50 ` Oren Laadan
@ 2009-09-23 23:51   ` Oren Laadan
  -1 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-23 23:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, containers, linux-kernel, linux-mm, linux-api,
	Serge Hallyn, Ingo Molnar, Pavel Emelyanov, Oren Laadan,
	Oren Laadan

Add checkpoint/restart of controlling terminal: current->signal->tty.
This is only done for session leaders.

If the session leader belongs to the ancestor pid-ns, then checkpoint
skips this tty; On restart, it will not be restored, and whatever tty
is in place from parent pid-ns (at restart) will be inherited.

Chagnelog [v1]:
  - Don't restore tty_old_pgrp it pgid is CKPT_PID_NULL
  - Initialize pgrp to NULL in restore_signal

Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
---
 checkpoint/signal.c            |   79 +++++++++++++++++++++++++++++++++++++++-
 drivers/char/tty_io.c          |   33 +++++++++++++----
 include/linux/checkpoint.h     |    1 +
 include/linux/checkpoint_hdr.h |    6 +++
 include/linux/tty.h            |    5 +++
 5 files changed, 115 insertions(+), 9 deletions(-)

diff --git a/checkpoint/signal.c b/checkpoint/signal.c
index 5ff0734..cd3956d 100644
--- a/checkpoint/signal.c
+++ b/checkpoint/signal.c
@@ -316,11 +316,12 @@ static int checkpoint_signal(struct ckpt_ctx *ctx, struct task_struct *t)
 	struct ckpt_hdr_signal *h;
 	struct signal_struct *signal;
 	struct sigpending shared_pending;
+	struct tty_struct *tty = NULL;
 	struct rlimit *rlim;
 	struct timeval tval;
 	cputime_t cputime;
 	unsigned long flags;
-	int i, ret;
+	int i, ret = 0;
 
 	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_SIGNAL);
 	if (!h)
@@ -398,9 +399,34 @@ static int checkpoint_signal(struct ckpt_ctx *ctx, struct task_struct *t)
 	cputime_to_timeval(signal->it_prof_incr, &tval);
 	h->it_prof_incr = timeval_to_ns(&tval);
 
+	/* tty */
+	if (signal->leader) {
+		h->tty_old_pgrp = ckpt_pid_nr(ctx, signal->tty_old_pgrp);
+		tty = tty_kref_get(signal->tty);
+		if (tty) {
+			/* irq is already disabled */
+			spin_lock(&tty->ctrl_lock);
+			h->tty_pgrp = ckpt_pid_nr(ctx, tty->pgrp);
+			spin_unlock(&tty->ctrl_lock);
+			tty_kref_put(tty);
+		}
+	}
+
 	unlock_task_sighand(t, &flags);
 
-	ret = ckpt_write_obj(ctx, &h->h);
+	/*
+	 * If the session is in an ancestor namespace, skip this tty
+	 * and set tty_objref = 0. It will not be explicitly restored,
+	 * but rather inherited from parent pid-ns at restart time.
+	 */
+	if (tty && ckpt_pid_nr(ctx, tty->session) > 0) {
+		h->tty_objref = checkpoint_obj(ctx, tty, CKPT_OBJ_TTY);
+		if (h->tty_objref < 0)
+			ret = h->tty_objref;
+	}
+
+	if (!ret)
+		ret = ckpt_write_obj(ctx, &h->h);
 	if (!ret)
 		ret = checkpoint_sigpending(ctx, &shared_pending);
 
@@ -471,8 +497,10 @@ static int restore_signal(struct ckpt_ctx *ctx)
 	struct ckpt_hdr_signal *h;
 	struct sigpending new_pending;
 	struct sigpending *pending;
+	struct tty_struct *tty = NULL;
 	struct itimerval itimer;
 	struct rlimit rlim;
+	struct pid *pgrp = NULL;
 	int i, ret;
 
 	h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_SIGNAL);
@@ -492,6 +520,40 @@ static int restore_signal(struct ckpt_ctx *ctx)
 	if (ret < 0)
 		goto out;
 
+	/* tty - session */
+	if (h->tty_objref) {
+		tty = ckpt_obj_fetch(ctx, h->tty_objref, CKPT_OBJ_TTY);
+		if (IS_ERR(tty)) {
+			ret = PTR_ERR(tty);
+			goto out;
+		}
+		/* this will fail unless we're the session leader */
+		ret = tiocsctty(tty, 0);
+		if (ret < 0)
+			goto out;
+		/* now restore the foreground group (job control) */
+		if (h->tty_pgrp) {
+			/*
+			 * If tty_pgrp == CKPT_PID_NULL, below will
+			 * fail, so no need for explicit test
+			 */
+			ret = do_tiocspgrp(tty, tty_pair_get_tty(tty),
+					   h->tty_pgrp);
+			if (ret < 0)
+				goto out;
+		}
+	} else {
+		/*
+		 * If tty_objref isn't set, we _keep_ whatever tty we
+		 * already have as a ctty. Why does this make sense ?
+		 * - If our session is "within" the restart context,
+		 * then that session has no controlling terminal.
+		 * - If out session is "outside" the restart context,
+                 * then we're like to keep whatever we inherit from
+                 * the parent pid-ns.
+		 */
+	}
+
 	/*
 	 * Reset real/virt/prof itimer (in case they were set), to
 	 * prevent unwanted signals after flushing current signals
@@ -503,7 +565,20 @@ static int restore_signal(struct ckpt_ctx *ctx)
 	do_setitimer(ITIMER_VIRTUAL, &itimer, NULL);
 	do_setitimer(ITIMER_PROF, &itimer, NULL);
 
+	/* tty - tty_old_pgrp */
+	if (current->signal->leader && h->tty_old_pgrp != CKPT_PID_NULL) {
+		rcu_read_lock();
+		pgrp = get_pid(_ckpt_find_pgrp(ctx, h->tty_old_pgrp));
+		rcu_read_unlock();
+		if (!pgrp)
+			goto out;
+	}
+
 	spin_lock_irq(&current->sighand->siglock);
+	/* tty - tty_old_pgrp */
+	put_pid(current->signal->tty_old_pgrp);
+	current->signal->tty_old_pgrp = pgrp;
+	/* pending signals */
 	pending = &current->signal->shared_pending;
 	flush_sigqueue(pending);
 	pending->signal = new_pending.signal;
diff --git a/drivers/char/tty_io.c b/drivers/char/tty_io.c
index 72f4432..1b220c1 100644
--- a/drivers/char/tty_io.c
+++ b/drivers/char/tty_io.c
@@ -2130,7 +2130,7 @@ static int fionbio(struct file *file, int __user *p)
  *		Takes ->siglock() when updating signal->tty
  */
 
-static int tiocsctty(struct tty_struct *tty, int arg)
+int tiocsctty(struct tty_struct *tty, int arg)
 {
 	int ret = 0;
 	if (current->signal->leader && (task_session(current) == tty->session))
@@ -2219,10 +2219,10 @@ static int tiocgpgrp(struct tty_struct *tty, struct tty_struct *real_tty, pid_t
 }
 
 /**
- *	tiocspgrp		-	attempt to set process group
+ *	do_tiocspgrp		-	attempt to set process group
  *	@tty: tty passed by user
  *	@real_tty: tty side device matching tty passed by user
- *	@p: pid pointer
+ *	@pid: pgrp_nr
  *
  *	Set the process group of the tty to the session passed. Only
  *	permitted where the tty session is our session.
@@ -2230,10 +2230,10 @@ static int tiocgpgrp(struct tty_struct *tty, struct tty_struct *real_tty, pid_t
  *	Locking: RCU, ctrl lock
  */
 
-static int tiocspgrp(struct tty_struct *tty, struct tty_struct *real_tty, pid_t __user *p)
+int do_tiocspgrp(struct tty_struct *tty,
+		 struct tty_struct *real_tty, pid_t pgrp_nr)
 {
 	struct pid *pgrp;
-	pid_t pgrp_nr;
 	int retval = tty_check_change(real_tty);
 	unsigned long flags;
 
@@ -2245,8 +2245,6 @@ static int tiocspgrp(struct tty_struct *tty, struct tty_struct *real_tty, pid_t
 	    (current->signal->tty != real_tty) ||
 	    (real_tty->session != task_session(current)))
 		return -ENOTTY;
-	if (get_user(pgrp_nr, p))
-		return -EFAULT;
 	if (pgrp_nr < 0)
 		return -EINVAL;
 	rcu_read_lock();
@@ -2268,6 +2266,27 @@ out_unlock:
 }
 
 /**
+ *	tiocspgrp		-	attempt to set process group
+ *	@tty: tty passed by user
+ *	@real_tty: tty side device matching tty passed by user
+ *	@p: pid pointer
+ *
+ *	Set the process group of the tty to the session passed. Only
+ *	permitted where the tty session is our session.
+ *
+ *	Locking: RCU, ctrl lock
+ */
+
+static int tiocspgrp(struct tty_struct *tty, struct tty_struct *real_tty, pid_t __user *p)
+{
+	pid_t pgrp_nr;
+
+	if (get_user(pgrp_nr, p))
+		return -EFAULT;
+	return do_tiocspgrp(tty, real_tty, pgrp_nr);
+}
+
+/**
  *	tiocgsid		-	get session id
  *	@tty: tty passed by user
  *	@real_tty: tty side of the tty pased by the user if a pty else the tty
diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
index 8e1cce7..e00dd70 100644
--- a/include/linux/checkpoint.h
+++ b/include/linux/checkpoint.h
@@ -84,6 +84,7 @@ extern char *ckpt_fill_fname(struct path *path, struct path *root,
 
 /* pids */
 extern pid_t ckpt_pid_nr(struct ckpt_ctx *ctx, struct pid *pid);
+extern struct pid *_ckpt_find_pgrp(struct ckpt_ctx *ctx, pid_t pgid);
 
 /* socket functions */
 extern int ckpt_sock_getnames(struct ckpt_ctx *ctx,
diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
index 842177f..9ae35a0 100644
--- a/include/linux/checkpoint_hdr.h
+++ b/include/linux/checkpoint_hdr.h
@@ -578,13 +578,19 @@ struct ckpt_rlimit {
 
 struct ckpt_hdr_signal {
 	struct ckpt_hdr h;
+	/* rlimit */
 	struct ckpt_rlimit rlim[CKPT_RLIM_NLIMITS];
+	/* itimer */
 	__u64 it_real_value;
 	__u64 it_real_incr;
 	__u64 it_virt_value;
 	__u64 it_virt_incr;
 	__u64 it_prof_value;
 	__u64 it_prof_incr;
+	/* tty */
+	__s32 tty_objref;
+	__s32 tty_pgrp;
+	__s32 tty_old_pgrp;
 } __attribute__((aligned(8)));
 
 struct ckpt_hdr_signal_task {
diff --git a/include/linux/tty.h b/include/linux/tty.h
index 295447b..9447251 100644
--- a/include/linux/tty.h
+++ b/include/linux/tty.h
@@ -471,6 +471,11 @@ extern void tty_ldisc_enable(struct tty_struct *tty);
 /* This one is for ptmx_close() */
 extern int tty_release(struct inode *inode, struct file *filp);
 
+/* These are for checkpoint/restart */
+extern int tiocsctty(struct tty_struct *tty, int arg);
+extern int do_tiocspgrp(struct tty_struct *tty,
+			struct tty_struct *real_tty, pid_t pgrp_nr);
+
 #ifdef CONFIG_CHECKPOINT
 struct ckpt_ctx;
 struct ckpt_hdr_file;
-- 
1.6.0.4


^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH v18 74/80] c/r: support for controlling terminal and job control
@ 2009-09-23 23:51   ` Oren Laadan
  0 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-23 23:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, containers, linux-kernel, linux-mm, linux-api,
	Serge Hallyn, Ingo Molnar, Pavel Emelyanov, Oren Laadan,
	Oren Laadan

Add checkpoint/restart of controlling terminal: current->signal->tty.
This is only done for session leaders.

If the session leader belongs to the ancestor pid-ns, then checkpoint
skips this tty; On restart, it will not be restored, and whatever tty
is in place from parent pid-ns (at restart) will be inherited.

Chagnelog [v1]:
  - Don't restore tty_old_pgrp it pgid is CKPT_PID_NULL
  - Initialize pgrp to NULL in restore_signal

Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
---
 checkpoint/signal.c            |   79 +++++++++++++++++++++++++++++++++++++++-
 drivers/char/tty_io.c          |   33 +++++++++++++----
 include/linux/checkpoint.h     |    1 +
 include/linux/checkpoint_hdr.h |    6 +++
 include/linux/tty.h            |    5 +++
 5 files changed, 115 insertions(+), 9 deletions(-)

diff --git a/checkpoint/signal.c b/checkpoint/signal.c
index 5ff0734..cd3956d 100644
--- a/checkpoint/signal.c
+++ b/checkpoint/signal.c
@@ -316,11 +316,12 @@ static int checkpoint_signal(struct ckpt_ctx *ctx, struct task_struct *t)
 	struct ckpt_hdr_signal *h;
 	struct signal_struct *signal;
 	struct sigpending shared_pending;
+	struct tty_struct *tty = NULL;
 	struct rlimit *rlim;
 	struct timeval tval;
 	cputime_t cputime;
 	unsigned long flags;
-	int i, ret;
+	int i, ret = 0;
 
 	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_SIGNAL);
 	if (!h)
@@ -398,9 +399,34 @@ static int checkpoint_signal(struct ckpt_ctx *ctx, struct task_struct *t)
 	cputime_to_timeval(signal->it_prof_incr, &tval);
 	h->it_prof_incr = timeval_to_ns(&tval);
 
+	/* tty */
+	if (signal->leader) {
+		h->tty_old_pgrp = ckpt_pid_nr(ctx, signal->tty_old_pgrp);
+		tty = tty_kref_get(signal->tty);
+		if (tty) {
+			/* irq is already disabled */
+			spin_lock(&tty->ctrl_lock);
+			h->tty_pgrp = ckpt_pid_nr(ctx, tty->pgrp);
+			spin_unlock(&tty->ctrl_lock);
+			tty_kref_put(tty);
+		}
+	}
+
 	unlock_task_sighand(t, &flags);
 
-	ret = ckpt_write_obj(ctx, &h->h);
+	/*
+	 * If the session is in an ancestor namespace, skip this tty
+	 * and set tty_objref = 0. It will not be explicitly restored,
+	 * but rather inherited from parent pid-ns at restart time.
+	 */
+	if (tty && ckpt_pid_nr(ctx, tty->session) > 0) {
+		h->tty_objref = checkpoint_obj(ctx, tty, CKPT_OBJ_TTY);
+		if (h->tty_objref < 0)
+			ret = h->tty_objref;
+	}
+
+	if (!ret)
+		ret = ckpt_write_obj(ctx, &h->h);
 	if (!ret)
 		ret = checkpoint_sigpending(ctx, &shared_pending);
 
@@ -471,8 +497,10 @@ static int restore_signal(struct ckpt_ctx *ctx)
 	struct ckpt_hdr_signal *h;
 	struct sigpending new_pending;
 	struct sigpending *pending;
+	struct tty_struct *tty = NULL;
 	struct itimerval itimer;
 	struct rlimit rlim;
+	struct pid *pgrp = NULL;
 	int i, ret;
 
 	h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_SIGNAL);
@@ -492,6 +520,40 @@ static int restore_signal(struct ckpt_ctx *ctx)
 	if (ret < 0)
 		goto out;
 
+	/* tty - session */
+	if (h->tty_objref) {
+		tty = ckpt_obj_fetch(ctx, h->tty_objref, CKPT_OBJ_TTY);
+		if (IS_ERR(tty)) {
+			ret = PTR_ERR(tty);
+			goto out;
+		}
+		/* this will fail unless we're the session leader */
+		ret = tiocsctty(tty, 0);
+		if (ret < 0)
+			goto out;
+		/* now restore the foreground group (job control) */
+		if (h->tty_pgrp) {
+			/*
+			 * If tty_pgrp == CKPT_PID_NULL, below will
+			 * fail, so no need for explicit test
+			 */
+			ret = do_tiocspgrp(tty, tty_pair_get_tty(tty),
+					   h->tty_pgrp);
+			if (ret < 0)
+				goto out;
+		}
+	} else {
+		/*
+		 * If tty_objref isn't set, we _keep_ whatever tty we
+		 * already have as a ctty. Why does this make sense ?
+		 * - If our session is "within" the restart context,
+		 * then that session has no controlling terminal.
+		 * - If out session is "outside" the restart context,
+                 * then we're like to keep whatever we inherit from
+                 * the parent pid-ns.
+		 */
+	}
+
 	/*
 	 * Reset real/virt/prof itimer (in case they were set), to
 	 * prevent unwanted signals after flushing current signals
@@ -503,7 +565,20 @@ static int restore_signal(struct ckpt_ctx *ctx)
 	do_setitimer(ITIMER_VIRTUAL, &itimer, NULL);
 	do_setitimer(ITIMER_PROF, &itimer, NULL);
 
+	/* tty - tty_old_pgrp */
+	if (current->signal->leader && h->tty_old_pgrp != CKPT_PID_NULL) {
+		rcu_read_lock();
+		pgrp = get_pid(_ckpt_find_pgrp(ctx, h->tty_old_pgrp));
+		rcu_read_unlock();
+		if (!pgrp)
+			goto out;
+	}
+
 	spin_lock_irq(&current->sighand->siglock);
+	/* tty - tty_old_pgrp */
+	put_pid(current->signal->tty_old_pgrp);
+	current->signal->tty_old_pgrp = pgrp;
+	/* pending signals */
 	pending = &current->signal->shared_pending;
 	flush_sigqueue(pending);
 	pending->signal = new_pending.signal;
diff --git a/drivers/char/tty_io.c b/drivers/char/tty_io.c
index 72f4432..1b220c1 100644
--- a/drivers/char/tty_io.c
+++ b/drivers/char/tty_io.c
@@ -2130,7 +2130,7 @@ static int fionbio(struct file *file, int __user *p)
  *		Takes ->siglock() when updating signal->tty
  */
 
-static int tiocsctty(struct tty_struct *tty, int arg)
+int tiocsctty(struct tty_struct *tty, int arg)
 {
 	int ret = 0;
 	if (current->signal->leader && (task_session(current) == tty->session))
@@ -2219,10 +2219,10 @@ static int tiocgpgrp(struct tty_struct *tty, struct tty_struct *real_tty, pid_t
 }
 
 /**
- *	tiocspgrp		-	attempt to set process group
+ *	do_tiocspgrp		-	attempt to set process group
  *	@tty: tty passed by user
  *	@real_tty: tty side device matching tty passed by user
- *	@p: pid pointer
+ *	@pid: pgrp_nr
  *
  *	Set the process group of the tty to the session passed. Only
  *	permitted where the tty session is our session.
@@ -2230,10 +2230,10 @@ static int tiocgpgrp(struct tty_struct *tty, struct tty_struct *real_tty, pid_t
  *	Locking: RCU, ctrl lock
  */
 
-static int tiocspgrp(struct tty_struct *tty, struct tty_struct *real_tty, pid_t __user *p)
+int do_tiocspgrp(struct tty_struct *tty,
+		 struct tty_struct *real_tty, pid_t pgrp_nr)
 {
 	struct pid *pgrp;
-	pid_t pgrp_nr;
 	int retval = tty_check_change(real_tty);
 	unsigned long flags;
 
@@ -2245,8 +2245,6 @@ static int tiocspgrp(struct tty_struct *tty, struct tty_struct *real_tty, pid_t
 	    (current->signal->tty != real_tty) ||
 	    (real_tty->session != task_session(current)))
 		return -ENOTTY;
-	if (get_user(pgrp_nr, p))
-		return -EFAULT;
 	if (pgrp_nr < 0)
 		return -EINVAL;
 	rcu_read_lock();
@@ -2268,6 +2266,27 @@ out_unlock:
 }
 
 /**
+ *	tiocspgrp		-	attempt to set process group
+ *	@tty: tty passed by user
+ *	@real_tty: tty side device matching tty passed by user
+ *	@p: pid pointer
+ *
+ *	Set the process group of the tty to the session passed. Only
+ *	permitted where the tty session is our session.
+ *
+ *	Locking: RCU, ctrl lock
+ */
+
+static int tiocspgrp(struct tty_struct *tty, struct tty_struct *real_tty, pid_t __user *p)
+{
+	pid_t pgrp_nr;
+
+	if (get_user(pgrp_nr, p))
+		return -EFAULT;
+	return do_tiocspgrp(tty, real_tty, pgrp_nr);
+}
+
+/**
  *	tiocgsid		-	get session id
  *	@tty: tty passed by user
  *	@real_tty: tty side of the tty pased by the user if a pty else the tty
diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
index 8e1cce7..e00dd70 100644
--- a/include/linux/checkpoint.h
+++ b/include/linux/checkpoint.h
@@ -84,6 +84,7 @@ extern char *ckpt_fill_fname(struct path *path, struct path *root,
 
 /* pids */
 extern pid_t ckpt_pid_nr(struct ckpt_ctx *ctx, struct pid *pid);
+extern struct pid *_ckpt_find_pgrp(struct ckpt_ctx *ctx, pid_t pgid);
 
 /* socket functions */
 extern int ckpt_sock_getnames(struct ckpt_ctx *ctx,
diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
index 842177f..9ae35a0 100644
--- a/include/linux/checkpoint_hdr.h
+++ b/include/linux/checkpoint_hdr.h
@@ -578,13 +578,19 @@ struct ckpt_rlimit {
 
 struct ckpt_hdr_signal {
 	struct ckpt_hdr h;
+	/* rlimit */
 	struct ckpt_rlimit rlim[CKPT_RLIM_NLIMITS];
+	/* itimer */
 	__u64 it_real_value;
 	__u64 it_real_incr;
 	__u64 it_virt_value;
 	__u64 it_virt_incr;
 	__u64 it_prof_value;
 	__u64 it_prof_incr;
+	/* tty */
+	__s32 tty_objref;
+	__s32 tty_pgrp;
+	__s32 tty_old_pgrp;
 } __attribute__((aligned(8)));
 
 struct ckpt_hdr_signal_task {
diff --git a/include/linux/tty.h b/include/linux/tty.h
index 295447b..9447251 100644
--- a/include/linux/tty.h
+++ b/include/linux/tty.h
@@ -471,6 +471,11 @@ extern void tty_ldisc_enable(struct tty_struct *tty);
 /* This one is for ptmx_close() */
 extern int tty_release(struct inode *inode, struct file *filp);
 
+/* These are for checkpoint/restart */
+extern int tiocsctty(struct tty_struct *tty, int arg);
+extern int do_tiocspgrp(struct tty_struct *tty,
+			struct tty_struct *real_tty, pid_t pgrp_nr);
+
 #ifdef CONFIG_CHECKPOINT
 struct ckpt_ctx;
 struct ckpt_hdr_file;
-- 
1.6.0.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH v18 75/80] powerpc: reserve checkpoint arch identifiers
       [not found] ` <1253749920-18673-1-git-send-email-orenl-RdfvBDnrOixBDgjK7y7TUQ@public.gmane.org>
                     ` (73 preceding siblings ...)
  2009-09-23 23:51   ` [PATCH v18 74/80] c/r: support for controlling terminal and job control Oren Laadan
@ 2009-09-23 23:51   ` Oren Laadan
  2009-09-23 23:51   ` [PATCH v18 76/80] powerpc: provide APIs for validating and updating DABR Oren Laadan
                     ` (6 subsequent siblings)
  81 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-23 23:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-api-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Linus Torvalds, Nathan Lynch,
	Ingo Molnar, Pavel Emelyanov

From: Nathan Lynch <ntl-e+AXbWqSrlAAvxtiuMwx3w@public.gmane.org>

Signed-off-by: Nathan Lynch <ntl-e+AXbWqSrlAAvxtiuMwx3w@public.gmane.org>
---
 include/linux/checkpoint_hdr.h |    2 ++
 1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
index 9ae35a0..2ed523f 100644
--- a/include/linux/checkpoint_hdr.h
+++ b/include/linux/checkpoint_hdr.h
@@ -117,6 +117,8 @@ enum {
 	/* do not change order (will break ABI) */
 	CKPT_ARCH_X86_32 = 1,
 	CKPT_ARCH_S390X,
+	CKPT_ARCH_PPC32,
+	CKPT_ARCH_PPC64,
 };
 
 /* shared objrects (objref) */
-- 
1.6.0.4

^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH v18 75/80] powerpc: reserve checkpoint arch identifiers
  2009-09-23 23:50 ` Oren Laadan
@ 2009-09-23 23:51   ` Oren Laadan
  -1 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-23 23:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, containers, linux-kernel, linux-mm, linux-api,
	Serge Hallyn, Ingo Molnar, Pavel Emelyanov, Nathan Lynch

From: Nathan Lynch <ntl@pobox.com>

Signed-off-by: Nathan Lynch <ntl@pobox.com>
---
 include/linux/checkpoint_hdr.h |    2 ++
 1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
index 9ae35a0..2ed523f 100644
--- a/include/linux/checkpoint_hdr.h
+++ b/include/linux/checkpoint_hdr.h
@@ -117,6 +117,8 @@ enum {
 	/* do not change order (will break ABI) */
 	CKPT_ARCH_X86_32 = 1,
 	CKPT_ARCH_S390X,
+	CKPT_ARCH_PPC32,
+	CKPT_ARCH_PPC64,
 };
 
 /* shared objrects (objref) */
-- 
1.6.0.4


^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH v18 75/80] powerpc: reserve checkpoint arch identifiers
@ 2009-09-23 23:51   ` Oren Laadan
  0 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-23 23:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, containers, linux-kernel, linux-mm, linux-api,
	Serge Hallyn, Ingo Molnar, Pavel Emelyanov, Nathan Lynch

From: Nathan Lynch <ntl@pobox.com>

Signed-off-by: Nathan Lynch <ntl@pobox.com>
---
 include/linux/checkpoint_hdr.h |    2 ++
 1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
index 9ae35a0..2ed523f 100644
--- a/include/linux/checkpoint_hdr.h
+++ b/include/linux/checkpoint_hdr.h
@@ -117,6 +117,8 @@ enum {
 	/* do not change order (will break ABI) */
 	CKPT_ARCH_X86_32 = 1,
 	CKPT_ARCH_S390X,
+	CKPT_ARCH_PPC32,
+	CKPT_ARCH_PPC64,
 };
 
 /* shared objrects (objref) */
-- 
1.6.0.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH v18 76/80] powerpc: provide APIs for validating and updating DABR
       [not found] ` <1253749920-18673-1-git-send-email-orenl-RdfvBDnrOixBDgjK7y7TUQ@public.gmane.org>
                     ` (74 preceding siblings ...)
  2009-09-23 23:51   ` [PATCH v18 75/80] powerpc: reserve checkpoint arch identifiers Oren Laadan
@ 2009-09-23 23:51   ` Oren Laadan
  2009-09-23 23:51   ` [PATCH v18 77/80] powerpc: checkpoint/restart implementation Oren Laadan
                     ` (5 subsequent siblings)
  81 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-23 23:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-api-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Linus Torvalds, Nathan Lynch,
	Ingo Molnar, Pavel Emelyanov

From: Nathan Lynch <ntl-e+AXbWqSrlAAvxtiuMwx3w@public.gmane.org>

A checkpointed task image may specify a value for the DABR (Data
Access Breakpoint Register).  The restart code needs to validate this
value before making any changes to the current task.

ptrace_set_debugreg encapsulates the bounds checking and platform
dependencies of programming the DABR.  Split this into "validate"
(debugreg_valid) and "update" (debugreg_update) functions, and make
them available for use outside of the ptrace code.

Also ptrace_set_debugreg has extern linkage, but no users outside of
ptrace.c.  Make it static.

Signed-off-by: Nathan Lynch <ntl-e+AXbWqSrlAAvxtiuMwx3w@public.gmane.org>
---
 arch/powerpc/include/asm/ptrace.h |    7 +++
 arch/powerpc/kernel/ptrace.c      |   88 +++++++++++++++++++++++++------------
 2 files changed, 66 insertions(+), 29 deletions(-)

diff --git a/arch/powerpc/include/asm/ptrace.h b/arch/powerpc/include/asm/ptrace.h
index 8c34149..c6cb2c6 100644
--- a/arch/powerpc/include/asm/ptrace.h
+++ b/arch/powerpc/include/asm/ptrace.h
@@ -81,6 +81,8 @@ struct pt_regs {
 
 #ifndef __ASSEMBLY__
 
+#include <linux/types.h>
+
 #define instruction_pointer(regs) ((regs)->nip)
 #define user_stack_pointer(regs) ((regs)->gpr[1])
 #define regs_return_value(regs) ((regs)->gpr[3])
@@ -140,6 +142,11 @@ extern void user_enable_single_step(struct task_struct *);
 extern void user_enable_block_step(struct task_struct *);
 extern void user_disable_single_step(struct task_struct *);
 
+/* for reprogramming DABR/DAC during restart of a checkpointed task */
+extern bool debugreg_valid(unsigned long val, unsigned int index);
+extern void debugreg_update(struct task_struct *task, unsigned long val,
+			    unsigned int index);
+
 #endif /* __ASSEMBLY__ */
 
 #endif /* __KERNEL__ */
diff --git a/arch/powerpc/kernel/ptrace.c b/arch/powerpc/kernel/ptrace.c
index ef14988..913ec8f 100644
--- a/arch/powerpc/kernel/ptrace.c
+++ b/arch/powerpc/kernel/ptrace.c
@@ -755,22 +755,25 @@ void user_disable_single_step(struct task_struct *task)
 	clear_tsk_thread_flag(task, TIF_SINGLESTEP);
 }
 
-int ptrace_set_debugreg(struct task_struct *task, unsigned long addr,
-			       unsigned long data)
+/**
+ * debugreg_valid() - validate the value to be written to a debug register
+ * @val:	The prospective contents of the register.
+ * @index:	Must be zero.
+ *
+ * Returns true if @val is an acceptable value for the register indicated by
+ * @index, false otherwise.
+ */
+bool debugreg_valid(unsigned long val, unsigned int index)
 {
-	/* For ppc64 we support one DABR and no IABR's at the moment (ppc64).
-	 *  For embedded processors we support one DAC and no IAC's at the
-	 *  moment.
-	 */
-	if (addr > 0)
-		return -EINVAL;
+	/* We support only one debug register for now */
+	if (index != 0)
+		return false;
 
 	/* The bottom 3 bits in dabr are flags */
-	if ((data & ~0x7UL) >= TASK_SIZE)
-		return -EIO;
+	if ((val & ~0x7UL) >= TASK_SIZE)
+		return false;
 
 #ifndef CONFIG_BOOKE
-
 	/* For processors using DABR (i.e. 970), the bottom 3 bits are flags.
 	 *  It was assumed, on previous implementations, that 3 bits were
 	 *  passed together with the data address, fitting the design of the
@@ -784,47 +787,74 @@ int ptrace_set_debugreg(struct task_struct *task, unsigned long addr,
 	 */
 
 	/* Ensure breakpoint translation bit is set */
-	if (data && !(data & DABR_TRANSLATION))
-		return -EIO;
-
-	/* Move contents to the DABR register */
-	task->thread.dabr = data;
-
-#endif
-#if defined(CONFIG_BOOKE)
-
+	if (val && !(val & DABR_TRANSLATION))
+		return false;
+#else
 	/* As described above, it was assumed 3 bits were passed with the data
 	 *  address, but we will assume only the mode bits will be passed
 	 *  as to not cause alignment restrictions for DAC-based processors.
 	 */
 
+	/* Read or Write bits must be set */
+	if (!(val & 0x3UL))
+		return -EINVAL;
+#endif
+	return true;
+}
+
+/**
+ * debugreg_update() - update a debug register associated with a task
+ * @task:	The task whose register state is to be modified.
+ * @val:	The value to be written to the debug register.
+ * @index:	Specifies the debug register.  Currently unused.
+ *
+ * Set a task's DABR/DAC to @val, which should be validated with
+ * debugreg_valid() beforehand.
+ */
+void debugreg_update(struct task_struct *task, unsigned long val,
+		     unsigned int index)
+{
+#ifndef CONFIG_BOOKE
+	task->thread.dabr = val;
+#else
 	/* DAC's hold the whole address without any mode flags */
-	task->thread.dabr = data & ~0x3UL;
+	task->thread.dabr = val & ~0x3UL;
 
 	if (task->thread.dabr == 0) {
 		task->thread.dbcr0 &= ~(DBSR_DAC1R | DBSR_DAC1W | DBCR0_IDM);
 		task->thread.regs->msr &= ~MSR_DE;
-		return 0;
 	}
 
-	/* Read or Write bits must be set */
-
-	if (!(data & 0x3UL))
-		return -EINVAL;
-
 	/* Set the Internal Debugging flag (IDM bit 1) for the DBCR0
 	   register */
 	task->thread.dbcr0 = DBCR0_IDM;
 
 	/* Check for write and read flags and set DBCR0
 	   accordingly */
-	if (data & 0x1UL)
+	if (val & 0x1UL)
 		task->thread.dbcr0 |= DBSR_DAC1R;
-	if (data & 0x2UL)
+	if (val & 0x2UL)
 		task->thread.dbcr0 |= DBSR_DAC1W;
 
 	task->thread.regs->msr |= MSR_DE;
 #endif
+}
+
+static int ptrace_set_debugreg(struct task_struct *task, unsigned long addr,
+			       unsigned long data)
+{
+	/* For ppc64 we support one DABR and no IABR's at the moment (ppc64).
+	 * For embedded processors we support one DAC and no IAC's at the
+	 * moment.
+	 */
+	if (addr > 0)
+		return -EINVAL;
+
+	if (!debugreg_valid(data, 0))
+		return -EIO;
+
+	debugreg_update(task, data, 0);
+
 	return 0;
 }
 
-- 
1.6.0.4

^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH v18 76/80] powerpc: provide APIs for validating and updating DABR
  2009-09-23 23:50 ` Oren Laadan
@ 2009-09-23 23:51   ` Oren Laadan
  -1 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-23 23:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, containers, linux-kernel, linux-mm, linux-api,
	Serge Hallyn, Ingo Molnar, Pavel Emelyanov, Nathan Lynch

From: Nathan Lynch <ntl@pobox.com>

A checkpointed task image may specify a value for the DABR (Data
Access Breakpoint Register).  The restart code needs to validate this
value before making any changes to the current task.

ptrace_set_debugreg encapsulates the bounds checking and platform
dependencies of programming the DABR.  Split this into "validate"
(debugreg_valid) and "update" (debugreg_update) functions, and make
them available for use outside of the ptrace code.

Also ptrace_set_debugreg has extern linkage, but no users outside of
ptrace.c.  Make it static.

Signed-off-by: Nathan Lynch <ntl@pobox.com>
---
 arch/powerpc/include/asm/ptrace.h |    7 +++
 arch/powerpc/kernel/ptrace.c      |   88 +++++++++++++++++++++++++------------
 2 files changed, 66 insertions(+), 29 deletions(-)

diff --git a/arch/powerpc/include/asm/ptrace.h b/arch/powerpc/include/asm/ptrace.h
index 8c34149..c6cb2c6 100644
--- a/arch/powerpc/include/asm/ptrace.h
+++ b/arch/powerpc/include/asm/ptrace.h
@@ -81,6 +81,8 @@ struct pt_regs {
 
 #ifndef __ASSEMBLY__
 
+#include <linux/types.h>
+
 #define instruction_pointer(regs) ((regs)->nip)
 #define user_stack_pointer(regs) ((regs)->gpr[1])
 #define regs_return_value(regs) ((regs)->gpr[3])
@@ -140,6 +142,11 @@ extern void user_enable_single_step(struct task_struct *);
 extern void user_enable_block_step(struct task_struct *);
 extern void user_disable_single_step(struct task_struct *);
 
+/* for reprogramming DABR/DAC during restart of a checkpointed task */
+extern bool debugreg_valid(unsigned long val, unsigned int index);
+extern void debugreg_update(struct task_struct *task, unsigned long val,
+			    unsigned int index);
+
 #endif /* __ASSEMBLY__ */
 
 #endif /* __KERNEL__ */
diff --git a/arch/powerpc/kernel/ptrace.c b/arch/powerpc/kernel/ptrace.c
index ef14988..913ec8f 100644
--- a/arch/powerpc/kernel/ptrace.c
+++ b/arch/powerpc/kernel/ptrace.c
@@ -755,22 +755,25 @@ void user_disable_single_step(struct task_struct *task)
 	clear_tsk_thread_flag(task, TIF_SINGLESTEP);
 }
 
-int ptrace_set_debugreg(struct task_struct *task, unsigned long addr,
-			       unsigned long data)
+/**
+ * debugreg_valid() - validate the value to be written to a debug register
+ * @val:	The prospective contents of the register.
+ * @index:	Must be zero.
+ *
+ * Returns true if @val is an acceptable value for the register indicated by
+ * @index, false otherwise.
+ */
+bool debugreg_valid(unsigned long val, unsigned int index)
 {
-	/* For ppc64 we support one DABR and no IABR's at the moment (ppc64).
-	 *  For embedded processors we support one DAC and no IAC's at the
-	 *  moment.
-	 */
-	if (addr > 0)
-		return -EINVAL;
+	/* We support only one debug register for now */
+	if (index != 0)
+		return false;
 
 	/* The bottom 3 bits in dabr are flags */
-	if ((data & ~0x7UL) >= TASK_SIZE)
-		return -EIO;
+	if ((val & ~0x7UL) >= TASK_SIZE)
+		return false;
 
 #ifndef CONFIG_BOOKE
-
 	/* For processors using DABR (i.e. 970), the bottom 3 bits are flags.
 	 *  It was assumed, on previous implementations, that 3 bits were
 	 *  passed together with the data address, fitting the design of the
@@ -784,47 +787,74 @@ int ptrace_set_debugreg(struct task_struct *task, unsigned long addr,
 	 */
 
 	/* Ensure breakpoint translation bit is set */
-	if (data && !(data & DABR_TRANSLATION))
-		return -EIO;
-
-	/* Move contents to the DABR register */
-	task->thread.dabr = data;
-
-#endif
-#if defined(CONFIG_BOOKE)
-
+	if (val && !(val & DABR_TRANSLATION))
+		return false;
+#else
 	/* As described above, it was assumed 3 bits were passed with the data
 	 *  address, but we will assume only the mode bits will be passed
 	 *  as to not cause alignment restrictions for DAC-based processors.
 	 */
 
+	/* Read or Write bits must be set */
+	if (!(val & 0x3UL))
+		return -EINVAL;
+#endif
+	return true;
+}
+
+/**
+ * debugreg_update() - update a debug register associated with a task
+ * @task:	The task whose register state is to be modified.
+ * @val:	The value to be written to the debug register.
+ * @index:	Specifies the debug register.  Currently unused.
+ *
+ * Set a task's DABR/DAC to @val, which should be validated with
+ * debugreg_valid() beforehand.
+ */
+void debugreg_update(struct task_struct *task, unsigned long val,
+		     unsigned int index)
+{
+#ifndef CONFIG_BOOKE
+	task->thread.dabr = val;
+#else
 	/* DAC's hold the whole address without any mode flags */
-	task->thread.dabr = data & ~0x3UL;
+	task->thread.dabr = val & ~0x3UL;
 
 	if (task->thread.dabr == 0) {
 		task->thread.dbcr0 &= ~(DBSR_DAC1R | DBSR_DAC1W | DBCR0_IDM);
 		task->thread.regs->msr &= ~MSR_DE;
-		return 0;
 	}
 
-	/* Read or Write bits must be set */
-
-	if (!(data & 0x3UL))
-		return -EINVAL;
-
 	/* Set the Internal Debugging flag (IDM bit 1) for the DBCR0
 	   register */
 	task->thread.dbcr0 = DBCR0_IDM;
 
 	/* Check for write and read flags and set DBCR0
 	   accordingly */
-	if (data & 0x1UL)
+	if (val & 0x1UL)
 		task->thread.dbcr0 |= DBSR_DAC1R;
-	if (data & 0x2UL)
+	if (val & 0x2UL)
 		task->thread.dbcr0 |= DBSR_DAC1W;
 
 	task->thread.regs->msr |= MSR_DE;
 #endif
+}
+
+static int ptrace_set_debugreg(struct task_struct *task, unsigned long addr,
+			       unsigned long data)
+{
+	/* For ppc64 we support one DABR and no IABR's at the moment (ppc64).
+	 * For embedded processors we support one DAC and no IAC's at the
+	 * moment.
+	 */
+	if (addr > 0)
+		return -EINVAL;
+
+	if (!debugreg_valid(data, 0))
+		return -EIO;
+
+	debugreg_update(task, data, 0);
+
 	return 0;
 }
 
-- 
1.6.0.4


^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH v18 76/80] powerpc: provide APIs for validating and updating DABR
@ 2009-09-23 23:51   ` Oren Laadan
  0 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-23 23:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, containers, linux-kernel, linux-mm, linux-api,
	Serge Hallyn, Ingo Molnar, Pavel Emelyanov, Nathan Lynch

From: Nathan Lynch <ntl@pobox.com>

A checkpointed task image may specify a value for the DABR (Data
Access Breakpoint Register).  The restart code needs to validate this
value before making any changes to the current task.

ptrace_set_debugreg encapsulates the bounds checking and platform
dependencies of programming the DABR.  Split this into "validate"
(debugreg_valid) and "update" (debugreg_update) functions, and make
them available for use outside of the ptrace code.

Also ptrace_set_debugreg has extern linkage, but no users outside of
ptrace.c.  Make it static.

Signed-off-by: Nathan Lynch <ntl@pobox.com>
---
 arch/powerpc/include/asm/ptrace.h |    7 +++
 arch/powerpc/kernel/ptrace.c      |   88 +++++++++++++++++++++++++------------
 2 files changed, 66 insertions(+), 29 deletions(-)

diff --git a/arch/powerpc/include/asm/ptrace.h b/arch/powerpc/include/asm/ptrace.h
index 8c34149..c6cb2c6 100644
--- a/arch/powerpc/include/asm/ptrace.h
+++ b/arch/powerpc/include/asm/ptrace.h
@@ -81,6 +81,8 @@ struct pt_regs {
 
 #ifndef __ASSEMBLY__
 
+#include <linux/types.h>
+
 #define instruction_pointer(regs) ((regs)->nip)
 #define user_stack_pointer(regs) ((regs)->gpr[1])
 #define regs_return_value(regs) ((regs)->gpr[3])
@@ -140,6 +142,11 @@ extern void user_enable_single_step(struct task_struct *);
 extern void user_enable_block_step(struct task_struct *);
 extern void user_disable_single_step(struct task_struct *);
 
+/* for reprogramming DABR/DAC during restart of a checkpointed task */
+extern bool debugreg_valid(unsigned long val, unsigned int index);
+extern void debugreg_update(struct task_struct *task, unsigned long val,
+			    unsigned int index);
+
 #endif /* __ASSEMBLY__ */
 
 #endif /* __KERNEL__ */
diff --git a/arch/powerpc/kernel/ptrace.c b/arch/powerpc/kernel/ptrace.c
index ef14988..913ec8f 100644
--- a/arch/powerpc/kernel/ptrace.c
+++ b/arch/powerpc/kernel/ptrace.c
@@ -755,22 +755,25 @@ void user_disable_single_step(struct task_struct *task)
 	clear_tsk_thread_flag(task, TIF_SINGLESTEP);
 }
 
-int ptrace_set_debugreg(struct task_struct *task, unsigned long addr,
-			       unsigned long data)
+/**
+ * debugreg_valid() - validate the value to be written to a debug register
+ * @val:	The prospective contents of the register.
+ * @index:	Must be zero.
+ *
+ * Returns true if @val is an acceptable value for the register indicated by
+ * @index, false otherwise.
+ */
+bool debugreg_valid(unsigned long val, unsigned int index)
 {
-	/* For ppc64 we support one DABR and no IABR's at the moment (ppc64).
-	 *  For embedded processors we support one DAC and no IAC's at the
-	 *  moment.
-	 */
-	if (addr > 0)
-		return -EINVAL;
+	/* We support only one debug register for now */
+	if (index != 0)
+		return false;
 
 	/* The bottom 3 bits in dabr are flags */
-	if ((data & ~0x7UL) >= TASK_SIZE)
-		return -EIO;
+	if ((val & ~0x7UL) >= TASK_SIZE)
+		return false;
 
 #ifndef CONFIG_BOOKE
-
 	/* For processors using DABR (i.e. 970), the bottom 3 bits are flags.
 	 *  It was assumed, on previous implementations, that 3 bits were
 	 *  passed together with the data address, fitting the design of the
@@ -784,47 +787,74 @@ int ptrace_set_debugreg(struct task_struct *task, unsigned long addr,
 	 */
 
 	/* Ensure breakpoint translation bit is set */
-	if (data && !(data & DABR_TRANSLATION))
-		return -EIO;
-
-	/* Move contents to the DABR register */
-	task->thread.dabr = data;
-
-#endif
-#if defined(CONFIG_BOOKE)
-
+	if (val && !(val & DABR_TRANSLATION))
+		return false;
+#else
 	/* As described above, it was assumed 3 bits were passed with the data
 	 *  address, but we will assume only the mode bits will be passed
 	 *  as to not cause alignment restrictions for DAC-based processors.
 	 */
 
+	/* Read or Write bits must be set */
+	if (!(val & 0x3UL))
+		return -EINVAL;
+#endif
+	return true;
+}
+
+/**
+ * debugreg_update() - update a debug register associated with a task
+ * @task:	The task whose register state is to be modified.
+ * @val:	The value to be written to the debug register.
+ * @index:	Specifies the debug register.  Currently unused.
+ *
+ * Set a task's DABR/DAC to @val, which should be validated with
+ * debugreg_valid() beforehand.
+ */
+void debugreg_update(struct task_struct *task, unsigned long val,
+		     unsigned int index)
+{
+#ifndef CONFIG_BOOKE
+	task->thread.dabr = val;
+#else
 	/* DAC's hold the whole address without any mode flags */
-	task->thread.dabr = data & ~0x3UL;
+	task->thread.dabr = val & ~0x3UL;
 
 	if (task->thread.dabr == 0) {
 		task->thread.dbcr0 &= ~(DBSR_DAC1R | DBSR_DAC1W | DBCR0_IDM);
 		task->thread.regs->msr &= ~MSR_DE;
-		return 0;
 	}
 
-	/* Read or Write bits must be set */
-
-	if (!(data & 0x3UL))
-		return -EINVAL;
-
 	/* Set the Internal Debugging flag (IDM bit 1) for the DBCR0
 	   register */
 	task->thread.dbcr0 = DBCR0_IDM;
 
 	/* Check for write and read flags and set DBCR0
 	   accordingly */
-	if (data & 0x1UL)
+	if (val & 0x1UL)
 		task->thread.dbcr0 |= DBSR_DAC1R;
-	if (data & 0x2UL)
+	if (val & 0x2UL)
 		task->thread.dbcr0 |= DBSR_DAC1W;
 
 	task->thread.regs->msr |= MSR_DE;
 #endif
+}
+
+static int ptrace_set_debugreg(struct task_struct *task, unsigned long addr,
+			       unsigned long data)
+{
+	/* For ppc64 we support one DABR and no IABR's at the moment (ppc64).
+	 * For embedded processors we support one DAC and no IAC's at the
+	 * moment.
+	 */
+	if (addr > 0)
+		return -EINVAL;
+
+	if (!debugreg_valid(data, 0))
+		return -EIO;
+
+	debugreg_update(task, data, 0);
+
 	return 0;
 }
 
-- 
1.6.0.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH v18 77/80] powerpc: checkpoint/restart implementation
       [not found] ` <1253749920-18673-1-git-send-email-orenl-RdfvBDnrOixBDgjK7y7TUQ@public.gmane.org>
                     ` (75 preceding siblings ...)
  2009-09-23 23:51   ` [PATCH v18 76/80] powerpc: provide APIs for validating and updating DABR Oren Laadan
@ 2009-09-23 23:51   ` Oren Laadan
  2009-09-23 23:51   ` [PATCH v18 78/80] powerpc: wire up checkpoint and restart syscalls Oren Laadan
                     ` (4 subsequent siblings)
  81 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-23 23:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-api-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Linus Torvalds, Nathan Lynch,
	Ingo Molnar, Pavel Emelyanov

From: Nathan Lynch <ntl-e+AXbWqSrlAAvxtiuMwx3w@public.gmane.org>

Support for checkpointing and restarting GPRs, FPU state, DABR, and
Altivec state.

The portion of the checkpoint image manipulated by this code begins
with a bitmask of features indicating the various contexts saved.
Fields in image that can vary depending on kernel configuration
(e.g. FP regs due to VSX) have their sizes explicitly recorded, except
for GPRS, so migrating between ppc32 and ppc64 won't work yet.

The restart code ensures that the task is not modified until the
checkpoint image is validated against the current kernel configuration
and hardware features (e.g. can't restart a task using Altivec on
non-Altivec systems).

What works:
* self and external checkpoint of simple (single thread, one open
  file) 32- and 64-bit processes on a ppc64 kernel

What doesn't work:
* restarting a 32-bit task from a 64-bit task and vice versa

Untested:
* ppc32 (but it builds)

Signed-off-by: Nathan Lynch <ntl-e+AXbWqSrlAAvxtiuMwx3w@public.gmane.org>
[Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>] Add arch-specific tty support
---
 arch/powerpc/include/asm/Kbuild           |    1 +
 arch/powerpc/include/asm/checkpoint_hdr.h |   37 ++
 arch/powerpc/mm/Makefile                  |    1 +
 arch/powerpc/mm/checkpoint.c              |  531 +++++++++++++++++++++++++++++
 4 files changed, 570 insertions(+), 0 deletions(-)
 create mode 100644 arch/powerpc/include/asm/checkpoint_hdr.h
 create mode 100644 arch/powerpc/mm/checkpoint.c

diff --git a/arch/powerpc/include/asm/Kbuild b/arch/powerpc/include/asm/Kbuild
index 5ab7d7f..20379f1 100644
--- a/arch/powerpc/include/asm/Kbuild
+++ b/arch/powerpc/include/asm/Kbuild
@@ -12,6 +12,7 @@ header-y += shmbuf.h
 header-y += socket.h
 header-y += termbits.h
 header-y += fcntl.h
+header-y += checkpoint_hdr.h
 header-y += poll.h
 header-y += sockios.h
 header-y += ucontext.h
diff --git a/arch/powerpc/include/asm/checkpoint_hdr.h b/arch/powerpc/include/asm/checkpoint_hdr.h
new file mode 100644
index 0000000..fbb1705
--- /dev/null
+++ b/arch/powerpc/include/asm/checkpoint_hdr.h
@@ -0,0 +1,37 @@
+#ifndef __ASM_POWERPC_CKPT_HDR_H
+#define __ASM_POWERPC_CKPT_HDR_H
+
+#include <linux/types.h>
+
+/* arch dependent constants */
+#define CKPT_ARCH_NSIG 64
+#define CKPT_TTY_NCC  10
+
+#ifdef __KERNEL__
+
+#include <asm/signal.h>
+#if CKPT_ARCH_NSIG != _NSIG
+#error CKPT_ARCH_NSIG size is wrong per asm/signal.h and asm/checkpoint_hdr.h
+#endif
+
+#include <linux/tty.h>
+#if CKPT_TTY_NCC != NCC
+#error CKPT_TTY_NCC size is wrong per asm-generic/termios.h
+#endif
+
+#endif /* __KERNEL__ */
+
+#ifdef __KERNEL__
+#ifdef CONFIG_PPC64
+#define CKPT_ARCH_ID CKPT_ARCH_PPC64
+#else
+#define CKPT_ARCH_ID CKPT_ARCH_PPC32
+#endif
+#endif
+
+struct ckpt_hdr_header_arch {
+	struct ckpt_hdr h;
+	__u32 what;
+} __attribute__((aligned(8)));
+
+#endif /* __ASM_POWERPC_CKPT_HDR_H */
diff --git a/arch/powerpc/mm/Makefile b/arch/powerpc/mm/Makefile
index 3e68363..aa8733c 100644
--- a/arch/powerpc/mm/Makefile
+++ b/arch/powerpc/mm/Makefile
@@ -31,3 +31,4 @@ obj-$(CONFIG_HUGETLB_PAGE)	+= hugetlbpage.o
 obj-$(CONFIG_PPC_SUBPAGE_PROT)	+= subpage-prot.o
 obj-$(CONFIG_NOT_COHERENT_CACHE) += dma-noncoherent.o
 obj-$(CONFIG_HIGHMEM)		+= highmem.o
+obj-$(CONFIG_CHECKPOINT)	+= checkpoint.o
diff --git a/arch/powerpc/mm/checkpoint.c b/arch/powerpc/mm/checkpoint.c
new file mode 100644
index 0000000..de18467
--- /dev/null
+++ b/arch/powerpc/mm/checkpoint.c
@@ -0,0 +1,531 @@
+/*
+ * PowerPC architecture support for checkpoint/restart.
+ * Based on x86 implementation.
+ *
+ * Copyright (C) 2008 Oren Laadan
+ * Copyright 2009 IBM Corp.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License version
+ * 2 as published by the Free Software Foundation.
+ */
+
+#if 0
+#define DEBUG
+#endif
+
+#include <linux/checkpoint.h>
+#include <linux/checkpoint_hdr.h>
+#include <linux/kernel.h>
+#include <asm/processor.h>
+#include <asm/ptrace.h>
+#include <asm/system.h>
+
+enum ckpt_cpu_feature {
+	CKPT_USED_FP,
+	CKPT_USED_DEBUG,
+	CKPT_USED_ALTIVEC,
+	CKPT_USED_SPE,
+	CKPT_USED_VSX,
+	CKPT_FTR_END = 31,
+};
+
+#define x(ftr) (1UL << ftr)
+
+/* features this kernel can handle for restart */
+enum {
+	CKPT_FTRS_POSSIBLE =
+#ifdef CONFIG_PPC_FPU
+	x(CKPT_USED_FP) |
+#endif
+	x(CKPT_USED_DEBUG) |
+#ifdef CONFIG_ALTIVEC
+	x(CKPT_USED_ALTIVEC) |
+#endif
+#ifdef CONFIG_SPE
+	x(CKPT_USED_SPE) |
+#endif
+#ifdef CONFIG_VSX
+	x(CKPT_USED_VSX) |
+#endif
+	0,
+};
+
+#undef x
+
+struct ckpt_hdr_cpu {
+	struct ckpt_hdr h;
+	u32 features_used;
+	u32 pt_regs_size;
+	u32 fpr_size;
+	u64 orig_gpr3;
+	struct pt_regs pt_regs;
+	/* relevant fields from thread_struct */
+	double fpr[32][TS_FPRWIDTH];
+	u32 fpscr;
+	s32 fpexc_mode;
+	u64 dabr;
+	/* Altivec/VMX state */
+	vector128 vr[32];
+	vector128 vscr;
+	u64 vrsave;
+	/* SPE state */
+	u32 evr[32];
+	u64 acc;
+	u32 spefscr;
+};
+
+/**************************************************************************
+ * Checkpoint
+ */
+
+static void ckpt_cpu_feature_set(struct ckpt_hdr_cpu *hdr,
+				 enum ckpt_cpu_feature ftr)
+{
+	hdr->features_used |= 1ULL << ftr;
+}
+
+static bool ckpt_cpu_feature_isset(const struct ckpt_hdr_cpu *hdr,
+				 enum ckpt_cpu_feature ftr)
+{
+	return hdr->features_used & (1ULL << ftr);
+}
+
+/* determine whether an image has feature bits set that this kernel
+ * does not support */
+static bool ckpt_cpu_features_unknown(const struct ckpt_hdr_cpu *hdr)
+{
+	return hdr->features_used & ~CKPT_FTRS_POSSIBLE;
+}
+
+static void checkpoint_gprs(struct ckpt_hdr_cpu *cpu_hdr,
+			    struct task_struct *task)
+{
+	struct pt_regs *pt_regs;
+
+	pr_debug("%s: saving GPRs\n", __func__);
+
+	cpu_hdr->pt_regs_size = sizeof(*pt_regs);
+	pt_regs = task_pt_regs(task);
+	cpu_hdr->pt_regs = *pt_regs;
+
+	if (task == current)
+		cpu_hdr->pt_regs.gpr[3] = 0;
+
+	cpu_hdr->orig_gpr3 = pt_regs->orig_gpr3;
+}
+
+#ifdef CONFIG_PPC_FPU
+static void checkpoint_fpu(struct ckpt_hdr_cpu *cpu_hdr,
+			   struct task_struct *task)
+{
+	/* easiest to save FP state unconditionally */
+
+	pr_debug("%s: saving FPU state\n", __func__);
+
+	if (task == current)
+		flush_fp_to_thread(task);
+
+	cpu_hdr->fpr_size = sizeof(cpu_hdr->fpr);
+	cpu_hdr->fpscr = task->thread.fpscr.val;
+	cpu_hdr->fpexc_mode = task->thread.fpexc_mode;
+
+	memcpy(cpu_hdr->fpr, task->thread.fpr, sizeof(cpu_hdr->fpr));
+
+	ckpt_cpu_feature_set(cpu_hdr, CKPT_USED_FP);
+}
+#else
+static void checkpoint_fpu(struct ckpt_hdr_cpu *cpu_hdr,
+			   struct task_struct *task)
+{
+	return;
+}
+#endif
+
+#ifdef CONFIG_ALTIVEC
+static void checkpoint_altivec(struct ckpt_hdr_cpu *cpu_hdr,
+			       struct task_struct *task)
+{
+	if (!cpu_has_feature(CPU_FTR_ALTIVEC))
+		return;
+
+	if (!task->thread.used_vr)
+		return;
+
+	pr_debug("%s: saving Altivec state\n", __func__);
+
+	if (task == current)
+		flush_altivec_to_thread(task);
+
+	cpu_hdr->vrsave = task->thread.vrsave;
+	memcpy(cpu_hdr->vr, task->thread.vr, sizeof(cpu_hdr->vr));
+	ckpt_cpu_feature_set(cpu_hdr, CKPT_USED_ALTIVEC);
+}
+#else
+static void checkpoint_altivec(struct ckpt_hdr_cpu *cpu_hdr,
+			       struct task_struct *task)
+{
+	return;
+}
+#endif
+
+#ifdef CONFIG_SPE
+static void checkpoint_spe(struct ckpt_hdr_cpu *cpu_hdr,
+			   struct task_struct *task)
+{
+	if (!cpu_has_feature(CPU_FTR_SPE))
+		return;
+
+	if (!task->thread.used_spe)
+		return;
+
+	pr_debug("%s: saving SPE state\n", __func__);
+
+	if (task == current)
+		flush_spe_to_thread(task);
+
+	cpu_hdr->acc = task->thread.acc;
+	cpu_hdr->spefscr = task->thread.spefscr;
+	memcpy(cpu_hdr->evr, task->thread.evr, sizeof(cpu_hdr->evr));
+	ckpt_cpu_feature_set(cpu_hdr, CKPT_USED_SPE);
+}
+#else
+static void checkpoint_spe(struct ckpt_hdr_cpu *cpu_hdr,
+			   struct task_struct *task)
+{
+	return;
+}
+#endif
+
+static void checkpoint_dabr(struct ckpt_hdr_cpu *cpu_hdr,
+			    const struct task_struct *task)
+{
+	if (!task->thread.dabr)
+		return;
+
+	cpu_hdr->dabr = task->thread.dabr;
+	ckpt_cpu_feature_set(cpu_hdr, CKPT_USED_DEBUG);
+}
+
+/* dump the thread_struct of a given task */
+int checkpoint_thread(struct ckpt_ctx *ctx, struct task_struct *t)
+{
+	return 0;
+}
+
+/* dump the cpu state and registers of a given task */
+int checkpoint_cpu(struct ckpt_ctx *ctx, struct task_struct *t)
+{
+	struct ckpt_hdr_cpu *cpu_hdr;
+	int rc;
+
+	rc = -ENOMEM;
+	cpu_hdr = ckpt_hdr_get_type(ctx, sizeof(*cpu_hdr), CKPT_HDR_CPU);
+	if (!cpu_hdr)
+		goto err;
+
+	checkpoint_gprs(cpu_hdr, t);
+	checkpoint_fpu(cpu_hdr, t);
+	checkpoint_dabr(cpu_hdr, t);
+	checkpoint_altivec(cpu_hdr, t);
+	checkpoint_spe(cpu_hdr, t);
+
+	rc = ckpt_write_obj(ctx, (struct ckpt_hdr *) cpu_hdr);
+err:
+	ckpt_hdr_put(ctx, cpu_hdr);
+	return rc;
+}
+
+int checkpoint_write_header_arch(struct ckpt_ctx *ctx)
+{
+	struct ckpt_hdr_header_arch *arch_hdr;
+	int ret;
+
+	arch_hdr = ckpt_hdr_get_type(ctx, sizeof(*arch_hdr),
+				     CKPT_HDR_HEADER_ARCH);
+	if (!arch_hdr)
+		return -ENOMEM;
+
+	arch_hdr->what = 0xdeadbeef;
+
+	ret = ckpt_write_obj(ctx, &arch_hdr->h);
+	ckpt_hdr_put(ctx, arch_hdr);
+
+	return ret;
+}
+
+/* dump the mm->context state */
+int checkpoint_mm_context(struct ckpt_ctx *ctx, struct mm_struct *mm)
+{
+	return 0;
+}
+
+/**************************************************************************
+ * Restart
+ */
+
+/* read the thread_struct into the current task */
+int restore_thread(struct ckpt_ctx *ctx)
+{
+	return 0;
+}
+
+/* Based on the MSR value from a checkpoint image, produce an MSR
+ * value that is appropriate for the restored task.  Right now we only
+ * check for MSR_SF (64-bit) for PPC64.
+ */
+static unsigned long sanitize_msr(unsigned long msr_ckpt)
+{
+#ifdef CONFIG_PPC32
+	return MSR_USER;
+#else
+	if (msr_ckpt & MSR_SF)
+		return MSR_USER64;
+	return MSR_USER32;
+#endif
+}
+
+static int restore_gprs(const struct ckpt_hdr_cpu *cpu_hdr,
+			struct task_struct *task, bool update)
+{
+	struct pt_regs *regs;
+	int rc;
+
+	rc = -EINVAL;
+	if (cpu_hdr->pt_regs_size != sizeof(*regs))
+		goto out;
+
+	rc = 0;
+	if (!update)
+		goto out;
+
+	regs = task_pt_regs(task);
+	*regs = cpu_hdr->pt_regs;
+
+	regs->orig_gpr3 = cpu_hdr->orig_gpr3;
+
+	regs->msr = sanitize_msr(regs->msr);
+out:
+	return rc;
+}
+
+#ifdef CONFIG_PPC_FPU
+static int restore_fpu(const struct ckpt_hdr_cpu *cpu_hdr,
+		       struct task_struct *task, bool update)
+{
+	int rc;
+
+	rc = -EINVAL;
+	if (cpu_hdr->fpr_size != sizeof(task->thread.fpr))
+		goto out;
+
+	rc = 0;
+	if (!update || !ckpt_cpu_feature_isset(cpu_hdr, CKPT_USED_FP))
+		goto out;
+
+	task->thread.fpscr.val = cpu_hdr->fpscr;
+	task->thread.fpexc_mode = cpu_hdr->fpexc_mode;
+
+	memcpy(task->thread.fpr, cpu_hdr->fpr, sizeof(task->thread.fpr));
+out:
+	return rc;
+}
+#else
+static int restore_fpu(const struct ckpt_hdr_cpu *cpu_hdr,
+		       struct task_struct *task, bool update)
+{
+	WARN_ON_ONCE(ckpt_cpu_feature_isset(cpu_hdr, CKPT_USED_FP));
+	return 0;
+}
+#endif
+
+static int restore_dabr(const struct ckpt_hdr_cpu *cpu_hdr,
+			struct task_struct *task, bool update)
+{
+	int rc;
+
+	rc = 0;
+	if (!ckpt_cpu_feature_isset(cpu_hdr, CKPT_USED_DEBUG))
+		goto out;
+
+	rc = -EINVAL;
+	if (!debugreg_valid(cpu_hdr->dabr, 0))
+		goto out;
+
+	rc = 0;
+	if (!update)
+		goto out;
+
+	debugreg_update(task, cpu_hdr->dabr, 0);
+out:
+	return rc;
+}
+
+#ifdef CONFIG_ALTIVEC
+static int restore_altivec(const struct ckpt_hdr_cpu *cpu_hdr,
+			   struct task_struct *task, bool update)
+{
+	int rc;
+
+	rc = 0;
+	if (!ckpt_cpu_feature_isset(cpu_hdr, CKPT_USED_ALTIVEC))
+		goto out;
+
+	rc = -EINVAL;
+	if (!cpu_has_feature(CPU_FTR_ALTIVEC))
+		goto out;
+
+	rc = 0;
+	if (!update)
+		goto out;
+
+	task->thread.vrsave = cpu_hdr->vrsave;
+	task->thread.used_vr = 1;
+
+	memcpy(task->thread.vr, cpu_hdr->vr, sizeof(cpu_hdr->vr));
+out:
+	return rc;
+}
+#else
+static int restore_altivec(const struct ckpt_hdr_cpu *cpu_hdr,
+			   struct task_struct *task, bool update)
+{
+	WARN_ON_ONCE(ckpt_cpu_feature_isset(CKPT_USED_ALTIVEC));
+	return 0;
+}
+#endif
+
+#ifdef CONFIG_SPE
+static int restore_spe(const struct ckpt_hdr_cpu *cpu_hdr,
+		       struct task_struct *task, bool update)
+{
+	int rc;
+
+	rc = 0;
+	if (!ckpt_cpu_feature_isset(cpu_hdr, CKPT_USED_SPE))
+		goto out;
+
+	rc = -EINVAL;
+	if (!cpu_has_feature(CPU_FTR_SPE))
+		goto out;
+
+	rc = 0;
+	if (!update)
+		goto out;
+
+	task->thread.acc = cpu_hdr->acc;
+	task->thread.spefscr = cpu_hdr->spefscr;
+	task->thread.used_spe = 1;
+
+	memcpy(task->thread.evr, cpu_hdr->evr, sizeof(cpu_hdr->evr));
+out:
+	return rc;
+}
+#else
+static int restore_spe(const struct ckpt_hdr_cpu *cpu_hdr,
+		       struct task_struct *task, bool update)
+{
+	WARN_ON_ONCE(ckpt_cpu_feature_isset(cpu_hdr, CKPT_USED_SPE));
+	return 0;
+}
+#endif
+
+struct restore_func_desc {
+	int (*func)(const struct ckpt_hdr_cpu *, struct task_struct *, bool);
+	const char *info;
+};
+
+typedef int (*restore_func_t)(const struct ckpt_hdr_cpu *,
+			      struct task_struct *, bool);
+
+static const restore_func_t restore_funcs[] = {
+	restore_gprs,
+	restore_fpu,
+	restore_dabr,
+	restore_altivec,
+	restore_spe,
+};
+
+static bool bitness_match(const struct ckpt_hdr_cpu *cpu_hdr,
+			  const struct task_struct *task)
+{
+	/* 64-bit image */
+	if (cpu_hdr->pt_regs.msr & MSR_SF) {
+		if (task->thread.regs->msr & MSR_SF)
+			return true;
+		else
+			return false;
+	}
+
+	/* 32-bit image */
+	if (task->thread.regs->msr & MSR_SF)
+		return false;
+
+	return true;
+}
+
+int restore_cpu(struct ckpt_ctx *ctx)
+{
+	struct ckpt_hdr_cpu *cpu_hdr;
+	bool update;
+	int rc;
+	int i;
+
+	cpu_hdr = ckpt_read_obj_type(ctx, sizeof(*cpu_hdr), CKPT_HDR_CPU);
+	if (IS_ERR(cpu_hdr))
+		return PTR_ERR(cpu_hdr);
+
+	rc = -EINVAL;
+	if (ckpt_cpu_features_unknown(cpu_hdr))
+		goto err;
+
+	/* temporary: restoring a 32-bit image from a 64-bit task and
+	 * vice-versa is known not to work (probably not restoring
+	 * thread_info correctly); detect this and fail gracefully.
+	 */
+	if (!bitness_match(cpu_hdr, current))
+		goto err;
+
+	/* We want to determine whether there's anything wrong with
+	 * the checkpoint image before changing the task at all.  Run
+	 * a "check" phase (update = false) first.
+	 */
+	update = false;
+commit:
+	for (i = 0; i < ARRAY_SIZE(restore_funcs); i++) {
+		rc = restore_funcs[i](cpu_hdr, current, update);
+		if (rc == 0)
+			continue;
+		pr_debug("%s: restore_func[%i] failed\n", __func__, i);
+		WARN_ON_ONCE(update);
+		goto err;
+	}
+
+	if (!update) {
+		update = true;
+		goto commit;
+	}
+
+err:
+	ckpt_hdr_put(ctx, cpu_hdr);
+	return rc;
+}
+
+int restore_read_header_arch(struct ckpt_ctx *ctx)
+{
+	struct ckpt_hdr_header_arch *arch_hdr;
+
+	arch_hdr = ckpt_read_obj_type(ctx, sizeof(*arch_hdr),
+				      CKPT_HDR_HEADER_ARCH);
+	if (IS_ERR(arch_hdr))
+		return PTR_ERR(arch_hdr);
+
+	ckpt_hdr_put(ctx, arch_hdr);
+
+	return 0;
+}
+
+int restore_mm_context(struct ckpt_ctx *ctx, struct mm_struct *mm)
+{
+	return 0;
+}
-- 
1.6.0.4

^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH v18 77/80] powerpc: checkpoint/restart implementation
       [not found] ` <1253749920-18673-1-git-send-email-orenl-RdfvBDnrOixBDgjK7y7TUQ@public.gmane.org>
  2009-09-23 23:50   ` [PATCH v18 01/80] c/r: extend arch_setup_additional_pages() Oren Laadan
@ 2009-09-23 23:51   ` Oren Laadan
  2009-09-23 23:50   ` [PATCH v18 03/80] c/r: break out new_user_ns() Oren Laadan
                     ` (79 subsequent siblings)
  81 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-23 23:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, containers, linux-kernel, linux-mm, linux-api,
	Serge Hallyn, Ingo Molnar, Pavel Emelyanov, Nathan Lynch

From: Nathan Lynch <ntl@pobox.com>

Support for checkpointing and restarting GPRs, FPU state, DABR, and
Altivec state.

The portion of the checkpoint image manipulated by this code begins
with a bitmask of features indicating the various contexts saved.
Fields in image that can vary depending on kernel configuration
(e.g. FP regs due to VSX) have their sizes explicitly recorded, except
for GPRS, so migrating between ppc32 and ppc64 won't work yet.

The restart code ensures that the task is not modified until the
checkpoint image is validated against the current kernel configuration
and hardware features (e.g. can't restart a task using Altivec on
non-Altivec systems).

What works:
* self and external checkpoint of simple (single thread, one open
  file) 32- and 64-bit processes on a ppc64 kernel

What doesn't work:
* restarting a 32-bit task from a 64-bit task and vice versa

Untested:
* ppc32 (but it builds)

Signed-off-by: Nathan Lynch <ntl@pobox.com>
[Oren Laadan <orenl@cs.columbia.edu>] Add arch-specific tty support
---
 arch/powerpc/include/asm/Kbuild           |    1 +
 arch/powerpc/include/asm/checkpoint_hdr.h |   37 ++
 arch/powerpc/mm/Makefile                  |    1 +
 arch/powerpc/mm/checkpoint.c              |  531 +++++++++++++++++++++++++++++
 4 files changed, 570 insertions(+), 0 deletions(-)
 create mode 100644 arch/powerpc/include/asm/checkpoint_hdr.h
 create mode 100644 arch/powerpc/mm/checkpoint.c

diff --git a/arch/powerpc/include/asm/Kbuild b/arch/powerpc/include/asm/Kbuild
index 5ab7d7f..20379f1 100644
--- a/arch/powerpc/include/asm/Kbuild
+++ b/arch/powerpc/include/asm/Kbuild
@@ -12,6 +12,7 @@ header-y += shmbuf.h
 header-y += socket.h
 header-y += termbits.h
 header-y += fcntl.h
+header-y += checkpoint_hdr.h
 header-y += poll.h
 header-y += sockios.h
 header-y += ucontext.h
diff --git a/arch/powerpc/include/asm/checkpoint_hdr.h b/arch/powerpc/include/asm/checkpoint_hdr.h
new file mode 100644
index 0000000..fbb1705
--- /dev/null
+++ b/arch/powerpc/include/asm/checkpoint_hdr.h
@@ -0,0 +1,37 @@
+#ifndef __ASM_POWERPC_CKPT_HDR_H
+#define __ASM_POWERPC_CKPT_HDR_H
+
+#include <linux/types.h>
+
+/* arch dependent constants */
+#define CKPT_ARCH_NSIG 64
+#define CKPT_TTY_NCC  10
+
+#ifdef __KERNEL__
+
+#include <asm/signal.h>
+#if CKPT_ARCH_NSIG != _NSIG
+#error CKPT_ARCH_NSIG size is wrong per asm/signal.h and asm/checkpoint_hdr.h
+#endif
+
+#include <linux/tty.h>
+#if CKPT_TTY_NCC != NCC
+#error CKPT_TTY_NCC size is wrong per asm-generic/termios.h
+#endif
+
+#endif /* __KERNEL__ */
+
+#ifdef __KERNEL__
+#ifdef CONFIG_PPC64
+#define CKPT_ARCH_ID CKPT_ARCH_PPC64
+#else
+#define CKPT_ARCH_ID CKPT_ARCH_PPC32
+#endif
+#endif
+
+struct ckpt_hdr_header_arch {
+	struct ckpt_hdr h;
+	__u32 what;
+} __attribute__((aligned(8)));
+
+#endif /* __ASM_POWERPC_CKPT_HDR_H */
diff --git a/arch/powerpc/mm/Makefile b/arch/powerpc/mm/Makefile
index 3e68363..aa8733c 100644
--- a/arch/powerpc/mm/Makefile
+++ b/arch/powerpc/mm/Makefile
@@ -31,3 +31,4 @@ obj-$(CONFIG_HUGETLB_PAGE)	+= hugetlbpage.o
 obj-$(CONFIG_PPC_SUBPAGE_PROT)	+= subpage-prot.o
 obj-$(CONFIG_NOT_COHERENT_CACHE) += dma-noncoherent.o
 obj-$(CONFIG_HIGHMEM)		+= highmem.o
+obj-$(CONFIG_CHECKPOINT)	+= checkpoint.o
diff --git a/arch/powerpc/mm/checkpoint.c b/arch/powerpc/mm/checkpoint.c
new file mode 100644
index 0000000..de18467
--- /dev/null
+++ b/arch/powerpc/mm/checkpoint.c
@@ -0,0 +1,531 @@
+/*
+ * PowerPC architecture support for checkpoint/restart.
+ * Based on x86 implementation.
+ *
+ * Copyright (C) 2008 Oren Laadan
+ * Copyright 2009 IBM Corp.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License version
+ * 2 as published by the Free Software Foundation.
+ */
+
+#if 0
+#define DEBUG
+#endif
+
+#include <linux/checkpoint.h>
+#include <linux/checkpoint_hdr.h>
+#include <linux/kernel.h>
+#include <asm/processor.h>
+#include <asm/ptrace.h>
+#include <asm/system.h>
+
+enum ckpt_cpu_feature {
+	CKPT_USED_FP,
+	CKPT_USED_DEBUG,
+	CKPT_USED_ALTIVEC,
+	CKPT_USED_SPE,
+	CKPT_USED_VSX,
+	CKPT_FTR_END = 31,
+};
+
+#define x(ftr) (1UL << ftr)
+
+/* features this kernel can handle for restart */
+enum {
+	CKPT_FTRS_POSSIBLE =
+#ifdef CONFIG_PPC_FPU
+	x(CKPT_USED_FP) |
+#endif
+	x(CKPT_USED_DEBUG) |
+#ifdef CONFIG_ALTIVEC
+	x(CKPT_USED_ALTIVEC) |
+#endif
+#ifdef CONFIG_SPE
+	x(CKPT_USED_SPE) |
+#endif
+#ifdef CONFIG_VSX
+	x(CKPT_USED_VSX) |
+#endif
+	0,
+};
+
+#undef x
+
+struct ckpt_hdr_cpu {
+	struct ckpt_hdr h;
+	u32 features_used;
+	u32 pt_regs_size;
+	u32 fpr_size;
+	u64 orig_gpr3;
+	struct pt_regs pt_regs;
+	/* relevant fields from thread_struct */
+	double fpr[32][TS_FPRWIDTH];
+	u32 fpscr;
+	s32 fpexc_mode;
+	u64 dabr;
+	/* Altivec/VMX state */
+	vector128 vr[32];
+	vector128 vscr;
+	u64 vrsave;
+	/* SPE state */
+	u32 evr[32];
+	u64 acc;
+	u32 spefscr;
+};
+
+/**************************************************************************
+ * Checkpoint
+ */
+
+static void ckpt_cpu_feature_set(struct ckpt_hdr_cpu *hdr,
+				 enum ckpt_cpu_feature ftr)
+{
+	hdr->features_used |= 1ULL << ftr;
+}
+
+static bool ckpt_cpu_feature_isset(const struct ckpt_hdr_cpu *hdr,
+				 enum ckpt_cpu_feature ftr)
+{
+	return hdr->features_used & (1ULL << ftr);
+}
+
+/* determine whether an image has feature bits set that this kernel
+ * does not support */
+static bool ckpt_cpu_features_unknown(const struct ckpt_hdr_cpu *hdr)
+{
+	return hdr->features_used & ~CKPT_FTRS_POSSIBLE;
+}
+
+static void checkpoint_gprs(struct ckpt_hdr_cpu *cpu_hdr,
+			    struct task_struct *task)
+{
+	struct pt_regs *pt_regs;
+
+	pr_debug("%s: saving GPRs\n", __func__);
+
+	cpu_hdr->pt_regs_size = sizeof(*pt_regs);
+	pt_regs = task_pt_regs(task);
+	cpu_hdr->pt_regs = *pt_regs;
+
+	if (task == current)
+		cpu_hdr->pt_regs.gpr[3] = 0;
+
+	cpu_hdr->orig_gpr3 = pt_regs->orig_gpr3;
+}
+
+#ifdef CONFIG_PPC_FPU
+static void checkpoint_fpu(struct ckpt_hdr_cpu *cpu_hdr,
+			   struct task_struct *task)
+{
+	/* easiest to save FP state unconditionally */
+
+	pr_debug("%s: saving FPU state\n", __func__);
+
+	if (task == current)
+		flush_fp_to_thread(task);
+
+	cpu_hdr->fpr_size = sizeof(cpu_hdr->fpr);
+	cpu_hdr->fpscr = task->thread.fpscr.val;
+	cpu_hdr->fpexc_mode = task->thread.fpexc_mode;
+
+	memcpy(cpu_hdr->fpr, task->thread.fpr, sizeof(cpu_hdr->fpr));
+
+	ckpt_cpu_feature_set(cpu_hdr, CKPT_USED_FP);
+}
+#else
+static void checkpoint_fpu(struct ckpt_hdr_cpu *cpu_hdr,
+			   struct task_struct *task)
+{
+	return;
+}
+#endif
+
+#ifdef CONFIG_ALTIVEC
+static void checkpoint_altivec(struct ckpt_hdr_cpu *cpu_hdr,
+			       struct task_struct *task)
+{
+	if (!cpu_has_feature(CPU_FTR_ALTIVEC))
+		return;
+
+	if (!task->thread.used_vr)
+		return;
+
+	pr_debug("%s: saving Altivec state\n", __func__);
+
+	if (task == current)
+		flush_altivec_to_thread(task);
+
+	cpu_hdr->vrsave = task->thread.vrsave;
+	memcpy(cpu_hdr->vr, task->thread.vr, sizeof(cpu_hdr->vr));
+	ckpt_cpu_feature_set(cpu_hdr, CKPT_USED_ALTIVEC);
+}
+#else
+static void checkpoint_altivec(struct ckpt_hdr_cpu *cpu_hdr,
+			       struct task_struct *task)
+{
+	return;
+}
+#endif
+
+#ifdef CONFIG_SPE
+static void checkpoint_spe(struct ckpt_hdr_cpu *cpu_hdr,
+			   struct task_struct *task)
+{
+	if (!cpu_has_feature(CPU_FTR_SPE))
+		return;
+
+	if (!task->thread.used_spe)
+		return;
+
+	pr_debug("%s: saving SPE state\n", __func__);
+
+	if (task == current)
+		flush_spe_to_thread(task);
+
+	cpu_hdr->acc = task->thread.acc;
+	cpu_hdr->spefscr = task->thread.spefscr;
+	memcpy(cpu_hdr->evr, task->thread.evr, sizeof(cpu_hdr->evr));
+	ckpt_cpu_feature_set(cpu_hdr, CKPT_USED_SPE);
+}
+#else
+static void checkpoint_spe(struct ckpt_hdr_cpu *cpu_hdr,
+			   struct task_struct *task)
+{
+	return;
+}
+#endif
+
+static void checkpoint_dabr(struct ckpt_hdr_cpu *cpu_hdr,
+			    const struct task_struct *task)
+{
+	if (!task->thread.dabr)
+		return;
+
+	cpu_hdr->dabr = task->thread.dabr;
+	ckpt_cpu_feature_set(cpu_hdr, CKPT_USED_DEBUG);
+}
+
+/* dump the thread_struct of a given task */
+int checkpoint_thread(struct ckpt_ctx *ctx, struct task_struct *t)
+{
+	return 0;
+}
+
+/* dump the cpu state and registers of a given task */
+int checkpoint_cpu(struct ckpt_ctx *ctx, struct task_struct *t)
+{
+	struct ckpt_hdr_cpu *cpu_hdr;
+	int rc;
+
+	rc = -ENOMEM;
+	cpu_hdr = ckpt_hdr_get_type(ctx, sizeof(*cpu_hdr), CKPT_HDR_CPU);
+	if (!cpu_hdr)
+		goto err;
+
+	checkpoint_gprs(cpu_hdr, t);
+	checkpoint_fpu(cpu_hdr, t);
+	checkpoint_dabr(cpu_hdr, t);
+	checkpoint_altivec(cpu_hdr, t);
+	checkpoint_spe(cpu_hdr, t);
+
+	rc = ckpt_write_obj(ctx, (struct ckpt_hdr *) cpu_hdr);
+err:
+	ckpt_hdr_put(ctx, cpu_hdr);
+	return rc;
+}
+
+int checkpoint_write_header_arch(struct ckpt_ctx *ctx)
+{
+	struct ckpt_hdr_header_arch *arch_hdr;
+	int ret;
+
+	arch_hdr = ckpt_hdr_get_type(ctx, sizeof(*arch_hdr),
+				     CKPT_HDR_HEADER_ARCH);
+	if (!arch_hdr)
+		return -ENOMEM;
+
+	arch_hdr->what = 0xdeadbeef;
+
+	ret = ckpt_write_obj(ctx, &arch_hdr->h);
+	ckpt_hdr_put(ctx, arch_hdr);
+
+	return ret;
+}
+
+/* dump the mm->context state */
+int checkpoint_mm_context(struct ckpt_ctx *ctx, struct mm_struct *mm)
+{
+	return 0;
+}
+
+/**************************************************************************
+ * Restart
+ */
+
+/* read the thread_struct into the current task */
+int restore_thread(struct ckpt_ctx *ctx)
+{
+	return 0;
+}
+
+/* Based on the MSR value from a checkpoint image, produce an MSR
+ * value that is appropriate for the restored task.  Right now we only
+ * check for MSR_SF (64-bit) for PPC64.
+ */
+static unsigned long sanitize_msr(unsigned long msr_ckpt)
+{
+#ifdef CONFIG_PPC32
+	return MSR_USER;
+#else
+	if (msr_ckpt & MSR_SF)
+		return MSR_USER64;
+	return MSR_USER32;
+#endif
+}
+
+static int restore_gprs(const struct ckpt_hdr_cpu *cpu_hdr,
+			struct task_struct *task, bool update)
+{
+	struct pt_regs *regs;
+	int rc;
+
+	rc = -EINVAL;
+	if (cpu_hdr->pt_regs_size != sizeof(*regs))
+		goto out;
+
+	rc = 0;
+	if (!update)
+		goto out;
+
+	regs = task_pt_regs(task);
+	*regs = cpu_hdr->pt_regs;
+
+	regs->orig_gpr3 = cpu_hdr->orig_gpr3;
+
+	regs->msr = sanitize_msr(regs->msr);
+out:
+	return rc;
+}
+
+#ifdef CONFIG_PPC_FPU
+static int restore_fpu(const struct ckpt_hdr_cpu *cpu_hdr,
+		       struct task_struct *task, bool update)
+{
+	int rc;
+
+	rc = -EINVAL;
+	if (cpu_hdr->fpr_size != sizeof(task->thread.fpr))
+		goto out;
+
+	rc = 0;
+	if (!update || !ckpt_cpu_feature_isset(cpu_hdr, CKPT_USED_FP))
+		goto out;
+
+	task->thread.fpscr.val = cpu_hdr->fpscr;
+	task->thread.fpexc_mode = cpu_hdr->fpexc_mode;
+
+	memcpy(task->thread.fpr, cpu_hdr->fpr, sizeof(task->thread.fpr));
+out:
+	return rc;
+}
+#else
+static int restore_fpu(const struct ckpt_hdr_cpu *cpu_hdr,
+		       struct task_struct *task, bool update)
+{
+	WARN_ON_ONCE(ckpt_cpu_feature_isset(cpu_hdr, CKPT_USED_FP));
+	return 0;
+}
+#endif
+
+static int restore_dabr(const struct ckpt_hdr_cpu *cpu_hdr,
+			struct task_struct *task, bool update)
+{
+	int rc;
+
+	rc = 0;
+	if (!ckpt_cpu_feature_isset(cpu_hdr, CKPT_USED_DEBUG))
+		goto out;
+
+	rc = -EINVAL;
+	if (!debugreg_valid(cpu_hdr->dabr, 0))
+		goto out;
+
+	rc = 0;
+	if (!update)
+		goto out;
+
+	debugreg_update(task, cpu_hdr->dabr, 0);
+out:
+	return rc;
+}
+
+#ifdef CONFIG_ALTIVEC
+static int restore_altivec(const struct ckpt_hdr_cpu *cpu_hdr,
+			   struct task_struct *task, bool update)
+{
+	int rc;
+
+	rc = 0;
+	if (!ckpt_cpu_feature_isset(cpu_hdr, CKPT_USED_ALTIVEC))
+		goto out;
+
+	rc = -EINVAL;
+	if (!cpu_has_feature(CPU_FTR_ALTIVEC))
+		goto out;
+
+	rc = 0;
+	if (!update)
+		goto out;
+
+	task->thread.vrsave = cpu_hdr->vrsave;
+	task->thread.used_vr = 1;
+
+	memcpy(task->thread.vr, cpu_hdr->vr, sizeof(cpu_hdr->vr));
+out:
+	return rc;
+}
+#else
+static int restore_altivec(const struct ckpt_hdr_cpu *cpu_hdr,
+			   struct task_struct *task, bool update)
+{
+	WARN_ON_ONCE(ckpt_cpu_feature_isset(CKPT_USED_ALTIVEC));
+	return 0;
+}
+#endif
+
+#ifdef CONFIG_SPE
+static int restore_spe(const struct ckpt_hdr_cpu *cpu_hdr,
+		       struct task_struct *task, bool update)
+{
+	int rc;
+
+	rc = 0;
+	if (!ckpt_cpu_feature_isset(cpu_hdr, CKPT_USED_SPE))
+		goto out;
+
+	rc = -EINVAL;
+	if (!cpu_has_feature(CPU_FTR_SPE))
+		goto out;
+
+	rc = 0;
+	if (!update)
+		goto out;
+
+	task->thread.acc = cpu_hdr->acc;
+	task->thread.spefscr = cpu_hdr->spefscr;
+	task->thread.used_spe = 1;
+
+	memcpy(task->thread.evr, cpu_hdr->evr, sizeof(cpu_hdr->evr));
+out:
+	return rc;
+}
+#else
+static int restore_spe(const struct ckpt_hdr_cpu *cpu_hdr,
+		       struct task_struct *task, bool update)
+{
+	WARN_ON_ONCE(ckpt_cpu_feature_isset(cpu_hdr, CKPT_USED_SPE));
+	return 0;
+}
+#endif
+
+struct restore_func_desc {
+	int (*func)(const struct ckpt_hdr_cpu *, struct task_struct *, bool);
+	const char *info;
+};
+
+typedef int (*restore_func_t)(const struct ckpt_hdr_cpu *,
+			      struct task_struct *, bool);
+
+static const restore_func_t restore_funcs[] = {
+	restore_gprs,
+	restore_fpu,
+	restore_dabr,
+	restore_altivec,
+	restore_spe,
+};
+
+static bool bitness_match(const struct ckpt_hdr_cpu *cpu_hdr,
+			  const struct task_struct *task)
+{
+	/* 64-bit image */
+	if (cpu_hdr->pt_regs.msr & MSR_SF) {
+		if (task->thread.regs->msr & MSR_SF)
+			return true;
+		else
+			return false;
+	}
+
+	/* 32-bit image */
+	if (task->thread.regs->msr & MSR_SF)
+		return false;
+
+	return true;
+}
+
+int restore_cpu(struct ckpt_ctx *ctx)
+{
+	struct ckpt_hdr_cpu *cpu_hdr;
+	bool update;
+	int rc;
+	int i;
+
+	cpu_hdr = ckpt_read_obj_type(ctx, sizeof(*cpu_hdr), CKPT_HDR_CPU);
+	if (IS_ERR(cpu_hdr))
+		return PTR_ERR(cpu_hdr);
+
+	rc = -EINVAL;
+	if (ckpt_cpu_features_unknown(cpu_hdr))
+		goto err;
+
+	/* temporary: restoring a 32-bit image from a 64-bit task and
+	 * vice-versa is known not to work (probably not restoring
+	 * thread_info correctly); detect this and fail gracefully.
+	 */
+	if (!bitness_match(cpu_hdr, current))
+		goto err;
+
+	/* We want to determine whether there's anything wrong with
+	 * the checkpoint image before changing the task at all.  Run
+	 * a "check" phase (update = false) first.
+	 */
+	update = false;
+commit:
+	for (i = 0; i < ARRAY_SIZE(restore_funcs); i++) {
+		rc = restore_funcs[i](cpu_hdr, current, update);
+		if (rc == 0)
+			continue;
+		pr_debug("%s: restore_func[%i] failed\n", __func__, i);
+		WARN_ON_ONCE(update);
+		goto err;
+	}
+
+	if (!update) {
+		update = true;
+		goto commit;
+	}
+
+err:
+	ckpt_hdr_put(ctx, cpu_hdr);
+	return rc;
+}
+
+int restore_read_header_arch(struct ckpt_ctx *ctx)
+{
+	struct ckpt_hdr_header_arch *arch_hdr;
+
+	arch_hdr = ckpt_read_obj_type(ctx, sizeof(*arch_hdr),
+				      CKPT_HDR_HEADER_ARCH);
+	if (IS_ERR(arch_hdr))
+		return PTR_ERR(arch_hdr);
+
+	ckpt_hdr_put(ctx, arch_hdr);
+
+	return 0;
+}
+
+int restore_mm_context(struct ckpt_ctx *ctx, struct mm_struct *mm)
+{
+	return 0;
+}
-- 
1.6.0.4


^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH v18 77/80] powerpc: checkpoint/restart implementation
@ 2009-09-23 23:51   ` Oren Laadan
  0 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-23 23:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Serge Hallyn, Ingo Molnar,
	Pavel Emelyanov, Nathan Lynch

From: Nathan Lynch <ntl-e+AXbWqSrlAAvxtiuMwx3w@public.gmane.org>

Support for checkpointing and restarting GPRs, FPU state, DABR, and
Altivec state.

The portion of the checkpoint image manipulated by this code begins
with a bitmask of features indicating the various contexts saved.
Fields in image that can vary depending on kernel configuration
(e.g. FP regs due to VSX) have their sizes explicitly recorded, except
for GPRS, so migrating between ppc32 and ppc64 won't work yet.

The restart code ensures that the task is not modified until the
checkpoint image is validated against the current kernel configuration
and hardware features (e.g. can't restart a task using Altivec on
non-Altivec systems).

What works:
* self and external checkpoint of simple (single thread, one open
  file) 32- and 64-bit processes on a ppc64 kernel

What doesn't work:
* restarting a 32-bit task from a 64-bit task and vice versa

Untested:
* ppc32 (but it builds)

Signed-off-by: Nathan Lynch <ntl-e+AXbWqSrlAAvxtiuMwx3w@public.gmane.org>
[Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>] Add arch-specific tty support
---
 arch/powerpc/include/asm/Kbuild           |    1 +
 arch/powerpc/include/asm/checkpoint_hdr.h |   37 ++
 arch/powerpc/mm/Makefile                  |    1 +
 arch/powerpc/mm/checkpoint.c              |  531 +++++++++++++++++++++++++++++
 4 files changed, 570 insertions(+), 0 deletions(-)
 create mode 100644 arch/powerpc/include/asm/checkpoint_hdr.h
 create mode 100644 arch/powerpc/mm/checkpoint.c

diff --git a/arch/powerpc/include/asm/Kbuild b/arch/powerpc/include/asm/Kbuild
index 5ab7d7f..20379f1 100644
--- a/arch/powerpc/include/asm/Kbuild
+++ b/arch/powerpc/include/asm/Kbuild
@@ -12,6 +12,7 @@ header-y += shmbuf.h
 header-y += socket.h
 header-y += termbits.h
 header-y += fcntl.h
+header-y += checkpoint_hdr.h
 header-y += poll.h
 header-y += sockios.h
 header-y += ucontext.h
diff --git a/arch/powerpc/include/asm/checkpoint_hdr.h b/arch/powerpc/include/asm/checkpoint_hdr.h
new file mode 100644
index 0000000..fbb1705
--- /dev/null
+++ b/arch/powerpc/include/asm/checkpoint_hdr.h
@@ -0,0 +1,37 @@
+#ifndef __ASM_POWERPC_CKPT_HDR_H
+#define __ASM_POWERPC_CKPT_HDR_H
+
+#include <linux/types.h>
+
+/* arch dependent constants */
+#define CKPT_ARCH_NSIG 64
+#define CKPT_TTY_NCC  10
+
+#ifdef __KERNEL__
+
+#include <asm/signal.h>
+#if CKPT_ARCH_NSIG != _NSIG
+#error CKPT_ARCH_NSIG size is wrong per asm/signal.h and asm/checkpoint_hdr.h
+#endif
+
+#include <linux/tty.h>
+#if CKPT_TTY_NCC != NCC
+#error CKPT_TTY_NCC size is wrong per asm-generic/termios.h
+#endif
+
+#endif /* __KERNEL__ */
+
+#ifdef __KERNEL__
+#ifdef CONFIG_PPC64
+#define CKPT_ARCH_ID CKPT_ARCH_PPC64
+#else
+#define CKPT_ARCH_ID CKPT_ARCH_PPC32
+#endif
+#endif
+
+struct ckpt_hdr_header_arch {
+	struct ckpt_hdr h;
+	__u32 what;
+} __attribute__((aligned(8)));
+
+#endif /* __ASM_POWERPC_CKPT_HDR_H */
diff --git a/arch/powerpc/mm/Makefile b/arch/powerpc/mm/Makefile
index 3e68363..aa8733c 100644
--- a/arch/powerpc/mm/Makefile
+++ b/arch/powerpc/mm/Makefile
@@ -31,3 +31,4 @@ obj-$(CONFIG_HUGETLB_PAGE)	+= hugetlbpage.o
 obj-$(CONFIG_PPC_SUBPAGE_PROT)	+= subpage-prot.o
 obj-$(CONFIG_NOT_COHERENT_CACHE) += dma-noncoherent.o
 obj-$(CONFIG_HIGHMEM)		+= highmem.o
+obj-$(CONFIG_CHECKPOINT)	+= checkpoint.o
diff --git a/arch/powerpc/mm/checkpoint.c b/arch/powerpc/mm/checkpoint.c
new file mode 100644
index 0000000..de18467
--- /dev/null
+++ b/arch/powerpc/mm/checkpoint.c
@@ -0,0 +1,531 @@
+/*
+ * PowerPC architecture support for checkpoint/restart.
+ * Based on x86 implementation.
+ *
+ * Copyright (C) 2008 Oren Laadan
+ * Copyright 2009 IBM Corp.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License version
+ * 2 as published by the Free Software Foundation.
+ */
+
+#if 0
+#define DEBUG
+#endif
+
+#include <linux/checkpoint.h>
+#include <linux/checkpoint_hdr.h>
+#include <linux/kernel.h>
+#include <asm/processor.h>
+#include <asm/ptrace.h>
+#include <asm/system.h>
+
+enum ckpt_cpu_feature {
+	CKPT_USED_FP,
+	CKPT_USED_DEBUG,
+	CKPT_USED_ALTIVEC,
+	CKPT_USED_SPE,
+	CKPT_USED_VSX,
+	CKPT_FTR_END = 31,
+};
+
+#define x(ftr) (1UL << ftr)
+
+/* features this kernel can handle for restart */
+enum {
+	CKPT_FTRS_POSSIBLE =
+#ifdef CONFIG_PPC_FPU
+	x(CKPT_USED_FP) |
+#endif
+	x(CKPT_USED_DEBUG) |
+#ifdef CONFIG_ALTIVEC
+	x(CKPT_USED_ALTIVEC) |
+#endif
+#ifdef CONFIG_SPE
+	x(CKPT_USED_SPE) |
+#endif
+#ifdef CONFIG_VSX
+	x(CKPT_USED_VSX) |
+#endif
+	0,
+};
+
+#undef x
+
+struct ckpt_hdr_cpu {
+	struct ckpt_hdr h;
+	u32 features_used;
+	u32 pt_regs_size;
+	u32 fpr_size;
+	u64 orig_gpr3;
+	struct pt_regs pt_regs;
+	/* relevant fields from thread_struct */
+	double fpr[32][TS_FPRWIDTH];
+	u32 fpscr;
+	s32 fpexc_mode;
+	u64 dabr;
+	/* Altivec/VMX state */
+	vector128 vr[32];
+	vector128 vscr;
+	u64 vrsave;
+	/* SPE state */
+	u32 evr[32];
+	u64 acc;
+	u32 spefscr;
+};
+
+/**************************************************************************
+ * Checkpoint
+ */
+
+static void ckpt_cpu_feature_set(struct ckpt_hdr_cpu *hdr,
+				 enum ckpt_cpu_feature ftr)
+{
+	hdr->features_used |= 1ULL << ftr;
+}
+
+static bool ckpt_cpu_feature_isset(const struct ckpt_hdr_cpu *hdr,
+				 enum ckpt_cpu_feature ftr)
+{
+	return hdr->features_used & (1ULL << ftr);
+}
+
+/* determine whether an image has feature bits set that this kernel
+ * does not support */
+static bool ckpt_cpu_features_unknown(const struct ckpt_hdr_cpu *hdr)
+{
+	return hdr->features_used & ~CKPT_FTRS_POSSIBLE;
+}
+
+static void checkpoint_gprs(struct ckpt_hdr_cpu *cpu_hdr,
+			    struct task_struct *task)
+{
+	struct pt_regs *pt_regs;
+
+	pr_debug("%s: saving GPRs\n", __func__);
+
+	cpu_hdr->pt_regs_size = sizeof(*pt_regs);
+	pt_regs = task_pt_regs(task);
+	cpu_hdr->pt_regs = *pt_regs;
+
+	if (task == current)
+		cpu_hdr->pt_regs.gpr[3] = 0;
+
+	cpu_hdr->orig_gpr3 = pt_regs->orig_gpr3;
+}
+
+#ifdef CONFIG_PPC_FPU
+static void checkpoint_fpu(struct ckpt_hdr_cpu *cpu_hdr,
+			   struct task_struct *task)
+{
+	/* easiest to save FP state unconditionally */
+
+	pr_debug("%s: saving FPU state\n", __func__);
+
+	if (task == current)
+		flush_fp_to_thread(task);
+
+	cpu_hdr->fpr_size = sizeof(cpu_hdr->fpr);
+	cpu_hdr->fpscr = task->thread.fpscr.val;
+	cpu_hdr->fpexc_mode = task->thread.fpexc_mode;
+
+	memcpy(cpu_hdr->fpr, task->thread.fpr, sizeof(cpu_hdr->fpr));
+
+	ckpt_cpu_feature_set(cpu_hdr, CKPT_USED_FP);
+}
+#else
+static void checkpoint_fpu(struct ckpt_hdr_cpu *cpu_hdr,
+			   struct task_struct *task)
+{
+	return;
+}
+#endif
+
+#ifdef CONFIG_ALTIVEC
+static void checkpoint_altivec(struct ckpt_hdr_cpu *cpu_hdr,
+			       struct task_struct *task)
+{
+	if (!cpu_has_feature(CPU_FTR_ALTIVEC))
+		return;
+
+	if (!task->thread.used_vr)
+		return;
+
+	pr_debug("%s: saving Altivec state\n", __func__);
+
+	if (task == current)
+		flush_altivec_to_thread(task);
+
+	cpu_hdr->vrsave = task->thread.vrsave;
+	memcpy(cpu_hdr->vr, task->thread.vr, sizeof(cpu_hdr->vr));
+	ckpt_cpu_feature_set(cpu_hdr, CKPT_USED_ALTIVEC);
+}
+#else
+static void checkpoint_altivec(struct ckpt_hdr_cpu *cpu_hdr,
+			       struct task_struct *task)
+{
+	return;
+}
+#endif
+
+#ifdef CONFIG_SPE
+static void checkpoint_spe(struct ckpt_hdr_cpu *cpu_hdr,
+			   struct task_struct *task)
+{
+	if (!cpu_has_feature(CPU_FTR_SPE))
+		return;
+
+	if (!task->thread.used_spe)
+		return;
+
+	pr_debug("%s: saving SPE state\n", __func__);
+
+	if (task == current)
+		flush_spe_to_thread(task);
+
+	cpu_hdr->acc = task->thread.acc;
+	cpu_hdr->spefscr = task->thread.spefscr;
+	memcpy(cpu_hdr->evr, task->thread.evr, sizeof(cpu_hdr->evr));
+	ckpt_cpu_feature_set(cpu_hdr, CKPT_USED_SPE);
+}
+#else
+static void checkpoint_spe(struct ckpt_hdr_cpu *cpu_hdr,
+			   struct task_struct *task)
+{
+	return;
+}
+#endif
+
+static void checkpoint_dabr(struct ckpt_hdr_cpu *cpu_hdr,
+			    const struct task_struct *task)
+{
+	if (!task->thread.dabr)
+		return;
+
+	cpu_hdr->dabr = task->thread.dabr;
+	ckpt_cpu_feature_set(cpu_hdr, CKPT_USED_DEBUG);
+}
+
+/* dump the thread_struct of a given task */
+int checkpoint_thread(struct ckpt_ctx *ctx, struct task_struct *t)
+{
+	return 0;
+}
+
+/* dump the cpu state and registers of a given task */
+int checkpoint_cpu(struct ckpt_ctx *ctx, struct task_struct *t)
+{
+	struct ckpt_hdr_cpu *cpu_hdr;
+	int rc;
+
+	rc = -ENOMEM;
+	cpu_hdr = ckpt_hdr_get_type(ctx, sizeof(*cpu_hdr), CKPT_HDR_CPU);
+	if (!cpu_hdr)
+		goto err;
+
+	checkpoint_gprs(cpu_hdr, t);
+	checkpoint_fpu(cpu_hdr, t);
+	checkpoint_dabr(cpu_hdr, t);
+	checkpoint_altivec(cpu_hdr, t);
+	checkpoint_spe(cpu_hdr, t);
+
+	rc = ckpt_write_obj(ctx, (struct ckpt_hdr *) cpu_hdr);
+err:
+	ckpt_hdr_put(ctx, cpu_hdr);
+	return rc;
+}
+
+int checkpoint_write_header_arch(struct ckpt_ctx *ctx)
+{
+	struct ckpt_hdr_header_arch *arch_hdr;
+	int ret;
+
+	arch_hdr = ckpt_hdr_get_type(ctx, sizeof(*arch_hdr),
+				     CKPT_HDR_HEADER_ARCH);
+	if (!arch_hdr)
+		return -ENOMEM;
+
+	arch_hdr->what = 0xdeadbeef;
+
+	ret = ckpt_write_obj(ctx, &arch_hdr->h);
+	ckpt_hdr_put(ctx, arch_hdr);
+
+	return ret;
+}
+
+/* dump the mm->context state */
+int checkpoint_mm_context(struct ckpt_ctx *ctx, struct mm_struct *mm)
+{
+	return 0;
+}
+
+/**************************************************************************
+ * Restart
+ */
+
+/* read the thread_struct into the current task */
+int restore_thread(struct ckpt_ctx *ctx)
+{
+	return 0;
+}
+
+/* Based on the MSR value from a checkpoint image, produce an MSR
+ * value that is appropriate for the restored task.  Right now we only
+ * check for MSR_SF (64-bit) for PPC64.
+ */
+static unsigned long sanitize_msr(unsigned long msr_ckpt)
+{
+#ifdef CONFIG_PPC32
+	return MSR_USER;
+#else
+	if (msr_ckpt & MSR_SF)
+		return MSR_USER64;
+	return MSR_USER32;
+#endif
+}
+
+static int restore_gprs(const struct ckpt_hdr_cpu *cpu_hdr,
+			struct task_struct *task, bool update)
+{
+	struct pt_regs *regs;
+	int rc;
+
+	rc = -EINVAL;
+	if (cpu_hdr->pt_regs_size != sizeof(*regs))
+		goto out;
+
+	rc = 0;
+	if (!update)
+		goto out;
+
+	regs = task_pt_regs(task);
+	*regs = cpu_hdr->pt_regs;
+
+	regs->orig_gpr3 = cpu_hdr->orig_gpr3;
+
+	regs->msr = sanitize_msr(regs->msr);
+out:
+	return rc;
+}
+
+#ifdef CONFIG_PPC_FPU
+static int restore_fpu(const struct ckpt_hdr_cpu *cpu_hdr,
+		       struct task_struct *task, bool update)
+{
+	int rc;
+
+	rc = -EINVAL;
+	if (cpu_hdr->fpr_size != sizeof(task->thread.fpr))
+		goto out;
+
+	rc = 0;
+	if (!update || !ckpt_cpu_feature_isset(cpu_hdr, CKPT_USED_FP))
+		goto out;
+
+	task->thread.fpscr.val = cpu_hdr->fpscr;
+	task->thread.fpexc_mode = cpu_hdr->fpexc_mode;
+
+	memcpy(task->thread.fpr, cpu_hdr->fpr, sizeof(task->thread.fpr));
+out:
+	return rc;
+}
+#else
+static int restore_fpu(const struct ckpt_hdr_cpu *cpu_hdr,
+		       struct task_struct *task, bool update)
+{
+	WARN_ON_ONCE(ckpt_cpu_feature_isset(cpu_hdr, CKPT_USED_FP));
+	return 0;
+}
+#endif
+
+static int restore_dabr(const struct ckpt_hdr_cpu *cpu_hdr,
+			struct task_struct *task, bool update)
+{
+	int rc;
+
+	rc = 0;
+	if (!ckpt_cpu_feature_isset(cpu_hdr, CKPT_USED_DEBUG))
+		goto out;
+
+	rc = -EINVAL;
+	if (!debugreg_valid(cpu_hdr->dabr, 0))
+		goto out;
+
+	rc = 0;
+	if (!update)
+		goto out;
+
+	debugreg_update(task, cpu_hdr->dabr, 0);
+out:
+	return rc;
+}
+
+#ifdef CONFIG_ALTIVEC
+static int restore_altivec(const struct ckpt_hdr_cpu *cpu_hdr,
+			   struct task_struct *task, bool update)
+{
+	int rc;
+
+	rc = 0;
+	if (!ckpt_cpu_feature_isset(cpu_hdr, CKPT_USED_ALTIVEC))
+		goto out;
+
+	rc = -EINVAL;
+	if (!cpu_has_feature(CPU_FTR_ALTIVEC))
+		goto out;
+
+	rc = 0;
+	if (!update)
+		goto out;
+
+	task->thread.vrsave = cpu_hdr->vrsave;
+	task->thread.used_vr = 1;
+
+	memcpy(task->thread.vr, cpu_hdr->vr, sizeof(cpu_hdr->vr));
+out:
+	return rc;
+}
+#else
+static int restore_altivec(const struct ckpt_hdr_cpu *cpu_hdr,
+			   struct task_struct *task, bool update)
+{
+	WARN_ON_ONCE(ckpt_cpu_feature_isset(CKPT_USED_ALTIVEC));
+	return 0;
+}
+#endif
+
+#ifdef CONFIG_SPE
+static int restore_spe(const struct ckpt_hdr_cpu *cpu_hdr,
+		       struct task_struct *task, bool update)
+{
+	int rc;
+
+	rc = 0;
+	if (!ckpt_cpu_feature_isset(cpu_hdr, CKPT_USED_SPE))
+		goto out;
+
+	rc = -EINVAL;
+	if (!cpu_has_feature(CPU_FTR_SPE))
+		goto out;
+
+	rc = 0;
+	if (!update)
+		goto out;
+
+	task->thread.acc = cpu_hdr->acc;
+	task->thread.spefscr = cpu_hdr->spefscr;
+	task->thread.used_spe = 1;
+
+	memcpy(task->thread.evr, cpu_hdr->evr, sizeof(cpu_hdr->evr));
+out:
+	return rc;
+}
+#else
+static int restore_spe(const struct ckpt_hdr_cpu *cpu_hdr,
+		       struct task_struct *task, bool update)
+{
+	WARN_ON_ONCE(ckpt_cpu_feature_isset(cpu_hdr, CKPT_USED_SPE));
+	return 0;
+}
+#endif
+
+struct restore_func_desc {
+	int (*func)(const struct ckpt_hdr_cpu *, struct task_struct *, bool);
+	const char *info;
+};
+
+typedef int (*restore_func_t)(const struct ckpt_hdr_cpu *,
+			      struct task_struct *, bool);
+
+static const restore_func_t restore_funcs[] = {
+	restore_gprs,
+	restore_fpu,
+	restore_dabr,
+	restore_altivec,
+	restore_spe,
+};
+
+static bool bitness_match(const struct ckpt_hdr_cpu *cpu_hdr,
+			  const struct task_struct *task)
+{
+	/* 64-bit image */
+	if (cpu_hdr->pt_regs.msr & MSR_SF) {
+		if (task->thread.regs->msr & MSR_SF)
+			return true;
+		else
+			return false;
+	}
+
+	/* 32-bit image */
+	if (task->thread.regs->msr & MSR_SF)
+		return false;
+
+	return true;
+}
+
+int restore_cpu(struct ckpt_ctx *ctx)
+{
+	struct ckpt_hdr_cpu *cpu_hdr;
+	bool update;
+	int rc;
+	int i;
+
+	cpu_hdr = ckpt_read_obj_type(ctx, sizeof(*cpu_hdr), CKPT_HDR_CPU);
+	if (IS_ERR(cpu_hdr))
+		return PTR_ERR(cpu_hdr);
+
+	rc = -EINVAL;
+	if (ckpt_cpu_features_unknown(cpu_hdr))
+		goto err;
+
+	/* temporary: restoring a 32-bit image from a 64-bit task and
+	 * vice-versa is known not to work (probably not restoring
+	 * thread_info correctly); detect this and fail gracefully.
+	 */
+	if (!bitness_match(cpu_hdr, current))
+		goto err;
+
+	/* We want to determine whether there's anything wrong with
+	 * the checkpoint image before changing the task at all.  Run
+	 * a "check" phase (update = false) first.
+	 */
+	update = false;
+commit:
+	for (i = 0; i < ARRAY_SIZE(restore_funcs); i++) {
+		rc = restore_funcs[i](cpu_hdr, current, update);
+		if (rc == 0)
+			continue;
+		pr_debug("%s: restore_func[%i] failed\n", __func__, i);
+		WARN_ON_ONCE(update);
+		goto err;
+	}
+
+	if (!update) {
+		update = true;
+		goto commit;
+	}
+
+err:
+	ckpt_hdr_put(ctx, cpu_hdr);
+	return rc;
+}
+
+int restore_read_header_arch(struct ckpt_ctx *ctx)
+{
+	struct ckpt_hdr_header_arch *arch_hdr;
+
+	arch_hdr = ckpt_read_obj_type(ctx, sizeof(*arch_hdr),
+				      CKPT_HDR_HEADER_ARCH);
+	if (IS_ERR(arch_hdr))
+		return PTR_ERR(arch_hdr);
+
+	ckpt_hdr_put(ctx, arch_hdr);
+
+	return 0;
+}
+
+int restore_mm_context(struct ckpt_ctx *ctx, struct mm_struct *mm)
+{
+	return 0;
+}
-- 
1.6.0.4

--
To unsubscribe from this list: send the line "unsubscribe linux-api" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH v18 77/80] powerpc: checkpoint/restart implementation
@ 2009-09-23 23:51   ` Oren Laadan
  0 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-23 23:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, containers, linux-kernel, linux-mm, linux-api,
	Serge Hallyn, Ingo Molnar, Pavel Emelyanov, Nathan Lynch

From: Nathan Lynch <ntl@pobox.com>

Support for checkpointing and restarting GPRs, FPU state, DABR, and
Altivec state.

The portion of the checkpoint image manipulated by this code begins
with a bitmask of features indicating the various contexts saved.
Fields in image that can vary depending on kernel configuration
(e.g. FP regs due to VSX) have their sizes explicitly recorded, except
for GPRS, so migrating between ppc32 and ppc64 won't work yet.

The restart code ensures that the task is not modified until the
checkpoint image is validated against the current kernel configuration
and hardware features (e.g. can't restart a task using Altivec on
non-Altivec systems).

What works:
* self and external checkpoint of simple (single thread, one open
  file) 32- and 64-bit processes on a ppc64 kernel

What doesn't work:
* restarting a 32-bit task from a 64-bit task and vice versa

Untested:
* ppc32 (but it builds)

Signed-off-by: Nathan Lynch <ntl@pobox.com>
[Oren Laadan <orenl@cs.columbia.edu>] Add arch-specific tty support
---
 arch/powerpc/include/asm/Kbuild           |    1 +
 arch/powerpc/include/asm/checkpoint_hdr.h |   37 ++
 arch/powerpc/mm/Makefile                  |    1 +
 arch/powerpc/mm/checkpoint.c              |  531 +++++++++++++++++++++++++++++
 4 files changed, 570 insertions(+), 0 deletions(-)
 create mode 100644 arch/powerpc/include/asm/checkpoint_hdr.h
 create mode 100644 arch/powerpc/mm/checkpoint.c

diff --git a/arch/powerpc/include/asm/Kbuild b/arch/powerpc/include/asm/Kbuild
index 5ab7d7f..20379f1 100644
--- a/arch/powerpc/include/asm/Kbuild
+++ b/arch/powerpc/include/asm/Kbuild
@@ -12,6 +12,7 @@ header-y += shmbuf.h
 header-y += socket.h
 header-y += termbits.h
 header-y += fcntl.h
+header-y += checkpoint_hdr.h
 header-y += poll.h
 header-y += sockios.h
 header-y += ucontext.h
diff --git a/arch/powerpc/include/asm/checkpoint_hdr.h b/arch/powerpc/include/asm/checkpoint_hdr.h
new file mode 100644
index 0000000..fbb1705
--- /dev/null
+++ b/arch/powerpc/include/asm/checkpoint_hdr.h
@@ -0,0 +1,37 @@
+#ifndef __ASM_POWERPC_CKPT_HDR_H
+#define __ASM_POWERPC_CKPT_HDR_H
+
+#include <linux/types.h>
+
+/* arch dependent constants */
+#define CKPT_ARCH_NSIG 64
+#define CKPT_TTY_NCC  10
+
+#ifdef __KERNEL__
+
+#include <asm/signal.h>
+#if CKPT_ARCH_NSIG != _NSIG
+#error CKPT_ARCH_NSIG size is wrong per asm/signal.h and asm/checkpoint_hdr.h
+#endif
+
+#include <linux/tty.h>
+#if CKPT_TTY_NCC != NCC
+#error CKPT_TTY_NCC size is wrong per asm-generic/termios.h
+#endif
+
+#endif /* __KERNEL__ */
+
+#ifdef __KERNEL__
+#ifdef CONFIG_PPC64
+#define CKPT_ARCH_ID CKPT_ARCH_PPC64
+#else
+#define CKPT_ARCH_ID CKPT_ARCH_PPC32
+#endif
+#endif
+
+struct ckpt_hdr_header_arch {
+	struct ckpt_hdr h;
+	__u32 what;
+} __attribute__((aligned(8)));
+
+#endif /* __ASM_POWERPC_CKPT_HDR_H */
diff --git a/arch/powerpc/mm/Makefile b/arch/powerpc/mm/Makefile
index 3e68363..aa8733c 100644
--- a/arch/powerpc/mm/Makefile
+++ b/arch/powerpc/mm/Makefile
@@ -31,3 +31,4 @@ obj-$(CONFIG_HUGETLB_PAGE)	+= hugetlbpage.o
 obj-$(CONFIG_PPC_SUBPAGE_PROT)	+= subpage-prot.o
 obj-$(CONFIG_NOT_COHERENT_CACHE) += dma-noncoherent.o
 obj-$(CONFIG_HIGHMEM)		+= highmem.o
+obj-$(CONFIG_CHECKPOINT)	+= checkpoint.o
diff --git a/arch/powerpc/mm/checkpoint.c b/arch/powerpc/mm/checkpoint.c
new file mode 100644
index 0000000..de18467
--- /dev/null
+++ b/arch/powerpc/mm/checkpoint.c
@@ -0,0 +1,531 @@
+/*
+ * PowerPC architecture support for checkpoint/restart.
+ * Based on x86 implementation.
+ *
+ * Copyright (C) 2008 Oren Laadan
+ * Copyright 2009 IBM Corp.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License version
+ * 2 as published by the Free Software Foundation.
+ */
+
+#if 0
+#define DEBUG
+#endif
+
+#include <linux/checkpoint.h>
+#include <linux/checkpoint_hdr.h>
+#include <linux/kernel.h>
+#include <asm/processor.h>
+#include <asm/ptrace.h>
+#include <asm/system.h>
+
+enum ckpt_cpu_feature {
+	CKPT_USED_FP,
+	CKPT_USED_DEBUG,
+	CKPT_USED_ALTIVEC,
+	CKPT_USED_SPE,
+	CKPT_USED_VSX,
+	CKPT_FTR_END = 31,
+};
+
+#define x(ftr) (1UL << ftr)
+
+/* features this kernel can handle for restart */
+enum {
+	CKPT_FTRS_POSSIBLE =
+#ifdef CONFIG_PPC_FPU
+	x(CKPT_USED_FP) |
+#endif
+	x(CKPT_USED_DEBUG) |
+#ifdef CONFIG_ALTIVEC
+	x(CKPT_USED_ALTIVEC) |
+#endif
+#ifdef CONFIG_SPE
+	x(CKPT_USED_SPE) |
+#endif
+#ifdef CONFIG_VSX
+	x(CKPT_USED_VSX) |
+#endif
+	0,
+};
+
+#undef x
+
+struct ckpt_hdr_cpu {
+	struct ckpt_hdr h;
+	u32 features_used;
+	u32 pt_regs_size;
+	u32 fpr_size;
+	u64 orig_gpr3;
+	struct pt_regs pt_regs;
+	/* relevant fields from thread_struct */
+	double fpr[32][TS_FPRWIDTH];
+	u32 fpscr;
+	s32 fpexc_mode;
+	u64 dabr;
+	/* Altivec/VMX state */
+	vector128 vr[32];
+	vector128 vscr;
+	u64 vrsave;
+	/* SPE state */
+	u32 evr[32];
+	u64 acc;
+	u32 spefscr;
+};
+
+/**************************************************************************
+ * Checkpoint
+ */
+
+static void ckpt_cpu_feature_set(struct ckpt_hdr_cpu *hdr,
+				 enum ckpt_cpu_feature ftr)
+{
+	hdr->features_used |= 1ULL << ftr;
+}
+
+static bool ckpt_cpu_feature_isset(const struct ckpt_hdr_cpu *hdr,
+				 enum ckpt_cpu_feature ftr)
+{
+	return hdr->features_used & (1ULL << ftr);
+}
+
+/* determine whether an image has feature bits set that this kernel
+ * does not support */
+static bool ckpt_cpu_features_unknown(const struct ckpt_hdr_cpu *hdr)
+{
+	return hdr->features_used & ~CKPT_FTRS_POSSIBLE;
+}
+
+static void checkpoint_gprs(struct ckpt_hdr_cpu *cpu_hdr,
+			    struct task_struct *task)
+{
+	struct pt_regs *pt_regs;
+
+	pr_debug("%s: saving GPRs\n", __func__);
+
+	cpu_hdr->pt_regs_size = sizeof(*pt_regs);
+	pt_regs = task_pt_regs(task);
+	cpu_hdr->pt_regs = *pt_regs;
+
+	if (task == current)
+		cpu_hdr->pt_regs.gpr[3] = 0;
+
+	cpu_hdr->orig_gpr3 = pt_regs->orig_gpr3;
+}
+
+#ifdef CONFIG_PPC_FPU
+static void checkpoint_fpu(struct ckpt_hdr_cpu *cpu_hdr,
+			   struct task_struct *task)
+{
+	/* easiest to save FP state unconditionally */
+
+	pr_debug("%s: saving FPU state\n", __func__);
+
+	if (task == current)
+		flush_fp_to_thread(task);
+
+	cpu_hdr->fpr_size = sizeof(cpu_hdr->fpr);
+	cpu_hdr->fpscr = task->thread.fpscr.val;
+	cpu_hdr->fpexc_mode = task->thread.fpexc_mode;
+
+	memcpy(cpu_hdr->fpr, task->thread.fpr, sizeof(cpu_hdr->fpr));
+
+	ckpt_cpu_feature_set(cpu_hdr, CKPT_USED_FP);
+}
+#else
+static void checkpoint_fpu(struct ckpt_hdr_cpu *cpu_hdr,
+			   struct task_struct *task)
+{
+	return;
+}
+#endif
+
+#ifdef CONFIG_ALTIVEC
+static void checkpoint_altivec(struct ckpt_hdr_cpu *cpu_hdr,
+			       struct task_struct *task)
+{
+	if (!cpu_has_feature(CPU_FTR_ALTIVEC))
+		return;
+
+	if (!task->thread.used_vr)
+		return;
+
+	pr_debug("%s: saving Altivec state\n", __func__);
+
+	if (task == current)
+		flush_altivec_to_thread(task);
+
+	cpu_hdr->vrsave = task->thread.vrsave;
+	memcpy(cpu_hdr->vr, task->thread.vr, sizeof(cpu_hdr->vr));
+	ckpt_cpu_feature_set(cpu_hdr, CKPT_USED_ALTIVEC);
+}
+#else
+static void checkpoint_altivec(struct ckpt_hdr_cpu *cpu_hdr,
+			       struct task_struct *task)
+{
+	return;
+}
+#endif
+
+#ifdef CONFIG_SPE
+static void checkpoint_spe(struct ckpt_hdr_cpu *cpu_hdr,
+			   struct task_struct *task)
+{
+	if (!cpu_has_feature(CPU_FTR_SPE))
+		return;
+
+	if (!task->thread.used_spe)
+		return;
+
+	pr_debug("%s: saving SPE state\n", __func__);
+
+	if (task == current)
+		flush_spe_to_thread(task);
+
+	cpu_hdr->acc = task->thread.acc;
+	cpu_hdr->spefscr = task->thread.spefscr;
+	memcpy(cpu_hdr->evr, task->thread.evr, sizeof(cpu_hdr->evr));
+	ckpt_cpu_feature_set(cpu_hdr, CKPT_USED_SPE);
+}
+#else
+static void checkpoint_spe(struct ckpt_hdr_cpu *cpu_hdr,
+			   struct task_struct *task)
+{
+	return;
+}
+#endif
+
+static void checkpoint_dabr(struct ckpt_hdr_cpu *cpu_hdr,
+			    const struct task_struct *task)
+{
+	if (!task->thread.dabr)
+		return;
+
+	cpu_hdr->dabr = task->thread.dabr;
+	ckpt_cpu_feature_set(cpu_hdr, CKPT_USED_DEBUG);
+}
+
+/* dump the thread_struct of a given task */
+int checkpoint_thread(struct ckpt_ctx *ctx, struct task_struct *t)
+{
+	return 0;
+}
+
+/* dump the cpu state and registers of a given task */
+int checkpoint_cpu(struct ckpt_ctx *ctx, struct task_struct *t)
+{
+	struct ckpt_hdr_cpu *cpu_hdr;
+	int rc;
+
+	rc = -ENOMEM;
+	cpu_hdr = ckpt_hdr_get_type(ctx, sizeof(*cpu_hdr), CKPT_HDR_CPU);
+	if (!cpu_hdr)
+		goto err;
+
+	checkpoint_gprs(cpu_hdr, t);
+	checkpoint_fpu(cpu_hdr, t);
+	checkpoint_dabr(cpu_hdr, t);
+	checkpoint_altivec(cpu_hdr, t);
+	checkpoint_spe(cpu_hdr, t);
+
+	rc = ckpt_write_obj(ctx, (struct ckpt_hdr *) cpu_hdr);
+err:
+	ckpt_hdr_put(ctx, cpu_hdr);
+	return rc;
+}
+
+int checkpoint_write_header_arch(struct ckpt_ctx *ctx)
+{
+	struct ckpt_hdr_header_arch *arch_hdr;
+	int ret;
+
+	arch_hdr = ckpt_hdr_get_type(ctx, sizeof(*arch_hdr),
+				     CKPT_HDR_HEADER_ARCH);
+	if (!arch_hdr)
+		return -ENOMEM;
+
+	arch_hdr->what = 0xdeadbeef;
+
+	ret = ckpt_write_obj(ctx, &arch_hdr->h);
+	ckpt_hdr_put(ctx, arch_hdr);
+
+	return ret;
+}
+
+/* dump the mm->context state */
+int checkpoint_mm_context(struct ckpt_ctx *ctx, struct mm_struct *mm)
+{
+	return 0;
+}
+
+/**************************************************************************
+ * Restart
+ */
+
+/* read the thread_struct into the current task */
+int restore_thread(struct ckpt_ctx *ctx)
+{
+	return 0;
+}
+
+/* Based on the MSR value from a checkpoint image, produce an MSR
+ * value that is appropriate for the restored task.  Right now we only
+ * check for MSR_SF (64-bit) for PPC64.
+ */
+static unsigned long sanitize_msr(unsigned long msr_ckpt)
+{
+#ifdef CONFIG_PPC32
+	return MSR_USER;
+#else
+	if (msr_ckpt & MSR_SF)
+		return MSR_USER64;
+	return MSR_USER32;
+#endif
+}
+
+static int restore_gprs(const struct ckpt_hdr_cpu *cpu_hdr,
+			struct task_struct *task, bool update)
+{
+	struct pt_regs *regs;
+	int rc;
+
+	rc = -EINVAL;
+	if (cpu_hdr->pt_regs_size != sizeof(*regs))
+		goto out;
+
+	rc = 0;
+	if (!update)
+		goto out;
+
+	regs = task_pt_regs(task);
+	*regs = cpu_hdr->pt_regs;
+
+	regs->orig_gpr3 = cpu_hdr->orig_gpr3;
+
+	regs->msr = sanitize_msr(regs->msr);
+out:
+	return rc;
+}
+
+#ifdef CONFIG_PPC_FPU
+static int restore_fpu(const struct ckpt_hdr_cpu *cpu_hdr,
+		       struct task_struct *task, bool update)
+{
+	int rc;
+
+	rc = -EINVAL;
+	if (cpu_hdr->fpr_size != sizeof(task->thread.fpr))
+		goto out;
+
+	rc = 0;
+	if (!update || !ckpt_cpu_feature_isset(cpu_hdr, CKPT_USED_FP))
+		goto out;
+
+	task->thread.fpscr.val = cpu_hdr->fpscr;
+	task->thread.fpexc_mode = cpu_hdr->fpexc_mode;
+
+	memcpy(task->thread.fpr, cpu_hdr->fpr, sizeof(task->thread.fpr));
+out:
+	return rc;
+}
+#else
+static int restore_fpu(const struct ckpt_hdr_cpu *cpu_hdr,
+		       struct task_struct *task, bool update)
+{
+	WARN_ON_ONCE(ckpt_cpu_feature_isset(cpu_hdr, CKPT_USED_FP));
+	return 0;
+}
+#endif
+
+static int restore_dabr(const struct ckpt_hdr_cpu *cpu_hdr,
+			struct task_struct *task, bool update)
+{
+	int rc;
+
+	rc = 0;
+	if (!ckpt_cpu_feature_isset(cpu_hdr, CKPT_USED_DEBUG))
+		goto out;
+
+	rc = -EINVAL;
+	if (!debugreg_valid(cpu_hdr->dabr, 0))
+		goto out;
+
+	rc = 0;
+	if (!update)
+		goto out;
+
+	debugreg_update(task, cpu_hdr->dabr, 0);
+out:
+	return rc;
+}
+
+#ifdef CONFIG_ALTIVEC
+static int restore_altivec(const struct ckpt_hdr_cpu *cpu_hdr,
+			   struct task_struct *task, bool update)
+{
+	int rc;
+
+	rc = 0;
+	if (!ckpt_cpu_feature_isset(cpu_hdr, CKPT_USED_ALTIVEC))
+		goto out;
+
+	rc = -EINVAL;
+	if (!cpu_has_feature(CPU_FTR_ALTIVEC))
+		goto out;
+
+	rc = 0;
+	if (!update)
+		goto out;
+
+	task->thread.vrsave = cpu_hdr->vrsave;
+	task->thread.used_vr = 1;
+
+	memcpy(task->thread.vr, cpu_hdr->vr, sizeof(cpu_hdr->vr));
+out:
+	return rc;
+}
+#else
+static int restore_altivec(const struct ckpt_hdr_cpu *cpu_hdr,
+			   struct task_struct *task, bool update)
+{
+	WARN_ON_ONCE(ckpt_cpu_feature_isset(CKPT_USED_ALTIVEC));
+	return 0;
+}
+#endif
+
+#ifdef CONFIG_SPE
+static int restore_spe(const struct ckpt_hdr_cpu *cpu_hdr,
+		       struct task_struct *task, bool update)
+{
+	int rc;
+
+	rc = 0;
+	if (!ckpt_cpu_feature_isset(cpu_hdr, CKPT_USED_SPE))
+		goto out;
+
+	rc = -EINVAL;
+	if (!cpu_has_feature(CPU_FTR_SPE))
+		goto out;
+
+	rc = 0;
+	if (!update)
+		goto out;
+
+	task->thread.acc = cpu_hdr->acc;
+	task->thread.spefscr = cpu_hdr->spefscr;
+	task->thread.used_spe = 1;
+
+	memcpy(task->thread.evr, cpu_hdr->evr, sizeof(cpu_hdr->evr));
+out:
+	return rc;
+}
+#else
+static int restore_spe(const struct ckpt_hdr_cpu *cpu_hdr,
+		       struct task_struct *task, bool update)
+{
+	WARN_ON_ONCE(ckpt_cpu_feature_isset(cpu_hdr, CKPT_USED_SPE));
+	return 0;
+}
+#endif
+
+struct restore_func_desc {
+	int (*func)(const struct ckpt_hdr_cpu *, struct task_struct *, bool);
+	const char *info;
+};
+
+typedef int (*restore_func_t)(const struct ckpt_hdr_cpu *,
+			      struct task_struct *, bool);
+
+static const restore_func_t restore_funcs[] = {
+	restore_gprs,
+	restore_fpu,
+	restore_dabr,
+	restore_altivec,
+	restore_spe,
+};
+
+static bool bitness_match(const struct ckpt_hdr_cpu *cpu_hdr,
+			  const struct task_struct *task)
+{
+	/* 64-bit image */
+	if (cpu_hdr->pt_regs.msr & MSR_SF) {
+		if (task->thread.regs->msr & MSR_SF)
+			return true;
+		else
+			return false;
+	}
+
+	/* 32-bit image */
+	if (task->thread.regs->msr & MSR_SF)
+		return false;
+
+	return true;
+}
+
+int restore_cpu(struct ckpt_ctx *ctx)
+{
+	struct ckpt_hdr_cpu *cpu_hdr;
+	bool update;
+	int rc;
+	int i;
+
+	cpu_hdr = ckpt_read_obj_type(ctx, sizeof(*cpu_hdr), CKPT_HDR_CPU);
+	if (IS_ERR(cpu_hdr))
+		return PTR_ERR(cpu_hdr);
+
+	rc = -EINVAL;
+	if (ckpt_cpu_features_unknown(cpu_hdr))
+		goto err;
+
+	/* temporary: restoring a 32-bit image from a 64-bit task and
+	 * vice-versa is known not to work (probably not restoring
+	 * thread_info correctly); detect this and fail gracefully.
+	 */
+	if (!bitness_match(cpu_hdr, current))
+		goto err;
+
+	/* We want to determine whether there's anything wrong with
+	 * the checkpoint image before changing the task at all.  Run
+	 * a "check" phase (update = false) first.
+	 */
+	update = false;
+commit:
+	for (i = 0; i < ARRAY_SIZE(restore_funcs); i++) {
+		rc = restore_funcs[i](cpu_hdr, current, update);
+		if (rc == 0)
+			continue;
+		pr_debug("%s: restore_func[%i] failed\n", __func__, i);
+		WARN_ON_ONCE(update);
+		goto err;
+	}
+
+	if (!update) {
+		update = true;
+		goto commit;
+	}
+
+err:
+	ckpt_hdr_put(ctx, cpu_hdr);
+	return rc;
+}
+
+int restore_read_header_arch(struct ckpt_ctx *ctx)
+{
+	struct ckpt_hdr_header_arch *arch_hdr;
+
+	arch_hdr = ckpt_read_obj_type(ctx, sizeof(*arch_hdr),
+				      CKPT_HDR_HEADER_ARCH);
+	if (IS_ERR(arch_hdr))
+		return PTR_ERR(arch_hdr);
+
+	ckpt_hdr_put(ctx, arch_hdr);
+
+	return 0;
+}
+
+int restore_mm_context(struct ckpt_ctx *ctx, struct mm_struct *mm)
+{
+	return 0;
+}
-- 
1.6.0.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH v18 78/80] powerpc: wire up checkpoint and restart syscalls
       [not found] ` <1253749920-18673-1-git-send-email-orenl-RdfvBDnrOixBDgjK7y7TUQ@public.gmane.org>
                     ` (76 preceding siblings ...)
  2009-09-23 23:51   ` [PATCH v18 77/80] powerpc: checkpoint/restart implementation Oren Laadan
@ 2009-09-23 23:51   ` Oren Laadan
  2009-09-23 23:51   ` [PATCH v18 79/80] powerpc: enable checkpoint support in Kconfig Oren Laadan
                     ` (3 subsequent siblings)
  81 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-23 23:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-api-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Linus Torvalds, Nathan Lynch,
	Ingo Molnar, Pavel Emelyanov

From: Nathan Lynch <ntl-e+AXbWqSrlAAvxtiuMwx3w@public.gmane.org>

Signed-off-by: Nathan Lynch <ntl-e+AXbWqSrlAAvxtiuMwx3w@public.gmane.org>
---
 arch/powerpc/include/asm/systbl.h |    2 ++
 arch/powerpc/include/asm/unistd.h |    4 +++-
 2 files changed, 5 insertions(+), 1 deletions(-)

diff --git a/arch/powerpc/include/asm/systbl.h b/arch/powerpc/include/asm/systbl.h
index 370600c..3d44cf3 100644
--- a/arch/powerpc/include/asm/systbl.h
+++ b/arch/powerpc/include/asm/systbl.h
@@ -326,3 +326,5 @@ SYSCALL_SPU(perf_counter_open)
 COMPAT_SYS_SPU(preadv)
 COMPAT_SYS_SPU(pwritev)
 COMPAT_SYS(rt_tgsigqueueinfo)
+SYSCALL(checkpoint)
+SYSCALL(restart)
diff --git a/arch/powerpc/include/asm/unistd.h b/arch/powerpc/include/asm/unistd.h
index cef080b..ef41ebb 100644
--- a/arch/powerpc/include/asm/unistd.h
+++ b/arch/powerpc/include/asm/unistd.h
@@ -345,10 +345,12 @@
 #define __NR_preadv		320
 #define __NR_pwritev		321
 #define __NR_rt_tgsigqueueinfo	322
+#define __NR_checkpoint		323
+#define __NR_restart		324
 
 #ifdef __KERNEL__
 
-#define __NR_syscalls		323
+#define __NR_syscalls		325
 
 #define __NR__exit __NR_exit
 #define NR_syscalls	__NR_syscalls
-- 
1.6.0.4

^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH v18 78/80] powerpc: wire up checkpoint and restart syscalls
  2009-09-23 23:50 ` Oren Laadan
@ 2009-09-23 23:51   ` Oren Laadan
  -1 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-23 23:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, containers, linux-kernel, linux-mm, linux-api,
	Serge Hallyn, Ingo Molnar, Pavel Emelyanov, Nathan Lynch

From: Nathan Lynch <ntl@pobox.com>

Signed-off-by: Nathan Lynch <ntl@pobox.com>
---
 arch/powerpc/include/asm/systbl.h |    2 ++
 arch/powerpc/include/asm/unistd.h |    4 +++-
 2 files changed, 5 insertions(+), 1 deletions(-)

diff --git a/arch/powerpc/include/asm/systbl.h b/arch/powerpc/include/asm/systbl.h
index 370600c..3d44cf3 100644
--- a/arch/powerpc/include/asm/systbl.h
+++ b/arch/powerpc/include/asm/systbl.h
@@ -326,3 +326,5 @@ SYSCALL_SPU(perf_counter_open)
 COMPAT_SYS_SPU(preadv)
 COMPAT_SYS_SPU(pwritev)
 COMPAT_SYS(rt_tgsigqueueinfo)
+SYSCALL(checkpoint)
+SYSCALL(restart)
diff --git a/arch/powerpc/include/asm/unistd.h b/arch/powerpc/include/asm/unistd.h
index cef080b..ef41ebb 100644
--- a/arch/powerpc/include/asm/unistd.h
+++ b/arch/powerpc/include/asm/unistd.h
@@ -345,10 +345,12 @@
 #define __NR_preadv		320
 #define __NR_pwritev		321
 #define __NR_rt_tgsigqueueinfo	322
+#define __NR_checkpoint		323
+#define __NR_restart		324
 
 #ifdef __KERNEL__
 
-#define __NR_syscalls		323
+#define __NR_syscalls		325
 
 #define __NR__exit __NR_exit
 #define NR_syscalls	__NR_syscalls
-- 
1.6.0.4


^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH v18 78/80] powerpc: wire up checkpoint and restart syscalls
@ 2009-09-23 23:51   ` Oren Laadan
  0 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-23 23:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, containers, linux-kernel, linux-mm, linux-api,
	Serge Hallyn, Ingo Molnar, Pavel Emelyanov, Nathan Lynch

From: Nathan Lynch <ntl@pobox.com>

Signed-off-by: Nathan Lynch <ntl@pobox.com>
---
 arch/powerpc/include/asm/systbl.h |    2 ++
 arch/powerpc/include/asm/unistd.h |    4 +++-
 2 files changed, 5 insertions(+), 1 deletions(-)

diff --git a/arch/powerpc/include/asm/systbl.h b/arch/powerpc/include/asm/systbl.h
index 370600c..3d44cf3 100644
--- a/arch/powerpc/include/asm/systbl.h
+++ b/arch/powerpc/include/asm/systbl.h
@@ -326,3 +326,5 @@ SYSCALL_SPU(perf_counter_open)
 COMPAT_SYS_SPU(preadv)
 COMPAT_SYS_SPU(pwritev)
 COMPAT_SYS(rt_tgsigqueueinfo)
+SYSCALL(checkpoint)
+SYSCALL(restart)
diff --git a/arch/powerpc/include/asm/unistd.h b/arch/powerpc/include/asm/unistd.h
index cef080b..ef41ebb 100644
--- a/arch/powerpc/include/asm/unistd.h
+++ b/arch/powerpc/include/asm/unistd.h
@@ -345,10 +345,12 @@
 #define __NR_preadv		320
 #define __NR_pwritev		321
 #define __NR_rt_tgsigqueueinfo	322
+#define __NR_checkpoint		323
+#define __NR_restart		324
 
 #ifdef __KERNEL__
 
-#define __NR_syscalls		323
+#define __NR_syscalls		325
 
 #define __NR__exit __NR_exit
 #define NR_syscalls	__NR_syscalls
-- 
1.6.0.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH v18 79/80] powerpc: enable checkpoint support in Kconfig
       [not found] ` <1253749920-18673-1-git-send-email-orenl-RdfvBDnrOixBDgjK7y7TUQ@public.gmane.org>
                     ` (77 preceding siblings ...)
  2009-09-23 23:51   ` [PATCH v18 78/80] powerpc: wire up checkpoint and restart syscalls Oren Laadan
@ 2009-09-23 23:51   ` Oren Laadan
  2009-09-23 23:52   ` [PATCH v18 80/80] powerpc: clone_with_pids implementation Oren Laadan
                     ` (2 subsequent siblings)
  81 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-23 23:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-api-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Linus Torvalds, Nathan Lynch,
	Ingo Molnar, Pavel Emelyanov

From: Nathan Lynch <ntl-e+AXbWqSrlAAvxtiuMwx3w@public.gmane.org>

Signed-off-by: Nathan Lynch <ntl-e+AXbWqSrlAAvxtiuMwx3w@public.gmane.org>
---
 arch/powerpc/Kconfig |    3 +++
 1 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index d00131c..2ca160e 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -26,6 +26,9 @@ config MMU
 	bool
 	default y
 
+config CHECKPOINT_SUPPORT
+	def_bool y
+
 config GENERIC_CMOS_UPDATE
 	def_bool y
 
-- 
1.6.0.4

^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH v18 79/80] powerpc: enable checkpoint support in Kconfig
  2009-09-23 23:50 ` Oren Laadan
@ 2009-09-23 23:51   ` Oren Laadan
  -1 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-23 23:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, containers, linux-kernel, linux-mm, linux-api,
	Serge Hallyn, Ingo Molnar, Pavel Emelyanov, Nathan Lynch

From: Nathan Lynch <ntl@pobox.com>

Signed-off-by: Nathan Lynch <ntl@pobox.com>
---
 arch/powerpc/Kconfig |    3 +++
 1 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index d00131c..2ca160e 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -26,6 +26,9 @@ config MMU
 	bool
 	default y
 
+config CHECKPOINT_SUPPORT
+	def_bool y
+
 config GENERIC_CMOS_UPDATE
 	def_bool y
 
-- 
1.6.0.4


^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH v18 79/80] powerpc: enable checkpoint support in Kconfig
@ 2009-09-23 23:51   ` Oren Laadan
  0 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-23 23:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, containers, linux-kernel, linux-mm, linux-api,
	Serge Hallyn, Ingo Molnar, Pavel Emelyanov, Nathan Lynch

From: Nathan Lynch <ntl@pobox.com>

Signed-off-by: Nathan Lynch <ntl@pobox.com>
---
 arch/powerpc/Kconfig |    3 +++
 1 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index d00131c..2ca160e 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -26,6 +26,9 @@ config MMU
 	bool
 	default y
 
+config CHECKPOINT_SUPPORT
+	def_bool y
+
 config GENERIC_CMOS_UPDATE
 	def_bool y
 
-- 
1.6.0.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH v18 80/80] powerpc: clone_with_pids implementation
       [not found] ` <1253749920-18673-1-git-send-email-orenl-RdfvBDnrOixBDgjK7y7TUQ@public.gmane.org>
                     ` (78 preceding siblings ...)
  2009-09-23 23:51   ` [PATCH v18 79/80] powerpc: enable checkpoint support in Kconfig Oren Laadan
@ 2009-09-23 23:52   ` Oren Laadan
  2009-09-24 13:05   ` [PATCH 00/80] Kernel based checkpoint/restart [v18] Rishikesh
  2009-09-24 22:41   ` Andrew Morton
  81 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-23 23:52 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-api-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Linus Torvalds, Nathan Lynch,
	Ingo Molnar, Pavel Emelyanov

From: Nathan Lynch <ntl-e+AXbWqSrlAAvxtiuMwx3w@public.gmane.org>

Mostly copy-paste from existing clone code.

(may want to hold off applying this until the ongoing clone_with_pids
discussion is resolved.)

Signed-off-by: Nathan Lynch <ntl-e+AXbWqSrlAAvxtiuMwx3w@public.gmane.org>
---
 arch/powerpc/include/asm/syscalls.h |    4 ++++
 arch/powerpc/include/asm/systbl.h   |    1 +
 arch/powerpc/include/asm/unistd.h   |    3 ++-
 arch/powerpc/kernel/entry_32.S      |    8 ++++++++
 arch/powerpc/kernel/entry_64.S      |    5 +++++
 arch/powerpc/kernel/process.c       |   20 +++++++++++++++++++-
 6 files changed, 39 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/include/asm/syscalls.h b/arch/powerpc/include/asm/syscalls.h
index eb8eb40..25dfe8b 100644
--- a/arch/powerpc/include/asm/syscalls.h
+++ b/arch/powerpc/include/asm/syscalls.h
@@ -24,6 +24,10 @@ asmlinkage int sys_execve(unsigned long a0, unsigned long a1,
 asmlinkage int sys_clone(unsigned long clone_flags, unsigned long usp,
 		int __user *parent_tidp, void __user *child_threadptr,
 		int __user *child_tidp, int p6, struct pt_regs *regs);
+asmlinkage int sys_clone_with_pids(unsigned long clone_flags,
+		unsigned long usp, int __user *parent_tidp,
+		void __user *child_threadptr, int __user *child_tidp,
+		void __user *upid_setp, struct pt_regs *regs);
 asmlinkage int sys_fork(unsigned long p1, unsigned long p2,
 		unsigned long p3, unsigned long p4, unsigned long p5,
 		unsigned long p6, struct pt_regs *regs);
diff --git a/arch/powerpc/include/asm/systbl.h b/arch/powerpc/include/asm/systbl.h
index 3d44cf3..e3c0b96 100644
--- a/arch/powerpc/include/asm/systbl.h
+++ b/arch/powerpc/include/asm/systbl.h
@@ -328,3 +328,4 @@ COMPAT_SYS_SPU(pwritev)
 COMPAT_SYS(rt_tgsigqueueinfo)
 SYSCALL(checkpoint)
 SYSCALL(restart)
+PPC_SYS(clone_with_pids)
diff --git a/arch/powerpc/include/asm/unistd.h b/arch/powerpc/include/asm/unistd.h
index ef41ebb..c54db8b 100644
--- a/arch/powerpc/include/asm/unistd.h
+++ b/arch/powerpc/include/asm/unistd.h
@@ -347,10 +347,11 @@
 #define __NR_rt_tgsigqueueinfo	322
 #define __NR_checkpoint		323
 #define __NR_restart		324
+#define __NR_clone_with_pids	325
 
 #ifdef __KERNEL__
 
-#define __NR_syscalls		325
+#define __NR_syscalls		326
 
 #define __NR__exit __NR_exit
 #define NR_syscalls	__NR_syscalls
diff --git a/arch/powerpc/kernel/entry_32.S b/arch/powerpc/kernel/entry_32.S
index 3cadba6..081dd36 100644
--- a/arch/powerpc/kernel/entry_32.S
+++ b/arch/powerpc/kernel/entry_32.S
@@ -586,6 +586,14 @@ ppc_clone:
 	stw	r0,_TRAP(r1)		/* register set saved */
 	b	sys_clone
 
+	.globl	ppc_clone_with_pids
+ppc_clone_with_pids:
+	SAVE_NVGPRS(r1)
+	lwz	r0,_TRAP(r1)
+	rlwinm	r0,r0,0,0,30		/* clear LSB to indicate full */
+	stw	r0,_TRAP(r1)		/* register set saved */
+	b	sys_clone_with_pids
+
 	.globl	ppc_swapcontext
 ppc_swapcontext:
 	SAVE_NVGPRS(r1)
diff --git a/arch/powerpc/kernel/entry_64.S b/arch/powerpc/kernel/entry_64.S
index 43e0734..39fa316 100644
--- a/arch/powerpc/kernel/entry_64.S
+++ b/arch/powerpc/kernel/entry_64.S
@@ -320,6 +320,11 @@ _GLOBAL(ppc_clone)
 	bl	.sys_clone
 	b	syscall_exit
 
+_GLOBAL(ppc_clone_with_pids)
+	bl	.save_nvgprs
+	bl	.sys_clone_with_pids
+	b	syscall_exit
+
 _GLOBAL(ppc32_swapcontext)
 	bl	.save_nvgprs
 	bl	.compat_sys_swapcontext
diff --git a/arch/powerpc/kernel/process.c b/arch/powerpc/kernel/process.c
index 892a9f2..7bf21cc 100644
--- a/arch/powerpc/kernel/process.c
+++ b/arch/powerpc/kernel/process.c
@@ -895,7 +895,25 @@ int sys_clone(unsigned long clone_flags, unsigned long usp,
 		child_tidp = TRUNC_PTR(child_tidp);
 	}
 #endif
- 	return do_fork(clone_flags, usp, regs, 0, parent_tidp, child_tidp);
+	return do_fork(clone_flags, usp, regs, 0, parent_tidp, child_tidp);
+}
+
+int sys_clone_with_pids(unsigned long clone_flags, unsigned long usp,
+	      int __user *parent_tidp, void __user *child_threadptr,
+	      int __user *child_tidp, void __user *upid_setp,
+	      struct pt_regs *regs)
+{
+	CHECK_FULL_REGS(regs);
+	if (usp == 0)
+		usp = regs->gpr[1];
+#ifdef CONFIG_PPC64
+	if (test_thread_flag(TIF_32BIT)) {
+		parent_tidp = TRUNC_PTR(parent_tidp);
+		child_tidp = TRUNC_PTR(child_tidp);
+	}
+#endif
+	return do_fork_with_pids(clone_flags, usp, regs, 0,
+				 parent_tidp, child_tidp, upid_setp);
 }
 
 int sys_fork(unsigned long p1, unsigned long p2, unsigned long p3,
-- 
1.6.0.4

^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH v18 80/80] powerpc: clone_with_pids implementation
  2009-09-23 23:50 ` Oren Laadan
@ 2009-09-23 23:52   ` Oren Laadan
  -1 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-23 23:52 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, containers, linux-kernel, linux-mm, linux-api,
	Serge Hallyn, Ingo Molnar, Pavel Emelyanov, Nathan Lynch

From: Nathan Lynch <ntl@pobox.com>

Mostly copy-paste from existing clone code.

(may want to hold off applying this until the ongoing clone_with_pids
discussion is resolved.)

Signed-off-by: Nathan Lynch <ntl@pobox.com>
---
 arch/powerpc/include/asm/syscalls.h |    4 ++++
 arch/powerpc/include/asm/systbl.h   |    1 +
 arch/powerpc/include/asm/unistd.h   |    3 ++-
 arch/powerpc/kernel/entry_32.S      |    8 ++++++++
 arch/powerpc/kernel/entry_64.S      |    5 +++++
 arch/powerpc/kernel/process.c       |   20 +++++++++++++++++++-
 6 files changed, 39 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/include/asm/syscalls.h b/arch/powerpc/include/asm/syscalls.h
index eb8eb40..25dfe8b 100644
--- a/arch/powerpc/include/asm/syscalls.h
+++ b/arch/powerpc/include/asm/syscalls.h
@@ -24,6 +24,10 @@ asmlinkage int sys_execve(unsigned long a0, unsigned long a1,
 asmlinkage int sys_clone(unsigned long clone_flags, unsigned long usp,
 		int __user *parent_tidp, void __user *child_threadptr,
 		int __user *child_tidp, int p6, struct pt_regs *regs);
+asmlinkage int sys_clone_with_pids(unsigned long clone_flags,
+		unsigned long usp, int __user *parent_tidp,
+		void __user *child_threadptr, int __user *child_tidp,
+		void __user *upid_setp, struct pt_regs *regs);
 asmlinkage int sys_fork(unsigned long p1, unsigned long p2,
 		unsigned long p3, unsigned long p4, unsigned long p5,
 		unsigned long p6, struct pt_regs *regs);
diff --git a/arch/powerpc/include/asm/systbl.h b/arch/powerpc/include/asm/systbl.h
index 3d44cf3..e3c0b96 100644
--- a/arch/powerpc/include/asm/systbl.h
+++ b/arch/powerpc/include/asm/systbl.h
@@ -328,3 +328,4 @@ COMPAT_SYS_SPU(pwritev)
 COMPAT_SYS(rt_tgsigqueueinfo)
 SYSCALL(checkpoint)
 SYSCALL(restart)
+PPC_SYS(clone_with_pids)
diff --git a/arch/powerpc/include/asm/unistd.h b/arch/powerpc/include/asm/unistd.h
index ef41ebb..c54db8b 100644
--- a/arch/powerpc/include/asm/unistd.h
+++ b/arch/powerpc/include/asm/unistd.h
@@ -347,10 +347,11 @@
 #define __NR_rt_tgsigqueueinfo	322
 #define __NR_checkpoint		323
 #define __NR_restart		324
+#define __NR_clone_with_pids	325
 
 #ifdef __KERNEL__
 
-#define __NR_syscalls		325
+#define __NR_syscalls		326
 
 #define __NR__exit __NR_exit
 #define NR_syscalls	__NR_syscalls
diff --git a/arch/powerpc/kernel/entry_32.S b/arch/powerpc/kernel/entry_32.S
index 3cadba6..081dd36 100644
--- a/arch/powerpc/kernel/entry_32.S
+++ b/arch/powerpc/kernel/entry_32.S
@@ -586,6 +586,14 @@ ppc_clone:
 	stw	r0,_TRAP(r1)		/* register set saved */
 	b	sys_clone
 
+	.globl	ppc_clone_with_pids
+ppc_clone_with_pids:
+	SAVE_NVGPRS(r1)
+	lwz	r0,_TRAP(r1)
+	rlwinm	r0,r0,0,0,30		/* clear LSB to indicate full */
+	stw	r0,_TRAP(r1)		/* register set saved */
+	b	sys_clone_with_pids
+
 	.globl	ppc_swapcontext
 ppc_swapcontext:
 	SAVE_NVGPRS(r1)
diff --git a/arch/powerpc/kernel/entry_64.S b/arch/powerpc/kernel/entry_64.S
index 43e0734..39fa316 100644
--- a/arch/powerpc/kernel/entry_64.S
+++ b/arch/powerpc/kernel/entry_64.S
@@ -320,6 +320,11 @@ _GLOBAL(ppc_clone)
 	bl	.sys_clone
 	b	syscall_exit
 
+_GLOBAL(ppc_clone_with_pids)
+	bl	.save_nvgprs
+	bl	.sys_clone_with_pids
+	b	syscall_exit
+
 _GLOBAL(ppc32_swapcontext)
 	bl	.save_nvgprs
 	bl	.compat_sys_swapcontext
diff --git a/arch/powerpc/kernel/process.c b/arch/powerpc/kernel/process.c
index 892a9f2..7bf21cc 100644
--- a/arch/powerpc/kernel/process.c
+++ b/arch/powerpc/kernel/process.c
@@ -895,7 +895,25 @@ int sys_clone(unsigned long clone_flags, unsigned long usp,
 		child_tidp = TRUNC_PTR(child_tidp);
 	}
 #endif
- 	return do_fork(clone_flags, usp, regs, 0, parent_tidp, child_tidp);
+	return do_fork(clone_flags, usp, regs, 0, parent_tidp, child_tidp);
+}
+
+int sys_clone_with_pids(unsigned long clone_flags, unsigned long usp,
+	      int __user *parent_tidp, void __user *child_threadptr,
+	      int __user *child_tidp, void __user *upid_setp,
+	      struct pt_regs *regs)
+{
+	CHECK_FULL_REGS(regs);
+	if (usp == 0)
+		usp = regs->gpr[1];
+#ifdef CONFIG_PPC64
+	if (test_thread_flag(TIF_32BIT)) {
+		parent_tidp = TRUNC_PTR(parent_tidp);
+		child_tidp = TRUNC_PTR(child_tidp);
+	}
+#endif
+	return do_fork_with_pids(clone_flags, usp, regs, 0,
+				 parent_tidp, child_tidp, upid_setp);
 }
 
 int sys_fork(unsigned long p1, unsigned long p2, unsigned long p3,
-- 
1.6.0.4


^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH v18 80/80] powerpc: clone_with_pids implementation
@ 2009-09-23 23:52   ` Oren Laadan
  0 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-23 23:52 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, containers, linux-kernel, linux-mm, linux-api,
	Serge Hallyn, Ingo Molnar, Pavel Emelyanov, Nathan Lynch

From: Nathan Lynch <ntl@pobox.com>

Mostly copy-paste from existing clone code.

(may want to hold off applying this until the ongoing clone_with_pids
discussion is resolved.)

Signed-off-by: Nathan Lynch <ntl@pobox.com>
---
 arch/powerpc/include/asm/syscalls.h |    4 ++++
 arch/powerpc/include/asm/systbl.h   |    1 +
 arch/powerpc/include/asm/unistd.h   |    3 ++-
 arch/powerpc/kernel/entry_32.S      |    8 ++++++++
 arch/powerpc/kernel/entry_64.S      |    5 +++++
 arch/powerpc/kernel/process.c       |   20 +++++++++++++++++++-
 6 files changed, 39 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/include/asm/syscalls.h b/arch/powerpc/include/asm/syscalls.h
index eb8eb40..25dfe8b 100644
--- a/arch/powerpc/include/asm/syscalls.h
+++ b/arch/powerpc/include/asm/syscalls.h
@@ -24,6 +24,10 @@ asmlinkage int sys_execve(unsigned long a0, unsigned long a1,
 asmlinkage int sys_clone(unsigned long clone_flags, unsigned long usp,
 		int __user *parent_tidp, void __user *child_threadptr,
 		int __user *child_tidp, int p6, struct pt_regs *regs);
+asmlinkage int sys_clone_with_pids(unsigned long clone_flags,
+		unsigned long usp, int __user *parent_tidp,
+		void __user *child_threadptr, int __user *child_tidp,
+		void __user *upid_setp, struct pt_regs *regs);
 asmlinkage int sys_fork(unsigned long p1, unsigned long p2,
 		unsigned long p3, unsigned long p4, unsigned long p5,
 		unsigned long p6, struct pt_regs *regs);
diff --git a/arch/powerpc/include/asm/systbl.h b/arch/powerpc/include/asm/systbl.h
index 3d44cf3..e3c0b96 100644
--- a/arch/powerpc/include/asm/systbl.h
+++ b/arch/powerpc/include/asm/systbl.h
@@ -328,3 +328,4 @@ COMPAT_SYS_SPU(pwritev)
 COMPAT_SYS(rt_tgsigqueueinfo)
 SYSCALL(checkpoint)
 SYSCALL(restart)
+PPC_SYS(clone_with_pids)
diff --git a/arch/powerpc/include/asm/unistd.h b/arch/powerpc/include/asm/unistd.h
index ef41ebb..c54db8b 100644
--- a/arch/powerpc/include/asm/unistd.h
+++ b/arch/powerpc/include/asm/unistd.h
@@ -347,10 +347,11 @@
 #define __NR_rt_tgsigqueueinfo	322
 #define __NR_checkpoint		323
 #define __NR_restart		324
+#define __NR_clone_with_pids	325
 
 #ifdef __KERNEL__
 
-#define __NR_syscalls		325
+#define __NR_syscalls		326
 
 #define __NR__exit __NR_exit
 #define NR_syscalls	__NR_syscalls
diff --git a/arch/powerpc/kernel/entry_32.S b/arch/powerpc/kernel/entry_32.S
index 3cadba6..081dd36 100644
--- a/arch/powerpc/kernel/entry_32.S
+++ b/arch/powerpc/kernel/entry_32.S
@@ -586,6 +586,14 @@ ppc_clone:
 	stw	r0,_TRAP(r1)		/* register set saved */
 	b	sys_clone
 
+	.globl	ppc_clone_with_pids
+ppc_clone_with_pids:
+	SAVE_NVGPRS(r1)
+	lwz	r0,_TRAP(r1)
+	rlwinm	r0,r0,0,0,30		/* clear LSB to indicate full */
+	stw	r0,_TRAP(r1)		/* register set saved */
+	b	sys_clone_with_pids
+
 	.globl	ppc_swapcontext
 ppc_swapcontext:
 	SAVE_NVGPRS(r1)
diff --git a/arch/powerpc/kernel/entry_64.S b/arch/powerpc/kernel/entry_64.S
index 43e0734..39fa316 100644
--- a/arch/powerpc/kernel/entry_64.S
+++ b/arch/powerpc/kernel/entry_64.S
@@ -320,6 +320,11 @@ _GLOBAL(ppc_clone)
 	bl	.sys_clone
 	b	syscall_exit
 
+_GLOBAL(ppc_clone_with_pids)
+	bl	.save_nvgprs
+	bl	.sys_clone_with_pids
+	b	syscall_exit
+
 _GLOBAL(ppc32_swapcontext)
 	bl	.save_nvgprs
 	bl	.compat_sys_swapcontext
diff --git a/arch/powerpc/kernel/process.c b/arch/powerpc/kernel/process.c
index 892a9f2..7bf21cc 100644
--- a/arch/powerpc/kernel/process.c
+++ b/arch/powerpc/kernel/process.c
@@ -895,7 +895,25 @@ int sys_clone(unsigned long clone_flags, unsigned long usp,
 		child_tidp = TRUNC_PTR(child_tidp);
 	}
 #endif
- 	return do_fork(clone_flags, usp, regs, 0, parent_tidp, child_tidp);
+	return do_fork(clone_flags, usp, regs, 0, parent_tidp, child_tidp);
+}
+
+int sys_clone_with_pids(unsigned long clone_flags, unsigned long usp,
+	      int __user *parent_tidp, void __user *child_threadptr,
+	      int __user *child_tidp, void __user *upid_setp,
+	      struct pt_regs *regs)
+{
+	CHECK_FULL_REGS(regs);
+	if (usp == 0)
+		usp = regs->gpr[1];
+#ifdef CONFIG_PPC64
+	if (test_thread_flag(TIF_32BIT)) {
+		parent_tidp = TRUNC_PTR(parent_tidp);
+		child_tidp = TRUNC_PTR(child_tidp);
+	}
+#endif
+	return do_fork_with_pids(clone_flags, usp, regs, 0,
+				 parent_tidp, child_tidp, upid_setp);
 }
 
 int sys_fork(unsigned long p1, unsigned long p2, unsigned long p3,
-- 
1.6.0.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 327+ messages in thread

* Re: [PATCH v18 19/80] c/r: documentation
  2009-09-23 23:50   ` Oren Laadan
  (?)
@ 2009-09-24  9:10   ` Nikita V. Youshchenko
  -1 siblings, 0 replies; 327+ messages in thread
From: Nikita V. Youshchenko @ 2009-09-24  9:10 UTC (permalink / raw)
  To: linux-kernel

> +   - RESTART_FROEZN : processes remain frozen once restart completes

A small typo here.

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 00/80] Kernel based checkpoint/restart [v18]
       [not found] ` <1253749920-18673-1-git-send-email-orenl-RdfvBDnrOixBDgjK7y7TUQ@public.gmane.org>
                     ` (79 preceding siblings ...)
  2009-09-23 23:52   ` [PATCH v18 80/80] powerpc: clone_with_pids implementation Oren Laadan
@ 2009-09-24 13:05   ` Rishikesh
  2009-09-24 22:41   ` Andrew Morton
  81 siblings, 0 replies; 327+ messages in thread
From: Rishikesh @ 2009-09-24 13:05 UTC (permalink / raw)
  To: Oren Laadan
  Cc: linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Andrew Morton,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

Hi Oren,

I am getting following build error while compiling linux-cr kernel.

git://git.ncl.cs.columbia.edu/pub/git/linux-cr.git

...
76569 net/unix/af_unix.c:528: error: ‘unix_collect’ undeclared here (not 
in a function)
76570 LD [M] drivers/net/enic/enic.o
76571 make[2]: *** [net/unix/af_unix.o] Error 1
76572 make[1]: *** [net/unix] Error 2
76573 make: *** [net] Error 2
76574 make: *** Waiting for unfinished jobs....
...

Let me know if you need config file.

-Rishi
Oren Laadan wrote:
> Hi Andrew,
>
> This is our recent round of checkpoint/restart patches. It can
> checkpoint and restart interactive sessions of 'screen' across 
> kernel reboot. Please consider applying to -mm.
>
> Patches 1-17 are clean-ups and preparations for c/r:
>  * 1,2,3,4 and 9,10: cleanups, also useful for c/r.
>  * 5,6: fix freezer control group
>  * 7,8: extend freezer control group for c/r.
>  * 11-17: clone_with_pid
>
> Patch 18 reserves the system calls slots - please apply so we
> don't need to keep changing them.
>
> Patches 19-80 contain the actual c/r code; we've exhausted the
> reviewers for most of them.
>
> Patch 32 implements a deferqueue - mechanism for a process to
> defer work for some later time (unlike workqueue, designed for
> the work to execute in the context of same/original process).
>
> Thanks,
>
> Oren.
>
> ----
>
> Application checkpoint/restart (c/r) is the ability to save the state
> of a running application so that it can later resume its execution
> from the time at which it was checkpointed, on the same or a different
> machine.
>
> This version brings support many new features, including support for
> unix domain sockets, fifos, pseudo-terminals, and signals (see the
> detailed changelog below).
>
> With these in place, it can now checkpoint and restart not only batch
> jobs, but also interactive programs using 'screen'. For example, users
> can checkpoint a 'screen' session with multiple shells, upgrade their
> kernel, reboot, and restart their interactive 'screen' session from
> before !
>
> This patchset was compiled and tested against v2.6.31. For more
> information, check out Documentation/checkpoint/*.txt
>
> Q: How useful is this code as it stands in real-world usage?
> A: The application can be single- or multi-processes and threads. It
>    handles open files (regular files/directories on most file systems,
>    pipes, fifos, af_unix sockets, /dev/{null,zero,random,urandom} and
>    pseudo-terminals. It supports shared memory. sysv IPC (except undo
>    of sempahores). It's suitable for many types of batch jobs as well
>    as some interactive jobs. (Note: it is assumed that the fs view is
>    available at restart).
>
> Q: What can it checkpoint and restart ?
> A: A (single threaded) process can checkpoint itself, aka "self"
>    checkpoint, if it calls the new system calls. Otherise, for an
>    "external" checkpoint, the caller must first freeze the target
>    processes. One can either checkpoint an entire container (and
>    we make best effort to ensure that the result is self-contained),
>    or merely a subtree of a process hierarchy.
>
> Q: What about namespaces ?
> A: Currrently, UTS and IPC namespaces are restored. They demonstrate
>    how namespaces are handled. More to come.
>
> Q: What additional work needs to be done to it?
> A: Fill in the gory details following the examples so far. Current WIP
>    includes inet sockets, event-poll, and early work on inotify, mount
>    namespace and mount-points, pseudo file systems, and x86_64 support.
>    
> Q: How can I try it ?
> A: Use it for simple batch jobs (pipes, too), or an interactive
>    'screen' session, in a whole container or just a subtree of
>    tasks:
>
>    create the freezer cgroup:
>      $ mount -t cgroup -ofreezer freezer /cgroup
>      $ mkdir /cgroup/0
>    
>    run the test, freeze it:  
>      $ test/multitask &
>      [1] 2754
>      $ for i in `pidof multitask`; do echo $i > /cgroup/0/tasks; done
>      $ echo FROZEN > /cgruop/0/freezer.state
>    
>    checkpoint:
>      $ ./ckpt 2754 > ckpt.out
>    
>    restart:
>      $ ./mktree < ckpt.out
>    
>    voila :)
>    
> To do all this, you'll need:
>
> The git tree tracking v18, branch 'ckpt-v18' (and past versions):
> 	git://git.ncl.cs.columbia.edu/pub/git/linux-cr.git
>
> The userspace tools are available through the matching branch [v18]:
> 	git://git.ncl.cs.columbia.edu/pub/git/user-cr.git
>
>
> Changelog:
>
> [2009-Sep-22] v18
>
>   (new features)
>   - [Nathan Lynch] Re-introduce powerpc support
>   - Save/restore pseudo-terminals
>   - Save/restore (pty) controlling terminals
>   - Save/restore restore PGIDs
>   - [Dan Smith] Save/restore unix domain sockets
>   - Save/restore FIFOs
>   - Save/restore pending signals
>   - Save/restore rlimits
>   - Save/restore itimers
>   - [Matt Helsley] Handle many non-pseudo file-systems
>
>   (other changes)
>   - Rename headerless struct ckpt_hdr_* to struct ckpt_*
>   - [Nathan Lynch] discard const from struct cred * where appropriate
>   - [Serge Hallyn][s390] Set return value for self-checkpoint 
>   - Handle kmalloc failure in restore_sem_array()
>   - [IPC] Collect files used by shm objects
>   - [IPC] Use file (not inode) as shared object on checkpoint of shm
>   - More ckpt_write_err()s to give information on checkpoint failure
>   - Adjust format of pipe buffer to include the mandatory pre-header
>   - [LEAKS] Mark the backing file as visited at chekcpoint
>   - Tighten checks on supported vma to checkpoint or restart
>   - [Serge Hallyn] Export filemap_checkpoint() (used for ext4)
>   - Introduce ckpt_collect_file() that also uses file->collect method
>   - Use ckpt_collect_file() instead of ckpt_obj_collect() for files
>   - Fix leak-detection issue in collect_mm() (test for first-time obj)
>   - Invoke set_close_on_exec() unconditionally on restart
>   - [Dan Smith] Export fill_fname() as ckpt_fill_fname()
>   - Interface to pass simple pointers as data with deferqueue
>   - [Dan Smith] Fix ckpt_obj_lookup_add() leak detection logic
>   - Replace EAGAIN with EBUSY where necessary
>   - Introduce CKPT_OBJ_VISITED in leak detection
>   - ckpt_obj_collect() returns objref for new objects, 0 otherwise
>   - Rename ckpt_obj_checkpointed() to ckpt_obj_visited()
>   - Introduce ckpt_obj_visit() to mark objects as visited
>   - Set the CHECKPOINTED flag on objects before calling checkpoint
>   - Introduce ckpt_obj_reserve()
>   - Change ref_drop() to accept a @lastref argument (for cleanup)
>   - Disallow multiple objects with same objref in restart
>   - Allow _ckpt_read_obj_type() to read header only (w/o payload)
>   - Fix leak of ckpt_ctx when restoring zombie tasks
>   - Fix race of prepare_descendant() with an ongoing fork()
>   - Track and report the first error if restart fails
>   - Tighten logic to protect against bogus pids in input
>   - [Matt Helsley] Improve debug output from ckpt_notify_error()
>   - [Nathan Lynch] fix compilation errors with CONFIG_COMPAT=y
>   - Detect error-headers in input data on restart, and abort.
>   - Standard format for checkpoint error strings (and documentation)
>   - [Dan Smith] Add an errno validation function
>   - Add ckpt_read_payload(): read a variable-length object (no header)
>   - Add ckpt_read_string(): same for strings (ensures null-terminated)
>   - Add ckpt_read_consume(): consumes next object without processing
>   - [John Dykstra] Fix no-dot-config-targets pattern in linux/Makefile
>
> [2009-Jul-21] v17
>   - Introduce syscall clone_with_pids() to restore original pids
>   - Support threads and zombies
>   - Save/restore task->files
>   - Save/restore task->sighand
>   - Save/restore futex
>   - Save/restore credentials
>   - Introduce PF_RESTARTING to skip notifications on task exit
>   - restart(2) allow caller to ask to freeze tasks after restart
>   - restart(2) isn't idempotent: return -EINTR if interrupted
>   - Improve debugging output handling 
>   - Make multi-process restart logic more robust and complete
>   - Correctly select return value for restarting tasks on success
>   - Tighten ptrace test for checkpoint to PTRACE_MODE_ATTACH
>   - Use CHECKPOINTING state for frozen checkpointed tasks
>   - Fix compilation without CONFIG_CHECKPOINT
>   - Fix compilation with CONFIG_COMPAT
>   - Fix headers includes and exports
>   - Leak detection performed in two steps
>   - Detect "inverse" leaks of objects (dis)appearing unexpectedly
>   - Memory: save/restore mm->{flags,def_flags,saved_auxv}
>   - Memory: only collect sub-objects of mm once (leak detection)
>   - Files: validate f_mode after restore
>   - Namespaces: leak detection for nsproxy sub-components
>   - Namespaces: proper restart from namespace(s) without namespace(s)
>   - Save global constants in header instead of per-object
>   - IPC: replace sys_unshare() with create_ipc_ns()
>   - IPC: restore objects in suitable namespace
>   - IPC: correct behavior under !CONFIG_IPC_NS
>   - UTS: save/restore all fields
>   - UTS: replace sys_unshare() with create_uts_ns()
>   - X86_32: sanitize cpu, debug, and segment registers on restart
>   - cgroup_freezer: add CHECKPOINTING state to safeguard checkpoint
>   - cgroup_freezer: add interface to freeze a cgroup (given a task)
>
> [2009-May-27] v16
>   - Privilege checks for IPC checkpoint
>   - Fix error string generation during checkpoint
>   - Use kzalloc for header allocation
>   - Restart blocks are arch-independent
>   - Redo pipe c/r using splice
>   - Fixes to s390 arch
>   - Remove powerpc arch (temporary)
>   - Explicitly restore ->nsproxy
>   - All objects in image are precedeed by 'struct ckpt_hdr'
>   - Fix leaks detection (and leaks)
>   - Reorder of patchset
>   - Misc bugs and compilation fixes
>
> [2009-Apr-12] v15
>   - Minor fixes
>
> [2009-Apr-28] v14
>   - Tested against kernel v2.6.30-rc3 on x86_32.
>   - Refactor files chekpoint to use f_ops (file operations)
>   - Refactor mm/vma to use vma_ops
>   - Explicitly handle VDSO vma (and require compat mode)
>   - Added code to c/r restat-blocks (restart timeout related syscalls)
>   - Added code to c/r namespaces: uts, ipc (with Dan Smith)
>   - Added code to c/r sysvipc (shm, msg, sem)
>   - Support for VM_CLONE shared memory
>   - Added resource leak detection for whole-container checkpoint
>   - Added sysctl gauge to allow unprivileged restart/checkpoint
>   - Improve and simplify the code and logic of shared objects
>   - Rework image format: shared objects appear prior to their use
>   - Merge checkpoint and restart functionality into same files
>   - Massive renaming of functions: prefix "ckpt_" for generics,
>     "checkpoint_" for checkpoint, and "restore_" for restart.
>   - Report checkpoint errors as a valid (string record) in the output
>   - Merged PPC architecture (by Nathan Lunch),
>   - Requires updates to userspace tools too.
>   - Misc nits and bug fixes
>
> [2009-Mar-31] v14-rc2
>   - Change along Dave's suggestion to use f_ops->checkpoint() for files
>   - Merge patch simplifying Kconfig, with CONFIG_CHECKPOINT_SUPPORT
>   - Merge support for PPC arch (Nathan Lynch)
>   - Misc cleanups and fixes in response to comments
>
> [2009-Mar-20] v14-rc1:
>   - The 'h.parent' field of 'struct cr_hdr' isn't used - discard
>   - Check whether calls to cr_hbuf_get() succeed or fail.
>   - Fixed of pipe c/r code
>   - Prevent deadlock by refusing c/r when a pipe inode == ctx->file inode
>   - Refuse non-self checkpoint if a task isn't frozen
>   - Use unsigned fields in checkpoint headers unless otherwise required
>   - Rename functions in files c/r to better reflect their role
>   - Add support for anonymous shared memory
>   - Merge support for s390 arch (Dan Smith, Serge Hallyn)
>     
> [2008-Dec-03] v13:
>   - Cleanups of 'struct cr_ctx' - remove unused fields
>   - Misc fixes for comments
>   
> [2008-Dec-17] v12:
>   - Fix re-alloc/reset of pgarr chain to correctly reuse buffers
>     (empty pgarr are saves in a separate pool chain)
>   - Add a couple of missed calls to cr_hbuf_put()
>   - cr_kwrite/cr_kread() again use vfs_read(), vfs_write() (safer)
>   - Split cr_write/cr_read() to two parts: _cr_write/read() helper
>   - Befriend with sparse: explicit conversion to 'void __user *'
>   - Redrefine 'pr_fmt' ind replace cr_debug() with pr_debug()
>
> [2008-Dec-05] v11:
>   - Use contents of 'init->fs->root' instead of pointing to it
>   - Ignore symlinks (there is no such thing as an open symlink)
>   - cr_scan_fds() retries from scratch if it hits size limits
>   - Add missing test for VM_MAYSHARE when dumping memory
>   - Improve documentation about: behavior when tasks aren't fronen,
>     life span of the object hash, references to objects in the hash
>
> [2008-Nov-26] v10:
>   - Grab vfs root of container init, rather than current process
>   - Acquire dcache_lock around call to __d_path() in cr_fill_name()
>   - Force end-of-string in cr_read_string() (fix possible DoS)
>   - Introduce cr_write_buffer(), cr_read_buffer() and cr_read_buf_type()
>
> [2008-Nov-10] v9:
>   - Support multiple processes c/r
>   - Extend checkpoint header with archtiecture dependent header 
>   - Misc bug fixes (see individual changelogs)
>   - Rebase to v2.6.28-rc3.
>
> [2008-Oct-29] v8:
>   - Support "external" checkpoint
>   - Include Dave Hansen's 'deny-checkpoint' patch
>   - Split docs in Documentation/checkpoint/..., and improve contents
>
> [2008-Oct-17] v7:
>   - Fix save/restore state of FPU
>   - Fix argument given to kunmap_atomic() in memory dump/restore
>
> [2008-Oct-07] v6:
>   - Balance all calls to cr_hbuf_get() with matching cr_hbuf_put()
>     (even though it's not really needed)
>   - Add assumptions and what's-missing to documentation
>   - Misc fixes and cleanups
>
> [2008-Sep-11] v5:
>   - Config is now 'def_bool n' by default
>   - Improve memory dump/restore code (following Dave Hansen's comments)
>   - Change dump format (and code) to allow chunks of <vaddrs, pages>
>     instead of one long list of each
>   - Fix use of follow_page() to avoid faulting in non-present pages
>   - Memory restore now maps user pages explicitly to copy data into them,
>     instead of reading directly to user space; got rid of mprotect_fixup()
>   - Remove preempt_disable() when restoring debug registers
>   - Rename headers files s/ckpt/checkpoint/
>   - Fix misc bugs in files dump/restore
>   - Fixes and cleanups on some error paths
>   - Fix misc coding style
>
> [2008-Sep-09] v4:
>   - Various fixes and clean-ups
>   - Fix calculation of hash table size
>   - Fix header structure alignment
>   - Use stand list_... for cr_pgarr
>
> [2008-Aug-29] v3:
>   - Various fixes and clean-ups
>   - Use standard hlist_... for hash table
>   - Better use of standard kmalloc/kfree
>
> [2008-Aug-20] v2:
>   - Added Dump and restore of open files (regular and directories)
>   - Added basic handling of shared objects, and improve handling of
>     'parent tag' concept
>   - Added documentation
>   - Improved ABI, 64bit padding for image data
>   - Improved locking when saving/restoring memory
>   - Added UTS information to header (release, version, machine)
>   - Cleanup extraction of filename from a file pointer
>   - Refactor to allow easier reviewing
>   - Remove requirement for CAPS_SYS_ADMIN until we come up with a
>     security policy (this means that file restore may fail)
>   - Other cleanup and response to comments for v1
>
> [2008-Jul-29] v1:
>   - Initial version: support a single task with address space of only
>     private anonymous or file-mapped VMAs; syscalls ignore pid/crid
>     argument and act on current process.
>
> --
> At the containers mini-conference before OLS, the consensus among
> all the stakeholders was that doing checkpoint/restart in the kernel
> as much as possible was the best approach.  With this approach, the
> kernel will export a relatively opaque 'blob' of data to userspace
> which can then be handed to the new kernel at restore time.
>
> This is different than what had been proposed before, which was
> that a userspace application would be responsible for collecting
> all of this data.  We were also planning on adding lots of new,
> little kernel interfaces for all of the things that needed
> checkpointing.  This unites those into a single, grand interface.
>
> The 'blob' will contain copies of select portions of kernel
> structures such as vmas and mm_structs.  It will also contain
> copies of the actual memory that the process uses.  Any changes
> in this blob's format between kernel revisions can be handled by
> an in-userspace conversion program.
>
> This is a similar approach to virtually all of the commercial
> checkpoint/restart products out there, as well as the research
> project Zap.
>
> These patches basically serialize internel kernel state and write
> it out to a file descriptor.  The checkpoint and restore are done
> with two new system calls: sys_checkpoint and sys_restart.
>
> In this incarnation, they can only work checkpoint and restore a
> single task. The task's address space may consist of only private,
> simple vma's - anonymous or file-mapped. The open files may consist
> of only simple files and directories.
> --
> _______________________________________________
> Containers mailing list
> Containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org
> https://lists.linux-foundation.org/mailman/listinfo/containers
>   

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 00/80] Kernel based checkpoint/restart [v18]
  2009-09-23 23:50 ` Oren Laadan
  (?)
@ 2009-09-24 13:05   ` Rishikesh
  -1 siblings, 0 replies; 327+ messages in thread
From: Rishikesh @ 2009-09-24 13:05 UTC (permalink / raw)
  To: Oren Laadan; +Cc: Andrew Morton, linux-api, containers, linux-kernel, linux-mm

Hi Oren,

I am getting following build error while compiling linux-cr kernel.

git://git.ncl.cs.columbia.edu/pub/git/linux-cr.git

...
76569 net/unix/af_unix.c:528: error: ‘unix_collect’ undeclared here (not 
in a function)
76570 LD [M] drivers/net/enic/enic.o
76571 make[2]: *** [net/unix/af_unix.o] Error 1
76572 make[1]: *** [net/unix] Error 2
76573 make: *** [net] Error 2
76574 make: *** Waiting for unfinished jobs....
...

Let me know if you need config file.

-Rishi
Oren Laadan wrote:
> Hi Andrew,
>
> This is our recent round of checkpoint/restart patches. It can
> checkpoint and restart interactive sessions of 'screen' across 
> kernel reboot. Please consider applying to -mm.
>
> Patches 1-17 are clean-ups and preparations for c/r:
>  * 1,2,3,4 and 9,10: cleanups, also useful for c/r.
>  * 5,6: fix freezer control group
>  * 7,8: extend freezer control group for c/r.
>  * 11-17: clone_with_pid
>
> Patch 18 reserves the system calls slots - please apply so we
> don't need to keep changing them.
>
> Patches 19-80 contain the actual c/r code; we've exhausted the
> reviewers for most of them.
>
> Patch 32 implements a deferqueue - mechanism for a process to
> defer work for some later time (unlike workqueue, designed for
> the work to execute in the context of same/original process).
>
> Thanks,
>
> Oren.
>
> ----
>
> Application checkpoint/restart (c/r) is the ability to save the state
> of a running application so that it can later resume its execution
> from the time at which it was checkpointed, on the same or a different
> machine.
>
> This version brings support many new features, including support for
> unix domain sockets, fifos, pseudo-terminals, and signals (see the
> detailed changelog below).
>
> With these in place, it can now checkpoint and restart not only batch
> jobs, but also interactive programs using 'screen'. For example, users
> can checkpoint a 'screen' session with multiple shells, upgrade their
> kernel, reboot, and restart their interactive 'screen' session from
> before !
>
> This patchset was compiled and tested against v2.6.31. For more
> information, check out Documentation/checkpoint/*.txt
>
> Q: How useful is this code as it stands in real-world usage?
> A: The application can be single- or multi-processes and threads. It
>    handles open files (regular files/directories on most file systems,
>    pipes, fifos, af_unix sockets, /dev/{null,zero,random,urandom} and
>    pseudo-terminals. It supports shared memory. sysv IPC (except undo
>    of sempahores). It's suitable for many types of batch jobs as well
>    as some interactive jobs. (Note: it is assumed that the fs view is
>    available at restart).
>
> Q: What can it checkpoint and restart ?
> A: A (single threaded) process can checkpoint itself, aka "self"
>    checkpoint, if it calls the new system calls. Otherise, for an
>    "external" checkpoint, the caller must first freeze the target
>    processes. One can either checkpoint an entire container (and
>    we make best effort to ensure that the result is self-contained),
>    or merely a subtree of a process hierarchy.
>
> Q: What about namespaces ?
> A: Currrently, UTS and IPC namespaces are restored. They demonstrate
>    how namespaces are handled. More to come.
>
> Q: What additional work needs to be done to it?
> A: Fill in the gory details following the examples so far. Current WIP
>    includes inet sockets, event-poll, and early work on inotify, mount
>    namespace and mount-points, pseudo file systems, and x86_64 support.
>    
> Q: How can I try it ?
> A: Use it for simple batch jobs (pipes, too), or an interactive
>    'screen' session, in a whole container or just a subtree of
>    tasks:
>
>    create the freezer cgroup:
>      $ mount -t cgroup -ofreezer freezer /cgroup
>      $ mkdir /cgroup/0
>    
>    run the test, freeze it:  
>      $ test/multitask &
>      [1] 2754
>      $ for i in `pidof multitask`; do echo $i > /cgroup/0/tasks; done
>      $ echo FROZEN > /cgruop/0/freezer.state
>    
>    checkpoint:
>      $ ./ckpt 2754 > ckpt.out
>    
>    restart:
>      $ ./mktree < ckpt.out
>    
>    voila :)
>    
> To do all this, you'll need:
>
> The git tree tracking v18, branch 'ckpt-v18' (and past versions):
> 	git://git.ncl.cs.columbia.edu/pub/git/linux-cr.git
>
> The userspace tools are available through the matching branch [v18]:
> 	git://git.ncl.cs.columbia.edu/pub/git/user-cr.git
>
>
> Changelog:
>
> [2009-Sep-22] v18
>
>   (new features)
>   - [Nathan Lynch] Re-introduce powerpc support
>   - Save/restore pseudo-terminals
>   - Save/restore (pty) controlling terminals
>   - Save/restore restore PGIDs
>   - [Dan Smith] Save/restore unix domain sockets
>   - Save/restore FIFOs
>   - Save/restore pending signals
>   - Save/restore rlimits
>   - Save/restore itimers
>   - [Matt Helsley] Handle many non-pseudo file-systems
>
>   (other changes)
>   - Rename headerless struct ckpt_hdr_* to struct ckpt_*
>   - [Nathan Lynch] discard const from struct cred * where appropriate
>   - [Serge Hallyn][s390] Set return value for self-checkpoint 
>   - Handle kmalloc failure in restore_sem_array()
>   - [IPC] Collect files used by shm objects
>   - [IPC] Use file (not inode) as shared object on checkpoint of shm
>   - More ckpt_write_err()s to give information on checkpoint failure
>   - Adjust format of pipe buffer to include the mandatory pre-header
>   - [LEAKS] Mark the backing file as visited at chekcpoint
>   - Tighten checks on supported vma to checkpoint or restart
>   - [Serge Hallyn] Export filemap_checkpoint() (used for ext4)
>   - Introduce ckpt_collect_file() that also uses file->collect method
>   - Use ckpt_collect_file() instead of ckpt_obj_collect() for files
>   - Fix leak-detection issue in collect_mm() (test for first-time obj)
>   - Invoke set_close_on_exec() unconditionally on restart
>   - [Dan Smith] Export fill_fname() as ckpt_fill_fname()
>   - Interface to pass simple pointers as data with deferqueue
>   - [Dan Smith] Fix ckpt_obj_lookup_add() leak detection logic
>   - Replace EAGAIN with EBUSY where necessary
>   - Introduce CKPT_OBJ_VISITED in leak detection
>   - ckpt_obj_collect() returns objref for new objects, 0 otherwise
>   - Rename ckpt_obj_checkpointed() to ckpt_obj_visited()
>   - Introduce ckpt_obj_visit() to mark objects as visited
>   - Set the CHECKPOINTED flag on objects before calling checkpoint
>   - Introduce ckpt_obj_reserve()
>   - Change ref_drop() to accept a @lastref argument (for cleanup)
>   - Disallow multiple objects with same objref in restart
>   - Allow _ckpt_read_obj_type() to read header only (w/o payload)
>   - Fix leak of ckpt_ctx when restoring zombie tasks
>   - Fix race of prepare_descendant() with an ongoing fork()
>   - Track and report the first error if restart fails
>   - Tighten logic to protect against bogus pids in input
>   - [Matt Helsley] Improve debug output from ckpt_notify_error()
>   - [Nathan Lynch] fix compilation errors with CONFIG_COMPAT=y
>   - Detect error-headers in input data on restart, and abort.
>   - Standard format for checkpoint error strings (and documentation)
>   - [Dan Smith] Add an errno validation function
>   - Add ckpt_read_payload(): read a variable-length object (no header)
>   - Add ckpt_read_string(): same for strings (ensures null-terminated)
>   - Add ckpt_read_consume(): consumes next object without processing
>   - [John Dykstra] Fix no-dot-config-targets pattern in linux/Makefile
>
> [2009-Jul-21] v17
>   - Introduce syscall clone_with_pids() to restore original pids
>   - Support threads and zombies
>   - Save/restore task->files
>   - Save/restore task->sighand
>   - Save/restore futex
>   - Save/restore credentials
>   - Introduce PF_RESTARTING to skip notifications on task exit
>   - restart(2) allow caller to ask to freeze tasks after restart
>   - restart(2) isn't idempotent: return -EINTR if interrupted
>   - Improve debugging output handling 
>   - Make multi-process restart logic more robust and complete
>   - Correctly select return value for restarting tasks on success
>   - Tighten ptrace test for checkpoint to PTRACE_MODE_ATTACH
>   - Use CHECKPOINTING state for frozen checkpointed tasks
>   - Fix compilation without CONFIG_CHECKPOINT
>   - Fix compilation with CONFIG_COMPAT
>   - Fix headers includes and exports
>   - Leak detection performed in two steps
>   - Detect "inverse" leaks of objects (dis)appearing unexpectedly
>   - Memory: save/restore mm->{flags,def_flags,saved_auxv}
>   - Memory: only collect sub-objects of mm once (leak detection)
>   - Files: validate f_mode after restore
>   - Namespaces: leak detection for nsproxy sub-components
>   - Namespaces: proper restart from namespace(s) without namespace(s)
>   - Save global constants in header instead of per-object
>   - IPC: replace sys_unshare() with create_ipc_ns()
>   - IPC: restore objects in suitable namespace
>   - IPC: correct behavior under !CONFIG_IPC_NS
>   - UTS: save/restore all fields
>   - UTS: replace sys_unshare() with create_uts_ns()
>   - X86_32: sanitize cpu, debug, and segment registers on restart
>   - cgroup_freezer: add CHECKPOINTING state to safeguard checkpoint
>   - cgroup_freezer: add interface to freeze a cgroup (given a task)
>
> [2009-May-27] v16
>   - Privilege checks for IPC checkpoint
>   - Fix error string generation during checkpoint
>   - Use kzalloc for header allocation
>   - Restart blocks are arch-independent
>   - Redo pipe c/r using splice
>   - Fixes to s390 arch
>   - Remove powerpc arch (temporary)
>   - Explicitly restore ->nsproxy
>   - All objects in image are precedeed by 'struct ckpt_hdr'
>   - Fix leaks detection (and leaks)
>   - Reorder of patchset
>   - Misc bugs and compilation fixes
>
> [2009-Apr-12] v15
>   - Minor fixes
>
> [2009-Apr-28] v14
>   - Tested against kernel v2.6.30-rc3 on x86_32.
>   - Refactor files chekpoint to use f_ops (file operations)
>   - Refactor mm/vma to use vma_ops
>   - Explicitly handle VDSO vma (and require compat mode)
>   - Added code to c/r restat-blocks (restart timeout related syscalls)
>   - Added code to c/r namespaces: uts, ipc (with Dan Smith)
>   - Added code to c/r sysvipc (shm, msg, sem)
>   - Support for VM_CLONE shared memory
>   - Added resource leak detection for whole-container checkpoint
>   - Added sysctl gauge to allow unprivileged restart/checkpoint
>   - Improve and simplify the code and logic of shared objects
>   - Rework image format: shared objects appear prior to their use
>   - Merge checkpoint and restart functionality into same files
>   - Massive renaming of functions: prefix "ckpt_" for generics,
>     "checkpoint_" for checkpoint, and "restore_" for restart.
>   - Report checkpoint errors as a valid (string record) in the output
>   - Merged PPC architecture (by Nathan Lunch),
>   - Requires updates to userspace tools too.
>   - Misc nits and bug fixes
>
> [2009-Mar-31] v14-rc2
>   - Change along Dave's suggestion to use f_ops->checkpoint() for files
>   - Merge patch simplifying Kconfig, with CONFIG_CHECKPOINT_SUPPORT
>   - Merge support for PPC arch (Nathan Lynch)
>   - Misc cleanups and fixes in response to comments
>
> [2009-Mar-20] v14-rc1:
>   - The 'h.parent' field of 'struct cr_hdr' isn't used - discard
>   - Check whether calls to cr_hbuf_get() succeed or fail.
>   - Fixed of pipe c/r code
>   - Prevent deadlock by refusing c/r when a pipe inode == ctx->file inode
>   - Refuse non-self checkpoint if a task isn't frozen
>   - Use unsigned fields in checkpoint headers unless otherwise required
>   - Rename functions in files c/r to better reflect their role
>   - Add support for anonymous shared memory
>   - Merge support for s390 arch (Dan Smith, Serge Hallyn)
>     
> [2008-Dec-03] v13:
>   - Cleanups of 'struct cr_ctx' - remove unused fields
>   - Misc fixes for comments
>   
> [2008-Dec-17] v12:
>   - Fix re-alloc/reset of pgarr chain to correctly reuse buffers
>     (empty pgarr are saves in a separate pool chain)
>   - Add a couple of missed calls to cr_hbuf_put()
>   - cr_kwrite/cr_kread() again use vfs_read(), vfs_write() (safer)
>   - Split cr_write/cr_read() to two parts: _cr_write/read() helper
>   - Befriend with sparse: explicit conversion to 'void __user *'
>   - Redrefine 'pr_fmt' ind replace cr_debug() with pr_debug()
>
> [2008-Dec-05] v11:
>   - Use contents of 'init->fs->root' instead of pointing to it
>   - Ignore symlinks (there is no such thing as an open symlink)
>   - cr_scan_fds() retries from scratch if it hits size limits
>   - Add missing test for VM_MAYSHARE when dumping memory
>   - Improve documentation about: behavior when tasks aren't fronen,
>     life span of the object hash, references to objects in the hash
>
> [2008-Nov-26] v10:
>   - Grab vfs root of container init, rather than current process
>   - Acquire dcache_lock around call to __d_path() in cr_fill_name()
>   - Force end-of-string in cr_read_string() (fix possible DoS)
>   - Introduce cr_write_buffer(), cr_read_buffer() and cr_read_buf_type()
>
> [2008-Nov-10] v9:
>   - Support multiple processes c/r
>   - Extend checkpoint header with archtiecture dependent header 
>   - Misc bug fixes (see individual changelogs)
>   - Rebase to v2.6.28-rc3.
>
> [2008-Oct-29] v8:
>   - Support "external" checkpoint
>   - Include Dave Hansen's 'deny-checkpoint' patch
>   - Split docs in Documentation/checkpoint/..., and improve contents
>
> [2008-Oct-17] v7:
>   - Fix save/restore state of FPU
>   - Fix argument given to kunmap_atomic() in memory dump/restore
>
> [2008-Oct-07] v6:
>   - Balance all calls to cr_hbuf_get() with matching cr_hbuf_put()
>     (even though it's not really needed)
>   - Add assumptions and what's-missing to documentation
>   - Misc fixes and cleanups
>
> [2008-Sep-11] v5:
>   - Config is now 'def_bool n' by default
>   - Improve memory dump/restore code (following Dave Hansen's comments)
>   - Change dump format (and code) to allow chunks of <vaddrs, pages>
>     instead of one long list of each
>   - Fix use of follow_page() to avoid faulting in non-present pages
>   - Memory restore now maps user pages explicitly to copy data into them,
>     instead of reading directly to user space; got rid of mprotect_fixup()
>   - Remove preempt_disable() when restoring debug registers
>   - Rename headers files s/ckpt/checkpoint/
>   - Fix misc bugs in files dump/restore
>   - Fixes and cleanups on some error paths
>   - Fix misc coding style
>
> [2008-Sep-09] v4:
>   - Various fixes and clean-ups
>   - Fix calculation of hash table size
>   - Fix header structure alignment
>   - Use stand list_... for cr_pgarr
>
> [2008-Aug-29] v3:
>   - Various fixes and clean-ups
>   - Use standard hlist_... for hash table
>   - Better use of standard kmalloc/kfree
>
> [2008-Aug-20] v2:
>   - Added Dump and restore of open files (regular and directories)
>   - Added basic handling of shared objects, and improve handling of
>     'parent tag' concept
>   - Added documentation
>   - Improved ABI, 64bit padding for image data
>   - Improved locking when saving/restoring memory
>   - Added UTS information to header (release, version, machine)
>   - Cleanup extraction of filename from a file pointer
>   - Refactor to allow easier reviewing
>   - Remove requirement for CAPS_SYS_ADMIN until we come up with a
>     security policy (this means that file restore may fail)
>   - Other cleanup and response to comments for v1
>
> [2008-Jul-29] v1:
>   - Initial version: support a single task with address space of only
>     private anonymous or file-mapped VMAs; syscalls ignore pid/crid
>     argument and act on current process.
>
> --
> At the containers mini-conference before OLS, the consensus among
> all the stakeholders was that doing checkpoint/restart in the kernel
> as much as possible was the best approach.  With this approach, the
> kernel will export a relatively opaque 'blob' of data to userspace
> which can then be handed to the new kernel at restore time.
>
> This is different than what had been proposed before, which was
> that a userspace application would be responsible for collecting
> all of this data.  We were also planning on adding lots of new,
> little kernel interfaces for all of the things that needed
> checkpointing.  This unites those into a single, grand interface.
>
> The 'blob' will contain copies of select portions of kernel
> structures such as vmas and mm_structs.  It will also contain
> copies of the actual memory that the process uses.  Any changes
> in this blob's format between kernel revisions can be handled by
> an in-userspace conversion program.
>
> This is a similar approach to virtually all of the commercial
> checkpoint/restart products out there, as well as the research
> project Zap.
>
> These patches basically serialize internel kernel state and write
> it out to a file descriptor.  The checkpoint and restore are done
> with two new system calls: sys_checkpoint and sys_restart.
>
> In this incarnation, they can only work checkpoint and restore a
> single task. The task's address space may consist of only private,
> simple vma's - anonymous or file-mapped. The open files may consist
> of only simple files and directories.
> --
> _______________________________________________
> Containers mailing list
> Containers@lists.linux-foundation.org
> https://lists.linux-foundation.org/mailman/listinfo/containers
>   


^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 00/80] Kernel based checkpoint/restart [v18]
@ 2009-09-24 13:05   ` Rishikesh
  0 siblings, 0 replies; 327+ messages in thread
From: Rishikesh @ 2009-09-24 13:05 UTC (permalink / raw)
  To: Oren Laadan; +Cc: Andrew Morton, linux-api, containers, linux-kernel, linux-mm

Hi Oren,

I am getting following build error while compiling linux-cr kernel.

git://git.ncl.cs.columbia.edu/pub/git/linux-cr.git

...
76569 net/unix/af_unix.c:528: error: ‘unix_collect’ undeclared here (not 
in a function)
76570 LD [M] drivers/net/enic/enic.o
76571 make[2]: *** [net/unix/af_unix.o] Error 1
76572 make[1]: *** [net/unix] Error 2
76573 make: *** [net] Error 2
76574 make: *** Waiting for unfinished jobs....
...

Let me know if you need config file.

-Rishi
Oren Laadan wrote:
> Hi Andrew,
>
> This is our recent round of checkpoint/restart patches. It can
> checkpoint and restart interactive sessions of 'screen' across 
> kernel reboot. Please consider applying to -mm.
>
> Patches 1-17 are clean-ups and preparations for c/r:
>  * 1,2,3,4 and 9,10: cleanups, also useful for c/r.
>  * 5,6: fix freezer control group
>  * 7,8: extend freezer control group for c/r.
>  * 11-17: clone_with_pid
>
> Patch 18 reserves the system calls slots - please apply so we
> don't need to keep changing them.
>
> Patches 19-80 contain the actual c/r code; we've exhausted the
> reviewers for most of them.
>
> Patch 32 implements a deferqueue - mechanism for a process to
> defer work for some later time (unlike workqueue, designed for
> the work to execute in the context of same/original process).
>
> Thanks,
>
> Oren.
>
> ----
>
> Application checkpoint/restart (c/r) is the ability to save the state
> of a running application so that it can later resume its execution
> from the time at which it was checkpointed, on the same or a different
> machine.
>
> This version brings support many new features, including support for
> unix domain sockets, fifos, pseudo-terminals, and signals (see the
> detailed changelog below).
>
> With these in place, it can now checkpoint and restart not only batch
> jobs, but also interactive programs using 'screen'. For example, users
> can checkpoint a 'screen' session with multiple shells, upgrade their
> kernel, reboot, and restart their interactive 'screen' session from
> before !
>
> This patchset was compiled and tested against v2.6.31. For more
> information, check out Documentation/checkpoint/*.txt
>
> Q: How useful is this code as it stands in real-world usage?
> A: The application can be single- or multi-processes and threads. It
>    handles open files (regular files/directories on most file systems,
>    pipes, fifos, af_unix sockets, /dev/{null,zero,random,urandom} and
>    pseudo-terminals. It supports shared memory. sysv IPC (except undo
>    of sempahores). It's suitable for many types of batch jobs as well
>    as some interactive jobs. (Note: it is assumed that the fs view is
>    available at restart).
>
> Q: What can it checkpoint and restart ?
> A: A (single threaded) process can checkpoint itself, aka "self"
>    checkpoint, if it calls the new system calls. Otherise, for an
>    "external" checkpoint, the caller must first freeze the target
>    processes. One can either checkpoint an entire container (and
>    we make best effort to ensure that the result is self-contained),
>    or merely a subtree of a process hierarchy.
>
> Q: What about namespaces ?
> A: Currrently, UTS and IPC namespaces are restored. They demonstrate
>    how namespaces are handled. More to come.
>
> Q: What additional work needs to be done to it?
> A: Fill in the gory details following the examples so far. Current WIP
>    includes inet sockets, event-poll, and early work on inotify, mount
>    namespace and mount-points, pseudo file systems, and x86_64 support.
>    
> Q: How can I try it ?
> A: Use it for simple batch jobs (pipes, too), or an interactive
>    'screen' session, in a whole container or just a subtree of
>    tasks:
>
>    create the freezer cgroup:
>      $ mount -t cgroup -ofreezer freezer /cgroup
>      $ mkdir /cgroup/0
>    
>    run the test, freeze it:  
>      $ test/multitask &
>      [1] 2754
>      $ for i in `pidof multitask`; do echo $i > /cgroup/0/tasks; done
>      $ echo FROZEN > /cgruop/0/freezer.state
>    
>    checkpoint:
>      $ ./ckpt 2754 > ckpt.out
>    
>    restart:
>      $ ./mktree < ckpt.out
>    
>    voila :)
>    
> To do all this, you'll need:
>
> The git tree tracking v18, branch 'ckpt-v18' (and past versions):
> 	git://git.ncl.cs.columbia.edu/pub/git/linux-cr.git
>
> The userspace tools are available through the matching branch [v18]:
> 	git://git.ncl.cs.columbia.edu/pub/git/user-cr.git
>
>
> Changelog:
>
> [2009-Sep-22] v18
>
>   (new features)
>   - [Nathan Lynch] Re-introduce powerpc support
>   - Save/restore pseudo-terminals
>   - Save/restore (pty) controlling terminals
>   - Save/restore restore PGIDs
>   - [Dan Smith] Save/restore unix domain sockets
>   - Save/restore FIFOs
>   - Save/restore pending signals
>   - Save/restore rlimits
>   - Save/restore itimers
>   - [Matt Helsley] Handle many non-pseudo file-systems
>
>   (other changes)
>   - Rename headerless struct ckpt_hdr_* to struct ckpt_*
>   - [Nathan Lynch] discard const from struct cred * where appropriate
>   - [Serge Hallyn][s390] Set return value for self-checkpoint 
>   - Handle kmalloc failure in restore_sem_array()
>   - [IPC] Collect files used by shm objects
>   - [IPC] Use file (not inode) as shared object on checkpoint of shm
>   - More ckpt_write_err()s to give information on checkpoint failure
>   - Adjust format of pipe buffer to include the mandatory pre-header
>   - [LEAKS] Mark the backing file as visited at chekcpoint
>   - Tighten checks on supported vma to checkpoint or restart
>   - [Serge Hallyn] Export filemap_checkpoint() (used for ext4)
>   - Introduce ckpt_collect_file() that also uses file->collect method
>   - Use ckpt_collect_file() instead of ckpt_obj_collect() for files
>   - Fix leak-detection issue in collect_mm() (test for first-time obj)
>   - Invoke set_close_on_exec() unconditionally on restart
>   - [Dan Smith] Export fill_fname() as ckpt_fill_fname()
>   - Interface to pass simple pointers as data with deferqueue
>   - [Dan Smith] Fix ckpt_obj_lookup_add() leak detection logic
>   - Replace EAGAIN with EBUSY where necessary
>   - Introduce CKPT_OBJ_VISITED in leak detection
>   - ckpt_obj_collect() returns objref for new objects, 0 otherwise
>   - Rename ckpt_obj_checkpointed() to ckpt_obj_visited()
>   - Introduce ckpt_obj_visit() to mark objects as visited
>   - Set the CHECKPOINTED flag on objects before calling checkpoint
>   - Introduce ckpt_obj_reserve()
>   - Change ref_drop() to accept a @lastref argument (for cleanup)
>   - Disallow multiple objects with same objref in restart
>   - Allow _ckpt_read_obj_type() to read header only (w/o payload)
>   - Fix leak of ckpt_ctx when restoring zombie tasks
>   - Fix race of prepare_descendant() with an ongoing fork()
>   - Track and report the first error if restart fails
>   - Tighten logic to protect against bogus pids in input
>   - [Matt Helsley] Improve debug output from ckpt_notify_error()
>   - [Nathan Lynch] fix compilation errors with CONFIG_COMPAT=y
>   - Detect error-headers in input data on restart, and abort.
>   - Standard format for checkpoint error strings (and documentation)
>   - [Dan Smith] Add an errno validation function
>   - Add ckpt_read_payload(): read a variable-length object (no header)
>   - Add ckpt_read_string(): same for strings (ensures null-terminated)
>   - Add ckpt_read_consume(): consumes next object without processing
>   - [John Dykstra] Fix no-dot-config-targets pattern in linux/Makefile
>
> [2009-Jul-21] v17
>   - Introduce syscall clone_with_pids() to restore original pids
>   - Support threads and zombies
>   - Save/restore task->files
>   - Save/restore task->sighand
>   - Save/restore futex
>   - Save/restore credentials
>   - Introduce PF_RESTARTING to skip notifications on task exit
>   - restart(2) allow caller to ask to freeze tasks after restart
>   - restart(2) isn't idempotent: return -EINTR if interrupted
>   - Improve debugging output handling 
>   - Make multi-process restart logic more robust and complete
>   - Correctly select return value for restarting tasks on success
>   - Tighten ptrace test for checkpoint to PTRACE_MODE_ATTACH
>   - Use CHECKPOINTING state for frozen checkpointed tasks
>   - Fix compilation without CONFIG_CHECKPOINT
>   - Fix compilation with CONFIG_COMPAT
>   - Fix headers includes and exports
>   - Leak detection performed in two steps
>   - Detect "inverse" leaks of objects (dis)appearing unexpectedly
>   - Memory: save/restore mm->{flags,def_flags,saved_auxv}
>   - Memory: only collect sub-objects of mm once (leak detection)
>   - Files: validate f_mode after restore
>   - Namespaces: leak detection for nsproxy sub-components
>   - Namespaces: proper restart from namespace(s) without namespace(s)
>   - Save global constants in header instead of per-object
>   - IPC: replace sys_unshare() with create_ipc_ns()
>   - IPC: restore objects in suitable namespace
>   - IPC: correct behavior under !CONFIG_IPC_NS
>   - UTS: save/restore all fields
>   - UTS: replace sys_unshare() with create_uts_ns()
>   - X86_32: sanitize cpu, debug, and segment registers on restart
>   - cgroup_freezer: add CHECKPOINTING state to safeguard checkpoint
>   - cgroup_freezer: add interface to freeze a cgroup (given a task)
>
> [2009-May-27] v16
>   - Privilege checks for IPC checkpoint
>   - Fix error string generation during checkpoint
>   - Use kzalloc for header allocation
>   - Restart blocks are arch-independent
>   - Redo pipe c/r using splice
>   - Fixes to s390 arch
>   - Remove powerpc arch (temporary)
>   - Explicitly restore ->nsproxy
>   - All objects in image are precedeed by 'struct ckpt_hdr'
>   - Fix leaks detection (and leaks)
>   - Reorder of patchset
>   - Misc bugs and compilation fixes
>
> [2009-Apr-12] v15
>   - Minor fixes
>
> [2009-Apr-28] v14
>   - Tested against kernel v2.6.30-rc3 on x86_32.
>   - Refactor files chekpoint to use f_ops (file operations)
>   - Refactor mm/vma to use vma_ops
>   - Explicitly handle VDSO vma (and require compat mode)
>   - Added code to c/r restat-blocks (restart timeout related syscalls)
>   - Added code to c/r namespaces: uts, ipc (with Dan Smith)
>   - Added code to c/r sysvipc (shm, msg, sem)
>   - Support for VM_CLONE shared memory
>   - Added resource leak detection for whole-container checkpoint
>   - Added sysctl gauge to allow unprivileged restart/checkpoint
>   - Improve and simplify the code and logic of shared objects
>   - Rework image format: shared objects appear prior to their use
>   - Merge checkpoint and restart functionality into same files
>   - Massive renaming of functions: prefix "ckpt_" for generics,
>     "checkpoint_" for checkpoint, and "restore_" for restart.
>   - Report checkpoint errors as a valid (string record) in the output
>   - Merged PPC architecture (by Nathan Lunch),
>   - Requires updates to userspace tools too.
>   - Misc nits and bug fixes
>
> [2009-Mar-31] v14-rc2
>   - Change along Dave's suggestion to use f_ops->checkpoint() for files
>   - Merge patch simplifying Kconfig, with CONFIG_CHECKPOINT_SUPPORT
>   - Merge support for PPC arch (Nathan Lynch)
>   - Misc cleanups and fixes in response to comments
>
> [2009-Mar-20] v14-rc1:
>   - The 'h.parent' field of 'struct cr_hdr' isn't used - discard
>   - Check whether calls to cr_hbuf_get() succeed or fail.
>   - Fixed of pipe c/r code
>   - Prevent deadlock by refusing c/r when a pipe inode == ctx->file inode
>   - Refuse non-self checkpoint if a task isn't frozen
>   - Use unsigned fields in checkpoint headers unless otherwise required
>   - Rename functions in files c/r to better reflect their role
>   - Add support for anonymous shared memory
>   - Merge support for s390 arch (Dan Smith, Serge Hallyn)
>     
> [2008-Dec-03] v13:
>   - Cleanups of 'struct cr_ctx' - remove unused fields
>   - Misc fixes for comments
>   
> [2008-Dec-17] v12:
>   - Fix re-alloc/reset of pgarr chain to correctly reuse buffers
>     (empty pgarr are saves in a separate pool chain)
>   - Add a couple of missed calls to cr_hbuf_put()
>   - cr_kwrite/cr_kread() again use vfs_read(), vfs_write() (safer)
>   - Split cr_write/cr_read() to two parts: _cr_write/read() helper
>   - Befriend with sparse: explicit conversion to 'void __user *'
>   - Redrefine 'pr_fmt' ind replace cr_debug() with pr_debug()
>
> [2008-Dec-05] v11:
>   - Use contents of 'init->fs->root' instead of pointing to it
>   - Ignore symlinks (there is no such thing as an open symlink)
>   - cr_scan_fds() retries from scratch if it hits size limits
>   - Add missing test for VM_MAYSHARE when dumping memory
>   - Improve documentation about: behavior when tasks aren't fronen,
>     life span of the object hash, references to objects in the hash
>
> [2008-Nov-26] v10:
>   - Grab vfs root of container init, rather than current process
>   - Acquire dcache_lock around call to __d_path() in cr_fill_name()
>   - Force end-of-string in cr_read_string() (fix possible DoS)
>   - Introduce cr_write_buffer(), cr_read_buffer() and cr_read_buf_type()
>
> [2008-Nov-10] v9:
>   - Support multiple processes c/r
>   - Extend checkpoint header with archtiecture dependent header 
>   - Misc bug fixes (see individual changelogs)
>   - Rebase to v2.6.28-rc3.
>
> [2008-Oct-29] v8:
>   - Support "external" checkpoint
>   - Include Dave Hansen's 'deny-checkpoint' patch
>   - Split docs in Documentation/checkpoint/..., and improve contents
>
> [2008-Oct-17] v7:
>   - Fix save/restore state of FPU
>   - Fix argument given to kunmap_atomic() in memory dump/restore
>
> [2008-Oct-07] v6:
>   - Balance all calls to cr_hbuf_get() with matching cr_hbuf_put()
>     (even though it's not really needed)
>   - Add assumptions and what's-missing to documentation
>   - Misc fixes and cleanups
>
> [2008-Sep-11] v5:
>   - Config is now 'def_bool n' by default
>   - Improve memory dump/restore code (following Dave Hansen's comments)
>   - Change dump format (and code) to allow chunks of <vaddrs, pages>
>     instead of one long list of each
>   - Fix use of follow_page() to avoid faulting in non-present pages
>   - Memory restore now maps user pages explicitly to copy data into them,
>     instead of reading directly to user space; got rid of mprotect_fixup()
>   - Remove preempt_disable() when restoring debug registers
>   - Rename headers files s/ckpt/checkpoint/
>   - Fix misc bugs in files dump/restore
>   - Fixes and cleanups on some error paths
>   - Fix misc coding style
>
> [2008-Sep-09] v4:
>   - Various fixes and clean-ups
>   - Fix calculation of hash table size
>   - Fix header structure alignment
>   - Use stand list_... for cr_pgarr
>
> [2008-Aug-29] v3:
>   - Various fixes and clean-ups
>   - Use standard hlist_... for hash table
>   - Better use of standard kmalloc/kfree
>
> [2008-Aug-20] v2:
>   - Added Dump and restore of open files (regular and directories)
>   - Added basic handling of shared objects, and improve handling of
>     'parent tag' concept
>   - Added documentation
>   - Improved ABI, 64bit padding for image data
>   - Improved locking when saving/restoring memory
>   - Added UTS information to header (release, version, machine)
>   - Cleanup extraction of filename from a file pointer
>   - Refactor to allow easier reviewing
>   - Remove requirement for CAPS_SYS_ADMIN until we come up with a
>     security policy (this means that file restore may fail)
>   - Other cleanup and response to comments for v1
>
> [2008-Jul-29] v1:
>   - Initial version: support a single task with address space of only
>     private anonymous or file-mapped VMAs; syscalls ignore pid/crid
>     argument and act on current process.
>
> --
> At the containers mini-conference before OLS, the consensus among
> all the stakeholders was that doing checkpoint/restart in the kernel
> as much as possible was the best approach.  With this approach, the
> kernel will export a relatively opaque 'blob' of data to userspace
> which can then be handed to the new kernel at restore time.
>
> This is different than what had been proposed before, which was
> that a userspace application would be responsible for collecting
> all of this data.  We were also planning on adding lots of new,
> little kernel interfaces for all of the things that needed
> checkpointing.  This unites those into a single, grand interface.
>
> The 'blob' will contain copies of select portions of kernel
> structures such as vmas and mm_structs.  It will also contain
> copies of the actual memory that the process uses.  Any changes
> in this blob's format between kernel revisions can be handled by
> an in-userspace conversion program.
>
> This is a similar approach to virtually all of the commercial
> checkpoint/restart products out there, as well as the research
> project Zap.
>
> These patches basically serialize internel kernel state and write
> it out to a file descriptor.  The checkpoint and restore are done
> with two new system calls: sys_checkpoint and sys_restart.
>
> In this incarnation, they can only work checkpoint and restore a
> single task. The task's address space may consist of only private,
> simple vma's - anonymous or file-mapped. The open files may consist
> of only simple files and directories.
> --
> _______________________________________________
> Containers mailing list
> Containers@lists.linux-foundation.org
> https://lists.linux-foundation.org/mailman/listinfo/containers
>   

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 00/80] Kernel based checkpoint/restart [v18]
@ 2009-09-24 13:05   ` Rishikesh
  0 siblings, 0 replies; 327+ messages in thread
From: Rishikesh @ 2009-09-24 13:05 UTC (permalink / raw)
  To: Oren Laadan; +Cc: Andrew Morton, linux-api, containers, linux-kernel, linux-mm

Hi Oren,

I am getting following build error while compiling linux-cr kernel.

git://git.ncl.cs.columbia.edu/pub/git/linux-cr.git

...
76569 net/unix/af_unix.c:528: error: ?unix_collect? undeclared here (not 
in a function)
76570 LD [M] drivers/net/enic/enic.o
76571 make[2]: *** [net/unix/af_unix.o] Error 1
76572 make[1]: *** [net/unix] Error 2
76573 make: *** [net] Error 2
76574 make: *** Waiting for unfinished jobs....
...

Let me know if you need config file.

-Rishi
Oren Laadan wrote:
> Hi Andrew,
>
> This is our recent round of checkpoint/restart patches. It can
> checkpoint and restart interactive sessions of 'screen' across 
> kernel reboot. Please consider applying to -mm.
>
> Patches 1-17 are clean-ups and preparations for c/r:
>  * 1,2,3,4 and 9,10: cleanups, also useful for c/r.
>  * 5,6: fix freezer control group
>  * 7,8: extend freezer control group for c/r.
>  * 11-17: clone_with_pid
>
> Patch 18 reserves the system calls slots - please apply so we
> don't need to keep changing them.
>
> Patches 19-80 contain the actual c/r code; we've exhausted the
> reviewers for most of them.
>
> Patch 32 implements a deferqueue - mechanism for a process to
> defer work for some later time (unlike workqueue, designed for
> the work to execute in the context of same/original process).
>
> Thanks,
>
> Oren.
>
> ----
>
> Application checkpoint/restart (c/r) is the ability to save the state
> of a running application so that it can later resume its execution
> from the time at which it was checkpointed, on the same or a different
> machine.
>
> This version brings support many new features, including support for
> unix domain sockets, fifos, pseudo-terminals, and signals (see the
> detailed changelog below).
>
> With these in place, it can now checkpoint and restart not only batch
> jobs, but also interactive programs using 'screen'. For example, users
> can checkpoint a 'screen' session with multiple shells, upgrade their
> kernel, reboot, and restart their interactive 'screen' session from
> before !
>
> This patchset was compiled and tested against v2.6.31. For more
> information, check out Documentation/checkpoint/*.txt
>
> Q: How useful is this code as it stands in real-world usage?
> A: The application can be single- or multi-processes and threads. It
>    handles open files (regular files/directories on most file systems,
>    pipes, fifos, af_unix sockets, /dev/{null,zero,random,urandom} and
>    pseudo-terminals. It supports shared memory. sysv IPC (except undo
>    of sempahores). It's suitable for many types of batch jobs as well
>    as some interactive jobs. (Note: it is assumed that the fs view is
>    available at restart).
>
> Q: What can it checkpoint and restart ?
> A: A (single threaded) process can checkpoint itself, aka "self"
>    checkpoint, if it calls the new system calls. Otherise, for an
>    "external" checkpoint, the caller must first freeze the target
>    processes. One can either checkpoint an entire container (and
>    we make best effort to ensure that the result is self-contained),
>    or merely a subtree of a process hierarchy.
>
> Q: What about namespaces ?
> A: Currrently, UTS and IPC namespaces are restored. They demonstrate
>    how namespaces are handled. More to come.
>
> Q: What additional work needs to be done to it?
> A: Fill in the gory details following the examples so far. Current WIP
>    includes inet sockets, event-poll, and early work on inotify, mount
>    namespace and mount-points, pseudo file systems, and x86_64 support.
>    
> Q: How can I try it ?
> A: Use it for simple batch jobs (pipes, too), or an interactive
>    'screen' session, in a whole container or just a subtree of
>    tasks:
>
>    create the freezer cgroup:
>      $ mount -t cgroup -ofreezer freezer /cgroup
>      $ mkdir /cgroup/0
>    
>    run the test, freeze it:  
>      $ test/multitask &
>      [1] 2754
>      $ for i in `pidof multitask`; do echo $i > /cgroup/0/tasks; done
>      $ echo FROZEN > /cgruop/0/freezer.state
>    
>    checkpoint:
>      $ ./ckpt 2754 > ckpt.out
>    
>    restart:
>      $ ./mktree < ckpt.out
>    
>    voila :)
>    
> To do all this, you'll need:
>
> The git tree tracking v18, branch 'ckpt-v18' (and past versions):
> 	git://git.ncl.cs.columbia.edu/pub/git/linux-cr.git
>
> The userspace tools are available through the matching branch [v18]:
> 	git://git.ncl.cs.columbia.edu/pub/git/user-cr.git
>
>
> Changelog:
>
> [2009-Sep-22] v18
>
>   (new features)
>   - [Nathan Lynch] Re-introduce powerpc support
>   - Save/restore pseudo-terminals
>   - Save/restore (pty) controlling terminals
>   - Save/restore restore PGIDs
>   - [Dan Smith] Save/restore unix domain sockets
>   - Save/restore FIFOs
>   - Save/restore pending signals
>   - Save/restore rlimits
>   - Save/restore itimers
>   - [Matt Helsley] Handle many non-pseudo file-systems
>
>   (other changes)
>   - Rename headerless struct ckpt_hdr_* to struct ckpt_*
>   - [Nathan Lynch] discard const from struct cred * where appropriate
>   - [Serge Hallyn][s390] Set return value for self-checkpoint 
>   - Handle kmalloc failure in restore_sem_array()
>   - [IPC] Collect files used by shm objects
>   - [IPC] Use file (not inode) as shared object on checkpoint of shm
>   - More ckpt_write_err()s to give information on checkpoint failure
>   - Adjust format of pipe buffer to include the mandatory pre-header
>   - [LEAKS] Mark the backing file as visited at chekcpoint
>   - Tighten checks on supported vma to checkpoint or restart
>   - [Serge Hallyn] Export filemap_checkpoint() (used for ext4)
>   - Introduce ckpt_collect_file() that also uses file->collect method
>   - Use ckpt_collect_file() instead of ckpt_obj_collect() for files
>   - Fix leak-detection issue in collect_mm() (test for first-time obj)
>   - Invoke set_close_on_exec() unconditionally on restart
>   - [Dan Smith] Export fill_fname() as ckpt_fill_fname()
>   - Interface to pass simple pointers as data with deferqueue
>   - [Dan Smith] Fix ckpt_obj_lookup_add() leak detection logic
>   - Replace EAGAIN with EBUSY where necessary
>   - Introduce CKPT_OBJ_VISITED in leak detection
>   - ckpt_obj_collect() returns objref for new objects, 0 otherwise
>   - Rename ckpt_obj_checkpointed() to ckpt_obj_visited()
>   - Introduce ckpt_obj_visit() to mark objects as visited
>   - Set the CHECKPOINTED flag on objects before calling checkpoint
>   - Introduce ckpt_obj_reserve()
>   - Change ref_drop() to accept a @lastref argument (for cleanup)
>   - Disallow multiple objects with same objref in restart
>   - Allow _ckpt_read_obj_type() to read header only (w/o payload)
>   - Fix leak of ckpt_ctx when restoring zombie tasks
>   - Fix race of prepare_descendant() with an ongoing fork()
>   - Track and report the first error if restart fails
>   - Tighten logic to protect against bogus pids in input
>   - [Matt Helsley] Improve debug output from ckpt_notify_error()
>   - [Nathan Lynch] fix compilation errors with CONFIG_COMPAT=y
>   - Detect error-headers in input data on restart, and abort.
>   - Standard format for checkpoint error strings (and documentation)
>   - [Dan Smith] Add an errno validation function
>   - Add ckpt_read_payload(): read a variable-length object (no header)
>   - Add ckpt_read_string(): same for strings (ensures null-terminated)
>   - Add ckpt_read_consume(): consumes next object without processing
>   - [John Dykstra] Fix no-dot-config-targets pattern in linux/Makefile
>
> [2009-Jul-21] v17
>   - Introduce syscall clone_with_pids() to restore original pids
>   - Support threads and zombies
>   - Save/restore task->files
>   - Save/restore task->sighand
>   - Save/restore futex
>   - Save/restore credentials
>   - Introduce PF_RESTARTING to skip notifications on task exit
>   - restart(2) allow caller to ask to freeze tasks after restart
>   - restart(2) isn't idempotent: return -EINTR if interrupted
>   - Improve debugging output handling 
>   - Make multi-process restart logic more robust and complete
>   - Correctly select return value for restarting tasks on success
>   - Tighten ptrace test for checkpoint to PTRACE_MODE_ATTACH
>   - Use CHECKPOINTING state for frozen checkpointed tasks
>   - Fix compilation without CONFIG_CHECKPOINT
>   - Fix compilation with CONFIG_COMPAT
>   - Fix headers includes and exports
>   - Leak detection performed in two steps
>   - Detect "inverse" leaks of objects (dis)appearing unexpectedly
>   - Memory: save/restore mm->{flags,def_flags,saved_auxv}
>   - Memory: only collect sub-objects of mm once (leak detection)
>   - Files: validate f_mode after restore
>   - Namespaces: leak detection for nsproxy sub-components
>   - Namespaces: proper restart from namespace(s) without namespace(s)
>   - Save global constants in header instead of per-object
>   - IPC: replace sys_unshare() with create_ipc_ns()
>   - IPC: restore objects in suitable namespace
>   - IPC: correct behavior under !CONFIG_IPC_NS
>   - UTS: save/restore all fields
>   - UTS: replace sys_unshare() with create_uts_ns()
>   - X86_32: sanitize cpu, debug, and segment registers on restart
>   - cgroup_freezer: add CHECKPOINTING state to safeguard checkpoint
>   - cgroup_freezer: add interface to freeze a cgroup (given a task)
>
> [2009-May-27] v16
>   - Privilege checks for IPC checkpoint
>   - Fix error string generation during checkpoint
>   - Use kzalloc for header allocation
>   - Restart blocks are arch-independent
>   - Redo pipe c/r using splice
>   - Fixes to s390 arch
>   - Remove powerpc arch (temporary)
>   - Explicitly restore ->nsproxy
>   - All objects in image are precedeed by 'struct ckpt_hdr'
>   - Fix leaks detection (and leaks)
>   - Reorder of patchset
>   - Misc bugs and compilation fixes
>
> [2009-Apr-12] v15
>   - Minor fixes
>
> [2009-Apr-28] v14
>   - Tested against kernel v2.6.30-rc3 on x86_32.
>   - Refactor files chekpoint to use f_ops (file operations)
>   - Refactor mm/vma to use vma_ops
>   - Explicitly handle VDSO vma (and require compat mode)
>   - Added code to c/r restat-blocks (restart timeout related syscalls)
>   - Added code to c/r namespaces: uts, ipc (with Dan Smith)
>   - Added code to c/r sysvipc (shm, msg, sem)
>   - Support for VM_CLONE shared memory
>   - Added resource leak detection for whole-container checkpoint
>   - Added sysctl gauge to allow unprivileged restart/checkpoint
>   - Improve and simplify the code and logic of shared objects
>   - Rework image format: shared objects appear prior to their use
>   - Merge checkpoint and restart functionality into same files
>   - Massive renaming of functions: prefix "ckpt_" for generics,
>     "checkpoint_" for checkpoint, and "restore_" for restart.
>   - Report checkpoint errors as a valid (string record) in the output
>   - Merged PPC architecture (by Nathan Lunch),
>   - Requires updates to userspace tools too.
>   - Misc nits and bug fixes
>
> [2009-Mar-31] v14-rc2
>   - Change along Dave's suggestion to use f_ops->checkpoint() for files
>   - Merge patch simplifying Kconfig, with CONFIG_CHECKPOINT_SUPPORT
>   - Merge support for PPC arch (Nathan Lynch)
>   - Misc cleanups and fixes in response to comments
>
> [2009-Mar-20] v14-rc1:
>   - The 'h.parent' field of 'struct cr_hdr' isn't used - discard
>   - Check whether calls to cr_hbuf_get() succeed or fail.
>   - Fixed of pipe c/r code
>   - Prevent deadlock by refusing c/r when a pipe inode == ctx->file inode
>   - Refuse non-self checkpoint if a task isn't frozen
>   - Use unsigned fields in checkpoint headers unless otherwise required
>   - Rename functions in files c/r to better reflect their role
>   - Add support for anonymous shared memory
>   - Merge support for s390 arch (Dan Smith, Serge Hallyn)
>     
> [2008-Dec-03] v13:
>   - Cleanups of 'struct cr_ctx' - remove unused fields
>   - Misc fixes for comments
>   
> [2008-Dec-17] v12:
>   - Fix re-alloc/reset of pgarr chain to correctly reuse buffers
>     (empty pgarr are saves in a separate pool chain)
>   - Add a couple of missed calls to cr_hbuf_put()
>   - cr_kwrite/cr_kread() again use vfs_read(), vfs_write() (safer)
>   - Split cr_write/cr_read() to two parts: _cr_write/read() helper
>   - Befriend with sparse: explicit conversion to 'void __user *'
>   - Redrefine 'pr_fmt' ind replace cr_debug() with pr_debug()
>
> [2008-Dec-05] v11:
>   - Use contents of 'init->fs->root' instead of pointing to it
>   - Ignore symlinks (there is no such thing as an open symlink)
>   - cr_scan_fds() retries from scratch if it hits size limits
>   - Add missing test for VM_MAYSHARE when dumping memory
>   - Improve documentation about: behavior when tasks aren't fronen,
>     life span of the object hash, references to objects in the hash
>
> [2008-Nov-26] v10:
>   - Grab vfs root of container init, rather than current process
>   - Acquire dcache_lock around call to __d_path() in cr_fill_name()
>   - Force end-of-string in cr_read_string() (fix possible DoS)
>   - Introduce cr_write_buffer(), cr_read_buffer() and cr_read_buf_type()
>
> [2008-Nov-10] v9:
>   - Support multiple processes c/r
>   - Extend checkpoint header with archtiecture dependent header 
>   - Misc bug fixes (see individual changelogs)
>   - Rebase to v2.6.28-rc3.
>
> [2008-Oct-29] v8:
>   - Support "external" checkpoint
>   - Include Dave Hansen's 'deny-checkpoint' patch
>   - Split docs in Documentation/checkpoint/..., and improve contents
>
> [2008-Oct-17] v7:
>   - Fix save/restore state of FPU
>   - Fix argument given to kunmap_atomic() in memory dump/restore
>
> [2008-Oct-07] v6:
>   - Balance all calls to cr_hbuf_get() with matching cr_hbuf_put()
>     (even though it's not really needed)
>   - Add assumptions and what's-missing to documentation
>   - Misc fixes and cleanups
>
> [2008-Sep-11] v5:
>   - Config is now 'def_bool n' by default
>   - Improve memory dump/restore code (following Dave Hansen's comments)
>   - Change dump format (and code) to allow chunks of <vaddrs, pages>
>     instead of one long list of each
>   - Fix use of follow_page() to avoid faulting in non-present pages
>   - Memory restore now maps user pages explicitly to copy data into them,
>     instead of reading directly to user space; got rid of mprotect_fixup()
>   - Remove preempt_disable() when restoring debug registers
>   - Rename headers files s/ckpt/checkpoint/
>   - Fix misc bugs in files dump/restore
>   - Fixes and cleanups on some error paths
>   - Fix misc coding style
>
> [2008-Sep-09] v4:
>   - Various fixes and clean-ups
>   - Fix calculation of hash table size
>   - Fix header structure alignment
>   - Use stand list_... for cr_pgarr
>
> [2008-Aug-29] v3:
>   - Various fixes and clean-ups
>   - Use standard hlist_... for hash table
>   - Better use of standard kmalloc/kfree
>
> [2008-Aug-20] v2:
>   - Added Dump and restore of open files (regular and directories)
>   - Added basic handling of shared objects, and improve handling of
>     'parent tag' concept
>   - Added documentation
>   - Improved ABI, 64bit padding for image data
>   - Improved locking when saving/restoring memory
>   - Added UTS information to header (release, version, machine)
>   - Cleanup extraction of filename from a file pointer
>   - Refactor to allow easier reviewing
>   - Remove requirement for CAPS_SYS_ADMIN until we come up with a
>     security policy (this means that file restore may fail)
>   - Other cleanup and response to comments for v1
>
> [2008-Jul-29] v1:
>   - Initial version: support a single task with address space of only
>     private anonymous or file-mapped VMAs; syscalls ignore pid/crid
>     argument and act on current process.
>
> --
> At the containers mini-conference before OLS, the consensus among
> all the stakeholders was that doing checkpoint/restart in the kernel
> as much as possible was the best approach.  With this approach, the
> kernel will export a relatively opaque 'blob' of data to userspace
> which can then be handed to the new kernel at restore time.
>
> This is different than what had been proposed before, which was
> that a userspace application would be responsible for collecting
> all of this data.  We were also planning on adding lots of new,
> little kernel interfaces for all of the things that needed
> checkpointing.  This unites those into a single, grand interface.
>
> The 'blob' will contain copies of select portions of kernel
> structures such as vmas and mm_structs.  It will also contain
> copies of the actual memory that the process uses.  Any changes
> in this blob's format between kernel revisions can be handled by
> an in-userspace conversion program.
>
> This is a similar approach to virtually all of the commercial
> checkpoint/restart products out there, as well as the research
> project Zap.
>
> These patches basically serialize internel kernel state and write
> it out to a file descriptor.  The checkpoint and restore are done
> with two new system calls: sys_checkpoint and sys_restart.
>
> In this incarnation, they can only work checkpoint and restore a
> single task. The task's address space may consist of only private,
> simple vma's - anonymous or file-mapped. The open files may consist
> of only simple files and directories.
> --
> _______________________________________________
> Containers mailing list
> Containers@lists.linux-foundation.org
> https://lists.linux-foundation.org/mailman/listinfo/containers
>   

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH v18 20/80] c/r: basic infrastructure for checkpoint/restart
       [not found]   ` <1253749920-18673-21-git-send-email-orenl-RdfvBDnrOixBDgjK7y7TUQ@public.gmane.org>
@ 2009-09-24 16:03     ` Daniel Walker
  0 siblings, 0 replies; 327+ messages in thread
From: Daniel Walker @ 2009-09-24 16:03 UTC (permalink / raw)
  To: Oren Laadan
  Cc: linux-api-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Linus Torvalds, Andrew Morton,
	Ingo Molnar, Pavel Emelyanov

On Wed, 2009-09-23 at 19:51 -0400, Oren Laadan wrote:
> /
> +static char *__ckpt_generate_fmt(struct ckpt_ctx *ctx, char *prefmt, char *fmt)
> +{
> +	static int warn_notask = 0;
> +	static int warn_prefmt = 0;

Shouldn't need the initializer since it's static..


> +/* read the checkpoint header */
> +static int restore_read_header(struct ckpt_ctx *ctx)
> +{
> +	struct ckpt_hdr_header *h;
> +	struct new_utsname *uts = NULL;
> +	int ret;
> +
> +	h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_HEADER);
> +	if (IS_ERR(h))
> +		return PTR_ERR(h);
> +
> +	ret = -EINVAL;
> +	if (h->magic != CHECKPOINT_MAGIC_HEAD ||
> +	    h->rev != CHECKPOINT_VERSION ||
> +	    h->major != ((LINUX_VERSION_CODE >> 16) & 0xff) ||
> +	    h->minor != ((LINUX_VERSION_CODE >> 8) & 0xff) ||
> +	    h->patch != ((LINUX_VERSION_CODE) & 0xff))
> +		goto out;

Do you still need this LINUX_VERSION_CODE stuff ? I would think once
it's in mainline you wouldn't need to track that..

These both got flagged by checkpatch .. Your series is marked in a
couple other places with checkpatch errors .. If you haven't already
reviewed those errors, it would be a good idea to review them.

Daniel

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH v18 20/80] c/r: basic infrastructure for checkpoint/restart
       [not found]   ` <1253749920-18673-21-git-send-email-orenl-RdfvBDnrOixBDgjK7y7TUQ@public.gmane.org>
  2009-09-24 16:03     ` Daniel Walker
@ 2009-09-24 16:03     ` Daniel Walker
  0 siblings, 0 replies; 327+ messages in thread
From: Daniel Walker @ 2009-09-24 16:03 UTC (permalink / raw)
  To: Oren Laadan
  Cc: Andrew Morton, Linus Torvalds, containers, linux-kernel,
	linux-mm, linux-api, Serge Hallyn, Ingo Molnar, Pavel Emelyanov,
	Oren Laadan

On Wed, 2009-09-23 at 19:51 -0400, Oren Laadan wrote:
> /
> +static char *__ckpt_generate_fmt(struct ckpt_ctx *ctx, char *prefmt, char *fmt)
> +{
> +	static int warn_notask = 0;
> +	static int warn_prefmt = 0;

Shouldn't need the initializer since it's static..


> +/* read the checkpoint header */
> +static int restore_read_header(struct ckpt_ctx *ctx)
> +{
> +	struct ckpt_hdr_header *h;
> +	struct new_utsname *uts = NULL;
> +	int ret;
> +
> +	h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_HEADER);
> +	if (IS_ERR(h))
> +		return PTR_ERR(h);
> +
> +	ret = -EINVAL;
> +	if (h->magic != CHECKPOINT_MAGIC_HEAD ||
> +	    h->rev != CHECKPOINT_VERSION ||
> +	    h->major != ((LINUX_VERSION_CODE >> 16) & 0xff) ||
> +	    h->minor != ((LINUX_VERSION_CODE >> 8) & 0xff) ||
> +	    h->patch != ((LINUX_VERSION_CODE) & 0xff))
> +		goto out;

Do you still need this LINUX_VERSION_CODE stuff ? I would think once
it's in mainline you wouldn't need to track that..

These both got flagged by checkpatch .. Your series is marked in a
couple other places with checkpatch errors .. If you haven't already
reviewed those errors, it would be a good idea to review them.

Daniel



^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH v18 20/80] c/r: basic infrastructure for checkpoint/restart
@ 2009-09-24 16:03     ` Daniel Walker
  0 siblings, 0 replies; 327+ messages in thread
From: Daniel Walker @ 2009-09-24 16:03 UTC (permalink / raw)
  To: Oren Laadan
  Cc: Andrew Morton, Linus Torvalds,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Serge Hallyn, Ingo Molnar,
	Pavel Emelyanov, Oren Laadan

On Wed, 2009-09-23 at 19:51 -0400, Oren Laadan wrote:
> /
> +static char *__ckpt_generate_fmt(struct ckpt_ctx *ctx, char *prefmt, char *fmt)
> +{
> +	static int warn_notask = 0;
> +	static int warn_prefmt = 0;

Shouldn't need the initializer since it's static..


> +/* read the checkpoint header */
> +static int restore_read_header(struct ckpt_ctx *ctx)
> +{
> +	struct ckpt_hdr_header *h;
> +	struct new_utsname *uts = NULL;
> +	int ret;
> +
> +	h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_HEADER);
> +	if (IS_ERR(h))
> +		return PTR_ERR(h);
> +
> +	ret = -EINVAL;
> +	if (h->magic != CHECKPOINT_MAGIC_HEAD ||
> +	    h->rev != CHECKPOINT_VERSION ||
> +	    h->major != ((LINUX_VERSION_CODE >> 16) & 0xff) ||
> +	    h->minor != ((LINUX_VERSION_CODE >> 8) & 0xff) ||
> +	    h->patch != ((LINUX_VERSION_CODE) & 0xff))
> +		goto out;

Do you still need this LINUX_VERSION_CODE stuff ? I would think once
it's in mainline you wouldn't need to track that..

These both got flagged by checkpatch .. Your series is marked in a
couple other places with checkpatch errors .. If you haven't already
reviewed those errors, it would be a good idea to review them.

Daniel


--
To unsubscribe from this list: send the line "unsubscribe linux-api" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH v18 20/80] c/r: basic infrastructure for checkpoint/restart
@ 2009-09-24 16:03     ` Daniel Walker
  0 siblings, 0 replies; 327+ messages in thread
From: Daniel Walker @ 2009-09-24 16:03 UTC (permalink / raw)
  To: Oren Laadan
  Cc: Andrew Morton, Linus Torvalds, containers, linux-kernel,
	linux-mm, linux-api, Serge Hallyn, Ingo Molnar, Pavel Emelyanov,
	Oren Laadan

On Wed, 2009-09-23 at 19:51 -0400, Oren Laadan wrote:
> /
> +static char *__ckpt_generate_fmt(struct ckpt_ctx *ctx, char *prefmt, char *fmt)
> +{
> +	static int warn_notask = 0;
> +	static int warn_prefmt = 0;

Shouldn't need the initializer since it's static..


> +/* read the checkpoint header */
> +static int restore_read_header(struct ckpt_ctx *ctx)
> +{
> +	struct ckpt_hdr_header *h;
> +	struct new_utsname *uts = NULL;
> +	int ret;
> +
> +	h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_HEADER);
> +	if (IS_ERR(h))
> +		return PTR_ERR(h);
> +
> +	ret = -EINVAL;
> +	if (h->magic != CHECKPOINT_MAGIC_HEAD ||
> +	    h->rev != CHECKPOINT_VERSION ||
> +	    h->major != ((LINUX_VERSION_CODE >> 16) & 0xff) ||
> +	    h->minor != ((LINUX_VERSION_CODE >> 8) & 0xff) ||
> +	    h->patch != ((LINUX_VERSION_CODE) & 0xff))
> +		goto out;

Do you still need this LINUX_VERSION_CODE stuff ? I would think once
it's in mainline you wouldn't need to track that..

These both got flagged by checkpatch .. Your series is marked in a
couple other places with checkpatch errors .. If you haven't already
reviewed those errors, it would be a good idea to review them.

Daniel

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 00/80] Kernel based checkpoint/restart [v18]
  2009-09-24 13:05   ` Rishikesh
  (?)
  (?)
@ 2009-09-24 16:27   ` Dan Smith
  -1 siblings, 0 replies; 327+ messages in thread
From: Dan Smith @ 2009-09-24 16:27 UTC (permalink / raw)
  To: Rishikesh
  Cc: linux-api-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Andrew Morton

R> I am getting following build error while compiling linux-cr kernel.

With CONFIG_CHECKPOINT=n, right?

R> 76569 net/unix/af_unix.c:528: error: ‘unix_collect’ undeclared here (not 
R> in a function)

Try the patch below.

-- 
Dan Smith
IBM Linux Technology Center
email: danms@us.ibm.com

diff --git a/include/net/af_unix.h b/include/net/af_unix.h
index e42a714..ee423d1 100644
--- a/include/net/af_unix.h
+++ b/include/net/af_unix.h
@@ -80,6 +80,7 @@ extern int unix_collect(struct ckpt_ctx *ctx, struct socket *sock);
 #else
 #define unix_checkpoint NULL
 #define unix_restore NULL
+#define unix_collect NULL
 #endif /* CONFIG_CHECKPOINT */
 
 #endif
_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

^ permalink raw reply related	[flat|nested] 327+ messages in thread

* Re: [PATCH 00/80] Kernel based checkpoint/restart [v18]
  2009-09-24 13:05   ` Rishikesh
@ 2009-09-24 16:27     ` Dan Smith
  -1 siblings, 0 replies; 327+ messages in thread
From: Dan Smith @ 2009-09-24 16:27 UTC (permalink / raw)
  To: Rishikesh
  Cc: Oren Laadan, linux-mm, linux-api, Andrew Morton, containers,
	linux-kernel

R> I am getting following build error while compiling linux-cr kernel.

With CONFIG_CHECKPOINT=n, right?

R> 76569 net/unix/af_unix.c:528: error: ‘unix_collect’ undeclared here (not 
R> in a function)

Try the patch below.

-- 
Dan Smith
IBM Linux Technology Center
email: danms@us.ibm.com

diff --git a/include/net/af_unix.h b/include/net/af_unix.h
index e42a714..ee423d1 100644
--- a/include/net/af_unix.h
+++ b/include/net/af_unix.h
@@ -80,6 +80,7 @@ extern int unix_collect(struct ckpt_ctx *ctx, struct socket *sock);
 #else
 #define unix_checkpoint NULL
 #define unix_restore NULL
+#define unix_collect NULL
 #endif /* CONFIG_CHECKPOINT */
 
 #endif

^ permalink raw reply related	[flat|nested] 327+ messages in thread

* Re: [PATCH 00/80] Kernel based checkpoint/restart [v18]
@ 2009-09-24 16:27     ` Dan Smith
  0 siblings, 0 replies; 327+ messages in thread
From: Dan Smith @ 2009-09-24 16:27 UTC (permalink / raw)
  To: Rishikesh
  Cc: Oren Laadan, linux-mm, linux-api, Andrew Morton, containers,
	linux-kernel

R> I am getting following build error while compiling linux-cr kernel.

With CONFIG_CHECKPOINT=n, right?

R> 76569 net/unix/af_unix.c:528: error: ‘unix_collect’ undeclared here (not 
R> in a function)

Try the patch below.

-- 
Dan Smith
IBM Linux Technology Center
email: danms@us.ibm.com

diff --git a/include/net/af_unix.h b/include/net/af_unix.h
index e42a714..ee423d1 100644
--- a/include/net/af_unix.h
+++ b/include/net/af_unix.h
@@ -80,6 +80,7 @@ extern int unix_collect(struct ckpt_ctx *ctx, struct socket *sock);
 #else
 #define unix_checkpoint NULL
 #define unix_restore NULL
+#define unix_collect NULL
 #endif /* CONFIG_CHECKPOINT */
 
 #endif

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 327+ messages in thread

* Re: [PATCH v18 20/80] c/r: basic infrastructure for checkpoint/restart
  2009-09-24 16:03     ` Daniel Walker
                       ` (2 preceding siblings ...)
  (?)
@ 2009-09-24 17:33     ` Oren Laadan
  -1 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-24 17:33 UTC (permalink / raw)
  To: Daniel Walker
  Cc: linux-api-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Linus Torvalds, Andrew Morton,
	Ingo Molnar, Pavel Emelyanov



Daniel Walker wrote:
> On Wed, 2009-09-23 at 19:51 -0400, Oren Laadan wrote:
>> /
>> +static char *__ckpt_generate_fmt(struct ckpt_ctx *ctx, char *prefmt, char *fmt)
>> +{
>> +	static int warn_notask = 0;
>> +	static int warn_prefmt = 0;
> 
> Shouldn't need the initializer since it's static..
> 

Yup ...

> 
>> +/* read the checkpoint header */
>> +static int restore_read_header(struct ckpt_ctx *ctx)
>> +{
>> +	struct ckpt_hdr_header *h;
>> +	struct new_utsname *uts = NULL;
>> +	int ret;
>> +
>> +	h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_HEADER);
>> +	if (IS_ERR(h))
>> +		return PTR_ERR(h);
>> +
>> +	ret = -EINVAL;
>> +	if (h->magic != CHECKPOINT_MAGIC_HEAD ||
>> +	    h->rev != CHECKPOINT_VERSION ||
>> +	    h->major != ((LINUX_VERSION_CODE >> 16) & 0xff) ||
>> +	    h->minor != ((LINUX_VERSION_CODE >> 8) & 0xff) ||
>> +	    h->patch != ((LINUX_VERSION_CODE) & 0xff))
>> +		goto out;
> 
> Do you still need this LINUX_VERSION_CODE stuff ? I would think once
> it's in mainline you wouldn't need to track that..

In short: yes.

This is our poor-man's way to telling the kernel version on which
a given checkpoint image was generated.

The image format is a blob that may change between kernel releases.
Conversion between version formats will be done by userspace tools.
Tagging the image with the version of the kernel serves two goals:
it indicates the image version to the kernel at restart time, and
it will be used by userspace conversion tools.

How the kernel and image versions are encoded is WIP and is likely
to become more comprehensive soon.

> 
> These both got flagged by checkpatch .. Your series is marked in a
> couple other places with checkpatch errors .. If you haven't already
> reviewed those errors, it would be a good idea to review them.
> 

Sure, will re-review to remove remaining errors that sneaked in.

Thanks,

Oren.

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH v18 20/80] c/r: basic infrastructure for checkpoint/restart
  2009-09-24 16:03     ` Daniel Walker
@ 2009-09-24 17:33       ` Oren Laadan
  -1 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-24 17:33 UTC (permalink / raw)
  To: Daniel Walker
  Cc: Andrew Morton, Linus Torvalds, containers, linux-kernel,
	linux-mm, linux-api, Serge Hallyn, Ingo Molnar, Pavel Emelyanov,
	Oren Laadan



Daniel Walker wrote:
> On Wed, 2009-09-23 at 19:51 -0400, Oren Laadan wrote:
>> /
>> +static char *__ckpt_generate_fmt(struct ckpt_ctx *ctx, char *prefmt, char *fmt)
>> +{
>> +	static int warn_notask = 0;
>> +	static int warn_prefmt = 0;
> 
> Shouldn't need the initializer since it's static..
> 

Yup ...

> 
>> +/* read the checkpoint header */
>> +static int restore_read_header(struct ckpt_ctx *ctx)
>> +{
>> +	struct ckpt_hdr_header *h;
>> +	struct new_utsname *uts = NULL;
>> +	int ret;
>> +
>> +	h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_HEADER);
>> +	if (IS_ERR(h))
>> +		return PTR_ERR(h);
>> +
>> +	ret = -EINVAL;
>> +	if (h->magic != CHECKPOINT_MAGIC_HEAD ||
>> +	    h->rev != CHECKPOINT_VERSION ||
>> +	    h->major != ((LINUX_VERSION_CODE >> 16) & 0xff) ||
>> +	    h->minor != ((LINUX_VERSION_CODE >> 8) & 0xff) ||
>> +	    h->patch != ((LINUX_VERSION_CODE) & 0xff))
>> +		goto out;
> 
> Do you still need this LINUX_VERSION_CODE stuff ? I would think once
> it's in mainline you wouldn't need to track that..

In short: yes.

This is our poor-man's way to telling the kernel version on which
a given checkpoint image was generated.

The image format is a blob that may change between kernel releases.
Conversion between version formats will be done by userspace tools.
Tagging the image with the version of the kernel serves two goals:
it indicates the image version to the kernel at restart time, and
it will be used by userspace conversion tools.

How the kernel and image versions are encoded is WIP and is likely
to become more comprehensive soon.

> 
> These both got flagged by checkpatch .. Your series is marked in a
> couple other places with checkpatch errors .. If you haven't already
> reviewed those errors, it would be a good idea to review them.
> 

Sure, will re-review to remove remaining errors that sneaked in.

Thanks,

Oren.


^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH v18 20/80] c/r: basic infrastructure for checkpoint/restart
@ 2009-09-24 17:33       ` Oren Laadan
  0 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-24 17:33 UTC (permalink / raw)
  To: Daniel Walker
  Cc: Andrew Morton, Linus Torvalds, containers, linux-kernel,
	linux-mm, linux-api, Serge Hallyn, Ingo Molnar, Pavel Emelyanov,
	Oren Laadan



Daniel Walker wrote:
> On Wed, 2009-09-23 at 19:51 -0400, Oren Laadan wrote:
>> /
>> +static char *__ckpt_generate_fmt(struct ckpt_ctx *ctx, char *prefmt, char *fmt)
>> +{
>> +	static int warn_notask = 0;
>> +	static int warn_prefmt = 0;
> 
> Shouldn't need the initializer since it's static..
> 

Yup ...

> 
>> +/* read the checkpoint header */
>> +static int restore_read_header(struct ckpt_ctx *ctx)
>> +{
>> +	struct ckpt_hdr_header *h;
>> +	struct new_utsname *uts = NULL;
>> +	int ret;
>> +
>> +	h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_HEADER);
>> +	if (IS_ERR(h))
>> +		return PTR_ERR(h);
>> +
>> +	ret = -EINVAL;
>> +	if (h->magic != CHECKPOINT_MAGIC_HEAD ||
>> +	    h->rev != CHECKPOINT_VERSION ||
>> +	    h->major != ((LINUX_VERSION_CODE >> 16) & 0xff) ||
>> +	    h->minor != ((LINUX_VERSION_CODE >> 8) & 0xff) ||
>> +	    h->patch != ((LINUX_VERSION_CODE) & 0xff))
>> +		goto out;
> 
> Do you still need this LINUX_VERSION_CODE stuff ? I would think once
> it's in mainline you wouldn't need to track that..

In short: yes.

This is our poor-man's way to telling the kernel version on which
a given checkpoint image was generated.

The image format is a blob that may change between kernel releases.
Conversion between version formats will be done by userspace tools.
Tagging the image with the version of the kernel serves two goals:
it indicates the image version to the kernel at restart time, and
it will be used by userspace conversion tools.

How the kernel and image versions are encoded is WIP and is likely
to become more comprehensive soon.

> 
> These both got flagged by checkpatch .. Your series is marked in a
> couple other places with checkpatch errors .. If you haven't already
> reviewed those errors, it would be a good idea to review them.
> 

Sure, will re-review to remove remaining errors that sneaked in.

Thanks,

Oren.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 00/80] Kernel based checkpoint/restart [v18]
       [not found]     ` <878wg41f65.fsf-FLMGYpZoEPULwtHQx/6qkW3U47Q5hpJU@public.gmane.org>
@ 2009-09-24 18:20       ` Rishikesh
  0 siblings, 0 replies; 327+ messages in thread
From: Rishikesh @ 2009-09-24 18:20 UTC (permalink / raw)
  To: Dan Smith
  Cc: linux-api-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Andrew Morton

Dan Smith wrote:
> R> I am getting following build error while compiling linux-cr kernel.
>
> With CONFIG_CHECKPOINT=n, right?
>
> R> 76569 net/unix/af_unix.c:528: error: ‘unix_collect’ undeclared here (not 
> R> in a function)
>
> Try the patch below.
>   

Yes attached patch solves the prob. Thanks Dan.

_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 00/80] Kernel based checkpoint/restart [v18]
  2009-09-24 16:27     ` Dan Smith
  (?)
@ 2009-09-24 18:20       ` Rishikesh
  -1 siblings, 0 replies; 327+ messages in thread
From: Rishikesh @ 2009-09-24 18:20 UTC (permalink / raw)
  To: Dan Smith
  Cc: Oren Laadan, linux-mm, linux-api, Andrew Morton, containers,
	linux-kernel

Dan Smith wrote:
> R> I am getting following build error while compiling linux-cr kernel.
>
> With CONFIG_CHECKPOINT=n, right?
>
> R> 76569 net/unix/af_unix.c:528: error: ‘unix_collect’ undeclared here (not 
> R> in a function)
>
> Try the patch below.
>   

Yes attached patch solves the prob. Thanks Dan.


^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 00/80] Kernel based checkpoint/restart [v18]
@ 2009-09-24 18:20       ` Rishikesh
  0 siblings, 0 replies; 327+ messages in thread
From: Rishikesh @ 2009-09-24 18:20 UTC (permalink / raw)
  To: Dan Smith
  Cc: Oren Laadan, linux-mm, linux-api, Andrew Morton, containers,
	linux-kernel

Dan Smith wrote:
> R> I am getting following build error while compiling linux-cr kernel.
>
> With CONFIG_CHECKPOINT=n, right?
>
> R> 76569 net/unix/af_unix.c:528: error: ‘unix_collect’ undeclared here (not 
> R> in a function)
>
> Try the patch below.
>   

Yes attached patch solves the prob. Thanks Dan.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 00/80] Kernel based checkpoint/restart [v18]
@ 2009-09-24 18:20       ` Rishikesh
  0 siblings, 0 replies; 327+ messages in thread
From: Rishikesh @ 2009-09-24 18:20 UTC (permalink / raw)
  To: Dan Smith
  Cc: Oren Laadan, linux-mm, linux-api, Andrew Morton, containers,
	linux-kernel

Dan Smith wrote:
> R> I am getting following build error while compiling linux-cr kernel.
>
> With CONFIG_CHECKPOINT=n, right?
>
> R> 76569 net/unix/af_unix.c:528: error: a??unix_collecta?? undeclared here (not 
> R> in a function)
>
> Try the patch below.
>   

Yes attached patch solves the prob. Thanks Dan.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 00/80] Kernel based checkpoint/restart [v18]
       [not found] ` <1253749920-18673-1-git-send-email-orenl-RdfvBDnrOixBDgjK7y7TUQ@public.gmane.org>
                     ` (80 preceding siblings ...)
  2009-09-24 13:05   ` [PATCH 00/80] Kernel based checkpoint/restart [v18] Rishikesh
@ 2009-09-24 22:41   ` Andrew Morton
  81 siblings, 0 replies; 327+ messages in thread
From: Andrew Morton @ 2009-09-24 22:41 UTC (permalink / raw)
  To: Oren Laadan
  Cc: linux-api-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, mingo-X9Un+BFzKDI,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	xemul-GEFAQzZX7r8dnm+yROfE0A

On Wed, 23 Sep 2009 19:50:40 -0400
Oren Laadan <orenl-RdfvBDnrOixBDgjK7y7TUQ@public.gmane.org> wrote:

> Q: How useful is this code as it stands in real-world usage?
> A: The application can be single- or multi-processes and threads. It
>    handles open files (regular files/directories on most file systems,
>    pipes, fifos, af_unix sockets, /dev/{null,zero,random,urandom} and
>    pseudo-terminals. It supports shared memory. sysv IPC (except undo
>    of sempahores). It's suitable for many types of batch jobs as well
>    as some interactive jobs. (Note: it is assumed that the fs view is
>    available at restart).

That's encouraging.

> Q: What can it checkpoint and restart ?
> A: A (single threaded) process can checkpoint itself, aka "self"
>    checkpoint, if it calls the new system calls. Otherise, for an
>    "external" checkpoint, the caller must first freeze the target
>    processes. One can either checkpoint an entire container (and
>    we make best effort to ensure that the result is self-contained),
>    or merely a subtree of a process hierarchy.

What is "best effort"?  Will the operation appear to have succeeded,
only it didn't?

IOW, how reliable and robust is code at detecting that it was unable to
successfully generate a restartable image?

> Q: What about namespaces ?
> A: Currrently, UTS and IPC namespaces are restored. They demonstrate
>    how namespaces are handled. More to come.

Will this new code muck up the kernel?

> Q: What additional work needs to be done to it?
> A: Fill in the gory details following the examples so far. Current WIP
>    includes inet sockets, event-poll, and early work on inotify, mount
>    namespace and mount-points, pseudo file systems

Will this new code muck up the kernel, or will it be clean?

> and x86_64 support.

eh?  You mean the code doesn't work on x86_64 at present?


What is the story on migration?  Moving the process(es) to a different
machine?

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 00/80] Kernel based checkpoint/restart [v18]
  2009-09-23 23:50 ` Oren Laadan
  (?)
@ 2009-09-24 22:41   ` Andrew Morton
  -1 siblings, 0 replies; 327+ messages in thread
From: Andrew Morton @ 2009-09-24 22:41 UTC (permalink / raw)
  To: Oren Laadan
  Cc: torvalds, containers, linux-kernel, linux-mm, linux-api, serue,
	mingo, xemul, orenl

On Wed, 23 Sep 2009 19:50:40 -0400
Oren Laadan <orenl@librato.com> wrote:

> Q: How useful is this code as it stands in real-world usage?
> A: The application can be single- or multi-processes and threads. It
>    handles open files (regular files/directories on most file systems,
>    pipes, fifos, af_unix sockets, /dev/{null,zero,random,urandom} and
>    pseudo-terminals. It supports shared memory. sysv IPC (except undo
>    of sempahores). It's suitable for many types of batch jobs as well
>    as some interactive jobs. (Note: it is assumed that the fs view is
>    available at restart).

That's encouraging.

> Q: What can it checkpoint and restart ?
> A: A (single threaded) process can checkpoint itself, aka "self"
>    checkpoint, if it calls the new system calls. Otherise, for an
>    "external" checkpoint, the caller must first freeze the target
>    processes. One can either checkpoint an entire container (and
>    we make best effort to ensure that the result is self-contained),
>    or merely a subtree of a process hierarchy.

What is "best effort"?  Will the operation appear to have succeeded,
only it didn't?

IOW, how reliable and robust is code at detecting that it was unable to
successfully generate a restartable image?

> Q: What about namespaces ?
> A: Currrently, UTS and IPC namespaces are restored. They demonstrate
>    how namespaces are handled. More to come.

Will this new code muck up the kernel?

> Q: What additional work needs to be done to it?
> A: Fill in the gory details following the examples so far. Current WIP
>    includes inet sockets, event-poll, and early work on inotify, mount
>    namespace and mount-points, pseudo file systems

Will this new code muck up the kernel, or will it be clean?

> and x86_64 support.

eh?  You mean the code doesn't work on x86_64 at present?


What is the story on migration?  Moving the process(es) to a different
machine?


^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 00/80] Kernel based checkpoint/restart [v18]
@ 2009-09-24 22:41   ` Andrew Morton
  0 siblings, 0 replies; 327+ messages in thread
From: Andrew Morton @ 2009-09-24 22:41 UTC (permalink / raw)
  Cc: torvalds, containers, linux-kernel, linux-mm, linux-api, serue,
	mingo, xemul, orenl

On Wed, 23 Sep 2009 19:50:40 -0400
Oren Laadan <orenl@librato.com> wrote:

> Q: How useful is this code as it stands in real-world usage?
> A: The application can be single- or multi-processes and threads. It
>    handles open files (regular files/directories on most file systems,
>    pipes, fifos, af_unix sockets, /dev/{null,zero,random,urandom} and
>    pseudo-terminals. It supports shared memory. sysv IPC (except undo
>    of sempahores). It's suitable for many types of batch jobs as well
>    as some interactive jobs. (Note: it is assumed that the fs view is
>    available at restart).

That's encouraging.

> Q: What can it checkpoint and restart ?
> A: A (single threaded) process can checkpoint itself, aka "self"
>    checkpoint, if it calls the new system calls. Otherise, for an
>    "external" checkpoint, the caller must first freeze the target
>    processes. One can either checkpoint an entire container (and
>    we make best effort to ensure that the result is self-contained),
>    or merely a subtree of a process hierarchy.

What is "best effort"?  Will the operation appear to have succeeded,
only it didn't?

IOW, how reliable and robust is code at detecting that it was unable to
successfully generate a restartable image?

> Q: What about namespaces ?
> A: Currrently, UTS and IPC namespaces are restored. They demonstrate
>    how namespaces are handled. More to come.

Will this new code muck up the kernel?

> Q: What additional work needs to be done to it?
> A: Fill in the gory details following the examples so far. Current WIP
>    includes inet sockets, event-poll, and early work on inotify, mount
>    namespace and mount-points, pseudo file systems

Will this new code muck up the kernel, or will it be clean?

> and x86_64 support.

eh?  You mean the code doesn't work on x86_64 at present?


What is the story on migration?  Moving the process(es) to a different
machine?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 00/80] Kernel based checkpoint/restart [v18]
@ 2009-09-24 22:41   ` Andrew Morton
  0 siblings, 0 replies; 327+ messages in thread
From: Andrew Morton @ 2009-09-24 22:41 UTC (permalink / raw)
  To: Oren Laadan
  Cc: torvalds, containers, linux-kernel, linux-mm, linux-api, serue,
	mingo, xemul

On Wed, 23 Sep 2009 19:50:40 -0400
Oren Laadan <orenl@librato.com> wrote:

> Q: How useful is this code as it stands in real-world usage?
> A: The application can be single- or multi-processes and threads. It
>    handles open files (regular files/directories on most file systems,
>    pipes, fifos, af_unix sockets, /dev/{null,zero,random,urandom} and
>    pseudo-terminals. It supports shared memory. sysv IPC (except undo
>    of sempahores). It's suitable for many types of batch jobs as well
>    as some interactive jobs. (Note: it is assumed that the fs view is
>    available at restart).

That's encouraging.

> Q: What can it checkpoint and restart ?
> A: A (single threaded) process can checkpoint itself, aka "self"
>    checkpoint, if it calls the new system calls. Otherise, for an
>    "external" checkpoint, the caller must first freeze the target
>    processes. One can either checkpoint an entire container (and
>    we make best effort to ensure that the result is self-contained),
>    or merely a subtree of a process hierarchy.

What is "best effort"?  Will the operation appear to have succeeded,
only it didn't?

IOW, how reliable and robust is code at detecting that it was unable to
successfully generate a restartable image?

> Q: What about namespaces ?
> A: Currrently, UTS and IPC namespaces are restored. They demonstrate
>    how namespaces are handled. More to come.

Will this new code muck up the kernel?

> Q: What additional work needs to be done to it?
> A: Fill in the gory details following the examples so far. Current WIP
>    includes inet sockets, event-poll, and early work on inotify, mount
>    namespace and mount-points, pseudo file systems

Will this new code muck up the kernel, or will it be clean?

> and x86_64 support.

eh?  You mean the code doesn't work on x86_64 at present?


What is the story on migration?  Moving the process(es) to a different
machine?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 00/80] Kernel based checkpoint/restart [v18]
       [not found]   ` <20090924154139.2a7dd5ec.akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
@ 2009-09-25 13:59     ` Dan Smith
  2009-09-28 16:37     ` Serge E. Hallyn
  2009-09-29 18:31     ` Oren Laadan
  2 siblings, 0 replies; 327+ messages in thread
From: Dan Smith @ 2009-09-25 13:59 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-api-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, mingo-X9Un+BFzKDI,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	xemul-GEFAQzZX7r8dnm+yROfE0A

>> Q: What additional work needs to be done to it?  A: Fill in the
>> gory details following the examples so far. Current WIP includes
>> inet sockets, event-poll, and early work on inotify, mount
>> namespace and mount-points, pseudo file systems

AM> Will this new code muck up the kernel, or will it be clean?

I have (and have previously posted) prototype code to do c/r of open
sockets, ignoring some things like updating timers and such.  It looks
rather similar to the existing UNIX bits, and is even easier in some
ways.

One particular use case is only migrating listening sockets and
allowing the connected ones to be reset upon restart.  That enables a
bunch of things like apache, postfix, vncserver, and even sshd.  I
will pull the listen-only bits out of my current patch, scrape off a
little bitrot, and post them in a few days.

-- 
Dan Smith
IBM Linux Technology Center
email: danms-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 00/80] Kernel based checkpoint/restart [v18]
       [not found]   ` <20090924154139.2a7dd5ec.akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
  2009-09-25 13:59     ` Dan Smith
@ 2009-09-25 13:59     ` Dan Smith
  2009-09-29 18:31     ` Oren Laadan
  2 siblings, 0 replies; 327+ messages in thread
From: Dan Smith @ 2009-09-25 13:59 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Oren Laadan, linux-api, containers, linux-kernel, linux-mm,
	mingo, torvalds, xemul

>> Q: What additional work needs to be done to it?  A: Fill in the
>> gory details following the examples so far. Current WIP includes
>> inet sockets, event-poll, and early work on inotify, mount
>> namespace and mount-points, pseudo file systems

AM> Will this new code muck up the kernel, or will it be clean?

I have (and have previously posted) prototype code to do c/r of open
sockets, ignoring some things like updating timers and such.  It looks
rather similar to the existing UNIX bits, and is even easier in some
ways.

One particular use case is only migrating listening sockets and
allowing the connected ones to be reset upon restart.  That enables a
bunch of things like apache, postfix, vncserver, and even sshd.  I
will pull the listen-only bits out of my current patch, scrape off a
little bitrot, and post them in a few days.

-- 
Dan Smith
IBM Linux Technology Center
email: danms@us.ibm.com

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 00/80] Kernel based checkpoint/restart [v18]
@ 2009-09-25 13:59     ` Dan Smith
  0 siblings, 0 replies; 327+ messages in thread
From: Dan Smith @ 2009-09-25 13:59 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Oren Laadan, linux-api-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, mingo-X9Un+BFzKDI,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	xemul-GEFAQzZX7r8dnm+yROfE0A

>> Q: What additional work needs to be done to it?  A: Fill in the
>> gory details following the examples so far. Current WIP includes
>> inet sockets, event-poll, and early work on inotify, mount
>> namespace and mount-points, pseudo file systems

AM> Will this new code muck up the kernel, or will it be clean?

I have (and have previously posted) prototype code to do c/r of open
sockets, ignoring some things like updating timers and such.  It looks
rather similar to the existing UNIX bits, and is even easier in some
ways.

One particular use case is only migrating listening sockets and
allowing the connected ones to be reset upon restart.  That enables a
bunch of things like apache, postfix, vncserver, and even sshd.  I
will pull the listen-only bits out of my current patch, scrape off a
little bitrot, and post them in a few days.

-- 
Dan Smith
IBM Linux Technology Center
email: danms-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org
--
To unsubscribe from this list: send the line "unsubscribe linux-api" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 00/80] Kernel based checkpoint/restart [v18]
@ 2009-09-25 13:59     ` Dan Smith
  0 siblings, 0 replies; 327+ messages in thread
From: Dan Smith @ 2009-09-25 13:59 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Oren Laadan, linux-api, containers, linux-kernel, linux-mm,
	mingo, torvalds, xemul

>> Q: What additional work needs to be done to it?  A: Fill in the
>> gory details following the examples so far. Current WIP includes
>> inet sockets, event-poll, and early work on inotify, mount
>> namespace and mount-points, pseudo file systems

AM> Will this new code muck up the kernel, or will it be clean?

I have (and have previously posted) prototype code to do c/r of open
sockets, ignoring some things like updating timers and such.  It looks
rather similar to the existing UNIX bits, and is even easier in some
ways.

One particular use case is only migrating listening sockets and
allowing the connected ones to be reset upon restart.  That enables a
bunch of things like apache, postfix, vncserver, and even sshd.  I
will pull the listen-only bits out of my current patch, scrape off a
little bitrot, and post them in a few days.

-- 
Dan Smith
IBM Linux Technology Center
email: danms@us.ibm.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 00/80] Kernel based checkpoint/restart [v18]
       [not found]     ` <87ljk39lcl.fsf-FLMGYpZoEPULwtHQx/6qkW3U47Q5hpJU@public.gmane.org>
@ 2009-09-26 14:56       ` Ingo Molnar
  0 siblings, 0 replies; 327+ messages in thread
From: Ingo Molnar @ 2009-09-26 14:56 UTC (permalink / raw)
  To: Dan Smith
  Cc: linux-api-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Andrew Morton,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	xemul-GEFAQzZX7r8dnm+yROfE0A


* Dan Smith <danms-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> wrote:

> >> Q: What additional work needs to be done to it?  A: Fill in the
> >> gory details following the examples so far. Current WIP includes
> >> inet sockets, event-poll, and early work on inotify, mount
> >> namespace and mount-points, pseudo file systems
> 
> AM> Will this new code muck up the kernel, or will it be clean?
> 
> I have (and have previously posted) prototype code to do c/r of open 
> sockets, ignoring some things like updating timers and such.  It looks 
> rather similar to the existing UNIX bits, and is even easier in some 
> ways.
> 
> One particular use case is only migrating listening sockets and 
> allowing the connected ones to be reset upon restart.  That enables a 
> bunch of things like apache, postfix, vncserver, and even sshd.  I 
> will pull the listen-only bits out of my current patch, scrape off a 
> little bitrot, and post them in a few days.

That looks useful. (Btw., the other four questions Andrew asked look 
relevant too.)

	Ingo

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 00/80] Kernel based checkpoint/restart [v18]
  2009-09-25 13:59     ` Dan Smith
@ 2009-09-26 14:56       ` Ingo Molnar
  -1 siblings, 0 replies; 327+ messages in thread
From: Ingo Molnar @ 2009-09-26 14:56 UTC (permalink / raw)
  To: Dan Smith
  Cc: Andrew Morton, Oren Laadan, linux-api, containers, linux-kernel,
	linux-mm, torvalds, xemul


* Dan Smith <danms@us.ibm.com> wrote:

> >> Q: What additional work needs to be done to it?  A: Fill in the
> >> gory details following the examples so far. Current WIP includes
> >> inet sockets, event-poll, and early work on inotify, mount
> >> namespace and mount-points, pseudo file systems
> 
> AM> Will this new code muck up the kernel, or will it be clean?
> 
> I have (and have previously posted) prototype code to do c/r of open 
> sockets, ignoring some things like updating timers and such.  It looks 
> rather similar to the existing UNIX bits, and is even easier in some 
> ways.
> 
> One particular use case is only migrating listening sockets and 
> allowing the connected ones to be reset upon restart.  That enables a 
> bunch of things like apache, postfix, vncserver, and even sshd.  I 
> will pull the listen-only bits out of my current patch, scrape off a 
> little bitrot, and post them in a few days.

That looks useful. (Btw., the other four questions Andrew asked look 
relevant too.)

	Ingo

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 00/80] Kernel based checkpoint/restart [v18]
@ 2009-09-26 14:56       ` Ingo Molnar
  0 siblings, 0 replies; 327+ messages in thread
From: Ingo Molnar @ 2009-09-26 14:56 UTC (permalink / raw)
  To: Dan Smith
  Cc: Andrew Morton, Oren Laadan, linux-api, containers, linux-kernel,
	linux-mm, torvalds, xemul


* Dan Smith <danms@us.ibm.com> wrote:

> >> Q: What additional work needs to be done to it?  A: Fill in the
> >> gory details following the examples so far. Current WIP includes
> >> inet sockets, event-poll, and early work on inotify, mount
> >> namespace and mount-points, pseudo file systems
> 
> AM> Will this new code muck up the kernel, or will it be clean?
> 
> I have (and have previously posted) prototype code to do c/r of open 
> sockets, ignoring some things like updating timers and such.  It looks 
> rather similar to the existing UNIX bits, and is even easier in some 
> ways.
> 
> One particular use case is only migrating listening sockets and 
> allowing the connected ones to be reset upon restart.  That enables a 
> bunch of things like apache, postfix, vncserver, and even sshd.  I 
> will pull the listen-only bits out of my current patch, scrape off a 
> little bitrot, and post them in a few days.

That looks useful. (Btw., the other four questions Andrew asked look 
relevant too.)

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 00/80] Kernel based checkpoint/restart [v18]
       [not found]   ` <20090924154139.2a7dd5ec.akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
  2009-09-25 13:59     ` Dan Smith
@ 2009-09-28 16:37     ` Serge E. Hallyn
  2009-09-29 18:31     ` Oren Laadan
  2 siblings, 0 replies; 327+ messages in thread
From: Serge E. Hallyn @ 2009-09-28 16:37 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-api-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, mingo-X9Un+BFzKDI,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	xemul-GEFAQzZX7r8dnm+yROfE0A

Quoting Andrew Morton (akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org):
> On Wed, 23 Sep 2009 19:50:40 -0400
> Oren Laadan <orenl-RdfvBDnrOixBDgjK7y7TUQ@public.gmane.org> wrote:
> > Q: What about namespaces ?
> > A: Currrently, UTS and IPC namespaces are restored. They demonstrate
> >    how namespaces are handled. More to come.
> 
> Will this new code muck up the kernel?

Actually user namespaces are handled as well.  Pid namespaces will
be named and recorded by kernel at checkpoint, and re-created in
userspace using clone(CLONE_NEWPID).  This shouldn't muck up the
kernel at all.  The handling of network and mounts namespaces is
at this point undecided.  Well, mounts namespaces themselves are
pretty simple, but not so much for mountpoints.  There it's mainly
a question of how to predict what a user wants to have automatically
recreated.  All mounts which differ between the root checkpoint task
and its parent?  Do we do no mounts for the restarted init task at
all, and only recreate mounts in private child namespaces (i.e. if a
task did a unshare(CLONE_NEWNS); mount --make-private /var; 
mount --bind /container2/var/run /var/run)?

I hear a decision was made at plumber's about how to begin
handling them, so I'll let someone (Oren? Dave?) give that info.

For network namespaces i think it's clearer that a wrapper
program should set up the network for the restarted init task,
while the usrspace code should recreate any private network
namespaces and veth's which were created by the application.
But it still needs discussion.

> > Q: What additional work needs to be done to it?
> > A: Fill in the gory details following the examples so far. Current WIP
> >    includes inet sockets, event-poll, and early work on inotify, mount
> >    namespace and mount-points, pseudo file systems
> 
> Will this new code muck up the kernel, or will it be clean?
> 
> > and x86_64 support.
> 
> eh?  You mean the code doesn't work on x86_64 at present?

There have been patches for it, but I think the main problem is noone
involved has hw to test.

> What is the story on migration?  Moving the process(es) to a different
> machine?

Since that's basically checkpoint; recreate container on remote
machine; restart on remote machine; that will mainly be done by
userspace code exploiting the c/r kernel patches.

The main thing we may want to add is a way to initiate pre-dump
of large amounts of VM while the container is still running.
I suspect Oren and Dave can say a lot more about that than I can
right now.

thanks,
-serge

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 00/80] Kernel based checkpoint/restart [v18]
       [not found]   ` <20090924154139.2a7dd5ec.akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
  2009-09-25 13:59     ` Dan Smith
@ 2009-09-28 16:37     ` Serge E. Hallyn
  2009-09-29 18:31     ` Oren Laadan
  2 siblings, 0 replies; 327+ messages in thread
From: Serge E. Hallyn @ 2009-09-28 16:37 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Oren Laadan, torvalds, containers, linux-kernel, linux-mm,
	linux-api, mingo, xemul

Quoting Andrew Morton (akpm@linux-foundation.org):
> On Wed, 23 Sep 2009 19:50:40 -0400
> Oren Laadan <orenl@librato.com> wrote:
> > Q: What about namespaces ?
> > A: Currrently, UTS and IPC namespaces are restored. They demonstrate
> >    how namespaces are handled. More to come.
> 
> Will this new code muck up the kernel?

Actually user namespaces are handled as well.  Pid namespaces will
be named and recorded by kernel at checkpoint, and re-created in
userspace using clone(CLONE_NEWPID).  This shouldn't muck up the
kernel at all.  The handling of network and mounts namespaces is
at this point undecided.  Well, mounts namespaces themselves are
pretty simple, but not so much for mountpoints.  There it's mainly
a question of how to predict what a user wants to have automatically
recreated.  All mounts which differ between the root checkpoint task
and its parent?  Do we do no mounts for the restarted init task at
all, and only recreate mounts in private child namespaces (i.e. if a
task did a unshare(CLONE_NEWNS); mount --make-private /var; 
mount --bind /container2/var/run /var/run)?

I hear a decision was made at plumber's about how to begin
handling them, so I'll let someone (Oren? Dave?) give that info.

For network namespaces i think it's clearer that a wrapper
program should set up the network for the restarted init task,
while the usrspace code should recreate any private network
namespaces and veth's which were created by the application.
But it still needs discussion.

> > Q: What additional work needs to be done to it?
> > A: Fill in the gory details following the examples so far. Current WIP
> >    includes inet sockets, event-poll, and early work on inotify, mount
> >    namespace and mount-points, pseudo file systems
> 
> Will this new code muck up the kernel, or will it be clean?
> 
> > and x86_64 support.
> 
> eh?  You mean the code doesn't work on x86_64 at present?

There have been patches for it, but I think the main problem is noone
involved has hw to test.

> What is the story on migration?  Moving the process(es) to a different
> machine?

Since that's basically checkpoint; recreate container on remote
machine; restart on remote machine; that will mainly be done by
userspace code exploiting the c/r kernel patches.

The main thing we may want to add is a way to initiate pre-dump
of large amounts of VM while the container is still running.
I suspect Oren and Dave can say a lot more about that than I can
right now.

thanks,
-serge

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 00/80] Kernel based checkpoint/restart [v18]
@ 2009-09-28 16:37     ` Serge E. Hallyn
  0 siblings, 0 replies; 327+ messages in thread
From: Serge E. Hallyn @ 2009-09-28 16:37 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Oren Laadan, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	linux-api-u79uwXL29TY76Z2rM5mHXA, mingo-X9Un+BFzKDI,
	xemul-GEFAQzZX7r8dnm+yROfE0A

Quoting Andrew Morton (akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org):
> On Wed, 23 Sep 2009 19:50:40 -0400
> Oren Laadan <orenl-RdfvBDnrOixBDgjK7y7TUQ@public.gmane.org> wrote:
> > Q: What about namespaces ?
> > A: Currrently, UTS and IPC namespaces are restored. They demonstrate
> >    how namespaces are handled. More to come.
> 
> Will this new code muck up the kernel?

Actually user namespaces are handled as well.  Pid namespaces will
be named and recorded by kernel at checkpoint, and re-created in
userspace using clone(CLONE_NEWPID).  This shouldn't muck up the
kernel at all.  The handling of network and mounts namespaces is
at this point undecided.  Well, mounts namespaces themselves are
pretty simple, but not so much for mountpoints.  There it's mainly
a question of how to predict what a user wants to have automatically
recreated.  All mounts which differ between the root checkpoint task
and its parent?  Do we do no mounts for the restarted init task at
all, and only recreate mounts in private child namespaces (i.e. if a
task did a unshare(CLONE_NEWNS); mount --make-private /var; 
mount --bind /container2/var/run /var/run)?

I hear a decision was made at plumber's about how to begin
handling them, so I'll let someone (Oren? Dave?) give that info.

For network namespaces i think it's clearer that a wrapper
program should set up the network for the restarted init task,
while the usrspace code should recreate any private network
namespaces and veth's which were created by the application.
But it still needs discussion.

> > Q: What additional work needs to be done to it?
> > A: Fill in the gory details following the examples so far. Current WIP
> >    includes inet sockets, event-poll, and early work on inotify, mount
> >    namespace and mount-points, pseudo file systems
> 
> Will this new code muck up the kernel, or will it be clean?
> 
> > and x86_64 support.
> 
> eh?  You mean the code doesn't work on x86_64 at present?

There have been patches for it, but I think the main problem is noone
involved has hw to test.

> What is the story on migration?  Moving the process(es) to a different
> machine?

Since that's basically checkpoint; recreate container on remote
machine; restart on remote machine; that will mainly be done by
userspace code exploiting the c/r kernel patches.

The main thing we may want to add is a way to initiate pre-dump
of large amounts of VM while the container is still running.
I suspect Oren and Dave can say a lot more about that than I can
right now.

thanks,
-serge
--
To unsubscribe from this list: send the line "unsubscribe linux-api" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 00/80] Kernel based checkpoint/restart [v18]
@ 2009-09-28 16:37     ` Serge E. Hallyn
  0 siblings, 0 replies; 327+ messages in thread
From: Serge E. Hallyn @ 2009-09-28 16:37 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Oren Laadan, torvalds, containers, linux-kernel, linux-mm,
	linux-api, mingo, xemul

Quoting Andrew Morton (akpm@linux-foundation.org):
> On Wed, 23 Sep 2009 19:50:40 -0400
> Oren Laadan <orenl@librato.com> wrote:
> > Q: What about namespaces ?
> > A: Currrently, UTS and IPC namespaces are restored. They demonstrate
> >    how namespaces are handled. More to come.
> 
> Will this new code muck up the kernel?

Actually user namespaces are handled as well.  Pid namespaces will
be named and recorded by kernel at checkpoint, and re-created in
userspace using clone(CLONE_NEWPID).  This shouldn't muck up the
kernel at all.  The handling of network and mounts namespaces is
at this point undecided.  Well, mounts namespaces themselves are
pretty simple, but not so much for mountpoints.  There it's mainly
a question of how to predict what a user wants to have automatically
recreated.  All mounts which differ between the root checkpoint task
and its parent?  Do we do no mounts for the restarted init task at
all, and only recreate mounts in private child namespaces (i.e. if a
task did a unshare(CLONE_NEWNS); mount --make-private /var; 
mount --bind /container2/var/run /var/run)?

I hear a decision was made at plumber's about how to begin
handling them, so I'll let someone (Oren? Dave?) give that info.

For network namespaces i think it's clearer that a wrapper
program should set up the network for the restarted init task,
while the usrspace code should recreate any private network
namespaces and veth's which were created by the application.
But it still needs discussion.

> > Q: What additional work needs to be done to it?
> > A: Fill in the gory details following the examples so far. Current WIP
> >    includes inet sockets, event-poll, and early work on inotify, mount
> >    namespace and mount-points, pseudo file systems
> 
> Will this new code muck up the kernel, or will it be clean?
> 
> > and x86_64 support.
> 
> eh?  You mean the code doesn't work on x86_64 at present?

There have been patches for it, but I think the main problem is noone
involved has hw to test.

> What is the story on migration?  Moving the process(es) to a different
> machine?

Since that's basically checkpoint; recreate container on remote
machine; restart on remote machine; that will mainly be done by
userspace code exploiting the c/r kernel patches.

The main thing we may want to add is a way to initiate pre-dump
of large amounts of VM while the container is still running.
I suspect Oren and Dave can say a lot more about that than I can
right now.

thanks,
-serge

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 00/80] Kernel based checkpoint/restart [v18]
       [not found]     ` <20090928163704.GA3327-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
@ 2009-09-29 13:29       ` Daniel Lezcano
  0 siblings, 0 replies; 327+ messages in thread
From: Daniel Lezcano @ 2009-09-29 13:29 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: linux-api-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Andrew Morton,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, mingo-X9Un+BFzKDI,
	xemul-GEFAQzZX7r8dnm+yROfE0A

Serge E. Hallyn wrote:
> Quoting Andrew Morton (akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org):
>   
>> On Wed, 23 Sep 2009 19:50:40 -0400
>> Oren Laadan <orenl-RdfvBDnrOixBDgjK7y7TUQ@public.gmane.org> wrote:
>>     
>>> Q: What about namespaces ?
>>> A: Currrently, UTS and IPC namespaces are restored. They demonstrate
>>>    how namespaces are handled. More to come.
>>>       
>> Will this new code muck up the kernel?
>>     

[ cut ]
> For network namespaces i think it's clearer that a wrapper
> program should set up the network for the restarted init task,
> while the usrspace code should recreate any private network
> namespaces and veth's which were created by the application.
> But it still needs discussion.
>   
Ok for the restart, but for the checkpoint, how do you access the 
network setup from a process which belongs to another namespace context ?

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 00/80] Kernel based checkpoint/restart [v18]
       [not found]     ` <20090928163704.GA3327-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
  2009-09-29 13:29       ` Daniel Lezcano
@ 2009-09-29 13:29       ` Daniel Lezcano
  0 siblings, 0 replies; 327+ messages in thread
From: Daniel Lezcano @ 2009-09-29 13:29 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: Andrew Morton, linux-api, containers, linux-kernel, linux-mm,
	mingo, torvalds, xemul

Serge E. Hallyn wrote:
> Quoting Andrew Morton (akpm@linux-foundation.org):
>   
>> On Wed, 23 Sep 2009 19:50:40 -0400
>> Oren Laadan <orenl@librato.com> wrote:
>>     
>>> Q: What about namespaces ?
>>> A: Currrently, UTS and IPC namespaces are restored. They demonstrate
>>>    how namespaces are handled. More to come.
>>>       
>> Will this new code muck up the kernel?
>>     

[ cut ]
> For network namespaces i think it's clearer that a wrapper
> program should set up the network for the restarted init task,
> while the usrspace code should recreate any private network
> namespaces and veth's which were created by the application.
> But it still needs discussion.
>   
Ok for the restart, but for the checkpoint, how do you access the 
network setup from a process which belongs to another namespace context ?


^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 00/80] Kernel based checkpoint/restart [v18]
@ 2009-09-29 13:29       ` Daniel Lezcano
  0 siblings, 0 replies; 327+ messages in thread
From: Daniel Lezcano @ 2009-09-29 13:29 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: Andrew Morton, linux-api-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, mingo-X9Un+BFzKDI,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	xemul-GEFAQzZX7r8dnm+yROfE0A

Serge E. Hallyn wrote:
> Quoting Andrew Morton (akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org):
>   
>> On Wed, 23 Sep 2009 19:50:40 -0400
>> Oren Laadan <orenl-RdfvBDnrOixBDgjK7y7TUQ@public.gmane.org> wrote:
>>     
>>> Q: What about namespaces ?
>>> A: Currrently, UTS and IPC namespaces are restored. They demonstrate
>>>    how namespaces are handled. More to come.
>>>       
>> Will this new code muck up the kernel?
>>     

[ cut ]
> For network namespaces i think it's clearer that a wrapper
> program should set up the network for the restarted init task,
> while the usrspace code should recreate any private network
> namespaces and veth's which were created by the application.
> But it still needs discussion.
>   
Ok for the restart, but for the checkpoint, how do you access the 
network setup from a process which belongs to another namespace context ?

--
To unsubscribe from this list: send the line "unsubscribe linux-api" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 00/80] Kernel based checkpoint/restart [v18]
@ 2009-09-29 13:29       ` Daniel Lezcano
  0 siblings, 0 replies; 327+ messages in thread
From: Daniel Lezcano @ 2009-09-29 13:29 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: Andrew Morton, linux-api, containers, linux-kernel, linux-mm,
	mingo, torvalds, xemul

Serge E. Hallyn wrote:
> Quoting Andrew Morton (akpm@linux-foundation.org):
>   
>> On Wed, 23 Sep 2009 19:50:40 -0400
>> Oren Laadan <orenl@librato.com> wrote:
>>     
>>> Q: What about namespaces ?
>>> A: Currrently, UTS and IPC namespaces are restored. They demonstrate
>>>    how namespaces are handled. More to come.
>>>       
>> Will this new code muck up the kernel?
>>     

[ cut ]
> For network namespaces i think it's clearer that a wrapper
> program should set up the network for the restarted init task,
> while the usrspace code should recreate any private network
> namespaces and veth's which were created by the application.
> But it still needs discussion.
>   
Ok for the restart, but for the checkpoint, how do you access the 
network setup from a process which belongs to another namespace context ?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH v18 49/80] c/r: support for UTS namespace
       [not found]   ` <1253749920-18673-50-git-send-email-orenl-RdfvBDnrOixBDgjK7y7TUQ@public.gmane.org>
@ 2009-09-29 18:13     ` Nikita V. Youshchenko
  0 siblings, 0 replies; 327+ messages in thread
From: Nikita V. Youshchenko @ 2009-09-29 18:13 UTC (permalink / raw)
  To: Oren Laadan
  Cc: linux-api-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Dan Smith, Andrew Morton,
	Ingo Molnar, Pavel Emelyanov

> +static struct uts_namespace *do_restore_uts_ns(struct ckpt_ctx *ctx)
> ...
> +#ifdef CONFIG_UTS_NS
> +	uts_ns = create_uts_ns();
> +	if (!uts_ns) {
> +		uts_ns = ERR_PTR(-ENOMEM);
> +		goto out;
> +	}
> +	down_read(&uts_sem);
> +	name = &uts_ns->name;
> +	memcpy(name->sysname, h->sysname, sizeof(name->sysname));
> +	memcpy(name->nodename, h->nodename, sizeof(name->nodename));
> +	memcpy(name->release, h->release, sizeof(name->release));
> +	memcpy(name->version, h->version, sizeof(name->version));
> +	memcpy(name->machine, h->machine, sizeof(name->machine));
> +	memcpy(name->domainname, h->domainname, sizeof(name->domainname));
> +	up_read(&uts_sem);

Could you please explain what for is this down_read() / up_read() ?
You operate only on local objects: 'name' points to just-created 
uts_ns, 'h' is also local data.

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH v18 49/80] c/r: support for UTS namespace
  2009-09-23 23:51   ` Oren Laadan
@ 2009-09-29 18:13     ` Nikita V. Youshchenko
  -1 siblings, 0 replies; 327+ messages in thread
From: Nikita V. Youshchenko @ 2009-09-29 18:13 UTC (permalink / raw)
  To: Oren Laadan
  Cc: Andrew Morton, containers, linux-kernel, linux-mm, linux-api,
	Serge Hallyn, Ingo Molnar, Pavel Emelyanov, Dan Smith,
	Oren Laadan

> +static struct uts_namespace *do_restore_uts_ns(struct ckpt_ctx *ctx)
> ...
> +#ifdef CONFIG_UTS_NS
> +	uts_ns = create_uts_ns();
> +	if (!uts_ns) {
> +		uts_ns = ERR_PTR(-ENOMEM);
> +		goto out;
> +	}
> +	down_read(&uts_sem);
> +	name = &uts_ns->name;
> +	memcpy(name->sysname, h->sysname, sizeof(name->sysname));
> +	memcpy(name->nodename, h->nodename, sizeof(name->nodename));
> +	memcpy(name->release, h->release, sizeof(name->release));
> +	memcpy(name->version, h->version, sizeof(name->version));
> +	memcpy(name->machine, h->machine, sizeof(name->machine));
> +	memcpy(name->domainname, h->domainname, sizeof(name->domainname));
> +	up_read(&uts_sem);

Could you please explain what for is this down_read() / up_read() ?
You operate only on local objects: 'name' points to just-created 
uts_ns, 'h' is also local data.

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH v18 49/80] c/r: support for UTS namespace
@ 2009-09-29 18:13     ` Nikita V. Youshchenko
  0 siblings, 0 replies; 327+ messages in thread
From: Nikita V. Youshchenko @ 2009-09-29 18:13 UTC (permalink / raw)
  To: Oren Laadan
  Cc: Andrew Morton, containers, linux-kernel, linux-mm, linux-api,
	Serge Hallyn, Ingo Molnar, Pavel Emelyanov, Dan Smith,
	Oren Laadan

> +static struct uts_namespace *do_restore_uts_ns(struct ckpt_ctx *ctx)
> ...
> +#ifdef CONFIG_UTS_NS
> +	uts_ns = create_uts_ns();
> +	if (!uts_ns) {
> +		uts_ns = ERR_PTR(-ENOMEM);
> +		goto out;
> +	}
> +	down_read(&uts_sem);
> +	name = &uts_ns->name;
> +	memcpy(name->sysname, h->sysname, sizeof(name->sysname));
> +	memcpy(name->nodename, h->nodename, sizeof(name->nodename));
> +	memcpy(name->release, h->release, sizeof(name->release));
> +	memcpy(name->version, h->version, sizeof(name->version));
> +	memcpy(name->machine, h->machine, sizeof(name->machine));
> +	memcpy(name->domainname, h->domainname, sizeof(name->domainname));
> +	up_read(&uts_sem);

Could you please explain what for is this down_read() / up_read() ?
You operate only on local objects: 'name' points to just-created 
uts_ns, 'h' is also local data.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 00/80] Kernel based checkpoint/restart [v18]
       [not found]   ` <20090924154139.2a7dd5ec.akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
  2009-09-25 13:59     ` Dan Smith
  2009-09-28 16:37     ` Serge E. Hallyn
@ 2009-09-29 18:31     ` Oren Laadan
  2 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-29 18:31 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-api-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, mingo-X9Un+BFzKDI,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	xemul-GEFAQzZX7r8dnm+yROfE0A


Hi,

Andrew Morton wrote:
> On Wed, 23 Sep 2009 19:50:40 -0400
> Oren Laadan <orenl-RdfvBDnrOixBDgjK7y7TUQ@public.gmane.org> wrote:
> 
>> Q: How useful is this code as it stands in real-world usage?
>> A: The application can be single- or multi-processes and threads. It
>>    handles open files (regular files/directories on most file systems,
>>    pipes, fifos, af_unix sockets, /dev/{null,zero,random,urandom} and
>>    pseudo-terminals. It supports shared memory. sysv IPC (except undo
>>    of sempahores). It's suitable for many types of batch jobs as well
>>    as some interactive jobs. (Note: it is assumed that the fs view is
>>    available at restart).
> 
> That's encouraging.
> 
>> Q: What can it checkpoint and restart ?
>> A: A (single threaded) process can checkpoint itself, aka "self"
>>    checkpoint, if it calls the new system calls. Otherise, for an
>>    "external" checkpoint, the caller must first freeze the target
>>    processes. One can either checkpoint an entire container (and
>>    we make best effort to ensure that the result is self-contained),
>>    or merely a subtree of a process hierarchy.
> 
> What is "best effort"?  Will the operation appear to have succeeded,
> only it didn't?

There are two modes of operation: a "full" (container) checkpoint and
a "subtree" checkpoint.

For container-checkpoint, the application runa within its own set
of namespaces (pid, uts, ipc ...). We have a leak-detection algorithm
in place to ensure that the application is entirely contained and
isolated in the container, or else the checkpoint fails. In this case,
restart is guaranteed to succeed (assuming external dependencies are
properly set - see below).

A "subtree" checkpoint is less restrictive. It allows to checkpoint
applications that aren't truly isolated. In this case, restart may
fail, or restart may succeed but the application may fail shortly
after because some dependency may be missing. However, in practice
this works well for many applications transparently, and even more
so for c/r-aware application.

In both modes of operation, some external dependencies are assumed
to exist. One example is the file system view that has to be the
same for the container (or subtree) as during checkpoint. Another
is the network setup. These will be reconstructed (at least in part)
by userspace prior to the actual restart.

(Some of these dependencies can be relaxed for many use-cases,
e.g. when user/application doesn't care about preserving original
network connection after restart).

> IOW, how reliable and robust is code at detecting that it was unable to
> successfully generate a restartable image?

To the best of my knowledge (and unless someone pokes a hole in the
algorithm), container-checkpoint is robust and reliable.

Two reasons for saying "best effort": first, because external
dependencies need to be suitably arranged. If external dependencies
are not arranged - e.g. provide a snapshot of the filesystem from
the time of the checkpoint - restart, or execution thereafter, may
fail.

Second, because even when the application is isolated within a
container, the user from outside may still be able to affect it. I
can think of one (and only one) pathological example: an outside
process signals all tasks in a container during a checkpoint, but
the signal is recorded for only some tasks. I'd argue that this
is improper behavior that we need not support, rather than address
it in the kernel.

> 
>> Q: What about namespaces ?
>> A: Currrently, UTS and IPC namespaces are restored. They demonstrate
>>    how namespaces are handled. More to come.
> 
> Will this new code muck up the kernel?

I forgot to mention user-ns is already included. We have a good
plan for mount-ns and mount-points, and intend to handle net-ns
in userspace (by that, I mean the net-ns setup, not the state of
connections etc).

I don't expect this additional code to be less solid than the
current one.

> 
>> Q: What additional work needs to be done to it?
>> A: Fill in the gory details following the examples so far. Current WIP
>>    includes inet sockets, event-poll, and early work on inotify, mount
>>    namespace and mount-points, pseudo file systems
> 
> Will this new code muck up the kernel, or will it be clean?

Should not.

I guess "gory details" was a bad choice of words, and I really meant
to say "less interesting". IOW, I think we have constructed a solid
and framework. From the kernel side, I think we're left with
completeness - add support for remaining missing features (e.g.
inotify), and goodies - live migration and other optimizations.

> 
>> and x86_64 support.
> 
> eh?  You mean the code doesn't work on x86_64 at present?
> 
> 
> What is the story on migration?  Moving the process(es) to a different
> machine?
> 

It depends on the use-case in mind:

Moving processes between machines without preserving external network
connections (or IP), already works. This can be useful for servers
(e.g. apache, or vnc with user's session), and applications that know
how to recover from, or do not care about lost connections.

Migrating processes with their external network connections is WIP.

In both cases, there needs to be some userspace glue to ensure that
the same filesystem view is available on the target machine as was on
the origin machine (e.g. remote file system, SAN, or even rsync of
changes).

Also ,in both cases, we may want to pre-copy some of the application
state while it is still running, to reduce the application downtime.
This will be added down the road.

Oren.

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 00/80] Kernel based checkpoint/restart [v18]
       [not found]   ` <20090924154139.2a7dd5ec.akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
  2009-09-25 13:59     ` Dan Smith
@ 2009-09-29 18:31     ` Oren Laadan
  2009-09-29 18:31     ` Oren Laadan
  2 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-29 18:31 UTC (permalink / raw)
  To: Andrew Morton
  Cc: torvalds, containers, linux-kernel, linux-mm, linux-api, serue,
	mingo, xemul


Hi,

Andrew Morton wrote:
> On Wed, 23 Sep 2009 19:50:40 -0400
> Oren Laadan <orenl@librato.com> wrote:
> 
>> Q: How useful is this code as it stands in real-world usage?
>> A: The application can be single- or multi-processes and threads. It
>>    handles open files (regular files/directories on most file systems,
>>    pipes, fifos, af_unix sockets, /dev/{null,zero,random,urandom} and
>>    pseudo-terminals. It supports shared memory. sysv IPC (except undo
>>    of sempahores). It's suitable for many types of batch jobs as well
>>    as some interactive jobs. (Note: it is assumed that the fs view is
>>    available at restart).
> 
> That's encouraging.
> 
>> Q: What can it checkpoint and restart ?
>> A: A (single threaded) process can checkpoint itself, aka "self"
>>    checkpoint, if it calls the new system calls. Otherise, for an
>>    "external" checkpoint, the caller must first freeze the target
>>    processes. One can either checkpoint an entire container (and
>>    we make best effort to ensure that the result is self-contained),
>>    or merely a subtree of a process hierarchy.
> 
> What is "best effort"?  Will the operation appear to have succeeded,
> only it didn't?

There are two modes of operation: a "full" (container) checkpoint and
a "subtree" checkpoint.

For container-checkpoint, the application runa within its own set
of namespaces (pid, uts, ipc ...). We have a leak-detection algorithm
in place to ensure that the application is entirely contained and
isolated in the container, or else the checkpoint fails. In this case,
restart is guaranteed to succeed (assuming external dependencies are
properly set - see below).

A "subtree" checkpoint is less restrictive. It allows to checkpoint
applications that aren't truly isolated. In this case, restart may
fail, or restart may succeed but the application may fail shortly
after because some dependency may be missing. However, in practice
this works well for many applications transparently, and even more
so for c/r-aware application.

In both modes of operation, some external dependencies are assumed
to exist. One example is the file system view that has to be the
same for the container (or subtree) as during checkpoint. Another
is the network setup. These will be reconstructed (at least in part)
by userspace prior to the actual restart.

(Some of these dependencies can be relaxed for many use-cases,
e.g. when user/application doesn't care about preserving original
network connection after restart).

> IOW, how reliable and robust is code at detecting that it was unable to
> successfully generate a restartable image?

To the best of my knowledge (and unless someone pokes a hole in the
algorithm), container-checkpoint is robust and reliable.

Two reasons for saying "best effort": first, because external
dependencies need to be suitably arranged. If external dependencies
are not arranged - e.g. provide a snapshot of the filesystem from
the time of the checkpoint - restart, or execution thereafter, may
fail.

Second, because even when the application is isolated within a
container, the user from outside may still be able to affect it. I
can think of one (and only one) pathological example: an outside
process signals all tasks in a container during a checkpoint, but
the signal is recorded for only some tasks. I'd argue that this
is improper behavior that we need not support, rather than address
it in the kernel.

> 
>> Q: What about namespaces ?
>> A: Currrently, UTS and IPC namespaces are restored. They demonstrate
>>    how namespaces are handled. More to come.
> 
> Will this new code muck up the kernel?

I forgot to mention user-ns is already included. We have a good
plan for mount-ns and mount-points, and intend to handle net-ns
in userspace (by that, I mean the net-ns setup, not the state of
connections etc).

I don't expect this additional code to be less solid than the
current one.

> 
>> Q: What additional work needs to be done to it?
>> A: Fill in the gory details following the examples so far. Current WIP
>>    includes inet sockets, event-poll, and early work on inotify, mount
>>    namespace and mount-points, pseudo file systems
> 
> Will this new code muck up the kernel, or will it be clean?

Should not.

I guess "gory details" was a bad choice of words, and I really meant
to say "less interesting". IOW, I think we have constructed a solid
and framework. From the kernel side, I think we're left with
completeness - add support for remaining missing features (e.g.
inotify), and goodies - live migration and other optimizations.

> 
>> and x86_64 support.
> 
> eh?  You mean the code doesn't work on x86_64 at present?
> 
> 
> What is the story on migration?  Moving the process(es) to a different
> machine?
> 

It depends on the use-case in mind:

Moving processes between machines without preserving external network
connections (or IP), already works. This can be useful for servers
(e.g. apache, or vnc with user's session), and applications that know
how to recover from, or do not care about lost connections.

Migrating processes with their external network connections is WIP.

In both cases, there needs to be some userspace glue to ensure that
the same filesystem view is available on the target machine as was on
the origin machine (e.g. remote file system, SAN, or even rsync of
changes).

Also ,in both cases, we may want to pre-copy some of the application
state while it is still running, to reduce the application downtime.
This will be added down the road.

Oren.


^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 00/80] Kernel based checkpoint/restart [v18]
@ 2009-09-29 18:31     ` Oren Laadan
  0 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-29 18:31 UTC (permalink / raw)
  To: Andrew Morton
  Cc: torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	linux-api-u79uwXL29TY76Z2rM5mHXA, serue-r/Jw6+rmf7HQT0dZR+AlfA,
	mingo-X9Un+BFzKDI, xemul-GEFAQzZX7r8dnm+yROfE0A


Hi,

Andrew Morton wrote:
> On Wed, 23 Sep 2009 19:50:40 -0400
> Oren Laadan <orenl-RdfvBDnrOixBDgjK7y7TUQ@public.gmane.org> wrote:
> 
>> Q: How useful is this code as it stands in real-world usage?
>> A: The application can be single- or multi-processes and threads. It
>>    handles open files (regular files/directories on most file systems,
>>    pipes, fifos, af_unix sockets, /dev/{null,zero,random,urandom} and
>>    pseudo-terminals. It supports shared memory. sysv IPC (except undo
>>    of sempahores). It's suitable for many types of batch jobs as well
>>    as some interactive jobs. (Note: it is assumed that the fs view is
>>    available at restart).
> 
> That's encouraging.
> 
>> Q: What can it checkpoint and restart ?
>> A: A (single threaded) process can checkpoint itself, aka "self"
>>    checkpoint, if it calls the new system calls. Otherise, for an
>>    "external" checkpoint, the caller must first freeze the target
>>    processes. One can either checkpoint an entire container (and
>>    we make best effort to ensure that the result is self-contained),
>>    or merely a subtree of a process hierarchy.
> 
> What is "best effort"?  Will the operation appear to have succeeded,
> only it didn't?

There are two modes of operation: a "full" (container) checkpoint and
a "subtree" checkpoint.

For container-checkpoint, the application runa within its own set
of namespaces (pid, uts, ipc ...). We have a leak-detection algorithm
in place to ensure that the application is entirely contained and
isolated in the container, or else the checkpoint fails. In this case,
restart is guaranteed to succeed (assuming external dependencies are
properly set - see below).

A "subtree" checkpoint is less restrictive. It allows to checkpoint
applications that aren't truly isolated. In this case, restart may
fail, or restart may succeed but the application may fail shortly
after because some dependency may be missing. However, in practice
this works well for many applications transparently, and even more
so for c/r-aware application.

In both modes of operation, some external dependencies are assumed
to exist. One example is the file system view that has to be the
same for the container (or subtree) as during checkpoint. Another
is the network setup. These will be reconstructed (at least in part)
by userspace prior to the actual restart.

(Some of these dependencies can be relaxed for many use-cases,
e.g. when user/application doesn't care about preserving original
network connection after restart).

> IOW, how reliable and robust is code at detecting that it was unable to
> successfully generate a restartable image?

To the best of my knowledge (and unless someone pokes a hole in the
algorithm), container-checkpoint is robust and reliable.

Two reasons for saying "best effort": first, because external
dependencies need to be suitably arranged. If external dependencies
are not arranged - e.g. provide a snapshot of the filesystem from
the time of the checkpoint - restart, or execution thereafter, may
fail.

Second, because even when the application is isolated within a
container, the user from outside may still be able to affect it. I
can think of one (and only one) pathological example: an outside
process signals all tasks in a container during a checkpoint, but
the signal is recorded for only some tasks. I'd argue that this
is improper behavior that we need not support, rather than address
it in the kernel.

> 
>> Q: What about namespaces ?
>> A: Currrently, UTS and IPC namespaces are restored. They demonstrate
>>    how namespaces are handled. More to come.
> 
> Will this new code muck up the kernel?

I forgot to mention user-ns is already included. We have a good
plan for mount-ns and mount-points, and intend to handle net-ns
in userspace (by that, I mean the net-ns setup, not the state of
connections etc).

I don't expect this additional code to be less solid than the
current one.

> 
>> Q: What additional work needs to be done to it?
>> A: Fill in the gory details following the examples so far. Current WIP
>>    includes inet sockets, event-poll, and early work on inotify, mount
>>    namespace and mount-points, pseudo file systems
> 
> Will this new code muck up the kernel, or will it be clean?

Should not.

I guess "gory details" was a bad choice of words, and I really meant
to say "less interesting". IOW, I think we have constructed a solid
and framework. From the kernel side, I think we're left with
completeness - add support for remaining missing features (e.g.
inotify), and goodies - live migration and other optimizations.

> 
>> and x86_64 support.
> 
> eh?  You mean the code doesn't work on x86_64 at present?
> 
> 
> What is the story on migration?  Moving the process(es) to a different
> machine?
> 

It depends on the use-case in mind:

Moving processes between machines without preserving external network
connections (or IP), already works. This can be useful for servers
(e.g. apache, or vnc with user's session), and applications that know
how to recover from, or do not care about lost connections.

Migrating processes with their external network connections is WIP.

In both cases, there needs to be some userspace glue to ensure that
the same filesystem view is available on the target machine as was on
the origin machine (e.g. remote file system, SAN, or even rsync of
changes).

Also ,in both cases, we may want to pre-copy some of the application
state while it is still running, to reduce the application downtime.
This will be added down the road.

Oren.

--
To unsubscribe from this list: send the line "unsubscribe linux-api" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 00/80] Kernel based checkpoint/restart [v18]
@ 2009-09-29 18:31     ` Oren Laadan
  0 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-29 18:31 UTC (permalink / raw)
  To: Andrew Morton
  Cc: torvalds, containers, linux-kernel, linux-mm, linux-api, serue,
	mingo, xemul


Hi,

Andrew Morton wrote:
> On Wed, 23 Sep 2009 19:50:40 -0400
> Oren Laadan <orenl@librato.com> wrote:
> 
>> Q: How useful is this code as it stands in real-world usage?
>> A: The application can be single- or multi-processes and threads. It
>>    handles open files (regular files/directories on most file systems,
>>    pipes, fifos, af_unix sockets, /dev/{null,zero,random,urandom} and
>>    pseudo-terminals. It supports shared memory. sysv IPC (except undo
>>    of sempahores). It's suitable for many types of batch jobs as well
>>    as some interactive jobs. (Note: it is assumed that the fs view is
>>    available at restart).
> 
> That's encouraging.
> 
>> Q: What can it checkpoint and restart ?
>> A: A (single threaded) process can checkpoint itself, aka "self"
>>    checkpoint, if it calls the new system calls. Otherise, for an
>>    "external" checkpoint, the caller must first freeze the target
>>    processes. One can either checkpoint an entire container (and
>>    we make best effort to ensure that the result is self-contained),
>>    or merely a subtree of a process hierarchy.
> 
> What is "best effort"?  Will the operation appear to have succeeded,
> only it didn't?

There are two modes of operation: a "full" (container) checkpoint and
a "subtree" checkpoint.

For container-checkpoint, the application runa within its own set
of namespaces (pid, uts, ipc ...). We have a leak-detection algorithm
in place to ensure that the application is entirely contained and
isolated in the container, or else the checkpoint fails. In this case,
restart is guaranteed to succeed (assuming external dependencies are
properly set - see below).

A "subtree" checkpoint is less restrictive. It allows to checkpoint
applications that aren't truly isolated. In this case, restart may
fail, or restart may succeed but the application may fail shortly
after because some dependency may be missing. However, in practice
this works well for many applications transparently, and even more
so for c/r-aware application.

In both modes of operation, some external dependencies are assumed
to exist. One example is the file system view that has to be the
same for the container (or subtree) as during checkpoint. Another
is the network setup. These will be reconstructed (at least in part)
by userspace prior to the actual restart.

(Some of these dependencies can be relaxed for many use-cases,
e.g. when user/application doesn't care about preserving original
network connection after restart).

> IOW, how reliable and robust is code at detecting that it was unable to
> successfully generate a restartable image?

To the best of my knowledge (and unless someone pokes a hole in the
algorithm), container-checkpoint is robust and reliable.

Two reasons for saying "best effort": first, because external
dependencies need to be suitably arranged. If external dependencies
are not arranged - e.g. provide a snapshot of the filesystem from
the time of the checkpoint - restart, or execution thereafter, may
fail.

Second, because even when the application is isolated within a
container, the user from outside may still be able to affect it. I
can think of one (and only one) pathological example: an outside
process signals all tasks in a container during a checkpoint, but
the signal is recorded for only some tasks. I'd argue that this
is improper behavior that we need not support, rather than address
it in the kernel.

> 
>> Q: What about namespaces ?
>> A: Currrently, UTS and IPC namespaces are restored. They demonstrate
>>    how namespaces are handled. More to come.
> 
> Will this new code muck up the kernel?

I forgot to mention user-ns is already included. We have a good
plan for mount-ns and mount-points, and intend to handle net-ns
in userspace (by that, I mean the net-ns setup, not the state of
connections etc).

I don't expect this additional code to be less solid than the
current one.

> 
>> Q: What additional work needs to be done to it?
>> A: Fill in the gory details following the examples so far. Current WIP
>>    includes inet sockets, event-poll, and early work on inotify, mount
>>    namespace and mount-points, pseudo file systems
> 
> Will this new code muck up the kernel, or will it be clean?

Should not.

I guess "gory details" was a bad choice of words, and I really meant
to say "less interesting". IOW, I think we have constructed a solid
and framework. From the kernel side, I think we're left with
completeness - add support for remaining missing features (e.g.
inotify), and goodies - live migration and other optimizations.

> 
>> and x86_64 support.
> 
> eh?  You mean the code doesn't work on x86_64 at present?
> 
> 
> What is the story on migration?  Moving the process(es) to a different
> machine?
> 

It depends on the use-case in mind:

Moving processes between machines without preserving external network
connections (or IP), already works. This can be useful for servers
(e.g. apache, or vnc with user's session), and applications that know
how to recover from, or do not care about lost connections.

Migrating processes with their external network connections is WIP.

In both cases, there needs to be some userspace glue to ensure that
the same filesystem view is available on the target machine as was on
the origin machine (e.g. remote file system, SAN, or even rsync of
changes).

Also ,in both cases, we may want to pre-copy some of the application
state while it is still running, to reduce the application downtime.
This will be added down the road.

Oren.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH v18 49/80] c/r: support for UTS namespace
       [not found]     ` <200909292213.21266-vlktAfmDCSZSq9BJjBFyUp/QNRX+jHPU@public.gmane.org>
@ 2009-09-29 18:51       ` Serge E. Hallyn
  2009-09-29 23:01       ` Oren Laadan
  1 sibling, 0 replies; 327+ messages in thread
From: Serge E. Hallyn @ 2009-09-29 18:51 UTC (permalink / raw)
  To: Nikita V. Youshchenko
  Cc: linux-api-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Dan Smith, Andrew Morton,
	Ingo Molnar, Pavel Emelyanov

Quoting Nikita V. Youshchenko (yoush-/llMDZXAvAOHXe+LvDLADg@public.gmane.org):
> > +static struct uts_namespace *do_restore_uts_ns(struct ckpt_ctx *ctx)
> > ...
> > +#ifdef CONFIG_UTS_NS
> > +	uts_ns = create_uts_ns();
> > +	if (!uts_ns) {
> > +		uts_ns = ERR_PTR(-ENOMEM);
> > +		goto out;
> > +	}
> > +	down_read(&uts_sem);
> > +	name = &uts_ns->name;
> > +	memcpy(name->sysname, h->sysname, sizeof(name->sysname));
> > +	memcpy(name->nodename, h->nodename, sizeof(name->nodename));
> > +	memcpy(name->release, h->release, sizeof(name->release));
> > +	memcpy(name->version, h->version, sizeof(name->version));
> > +	memcpy(name->machine, h->machine, sizeof(name->machine));
> > +	memcpy(name->domainname, h->domainname, sizeof(name->domainname));
> > +	up_read(&uts_sem);
> 
> Could you please explain what for is this down_read() / up_read() ?
> You operate only on local objects: 'name' points to just-created 
> uts_ns, 'h' is also local data.

Yup, good point, that looks unnecessary.

thanks,
-serge

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH v18 49/80] c/r: support for UTS namespace
       [not found]     ` <200909292213.21266-vlktAfmDCSZSq9BJjBFyUp/QNRX+jHPU@public.gmane.org>
  2009-09-29 18:51       ` Serge E. Hallyn
@ 2009-09-29 18:51       ` Serge E. Hallyn
  1 sibling, 0 replies; 327+ messages in thread
From: Serge E. Hallyn @ 2009-09-29 18:51 UTC (permalink / raw)
  To: Nikita V. Youshchenko
  Cc: Oren Laadan, Andrew Morton, containers, linux-kernel, linux-mm,
	linux-api, Ingo Molnar, Pavel Emelyanov, Dan Smith, Oren Laadan

Quoting Nikita V. Youshchenko (yoush@cs.msu.su):
> > +static struct uts_namespace *do_restore_uts_ns(struct ckpt_ctx *ctx)
> > ...
> > +#ifdef CONFIG_UTS_NS
> > +	uts_ns = create_uts_ns();
> > +	if (!uts_ns) {
> > +		uts_ns = ERR_PTR(-ENOMEM);
> > +		goto out;
> > +	}
> > +	down_read(&uts_sem);
> > +	name = &uts_ns->name;
> > +	memcpy(name->sysname, h->sysname, sizeof(name->sysname));
> > +	memcpy(name->nodename, h->nodename, sizeof(name->nodename));
> > +	memcpy(name->release, h->release, sizeof(name->release));
> > +	memcpy(name->version, h->version, sizeof(name->version));
> > +	memcpy(name->machine, h->machine, sizeof(name->machine));
> > +	memcpy(name->domainname, h->domainname, sizeof(name->domainname));
> > +	up_read(&uts_sem);
> 
> Could you please explain what for is this down_read() / up_read() ?
> You operate only on local objects: 'name' points to just-created 
> uts_ns, 'h' is also local data.

Yup, good point, that looks unnecessary.

thanks,
-serge

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH v18 49/80] c/r: support for UTS namespace
@ 2009-09-29 18:51       ` Serge E. Hallyn
  0 siblings, 0 replies; 327+ messages in thread
From: Serge E. Hallyn @ 2009-09-29 18:51 UTC (permalink / raw)
  To: Nikita V. Youshchenko
  Cc: Oren Laadan, Andrew Morton,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Ingo Molnar, Pavel Emelyanov,
	Dan Smith, Oren Laadan

Quoting Nikita V. Youshchenko (yoush-/llMDZXAvAOHXe+LvDLADg@public.gmane.org):
> > +static struct uts_namespace *do_restore_uts_ns(struct ckpt_ctx *ctx)
> > ...
> > +#ifdef CONFIG_UTS_NS
> > +	uts_ns = create_uts_ns();
> > +	if (!uts_ns) {
> > +		uts_ns = ERR_PTR(-ENOMEM);
> > +		goto out;
> > +	}
> > +	down_read(&uts_sem);
> > +	name = &uts_ns->name;
> > +	memcpy(name->sysname, h->sysname, sizeof(name->sysname));
> > +	memcpy(name->nodename, h->nodename, sizeof(name->nodename));
> > +	memcpy(name->release, h->release, sizeof(name->release));
> > +	memcpy(name->version, h->version, sizeof(name->version));
> > +	memcpy(name->machine, h->machine, sizeof(name->machine));
> > +	memcpy(name->domainname, h->domainname, sizeof(name->domainname));
> > +	up_read(&uts_sem);
> 
> Could you please explain what for is this down_read() / up_read() ?
> You operate only on local objects: 'name' points to just-created 
> uts_ns, 'h' is also local data.

Yup, good point, that looks unnecessary.

thanks,
-serge
--
To unsubscribe from this list: send the line "unsubscribe linux-api" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH v18 49/80] c/r: support for UTS namespace
@ 2009-09-29 18:51       ` Serge E. Hallyn
  0 siblings, 0 replies; 327+ messages in thread
From: Serge E. Hallyn @ 2009-09-29 18:51 UTC (permalink / raw)
  To: Nikita V. Youshchenko
  Cc: Oren Laadan, Andrew Morton, containers, linux-kernel, linux-mm,
	linux-api, Ingo Molnar, Pavel Emelyanov, Dan Smith, Oren Laadan

Quoting Nikita V. Youshchenko (yoush@cs.msu.su):
> > +static struct uts_namespace *do_restore_uts_ns(struct ckpt_ctx *ctx)
> > ...
> > +#ifdef CONFIG_UTS_NS
> > +	uts_ns = create_uts_ns();
> > +	if (!uts_ns) {
> > +		uts_ns = ERR_PTR(-ENOMEM);
> > +		goto out;
> > +	}
> > +	down_read(&uts_sem);
> > +	name = &uts_ns->name;
> > +	memcpy(name->sysname, h->sysname, sizeof(name->sysname));
> > +	memcpy(name->nodename, h->nodename, sizeof(name->nodename));
> > +	memcpy(name->release, h->release, sizeof(name->release));
> > +	memcpy(name->version, h->version, sizeof(name->version));
> > +	memcpy(name->machine, h->machine, sizeof(name->machine));
> > +	memcpy(name->domainname, h->domainname, sizeof(name->domainname));
> > +	up_read(&uts_sem);
> 
> Could you please explain what for is this down_read() / up_read() ?
> You operate only on local objects: 'name' points to just-created 
> uts_ns, 'h' is also local data.

Yup, good point, that looks unnecessary.

thanks,
-serge

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH v18 49/80] c/r: support for UTS namespace
       [not found]     ` <200909292213.21266-vlktAfmDCSZSq9BJjBFyUp/QNRX+jHPU@public.gmane.org>
  2009-09-29 18:51       ` Serge E. Hallyn
@ 2009-09-29 23:01       ` Oren Laadan
  1 sibling, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-29 23:01 UTC (permalink / raw)
  To: Nikita V. Youshchenko
  Cc: linux-api-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Dan Smith, Andrew Morton,
	Ingo Molnar, Pavel Emelyanov



Nikita V. Youshchenko wrote:
>> +static struct uts_namespace *do_restore_uts_ns(struct ckpt_ctx *ctx)
>> ...
>> +#ifdef CONFIG_UTS_NS
>> +	uts_ns = create_uts_ns();
>> +	if (!uts_ns) {
>> +		uts_ns = ERR_PTR(-ENOMEM);
>> +		goto out;
>> +	}
>> +	down_read(&uts_sem);
>> +	name = &uts_ns->name;
>> +	memcpy(name->sysname, h->sysname, sizeof(name->sysname));
>> +	memcpy(name->nodename, h->nodename, sizeof(name->nodename));
>> +	memcpy(name->release, h->release, sizeof(name->release));
>> +	memcpy(name->version, h->version, sizeof(name->version));
>> +	memcpy(name->machine, h->machine, sizeof(name->machine));
>> +	memcpy(name->domainname, h->domainname, sizeof(name->domainname));
>> +	up_read(&uts_sem);
> 
> Could you please explain what for is this down_read() / up_read() ?
> You operate only on local objects: 'name' points to just-created 
> uts_ns, 'h' is also local data.

Nothing more than symmetry with checkpoint code, and a pedagogical
aspect...

Can be replaced by a suitable comment.

Oren.

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH v18 49/80] c/r: support for UTS namespace
       [not found]     ` <200909292213.21266-vlktAfmDCSZSq9BJjBFyUp/QNRX+jHPU@public.gmane.org>
  2009-09-29 18:51       ` Serge E. Hallyn
@ 2009-09-29 23:01       ` Oren Laadan
  1 sibling, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-29 23:01 UTC (permalink / raw)
  To: Nikita V. Youshchenko
  Cc: Andrew Morton, containers, linux-kernel, linux-mm, linux-api,
	Serge Hallyn, Ingo Molnar, Pavel Emelyanov, Dan Smith,
	Oren Laadan



Nikita V. Youshchenko wrote:
>> +static struct uts_namespace *do_restore_uts_ns(struct ckpt_ctx *ctx)
>> ...
>> +#ifdef CONFIG_UTS_NS
>> +	uts_ns = create_uts_ns();
>> +	if (!uts_ns) {
>> +		uts_ns = ERR_PTR(-ENOMEM);
>> +		goto out;
>> +	}
>> +	down_read(&uts_sem);
>> +	name = &uts_ns->name;
>> +	memcpy(name->sysname, h->sysname, sizeof(name->sysname));
>> +	memcpy(name->nodename, h->nodename, sizeof(name->nodename));
>> +	memcpy(name->release, h->release, sizeof(name->release));
>> +	memcpy(name->version, h->version, sizeof(name->version));
>> +	memcpy(name->machine, h->machine, sizeof(name->machine));
>> +	memcpy(name->domainname, h->domainname, sizeof(name->domainname));
>> +	up_read(&uts_sem);
> 
> Could you please explain what for is this down_read() / up_read() ?
> You operate only on local objects: 'name' points to just-created 
> uts_ns, 'h' is also local data.

Nothing more than symmetry with checkpoint code, and a pedagogical
aspect...

Can be replaced by a suitable comment.

Oren.

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH v18 49/80] c/r: support for UTS namespace
@ 2009-09-29 23:01       ` Oren Laadan
  0 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-29 23:01 UTC (permalink / raw)
  To: Nikita V. Youshchenko
  Cc: Andrew Morton,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Serge Hallyn, Ingo Molnar,
	Pavel Emelyanov, Dan Smith, Oren Laadan



Nikita V. Youshchenko wrote:
>> +static struct uts_namespace *do_restore_uts_ns(struct ckpt_ctx *ctx)
>> ...
>> +#ifdef CONFIG_UTS_NS
>> +	uts_ns = create_uts_ns();
>> +	if (!uts_ns) {
>> +		uts_ns = ERR_PTR(-ENOMEM);
>> +		goto out;
>> +	}
>> +	down_read(&uts_sem);
>> +	name = &uts_ns->name;
>> +	memcpy(name->sysname, h->sysname, sizeof(name->sysname));
>> +	memcpy(name->nodename, h->nodename, sizeof(name->nodename));
>> +	memcpy(name->release, h->release, sizeof(name->release));
>> +	memcpy(name->version, h->version, sizeof(name->version));
>> +	memcpy(name->machine, h->machine, sizeof(name->machine));
>> +	memcpy(name->domainname, h->domainname, sizeof(name->domainname));
>> +	up_read(&uts_sem);
> 
> Could you please explain what for is this down_read() / up_read() ?
> You operate only on local objects: 'name' points to just-created 
> uts_ns, 'h' is also local data.

Nothing more than symmetry with checkpoint code, and a pedagogical
aspect...

Can be replaced by a suitable comment.

Oren.
--
To unsubscribe from this list: send the line "unsubscribe linux-api" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH v18 49/80] c/r: support for UTS namespace
@ 2009-09-29 23:01       ` Oren Laadan
  0 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-29 23:01 UTC (permalink / raw)
  To: Nikita V. Youshchenko
  Cc: Andrew Morton, containers, linux-kernel, linux-mm, linux-api,
	Serge Hallyn, Ingo Molnar, Pavel Emelyanov, Dan Smith,
	Oren Laadan



Nikita V. Youshchenko wrote:
>> +static struct uts_namespace *do_restore_uts_ns(struct ckpt_ctx *ctx)
>> ...
>> +#ifdef CONFIG_UTS_NS
>> +	uts_ns = create_uts_ns();
>> +	if (!uts_ns) {
>> +		uts_ns = ERR_PTR(-ENOMEM);
>> +		goto out;
>> +	}
>> +	down_read(&uts_sem);
>> +	name = &uts_ns->name;
>> +	memcpy(name->sysname, h->sysname, sizeof(name->sysname));
>> +	memcpy(name->nodename, h->nodename, sizeof(name->nodename));
>> +	memcpy(name->release, h->release, sizeof(name->release));
>> +	memcpy(name->version, h->version, sizeof(name->version));
>> +	memcpy(name->machine, h->machine, sizeof(name->machine));
>> +	memcpy(name->domainname, h->domainname, sizeof(name->domainname));
>> +	up_read(&uts_sem);
> 
> Could you please explain what for is this down_read() / up_read() ?
> You operate only on local objects: 'name' points to just-created 
> uts_ns, 'h' is also local data.

Nothing more than symmetry with checkpoint code, and a pedagogical
aspect...

Can be replaced by a suitable comment.

Oren.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 00/80] Kernel based checkpoint/restart [v18]
       [not found]       ` <4AC20BB8.4070509-GANU6spQydw@public.gmane.org>
@ 2009-09-30 14:39         ` Dan Smith
  0 siblings, 0 replies; 327+ messages in thread
From: Dan Smith @ 2009-09-30 14:39 UTC (permalink / raw)
  To: Daniel Lezcano
  Cc: linux-api-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Andrew Morton,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, mingo-X9Un+BFzKDI,
	xemul-GEFAQzZX7r8dnm+yROfE0A

DL> Ok for the restart, but for the checkpoint, how do you access the
DL> network setup from a process which belongs to another namespace
DL> context ?

So far the discussion has led to the kernel dumping all of that
information on checkpoint, and then splitting it up into what can be
done by userspace on restart and what still needs to be in the kernel.

Similarly, the task structure is currently exported by the kernel on
checkpoint, but recreated in userspace on restart.

-- 
Dan Smith
IBM Linux Technology Center
email: danms-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 00/80] Kernel based checkpoint/restart [v18]
  2009-09-29 13:29       ` Daniel Lezcano
@ 2009-09-30 14:39         ` Dan Smith
  -1 siblings, 0 replies; 327+ messages in thread
From: Dan Smith @ 2009-09-30 14:39 UTC (permalink / raw)
  To: Daniel Lezcano
  Cc: Serge E. Hallyn, linux-api, containers, linux-kernel, linux-mm,
	Andrew Morton, torvalds, mingo, xemul

DL> Ok for the restart, but for the checkpoint, how do you access the
DL> network setup from a process which belongs to another namespace
DL> context ?

So far the discussion has led to the kernel dumping all of that
information on checkpoint, and then splitting it up into what can be
done by userspace on restart and what still needs to be in the kernel.

Similarly, the task structure is currently exported by the kernel on
checkpoint, but recreated in userspace on restart.

-- 
Dan Smith
IBM Linux Technology Center
email: danms@us.ibm.com

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 00/80] Kernel based checkpoint/restart [v18]
@ 2009-09-30 14:39         ` Dan Smith
  0 siblings, 0 replies; 327+ messages in thread
From: Dan Smith @ 2009-09-30 14:39 UTC (permalink / raw)
  To: Daniel Lezcano
  Cc: Serge E. Hallyn, linux-api, containers, linux-kernel, linux-mm,
	Andrew Morton, torvalds, mingo, xemul

DL> Ok for the restart, but for the checkpoint, how do you access the
DL> network setup from a process which belongs to another namespace
DL> context ?

So far the discussion has led to the kernel dumping all of that
information on checkpoint, and then splitting it up into what can be
done by userspace on restart and what still needs to be in the kernel.

Similarly, the task structure is currently exported by the kernel on
checkpoint, but recreated in userspace on restart.

-- 
Dan Smith
IBM Linux Technology Center
email: danms@us.ibm.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 00/80] Kernel based checkpoint/restart [v18]
       [not found]         ` <87iqf0o5sf.fsf-FLMGYpZoEPULwtHQx/6qkW3U47Q5hpJU@public.gmane.org>
@ 2009-09-30 16:16           ` Daniel Lezcano
  0 siblings, 0 replies; 327+ messages in thread
From: Daniel Lezcano @ 2009-09-30 16:16 UTC (permalink / raw)
  To: Dan Smith
  Cc: linux-api-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Andrew Morton,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, mingo-X9Un+BFzKDI,
	xemul-GEFAQzZX7r8dnm+yROfE0A

Dan Smith wrote:
> DL> Ok for the restart, but for the checkpoint, how do you access the
> DL> network setup from a process which belongs to another namespace
> DL> context ?
>
> So far the discussion has led to the kernel dumping all of that
> information on checkpoint, and then splitting it up into what can be
> done by userspace on restart and what still needs to be in the kernel.
>   
Ah, this was a discussion in the containers@ mailing list ?
Sorry I missed it, I will look at the containers@ archives.

If the checkpoint is done from the kernel, why the restart wouldn't be 
in the kernel too ?
Do you have a list of what is restartable from userspace or from the 
kernel ?

Is there any documentation about the statefile format I can use if I 
want to implement myself an userspace CR solution based on this kernel 
patchset ?
> Similarly, the task structure is currently exported by the kernel on
> checkpoint, but recreated in userspace on restart.
>   
(I guess you meant tasks hierarchy/tree) Well I understand why this is 
done from userspace but I don't like the idea of digging in the 
statefile, but there's no accounting for taste :)

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 00/80] Kernel based checkpoint/restart [v18]
       [not found]         ` <87iqf0o5sf.fsf-FLMGYpZoEPULwtHQx/6qkW3U47Q5hpJU@public.gmane.org>
  2009-09-30 16:16           ` Daniel Lezcano
@ 2009-09-30 16:16           ` Daniel Lezcano
  0 siblings, 0 replies; 327+ messages in thread
From: Daniel Lezcano @ 2009-09-30 16:16 UTC (permalink / raw)
  To: Dan Smith
  Cc: Serge E. Hallyn, linux-api, containers, linux-kernel, linux-mm,
	Andrew Morton, torvalds, mingo, xemul

Dan Smith wrote:
> DL> Ok for the restart, but for the checkpoint, how do you access the
> DL> network setup from a process which belongs to another namespace
> DL> context ?
>
> So far the discussion has led to the kernel dumping all of that
> information on checkpoint, and then splitting it up into what can be
> done by userspace on restart and what still needs to be in the kernel.
>   
Ah, this was a discussion in the containers@ mailing list ?
Sorry I missed it, I will look at the containers@ archives.

If the checkpoint is done from the kernel, why the restart wouldn't be 
in the kernel too ?
Do you have a list of what is restartable from userspace or from the 
kernel ?

Is there any documentation about the statefile format I can use if I 
want to implement myself an userspace CR solution based on this kernel 
patchset ?
> Similarly, the task structure is currently exported by the kernel on
> checkpoint, but recreated in userspace on restart.
>   
(I guess you meant tasks hierarchy/tree) Well I understand why this is 
done from userspace but I don't like the idea of digging in the 
statefile, but there's no accounting for taste :)


^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 00/80] Kernel based checkpoint/restart [v18]
@ 2009-09-30 16:16           ` Daniel Lezcano
  0 siblings, 0 replies; 327+ messages in thread
From: Daniel Lezcano @ 2009-09-30 16:16 UTC (permalink / raw)
  To: Dan Smith
  Cc: Serge E. Hallyn, linux-api-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Andrew Morton,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, mingo-X9Un+BFzKDI,
	xemul-GEFAQzZX7r8dnm+yROfE0A

Dan Smith wrote:
> DL> Ok for the restart, but for the checkpoint, how do you access the
> DL> network setup from a process which belongs to another namespace
> DL> context ?
>
> So far the discussion has led to the kernel dumping all of that
> information on checkpoint, and then splitting it up into what can be
> done by userspace on restart and what still needs to be in the kernel.
>   
Ah, this was a discussion in the containers@ mailing list ?
Sorry I missed it, I will look at the containers@ archives.

If the checkpoint is done from the kernel, why the restart wouldn't be 
in the kernel too ?
Do you have a list of what is restartable from userspace or from the 
kernel ?

Is there any documentation about the statefile format I can use if I 
want to implement myself an userspace CR solution based on this kernel 
patchset ?
> Similarly, the task structure is currently exported by the kernel on
> checkpoint, but recreated in userspace on restart.
>   
(I guess you meant tasks hierarchy/tree) Well I understand why this is 
done from userspace but I don't like the idea of digging in the 
statefile, but there's no accounting for taste :)

--
To unsubscribe from this list: send the line "unsubscribe linux-api" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 00/80] Kernel based checkpoint/restart [v18]
@ 2009-09-30 16:16           ` Daniel Lezcano
  0 siblings, 0 replies; 327+ messages in thread
From: Daniel Lezcano @ 2009-09-30 16:16 UTC (permalink / raw)
  To: Dan Smith
  Cc: Serge E. Hallyn, linux-api, containers, linux-kernel, linux-mm,
	Andrew Morton, torvalds, mingo, xemul

Dan Smith wrote:
> DL> Ok for the restart, but for the checkpoint, how do you access the
> DL> network setup from a process which belongs to another namespace
> DL> context ?
>
> So far the discussion has led to the kernel dumping all of that
> information on checkpoint, and then splitting it up into what can be
> done by userspace on restart and what still needs to be in the kernel.
>   
Ah, this was a discussion in the containers@ mailing list ?
Sorry I missed it, I will look at the containers@ archives.

If the checkpoint is done from the kernel, why the restart wouldn't be 
in the kernel too ?
Do you have a list of what is restartable from userspace or from the 
kernel ?

Is there any documentation about the statefile format I can use if I 
want to implement myself an userspace CR solution based on this kernel 
patchset ?
> Similarly, the task structure is currently exported by the kernel on
> checkpoint, but recreated in userspace on restart.
>   
(I guess you meant tasks hierarchy/tree) Well I understand why this is 
done from userspace but I don't like the idea of digging in the 
statefile, but there's no accounting for taste :)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 00/80] Kernel based checkpoint/restart [v18]
       [not found]           ` <4AC38477.4070007-GANU6spQydw@public.gmane.org>
@ 2009-09-30 16:29             ` Dan Smith
  0 siblings, 0 replies; 327+ messages in thread
From: Dan Smith @ 2009-09-30 16:29 UTC (permalink / raw)
  To: Daniel Lezcano
  Cc: linux-api-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Andrew Morton,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, mingo-X9Un+BFzKDI,
	xemul-GEFAQzZX7r8dnm+yROfE0A

DL> If the checkpoint is done from the kernel, why the restart
DL> wouldn't be in the kernel too ?

I think thus far we have taken the approach of "if it can be done
reasonably in userspace, then do it there" right?  Setup of the
network devices is easy to do in userspace, allows more flexibility
from a policy standpoint, and ensures that all existing security
checks are performed.  Also, migration may be easier if the userspace
bits can call custom hooks allowing for routing changes and other
infrastructure-specific operations.

DL> Is there any documentation about the statefile format I can use if
DL> I want to implement myself an userspace CR solution based on this
DL> kernel patchset ?

See linux-cr/include/linux/checkpoint_hdr.h and user-cr/restart.c.

-- 
Dan Smith
IBM Linux Technology Center
email: danms-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 00/80] Kernel based checkpoint/restart [v18]
       [not found]           ` <4AC38477.4070007-GANU6spQydw@public.gmane.org>
  2009-09-30 16:29             ` Dan Smith
@ 2009-09-30 16:29             ` Dan Smith
  0 siblings, 0 replies; 327+ messages in thread
From: Dan Smith @ 2009-09-30 16:29 UTC (permalink / raw)
  To: Daniel Lezcano
  Cc: Serge E. Hallyn, linux-api, containers, linux-kernel, linux-mm,
	Andrew Morton, torvalds, mingo, xemul

DL> If the checkpoint is done from the kernel, why the restart
DL> wouldn't be in the kernel too ?

I think thus far we have taken the approach of "if it can be done
reasonably in userspace, then do it there" right?  Setup of the
network devices is easy to do in userspace, allows more flexibility
from a policy standpoint, and ensures that all existing security
checks are performed.  Also, migration may be easier if the userspace
bits can call custom hooks allowing for routing changes and other
infrastructure-specific operations.

DL> Is there any documentation about the statefile format I can use if
DL> I want to implement myself an userspace CR solution based on this
DL> kernel patchset ?

See linux-cr/include/linux/checkpoint_hdr.h and user-cr/restart.c.

-- 
Dan Smith
IBM Linux Technology Center
email: danms@us.ibm.com

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 00/80] Kernel based checkpoint/restart [v18]
@ 2009-09-30 16:29             ` Dan Smith
  0 siblings, 0 replies; 327+ messages in thread
From: Dan Smith @ 2009-09-30 16:29 UTC (permalink / raw)
  To: Daniel Lezcano
  Cc: Serge E. Hallyn, linux-api-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Andrew Morton,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, mingo-X9Un+BFzKDI,
	xemul-GEFAQzZX7r8dnm+yROfE0A

DL> If the checkpoint is done from the kernel, why the restart
DL> wouldn't be in the kernel too ?

I think thus far we have taken the approach of "if it can be done
reasonably in userspace, then do it there" right?  Setup of the
network devices is easy to do in userspace, allows more flexibility
from a policy standpoint, and ensures that all existing security
checks are performed.  Also, migration may be easier if the userspace
bits can call custom hooks allowing for routing changes and other
infrastructure-specific operations.

DL> Is there any documentation about the statefile format I can use if
DL> I want to implement myself an userspace CR solution based on this
DL> kernel patchset ?

See linux-cr/include/linux/checkpoint_hdr.h and user-cr/restart.c.

-- 
Dan Smith
IBM Linux Technology Center
email: danms-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org
--
To unsubscribe from this list: send the line "unsubscribe linux-api" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 00/80] Kernel based checkpoint/restart [v18]
@ 2009-09-30 16:29             ` Dan Smith
  0 siblings, 0 replies; 327+ messages in thread
From: Dan Smith @ 2009-09-30 16:29 UTC (permalink / raw)
  To: Daniel Lezcano
  Cc: Serge E. Hallyn, linux-api, containers, linux-kernel, linux-mm,
	Andrew Morton, torvalds, mingo, xemul

DL> If the checkpoint is done from the kernel, why the restart
DL> wouldn't be in the kernel too ?

I think thus far we have taken the approach of "if it can be done
reasonably in userspace, then do it there" right?  Setup of the
network devices is easy to do in userspace, allows more flexibility
from a policy standpoint, and ensures that all existing security
checks are performed.  Also, migration may be easier if the userspace
bits can call custom hooks allowing for routing changes and other
infrastructure-specific operations.

DL> Is there any documentation about the statefile format I can use if
DL> I want to implement myself an userspace CR solution based on this
DL> kernel patchset ?

See linux-cr/include/linux/checkpoint_hdr.h and user-cr/restart.c.

-- 
Dan Smith
IBM Linux Technology Center
email: danms@us.ibm.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 00/80] Kernel based checkpoint/restart [v18]
       [not found]             ` <87eipoo0po.fsf-FLMGYpZoEPULwtHQx/6qkW3U47Q5hpJU@public.gmane.org>
@ 2009-09-30 18:01               ` Daniel Lezcano
  0 siblings, 0 replies; 327+ messages in thread
From: Daniel Lezcano @ 2009-09-30 18:01 UTC (permalink / raw)
  To: Dan Smith
  Cc: linux-api-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Andrew Morton,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, mingo-X9Un+BFzKDI,
	xemul-GEFAQzZX7r8dnm+yROfE0A

Dan Smith wrote:
> DL> If the checkpoint is done from the kernel, why the restart
> DL> wouldn't be in the kernel too ?
>
> I think thus far we have taken the approach of "if it can be done
> reasonably in userspace, then do it there" right?  
Well I am a little lost :)
The tty CR can be "reasonably" done in userspace I think. But it was 
done in the kernel, no ?

> Setup of the
> network devices is easy to do in userspace, allows more flexibility
> from a policy standpoint, and ensures that all existing security
> checks are performed. 
Yep, I agree. But you didn't answer the question, what are the network 
resources you plan to checkpoint / restart ?
eg. you let the container to setup your network, will you restore netdev 
statistics ? the mac address ? ipv4 ? ipv6 ?

Is it possible to do a detailed list of network resources you plan to CR 
with the different items you will address from userspace and kernel space ?

> Also, migration may be easier if the userspace
> bits can call custom hooks allowing for routing changes and other
> infrastructure-specific operations.
>   
You may have some problems with the connected sockets you will restore 
in this case.

> DL> Is there any documentation about the statefile format I can use if
> DL> I want to implement myself an userspace CR solution based on this
> DL> kernel patchset ?
>
> See linux-cr/include/linux/checkpoint_hdr.h and user-cr/restart.c.
>   
Argh ! I was hoping there was something else than the source code :)

Thanks
  -- Daniel

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 00/80] Kernel based checkpoint/restart [v18]
       [not found]             ` <87eipoo0po.fsf-FLMGYpZoEPULwtHQx/6qkW3U47Q5hpJU@public.gmane.org>
  2009-09-30 18:01               ` Daniel Lezcano
@ 2009-09-30 18:01               ` Daniel Lezcano
  0 siblings, 0 replies; 327+ messages in thread
From: Daniel Lezcano @ 2009-09-30 18:01 UTC (permalink / raw)
  To: Dan Smith
  Cc: Serge E. Hallyn, linux-api, containers, linux-kernel, linux-mm,
	Andrew Morton, torvalds, mingo, xemul

Dan Smith wrote:
> DL> If the checkpoint is done from the kernel, why the restart
> DL> wouldn't be in the kernel too ?
>
> I think thus far we have taken the approach of "if it can be done
> reasonably in userspace, then do it there" right?  
Well I am a little lost :)
The tty CR can be "reasonably" done in userspace I think. But it was 
done in the kernel, no ?

> Setup of the
> network devices is easy to do in userspace, allows more flexibility
> from a policy standpoint, and ensures that all existing security
> checks are performed. 
Yep, I agree. But you didn't answer the question, what are the network 
resources you plan to checkpoint / restart ?
eg. you let the container to setup your network, will you restore netdev 
statistics ? the mac address ? ipv4 ? ipv6 ?

Is it possible to do a detailed list of network resources you plan to CR 
with the different items you will address from userspace and kernel space ?

> Also, migration may be easier if the userspace
> bits can call custom hooks allowing for routing changes and other
> infrastructure-specific operations.
>   
You may have some problems with the connected sockets you will restore 
in this case.

> DL> Is there any documentation about the statefile format I can use if
> DL> I want to implement myself an userspace CR solution based on this
> DL> kernel patchset ?
>
> See linux-cr/include/linux/checkpoint_hdr.h and user-cr/restart.c.
>   
Argh ! I was hoping there was something else than the source code :)

Thanks
  -- Daniel


^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 00/80] Kernel based checkpoint/restart [v18]
@ 2009-09-30 18:01               ` Daniel Lezcano
  0 siblings, 0 replies; 327+ messages in thread
From: Daniel Lezcano @ 2009-09-30 18:01 UTC (permalink / raw)
  To: Dan Smith
  Cc: Serge E. Hallyn, linux-api-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Andrew Morton,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, mingo-X9Un+BFzKDI,
	xemul-GEFAQzZX7r8dnm+yROfE0A

Dan Smith wrote:
> DL> If the checkpoint is done from the kernel, why the restart
> DL> wouldn't be in the kernel too ?
>
> I think thus far we have taken the approach of "if it can be done
> reasonably in userspace, then do it there" right?  
Well I am a little lost :)
The tty CR can be "reasonably" done in userspace I think. But it was 
done in the kernel, no ?

> Setup of the
> network devices is easy to do in userspace, allows more flexibility
> from a policy standpoint, and ensures that all existing security
> checks are performed. 
Yep, I agree. But you didn't answer the question, what are the network 
resources you plan to checkpoint / restart ?
eg. you let the container to setup your network, will you restore netdev 
statistics ? the mac address ? ipv4 ? ipv6 ?

Is it possible to do a detailed list of network resources you plan to CR 
with the different items you will address from userspace and kernel space ?

> Also, migration may be easier if the userspace
> bits can call custom hooks allowing for routing changes and other
> infrastructure-specific operations.
>   
You may have some problems with the connected sockets you will restore 
in this case.

> DL> Is there any documentation about the statefile format I can use if
> DL> I want to implement myself an userspace CR solution based on this
> DL> kernel patchset ?
>
> See linux-cr/include/linux/checkpoint_hdr.h and user-cr/restart.c.
>   
Argh ! I was hoping there was something else than the source code :)

Thanks
  -- Daniel

--
To unsubscribe from this list: send the line "unsubscribe linux-api" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 00/80] Kernel based checkpoint/restart [v18]
@ 2009-09-30 18:01               ` Daniel Lezcano
  0 siblings, 0 replies; 327+ messages in thread
From: Daniel Lezcano @ 2009-09-30 18:01 UTC (permalink / raw)
  To: Dan Smith
  Cc: Serge E. Hallyn, linux-api, containers, linux-kernel, linux-mm,
	Andrew Morton, torvalds, mingo, xemul

Dan Smith wrote:
> DL> If the checkpoint is done from the kernel, why the restart
> DL> wouldn't be in the kernel too ?
>
> I think thus far we have taken the approach of "if it can be done
> reasonably in userspace, then do it there" right?  
Well I am a little lost :)
The tty CR can be "reasonably" done in userspace I think. But it was 
done in the kernel, no ?

> Setup of the
> network devices is easy to do in userspace, allows more flexibility
> from a policy standpoint, and ensures that all existing security
> checks are performed. 
Yep, I agree. But you didn't answer the question, what are the network 
resources you plan to checkpoint / restart ?
eg. you let the container to setup your network, will you restore netdev 
statistics ? the mac address ? ipv4 ? ipv6 ?

Is it possible to do a detailed list of network resources you plan to CR 
with the different items you will address from userspace and kernel space ?

> Also, migration may be easier if the userspace
> bits can call custom hooks allowing for routing changes and other
> infrastructure-specific operations.
>   
You may have some problems with the connected sockets you will restore 
in this case.

> DL> Is there any documentation about the statefile format I can use if
> DL> I want to implement myself an userspace CR solution based on this
> DL> kernel patchset ?
>
> See linux-cr/include/linux/checkpoint_hdr.h and user-cr/restart.c.
>   
Argh ! I was hoping there was something else than the source code :)

Thanks
  -- Daniel

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 00/80] Kernel based checkpoint/restart [v18]
       [not found]               ` <4AC39CE5.9080908-GANU6spQydw@public.gmane.org>
@ 2009-09-30 18:28                 ` Dan Smith
  0 siblings, 0 replies; 327+ messages in thread
From: Dan Smith @ 2009-09-30 18:28 UTC (permalink / raw)
  To: Daniel Lezcano
  Cc: linux-api-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Andrew Morton,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, mingo-X9Un+BFzKDI,
	xemul-GEFAQzZX7r8dnm+yROfE0A

DL> Yep, I agree. But you didn't answer the question, what are the
DL> network resources you plan to checkpoint / restart ?  eg. you let
DL> the container to setup your network, will you restore netdev
DL> statistics ? the mac address ? ipv4 ? ipv6 ?

Yes, Yes, Yes, and Yes.  I'm making the assumption that the common
case will be with a veth device in the container and that all of the
aforementioned attributes should be copied over.  In the future case
where we could potentially have a real device in the container, it
probably doesn't make sense to copy the mac address.

DL> Is it possible to do a detailed list of network resources you plan
DL> to CR with the different items you will address from userspace and
DL> kernel space ?

I'm sure it's possible, but no, I haven't planned out everything for
the next year.  If you have strong feelings about what should be done
in user and kernel space, feel free to share :)

DL> Argh ! I was hoping there was something else than the source code

The header file makes it pretty clear what is going on, but maybe the
Documentation/checkpoint/readme.txt will help.  Putting all the
details in such a documentation file would be rather silly at the
moment, given that new things are being added at a rapid rate and it
would duplicate the only description that matters, which is the
header file.

-- 
Dan Smith
IBM Linux Technology Center
email: danms-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 00/80] Kernel based checkpoint/restart [v18]
       [not found]               ` <4AC39CE5.9080908-GANU6spQydw@public.gmane.org>
  2009-09-30 18:28                 ` Dan Smith
@ 2009-09-30 18:28                 ` Dan Smith
  0 siblings, 0 replies; 327+ messages in thread
From: Dan Smith @ 2009-09-30 18:28 UTC (permalink / raw)
  To: Daniel Lezcano
  Cc: Serge E. Hallyn, linux-api, containers, linux-kernel, linux-mm,
	Andrew Morton, torvalds, mingo, xemul

DL> Yep, I agree. But you didn't answer the question, what are the
DL> network resources you plan to checkpoint / restart ?  eg. you let
DL> the container to setup your network, will you restore netdev
DL> statistics ? the mac address ? ipv4 ? ipv6 ?

Yes, Yes, Yes, and Yes.  I'm making the assumption that the common
case will be with a veth device in the container and that all of the
aforementioned attributes should be copied over.  In the future case
where we could potentially have a real device in the container, it
probably doesn't make sense to copy the mac address.

DL> Is it possible to do a detailed list of network resources you plan
DL> to CR with the different items you will address from userspace and
DL> kernel space ?

I'm sure it's possible, but no, I haven't planned out everything for
the next year.  If you have strong feelings about what should be done
in user and kernel space, feel free to share :)

DL> Argh ! I was hoping there was something else than the source code

The header file makes it pretty clear what is going on, but maybe the
Documentation/checkpoint/readme.txt will help.  Putting all the
details in such a documentation file would be rather silly at the
moment, given that new things are being added at a rapid rate and it
would duplicate the only description that matters, which is the
header file.

-- 
Dan Smith
IBM Linux Technology Center
email: danms@us.ibm.com

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 00/80] Kernel based checkpoint/restart [v18]
@ 2009-09-30 18:28                 ` Dan Smith
  0 siblings, 0 replies; 327+ messages in thread
From: Dan Smith @ 2009-09-30 18:28 UTC (permalink / raw)
  To: Daniel Lezcano
  Cc: Serge E. Hallyn, linux-api-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Andrew Morton,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, mingo-X9Un+BFzKDI,
	xemul-GEFAQzZX7r8dnm+yROfE0A

DL> Yep, I agree. But you didn't answer the question, what are the
DL> network resources you plan to checkpoint / restart ?  eg. you let
DL> the container to setup your network, will you restore netdev
DL> statistics ? the mac address ? ipv4 ? ipv6 ?

Yes, Yes, Yes, and Yes.  I'm making the assumption that the common
case will be with a veth device in the container and that all of the
aforementioned attributes should be copied over.  In the future case
where we could potentially have a real device in the container, it
probably doesn't make sense to copy the mac address.

DL> Is it possible to do a detailed list of network resources you plan
DL> to CR with the different items you will address from userspace and
DL> kernel space ?

I'm sure it's possible, but no, I haven't planned out everything for
the next year.  If you have strong feelings about what should be done
in user and kernel space, feel free to share :)

DL> Argh ! I was hoping there was something else than the source code

The header file makes it pretty clear what is going on, but maybe the
Documentation/checkpoint/readme.txt will help.  Putting all the
details in such a documentation file would be rather silly at the
moment, given that new things are being added at a rapid rate and it
would duplicate the only description that matters, which is the
header file.

-- 
Dan Smith
IBM Linux Technology Center
email: danms-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org
--
To unsubscribe from this list: send the line "unsubscribe linux-api" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 00/80] Kernel based checkpoint/restart [v18]
@ 2009-09-30 18:28                 ` Dan Smith
  0 siblings, 0 replies; 327+ messages in thread
From: Dan Smith @ 2009-09-30 18:28 UTC (permalink / raw)
  To: Daniel Lezcano
  Cc: Serge E. Hallyn, linux-api, containers, linux-kernel, linux-mm,
	Andrew Morton, torvalds, mingo, xemul

DL> Yep, I agree. But you didn't answer the question, what are the
DL> network resources you plan to checkpoint / restart ?  eg. you let
DL> the container to setup your network, will you restore netdev
DL> statistics ? the mac address ? ipv4 ? ipv6 ?

Yes, Yes, Yes, and Yes.  I'm making the assumption that the common
case will be with a veth device in the container and that all of the
aforementioned attributes should be copied over.  In the future case
where we could potentially have a real device in the container, it
probably doesn't make sense to copy the mac address.

DL> Is it possible to do a detailed list of network resources you plan
DL> to CR with the different items you will address from userspace and
DL> kernel space ?

I'm sure it's possible, but no, I haven't planned out everything for
the next year.  If you have strong feelings about what should be done
in user and kernel space, feel free to share :)

DL> Argh ! I was hoping there was something else than the source code

The header file makes it pretty clear what is going on, but maybe the
Documentation/checkpoint/readme.txt will help.  Putting all the
details in such a documentation file would be rather silly at the
moment, given that new things are being added at a rapid rate and it
would duplicate the only description that matters, which is the
header file.

-- 
Dan Smith
IBM Linux Technology Center
email: danms@us.ibm.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 00/80] Kernel based checkpoint/restart [v18]
       [not found]                 ` <877hvgnv6z.fsf-FLMGYpZoEPULwtHQx/6qkW3U47Q5hpJU@public.gmane.org>
@ 2009-09-30 22:08                   ` Daniel Lezcano
  0 siblings, 0 replies; 327+ messages in thread
From: Daniel Lezcano @ 2009-09-30 22:08 UTC (permalink / raw)
  To: Dan Smith
  Cc: linux-api-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Andrew Morton,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, mingo-X9Un+BFzKDI,
	xemul-GEFAQzZX7r8dnm+yROfE0A

Dan Smith wrote:
> DL> Yep, I agree. But you didn't answer the question, what are the
> DL> network resources you plan to checkpoint / restart ?  eg. you let
> DL> the container to setup your network, will you restore netdev
> DL> statistics ? the mac address ? ipv4 ? ipv6 ?
>
> Yes, Yes, Yes, and Yes.  I'm making the assumption that the common
> case will be with a veth device in the container and that all of the
> aforementioned attributes should be copied over.  In the future case
> where we could potentially have a real device in the container, it
> probably doesn't make sense to copy the mac address.
>   

Be careful with the assumptions ;)
> DL> Is it possible to do a detailed list of network resources you plan
> DL> to CR with the different items you will address from userspace and
> DL> kernel space ?
>
> I'm sure it's possible, but no, I haven't planned out everything for
> the next year.  If you have strong feelings about what should be done
> in user and kernel space, feel free to share :)
>   

Dan,

I just want to understand what is your plan. If you say "yes I will 
checkpoint / restart" ipv4, ipv6, netdev statistics, etc ...  you should 
be able to give at least a small list of network resources you will 
checkpoint and how you will restart them, no ?

> DL> Argh ! I was hoping there was something else than the source code
>
> The header file makes it pretty clear what is going on, 
Certainly for you.
We are a little far away of the sys_checkpoint / sys_restart simple 
syscalls we talked about at the cr-minisummit in 2008.

Regards,
     -- Daniel

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 00/80] Kernel based checkpoint/restart [v18]
       [not found]                 ` <877hvgnv6z.fsf-FLMGYpZoEPULwtHQx/6qkW3U47Q5hpJU@public.gmane.org>
  2009-09-30 22:08                   ` Daniel Lezcano
@ 2009-09-30 22:08                   ` Daniel Lezcano
  0 siblings, 0 replies; 327+ messages in thread
From: Daniel Lezcano @ 2009-09-30 22:08 UTC (permalink / raw)
  To: Dan Smith
  Cc: Serge E. Hallyn, linux-api, containers, linux-kernel, linux-mm,
	Andrew Morton, torvalds, mingo, xemul

Dan Smith wrote:
> DL> Yep, I agree. But you didn't answer the question, what are the
> DL> network resources you plan to checkpoint / restart ?  eg. you let
> DL> the container to setup your network, will you restore netdev
> DL> statistics ? the mac address ? ipv4 ? ipv6 ?
>
> Yes, Yes, Yes, and Yes.  I'm making the assumption that the common
> case will be with a veth device in the container and that all of the
> aforementioned attributes should be copied over.  In the future case
> where we could potentially have a real device in the container, it
> probably doesn't make sense to copy the mac address.
>   

Be careful with the assumptions ;)
> DL> Is it possible to do a detailed list of network resources you plan
> DL> to CR with the different items you will address from userspace and
> DL> kernel space ?
>
> I'm sure it's possible, but no, I haven't planned out everything for
> the next year.  If you have strong feelings about what should be done
> in user and kernel space, feel free to share :)
>   

Dan,

I just want to understand what is your plan. If you say "yes I will 
checkpoint / restart" ipv4, ipv6, netdev statistics, etc ...  you should 
be able to give at least a small list of network resources you will 
checkpoint and how you will restart them, no ?

> DL> Argh ! I was hoping there was something else than the source code
>
> The header file makes it pretty clear what is going on, 
Certainly for you.
We are a little far away of the sys_checkpoint / sys_restart simple 
syscalls we talked about at the cr-minisummit in 2008.

Regards,
     -- Daniel




^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 00/80] Kernel based checkpoint/restart [v18]
@ 2009-09-30 22:08                   ` Daniel Lezcano
  0 siblings, 0 replies; 327+ messages in thread
From: Daniel Lezcano @ 2009-09-30 22:08 UTC (permalink / raw)
  To: Dan Smith
  Cc: Serge E. Hallyn, linux-api-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Andrew Morton,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, mingo-X9Un+BFzKDI,
	xemul-GEFAQzZX7r8dnm+yROfE0A

Dan Smith wrote:
> DL> Yep, I agree. But you didn't answer the question, what are the
> DL> network resources you plan to checkpoint / restart ?  eg. you let
> DL> the container to setup your network, will you restore netdev
> DL> statistics ? the mac address ? ipv4 ? ipv6 ?
>
> Yes, Yes, Yes, and Yes.  I'm making the assumption that the common
> case will be with a veth device in the container and that all of the
> aforementioned attributes should be copied over.  In the future case
> where we could potentially have a real device in the container, it
> probably doesn't make sense to copy the mac address.
>   

Be careful with the assumptions ;)
> DL> Is it possible to do a detailed list of network resources you plan
> DL> to CR with the different items you will address from userspace and
> DL> kernel space ?
>
> I'm sure it's possible, but no, I haven't planned out everything for
> the next year.  If you have strong feelings about what should be done
> in user and kernel space, feel free to share :)
>   

Dan,

I just want to understand what is your plan. If you say "yes I will 
checkpoint / restart" ipv4, ipv6, netdev statistics, etc ...  you should 
be able to give at least a small list of network resources you will 
checkpoint and how you will restart them, no ?

> DL> Argh ! I was hoping there was something else than the source code
>
> The header file makes it pretty clear what is going on, 
Certainly for you.
We are a little far away of the sys_checkpoint / sys_restart simple 
syscalls we talked about at the cr-minisummit in 2008.

Regards,
     -- Daniel



--
To unsubscribe from this list: send the line "unsubscribe linux-api" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 00/80] Kernel based checkpoint/restart [v18]
@ 2009-09-30 22:08                   ` Daniel Lezcano
  0 siblings, 0 replies; 327+ messages in thread
From: Daniel Lezcano @ 2009-09-30 22:08 UTC (permalink / raw)
  To: Dan Smith
  Cc: Serge E. Hallyn, linux-api, containers, linux-kernel, linux-mm,
	Andrew Morton, torvalds, mingo, xemul

Dan Smith wrote:
> DL> Yep, I agree. But you didn't answer the question, what are the
> DL> network resources you plan to checkpoint / restart ?  eg. you let
> DL> the container to setup your network, will you restore netdev
> DL> statistics ? the mac address ? ipv4 ? ipv6 ?
>
> Yes, Yes, Yes, and Yes.  I'm making the assumption that the common
> case will be with a veth device in the container and that all of the
> aforementioned attributes should be copied over.  In the future case
> where we could potentially have a real device in the container, it
> probably doesn't make sense to copy the mac address.
>   

Be careful with the assumptions ;)
> DL> Is it possible to do a detailed list of network resources you plan
> DL> to CR with the different items you will address from userspace and
> DL> kernel space ?
>
> I'm sure it's possible, but no, I haven't planned out everything for
> the next year.  If you have strong feelings about what should be done
> in user and kernel space, feel free to share :)
>   

Dan,

I just want to understand what is your plan. If you say "yes I will 
checkpoint / restart" ipv4, ipv6, netdev statistics, etc ...  you should 
be able to give at least a small list of network resources you will 
checkpoint and how you will restart them, no ?

> DL> Argh ! I was hoping there was something else than the source code
>
> The header file makes it pretty clear what is going on, 
Certainly for you.
We are a little far away of the sys_checkpoint / sys_restart simple 
syscalls we talked about at the cr-minisummit in 2008.

Regards,
     -- Daniel



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 00/80] Kernel based checkpoint/restart [v18]
       [not found]                   ` <4AC3D6DC.9010500-GANU6spQydw@public.gmane.org>
@ 2009-09-30 22:24                     ` Serge E. Hallyn
  2009-09-30 22:42                     ` Serge E. Hallyn
  1 sibling, 0 replies; 327+ messages in thread
From: Serge E. Hallyn @ 2009-09-30 22:24 UTC (permalink / raw)
  To: Daniel Lezcano
  Cc: Dan Smith, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA

Quoting Daniel Lezcano (daniel.lezcano-GANU6spQydw@public.gmane.org):
> Dan Smith wrote:
>> DL> Argh ! I was hoping there was something else than the source code
>>
>> The header file makes it pretty clear what is going on, 
> Certainly for you.
> We are a little far away of the sys_checkpoint / sys_restart simple  
> syscalls we talked about at the cr-minisummit in 2008.

Yes.  We did those, and were asked to provide a fuller
implementation.  Here is what we have so far...  We're
open, in fact eager, to see other people's implementations
based on different designs, but here is what *we*'ve got.

Oh, and it was NEVER said at the cr-minisummit that we would have
simple syscalls doing a full, complicated c/r.  Only that
we would start with simple syscalls to do a single task with
no open resources.  And you can still do that with v18, with
one simple syscall.

-serge

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 00/80] Kernel based checkpoint/restart [v18]
       [not found]                   ` <4AC3D6DC.9010500-GANU6spQydw@public.gmane.org>
  2009-09-30 22:24                     ` Serge E. Hallyn
@ 2009-09-30 22:42                     ` Serge E. Hallyn
       [not found]                       ` <20090930224249.GA22977-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
  1 sibling, 1 reply; 327+ messages in thread
From: Serge E. Hallyn @ 2009-09-30 22:42 UTC (permalink / raw)
  To: Daniel Lezcano
  Cc: Dan Smith, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA

Quoting Daniel Lezcano (daniel.lezcano-GANU6spQydw@public.gmane.org):
> Dan Smith wrote:
>> The header file makes it pretty clear what is going on, 
> Certainly for you.

If you're worried about hooking lxc-restart up and that
being a mess, i have said that as soon as something hits -mm,
I will hook up lxc-restart.  I do agree, the userspace code
would be much simpler if we didn't need to do all of the
process tree creation in userspace :)  I *would* prefer if
we didn't have to parse the image in userspace.  But the
moment it was decided that portability across kernel versions
woudl be done by having userspace process the image, we lost
that fight.

-serge

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 00/80] Kernel based checkpoint/restart [v18]
       [not found]                       ` <20090930224249.GA22977-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
@ 2009-10-01  8:07                         ` Daniel Lezcano
       [not found]                           ` <4AC4634F.6070901-GANU6spQydw@public.gmane.org>
  0 siblings, 1 reply; 327+ messages in thread
From: Daniel Lezcano @ 2009-10-01  8:07 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: Dan Smith, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA

Serge E. Hallyn wrote:
> Quoting Daniel Lezcano (daniel.lezcano-GANU6spQydw@public.gmane.org):
>   
>> Dan Smith wrote:
>>     
>>> The header file makes it pretty clear what is going on, 
>>>       
>> Certainly for you.
>>     
>
> If you're worried about hooking lxc-restart up and that
> being a mess, 
Yep, I am worried about that too :)

> i have said that as soon as something hits -mm,
> I will hook up lxc-restart.  I do agree, the userspace code
> would be much simpler if we didn't need to do all of the
> process tree creation in userspace :)
Yes and I know there were discussions about this point several times for 
the proctree, I won't argue with kernel vs user proctree creation.
But what I understood is you will continue to parse the statefile to 
recreate some other resources like a subset of the network and here I am 
lost.
Who in the linux community will understand what is checkpointed and what 
is restored from the kernel or from the userspace ?

Does this imply someone has to use a specific tool like "restart.c" 
within its own tools, assuming this tool is installed in the system or 
shall he copy-paste the code of the GPL licensed restart.c to its LGPL 
licensed tools ?

I am not arguing against the Checkpoint / Restart, IMO the general 
approach is good.
But I am just worried about who will be able to implement a CR solution 
using what is provided by the kernel, except a few people who 
implemented it.

>   I *would* prefer if
> we didn't have to parse the image in userspace.  But the
> moment it was decided that portability across kernel versions
> woudl be done by having userspace process the image, we lost
> that fight.
>   
Yeah, a big deal.

Thanks
    -- Daniel

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 00/80] Kernel based checkpoint/restart [v18]
       [not found]                           ` <4AC4634F.6070901-GANU6spQydw@public.gmane.org>
@ 2009-10-01 14:55                             ` Serge E. Hallyn
       [not found]                               ` <20091001145512.GA17916-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 327+ messages in thread
From: Serge E. Hallyn @ 2009-10-01 14:55 UTC (permalink / raw)
  To: Daniel Lezcano
  Cc: Dan Smith, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA

Quoting Daniel Lezcano (daniel.lezcano-GANU6spQydw@public.gmane.org):
> Serge E. Hallyn wrote:
>> Quoting Daniel Lezcano (daniel.lezcano-GANU6spQydw@public.gmane.org):
>>   
>>> Dan Smith wrote:
>>>     
>>>> The header file makes it pretty clear what is going on,       
>>> Certainly for you.
>>>     
>>
>> If you're worried about hooking lxc-restart up and that
>> being a mess, 
> Yep, I am worried about that too :)
>
>> i have said that as soon as something hits -mm,
>> I will hook up lxc-restart.  I do agree, the userspace code
>> would be much simpler if we didn't need to do all of the
>> process tree creation in userspace :)
> Yes and I know there were discussions about this point several times for  
> the proctree, I won't argue with kernel vs user proctree creation.
> But what I understood is you will continue to parse the statefile to  
> recreate some other resources like a subset of the network and here I am  
> lost.

If network devices end up being recreated in userspace - either the
ones for the root restarted container, or all the devices including
for any child network namespaces - then I believe they will be
considered container objects.  All the container information is at the
top of the checkpoint file, so the program coordinating the restart
will see all of the information before the task hierarchy.  Actually, I
thought linux-2.6/Documentation/checkpoint/readme.txt used to
explicitly show a 'container information' section between the
image header and task hierarchy.  Oren?

> Who in the linux community will understand what is checkpointed and what  
> is restored from the kernel or from the userspace ?

It should all be documented linux-2.6/Documentation/checkpoint/.  But
right now it's not even settled whether process creation in userspace
is going to be the final acceptable way, so documenting speculation
about how we're going to do network devices just seems too certain to
not end up matching reality.

> Does this imply someone has to use a specific tool like "restart.c"  
> within its own tools, assuming this tool is installed in the system or  
> shall he copy-paste the code of the GPL licensed restart.c to its LGPL  
> licensed tools ?

Hmm, I think a tiny little lgpl library, maybe even shipping under the
kernel tree, implementing a generic, whole-container and sub-tree
checkpoint and restart, makes very good sense.

It certainly does NOT make sense to require multiple projects to track
all changes to the checkpoint image format as the kernel changes...

-serge

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 00/80] Kernel based checkpoint/restart [v18]
       [not found]                               ` <20091001145512.GA17916-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
@ 2009-10-01 19:02                                 ` Oren Laadan
  0 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-10-01 19:02 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Dan Smith



Serge E. Hallyn wrote:
> Quoting Daniel Lezcano (daniel.lezcano-GANU6spQydw@public.gmane.org):
>> Serge E. Hallyn wrote:
>>> Quoting Daniel Lezcano (daniel.lezcano-GANU6spQydw@public.gmane.org):
>>>   
>>>> Dan Smith wrote:
>>>>     
>>>>> The header file makes it pretty clear what is going on,       
>>>> Certainly for you.
>>>>     
>>> If you're worried about hooking lxc-restart up and that
>>> being a mess, 
>> Yep, I am worried about that too :)
>>
>>> i have said that as soon as something hits -mm,
>>> I will hook up lxc-restart.  I do agree, the userspace code
>>> would be much simpler if we didn't need to do all of the
>>> process tree creation in userspace :)
>> Yes and I know there were discussions about this point several times for  
>> the proctree, I won't argue with kernel vs user proctree creation.
>> But what I understood is you will continue to parse the statefile to  
>> recreate some other resources like a subset of the network and here I am  
>> lost.
> 
> If network devices end up being recreated in userspace - either the
> ones for the root restarted container, or all the devices including
> for any child network namespaces - then I believe they will be
> considered container objects.  All the container information is at the
> top of the checkpoint file, so the program coordinating the restart
> will see all of the information before the task hierarchy.  Actually, I
> thought linux-2.6/Documentation/checkpoint/readme.txt used to
> explicitly show a 'container information' section between the
> image header and task hierarchy.  Oren?

It used to. I'll put it back there, now that somebody cares :)

But yes, the idea is that the part in which userspace is involved
is at the beginning of the image file, to avoid kernel-userspace
bouncing.

> 
>> Who in the linux community will understand what is checkpointed and what  
>> is restored from the kernel or from the userspace ?

Interested developers, and they can look at the code and read the
documentation. Assuming suitable tools are available for the users,
who else would care ?

> 
> It should all be documented linux-2.6/Documentation/checkpoint/.  But
> right now it's not even settled whether process creation in userspace
> is going to be the final acceptable way, so documenting speculation
> about how we're going to do network devices just seems too certain to
> not end up matching reality.

Here is my view on the subject (I'll add it to the documentation).

* By design, we do everything in the kernel, unless there is a
strong reasoning to move something to userspace.

* Keep a clear distinction for what we move to userspace. Avoid
doing something partly in kernel and partly in userspace.

* Streamline logic, so the execution flow doesn't bounce in and
out of the kernel. What's done in userspace appears _first_ in
the image file.

* My rules of thumb to take something to userspace are:

1. If it can be done _easily_ and _uniformly_ in userspace, _and_

2. If doing it in userspace gives us _substantial_ flexibility,
 or coverage, or  portability that is difficult in the kernel.

Let me demonstrate this thinking with 3 examples: open files (in
kernel), tasks tree (userspace), and network namespaces (userspace).

Why not restore open files in userspace ?  because some file types
are hard to do in userspace, and we want uniform handling. Also, it
becomes very tricky to do in userspace in the presence of mount
points, chroots, and mount namespaces.

Why restore the tasks tree in userspace ?  because it's easy and
portable (using the new clone); because it is unrelated to where/how
we do other resource; because it gives the flexibility for the
restarting tasks to do useful things after they are created but
before they call sys_restart (flexibility).

Why restore network namespace (their setup) in userspace ? because
it's easy and portable, and network setup tools are well developed
and understood; because we will want some policy to it (e.g. restart
doesn't care which actual device is used, we don't want such decision
to be in the kernel); because we don't want to replicate the rich
high level tools from userspace inside the kernel; because you may
want to change the configuration compare to checkpoint (e.g. add
firewall).

(That said, this is not written in stone, and if you have strong
arguments to convince otherwise, this is a good time).

> 
>> Does this imply someone has to use a specific tool like "restart.c"  
>> within its own tools, assuming this tool is installed in the system or  
>> shall he copy-paste the code of the GPL licensed restart.c to its LGPL  
>> licensed tools ?
> 
> Hmm, I think a tiny little lgpl library, maybe even shipping under the
> kernel tree, implementing a generic, whole-container and sub-tree
> checkpoint and restart, makes very good sense.
> 
> It certainly does NOT make sense to require multiple projects to track
> all changes to the checkpoint image format as the kernel changes...
> 

The idea is to add a plugin architecture to restart to allow users
to execute any useful work before and after the tasks tree is created,
but prior to calling sys_restart().

And yes, eventually making this a library, too.

Oren.

^ permalink raw reply	[flat|nested] 327+ messages in thread

* [PATCH 00/80] Kernel based checkpoint/restart [v18]
@ 2009-09-23 23:50 Oren Laadan
  0 siblings, 0 replies; 327+ messages in thread
From: Oren Laadan @ 2009-09-23 23:50 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-api-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Linus Torvalds, Ingo Molnar,
	Pavel Emelyanov

Hi Andrew,

This is our recent round of checkpoint/restart patches. It can
checkpoint and restart interactive sessions of 'screen' across 
kernel reboot. Please consider applying to -mm.

Patches 1-17 are clean-ups and preparations for c/r:
 * 1,2,3,4 and 9,10: cleanups, also useful for c/r.
 * 5,6: fix freezer control group
 * 7,8: extend freezer control group for c/r.
 * 11-17: clone_with_pid

Patch 18 reserves the system calls slots - please apply so we
don't need to keep changing them.

Patches 19-80 contain the actual c/r code; we've exhausted the
reviewers for most of them.

Patch 32 implements a deferqueue - mechanism for a process to
defer work for some later time (unlike workqueue, designed for
the work to execute in the context of same/original process).

Thanks,

Oren.

----

Application checkpoint/restart (c/r) is the ability to save the state
of a running application so that it can later resume its execution
from the time at which it was checkpointed, on the same or a different
machine.

This version brings support many new features, including support for
unix domain sockets, fifos, pseudo-terminals, and signals (see the
detailed changelog below).

With these in place, it can now checkpoint and restart not only batch
jobs, but also interactive programs using 'screen'. For example, users
can checkpoint a 'screen' session with multiple shells, upgrade their
kernel, reboot, and restart their interactive 'screen' session from
before !

This patchset was compiled and tested against v2.6.31. For more
information, check out Documentation/checkpoint/*.txt

Q: How useful is this code as it stands in real-world usage?
A: The application can be single- or multi-processes and threads. It
   handles open files (regular files/directories on most file systems,
   pipes, fifos, af_unix sockets, /dev/{null,zero,random,urandom} and
   pseudo-terminals. It supports shared memory. sysv IPC (except undo
   of sempahores). It's suitable for many types of batch jobs as well
   as some interactive jobs. (Note: it is assumed that the fs view is
   available at restart).

Q: What can it checkpoint and restart ?
A: A (single threaded) process can checkpoint itself, aka "self"
   checkpoint, if it calls the new system calls. Otherise, for an
   "external" checkpoint, the caller must first freeze the target
   processes. One can either checkpoint an entire container (and
   we make best effort to ensure that the result is self-contained),
   or merely a subtree of a process hierarchy.

Q: What about namespaces ?
A: Currrently, UTS and IPC namespaces are restored. They demonstrate
   how namespaces are handled. More to come.

Q: What additional work needs to be done to it?
A: Fill in the gory details following the examples so far. Current WIP
   includes inet sockets, event-poll, and early work on inotify, mount
   namespace and mount-points, pseudo file systems, and x86_64 support.
   
Q: How can I try it ?
A: Use it for simple batch jobs (pipes, too), or an interactive
   'screen' session, in a whole container or just a subtree of
   tasks:

   create the freezer cgroup:
     $ mount -t cgroup -ofreezer freezer /cgroup
     $ mkdir /cgroup/0
   
   run the test, freeze it:  
     $ test/multitask &
     [1] 2754
     $ for i in `pidof multitask`; do echo $i > /cgroup/0/tasks; done
     $ echo FROZEN > /cgruop/0/freezer.state
   
   checkpoint:
     $ ./ckpt 2754 > ckpt.out
   
   restart:
     $ ./mktree < ckpt.out
   
   voila :)
   
To do all this, you'll need:

The git tree tracking v18, branch 'ckpt-v18' (and past versions):
	git://git.ncl.cs.columbia.edu/pub/git/linux-cr.git

The userspace tools are available through the matching branch [v18]:
	git://git.ncl.cs.columbia.edu/pub/git/user-cr.git


Changelog:

[2009-Sep-22] v18

  (new features)
  - [Nathan Lynch] Re-introduce powerpc support
  - Save/restore pseudo-terminals
  - Save/restore (pty) controlling terminals
  - Save/restore restore PGIDs
  - [Dan Smith] Save/restore unix domain sockets
  - Save/restore FIFOs
  - Save/restore pending signals
  - Save/restore rlimits
  - Save/restore itimers
  - [Matt Helsley] Handle many non-pseudo file-systems

  (other changes)
  - Rename headerless struct ckpt_hdr_* to struct ckpt_*
  - [Nathan Lynch] discard const from struct cred * where appropriate
  - [Serge Hallyn][s390] Set return value for self-checkpoint 
  - Handle kmalloc failure in restore_sem_array()
  - [IPC] Collect files used by shm objects
  - [IPC] Use file (not inode) as shared object on checkpoint of shm
  - More ckpt_write_err()s to give information on checkpoint failure
  - Adjust format of pipe buffer to include the mandatory pre-header
  - [LEAKS] Mark the backing file as visited at chekcpoint
  - Tighten checks on supported vma to checkpoint or restart
  - [Serge Hallyn] Export filemap_checkpoint() (used for ext4)
  - Introduce ckpt_collect_file() that also uses file->collect method
  - Use ckpt_collect_file() instead of ckpt_obj_collect() for files
  - Fix leak-detection issue in collect_mm() (test for first-time obj)
  - Invoke set_close_on_exec() unconditionally on restart
  - [Dan Smith] Export fill_fname() as ckpt_fill_fname()
  - Interface to pass simple pointers as data with deferqueue
  - [Dan Smith] Fix ckpt_obj_lookup_add() leak detection logic
  - Replace EAGAIN with EBUSY where necessary
  - Introduce CKPT_OBJ_VISITED in leak detection
  - ckpt_obj_collect() returns objref for new objects, 0 otherwise
  - Rename ckpt_obj_checkpointed() to ckpt_obj_visited()
  - Introduce ckpt_obj_visit() to mark objects as visited
  - Set the CHECKPOINTED flag on objects before calling checkpoint
  - Introduce ckpt_obj_reserve()
  - Change ref_drop() to accept a @lastref argument (for cleanup)
  - Disallow multiple objects with same objref in restart
  - Allow _ckpt_read_obj_type() to read header only (w/o payload)
  - Fix leak of ckpt_ctx when restoring zombie tasks
  - Fix race of prepare_descendant() with an ongoing fork()
  - Track and report the first error if restart fails
  - Tighten logic to protect against bogus pids in input
  - [Matt Helsley] Improve debug output from ckpt_notify_error()
  - [Nathan Lynch] fix compilation errors with CONFIG_COMPAT=y
  - Detect error-headers in input data on restart, and abort.
  - Standard format for checkpoint error strings (and documentation)
  - [Dan Smith] Add an errno validation function
  - Add ckpt_read_payload(): read a variable-length object (no header)
  - Add ckpt_read_string(): same for strings (ensures null-terminated)
  - Add ckpt_read_consume(): consumes next object without processing
  - [John Dykstra] Fix no-dot-config-targets pattern in linux/Makefile

[2009-Jul-21] v17
  - Introduce syscall clone_with_pids() to restore original pids
  - Support threads and zombies
  - Save/restore task->files
  - Save/restore task->sighand
  - Save/restore futex
  - Save/restore credentials
  - Introduce PF_RESTARTING to skip notifications on task exit
  - restart(2) allow caller to ask to freeze tasks after restart
  - restart(2) isn't idempotent: return -EINTR if interrupted
  - Improve debugging output handling 
  - Make multi-process restart logic more robust and complete
  - Correctly select return value for restarting tasks on success
  - Tighten ptrace test for checkpoint to PTRACE_MODE_ATTACH
  - Use CHECKPOINTING state for frozen checkpointed tasks
  - Fix compilation without CONFIG_CHECKPOINT
  - Fix compilation with CONFIG_COMPAT
  - Fix headers includes and exports
  - Leak detection performed in two steps
  - Detect "inverse" leaks of objects (dis)appearing unexpectedly
  - Memory: save/restore mm->{flags,def_flags,saved_auxv}
  - Memory: only collect sub-objects of mm once (leak detection)
  - Files: validate f_mode after restore
  - Namespaces: leak detection for nsproxy sub-components
  - Namespaces: proper restart from namespace(s) without namespace(s)
  - Save global constants in header instead of per-object
  - IPC: replace sys_unshare() with create_ipc_ns()
  - IPC: restore objects in suitable namespace
  - IPC: correct behavior under !CONFIG_IPC_NS
  - UTS: save/restore all fields
  - UTS: replace sys_unshare() with create_uts_ns()
  - X86_32: sanitize cpu, debug, and segment registers on restart
  - cgroup_freezer: add CHECKPOINTING state to safeguard checkpoint
  - cgroup_freezer: add interface to freeze a cgroup (given a task)

[2009-May-27] v16
  - Privilege checks for IPC checkpoint
  - Fix error string generation during checkpoint
  - Use kzalloc for header allocation
  - Restart blocks are arch-independent
  - Redo pipe c/r using splice
  - Fixes to s390 arch
  - Remove powerpc arch (temporary)
  - Explicitly restore ->nsproxy
  - All objects in image are precedeed by 'struct ckpt_hdr'
  - Fix leaks detection (and leaks)
  - Reorder of patchset
  - Misc bugs and compilation fixes

[2009-Apr-12] v15
  - Minor fixes

[2009-Apr-28] v14
  - Tested against kernel v2.6.30-rc3 on x86_32.
  - Refactor files chekpoint to use f_ops (file operations)
  - Refactor mm/vma to use vma_ops
  - Explicitly handle VDSO vma (and require compat mode)
  - Added code to c/r restat-blocks (restart timeout related syscalls)
  - Added code to c/r namespaces: uts, ipc (with Dan Smith)
  - Added code to c/r sysvipc (shm, msg, sem)
  - Support for VM_CLONE shared memory
  - Added resource leak detection for whole-container checkpoint
  - Added sysctl gauge to allow unprivileged restart/checkpoint
  - Improve and simplify the code and logic of shared objects
  - Rework image format: shared objects appear prior to their use
  - Merge checkpoint and restart functionality into same files
  - Massive renaming of functions: prefix "ckpt_" for generics,
    "checkpoint_" for checkpoint, and "restore_" for restart.
  - Report checkpoint errors as a valid (string record) in the output
  - Merged PPC architecture (by Nathan Lunch),
  - Requires updates to userspace tools too.
  - Misc nits and bug fixes

[2009-Mar-31] v14-rc2
  - Change along Dave's suggestion to use f_ops->checkpoint() for files
  - Merge patch simplifying Kconfig, with CONFIG_CHECKPOINT_SUPPORT
  - Merge support for PPC arch (Nathan Lynch)
  - Misc cleanups and fixes in response to comments

[2009-Mar-20] v14-rc1:
  - The 'h.parent' field of 'struct cr_hdr' isn't used - discard
  - Check whether calls to cr_hbuf_get() succeed or fail.
  - Fixed of pipe c/r code
  - Prevent deadlock by refusing c/r when a pipe inode == ctx->file inode
  - Refuse non-self checkpoint if a task isn't frozen
  - Use unsigned fields in checkpoint headers unless otherwise required
  - Rename functions in files c/r to better reflect their role
  - Add support for anonymous shared memory
  - Merge support for s390 arch (Dan Smith, Serge Hallyn)
    
[2008-Dec-03] v13:
  - Cleanups of 'struct cr_ctx' - remove unused fields
  - Misc fixes for comments
  
[2008-Dec-17] v12:
  - Fix re-alloc/reset of pgarr chain to correctly reuse buffers
    (empty pgarr are saves in a separate pool chain)
  - Add a couple of missed calls to cr_hbuf_put()
  - cr_kwrite/cr_kread() again use vfs_read(), vfs_write() (safer)
  - Split cr_write/cr_read() to two parts: _cr_write/read() helper
  - Befriend with sparse: explicit conversion to 'void __user *'
  - Redrefine 'pr_fmt' ind replace cr_debug() with pr_debug()

[2008-Dec-05] v11:
  - Use contents of 'init->fs->root' instead of pointing to it
  - Ignore symlinks (there is no such thing as an open symlink)
  - cr_scan_fds() retries from scratch if it hits size limits
  - Add missing test for VM_MAYSHARE when dumping memory
  - Improve documentation about: behavior when tasks aren't fronen,
    life span of the object hash, references to objects in the hash
 
[2008-Nov-26] v10:
  - Grab vfs root of container init, rather than current process
  - Acquire dcache_lock around call to __d_path() in cr_fill_name()
  - Force end-of-string in cr_read_string() (fix possible DoS)
  - Introduce cr_write_buffer(), cr_read_buffer() and cr_read_buf_type()

[2008-Nov-10] v9:
  - Support multiple processes c/r
  - Extend checkpoint header with archtiecture dependent header 
  - Misc bug fixes (see individual changelogs)
  - Rebase to v2.6.28-rc3.

[2008-Oct-29] v8:
  - Support "external" checkpoint
  - Include Dave Hansen's 'deny-checkpoint' patch
  - Split docs in Documentation/checkpoint/..., and improve contents

[2008-Oct-17] v7:
  - Fix save/restore state of FPU
  - Fix argument given to kunmap_atomic() in memory dump/restore

[2008-Oct-07] v6:
  - Balance all calls to cr_hbuf_get() with matching cr_hbuf_put()
    (even though it's not really needed)
  - Add assumptions and what's-missing to documentation
  - Misc fixes and cleanups

[2008-Sep-11] v5:
  - Config is now 'def_bool n' by default
  - Improve memory dump/restore code (following Dave Hansen's comments)
  - Change dump format (and code) to allow chunks of <vaddrs, pages>
    instead of one long list of each
  - Fix use of follow_page() to avoid faulting in non-present pages
  - Memory restore now maps user pages explicitly to copy data into them,
    instead of reading directly to user space; got rid of mprotect_fixup()
  - Remove preempt_disable() when restoring debug registers
  - Rename headers files s/ckpt/checkpoint/
  - Fix misc bugs in files dump/restore
  - Fixes and cleanups on some error paths
  - Fix misc coding style

[2008-Sep-09] v4:
  - Various fixes and clean-ups
  - Fix calculation of hash table size
  - Fix header structure alignment
  - Use stand list_... for cr_pgarr

[2008-Aug-29] v3:
  - Various fixes and clean-ups
  - Use standard hlist_... for hash table
  - Better use of standard kmalloc/kfree

[2008-Aug-20] v2:
  - Added Dump and restore of open files (regular and directories)
  - Added basic handling of shared objects, and improve handling of
    'parent tag' concept
  - Added documentation
  - Improved ABI, 64bit padding for image data
  - Improved locking when saving/restoring memory
  - Added UTS information to header (release, version, machine)
  - Cleanup extraction of filename from a file pointer
  - Refactor to allow easier reviewing
  - Remove requirement for CAPS_SYS_ADMIN until we come up with a
    security policy (this means that file restore may fail)
  - Other cleanup and response to comments for v1

[2008-Jul-29] v1:
  - Initial version: support a single task with address space of only
    private anonymous or file-mapped VMAs; syscalls ignore pid/crid
    argument and act on current process.

--
At the containers mini-conference before OLS, the consensus among
all the stakeholders was that doing checkpoint/restart in the kernel
as much as possible was the best approach.  With this approach, the
kernel will export a relatively opaque 'blob' of data to userspace
which can then be handed to the new kernel at restore time.

This is different than what had been proposed before, which was
that a userspace application would be responsible for collecting
all of this data.  We were also planning on adding lots of new,
little kernel interfaces for all of the things that needed
checkpointing.  This unites those into a single, grand interface.

The 'blob' will contain copies of select portions of kernel
structures such as vmas and mm_structs.  It will also contain
copies of the actual memory that the process uses.  Any changes
in this blob's format between kernel revisions can be handled by
an in-userspace conversion program.

This is a similar approach to virtually all of the commercial
checkpoint/restart products out there, as well as the research
project Zap.

These patches basically serialize internel kernel state and write
it out to a file descriptor.  The checkpoint and restore are done
with two new system calls: sys_checkpoint and sys_restart.

In this incarnation, they can only work checkpoint and restore a
single task. The task's address space may consist of only private,
simple vma's - anonymous or file-mapped. The open files may consist
of only simple files and directories.
--

^ permalink raw reply	[flat|nested] 327+ messages in thread

end of thread, other threads:[~2009-10-01 19:02 UTC | newest]

Thread overview: 327+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-09-23 23:50 [PATCH 00/80] Kernel based checkpoint/restart [v18] Oren Laadan
2009-09-23 23:50 ` Oren Laadan
2009-09-23 23:50 ` [PATCH v18 01/80] c/r: extend arch_setup_additional_pages() Oren Laadan
2009-09-23 23:50   ` Oren Laadan
2009-09-23 23:50 ` [PATCH v18 02/80] x86: ptrace debugreg checks rewrite Oren Laadan
2009-09-23 23:50   ` Oren Laadan
2009-09-23 23:50 ` [PATCH v18 03/80] c/r: break out new_user_ns() Oren Laadan
2009-09-23 23:50   ` Oren Laadan
2009-09-23 23:50 ` [PATCH v18 04/80] c/r: split core function out of some set*{u,g}id functions Oren Laadan
2009-09-23 23:50   ` Oren Laadan
2009-09-23 23:50 ` [PATCH v18 05/80] cgroup freezer: Fix buggy resume test for tasks frozen with cgroup freezer Oren Laadan
2009-09-23 23:50 ` Oren Laadan
2009-09-23 23:50   ` Oren Laadan
2009-09-23 23:50 ` [PATCH v18 06/80] cgroup freezer: Update stale locking comments Oren Laadan
2009-09-23 23:50   ` Oren Laadan
2009-09-23 23:50 ` [PATCH v18 07/80] cgroup freezer: Add CHECKPOINTING state to safeguard container checkpoint Oren Laadan
2009-09-23 23:50   ` Oren Laadan
2009-09-23 23:50 ` [PATCH v18 08/80] cgroup freezer: interface to freeze a cgroup from within the kernel Oren Laadan
2009-09-23 23:50   ` Oren Laadan
2009-09-23 23:50 ` [PATCH v18 09/80] Namespaces submenu Oren Laadan
2009-09-23 23:50   ` Oren Laadan
2009-09-23 23:50 ` [PATCH v18 10/80] c/r: make file_pos_read/write() public Oren Laadan
2009-09-23 23:50   ` Oren Laadan
2009-09-23 23:50 ` [PATCH v18 11/80] pids 1/7: Factor out code to allocate pidmap page Oren Laadan
2009-09-23 23:50   ` Oren Laadan
2009-09-23 23:50 ` [PATCH v18 12/80] pids 2/7: Have alloc_pidmap() return actual error code Oren Laadan
2009-09-23 23:50   ` Oren Laadan
2009-09-23 23:50 ` [PATCH v18 13/80] pids 3/7: Add target_pid parameter to alloc_pidmap() Oren Laadan
2009-09-23 23:50   ` Oren Laadan
2009-09-23 23:50 ` [PATCH v18 14/80] pids 4/7: Add target_pids parameter to alloc_pid() Oren Laadan
2009-09-23 23:50   ` Oren Laadan
2009-09-23 23:50 ` [PATCH v18 15/80] pids 5/7: Add target_pids parameter to copy_process() Oren Laadan
2009-09-23 23:50   ` Oren Laadan
2009-09-23 23:50 ` [PATCH v18 16/80] pids 6/7: Define do_fork_with_pids() Oren Laadan
2009-09-23 23:50   ` Oren Laadan
2009-09-23 23:50 ` [PATCH v18 17/80] pids 7/7: Define clone_with_pids syscall Oren Laadan
2009-09-23 23:50   ` Oren Laadan
2009-09-23 23:50 ` [PATCH v18 18/80] c/r: create syscalls: sys_checkpoint, sys_restart Oren Laadan
2009-09-23 23:50   ` Oren Laadan
2009-09-23 23:50 ` [PATCH v18 19/80] c/r: documentation Oren Laadan
2009-09-23 23:50   ` Oren Laadan
2009-09-24  9:10   ` Nikita V. Youshchenko
2009-09-23 23:51 ` [PATCH v18 20/80] c/r: basic infrastructure for checkpoint/restart Oren Laadan
2009-09-23 23:51   ` Oren Laadan
2009-09-24 16:03   ` Daniel Walker
2009-09-24 16:03     ` Daniel Walker
2009-09-24 16:03     ` Daniel Walker
2009-09-24 17:33     ` Oren Laadan
2009-09-24 17:33       ` Oren Laadan
2009-09-24 17:33     ` Oren Laadan
     [not found]   ` <1253749920-18673-21-git-send-email-orenl-RdfvBDnrOixBDgjK7y7TUQ@public.gmane.org>
2009-09-24 16:03     ` Daniel Walker
2009-09-23 23:51 ` [PATCH v18 21/80] c/r: x86_32 support " Oren Laadan
2009-09-23 23:51   ` Oren Laadan
2009-09-23 23:51 ` [PATCH v18 22/80] c/r: external checkpoint of a task other than ourself Oren Laadan
2009-09-23 23:51   ` Oren Laadan
2009-09-23 23:51 ` [PATCH v18 23/80] c/r: export functionality used in next patch for restart-blocks Oren Laadan
2009-09-23 23:51   ` Oren Laadan
2009-09-23 23:51 ` [PATCH v18 24/80] c/r: restart-blocks Oren Laadan
2009-09-23 23:51   ` Oren Laadan
2009-09-23 23:51 ` [PATCH v18 25/80] c/r: checkpoint multiple processes Oren Laadan
2009-09-23 23:51   ` Oren Laadan
2009-09-23 23:51 ` [PATCH v18 26/80] c/r: restart " Oren Laadan
2009-09-23 23:51   ` Oren Laadan
2009-09-23 23:51 ` [PATCH v18 27/80] c/r: introduce PF_RESTARTING, and skip notification on exit Oren Laadan
2009-09-23 23:51   ` Oren Laadan
2009-09-23 23:51 ` [PATCH v18 28/80] c/r: support for zombie processes Oren Laadan
2009-09-23 23:51   ` Oren Laadan
2009-09-23 23:51 ` [PATCH v18 29/80] c/r: Save and restore the [compat_]robust_list member of the task struct Oren Laadan
2009-09-23 23:51   ` Oren Laadan
2009-09-23 23:51 ` [PATCH v18 30/80] c/r: infrastructure for shared objects Oren Laadan
2009-09-23 23:51   ` Oren Laadan
2009-09-23 23:51 ` [PATCH v18 31/80] c/r: detect resource leaks for whole-container checkpoint Oren Laadan
2009-09-23 23:51   ` Oren Laadan
2009-09-23 23:51 ` [PATCH v18 32/80] deferqueue: generic queue to defer work Oren Laadan
2009-09-23 23:51   ` Oren Laadan
2009-09-23 23:51 ` [PATCH v18 33/80] c/r: introduce new 'file_operations': ->checkpoint, ->collect() Oren Laadan
2009-09-23 23:51   ` Oren Laadan
2009-09-23 23:51 ` [PATCH v18 34/80] c/r: dump open file descriptors Oren Laadan
2009-09-23 23:51   ` Oren Laadan
2009-09-23 23:51 ` [PATCH v18 35/80] c/r: restore " Oren Laadan
2009-09-23 23:51   ` Oren Laadan
2009-09-23 23:51 ` [PATCH v18 36/80] c/r: introduce method '->checkpoint()' in struct vm_operations_struct Oren Laadan
2009-09-23 23:51   ` Oren Laadan
2009-09-23 23:51 ` [PATCH v18 37/80] c/r: dump memory address space (private memory) Oren Laadan
2009-09-23 23:51   ` Oren Laadan
2009-09-23 23:51 ` [PATCH v18 38/80] c/r: restore " Oren Laadan
2009-09-23 23:51   ` Oren Laadan
2009-09-23 23:51 ` [PATCH v18 39/80] c/r: add generic '->checkpoint' f_op to ext fses Oren Laadan
2009-09-23 23:51   ` Oren Laadan
2009-09-23 23:51 ` [PATCH v18 40/80] c/r: add generic '->checkpoint()' f_op to simple devices Oren Laadan
2009-09-23 23:51   ` Oren Laadan
2009-09-23 23:51 ` [PATCH v18 41/80] Add the checkpoint operation for opened files of generic filesystems Oren Laadan
2009-09-23 23:51   ` Oren Laadan
2009-09-23 23:51 ` [PATCH v18 42/80] c/r: export shmem_getpage() to support shared memory Oren Laadan
2009-09-23 23:51   ` Oren Laadan
2009-09-23 23:51 ` [PATCH v18 43/80] c/r: dump anonymous- and file-mapped- " Oren Laadan
2009-09-23 23:51   ` Oren Laadan
2009-09-23 23:51 ` [PATCH v18 44/80] c/r: restore " Oren Laadan
2009-09-23 23:51   ` Oren Laadan
2009-09-23 23:51 ` [PATCH v18 45/80] splice: export pipe/file-to-pipe/file functionality Oren Laadan
2009-09-23 23:51   ` Oren Laadan
2009-09-23 23:51 ` [PATCH v18 46/80] c/r: support for open pipes Oren Laadan
2009-09-23 23:51   ` Oren Laadan
2009-09-23 23:51 ` [PATCH v18 47/80] c/r: checkpoint and restore FIFOs Oren Laadan
2009-09-23 23:51   ` Oren Laadan
2009-09-23 23:51 ` [PATCH v18 48/80] c/r: make ckpt_may_checkpoint_task() check each namespace individually Oren Laadan
2009-09-23 23:51   ` Oren Laadan
2009-09-23 23:51 ` [PATCH v18 49/80] c/r: support for UTS namespace Oren Laadan
2009-09-23 23:51   ` Oren Laadan
     [not found]   ` <1253749920-18673-50-git-send-email-orenl-RdfvBDnrOixBDgjK7y7TUQ@public.gmane.org>
2009-09-29 18:13     ` Nikita V. Youshchenko
2009-09-29 18:13   ` Nikita V. Youshchenko
2009-09-29 18:13     ` Nikita V. Youshchenko
     [not found]     ` <200909292213.21266-vlktAfmDCSZSq9BJjBFyUp/QNRX+jHPU@public.gmane.org>
2009-09-29 18:51       ` Serge E. Hallyn
2009-09-29 23:01       ` Oren Laadan
2009-09-29 18:51     ` Serge E. Hallyn
2009-09-29 18:51       ` Serge E. Hallyn
2009-09-29 18:51       ` Serge E. Hallyn
2009-09-29 23:01     ` Oren Laadan
2009-09-29 23:01       ` Oren Laadan
2009-09-29 23:01       ` Oren Laadan
2009-09-23 23:51 ` [PATCH v18 50/80] c/r (ipc): allow allocation of a desired ipc identifier Oren Laadan
2009-09-23 23:51   ` Oren Laadan
2009-09-23 23:51 ` [PATCH v18 51/80] c/r: save and restore sysvipc namespace basics Oren Laadan
2009-09-23 23:51   ` Oren Laadan
2009-09-23 23:51 ` [PATCH v18 52/80] c/r: support share-memory sysv-ipc Oren Laadan
2009-09-23 23:51   ` Oren Laadan
2009-09-23 23:51 ` [PATCH v18 53/80] c/r: support message-queues sysv-ipc Oren Laadan
2009-09-23 23:51   ` Oren Laadan
2009-09-23 23:51 ` [PATCH v18 54/80] c/r: support semaphore sysv-ipc Oren Laadan
2009-09-23 23:51   ` Oren Laadan
2009-09-23 23:51 ` [PATCH v18 55/80] c/r: (s390): expose a constant for the number of words (CRs) Oren Laadan
2009-09-23 23:51   ` Oren Laadan
2009-09-23 23:51 ` [PATCH v18 56/80] c/r: add CKPT_COPY() macro Oren Laadan
2009-09-23 23:51   ` Oren Laadan
2009-09-23 23:51 ` [PATCH v18 57/80] c/r: define s390-specific checkpoint-restart code Oren Laadan
2009-09-23 23:51   ` Oren Laadan
2009-09-23 23:51 ` [PATCH v18 58/80] c/r: clone_with_pids: define the s390 syscall Oren Laadan
2009-09-23 23:51   ` Oren Laadan
2009-09-23 23:51 ` [PATCH v18 59/80] c/r: capabilities: define checkpoint and restore fns Oren Laadan
2009-09-23 23:51   ` Oren Laadan
2009-09-23 23:51 ` [PATCH v18 60/80] c/r: checkpoint and restore task credentials Oren Laadan
2009-09-23 23:51   ` Oren Laadan
2009-09-23 23:51 ` [PATCH v18 61/80] c/r: restore file->f_cred Oren Laadan
2009-09-23 23:51   ` Oren Laadan
2009-09-23 23:51 ` [PATCH v18 62/80] c/r: checkpoint and restore (shared) task's sighand_struct Oren Laadan
2009-09-23 23:51   ` Oren Laadan
2009-09-23 23:51 ` [PATCH v18 63/80] c/r: [signal 1/4] blocked and template for shared signals Oren Laadan
2009-09-23 23:51   ` Oren Laadan
2009-09-23 23:51 ` [PATCH v18 64/80] c/r: [signal 2/4] checkpoint/restart of rlimit Oren Laadan
2009-09-23 23:51   ` Oren Laadan
2009-09-23 23:51 ` [PATCH v18 65/80] c/r: [signal 3/4] pending signals (private, shared) Oren Laadan
2009-09-23 23:51   ` Oren Laadan
2009-09-23 23:51 ` [PATCH v18 66/80] c/r: [signal 4/4] support for real/virt/prof itimers Oren Laadan
2009-09-23 23:51   ` Oren Laadan
2009-09-23 23:51 ` [PATCH v18 67/80] Expose may_setuid() in user.h and add may_setgid() (v2) Oren Laadan
2009-09-23 23:51   ` Oren Laadan
2009-09-23 23:51 ` [PATCH v18 68/80] Add common socket helpers to unify the security hooks Oren Laadan
2009-09-23 23:51   ` Oren Laadan
2009-09-23 23:51 ` [PATCH v18 69/80] c/r: introduce checkpoint/restore methods to struct proto_ops Oren Laadan
2009-09-23 23:51   ` Oren Laadan
2009-09-23 23:51 ` [PATCH v18 70/80] c/r: Add AF_UNIX support (v12) Oren Laadan
2009-09-23 23:51   ` Oren Laadan
2009-09-23 23:51 ` [PATCH v18 71/80] c/r: [pty 1/2] allow allocation of desired pty slave Oren Laadan
2009-09-23 23:51   ` Oren Laadan
2009-09-23 23:51 ` [PATCH v18 72/80] c/r: [pty 2/2] support for pseudo terminals Oren Laadan
2009-09-23 23:51   ` Oren Laadan
2009-09-23 23:51   ` Oren Laadan
2009-09-23 23:51 ` [PATCH v18 73/80] c/r: correctly restore pgid Oren Laadan
2009-09-23 23:51   ` Oren Laadan
     [not found] ` <1253749920-18673-1-git-send-email-orenl-RdfvBDnrOixBDgjK7y7TUQ@public.gmane.org>
2009-09-23 23:50   ` [PATCH v18 01/80] c/r: extend arch_setup_additional_pages() Oren Laadan
2009-09-23 23:50   ` [PATCH v18 02/80] x86: ptrace debugreg checks rewrite Oren Laadan
2009-09-23 23:50   ` [PATCH v18 03/80] c/r: break out new_user_ns() Oren Laadan
2009-09-23 23:50   ` [PATCH v18 04/80] c/r: split core function out of some set*{u, g}id functions Oren Laadan
2009-09-23 23:50   ` [PATCH v18 05/80] cgroup freezer: Fix buggy resume test for tasks frozen with cgroup freezer Oren Laadan
2009-09-23 23:50   ` [PATCH v18 06/80] cgroup freezer: Update stale locking comments Oren Laadan
2009-09-23 23:50   ` [PATCH v18 07/80] cgroup freezer: Add CHECKPOINTING state to safeguard container checkpoint Oren Laadan
2009-09-23 23:50   ` [PATCH v18 08/80] cgroup freezer: interface to freeze a cgroup from within the kernel Oren Laadan
2009-09-23 23:50   ` [PATCH v18 09/80] Namespaces submenu Oren Laadan
2009-09-23 23:50   ` [PATCH v18 10/80] c/r: make file_pos_read/write() public Oren Laadan
2009-09-23 23:50   ` [PATCH v18 11/80] pids 1/7: Factor out code to allocate pidmap page Oren Laadan
2009-09-23 23:50   ` [PATCH v18 12/80] pids 2/7: Have alloc_pidmap() return actual error code Oren Laadan
2009-09-23 23:50   ` [PATCH v18 13/80] pids 3/7: Add target_pid parameter to alloc_pidmap() Oren Laadan
2009-09-23 23:50   ` [PATCH v18 14/80] pids 4/7: Add target_pids parameter to alloc_pid() Oren Laadan
2009-09-23 23:50   ` [PATCH v18 15/80] pids 5/7: Add target_pids parameter to copy_process() Oren Laadan
2009-09-23 23:50   ` [PATCH v18 16/80] pids 6/7: Define do_fork_with_pids() Oren Laadan
2009-09-23 23:50   ` [PATCH v18 17/80] pids 7/7: Define clone_with_pids syscall Oren Laadan
2009-09-23 23:50   ` [PATCH v18 18/80] c/r: create syscalls: sys_checkpoint, sys_restart Oren Laadan
2009-09-23 23:50   ` [PATCH v18 19/80] c/r: documentation Oren Laadan
2009-09-23 23:51   ` [PATCH v18 20/80] c/r: basic infrastructure for checkpoint/restart Oren Laadan
2009-09-23 23:51   ` [PATCH v18 21/80] c/r: x86_32 support " Oren Laadan
2009-09-23 23:51   ` [PATCH v18 22/80] c/r: external checkpoint of a task other than ourself Oren Laadan
2009-09-23 23:51   ` [PATCH v18 23/80] c/r: export functionality used in next patch for restart-blocks Oren Laadan
2009-09-23 23:51   ` [PATCH v18 24/80] c/r: restart-blocks Oren Laadan
2009-09-23 23:51   ` [PATCH v18 25/80] c/r: checkpoint multiple processes Oren Laadan
2009-09-23 23:51   ` [PATCH v18 26/80] c/r: restart " Oren Laadan
2009-09-23 23:51   ` [PATCH v18 27/80] c/r: introduce PF_RESTARTING, and skip notification on exit Oren Laadan
2009-09-23 23:51   ` [PATCH v18 28/80] c/r: support for zombie processes Oren Laadan
2009-09-23 23:51   ` [PATCH v18 29/80] c/r: Save and restore the [compat_]robust_list member of the task struct Oren Laadan
2009-09-23 23:51   ` [PATCH v18 30/80] c/r: infrastructure for shared objects Oren Laadan
2009-09-23 23:51   ` [PATCH v18 31/80] c/r: detect resource leaks for whole-container checkpoint Oren Laadan
2009-09-23 23:51   ` [PATCH v18 32/80] deferqueue: generic queue to defer work Oren Laadan
2009-09-23 23:51   ` [PATCH v18 33/80] c/r: introduce new 'file_operations': ->checkpoint, ->collect() Oren Laadan
2009-09-23 23:51   ` [PATCH v18 34/80] c/r: dump open file descriptors Oren Laadan
2009-09-23 23:51   ` [PATCH v18 35/80] c/r: restore " Oren Laadan
2009-09-23 23:51   ` [PATCH v18 36/80] c/r: introduce method '->checkpoint()' in struct vm_operations_struct Oren Laadan
2009-09-23 23:51   ` [PATCH v18 37/80] c/r: dump memory address space (private memory) Oren Laadan
2009-09-23 23:51   ` [PATCH v18 38/80] c/r: restore " Oren Laadan
2009-09-23 23:51   ` [PATCH v18 39/80] c/r: add generic '->checkpoint' f_op to ext fses Oren Laadan
2009-09-23 23:51   ` [PATCH v18 40/80] c/r: add generic '->checkpoint()' f_op to simple devices Oren Laadan
2009-09-23 23:51   ` [PATCH v18 41/80] Add the checkpoint operation for opened files of generic filesystems Oren Laadan
2009-09-23 23:51   ` [PATCH v18 42/80] c/r: export shmem_getpage() to support shared memory Oren Laadan
2009-09-23 23:51   ` [PATCH v18 43/80] c/r: dump anonymous- and file-mapped- " Oren Laadan
2009-09-23 23:51   ` [PATCH v18 44/80] c/r: restore " Oren Laadan
2009-09-23 23:51   ` [PATCH v18 45/80] splice: export pipe/file-to-pipe/file functionality Oren Laadan
2009-09-23 23:51   ` [PATCH v18 46/80] c/r: support for open pipes Oren Laadan
2009-09-23 23:51   ` [PATCH v18 47/80] c/r: checkpoint and restore FIFOs Oren Laadan
2009-09-23 23:51   ` [PATCH v18 48/80] c/r: make ckpt_may_checkpoint_task() check each namespace individually Oren Laadan
2009-09-23 23:51   ` [PATCH v18 49/80] c/r: support for UTS namespace Oren Laadan
2009-09-23 23:51   ` [PATCH v18 50/80] c/r (ipc): allow allocation of a desired ipc identifier Oren Laadan
2009-09-23 23:51   ` [PATCH v18 51/80] c/r: save and restore sysvipc namespace basics Oren Laadan
2009-09-23 23:51   ` [PATCH v18 52/80] c/r: support share-memory sysv-ipc Oren Laadan
2009-09-23 23:51   ` [PATCH v18 53/80] c/r: support message-queues sysv-ipc Oren Laadan
2009-09-23 23:51   ` [PATCH v18 54/80] c/r: support semaphore sysv-ipc Oren Laadan
2009-09-23 23:51   ` [PATCH v18 55/80] c/r: (s390): expose a constant for the number of words (CRs) Oren Laadan
2009-09-23 23:51   ` [PATCH v18 56/80] c/r: add CKPT_COPY() macro Oren Laadan
2009-09-23 23:51   ` [PATCH v18 57/80] c/r: define s390-specific checkpoint-restart code Oren Laadan
2009-09-23 23:51   ` [PATCH v18 58/80] c/r: clone_with_pids: define the s390 syscall Oren Laadan
2009-09-23 23:51   ` [PATCH v18 59/80] c/r: capabilities: define checkpoint and restore fns Oren Laadan
2009-09-23 23:51   ` [PATCH v18 60/80] c/r: checkpoint and restore task credentials Oren Laadan
2009-09-23 23:51   ` [PATCH v18 61/80] c/r: restore file->f_cred Oren Laadan
2009-09-23 23:51   ` [PATCH v18 62/80] c/r: checkpoint and restore (shared) task's sighand_struct Oren Laadan
2009-09-23 23:51   ` [PATCH v18 63/80] c/r: [signal 1/4] blocked and template for shared signals Oren Laadan
2009-09-23 23:51   ` [PATCH v18 64/80] c/r: [signal 2/4] checkpoint/restart of rlimit Oren Laadan
2009-09-23 23:51   ` [PATCH v18 65/80] c/r: [signal 3/4] pending signals (private, shared) Oren Laadan
2009-09-23 23:51   ` [PATCH v18 66/80] c/r: [signal 4/4] support for real/virt/prof itimers Oren Laadan
2009-09-23 23:51   ` [PATCH v18 67/80] Expose may_setuid() in user.h and add may_setgid() (v2) Oren Laadan
2009-09-23 23:51   ` [PATCH v18 68/80] Add common socket helpers to unify the security hooks Oren Laadan
2009-09-23 23:51   ` [PATCH v18 69/80] c/r: introduce checkpoint/restore methods to struct proto_ops Oren Laadan
2009-09-23 23:51   ` [PATCH v18 70/80] c/r: Add AF_UNIX support (v12) Oren Laadan
2009-09-23 23:51   ` [PATCH v18 71/80] c/r: [pty 1/2] allow allocation of desired pty slave Oren Laadan
2009-09-23 23:51   ` [PATCH v18 72/80] c/r: [pty 2/2] support for pseudo terminals Oren Laadan
2009-09-23 23:51   ` [PATCH v18 73/80] c/r: correctly restore pgid Oren Laadan
2009-09-23 23:51   ` [PATCH v18 74/80] c/r: support for controlling terminal and job control Oren Laadan
2009-09-23 23:51   ` [PATCH v18 75/80] powerpc: reserve checkpoint arch identifiers Oren Laadan
2009-09-23 23:51   ` [PATCH v18 76/80] powerpc: provide APIs for validating and updating DABR Oren Laadan
2009-09-23 23:51   ` [PATCH v18 77/80] powerpc: checkpoint/restart implementation Oren Laadan
2009-09-23 23:51   ` [PATCH v18 78/80] powerpc: wire up checkpoint and restart syscalls Oren Laadan
2009-09-23 23:51   ` [PATCH v18 79/80] powerpc: enable checkpoint support in Kconfig Oren Laadan
2009-09-23 23:52   ` [PATCH v18 80/80] powerpc: clone_with_pids implementation Oren Laadan
2009-09-24 13:05   ` [PATCH 00/80] Kernel based checkpoint/restart [v18] Rishikesh
2009-09-24 22:41   ` Andrew Morton
2009-09-23 23:51 ` [PATCH v18 74/80] c/r: support for controlling terminal and job control Oren Laadan
2009-09-23 23:51   ` Oren Laadan
2009-09-23 23:51 ` [PATCH v18 75/80] powerpc: reserve checkpoint arch identifiers Oren Laadan
2009-09-23 23:51   ` Oren Laadan
2009-09-23 23:51 ` [PATCH v18 76/80] powerpc: provide APIs for validating and updating DABR Oren Laadan
2009-09-23 23:51   ` Oren Laadan
2009-09-23 23:51 ` [PATCH v18 77/80] powerpc: checkpoint/restart implementation Oren Laadan
2009-09-23 23:51   ` Oren Laadan
2009-09-23 23:51   ` Oren Laadan
2009-09-23 23:51 ` [PATCH v18 78/80] powerpc: wire up checkpoint and restart syscalls Oren Laadan
2009-09-23 23:51   ` Oren Laadan
2009-09-23 23:51 ` [PATCH v18 79/80] powerpc: enable checkpoint support in Kconfig Oren Laadan
2009-09-23 23:51   ` Oren Laadan
2009-09-23 23:52 ` [PATCH v18 80/80] powerpc: clone_with_pids implementation Oren Laadan
2009-09-23 23:52   ` Oren Laadan
2009-09-24 13:05 ` [PATCH 00/80] Kernel based checkpoint/restart [v18] Rishikesh
2009-09-24 13:05   ` Rishikesh
2009-09-24 13:05   ` Rishikesh
2009-09-24 16:27   ` Dan Smith
2009-09-24 16:27   ` Dan Smith
2009-09-24 16:27     ` Dan Smith
2009-09-24 18:20     ` Rishikesh
2009-09-24 18:20       ` Rishikesh
2009-09-24 18:20       ` Rishikesh
     [not found]     ` <878wg41f65.fsf-FLMGYpZoEPULwtHQx/6qkW3U47Q5hpJU@public.gmane.org>
2009-09-24 18:20       ` Rishikesh
2009-09-24 22:41 ` Andrew Morton
2009-09-24 22:41   ` Andrew Morton
2009-09-24 22:41   ` Andrew Morton
     [not found]   ` <20090924154139.2a7dd5ec.akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
2009-09-25 13:59     ` Dan Smith
2009-09-28 16:37     ` Serge E. Hallyn
2009-09-29 18:31     ` Oren Laadan
2009-09-25 13:59   ` Dan Smith
2009-09-25 13:59     ` Dan Smith
2009-09-25 13:59     ` Dan Smith
     [not found]     ` <87ljk39lcl.fsf-FLMGYpZoEPULwtHQx/6qkW3U47Q5hpJU@public.gmane.org>
2009-09-26 14:56       ` Ingo Molnar
2009-09-26 14:56     ` Ingo Molnar
2009-09-26 14:56       ` Ingo Molnar
2009-09-28 16:37   ` Serge E. Hallyn
2009-09-28 16:37     ` Serge E. Hallyn
2009-09-28 16:37     ` Serge E. Hallyn
2009-09-29 13:29     ` Daniel Lezcano
2009-09-29 13:29       ` Daniel Lezcano
2009-09-29 13:29       ` Daniel Lezcano
2009-09-30 14:39       ` Dan Smith
2009-09-30 14:39         ` Dan Smith
     [not found]         ` <87iqf0o5sf.fsf-FLMGYpZoEPULwtHQx/6qkW3U47Q5hpJU@public.gmane.org>
2009-09-30 16:16           ` Daniel Lezcano
2009-09-30 16:16         ` Daniel Lezcano
2009-09-30 16:16           ` Daniel Lezcano
2009-09-30 16:16           ` Daniel Lezcano
     [not found]           ` <4AC38477.4070007-GANU6spQydw@public.gmane.org>
2009-09-30 16:29             ` Dan Smith
2009-09-30 16:29           ` Dan Smith
2009-09-30 16:29             ` Dan Smith
2009-09-30 16:29             ` Dan Smith
     [not found]             ` <87eipoo0po.fsf-FLMGYpZoEPULwtHQx/6qkW3U47Q5hpJU@public.gmane.org>
2009-09-30 18:01               ` Daniel Lezcano
2009-09-30 18:01             ` Daniel Lezcano
2009-09-30 18:01               ` Daniel Lezcano
2009-09-30 18:01               ` Daniel Lezcano
     [not found]               ` <4AC39CE5.9080908-GANU6spQydw@public.gmane.org>
2009-09-30 18:28                 ` Dan Smith
2009-09-30 18:28               ` Dan Smith
2009-09-30 18:28                 ` Dan Smith
2009-09-30 18:28                 ` Dan Smith
     [not found]                 ` <877hvgnv6z.fsf-FLMGYpZoEPULwtHQx/6qkW3U47Q5hpJU@public.gmane.org>
2009-09-30 22:08                   ` Daniel Lezcano
2009-09-30 22:08                 ` Daniel Lezcano
2009-09-30 22:08                   ` Daniel Lezcano
2009-09-30 22:08                   ` Daniel Lezcano
     [not found]                   ` <4AC3D6DC.9010500-GANU6spQydw@public.gmane.org>
2009-09-30 22:24                     ` Serge E. Hallyn
2009-09-30 22:42                     ` Serge E. Hallyn
     [not found]                       ` <20090930224249.GA22977-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2009-10-01  8:07                         ` Daniel Lezcano
     [not found]                           ` <4AC4634F.6070901-GANU6spQydw@public.gmane.org>
2009-10-01 14:55                             ` Serge E. Hallyn
     [not found]                               ` <20091001145512.GA17916-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2009-10-01 19:02                                 ` Oren Laadan
     [not found]       ` <4AC20BB8.4070509-GANU6spQydw@public.gmane.org>
2009-09-30 14:39         ` Dan Smith
     [not found]     ` <20090928163704.GA3327-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2009-09-29 13:29       ` Daniel Lezcano
2009-09-29 18:31   ` Oren Laadan
2009-09-29 18:31     ` Oren Laadan
2009-09-29 18:31     ` Oren Laadan
2009-09-23 23:50 Oren Laadan

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.