All of lore.kernel.org
 help / color / mirror / Atom feed
* [C/R v20][PATCH 00/96] Linux Checkpoint-Restart - v20
@ 2010-03-19  0:59 Oren Laadan
  2010-03-19  0:59 ` [C/R v20][PATCH 20/96] c/r: make file_pos_read/write() public Oren Laadan
                   ` (17 more replies)
  0 siblings, 18 replies; 88+ messages in thread
From: Oren Laadan @ 2010-03-19  0:59 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: containers, Matt Helsley, Andreas Dilger, Oren Laadan

Hi,

Following Andreas Dilger's reply (http://lkml.org/lkml/2010/3/17/410)
I'm (re)posting the subset of checkpoint-restart patch-set that is
related to linux-fsdevel. (I'm unsure why those weren't sent before).
Altogether there are 17 patches here (out of the 96 total).

For the original post/thread see: http://lkml.org/lkml/2010/3/17/232.

As Matt Helsley put briefly, checkpoint-restart mainly saves the
critical pieces of kernel information from the struct file needed to
restart the open file descriptors. It does not save the file (system)
contents in the checkpoint image. That's left for proper filesystem
freezing, snapshotting, or rsync (for example) depending on the tools
and/or filesystems userspace has chosen.

Oren.

---

Here is the introduction to the original post:

---

Following up on the thread on the checkpoint-restart patch set
(http://lkml.org/lkml/2010/3/1/422), the following series is the
latest checkpoint/restart, based on 2.6.33.

The first 20 patches are cleanups and prepartion for c/r; they
are followed by the actual c/r code.

Please apply to -mm, and let us know if there is any way we can
help.

---

Linux Checkpoint-Restart:
 web, wiki:	http://www.linux-cr.org
 bug track:	https://www.linux-cr.org/redmine

The repositories for the project are in:
 kernel:	http://www.linux-cr.org/git/?p=linux-cr.git;a=summary
 user tools:	http://www.linux-cr.org/git/?p=user-cr.git;a=summary
 tests suite:	http://www.linux-cr.org/git/?p=tests-cr.git;a=summary

---

CHANGELOG:

v20 [2010-Mar-16]
 BUG FIXES (only)
  - [Serge Hallyn] Fix unlabeled restore case
  - [Serge Hallyn] Always restore msg_msg label
  - [Serge Hallyn] Selinux prevents msgrcv on restore message queues?
  - [Serge Hallyn] save_access_regs for self-checkpoint
  - [Serge Hallyn] send uses_interp=1 to arch_setup_additional_pages
  - Fix "scheduling in atomic" while restoring ipc (sem, shm, msg)
  - Cleanup: no need to restore perm->{id,key,seq}
  - Fix sysvipc=n compile
  - Make uts_ns=n compile
  - Only use arch_setup_additional_pages() if supported by arch
  - Export key symbols to enable c/r from kernel modules
  - Avoid crash if incoming object doesn't have .restore
  - Replace error_sem with an event completion
  - [Serge Hallyn] Change sysctl and default for unprivileged use
  - [Nathan Lynch] Use syscall_get_error
  - Add entry for checkpoint/restart in MAINTAINERS 

[2010-Feb-19] v19
 NEW FEATURES
  - Support for x86-64 architecture
  - Support for c/r of LSM (smack, selinux)
  - Support for c/r of task fs_root and pwd
  - Support for c/r of epoll
  - Support for c/r of eventfd
  - Enable C/R while executing over NFS
  - Preliminary c/r of mounts namespace
  - Add @logfd argument to sys_{checkpoint,restart} prototypes
  - Define new api for error and debug logging
  - Restart to handle checkpoint images lacking {uts,ipc}-ns
  - Refuse to checkpoint if monitoring directories with dnotify
  - Refuse to checkpoint if file locks and leases are held
  - Refuse to checkpoint files with f_owner 
 OTHER CHANGES
  - Rebase to kernel 2.6.33-rc8
  - Settled version of new sys_eclone()
  - [Serge Hallyn] Fix potential use-before-set return (vdso)
  - Update documentation and examples for new syscalls API (doc)
  - [Liu Alexander] Fix typos (doc)
  - [Serge Hallyn] Update checkpoint image format (doc)
  - [Serge Hallyn] Use ckpt_err() to for bad header values
  - sys_{checkpoint,restart} to use ptregs prototype
  - Set ctx->errno in do_ckpt_msg() if needed
  - Fix up headers so we can munge them for use by userspace
  - Multiple fixes to _ckpt_write_err() and friends
  - [Matt Helsley] Add cpp definitions for enums
  - [Serge Hallyn] Add global section container to image format
  - [Matt Helsley] Fix total byte read/write count for large images
  - ckpt_read_buf_type() to accept max payload (excludes ckpt_hdr)
  - [Serge Hallyn] Use ckpt_err() for arch incompatbilities
  - Introduce walk_task_subtree() to iterate through descendants
  - Call restore_notify_error for restart (not checkpoint !)
  - Make kread/kwrite() abort if CKPT_CTX_ERROR is set
  - [Serge Hallyn] Move init_completion(&ctx->complete) to ctx_alloc
  - Simplify logic of tracking restarting tasks (->ctx)
  - Coordinator kills descendants on failure for proper cleanup
  - Prepare descendants needs PTRACE_MODE_ATTACH permissions
  - Threads wait for entire thread group before restoring
  - Add debug process-tree status during restart
  - Fix handling of bogus pid arg to sys_restart
  - In reparent_thread() test for PF_RESTARTING on parent
  - Keep __u32s in even groups for 32-64 bit compatibility
  - Define ckpt_obj_try_fetch
  - Disallow zero or negative objref during restart
  - Check for valid destructor before calling it (deferqueue)
  - Fix false negative of test for unlinked files at checkpoint
  - [Serge Hallyn] Rename fs_mnt to root_fs_path
  - Restore thread/cpu state early
  - Ensure null-termination of file names read from image
  - Fix compile warning in restore_open_fname()
  - Introduce FOLL_DIRTY to follow_page() for "dirty" pages
  - [Serge Hallyn] Checkpoint saved_auxv as u64s
  - Export filemap_checkpoint()
  - [Serge Hallyn] Disallow checkpoint of tasks with aio requests
  - Fix compilation failure when !CONFIG_CHEKCPOINT (regression)
  - Expose page write functions
  - Do not hold mmap_sem while checkpointing vma's
  - Do not hold mmap_sem when reading memory pages on restart
  -  Move consider_private_page() to mm/memory.c:__get_dirty_page()
  - [Serge Hallyn] move destroy_mm into mmap.c and remove size check
  - [Serge Hallyn] fill vdso (syscall32_setup_pages) for TIF_IA32/x86_64
  - [Serge Hallyn] Fix return value of read_pages_contents()
  - [Serge Hallyn] Change m_type to long, not int (ipc)
  - Don't free sma if it's an error on restore
  - Use task->saves_sigmask and drop task->checkpoint_data
  - [Serge Hallyn] Handle saved_sigmask at checkpoint
  - Defer restore of blocked signals mask during restart
  - Self-restart to tolerate missing PGIDs
  - [Serge Hallyn] skb->tail can be offset
  - Export and leverage sock_alloc_file()
  - [Nathan Lynch] Fix net/checkpoint.c for 64-bit
  - [Dan Smith] Unify skb read/write functions and handle fragmented buffers
  - [Dan Smith] Update buffer restore code to match the new format
  - [Dan Smith] Fix compile issue with CONFIG_CHECKPOINT=n
  - [Dan Smith] Remove an unnecessary check on socket restart
  - [Dan Smith] Pass the stored sock->protocol into sock_create() on restore
  - Relax tcp.window_clamp value in INET restore
  - Restore gso_type fields on sockets and buffers for proper operation
  - Fix broken compilation for no-c/r architectures
  - Return -EBUSY (not BUG_ON) if fd is gone on restart
  - Fix the chunk size instead of auto-tune (epoll) 
 ARCH: x86 (32,64)
  - Use PTREGSCALL4 for sys_{checkpoint,restart}
  - Remove debug-reg support (need to redo with perf_events)
  - [Serge Hallyn] Support for ia32 (checkpoint, restart)
  - Split arch/x86/checkpoint.c to generic and 32bit specific parts
  - sys_{checkpoint,restore} to use ptregs
  - Allow X86_EFLAGS_RF on restart
  - [Serge Hallyn] Only allow 'restart' with same bit-ness as image.
  - Move checkpoint.c from arch/x86/mm->arch/x86/kernel 
 ARCH: s390 [Serge Hallyn]
  - Define s390x sys_restart wrapper
  - Fixes to restart-blocks logic and signal path
  - Fix checkpoint and restart compat wrappers
  - sys_{checkpoint,restore} to use ptregs
  - Use simpler test_task_thread to test current ti flags
  - Fix 31-bit s390 checkpoint/restart wrappers
  - Update sys_checkpoint (do_sys_checkpoint on all archs)
  - [Oren Laadan] Move checkpoint.c from arch/s390/mm->arch/s390/kernel 
 ARCH: powerpc [Nathan Lynch]
  - [Serge Hallyn] Add hook task_has_saved_sigmask()
  - Warn if full register state unavailable
  - Fix up checkpoint syscall, tidy restart
  - [Oren Laadan] Move checkpoint.c from arch/powerpc/{mm->kernel} 

[2009-Sep-22] v18
 NEW FEATURES
  - [Nathan Lynch] Re-introduce powerpc support
  - Save/restore pseudo-terminals
  - Save/restore (pty) controlling terminals
  - Save/restore restore PGIDs
  - [Dan Smith] Save/restore unix domain sockets
  - Save/restore FIFOs
  - Save/restore pending signals
  - Save/restore rlimits
  - Save/restore itimers
  - [Matt Helsley] Handle many non-pseudo file-systems
 OTHER CHANGES
  - Rename headerless struct ckpt_hdr_* to struct ckpt_*
  - [Nathan Lynch] discard const from struct cred * where appropriate
  - [Serge Hallyn][s390] Set return value for self-checkpoint 
  - Handle kmalloc failure in restore_sem_array()
  - [IPC] Collect files used by shm objects
  - [IPC] Use file (not inode) as shared object on checkpoint of shm
  - More ckpt_write_err()s to give information on checkpoint failure
  - Adjust format of pipe buffer to include the mandatory pre-header
  - [LEAKS] Mark the backing file as visited at chekcpoint
  - Tighten checks on supported vma to checkpoint or restart
  - [Serge Hallyn] Export filemap_checkpoint() (used for ext4)
  - Introduce ckpt_collect_file() that also uses file->collect method
  - Use ckpt_collect_file() instead of ckpt_obj_collect() for files
  - Fix leak-detection issue in collect_mm() (test for first-time obj)
  - Invoke set_close_on_exec() unconditionally on restart
  - [Dan Smith] Export fill_fname() as ckpt_fill_fname()
  - Interface to pass simple pointers as data with deferqueue
  - [Dan Smith] Fix ckpt_obj_lookup_add() leak detection logic
  - Replace EAGAIN with EBUSY where necessary
  - Introduce CKPT_OBJ_VISITED in leak detection
  - ckpt_obj_collect() returns objref for new objects, 0 otherwise
  - Rename ckpt_obj_checkpointed() to ckpt_obj_visited()
  - Introduce ckpt_obj_visit() to mark objects as visited
  - Set the CHECKPOINTED flag on objects before calling checkpoint
  - Introduce ckpt_obj_reserve()
  - Change ref_drop() to accept a @lastref argument (for cleanup)
  - Disallow multiple objects with same objref in restart
  - Allow _ckpt_read_obj_type() to read header only (w/o payload)
  - Fix leak of ckpt_ctx when restoring zombie tasks
  - Fix race of prepare_descendant() with an ongoing fork()
  - Track and report the first error if restart fails
  - Tighten logic to protect against bogus pids in input
  - [Matt Helsley] Improve debug output from ckpt_notify_error()
  - [Nathan Lynch] fix compilation errors with CONFIG_COMPAT=y
  - Detect error-headers in input data on restart, and abort.
  - Standard format for checkpoint error strings (and documentation)
  - [Dan Smith] Add an errno validation function
  - Add ckpt_read_payload(): read a variable-length object (no header)
  - Add ckpt_read_string(): same for strings (ensures null-terminated)
  - Add ckpt_read_consume(): consumes next object without processing
  - [John Dykstra] Fix no-dot-config-targets pattern in linux/Makefile

[2009-Jul-21] v17
  - Introduce syscall clone_with_pids() to restore original pids
  - Support threads and zombies
  - Save/restore task->files
  - Save/restore task->sighand
  - Save/restore futex
  - Save/restore credentials
  - Introduce PF_RESTARTING to skip notifications on task exit
  - restart(2) allow caller to ask to freeze tasks after restart
  - restart(2) isn't idempotent: return -EINTR if interrupted
  - Improve debugging output handling 
  - Make multi-process restart logic more robust and complete
  - Correctly select return value for restarting tasks on success
  - Tighten ptrace test for checkpoint to PTRACE_MODE_ATTACH
  - Use CHECKPOINTING state for frozen checkpointed tasks
  - Fix compilation without CONFIG_CHECKPOINT
  - Fix compilation with CONFIG_COMPAT
  - Fix headers includes and exports
  - Leak detection performed in two steps
  - Detect "inverse" leaks of objects (dis)appearing unexpectedly
  - Memory: save/restore mm->{flags,def_flags,saved_auxv}
  - Memory: only collect sub-objects of mm once (leak detection)
  - Files: validate f_mode after restore
  - Namespaces: leak detection for nsproxy sub-components
  - Namespaces: proper restart from namespace(s) without namespace(s)
  - Save global constants in header instead of per-object
  - IPC: replace sys_unshare() with create_ipc_ns()
  - IPC: restore objects in suitable namespace
  - IPC: correct behavior under !CONFIG_IPC_NS
  - UTS: save/restore all fields
  - UTS: replace sys_unshare() with create_uts_ns()
  - X86_32: sanitize cpu, debug, and segment registers on restart
  - cgroup_freezer: add CHECKPOINTING state to safeguard checkpoint
  - cgroup_freezer: add interface to freeze a cgroup (given a task)

[2009-May-27] v16
  - Privilege checks for IPC checkpoint
  - Fix error string generation during checkpoint
  - Use kzalloc for header allocation
  - Restart blocks are arch-independent
  - Redo pipe c/r using splice
  - Fixes to s390 arch
  - Remove powerpc arch (temporary)
  - Explicitly restore ->nsproxy
  - All objects in image are precedeed by 'struct ckpt_hdr'
  - Fix leaks detection (and leaks)
  - Reorder of patchset
  - Misc bugs and compilation fixes

[2009-Apr-12] v15
  - Minor fixes

[2009-Apr-28] v14
  - Tested against kernel v2.6.30-rc3 on x86_32.
  - Refactor files chekpoint to use f_ops (file operations)
  - Refactor mm/vma to use vma_ops
  - Explicitly handle VDSO vma (and require compat mode)
  - Added code to c/r restat-blocks (restart timeout related syscalls)
  - Added code to c/r namespaces: uts, ipc (with Dan Smith)
  - Added code to c/r sysvipc (shm, msg, sem)
  - Support for VM_CLONE shared memory
  - Added resource leak detection for whole-container checkpoint
  - Added sysctl gauge to allow unprivileged restart/checkpoint
  - Improve and simplify the code and logic of shared objects
  - Rework image format: shared objects appear prior to their use
  - Merge checkpoint and restart functionality into same files
  - Massive renaming of functions: prefix "ckpt_" for generics,
    "checkpoint_" for checkpoint, and "restore_" for restart.
  - Report checkpoint errors as a valid (string record) in the output
  - Merged PPC architecture (by Nathan Lunch),
  - Requires updates to userspace tools too.
  - Misc nits and bug fixes

[2009-Mar-31] v14-rc2
  - Change along Dave's suggestion to use f_ops->checkpoint() for files
  - Merge patch simplifying Kconfig, with CONFIG_CHECKPOINT_SUPPORT
  - Merge support for PPC arch (Nathan Lynch)
  - Misc cleanups and fixes in response to comments

[2009-Mar-20] v14-rc1:
  - The 'h.parent' field of 'struct cr_hdr' isn't used - discard
  - Check whether calls to cr_hbuf_get() succeed or fail.
  - Fixed of pipe c/r code
  - Prevent deadlock by refusing c/r when a pipe inode == ctx->file inode
  - Refuse non-self checkpoint if a task isn't frozen
  - Use unsigned fields in checkpoint headers unless otherwise required
  - Rename functions in files c/r to better reflect their role
  - Add support for anonymous shared memory
  - Merge support for s390 arch (Dan Smith, Serge Hallyn)
    
[2008-Dec-03] v13:
  - Cleanups of 'struct cr_ctx' - remove unused fields
  - Misc fixes for comments
  
[2008-Dec-17] v12:
  - Fix re-alloc/reset of pgarr chain to correctly reuse buffers
    (empty pgarr are saves in a separate pool chain)
  - Add a couple of missed calls to cr_hbuf_put()
  - cr_kwrite/cr_kread() again use vfs_read(), vfs_write() (safer)
  - Split cr_write/cr_read() to two parts: _cr_write/read() helper
  - Befriend with sparse: explicit conversion to 'void __user *'
  - Redrefine 'pr_fmt' ind replace cr_debug() with pr_debug()

[2008-Dec-05] v11:
  - Use contents of 'init->fs->root' instead of pointing to it
  - Ignore symlinks (there is no such thing as an open symlink)
  - cr_scan_fds() retries from scratch if it hits size limits
  - Add missing test for VM_MAYSHARE when dumping memory
  - Improve documentation about: behavior when tasks aren't fronen,
    life span of the object hash, references to objects in the hash
 
[2008-Nov-26] v10:
  - Grab vfs root of container init, rather than current process
  - Acquire dcache_lock around call to __d_path() in cr_fill_name()
  - Force end-of-string in cr_read_string() (fix possible DoS)
  - Introduce cr_write_buffer(), cr_read_buffer() and cr_read_buf_type()

[2008-Nov-10] v9:
  - Support multiple processes c/r
  - Extend checkpoint header with archtiecture dependent header 
  - Misc bug fixes (see individual changelogs)
  - Rebase to v2.6.28-rc3.

[2008-Oct-29] v8:
  - Support "external" checkpoint
  - Include Dave Hansen's 'deny-checkpoint' patch
  - Split docs in Documentation/checkpoint/..., and improve contents

[2008-Oct-17] v7:
  - Fix save/restore state of FPU
  - Fix argument given to kunmap_atomic() in memory dump/restore

[2008-Oct-07] v6:
  - Balance all calls to cr_hbuf_get() with matching cr_hbuf_put()
    (even though it's not really needed)
  - Add assumptions and what's-missing to documentation
  - Misc fixes and cleanups

[2008-Sep-11] v5:
  - Config is now 'def_bool n' by default
  - Improve memory dump/restore code (following Dave Hansen's comments)
  - Change dump format (and code) to allow chunks of <vaddrs, pages>
    instead of one long list of each
  - Fix use of follow_page() to avoid faulting in non-present pages
  - Memory restore now maps user pages explicitly to copy data into them,
    instead of reading directly to user space; got rid of mprotect_fixup()
  - Remove preempt_disable() when restoring debug registers
  - Rename headers files s/ckpt/checkpoint/
  - Fix misc bugs in files dump/restore
  - Fixes and cleanups on some error paths
  - Fix misc coding style

[2008-Sep-09] v4:
  - Various fixes and clean-ups
  - Fix calculation of hash table size
  - Fix header structure alignment
  - Use stand list_... for cr_pgarr

[2008-Aug-29] v3:
  - Various fixes and clean-ups
  - Use standard hlist_... for hash table
  - Better use of standard kmalloc/kfree

[2008-Aug-20] v2:
  - Added Dump and restore of open files (regular and directories)
  - Added basic handling of shared objects, and improve handling of
    'parent tag' concept
  - Added documentation
  - Improved ABI, 64bit padding for image data
  - Improved locking when saving/restoring memory
  - Added UTS information to header (release, version, machine)
  - Cleanup extraction of filename from a file pointer
  - Refactor to allow easier reviewing
  - Remove requirement for CAPS_SYS_ADMIN until we come up with a
    security policy (this means that file restore may fail)
  - Other cleanup and response to comments for v1

[2008-Jul-29] v1:
  - Initial version: support a single task with address space of only
    private anonymous or file-mapped VMAs; syscalls ignore pid/crid
    argument and act on current process.

^ permalink raw reply	[flat|nested] 88+ messages in thread

* [C/R v20][PATCH 20/96] c/r: make file_pos_read/write() public
       [not found] ` <1268960401-16680-1-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
@ 2010-03-19  0:59   ` Oren Laadan
  2010-03-19  0:59   ` [C/R v20][PATCH 37/96] c/r: introduce new 'file_operations': ->checkpoint, ->collect() Oren Laadan
                     ` (15 subsequent siblings)
  16 siblings, 0 replies; 88+ messages in thread
From: Oren Laadan @ 2010-03-19  0:59 UTC (permalink / raw)
  To: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Andreas Dilger

These two are used in the next patch when calling vfs_read/write()

Signed-off-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
Acked-by: Serge E. Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
---
 fs/read_write.c    |   10 ----------
 include/linux/fs.h |   10 ++++++++++
 2 files changed, 10 insertions(+), 10 deletions(-)

diff --git a/fs/read_write.c b/fs/read_write.c
index b7f4a1f..e258301 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -359,16 +359,6 @@ ssize_t vfs_write(struct file *file, const char __user *buf, size_t count, loff_
 
 EXPORT_SYMBOL(vfs_write);
 
-static inline loff_t file_pos_read(struct file *file)
-{
-	return file->f_pos;
-}
-
-static inline void file_pos_write(struct file *file, loff_t pos)
-{
-	file->f_pos = pos;
-}
-
 SYSCALL_DEFINE3(read, unsigned int, fd, char __user *, buf, size_t, count)
 {
 	struct file *file;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index ebb1cd5..6c08df2 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1543,6 +1543,16 @@ ssize_t rw_copy_check_uvector(int type, const struct iovec __user * uvector,
 				struct iovec *fast_pointer,
 				struct iovec **ret_pointer);
 
+static inline loff_t file_pos_read(struct file *file)
+{
+	return file->f_pos;
+}
+
+static inline void file_pos_write(struct file *file, loff_t pos)
+{
+	file->f_pos = pos;
+}
+
 extern ssize_t vfs_read(struct file *, char __user *, size_t, loff_t *);
 extern ssize_t vfs_write(struct file *, const char __user *, size_t, loff_t *);
 extern ssize_t vfs_readv(struct file *, const struct iovec __user *,
-- 
1.6.3.3

^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [C/R v20][PATCH 20/96] c/r: make file_pos_read/write() public
  2010-03-19  0:59 [C/R v20][PATCH 00/96] Linux Checkpoint-Restart - v20 Oren Laadan
@ 2010-03-19  0:59 ` Oren Laadan
       [not found]   ` <1268960401-16680-2-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
  2010-03-22  6:31   ` Nick Piggin
  2010-03-19  0:59 ` [C/R v20][PATCH 37/96] c/r: introduce new 'file_operations': ->checkpoint, ->collect() Oren Laadan
                   ` (16 subsequent siblings)
  17 siblings, 2 replies; 88+ messages in thread
From: Oren Laadan @ 2010-03-19  0:59 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: containers, Matt Helsley, Andreas Dilger, Oren Laadan

These two are used in the next patch when calling vfs_read/write()

Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
Acked-by: Serge E. Hallyn <serue@us.ibm.com>
---
 fs/read_write.c    |   10 ----------
 include/linux/fs.h |   10 ++++++++++
 2 files changed, 10 insertions(+), 10 deletions(-)

diff --git a/fs/read_write.c b/fs/read_write.c
index b7f4a1f..e258301 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -359,16 +359,6 @@ ssize_t vfs_write(struct file *file, const char __user *buf, size_t count, loff_
 
 EXPORT_SYMBOL(vfs_write);
 
-static inline loff_t file_pos_read(struct file *file)
-{
-	return file->f_pos;
-}
-
-static inline void file_pos_write(struct file *file, loff_t pos)
-{
-	file->f_pos = pos;
-}
-
 SYSCALL_DEFINE3(read, unsigned int, fd, char __user *, buf, size_t, count)
 {
 	struct file *file;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index ebb1cd5..6c08df2 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1543,6 +1543,16 @@ ssize_t rw_copy_check_uvector(int type, const struct iovec __user * uvector,
 				struct iovec *fast_pointer,
 				struct iovec **ret_pointer);
 
+static inline loff_t file_pos_read(struct file *file)
+{
+	return file->f_pos;
+}
+
+static inline void file_pos_write(struct file *file, loff_t pos)
+{
+	file->f_pos = pos;
+}
+
 extern ssize_t vfs_read(struct file *, char __user *, size_t, loff_t *);
 extern ssize_t vfs_write(struct file *, const char __user *, size_t, loff_t *);
 extern ssize_t vfs_readv(struct file *, const struct iovec __user *,
-- 
1.6.3.3


^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [C/R v20][PATCH 37/96] c/r: introduce new 'file_operations': ->checkpoint, ->collect()
       [not found] ` <1268960401-16680-1-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
  2010-03-19  0:59   ` [C/R v20][PATCH 20/96] c/r: make file_pos_read/write() public Oren Laadan
@ 2010-03-19  0:59   ` Oren Laadan
  2010-03-19  0:59   ` [C/R v20][PATCH 38/96] c/r: dump open file descriptors Oren Laadan
                     ` (14 subsequent siblings)
  16 siblings, 0 replies; 88+ messages in thread
From: Oren Laadan @ 2010-03-19  0:59 UTC (permalink / raw)
  To: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Andreas Dilger

While we assume all normal files and directories can be checkpointed,
there are, as usual in the VFS, specialized places that will always
need an ability to override these defaults. Although we could do this
completely in the checkpoint code, that would bitrot quickly.

This adds a new 'file_operations' function for checkpointing a file.
It is assumed that there should be a dirt-simple way to make something
(un)checkpointable that fits in with current code.

As you can see in the ext[234] patches down the road, all that we have
to do to make something simple be supported is add a single "generic"
f_op entry.

Also adds a new 'file_operations' function for 'collecting' a file for
leak-detection during full-container checkpoint. This is useful for
those files that hold references to other "collectable" objects. Two
examples are pty files that point to corresponding tty objects, and
eventpoll files that refer to the files they are monitoring.

Finally, this patch introduces vfs_fcntl() so that it can be called
from restart (see patch adding restart of files).

Changelog[v17]
  - Introduce 'collect' method
Changelog[v17]
  - Forward-declare 'ckpt_ctx' et-al, don't use checkpoint_types.h

Signed-off-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
Acked-by: Serge E. Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
Tested-by: Serge E. Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
---
 fs/fcntl.c         |   21 +++++++++++++--------
 include/linux/fs.h |    7 +++++++
 2 files changed, 20 insertions(+), 8 deletions(-)

diff --git a/fs/fcntl.c b/fs/fcntl.c
index 97e01dc..e1f02ca 100644
--- a/fs/fcntl.c
+++ b/fs/fcntl.c
@@ -418,6 +418,18 @@ static long do_fcntl(int fd, unsigned int cmd, unsigned long arg,
 	return err;
 }
 
+int vfs_fcntl(int fd, unsigned int cmd, unsigned long arg, struct file *filp)
+{
+	int err;
+
+	err = security_file_fcntl(filp, cmd, arg);
+	if (err)
+		goto out;
+	err = do_fcntl(fd, cmd, arg, filp);
+ out:
+	return err;
+}
+
 SYSCALL_DEFINE3(fcntl, unsigned int, fd, unsigned int, cmd, unsigned long, arg)
 {	
 	struct file *filp;
@@ -427,14 +439,7 @@ SYSCALL_DEFINE3(fcntl, unsigned int, fd, unsigned int, cmd, unsigned long, arg)
 	if (!filp)
 		goto out;
 
-	err = security_file_fcntl(filp, cmd, arg);
-	if (err) {
-		fput(filp);
-		return err;
-	}
-
-	err = do_fcntl(fd, cmd, arg, filp);
-
+	err = vfs_fcntl(fd, cmd, arg, filp);
  	fput(filp);
 out:
 	return err;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 6c08df2..65ebec5 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -394,6 +394,7 @@ struct kstatfs;
 struct vm_area_struct;
 struct vfsmount;
 struct cred;
+struct ckpt_ctx;
 
 extern void __init inode_init(void);
 extern void __init inode_init_early(void);
@@ -1093,6 +1094,8 @@ struct file_lock {
 
 #include <linux/fcntl.h>
 
+extern int vfs_fcntl(int fd, unsigned cmd, unsigned long arg, struct file *fp);
+
 extern void send_sigio(struct fown_struct *fown, int fd, int band);
 
 #ifdef CONFIG_FILE_LOCKING
@@ -1504,6 +1507,8 @@ struct file_operations {
 	ssize_t (*splice_write)(struct pipe_inode_info *, struct file *, loff_t *, size_t, unsigned int);
 	ssize_t (*splice_read)(struct file *, loff_t *, struct pipe_inode_info *, size_t, unsigned int);
 	int (*setlease)(struct file *, long, struct file_lock **);
+	int (*checkpoint)(struct ckpt_ctx *, struct file *);
+	int (*collect)(struct ckpt_ctx *, struct file *);
 };
 
 struct inode_operations {
@@ -2313,6 +2318,8 @@ void inode_sub_bytes(struct inode *inode, loff_t bytes);
 loff_t inode_get_bytes(struct inode *inode);
 void inode_set_bytes(struct inode *inode, loff_t bytes);
 
+#define generic_file_checkpoint NULL
+
 extern int vfs_readdir(struct file *, filldir_t, void *);
 
 extern int vfs_stat(char __user *, struct kstat *);
-- 
1.6.3.3

^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [C/R v20][PATCH 37/96] c/r: introduce new 'file_operations': ->checkpoint, ->collect()
  2010-03-19  0:59 [C/R v20][PATCH 00/96] Linux Checkpoint-Restart - v20 Oren Laadan
  2010-03-19  0:59 ` [C/R v20][PATCH 20/96] c/r: make file_pos_read/write() public Oren Laadan
@ 2010-03-19  0:59 ` Oren Laadan
  2010-03-22  6:34   ` Nick Piggin
       [not found]   ` <1268960401-16680-3-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
  2010-03-19  0:59 ` [C/R v20][PATCH 38/96] c/r: dump open file descriptors Oren Laadan
                   ` (15 subsequent siblings)
  17 siblings, 2 replies; 88+ messages in thread
From: Oren Laadan @ 2010-03-19  0:59 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: containers, Matt Helsley, Andreas Dilger, Oren Laadan

While we assume all normal files and directories can be checkpointed,
there are, as usual in the VFS, specialized places that will always
need an ability to override these defaults. Although we could do this
completely in the checkpoint code, that would bitrot quickly.

This adds a new 'file_operations' function for checkpointing a file.
It is assumed that there should be a dirt-simple way to make something
(un)checkpointable that fits in with current code.

As you can see in the ext[234] patches down the road, all that we have
to do to make something simple be supported is add a single "generic"
f_op entry.

Also adds a new 'file_operations' function for 'collecting' a file for
leak-detection during full-container checkpoint. This is useful for
those files that hold references to other "collectable" objects. Two
examples are pty files that point to corresponding tty objects, and
eventpoll files that refer to the files they are monitoring.

Finally, this patch introduces vfs_fcntl() so that it can be called
from restart (see patch adding restart of files).

Changelog[v17]
  - Introduce 'collect' method
Changelog[v17]
  - Forward-declare 'ckpt_ctx' et-al, don't use checkpoint_types.h

Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
Acked-by: Serge E. Hallyn <serue@us.ibm.com>
Tested-by: Serge E. Hallyn <serue@us.ibm.com>
---
 fs/fcntl.c         |   21 +++++++++++++--------
 include/linux/fs.h |    7 +++++++
 2 files changed, 20 insertions(+), 8 deletions(-)

diff --git a/fs/fcntl.c b/fs/fcntl.c
index 97e01dc..e1f02ca 100644
--- a/fs/fcntl.c
+++ b/fs/fcntl.c
@@ -418,6 +418,18 @@ static long do_fcntl(int fd, unsigned int cmd, unsigned long arg,
 	return err;
 }
 
+int vfs_fcntl(int fd, unsigned int cmd, unsigned long arg, struct file *filp)
+{
+	int err;
+
+	err = security_file_fcntl(filp, cmd, arg);
+	if (err)
+		goto out;
+	err = do_fcntl(fd, cmd, arg, filp);
+ out:
+	return err;
+}
+
 SYSCALL_DEFINE3(fcntl, unsigned int, fd, unsigned int, cmd, unsigned long, arg)
 {	
 	struct file *filp;
@@ -427,14 +439,7 @@ SYSCALL_DEFINE3(fcntl, unsigned int, fd, unsigned int, cmd, unsigned long, arg)
 	if (!filp)
 		goto out;
 
-	err = security_file_fcntl(filp, cmd, arg);
-	if (err) {
-		fput(filp);
-		return err;
-	}
-
-	err = do_fcntl(fd, cmd, arg, filp);
-
+	err = vfs_fcntl(fd, cmd, arg, filp);
  	fput(filp);
 out:
 	return err;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 6c08df2..65ebec5 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -394,6 +394,7 @@ struct kstatfs;
 struct vm_area_struct;
 struct vfsmount;
 struct cred;
+struct ckpt_ctx;
 
 extern void __init inode_init(void);
 extern void __init inode_init_early(void);
@@ -1093,6 +1094,8 @@ struct file_lock {
 
 #include <linux/fcntl.h>
 
+extern int vfs_fcntl(int fd, unsigned cmd, unsigned long arg, struct file *fp);
+
 extern void send_sigio(struct fown_struct *fown, int fd, int band);
 
 #ifdef CONFIG_FILE_LOCKING
@@ -1504,6 +1507,8 @@ struct file_operations {
 	ssize_t (*splice_write)(struct pipe_inode_info *, struct file *, loff_t *, size_t, unsigned int);
 	ssize_t (*splice_read)(struct file *, loff_t *, struct pipe_inode_info *, size_t, unsigned int);
 	int (*setlease)(struct file *, long, struct file_lock **);
+	int (*checkpoint)(struct ckpt_ctx *, struct file *);
+	int (*collect)(struct ckpt_ctx *, struct file *);
 };
 
 struct inode_operations {
@@ -2313,6 +2318,8 @@ void inode_sub_bytes(struct inode *inode, loff_t bytes);
 loff_t inode_get_bytes(struct inode *inode);
 void inode_set_bytes(struct inode *inode, loff_t bytes);
 
+#define generic_file_checkpoint NULL
+
 extern int vfs_readdir(struct file *, filldir_t, void *);
 
 extern int vfs_stat(char __user *, struct kstat *);
-- 
1.6.3.3


^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [C/R v20][PATCH 38/96] c/r: dump open file descriptors
       [not found] ` <1268960401-16680-1-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
  2010-03-19  0:59   ` [C/R v20][PATCH 20/96] c/r: make file_pos_read/write() public Oren Laadan
  2010-03-19  0:59   ` [C/R v20][PATCH 37/96] c/r: introduce new 'file_operations': ->checkpoint, ->collect() Oren Laadan
@ 2010-03-19  0:59   ` Oren Laadan
  2010-03-19  0:59   ` [C/R v20][PATCH 39/96] c/r: restore " Oren Laadan
                     ` (13 subsequent siblings)
  16 siblings, 0 replies; 88+ messages in thread
From: Oren Laadan @ 2010-03-19  0:59 UTC (permalink / raw)
  To: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Andreas Dilger

Dump the file table with 'struct ckpt_hdr_file_table, followed by all
open file descriptors. Because the 'struct file' corresponding to an
fd can be shared, they are assigned an objref and registered in the
object hash. A reference to the 'file *' is kept for as long as it
lives in the hash (the hash is only cleaned up at the end of the
checkpoint).

Also provide generic_checkpoint_file() and generic_restore_file()
which is good for normal files and directories. It does not support
yet unlinked files or directories.

Changelog[v19]:
  - Fix false negative of test for unlinked files at checkpoint
Changelog[v19-rc3]:
  - [Serge Hallyn] Rename fs_mnt to root_fs_path
  - [Dave Hansen] Error out on file locks and leases
  - [Serge Hallyn] Refuse checkpoint of file with f_owner
Changelog[v19-rc1]:
  - [Matt Helsley] Add cpp definitions for enums
Changelog[v18]:
  - Add a few more ckpt_write_err()s
  - [Dan Smith] Export fill_fname() as ckpt_fill_fname()
  - Introduce ckpt_collect_file() that also uses file->collect method
  - In collect_file_stabl() use retval from ckpt_obj_collect() to
    test for first-time-object
Changelog[v17]:
  - Only collect sub-objects of files_struct once
  - Better file error debugging
  - Use (new) d_unlinked()
Changelog[v16]:
  - Fix compile warning in checkpoint_bad()
Changelog[v16]:
  - Reorder patch (move earlier in series)
  - Handle shared files_struct objects
Changelog[v14]:
  - File objects are dumped/restored prior to the first reference
  - Introduce a per file-type restore() callback
  - Use struct file_operations->checkpoint()
  - Put code for generic file descriptors in a separate function
  - Use one CKPT_FILE_GENERIC for both regular files and dirs
  - Revert change to pr_debug(), back to ckpt_debug()
  - Use only unsigned fields in checkpoint headers
  - Rename:  ckpt_write_files() => checkpoint_fd_table()
  - Rename:  ckpt_write_fd_data() => checkpoint_file()
  - Discard field 'h->parent'
Changelog[v12]:
  - Replace obsolete ckpt_debug() with pr_debug()
Changelog[v11]:
  - Discard handling of opened symlinks (there is no such thing)
  - ckpt_scan_fds() retries from scratch if hits size limits
Changelog[v9]:
  - Fix a couple of leaks in ckpt_write_files()
  - Drop useless kfree from ckpt_scan_fds()
Changelog[v8]:
  - initialize 'coe' to workaround gcc false warning
Changelog[v6]:
  - Balance all calls to ckpt_hbuf_get() with matching ckpt_hbuf_put()
    (even though it's not really needed)

Signed-off-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
Acked-by: Serge E. Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
Tested-by: Serge E. Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
---
 checkpoint/Makefile              |    3 +-
 checkpoint/checkpoint.c          |   11 +
 checkpoint/files.c               |  444 ++++++++++++++++++++++++++++++++++++++
 checkpoint/objhash.c             |   52 +++++
 checkpoint/process.c             |   33 +++-
 checkpoint/sys.c                 |    8 +
 fs/locks.c                       |   35 +++
 include/linux/checkpoint.h       |   19 ++
 include/linux/checkpoint_hdr.h   |   59 +++++
 include/linux/checkpoint_types.h |    5 +
 include/linux/fs.h               |   10 +
 11 files changed, 677 insertions(+), 2 deletions(-)
 create mode 100644 checkpoint/files.c

diff --git a/checkpoint/Makefile b/checkpoint/Makefile
index 5aa6a75..1d0c058 100644
--- a/checkpoint/Makefile
+++ b/checkpoint/Makefile
@@ -7,4 +7,5 @@ obj-$(CONFIG_CHECKPOINT) += \
 	objhash.o \
 	checkpoint.o \
 	restart.o \
-	process.o
+	process.o \
+	files.o
diff --git a/checkpoint/checkpoint.c b/checkpoint/checkpoint.c
index c016a2d..2bc2495 100644
--- a/checkpoint/checkpoint.c
+++ b/checkpoint/checkpoint.c
@@ -18,6 +18,7 @@
 #include <linux/time.h>
 #include <linux/fs.h>
 #include <linux/file.h>
+#include <linux/fs_struct.h>
 #include <linux/dcache.h>
 #include <linux/mount.h>
 #include <linux/utsname.h>
@@ -490,6 +491,7 @@ static int init_checkpoint_ctx(struct ckpt_ctx *ctx, pid_t pid)
 {
 	struct task_struct *task;
 	struct nsproxy *nsproxy;
+	struct fs_struct *fs;
 
 	/*
 	 * No need for explicit cleanup here, because if an error
@@ -531,6 +533,15 @@ static int init_checkpoint_ctx(struct ckpt_ctx *ctx, pid_t pid)
 		return -EINVAL;  /* cleanup by ckpt_ctx_free() */
 	}
 
+	/* root vfs (FIX: WILL CHANGE with mnt-ns etc */
+	task_lock(ctx->root_task);
+	fs = ctx->root_task->fs;
+	read_lock(&fs->lock);
+	ctx->root_fs_path = fs->root;
+	path_get(&ctx->root_fs_path);
+	read_unlock(&fs->lock);
+	task_unlock(ctx->root_task);
+
 	return 0;
 }
 
diff --git a/checkpoint/files.c b/checkpoint/files.c
new file mode 100644
index 0000000..7a57b24
--- /dev/null
+++ b/checkpoint/files.c
@@ -0,0 +1,444 @@
+/*
+ *  Checkpoint file descriptors
+ *
+ *  Copyright (C) 2008-2009 Oren Laadan
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+/* default debug level for output */
+#define CKPT_DFLAG  CKPT_DFILE
+
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/sched.h>
+#include <linux/file.h>
+#include <linux/fdtable.h>
+#include <linux/deferqueue.h>
+#include <linux/checkpoint.h>
+#include <linux/checkpoint_hdr.h>
+
+
+/**************************************************************************
+ * Checkpoint
+ */
+
+/**
+ * ckpt_fill_fname - return pathname of a given file
+ * @path: path name
+ * @root: relative root
+ * @buf: buffer for pathname
+ * @len: buffer length (in) and pathname length (out)
+ */
+char *ckpt_fill_fname(struct path *path, struct path *root, char *buf, int *len)
+{
+	struct path tmp = *root;
+	char *fname;
+
+	BUG_ON(!buf);
+	spin_lock(&dcache_lock);
+	fname = __d_path(path, &tmp, buf, *len);
+	spin_unlock(&dcache_lock);
+	if (IS_ERR(fname))
+		return fname;
+	*len = (buf + (*len) - fname);
+	/*
+	 * FIX: if __d_path() changed these, it must have stepped out of
+	 * init's namespace. Since currently we require a unified namespace
+	 * within the container: simply fail.
+	 */
+	if (tmp.mnt != root->mnt || tmp.dentry != root->dentry)
+		fname = ERR_PTR(-EBADF);
+
+	return fname;
+}
+
+/**
+ * checkpoint_fname - write a file name
+ * @ctx: checkpoint context
+ * @path: path name
+ * @root: relative root
+ */
+int checkpoint_fname(struct ckpt_ctx *ctx, struct path *path, struct path *root)
+{
+	char *buf, *fname;
+	int ret, flen;
+
+	/*
+	 * FIXME: we can optimize and save memory (and storage) if we
+	 * share strings (through objhash) and reference them instead
+	 */
+
+	flen = PATH_MAX;
+	buf = kmalloc(flen, GFP_KERNEL);
+	if (!buf)
+		return -ENOMEM;
+
+	fname = ckpt_fill_fname(path, root, buf, &flen);
+	if (!IS_ERR(fname)) {
+		ret = ckpt_write_obj_type(ctx, fname, flen,
+					  CKPT_HDR_FILE_NAME);
+	} else {
+		ret = PTR_ERR(fname);
+		ckpt_err(ctx, ret, "%(T)%(S)Obtain filename\n",
+			 path->dentry->d_name.name);
+	}
+
+	kfree(buf);
+	return ret;
+}
+
+#define CKPT_DEFAULT_FDTABLE  256		/* an initial guess */
+
+/**
+ * scan_fds - scan file table and construct array of open fds
+ * @files: files_struct pointer
+ * @fdtable: (output) array of open fds
+ *
+ * Returns the number of open fds found, and also the file table
+ * array via *fdtable. The caller should free the array.
+ *
+ * The caller must validate the file descriptors collected in the
+ * array before using them, e.g. by using fcheck_files(), in case
+ * the task's fdtable changes in the meantime.
+ */
+static int scan_fds(struct files_struct *files, int **fdtable)
+{
+	struct fdtable *fdt;
+	int *fds = NULL;
+	int i = 0, n = 0;
+	int tot = CKPT_DEFAULT_FDTABLE;
+
+	/*
+	 * We assume that all tasks possibly sharing the file table are
+	 * frozen (or we are a single process and we checkpoint ourselves).
+	 * Therefore, we can safely proceed after krealloc() from where we
+	 * left off. Otherwise the file table may be modified by another
+	 * task after we scan it. The behavior is this case is undefined,
+	 * and either checkpoint or restart will likely fail.
+	 */
+ retry:
+	fds = krealloc(fds, tot * sizeof(*fds), GFP_KERNEL);
+	if (!fds)
+		return -ENOMEM;
+
+	rcu_read_lock();
+	fdt = files_fdtable(files);
+	for (/**/; i < fdt->max_fds; i++) {
+		if (!fcheck_files(files, i))
+			continue;
+		if (n == tot) {
+			rcu_read_unlock();
+			tot *= 2;	/* won't overflow: kmalloc will fail */
+			goto retry;
+		}
+		fds[n++] = i;
+	}
+	rcu_read_unlock();
+
+	*fdtable = fds;
+	return n;
+}
+
+int checkpoint_file_common(struct ckpt_ctx *ctx, struct file *file,
+			   struct ckpt_hdr_file *h)
+{
+	h->f_flags = file->f_flags;
+	h->f_mode = file->f_mode;
+	h->f_pos = file->f_pos;
+	h->f_version = file->f_version;
+
+	ckpt_debug("file %s credref %d", file->f_dentry->d_name.name,
+		h->f_credref);
+
+	/* FIX: need also file->uid, file->gid, file->f_owner, etc */
+
+	return 0;
+}
+
+int generic_file_checkpoint(struct ckpt_ctx *ctx, struct file *file)
+{
+	struct ckpt_hdr_file_generic *h;
+	int ret;
+
+	/*
+	 * FIXME: when we'll add support for unlinked files/dirs, we'll
+	 * need to distinguish between unlinked filed and unlinked dirs.
+	 */
+	if (d_unlinked(file->f_dentry)) {
+		ckpt_err(ctx, -EBADF, "%(T)%(P)Unlinked files unsupported\n",
+			 file);
+		return -EBADF;
+	}
+
+	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_FILE);
+	if (!h)
+		return -ENOMEM;
+
+	h->common.f_type = CKPT_FILE_GENERIC;
+
+	ret = checkpoint_file_common(ctx, file, &h->common);
+	if (ret < 0)
+		goto out;
+	ret = ckpt_write_obj(ctx, &h->common.h);
+	if (ret < 0)
+		goto out;
+	ret = checkpoint_fname(ctx, &file->f_path, &ctx->root_fs_path);
+ out:
+	ckpt_hdr_put(ctx, h);
+	return ret;
+}
+EXPORT_SYMBOL(generic_file_checkpoint);
+
+/* checkpoint callback for file pointer */
+int checkpoint_file(struct ckpt_ctx *ctx, void *ptr)
+{
+	struct file *file = (struct file *) ptr;
+	int ret;
+
+	if (!file->f_op || !file->f_op->checkpoint) {
+		ckpt_err(ctx, -EBADF, "%(T)%(P)%(V)f_op lacks checkpoint\n",
+			       file, file->f_op);
+		return -EBADF;
+	}
+
+	ret = file->f_op->checkpoint(ctx, file);
+	if (ret < 0)
+		ckpt_err(ctx, ret, "%(T)%(P)file checkpoint failed\n", file);
+	return ret;
+}
+
+/**
+ * ckpt_write_file_desc - dump the state of a given file descriptor
+ * @ctx: checkpoint context
+ * @files: files_struct pointer
+ * @fd: file descriptor
+ *
+ * Saves the state of the file descriptor; looks up the actual file
+ * pointer in the hash table, and if found saves the matching objref,
+ * otherwise calls ckpt_write_file to dump the file pointer too.
+ */
+static int checkpoint_file_desc(struct ckpt_ctx *ctx,
+				struct files_struct *files, int fd)
+{
+	struct ckpt_hdr_file_desc *h;
+	struct file *file = NULL;
+	struct fdtable *fdt;
+	int objref, ret;
+	int coe = 0;	/* avoid gcc warning */
+	pid_t pid;
+
+	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_FILE_DESC);
+	if (!h)
+		return -ENOMEM;
+
+	rcu_read_lock();
+	fdt = files_fdtable(files);
+	file = fcheck_files(files, fd);
+	if (file) {
+		coe = FD_ISSET(fd, fdt->close_on_exec);
+		get_file(file);
+	}
+	rcu_read_unlock();
+
+	ret = find_locks_with_owner(file, files);
+	/*
+	 * find_locks_with_owner() returns an error when there
+	 * are no locks found, so we *want* it to return an error
+	 * code.  Its success means we have to fail the checkpoint.
+	 */
+	if (!ret) {
+		ret = -EBADF;
+		ckpt_err(ctx, ret, "%(T)fd %d has file lock or lease\n", fd);
+		goto out;
+	}
+
+	/* sanity check (although this shouldn't happen) */
+	ret = -EBADF;
+	if (!file) {
+		ckpt_err(ctx, ret, "%(T)fd %d gone?\n", fd);
+		goto out;
+	}
+
+	/*
+	 * TODO: Implement c/r of fowner and f_sigio.  Should be
+	 * trivial, but for now we just refuse its checkpoint
+	 */
+	pid = f_getown(file);
+	if (pid) {
+		ret = -EBUSY;
+		ckpt_err(ctx, ret, "%(T)fd %d has an owner (%d)\n", fd);
+		goto out;
+	}
+
+	/*
+	 * if seen first time, this will add 'file' to the objhash, keep
+	 * a reference to it, dump its state while at it.
+	 */
+	objref = checkpoint_obj(ctx, file, CKPT_OBJ_FILE);
+	ckpt_debug("fd %d objref %d file %p coe %d)\n", fd, objref, file, coe);
+	if (objref < 0) {
+		ret = objref;
+		goto out;
+	}
+
+	h->fd_objref = objref;
+	h->fd_descriptor = fd;
+	h->fd_close_on_exec = coe;
+
+	ret = ckpt_write_obj(ctx, &h->h);
+out:
+	ckpt_hdr_put(ctx, h);
+	if (file)
+		fput(file);
+	return ret;
+}
+
+static int do_checkpoint_file_table(struct ckpt_ctx *ctx,
+				    struct files_struct *files)
+{
+	struct ckpt_hdr_file_table *h;
+	int *fdtable = NULL;
+	int nfds, n, ret;
+
+	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_FILE_TABLE);
+	if (!h)
+		return -ENOMEM;
+
+	nfds = scan_fds(files, &fdtable);
+	if (nfds < 0) {
+		ret = nfds;
+		goto out;
+	}
+
+	h->fdt_nfds = nfds;
+
+	ret = ckpt_write_obj(ctx, &h->h);
+	ckpt_hdr_put(ctx, h);
+	if (ret < 0)
+		goto out;
+
+	ckpt_debug("nfds %d\n", nfds);
+	for (n = 0; n < nfds; n++) {
+		ret = checkpoint_file_desc(ctx, files, fdtable[n]);
+		if (ret < 0)
+			goto out;
+	}
+
+	ret = deferqueue_run(ctx->files_deferq);
+	ckpt_debug("files_deferq ran %d entries\n", ret);
+	if (ret > 0)
+		ret = 0;
+ out:
+	kfree(fdtable);
+	return ret;
+}
+
+/* checkpoint callback for file table */
+int checkpoint_file_table(struct ckpt_ctx *ctx, void *ptr)
+{
+	return do_checkpoint_file_table(ctx, (struct files_struct *) ptr);
+}
+
+/* checkpoint wrapper for file table */
+int checkpoint_obj_file_table(struct ckpt_ctx *ctx, struct task_struct *t)
+{
+	struct files_struct *files;
+	int objref;
+
+	files = get_files_struct(t);
+	if (!files)
+		return -EBUSY;
+	objref = checkpoint_obj(ctx, files, CKPT_OBJ_FILE_TABLE);
+	put_files_struct(files);
+
+	return objref;
+}
+
+/***********************************************************************
+ * Collect
+ */
+
+int ckpt_collect_file(struct ckpt_ctx *ctx, struct file *file)
+{
+	int ret;
+
+	ret = ckpt_obj_collect(ctx, file, CKPT_OBJ_FILE);
+	if (ret <= 0)
+		return ret;
+	/* if first time for this file (ret > 0), invoke ->collect() */
+	if (file->f_op->collect)
+		ret = file->f_op->collect(ctx, file);
+	if (ret < 0)
+		ckpt_err(ctx, ret, "%(T)%(P)File collect\n", file);
+	return ret;
+}
+
+static int collect_file_desc(struct ckpt_ctx *ctx,
+			     struct files_struct *files, int fd)
+{
+	struct fdtable *fdt;
+	struct file *file;
+	int ret;
+
+	rcu_read_lock();
+	fdt = files_fdtable(files);
+	file = fcheck_files(files, fd);
+	if (file)
+		get_file(file);
+	rcu_read_unlock();
+
+	if (!file) {
+		ckpt_err(ctx, -EBUSY, "%(T)%(P)File removed\n", file);
+		return -EBUSY;
+	}
+
+	ret = ckpt_collect_file(ctx, file);
+	fput(file);
+
+	return ret;
+}
+
+static int collect_file_table(struct ckpt_ctx *ctx, struct files_struct *files)
+{
+	int *fdtable;
+	int nfds, n;
+	int ret;
+
+	/* if already exists (ret == 0), nothing to do */
+	ret = ckpt_obj_collect(ctx, files, CKPT_OBJ_FILE_TABLE);
+	if (ret <= 0)
+		return ret;
+
+	/* if first time for this file table (ret > 0), proceed inside */
+	nfds = scan_fds(files, &fdtable);
+	if (nfds < 0)
+		return nfds;
+
+	for (n = 0; n < nfds; n++) {
+		ret = collect_file_desc(ctx, files, fdtable[n]);
+		if (ret < 0)
+			break;
+	}
+
+	kfree(fdtable);
+	return ret;
+}
+
+int ckpt_collect_file_table(struct ckpt_ctx *ctx, struct task_struct *t)
+{
+	struct files_struct *files;
+	int ret;
+
+	files = get_files_struct(t);
+	if (!files) {
+		ckpt_err(ctx, -EBUSY, "%(T)files_struct missing\n");
+		return -EBUSY;
+	}
+	ret = collect_file_table(ctx, files);
+	put_files_struct(files);
+
+	return ret;
+}
diff --git a/checkpoint/objhash.c b/checkpoint/objhash.c
index 22b1601..f25d130 100644
--- a/checkpoint/objhash.c
+++ b/checkpoint/objhash.c
@@ -13,6 +13,8 @@
 
 #include <linux/kernel.h>
 #include <linux/hash.h>
+#include <linux/file.h>
+#include <linux/fdtable.h>
 #include <linux/checkpoint.h>
 #include <linux/checkpoint_hdr.h>
 
@@ -62,6 +64,38 @@ static int obj_no_grab(void *ptr)
 	return 0;
 }
 
+static int obj_file_table_grab(void *ptr)
+{
+	atomic_inc(&((struct files_struct *) ptr)->count);
+	return 0;
+}
+
+static void obj_file_table_drop(void *ptr, int lastref)
+{
+	put_files_struct((struct files_struct *) ptr);
+}
+
+static int obj_file_table_users(void *ptr)
+{
+	return atomic_read(&((struct files_struct *) ptr)->count);
+}
+
+static int obj_file_grab(void *ptr)
+{
+	get_file((struct file *) ptr);
+	return 0;
+}
+
+static void obj_file_drop(void *ptr, int lastref)
+{
+	fput((struct file *) ptr);
+}
+
+static int obj_file_users(void *ptr)
+{
+	return atomic_long_read(&((struct file *) ptr)->f_count);
+}
+
 static struct ckpt_obj_ops ckpt_obj_ops[] = {
 	/* ignored object */
 	{
@@ -70,6 +104,24 @@ static struct ckpt_obj_ops ckpt_obj_ops[] = {
 		.ref_drop = obj_no_drop,
 		.ref_grab = obj_no_grab,
 	},
+	/* files_struct object */
+	{
+		.obj_name = "FILE_TABLE",
+		.obj_type = CKPT_OBJ_FILE_TABLE,
+		.ref_drop = obj_file_table_drop,
+		.ref_grab = obj_file_table_grab,
+		.ref_users = obj_file_table_users,
+		.checkpoint = checkpoint_file_table,
+	},
+	/* file object */
+	{
+		.obj_name = "FILE",
+		.obj_type = CKPT_OBJ_FILE,
+		.ref_drop = obj_file_drop,
+		.ref_grab = obj_file_grab,
+		.ref_users = obj_file_users,
+		.checkpoint = checkpoint_file,
+	},
 };
 
 
diff --git a/checkpoint/process.c b/checkpoint/process.c
index ef394a5..adc34a2 100644
--- a/checkpoint/process.c
+++ b/checkpoint/process.c
@@ -104,6 +104,29 @@ static int checkpoint_task_struct(struct ckpt_ctx *ctx, struct task_struct *t)
 	return ckpt_write_string(ctx, t->comm, TASK_COMM_LEN);
 }
 
+static int checkpoint_task_objs(struct ckpt_ctx *ctx, struct task_struct *t)
+{
+	struct ckpt_hdr_task_objs *h;
+	int files_objref;
+	int ret;
+
+	files_objref = checkpoint_obj_file_table(ctx, t);
+	ckpt_debug("files: objref %d\n", files_objref);
+	if (files_objref < 0) {
+		ckpt_err(ctx, files_objref, "%(T)files_struct\n");
+		return files_objref;
+	}
+
+	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_TASK_OBJS);
+	if (!h)
+		return -ENOMEM;
+	h->files_objref = files_objref;
+	ret = ckpt_write_obj(ctx, &h->h);
+	ckpt_hdr_put(ctx, h);
+
+	return ret;
+}
+
 /* dump the task_struct of a given task */
 int checkpoint_restart_block(struct ckpt_ctx *ctx, struct task_struct *t)
 {
@@ -240,6 +263,10 @@ int checkpoint_task(struct ckpt_ctx *ctx, struct task_struct *t)
 		goto out;
 	ret = checkpoint_cpu(ctx, t);
 	ckpt_debug("cpu %d\n", ret);
+	if (ret < 0)
+		goto out;
+	ret = checkpoint_task_objs(ctx, t);
+	ckpt_debug("objs %d\n", ret);
  out:
 	ctx->tsk = NULL;
 	return ret;
@@ -247,7 +274,11 @@ int checkpoint_task(struct ckpt_ctx *ctx, struct task_struct *t)
 
 int ckpt_collect_task(struct ckpt_ctx *ctx, struct task_struct *t)
 {
-	return 0;
+	int ret;
+
+	ret = ckpt_collect_file_table(ctx, t);
+
+	return ret;
 }
 
 /***********************************************************************
diff --git a/checkpoint/sys.c b/checkpoint/sys.c
index 926c937..30b8004 100644
--- a/checkpoint/sys.c
+++ b/checkpoint/sys.c
@@ -206,12 +206,16 @@ static void ckpt_ctx_free(struct ckpt_ctx *ctx)
 	if (ctx->kflags & CKPT_CTX_RESTART)
 		restore_debug_free(ctx);
 
+	if (ctx->files_deferq)
+		deferqueue_destroy(ctx->files_deferq);
+
 	if (ctx->file)
 		fput(ctx->file);
 	if (ctx->logfile)
 		fput(ctx->logfile);
 
 	ckpt_obj_hash_free(ctx);
+	path_put(&ctx->root_fs_path);
 
 	if (ctx->tasks_arr)
 		task_arr_free(ctx);
@@ -270,6 +274,10 @@ static struct ckpt_ctx *ckpt_ctx_alloc(int fd, unsigned long uflags,
 	if (ckpt_obj_hash_alloc(ctx) < 0)
 		goto err;
 
+	ctx->files_deferq = deferqueue_create();
+	if (!ctx->files_deferq)
+		goto err;
+
 	atomic_inc(&ctx->refcount);
 	return ctx;
  err:
diff --git a/fs/locks.c b/fs/locks.c
index a8794f2..721481a 100644
--- a/fs/locks.c
+++ b/fs/locks.c
@@ -1994,6 +1994,41 @@ void locks_remove_posix(struct file *filp, fl_owner_t owner)
 
 EXPORT_SYMBOL(locks_remove_posix);
 
+int find_locks_with_owner(struct file *filp, fl_owner_t owner)
+{
+	struct inode *inode = filp->f_path.dentry->d_inode;
+	struct file_lock **inode_fl;
+	int ret = -EEXIST;
+
+	lock_kernel();
+	for_each_lock(inode, inode_fl) {
+		struct file_lock *fl = *inode_fl;
+		/*
+		 * We could use posix_same_owner() along with a 'fake'
+		 * file_lock.  But, the fake file will never have the
+		 * same fl_lmops as the fl that we are looking for and
+		 * posix_same_owner() would just fall back to this
+		 * check anyway.
+		 */
+		if (IS_POSIX(fl)) {
+			if (fl->fl_owner == owner) {
+				ret = 0;
+				break;
+			}
+		} else if (IS_FLOCK(fl) || IS_LEASE(fl)) {
+			if (fl->fl_file == filp) {
+				ret = 0;
+				break;
+			}
+		} else {
+			WARN(1, "unknown file lock type, fl_flags: %x",
+				fl->fl_flags);
+		}
+	}
+	unlock_kernel();
+	return ret;
+}
+
 /*
  * This function is called on the last close of an open file.
  */
diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
index 50ce8f9..d74a890 100644
--- a/include/linux/checkpoint.h
+++ b/include/linux/checkpoint.h
@@ -80,6 +80,9 @@ extern int ckpt_read_payload(struct ckpt_ctx *ctx,
 extern char *ckpt_read_string(struct ckpt_ctx *ctx, int max);
 extern int ckpt_read_consume(struct ckpt_ctx *ctx, int len, int type);
 
+extern char *ckpt_fill_fname(struct path *path, struct path *root,
+			     char *buf, int *len);
+
 /* ckpt kflags */
 #define ckpt_set_ctx_kflag(__ctx, __kflag)  \
 	set_bit(__kflag##_BIT, &(__ctx)->kflags)
@@ -156,6 +159,21 @@ extern int checkpoint_restart_block(struct ckpt_ctx *ctx,
 				    struct task_struct *t);
 extern int restore_restart_block(struct ckpt_ctx *ctx);
 
+/* file table */
+extern int ckpt_collect_file_table(struct ckpt_ctx *ctx, struct task_struct *t);
+extern int checkpoint_obj_file_table(struct ckpt_ctx *ctx,
+				     struct task_struct *t);
+extern int checkpoint_file_table(struct ckpt_ctx *ctx, void *ptr);
+
+/* files */
+extern int checkpoint_fname(struct ckpt_ctx *ctx,
+			    struct path *path, struct path *root);
+extern int ckpt_collect_file(struct ckpt_ctx *ctx, struct file *file);
+extern int checkpoint_file(struct ckpt_ctx *ctx, void *ptr);
+
+extern int checkpoint_file_common(struct ckpt_ctx *ctx, struct file *file,
+				  struct ckpt_hdr_file *h);
+
 static inline int ckpt_validate_errno(int errno)
 {
 	return (errno >= 0) && (errno < MAX_ERRNO);
@@ -166,6 +184,7 @@ static inline int ckpt_validate_errno(int errno)
 #define CKPT_DSYS	0x2		/* generic (system) */
 #define CKPT_DRW	0x4		/* image read/write */
 #define CKPT_DOBJ	0x8		/* shared objects */
+#define CKPT_DFILE	0x10		/* files and filesystem */
 
 #define CKPT_DDEFAULT	0xffff		/* default debug level */
 
diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
index cdca9e4..3222545 100644
--- a/include/linux/checkpoint_hdr.h
+++ b/include/linux/checkpoint_hdr.h
@@ -71,6 +71,8 @@ enum {
 #define CKPT_HDR_TREE CKPT_HDR_TREE
 	CKPT_HDR_TASK,
 #define CKPT_HDR_TASK CKPT_HDR_TASK
+	CKPT_HDR_TASK_OBJS,
+#define CKPT_HDR_TASK_OBJS CKPT_HDR_TASK_OBJS
 	CKPT_HDR_RESTART_BLOCK,
 #define CKPT_HDR_RESTART_BLOCK CKPT_HDR_RESTART_BLOCK
 	CKPT_HDR_THREAD,
@@ -80,6 +82,15 @@ enum {
 
 	/* 201-299: reserved for arch-dependent */
 
+	CKPT_HDR_FILE_TABLE = 301,
+#define CKPT_HDR_FILE_TABLE CKPT_HDR_FILE_TABLE
+	CKPT_HDR_FILE_DESC,
+#define CKPT_HDR_FILE_DESC CKPT_HDR_FILE_DESC
+	CKPT_HDR_FILE_NAME,
+#define CKPT_HDR_FILE_NAME CKPT_HDR_FILE_NAME
+	CKPT_HDR_FILE,
+#define CKPT_HDR_FILE CKPT_HDR_FILE
+
 	CKPT_HDR_TAIL = 9001,
 #define CKPT_HDR_TAIL CKPT_HDR_TAIL
 
@@ -106,6 +117,10 @@ struct ckpt_hdr_objref {
 enum obj_type {
 	CKPT_OBJ_IGNORE = 0,
 #define CKPT_OBJ_IGNORE CKPT_OBJ_IGNORE
+	CKPT_OBJ_FILE_TABLE,
+#define CKPT_OBJ_FILE_TABLE CKPT_OBJ_FILE_TABLE
+	CKPT_OBJ_FILE,
+#define CKPT_OBJ_FILE CKPT_OBJ_FILE
 	CKPT_OBJ_MAX
 #define CKPT_OBJ_MAX CKPT_OBJ_MAX
 };
@@ -188,6 +203,12 @@ struct ckpt_hdr_task {
 	__u64 clear_child_tid;
 } __attribute__((aligned(8)));
 
+/* task's shared resources */
+struct ckpt_hdr_task_objs {
+	struct ckpt_hdr h;
+	__s32 files_objref;
+} __attribute__((aligned(8)));
+
 /* restart blocks */
 struct ckpt_hdr_restart_block {
 	struct ckpt_hdr h;
@@ -220,4 +241,42 @@ enum restart_block_type {
 #define CKPT_RESTART_BLOCK_FUTEX CKPT_RESTART_BLOCK_FUTEX
 };
 
+/* file system */
+struct ckpt_hdr_file_table {
+	struct ckpt_hdr h;
+	__s32 fdt_nfds;
+} __attribute__((aligned(8)));
+
+/* file descriptors */
+struct ckpt_hdr_file_desc {
+	struct ckpt_hdr h;
+	__s32 fd_objref;
+	__s32 fd_descriptor;
+	__u32 fd_close_on_exec;
+} __attribute__((aligned(8)));
+
+enum file_type {
+	CKPT_FILE_IGNORE = 0,
+#define CKPT_FILE_IGNORE CKPT_FILE_IGNORE
+	CKPT_FILE_GENERIC,
+#define CKPT_FILE_GENERIC CKPT_FILE_GENERIC
+	CKPT_FILE_MAX
+#define CKPT_FILE_MAX CKPT_FILE_MAX
+};
+
+/* file objects */
+struct ckpt_hdr_file {
+	struct ckpt_hdr h;
+	__u32 f_type;
+	__u32 f_mode;
+	__u32 f_flags;
+	__u32 _padding;
+	__u64 f_pos;
+	__u64 f_version;
+} __attribute__((aligned(8)));
+
+struct ckpt_hdr_file_generic {
+	struct ckpt_hdr_file common;
+} __attribute__((aligned(8)));
+
 #endif /* _CHECKPOINT_CKPT_HDR_H_ */
diff --git a/include/linux/checkpoint_types.h b/include/linux/checkpoint_types.h
index 90bbb16..aae6755 100644
--- a/include/linux/checkpoint_types.h
+++ b/include/linux/checkpoint_types.h
@@ -14,6 +14,8 @@
 
 #include <linux/sched.h>
 #include <linux/nsproxy.h>
+#include <linux/list.h>
+#include <linux/path.h>
 #include <linux/fs.h>
 #include <linux/ktime.h>
 #include <linux/wait.h>
@@ -40,6 +42,9 @@ struct ckpt_ctx {
 	atomic_t refcount;
 
 	struct ckpt_obj_hash *obj_hash;	/* repository for shared objects */
+	struct deferqueue_head *files_deferq;	/* deferred file-table work */
+
+	struct path root_fs_path;     /* container root (FIXME) */
 
 	struct task_struct *tsk;/* checkpoint: current target task */
 	char err_string[256];	/* checkpoint: error string */
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 65ebec5..7902a51 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1120,6 +1120,7 @@ extern void locks_remove_posix(struct file *, fl_owner_t);
 extern void locks_remove_flock(struct file *);
 extern void locks_release_private(struct file_lock *);
 extern void posix_test_lock(struct file *, struct file_lock *);
+extern int find_locks_with_owner(struct file *filp, fl_owner_t owner);
 extern int posix_lock_file(struct file *, struct file_lock *, struct file_lock *);
 extern int posix_lock_file_wait(struct file *, struct file_lock *);
 extern int posix_unblock_lock(struct file *, struct file_lock *);
@@ -1188,6 +1189,11 @@ static inline void locks_remove_posix(struct file *filp, fl_owner_t owner)
 	return;
 }
 
+static inline int find_locks_with_owner(struct file *filp, fl_owner_t owner)
+{
+	return -ENOENT;
+}
+
 static inline void locks_remove_flock(struct file *filp)
 {
 	return;
@@ -2318,7 +2324,11 @@ void inode_sub_bytes(struct inode *inode, loff_t bytes);
 loff_t inode_get_bytes(struct inode *inode);
 void inode_set_bytes(struct inode *inode, loff_t bytes);
 
+#ifdef CONFIG_CHECKPOINT
+extern int generic_file_checkpoint(struct ckpt_ctx *ctx, struct file *file);
+#else
 #define generic_file_checkpoint NULL
+#endif
 
 extern int vfs_readdir(struct file *, filldir_t, void *);
 
-- 
1.6.3.3

^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [C/R v20][PATCH 38/96] c/r: dump open file descriptors
  2010-03-19  0:59 [C/R v20][PATCH 00/96] Linux Checkpoint-Restart - v20 Oren Laadan
  2010-03-19  0:59 ` [C/R v20][PATCH 20/96] c/r: make file_pos_read/write() public Oren Laadan
  2010-03-19  0:59 ` [C/R v20][PATCH 37/96] c/r: introduce new 'file_operations': ->checkpoint, ->collect() Oren Laadan
@ 2010-03-19  0:59 ` Oren Laadan
  2010-03-19 23:19   ` Andreas Dilger
       [not found]   ` <1268960401-16680-4-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
  2010-03-19  0:59 ` [C/R v20][PATCH 39/96] c/r: restore " Oren Laadan
                   ` (14 subsequent siblings)
  17 siblings, 2 replies; 88+ messages in thread
From: Oren Laadan @ 2010-03-19  0:59 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: containers, Matt Helsley, Andreas Dilger, Oren Laadan

Dump the file table with 'struct ckpt_hdr_file_table, followed by all
open file descriptors. Because the 'struct file' corresponding to an
fd can be shared, they are assigned an objref and registered in the
object hash. A reference to the 'file *' is kept for as long as it
lives in the hash (the hash is only cleaned up at the end of the
checkpoint).

Also provide generic_checkpoint_file() and generic_restore_file()
which is good for normal files and directories. It does not support
yet unlinked files or directories.

Changelog[v19]:
  - Fix false negative of test for unlinked files at checkpoint
Changelog[v19-rc3]:
  - [Serge Hallyn] Rename fs_mnt to root_fs_path
  - [Dave Hansen] Error out on file locks and leases
  - [Serge Hallyn] Refuse checkpoint of file with f_owner
Changelog[v19-rc1]:
  - [Matt Helsley] Add cpp definitions for enums
Changelog[v18]:
  - Add a few more ckpt_write_err()s
  - [Dan Smith] Export fill_fname() as ckpt_fill_fname()
  - Introduce ckpt_collect_file() that also uses file->collect method
  - In collect_file_stabl() use retval from ckpt_obj_collect() to
    test for first-time-object
Changelog[v17]:
  - Only collect sub-objects of files_struct once
  - Better file error debugging
  - Use (new) d_unlinked()
Changelog[v16]:
  - Fix compile warning in checkpoint_bad()
Changelog[v16]:
  - Reorder patch (move earlier in series)
  - Handle shared files_struct objects
Changelog[v14]:
  - File objects are dumped/restored prior to the first reference
  - Introduce a per file-type restore() callback
  - Use struct file_operations->checkpoint()
  - Put code for generic file descriptors in a separate function
  - Use one CKPT_FILE_GENERIC for both regular files and dirs
  - Revert change to pr_debug(), back to ckpt_debug()
  - Use only unsigned fields in checkpoint headers
  - Rename:  ckpt_write_files() => checkpoint_fd_table()
  - Rename:  ckpt_write_fd_data() => checkpoint_file()
  - Discard field 'h->parent'
Changelog[v12]:
  - Replace obsolete ckpt_debug() with pr_debug()
Changelog[v11]:
  - Discard handling of opened symlinks (there is no such thing)
  - ckpt_scan_fds() retries from scratch if hits size limits
Changelog[v9]:
  - Fix a couple of leaks in ckpt_write_files()
  - Drop useless kfree from ckpt_scan_fds()
Changelog[v8]:
  - initialize 'coe' to workaround gcc false warning
Changelog[v6]:
  - Balance all calls to ckpt_hbuf_get() with matching ckpt_hbuf_put()
    (even though it's not really needed)

Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
Acked-by: Serge E. Hallyn <serue@us.ibm.com>
Tested-by: Serge E. Hallyn <serue@us.ibm.com>
---
 checkpoint/Makefile              |    3 +-
 checkpoint/checkpoint.c          |   11 +
 checkpoint/files.c               |  444 ++++++++++++++++++++++++++++++++++++++
 checkpoint/objhash.c             |   52 +++++
 checkpoint/process.c             |   33 +++-
 checkpoint/sys.c                 |    8 +
 fs/locks.c                       |   35 +++
 include/linux/checkpoint.h       |   19 ++
 include/linux/checkpoint_hdr.h   |   59 +++++
 include/linux/checkpoint_types.h |    5 +
 include/linux/fs.h               |   10 +
 11 files changed, 677 insertions(+), 2 deletions(-)
 create mode 100644 checkpoint/files.c

diff --git a/checkpoint/Makefile b/checkpoint/Makefile
index 5aa6a75..1d0c058 100644
--- a/checkpoint/Makefile
+++ b/checkpoint/Makefile
@@ -7,4 +7,5 @@ obj-$(CONFIG_CHECKPOINT) += \
 	objhash.o \
 	checkpoint.o \
 	restart.o \
-	process.o
+	process.o \
+	files.o
diff --git a/checkpoint/checkpoint.c b/checkpoint/checkpoint.c
index c016a2d..2bc2495 100644
--- a/checkpoint/checkpoint.c
+++ b/checkpoint/checkpoint.c
@@ -18,6 +18,7 @@
 #include <linux/time.h>
 #include <linux/fs.h>
 #include <linux/file.h>
+#include <linux/fs_struct.h>
 #include <linux/dcache.h>
 #include <linux/mount.h>
 #include <linux/utsname.h>
@@ -490,6 +491,7 @@ static int init_checkpoint_ctx(struct ckpt_ctx *ctx, pid_t pid)
 {
 	struct task_struct *task;
 	struct nsproxy *nsproxy;
+	struct fs_struct *fs;
 
 	/*
 	 * No need for explicit cleanup here, because if an error
@@ -531,6 +533,15 @@ static int init_checkpoint_ctx(struct ckpt_ctx *ctx, pid_t pid)
 		return -EINVAL;  /* cleanup by ckpt_ctx_free() */
 	}
 
+	/* root vfs (FIX: WILL CHANGE with mnt-ns etc */
+	task_lock(ctx->root_task);
+	fs = ctx->root_task->fs;
+	read_lock(&fs->lock);
+	ctx->root_fs_path = fs->root;
+	path_get(&ctx->root_fs_path);
+	read_unlock(&fs->lock);
+	task_unlock(ctx->root_task);
+
 	return 0;
 }
 
diff --git a/checkpoint/files.c b/checkpoint/files.c
new file mode 100644
index 0000000..7a57b24
--- /dev/null
+++ b/checkpoint/files.c
@@ -0,0 +1,444 @@
+/*
+ *  Checkpoint file descriptors
+ *
+ *  Copyright (C) 2008-2009 Oren Laadan
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+/* default debug level for output */
+#define CKPT_DFLAG  CKPT_DFILE
+
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/sched.h>
+#include <linux/file.h>
+#include <linux/fdtable.h>
+#include <linux/deferqueue.h>
+#include <linux/checkpoint.h>
+#include <linux/checkpoint_hdr.h>
+
+
+/**************************************************************************
+ * Checkpoint
+ */
+
+/**
+ * ckpt_fill_fname - return pathname of a given file
+ * @path: path name
+ * @root: relative root
+ * @buf: buffer for pathname
+ * @len: buffer length (in) and pathname length (out)
+ */
+char *ckpt_fill_fname(struct path *path, struct path *root, char *buf, int *len)
+{
+	struct path tmp = *root;
+	char *fname;
+
+	BUG_ON(!buf);
+	spin_lock(&dcache_lock);
+	fname = __d_path(path, &tmp, buf, *len);
+	spin_unlock(&dcache_lock);
+	if (IS_ERR(fname))
+		return fname;
+	*len = (buf + (*len) - fname);
+	/*
+	 * FIX: if __d_path() changed these, it must have stepped out of
+	 * init's namespace. Since currently we require a unified namespace
+	 * within the container: simply fail.
+	 */
+	if (tmp.mnt != root->mnt || tmp.dentry != root->dentry)
+		fname = ERR_PTR(-EBADF);
+
+	return fname;
+}
+
+/**
+ * checkpoint_fname - write a file name
+ * @ctx: checkpoint context
+ * @path: path name
+ * @root: relative root
+ */
+int checkpoint_fname(struct ckpt_ctx *ctx, struct path *path, struct path *root)
+{
+	char *buf, *fname;
+	int ret, flen;
+
+	/*
+	 * FIXME: we can optimize and save memory (and storage) if we
+	 * share strings (through objhash) and reference them instead
+	 */
+
+	flen = PATH_MAX;
+	buf = kmalloc(flen, GFP_KERNEL);
+	if (!buf)
+		return -ENOMEM;
+
+	fname = ckpt_fill_fname(path, root, buf, &flen);
+	if (!IS_ERR(fname)) {
+		ret = ckpt_write_obj_type(ctx, fname, flen,
+					  CKPT_HDR_FILE_NAME);
+	} else {
+		ret = PTR_ERR(fname);
+		ckpt_err(ctx, ret, "%(T)%(S)Obtain filename\n",
+			 path->dentry->d_name.name);
+	}
+
+	kfree(buf);
+	return ret;
+}
+
+#define CKPT_DEFAULT_FDTABLE  256		/* an initial guess */
+
+/**
+ * scan_fds - scan file table and construct array of open fds
+ * @files: files_struct pointer
+ * @fdtable: (output) array of open fds
+ *
+ * Returns the number of open fds found, and also the file table
+ * array via *fdtable. The caller should free the array.
+ *
+ * The caller must validate the file descriptors collected in the
+ * array before using them, e.g. by using fcheck_files(), in case
+ * the task's fdtable changes in the meantime.
+ */
+static int scan_fds(struct files_struct *files, int **fdtable)
+{
+	struct fdtable *fdt;
+	int *fds = NULL;
+	int i = 0, n = 0;
+	int tot = CKPT_DEFAULT_FDTABLE;
+
+	/*
+	 * We assume that all tasks possibly sharing the file table are
+	 * frozen (or we are a single process and we checkpoint ourselves).
+	 * Therefore, we can safely proceed after krealloc() from where we
+	 * left off. Otherwise the file table may be modified by another
+	 * task after we scan it. The behavior is this case is undefined,
+	 * and either checkpoint or restart will likely fail.
+	 */
+ retry:
+	fds = krealloc(fds, tot * sizeof(*fds), GFP_KERNEL);
+	if (!fds)
+		return -ENOMEM;
+
+	rcu_read_lock();
+	fdt = files_fdtable(files);
+	for (/**/; i < fdt->max_fds; i++) {
+		if (!fcheck_files(files, i))
+			continue;
+		if (n == tot) {
+			rcu_read_unlock();
+			tot *= 2;	/* won't overflow: kmalloc will fail */
+			goto retry;
+		}
+		fds[n++] = i;
+	}
+	rcu_read_unlock();
+
+	*fdtable = fds;
+	return n;
+}
+
+int checkpoint_file_common(struct ckpt_ctx *ctx, struct file *file,
+			   struct ckpt_hdr_file *h)
+{
+	h->f_flags = file->f_flags;
+	h->f_mode = file->f_mode;
+	h->f_pos = file->f_pos;
+	h->f_version = file->f_version;
+
+	ckpt_debug("file %s credref %d", file->f_dentry->d_name.name,
+		h->f_credref);
+
+	/* FIX: need also file->uid, file->gid, file->f_owner, etc */
+
+	return 0;
+}
+
+int generic_file_checkpoint(struct ckpt_ctx *ctx, struct file *file)
+{
+	struct ckpt_hdr_file_generic *h;
+	int ret;
+
+	/*
+	 * FIXME: when we'll add support for unlinked files/dirs, we'll
+	 * need to distinguish between unlinked filed and unlinked dirs.
+	 */
+	if (d_unlinked(file->f_dentry)) {
+		ckpt_err(ctx, -EBADF, "%(T)%(P)Unlinked files unsupported\n",
+			 file);
+		return -EBADF;
+	}
+
+	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_FILE);
+	if (!h)
+		return -ENOMEM;
+
+	h->common.f_type = CKPT_FILE_GENERIC;
+
+	ret = checkpoint_file_common(ctx, file, &h->common);
+	if (ret < 0)
+		goto out;
+	ret = ckpt_write_obj(ctx, &h->common.h);
+	if (ret < 0)
+		goto out;
+	ret = checkpoint_fname(ctx, &file->f_path, &ctx->root_fs_path);
+ out:
+	ckpt_hdr_put(ctx, h);
+	return ret;
+}
+EXPORT_SYMBOL(generic_file_checkpoint);
+
+/* checkpoint callback for file pointer */
+int checkpoint_file(struct ckpt_ctx *ctx, void *ptr)
+{
+	struct file *file = (struct file *) ptr;
+	int ret;
+
+	if (!file->f_op || !file->f_op->checkpoint) {
+		ckpt_err(ctx, -EBADF, "%(T)%(P)%(V)f_op lacks checkpoint\n",
+			       file, file->f_op);
+		return -EBADF;
+	}
+
+	ret = file->f_op->checkpoint(ctx, file);
+	if (ret < 0)
+		ckpt_err(ctx, ret, "%(T)%(P)file checkpoint failed\n", file);
+	return ret;
+}
+
+/**
+ * ckpt_write_file_desc - dump the state of a given file descriptor
+ * @ctx: checkpoint context
+ * @files: files_struct pointer
+ * @fd: file descriptor
+ *
+ * Saves the state of the file descriptor; looks up the actual file
+ * pointer in the hash table, and if found saves the matching objref,
+ * otherwise calls ckpt_write_file to dump the file pointer too.
+ */
+static int checkpoint_file_desc(struct ckpt_ctx *ctx,
+				struct files_struct *files, int fd)
+{
+	struct ckpt_hdr_file_desc *h;
+	struct file *file = NULL;
+	struct fdtable *fdt;
+	int objref, ret;
+	int coe = 0;	/* avoid gcc warning */
+	pid_t pid;
+
+	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_FILE_DESC);
+	if (!h)
+		return -ENOMEM;
+
+	rcu_read_lock();
+	fdt = files_fdtable(files);
+	file = fcheck_files(files, fd);
+	if (file) {
+		coe = FD_ISSET(fd, fdt->close_on_exec);
+		get_file(file);
+	}
+	rcu_read_unlock();
+
+	ret = find_locks_with_owner(file, files);
+	/*
+	 * find_locks_with_owner() returns an error when there
+	 * are no locks found, so we *want* it to return an error
+	 * code.  Its success means we have to fail the checkpoint.
+	 */
+	if (!ret) {
+		ret = -EBADF;
+		ckpt_err(ctx, ret, "%(T)fd %d has file lock or lease\n", fd);
+		goto out;
+	}
+
+	/* sanity check (although this shouldn't happen) */
+	ret = -EBADF;
+	if (!file) {
+		ckpt_err(ctx, ret, "%(T)fd %d gone?\n", fd);
+		goto out;
+	}
+
+	/*
+	 * TODO: Implement c/r of fowner and f_sigio.  Should be
+	 * trivial, but for now we just refuse its checkpoint
+	 */
+	pid = f_getown(file);
+	if (pid) {
+		ret = -EBUSY;
+		ckpt_err(ctx, ret, "%(T)fd %d has an owner (%d)\n", fd);
+		goto out;
+	}
+
+	/*
+	 * if seen first time, this will add 'file' to the objhash, keep
+	 * a reference to it, dump its state while at it.
+	 */
+	objref = checkpoint_obj(ctx, file, CKPT_OBJ_FILE);
+	ckpt_debug("fd %d objref %d file %p coe %d)\n", fd, objref, file, coe);
+	if (objref < 0) {
+		ret = objref;
+		goto out;
+	}
+
+	h->fd_objref = objref;
+	h->fd_descriptor = fd;
+	h->fd_close_on_exec = coe;
+
+	ret = ckpt_write_obj(ctx, &h->h);
+out:
+	ckpt_hdr_put(ctx, h);
+	if (file)
+		fput(file);
+	return ret;
+}
+
+static int do_checkpoint_file_table(struct ckpt_ctx *ctx,
+				    struct files_struct *files)
+{
+	struct ckpt_hdr_file_table *h;
+	int *fdtable = NULL;
+	int nfds, n, ret;
+
+	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_FILE_TABLE);
+	if (!h)
+		return -ENOMEM;
+
+	nfds = scan_fds(files, &fdtable);
+	if (nfds < 0) {
+		ret = nfds;
+		goto out;
+	}
+
+	h->fdt_nfds = nfds;
+
+	ret = ckpt_write_obj(ctx, &h->h);
+	ckpt_hdr_put(ctx, h);
+	if (ret < 0)
+		goto out;
+
+	ckpt_debug("nfds %d\n", nfds);
+	for (n = 0; n < nfds; n++) {
+		ret = checkpoint_file_desc(ctx, files, fdtable[n]);
+		if (ret < 0)
+			goto out;
+	}
+
+	ret = deferqueue_run(ctx->files_deferq);
+	ckpt_debug("files_deferq ran %d entries\n", ret);
+	if (ret > 0)
+		ret = 0;
+ out:
+	kfree(fdtable);
+	return ret;
+}
+
+/* checkpoint callback for file table */
+int checkpoint_file_table(struct ckpt_ctx *ctx, void *ptr)
+{
+	return do_checkpoint_file_table(ctx, (struct files_struct *) ptr);
+}
+
+/* checkpoint wrapper for file table */
+int checkpoint_obj_file_table(struct ckpt_ctx *ctx, struct task_struct *t)
+{
+	struct files_struct *files;
+	int objref;
+
+	files = get_files_struct(t);
+	if (!files)
+		return -EBUSY;
+	objref = checkpoint_obj(ctx, files, CKPT_OBJ_FILE_TABLE);
+	put_files_struct(files);
+
+	return objref;
+}
+
+/***********************************************************************
+ * Collect
+ */
+
+int ckpt_collect_file(struct ckpt_ctx *ctx, struct file *file)
+{
+	int ret;
+
+	ret = ckpt_obj_collect(ctx, file, CKPT_OBJ_FILE);
+	if (ret <= 0)
+		return ret;
+	/* if first time for this file (ret > 0), invoke ->collect() */
+	if (file->f_op->collect)
+		ret = file->f_op->collect(ctx, file);
+	if (ret < 0)
+		ckpt_err(ctx, ret, "%(T)%(P)File collect\n", file);
+	return ret;
+}
+
+static int collect_file_desc(struct ckpt_ctx *ctx,
+			     struct files_struct *files, int fd)
+{
+	struct fdtable *fdt;
+	struct file *file;
+	int ret;
+
+	rcu_read_lock();
+	fdt = files_fdtable(files);
+	file = fcheck_files(files, fd);
+	if (file)
+		get_file(file);
+	rcu_read_unlock();
+
+	if (!file) {
+		ckpt_err(ctx, -EBUSY, "%(T)%(P)File removed\n", file);
+		return -EBUSY;
+	}
+
+	ret = ckpt_collect_file(ctx, file);
+	fput(file);
+
+	return ret;
+}
+
+static int collect_file_table(struct ckpt_ctx *ctx, struct files_struct *files)
+{
+	int *fdtable;
+	int nfds, n;
+	int ret;
+
+	/* if already exists (ret == 0), nothing to do */
+	ret = ckpt_obj_collect(ctx, files, CKPT_OBJ_FILE_TABLE);
+	if (ret <= 0)
+		return ret;
+
+	/* if first time for this file table (ret > 0), proceed inside */
+	nfds = scan_fds(files, &fdtable);
+	if (nfds < 0)
+		return nfds;
+
+	for (n = 0; n < nfds; n++) {
+		ret = collect_file_desc(ctx, files, fdtable[n]);
+		if (ret < 0)
+			break;
+	}
+
+	kfree(fdtable);
+	return ret;
+}
+
+int ckpt_collect_file_table(struct ckpt_ctx *ctx, struct task_struct *t)
+{
+	struct files_struct *files;
+	int ret;
+
+	files = get_files_struct(t);
+	if (!files) {
+		ckpt_err(ctx, -EBUSY, "%(T)files_struct missing\n");
+		return -EBUSY;
+	}
+	ret = collect_file_table(ctx, files);
+	put_files_struct(files);
+
+	return ret;
+}
diff --git a/checkpoint/objhash.c b/checkpoint/objhash.c
index 22b1601..f25d130 100644
--- a/checkpoint/objhash.c
+++ b/checkpoint/objhash.c
@@ -13,6 +13,8 @@
 
 #include <linux/kernel.h>
 #include <linux/hash.h>
+#include <linux/file.h>
+#include <linux/fdtable.h>
 #include <linux/checkpoint.h>
 #include <linux/checkpoint_hdr.h>
 
@@ -62,6 +64,38 @@ static int obj_no_grab(void *ptr)
 	return 0;
 }
 
+static int obj_file_table_grab(void *ptr)
+{
+	atomic_inc(&((struct files_struct *) ptr)->count);
+	return 0;
+}
+
+static void obj_file_table_drop(void *ptr, int lastref)
+{
+	put_files_struct((struct files_struct *) ptr);
+}
+
+static int obj_file_table_users(void *ptr)
+{
+	return atomic_read(&((struct files_struct *) ptr)->count);
+}
+
+static int obj_file_grab(void *ptr)
+{
+	get_file((struct file *) ptr);
+	return 0;
+}
+
+static void obj_file_drop(void *ptr, int lastref)
+{
+	fput((struct file *) ptr);
+}
+
+static int obj_file_users(void *ptr)
+{
+	return atomic_long_read(&((struct file *) ptr)->f_count);
+}
+
 static struct ckpt_obj_ops ckpt_obj_ops[] = {
 	/* ignored object */
 	{
@@ -70,6 +104,24 @@ static struct ckpt_obj_ops ckpt_obj_ops[] = {
 		.ref_drop = obj_no_drop,
 		.ref_grab = obj_no_grab,
 	},
+	/* files_struct object */
+	{
+		.obj_name = "FILE_TABLE",
+		.obj_type = CKPT_OBJ_FILE_TABLE,
+		.ref_drop = obj_file_table_drop,
+		.ref_grab = obj_file_table_grab,
+		.ref_users = obj_file_table_users,
+		.checkpoint = checkpoint_file_table,
+	},
+	/* file object */
+	{
+		.obj_name = "FILE",
+		.obj_type = CKPT_OBJ_FILE,
+		.ref_drop = obj_file_drop,
+		.ref_grab = obj_file_grab,
+		.ref_users = obj_file_users,
+		.checkpoint = checkpoint_file,
+	},
 };
 
 
diff --git a/checkpoint/process.c b/checkpoint/process.c
index ef394a5..adc34a2 100644
--- a/checkpoint/process.c
+++ b/checkpoint/process.c
@@ -104,6 +104,29 @@ static int checkpoint_task_struct(struct ckpt_ctx *ctx, struct task_struct *t)
 	return ckpt_write_string(ctx, t->comm, TASK_COMM_LEN);
 }
 
+static int checkpoint_task_objs(struct ckpt_ctx *ctx, struct task_struct *t)
+{
+	struct ckpt_hdr_task_objs *h;
+	int files_objref;
+	int ret;
+
+	files_objref = checkpoint_obj_file_table(ctx, t);
+	ckpt_debug("files: objref %d\n", files_objref);
+	if (files_objref < 0) {
+		ckpt_err(ctx, files_objref, "%(T)files_struct\n");
+		return files_objref;
+	}
+
+	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_TASK_OBJS);
+	if (!h)
+		return -ENOMEM;
+	h->files_objref = files_objref;
+	ret = ckpt_write_obj(ctx, &h->h);
+	ckpt_hdr_put(ctx, h);
+
+	return ret;
+}
+
 /* dump the task_struct of a given task */
 int checkpoint_restart_block(struct ckpt_ctx *ctx, struct task_struct *t)
 {
@@ -240,6 +263,10 @@ int checkpoint_task(struct ckpt_ctx *ctx, struct task_struct *t)
 		goto out;
 	ret = checkpoint_cpu(ctx, t);
 	ckpt_debug("cpu %d\n", ret);
+	if (ret < 0)
+		goto out;
+	ret = checkpoint_task_objs(ctx, t);
+	ckpt_debug("objs %d\n", ret);
  out:
 	ctx->tsk = NULL;
 	return ret;
@@ -247,7 +274,11 @@ int checkpoint_task(struct ckpt_ctx *ctx, struct task_struct *t)
 
 int ckpt_collect_task(struct ckpt_ctx *ctx, struct task_struct *t)
 {
-	return 0;
+	int ret;
+
+	ret = ckpt_collect_file_table(ctx, t);
+
+	return ret;
 }
 
 /***********************************************************************
diff --git a/checkpoint/sys.c b/checkpoint/sys.c
index 926c937..30b8004 100644
--- a/checkpoint/sys.c
+++ b/checkpoint/sys.c
@@ -206,12 +206,16 @@ static void ckpt_ctx_free(struct ckpt_ctx *ctx)
 	if (ctx->kflags & CKPT_CTX_RESTART)
 		restore_debug_free(ctx);
 
+	if (ctx->files_deferq)
+		deferqueue_destroy(ctx->files_deferq);
+
 	if (ctx->file)
 		fput(ctx->file);
 	if (ctx->logfile)
 		fput(ctx->logfile);
 
 	ckpt_obj_hash_free(ctx);
+	path_put(&ctx->root_fs_path);
 
 	if (ctx->tasks_arr)
 		task_arr_free(ctx);
@@ -270,6 +274,10 @@ static struct ckpt_ctx *ckpt_ctx_alloc(int fd, unsigned long uflags,
 	if (ckpt_obj_hash_alloc(ctx) < 0)
 		goto err;
 
+	ctx->files_deferq = deferqueue_create();
+	if (!ctx->files_deferq)
+		goto err;
+
 	atomic_inc(&ctx->refcount);
 	return ctx;
  err:
diff --git a/fs/locks.c b/fs/locks.c
index a8794f2..721481a 100644
--- a/fs/locks.c
+++ b/fs/locks.c
@@ -1994,6 +1994,41 @@ void locks_remove_posix(struct file *filp, fl_owner_t owner)
 
 EXPORT_SYMBOL(locks_remove_posix);
 
+int find_locks_with_owner(struct file *filp, fl_owner_t owner)
+{
+	struct inode *inode = filp->f_path.dentry->d_inode;
+	struct file_lock **inode_fl;
+	int ret = -EEXIST;
+
+	lock_kernel();
+	for_each_lock(inode, inode_fl) {
+		struct file_lock *fl = *inode_fl;
+		/*
+		 * We could use posix_same_owner() along with a 'fake'
+		 * file_lock.  But, the fake file will never have the
+		 * same fl_lmops as the fl that we are looking for and
+		 * posix_same_owner() would just fall back to this
+		 * check anyway.
+		 */
+		if (IS_POSIX(fl)) {
+			if (fl->fl_owner == owner) {
+				ret = 0;
+				break;
+			}
+		} else if (IS_FLOCK(fl) || IS_LEASE(fl)) {
+			if (fl->fl_file == filp) {
+				ret = 0;
+				break;
+			}
+		} else {
+			WARN(1, "unknown file lock type, fl_flags: %x",
+				fl->fl_flags);
+		}
+	}
+	unlock_kernel();
+	return ret;
+}
+
 /*
  * This function is called on the last close of an open file.
  */
diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
index 50ce8f9..d74a890 100644
--- a/include/linux/checkpoint.h
+++ b/include/linux/checkpoint.h
@@ -80,6 +80,9 @@ extern int ckpt_read_payload(struct ckpt_ctx *ctx,
 extern char *ckpt_read_string(struct ckpt_ctx *ctx, int max);
 extern int ckpt_read_consume(struct ckpt_ctx *ctx, int len, int type);
 
+extern char *ckpt_fill_fname(struct path *path, struct path *root,
+			     char *buf, int *len);
+
 /* ckpt kflags */
 #define ckpt_set_ctx_kflag(__ctx, __kflag)  \
 	set_bit(__kflag##_BIT, &(__ctx)->kflags)
@@ -156,6 +159,21 @@ extern int checkpoint_restart_block(struct ckpt_ctx *ctx,
 				    struct task_struct *t);
 extern int restore_restart_block(struct ckpt_ctx *ctx);
 
+/* file table */
+extern int ckpt_collect_file_table(struct ckpt_ctx *ctx, struct task_struct *t);
+extern int checkpoint_obj_file_table(struct ckpt_ctx *ctx,
+				     struct task_struct *t);
+extern int checkpoint_file_table(struct ckpt_ctx *ctx, void *ptr);
+
+/* files */
+extern int checkpoint_fname(struct ckpt_ctx *ctx,
+			    struct path *path, struct path *root);
+extern int ckpt_collect_file(struct ckpt_ctx *ctx, struct file *file);
+extern int checkpoint_file(struct ckpt_ctx *ctx, void *ptr);
+
+extern int checkpoint_file_common(struct ckpt_ctx *ctx, struct file *file,
+				  struct ckpt_hdr_file *h);
+
 static inline int ckpt_validate_errno(int errno)
 {
 	return (errno >= 0) && (errno < MAX_ERRNO);
@@ -166,6 +184,7 @@ static inline int ckpt_validate_errno(int errno)
 #define CKPT_DSYS	0x2		/* generic (system) */
 #define CKPT_DRW	0x4		/* image read/write */
 #define CKPT_DOBJ	0x8		/* shared objects */
+#define CKPT_DFILE	0x10		/* files and filesystem */
 
 #define CKPT_DDEFAULT	0xffff		/* default debug level */
 
diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
index cdca9e4..3222545 100644
--- a/include/linux/checkpoint_hdr.h
+++ b/include/linux/checkpoint_hdr.h
@@ -71,6 +71,8 @@ enum {
 #define CKPT_HDR_TREE CKPT_HDR_TREE
 	CKPT_HDR_TASK,
 #define CKPT_HDR_TASK CKPT_HDR_TASK
+	CKPT_HDR_TASK_OBJS,
+#define CKPT_HDR_TASK_OBJS CKPT_HDR_TASK_OBJS
 	CKPT_HDR_RESTART_BLOCK,
 #define CKPT_HDR_RESTART_BLOCK CKPT_HDR_RESTART_BLOCK
 	CKPT_HDR_THREAD,
@@ -80,6 +82,15 @@ enum {
 
 	/* 201-299: reserved for arch-dependent */
 
+	CKPT_HDR_FILE_TABLE = 301,
+#define CKPT_HDR_FILE_TABLE CKPT_HDR_FILE_TABLE
+	CKPT_HDR_FILE_DESC,
+#define CKPT_HDR_FILE_DESC CKPT_HDR_FILE_DESC
+	CKPT_HDR_FILE_NAME,
+#define CKPT_HDR_FILE_NAME CKPT_HDR_FILE_NAME
+	CKPT_HDR_FILE,
+#define CKPT_HDR_FILE CKPT_HDR_FILE
+
 	CKPT_HDR_TAIL = 9001,
 #define CKPT_HDR_TAIL CKPT_HDR_TAIL
 
@@ -106,6 +117,10 @@ struct ckpt_hdr_objref {
 enum obj_type {
 	CKPT_OBJ_IGNORE = 0,
 #define CKPT_OBJ_IGNORE CKPT_OBJ_IGNORE
+	CKPT_OBJ_FILE_TABLE,
+#define CKPT_OBJ_FILE_TABLE CKPT_OBJ_FILE_TABLE
+	CKPT_OBJ_FILE,
+#define CKPT_OBJ_FILE CKPT_OBJ_FILE
 	CKPT_OBJ_MAX
 #define CKPT_OBJ_MAX CKPT_OBJ_MAX
 };
@@ -188,6 +203,12 @@ struct ckpt_hdr_task {
 	__u64 clear_child_tid;
 } __attribute__((aligned(8)));
 
+/* task's shared resources */
+struct ckpt_hdr_task_objs {
+	struct ckpt_hdr h;
+	__s32 files_objref;
+} __attribute__((aligned(8)));
+
 /* restart blocks */
 struct ckpt_hdr_restart_block {
 	struct ckpt_hdr h;
@@ -220,4 +241,42 @@ enum restart_block_type {
 #define CKPT_RESTART_BLOCK_FUTEX CKPT_RESTART_BLOCK_FUTEX
 };
 
+/* file system */
+struct ckpt_hdr_file_table {
+	struct ckpt_hdr h;
+	__s32 fdt_nfds;
+} __attribute__((aligned(8)));
+
+/* file descriptors */
+struct ckpt_hdr_file_desc {
+	struct ckpt_hdr h;
+	__s32 fd_objref;
+	__s32 fd_descriptor;
+	__u32 fd_close_on_exec;
+} __attribute__((aligned(8)));
+
+enum file_type {
+	CKPT_FILE_IGNORE = 0,
+#define CKPT_FILE_IGNORE CKPT_FILE_IGNORE
+	CKPT_FILE_GENERIC,
+#define CKPT_FILE_GENERIC CKPT_FILE_GENERIC
+	CKPT_FILE_MAX
+#define CKPT_FILE_MAX CKPT_FILE_MAX
+};
+
+/* file objects */
+struct ckpt_hdr_file {
+	struct ckpt_hdr h;
+	__u32 f_type;
+	__u32 f_mode;
+	__u32 f_flags;
+	__u32 _padding;
+	__u64 f_pos;
+	__u64 f_version;
+} __attribute__((aligned(8)));
+
+struct ckpt_hdr_file_generic {
+	struct ckpt_hdr_file common;
+} __attribute__((aligned(8)));
+
 #endif /* _CHECKPOINT_CKPT_HDR_H_ */
diff --git a/include/linux/checkpoint_types.h b/include/linux/checkpoint_types.h
index 90bbb16..aae6755 100644
--- a/include/linux/checkpoint_types.h
+++ b/include/linux/checkpoint_types.h
@@ -14,6 +14,8 @@
 
 #include <linux/sched.h>
 #include <linux/nsproxy.h>
+#include <linux/list.h>
+#include <linux/path.h>
 #include <linux/fs.h>
 #include <linux/ktime.h>
 #include <linux/wait.h>
@@ -40,6 +42,9 @@ struct ckpt_ctx {
 	atomic_t refcount;
 
 	struct ckpt_obj_hash *obj_hash;	/* repository for shared objects */
+	struct deferqueue_head *files_deferq;	/* deferred file-table work */
+
+	struct path root_fs_path;     /* container root (FIXME) */
 
 	struct task_struct *tsk;/* checkpoint: current target task */
 	char err_string[256];	/* checkpoint: error string */
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 65ebec5..7902a51 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1120,6 +1120,7 @@ extern void locks_remove_posix(struct file *, fl_owner_t);
 extern void locks_remove_flock(struct file *);
 extern void locks_release_private(struct file_lock *);
 extern void posix_test_lock(struct file *, struct file_lock *);
+extern int find_locks_with_owner(struct file *filp, fl_owner_t owner);
 extern int posix_lock_file(struct file *, struct file_lock *, struct file_lock *);
 extern int posix_lock_file_wait(struct file *, struct file_lock *);
 extern int posix_unblock_lock(struct file *, struct file_lock *);
@@ -1188,6 +1189,11 @@ static inline void locks_remove_posix(struct file *filp, fl_owner_t owner)
 	return;
 }
 
+static inline int find_locks_with_owner(struct file *filp, fl_owner_t owner)
+{
+	return -ENOENT;
+}
+
 static inline void locks_remove_flock(struct file *filp)
 {
 	return;
@@ -2318,7 +2324,11 @@ void inode_sub_bytes(struct inode *inode, loff_t bytes);
 loff_t inode_get_bytes(struct inode *inode);
 void inode_set_bytes(struct inode *inode, loff_t bytes);
 
+#ifdef CONFIG_CHECKPOINT
+extern int generic_file_checkpoint(struct ckpt_ctx *ctx, struct file *file);
+#else
 #define generic_file_checkpoint NULL
+#endif
 
 extern int vfs_readdir(struct file *, filldir_t, void *);
 
-- 
1.6.3.3


^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [C/R v20][PATCH 39/96] c/r: restore open file descriptors
       [not found] ` <1268960401-16680-1-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
                     ` (2 preceding siblings ...)
  2010-03-19  0:59   ` [C/R v20][PATCH 38/96] c/r: dump open file descriptors Oren Laadan
@ 2010-03-19  0:59   ` Oren Laadan
  2010-03-19  0:59   ` [C/R v20][PATCH 40/96] c/r: introduce method '->checkpoint()' in struct vm_operations_struct Oren Laadan
                     ` (12 subsequent siblings)
  16 siblings, 0 replies; 88+ messages in thread
From: Oren Laadan @ 2010-03-19  0:59 UTC (permalink / raw)
  To: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Andreas Dilger

For each fd read 'struct ckpt_hdr_file_desc' and lookup objref in the
hash table; If not found in the hash table, (first occurence), read in
'struct ckpt_hdr_file', create a new file and register in the hash.
Otherwise attach the file pointer from the hash as an FD.


Changelog[v19-rc1]:
  - Fix lockdep complaint in restore_obj_files()
Changelog[v19-rc1]:
  - Restore thread/cpu state early
  - Ensure null-termination of file names read from image
  - Fix compile warning in restore_open_fname()
Changelog[v18]:
  - Invoke set_close_on_exec() unconditionally on restart
Changelog[v17]:
  - Validate f_mode after restore against saved f_mode
  - Fail if f_flags have O_CREAT|O_EXCL|O_NOCTTY|O_TRUN
  - Reorder patch (move earlier in series)
  - Handle shared files_struct objects
Changelog[v14]:
  - Introduce a per file-type restore() callback
  - Revert change to pr_debug(), back to ckpt_debug()
  - Rename:  restore_files() => restore_fd_table()
  - Rename:  ckpt_read_fd_data() => restore_file()
  - Check whether calls to ckpt_hbuf_get() fail
  - Discard field 'hh->parent'
Changelog[v12]:
  - Replace obsolete ckpt_debug() with pr_debug()
Changelog[v6]:
  - Balance all calls to ckpt_hbuf_get() with matching ckpt_hbuf_put()
    (even though it's not really needed)

Signed-off-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
Acked-by: Serge E. Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
Tested-by: Serge E. Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
---
 checkpoint/files.c         |  318 ++++++++++++++++++++++++++++++++++++++++++++
 checkpoint/objhash.c       |    2 +
 checkpoint/process.c       |   20 +++
 include/linux/checkpoint.h |    7 +
 4 files changed, 347 insertions(+), 0 deletions(-)

diff --git a/checkpoint/files.c b/checkpoint/files.c
index 7a57b24..b404c8f 100644
--- a/checkpoint/files.c
+++ b/checkpoint/files.c
@@ -16,6 +16,8 @@
 #include <linux/sched.h>
 #include <linux/file.h>
 #include <linux/fdtable.h>
+#include <linux/fsnotify.h>
+#include <linux/syscalls.h>
 #include <linux/deferqueue.h>
 #include <linux/checkpoint.h>
 #include <linux/checkpoint_hdr.h>
@@ -442,3 +444,319 @@ int ckpt_collect_file_table(struct ckpt_ctx *ctx, struct task_struct *t)
 
 	return ret;
 }
+
+/**************************************************************************
+ * Restart
+ */
+
+/**
+ * restore_open_fname - read a file name and open a file
+ * @ctx: checkpoint context
+ * @flags: file flags
+ */
+struct file *restore_open_fname(struct ckpt_ctx *ctx, int flags)
+{
+	struct file *file;
+	char *fname;
+	int len;
+
+	/* prevent bad input from doing bad things */
+	if (flags & (O_CREAT | O_EXCL | O_NOCTTY | O_TRUNC))
+		return ERR_PTR(-EINVAL);
+
+	len = ckpt_read_payload(ctx, (void **) &fname,
+				PATH_MAX, CKPT_HDR_FILE_NAME);
+	if (len < 0)
+		return ERR_PTR(len);
+	fname[len - 1] = '\0';	/* always play if safe */
+	ckpt_debug("fname '%s' flags %#x\n", fname, flags);
+
+	file = filp_open(fname, flags, 0);
+	kfree(fname);
+
+	return file;
+}
+
+static int close_all_fds(struct files_struct *files)
+{
+	int *fdtable;
+	int nfds;
+
+	nfds = scan_fds(files, &fdtable);
+	if (nfds < 0)
+		return nfds;
+	while (nfds--)
+		sys_close(fdtable[nfds]);
+	kfree(fdtable);
+	return 0;
+}
+
+/**
+ * attach_file - attach a lonely file ptr to a file descriptor
+ * @file: lonely file pointer
+ */
+static int attach_file(struct file *file)
+{
+	int fd = get_unused_fd_flags(0);
+
+	if (fd >= 0) {
+		get_file(file);
+		fsnotify_open(file->f_path.dentry);
+		fd_install(fd, file);
+	}
+	return fd;
+}
+
+#define CKPT_SETFL_MASK  \
+	(O_APPEND | O_NONBLOCK | O_NDELAY | FASYNC | O_DIRECT | O_NOATIME)
+
+int restore_file_common(struct ckpt_ctx *ctx, struct file *file,
+			struct ckpt_hdr_file *h)
+{
+	fmode_t new_mode = file->f_mode;
+	fmode_t saved_mode = (__force fmode_t) h->f_mode;
+	int ret;
+
+	/* FIX: need to restore uid, gid, owner etc */
+
+	/* safe to set 1st arg (fd) to 0, as command is F_SETFL */
+	ret = vfs_fcntl(0, F_SETFL, h->f_flags & CKPT_SETFL_MASK, file);
+	if (ret < 0)
+		return ret;
+
+	/*
+	 * Normally f_mode is set by open, and modified only via
+	 * fcntl(), so its value now should match that at checkpoint.
+	 * However, a file may be downgraded from (read-)write to
+	 * read-only, e.g:
+	 *  - mark_files_ro() unsets FMODE_WRITE
+	 *  - nfs4_file_downgrade() too, and also sert FMODE_READ
+	 * Validate the new f_mode against saved f_mode, allowing:
+	 *  - new with FMODE_WRITE, saved without FMODE_WRITE
+	 *  - new without FMODE_READ, saved with FMODE_READ
+	 */
+	if ((new_mode & FMODE_WRITE) && !(saved_mode & FMODE_WRITE)) {
+		new_mode &= ~FMODE_WRITE;
+		if (!(new_mode & FMODE_READ) && (saved_mode & FMODE_READ))
+			new_mode |= FMODE_READ;
+	}
+	/* finally, at this point new mode should match saved mode */
+	if (new_mode ^ saved_mode)
+		return -EINVAL;
+
+	if (file->f_mode & FMODE_LSEEK)
+		ret = vfs_llseek(file, h->f_pos, SEEK_SET);
+
+	return ret;
+}
+
+static struct file *generic_file_restore(struct ckpt_ctx *ctx,
+					 struct ckpt_hdr_file *ptr)
+{
+	struct file *file;
+	int ret;
+
+	if (ptr->h.type != CKPT_HDR_FILE  ||
+	    ptr->h.len != sizeof(*ptr) || ptr->f_type != CKPT_FILE_GENERIC)
+		return ERR_PTR(-EINVAL);
+
+	file = restore_open_fname(ctx, ptr->f_flags);
+	if (IS_ERR(file))
+		return file;
+
+	ret = restore_file_common(ctx, file, ptr);
+	if (ret < 0) {
+		fput(file);
+		file = ERR_PTR(ret);
+	}
+	return file;
+}
+
+struct restore_file_ops {
+	char *file_name;
+	enum file_type file_type;
+	struct file * (*restore) (struct ckpt_ctx *ctx,
+				  struct ckpt_hdr_file *ptr);
+};
+
+static struct restore_file_ops restore_file_ops[] = {
+	/* ignored file */
+	{
+		.file_name = "IGNORE",
+		.file_type = CKPT_FILE_IGNORE,
+		.restore = NULL,
+	},
+	/* regular file/directory */
+	{
+		.file_name = "GENERIC",
+		.file_type = CKPT_FILE_GENERIC,
+		.restore = generic_file_restore,
+	},
+};
+
+static struct file *do_restore_file(struct ckpt_ctx *ctx)
+{
+	struct restore_file_ops *ops;
+	struct ckpt_hdr_file *h;
+	struct file *file = ERR_PTR(-EINVAL);
+
+	/*
+	 * All 'struct ckpt_hdr_file_...' begin with ckpt_hdr_file,
+	 * but the actual object depends on the file type. The length
+	 * should never be more than page.
+	 */
+	h = ckpt_read_buf_type(ctx, PAGE_SIZE, CKPT_HDR_FILE);
+	if (IS_ERR(h))
+		return (struct file *) h;
+	ckpt_debug("flags %#x mode %#x type %d\n",
+		 h->f_flags, h->f_mode, h->f_type);
+
+	if (h->f_type >= CKPT_FILE_MAX)
+		goto out;
+
+	ops = &restore_file_ops[h->f_type];
+	BUG_ON(ops->file_type != h->f_type);
+
+	if (ops->restore)
+		file = ops->restore(ctx, h);
+ out:
+	ckpt_hdr_put(ctx, h);
+	return file;
+}
+
+/* restore callback for file pointer */
+void *restore_file(struct ckpt_ctx *ctx)
+{
+	return (void *) do_restore_file(ctx);
+}
+
+/**
+ * ckpt_read_file_desc - restore the state of a given file descriptor
+ * @ctx: checkpoint context
+ *
+ * Restores the state of a file descriptor; looks up the objref (in the
+ * header) in the hash table, and if found picks the matching file and
+ * use it; otherwise calls restore_file to restore the file too.
+ */
+static int restore_file_desc(struct ckpt_ctx *ctx)
+{
+	struct ckpt_hdr_file_desc *h;
+	struct file *file;
+	int newfd, ret;
+
+	h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_FILE_DESC);
+	if (IS_ERR(h))
+		return PTR_ERR(h);
+	ckpt_debug("ref %d fd %d c.o.e %d\n",
+		 h->fd_objref, h->fd_descriptor, h->fd_close_on_exec);
+
+	ret = -EINVAL;
+	if (h->fd_objref <= 0 || h->fd_descriptor < 0)
+		goto out;
+
+	file = ckpt_obj_fetch(ctx, h->fd_objref, CKPT_OBJ_FILE);
+	if (IS_ERR(file)) {
+		ret = PTR_ERR(file);
+		goto out;
+	}
+
+	newfd = attach_file(file);
+	if (newfd < 0) {
+		ret = newfd;
+		goto out;
+	}
+
+	ckpt_debug("newfd got %d wanted %d\n", newfd, h->fd_descriptor);
+
+	/* reposition if newfd isn't desired fd */
+	if (newfd != h->fd_descriptor) {
+		ret = sys_dup2(newfd, h->fd_descriptor);
+		if (ret < 0)
+			goto out;
+		sys_close(newfd);
+	}
+
+	set_close_on_exec(h->fd_descriptor, h->fd_close_on_exec);
+	ret = 0;
+ out:
+	ckpt_hdr_put(ctx, h);
+	return ret;
+}
+
+/* restore callback for file table */
+static struct files_struct *do_restore_file_table(struct ckpt_ctx *ctx)
+{
+	struct ckpt_hdr_file_table *h;
+	struct files_struct *files;
+	int i, ret;
+
+	h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_FILE_TABLE);
+	if (IS_ERR(h))
+		return (struct files_struct *) h;
+
+	ckpt_debug("nfds %d\n", h->fdt_nfds);
+
+	ret = -EMFILE;
+	if (h->fdt_nfds < 0 || h->fdt_nfds > sysctl_nr_open)
+		goto out;
+
+	/*
+	 * We assume that restarting tasks, as created in user-space,
+	 * have distinct files_struct objects each. If not, we need to
+	 * call dup_fd() to make sure we don't overwrite an already
+	 * restored one.
+	 */
+
+	/* point of no return -- close all file descriptors */
+	ret = close_all_fds(current->files);
+	if (ret < 0)
+		goto out;
+
+	for (i = 0; i < h->fdt_nfds; i++) {
+		ret = restore_file_desc(ctx);
+		if (ret < 0)
+			goto out;
+	}
+
+	ret = deferqueue_run(ctx->files_deferq);
+	ckpt_debug("files_deferq ran %d entries\n", ret);
+	if (ret > 0)
+		ret = 0;
+ out:
+	ckpt_hdr_put(ctx, h);
+	if (!ret) {
+		files = current->files;
+		atomic_inc(&files->count);
+	} else {
+		files = ERR_PTR(ret);
+	}
+	return files;
+}
+
+void *restore_file_table(struct ckpt_ctx *ctx)
+{
+	return (void *) do_restore_file_table(ctx);
+}
+
+int restore_obj_file_table(struct ckpt_ctx *ctx, int files_objref)
+{
+	struct files_struct *files;
+
+	files = ckpt_obj_fetch(ctx, files_objref, CKPT_OBJ_FILE_TABLE);
+	if (IS_ERR(files))
+		return PTR_ERR(files);
+
+	if (files != current->files) {
+		struct files_struct *prev;
+
+		task_lock(current);
+		prev = current->files;
+		current->files = files;
+		atomic_inc(&files->count);
+		task_unlock(current);
+
+		put_files_struct(prev);
+	}
+
+	return 0;
+}
diff --git a/checkpoint/objhash.c b/checkpoint/objhash.c
index f25d130..cacc4c7 100644
--- a/checkpoint/objhash.c
+++ b/checkpoint/objhash.c
@@ -112,6 +112,7 @@ static struct ckpt_obj_ops ckpt_obj_ops[] = {
 		.ref_grab = obj_file_table_grab,
 		.ref_users = obj_file_table_users,
 		.checkpoint = checkpoint_file_table,
+		.restore = restore_file_table,
 	},
 	/* file object */
 	{
@@ -121,6 +122,7 @@ static struct ckpt_obj_ops ckpt_obj_ops[] = {
 		.ref_grab = obj_file_grab,
 		.ref_users = obj_file_users,
 		.checkpoint = checkpoint_file,
+		.restore = restore_file,
 	},
 };
 
diff --git a/checkpoint/process.c b/checkpoint/process.c
index adc34a2..23e0296 100644
--- a/checkpoint/process.c
+++ b/checkpoint/process.c
@@ -348,6 +348,22 @@ static int restore_task_struct(struct ckpt_ctx *ctx)
 	return ret;
 }
 
+static int restore_task_objs(struct ckpt_ctx *ctx)
+{
+	struct ckpt_hdr_task_objs *h;
+	int ret;
+
+	h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_TASK_OBJS);
+	if (IS_ERR(h))
+		return PTR_ERR(h);
+
+	ret = restore_obj_file_table(ctx, h->files_objref);
+	ckpt_debug("file_table: ret %d (%p)\n", ret, current->files);
+
+	ckpt_hdr_put(ctx, h);
+	return ret;
+}
+
 int restore_restart_block(struct ckpt_ctx *ctx)
 {
 	struct ckpt_hdr_restart_block *h;
@@ -477,6 +493,10 @@ int restore_task(struct ckpt_ctx *ctx)
 		goto out;
 	ret = restore_cpu(ctx);
 	ckpt_debug("cpu %d\n", ret);
+	if (ret < 0)
+		goto out;
+	ret = restore_task_objs(ctx);
+	ckpt_debug("objs %d\n", ret);
  out:
 	return ret;
 }
diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
index d74a890..749f30c 100644
--- a/include/linux/checkpoint.h
+++ b/include/linux/checkpoint.h
@@ -163,16 +163,23 @@ extern int restore_restart_block(struct ckpt_ctx *ctx);
 extern int ckpt_collect_file_table(struct ckpt_ctx *ctx, struct task_struct *t);
 extern int checkpoint_obj_file_table(struct ckpt_ctx *ctx,
 				     struct task_struct *t);
+extern int restore_obj_file_table(struct ckpt_ctx *ctx, int files_objref);
 extern int checkpoint_file_table(struct ckpt_ctx *ctx, void *ptr);
+extern void *restore_file_table(struct ckpt_ctx *ctx);
 
 /* files */
 extern int checkpoint_fname(struct ckpt_ctx *ctx,
 			    struct path *path, struct path *root);
+extern struct file *restore_open_fname(struct ckpt_ctx *ctx, int flags);
+
 extern int ckpt_collect_file(struct ckpt_ctx *ctx, struct file *file);
 extern int checkpoint_file(struct ckpt_ctx *ctx, void *ptr);
+extern void *restore_file(struct ckpt_ctx *ctx);
 
 extern int checkpoint_file_common(struct ckpt_ctx *ctx, struct file *file,
 				  struct ckpt_hdr_file *h);
+extern int restore_file_common(struct ckpt_ctx *ctx, struct file *file,
+			       struct ckpt_hdr_file *h);
 
 static inline int ckpt_validate_errno(int errno)
 {
-- 
1.6.3.3

^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [C/R v20][PATCH 39/96] c/r: restore open file descriptors
  2010-03-19  0:59 [C/R v20][PATCH 00/96] Linux Checkpoint-Restart - v20 Oren Laadan
                   ` (2 preceding siblings ...)
  2010-03-19  0:59 ` [C/R v20][PATCH 38/96] c/r: dump open file descriptors Oren Laadan
@ 2010-03-19  0:59 ` Oren Laadan
       [not found] ` <1268960401-16680-1-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
                   ` (13 subsequent siblings)
  17 siblings, 0 replies; 88+ messages in thread
From: Oren Laadan @ 2010-03-19  0:59 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: containers, Matt Helsley, Andreas Dilger, Oren Laadan

For each fd read 'struct ckpt_hdr_file_desc' and lookup objref in the
hash table; If not found in the hash table, (first occurence), read in
'struct ckpt_hdr_file', create a new file and register in the hash.
Otherwise attach the file pointer from the hash as an FD.


Changelog[v19-rc1]:
  - Fix lockdep complaint in restore_obj_files()
Changelog[v19-rc1]:
  - Restore thread/cpu state early
  - Ensure null-termination of file names read from image
  - Fix compile warning in restore_open_fname()
Changelog[v18]:
  - Invoke set_close_on_exec() unconditionally on restart
Changelog[v17]:
  - Validate f_mode after restore against saved f_mode
  - Fail if f_flags have O_CREAT|O_EXCL|O_NOCTTY|O_TRUN
  - Reorder patch (move earlier in series)
  - Handle shared files_struct objects
Changelog[v14]:
  - Introduce a per file-type restore() callback
  - Revert change to pr_debug(), back to ckpt_debug()
  - Rename:  restore_files() => restore_fd_table()
  - Rename:  ckpt_read_fd_data() => restore_file()
  - Check whether calls to ckpt_hbuf_get() fail
  - Discard field 'hh->parent'
Changelog[v12]:
  - Replace obsolete ckpt_debug() with pr_debug()
Changelog[v6]:
  - Balance all calls to ckpt_hbuf_get() with matching ckpt_hbuf_put()
    (even though it's not really needed)

Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
Acked-by: Serge E. Hallyn <serue@us.ibm.com>
Tested-by: Serge E. Hallyn <serue@us.ibm.com>
---
 checkpoint/files.c         |  318 ++++++++++++++++++++++++++++++++++++++++++++
 checkpoint/objhash.c       |    2 +
 checkpoint/process.c       |   20 +++
 include/linux/checkpoint.h |    7 +
 4 files changed, 347 insertions(+), 0 deletions(-)

diff --git a/checkpoint/files.c b/checkpoint/files.c
index 7a57b24..b404c8f 100644
--- a/checkpoint/files.c
+++ b/checkpoint/files.c
@@ -16,6 +16,8 @@
 #include <linux/sched.h>
 #include <linux/file.h>
 #include <linux/fdtable.h>
+#include <linux/fsnotify.h>
+#include <linux/syscalls.h>
 #include <linux/deferqueue.h>
 #include <linux/checkpoint.h>
 #include <linux/checkpoint_hdr.h>
@@ -442,3 +444,319 @@ int ckpt_collect_file_table(struct ckpt_ctx *ctx, struct task_struct *t)
 
 	return ret;
 }
+
+/**************************************************************************
+ * Restart
+ */
+
+/**
+ * restore_open_fname - read a file name and open a file
+ * @ctx: checkpoint context
+ * @flags: file flags
+ */
+struct file *restore_open_fname(struct ckpt_ctx *ctx, int flags)
+{
+	struct file *file;
+	char *fname;
+	int len;
+
+	/* prevent bad input from doing bad things */
+	if (flags & (O_CREAT | O_EXCL | O_NOCTTY | O_TRUNC))
+		return ERR_PTR(-EINVAL);
+
+	len = ckpt_read_payload(ctx, (void **) &fname,
+				PATH_MAX, CKPT_HDR_FILE_NAME);
+	if (len < 0)
+		return ERR_PTR(len);
+	fname[len - 1] = '\0';	/* always play if safe */
+	ckpt_debug("fname '%s' flags %#x\n", fname, flags);
+
+	file = filp_open(fname, flags, 0);
+	kfree(fname);
+
+	return file;
+}
+
+static int close_all_fds(struct files_struct *files)
+{
+	int *fdtable;
+	int nfds;
+
+	nfds = scan_fds(files, &fdtable);
+	if (nfds < 0)
+		return nfds;
+	while (nfds--)
+		sys_close(fdtable[nfds]);
+	kfree(fdtable);
+	return 0;
+}
+
+/**
+ * attach_file - attach a lonely file ptr to a file descriptor
+ * @file: lonely file pointer
+ */
+static int attach_file(struct file *file)
+{
+	int fd = get_unused_fd_flags(0);
+
+	if (fd >= 0) {
+		get_file(file);
+		fsnotify_open(file->f_path.dentry);
+		fd_install(fd, file);
+	}
+	return fd;
+}
+
+#define CKPT_SETFL_MASK  \
+	(O_APPEND | O_NONBLOCK | O_NDELAY | FASYNC | O_DIRECT | O_NOATIME)
+
+int restore_file_common(struct ckpt_ctx *ctx, struct file *file,
+			struct ckpt_hdr_file *h)
+{
+	fmode_t new_mode = file->f_mode;
+	fmode_t saved_mode = (__force fmode_t) h->f_mode;
+	int ret;
+
+	/* FIX: need to restore uid, gid, owner etc */
+
+	/* safe to set 1st arg (fd) to 0, as command is F_SETFL */
+	ret = vfs_fcntl(0, F_SETFL, h->f_flags & CKPT_SETFL_MASK, file);
+	if (ret < 0)
+		return ret;
+
+	/*
+	 * Normally f_mode is set by open, and modified only via
+	 * fcntl(), so its value now should match that at checkpoint.
+	 * However, a file may be downgraded from (read-)write to
+	 * read-only, e.g:
+	 *  - mark_files_ro() unsets FMODE_WRITE
+	 *  - nfs4_file_downgrade() too, and also sert FMODE_READ
+	 * Validate the new f_mode against saved f_mode, allowing:
+	 *  - new with FMODE_WRITE, saved without FMODE_WRITE
+	 *  - new without FMODE_READ, saved with FMODE_READ
+	 */
+	if ((new_mode & FMODE_WRITE) && !(saved_mode & FMODE_WRITE)) {
+		new_mode &= ~FMODE_WRITE;
+		if (!(new_mode & FMODE_READ) && (saved_mode & FMODE_READ))
+			new_mode |= FMODE_READ;
+	}
+	/* finally, at this point new mode should match saved mode */
+	if (new_mode ^ saved_mode)
+		return -EINVAL;
+
+	if (file->f_mode & FMODE_LSEEK)
+		ret = vfs_llseek(file, h->f_pos, SEEK_SET);
+
+	return ret;
+}
+
+static struct file *generic_file_restore(struct ckpt_ctx *ctx,
+					 struct ckpt_hdr_file *ptr)
+{
+	struct file *file;
+	int ret;
+
+	if (ptr->h.type != CKPT_HDR_FILE  ||
+	    ptr->h.len != sizeof(*ptr) || ptr->f_type != CKPT_FILE_GENERIC)
+		return ERR_PTR(-EINVAL);
+
+	file = restore_open_fname(ctx, ptr->f_flags);
+	if (IS_ERR(file))
+		return file;
+
+	ret = restore_file_common(ctx, file, ptr);
+	if (ret < 0) {
+		fput(file);
+		file = ERR_PTR(ret);
+	}
+	return file;
+}
+
+struct restore_file_ops {
+	char *file_name;
+	enum file_type file_type;
+	struct file * (*restore) (struct ckpt_ctx *ctx,
+				  struct ckpt_hdr_file *ptr);
+};
+
+static struct restore_file_ops restore_file_ops[] = {
+	/* ignored file */
+	{
+		.file_name = "IGNORE",
+		.file_type = CKPT_FILE_IGNORE,
+		.restore = NULL,
+	},
+	/* regular file/directory */
+	{
+		.file_name = "GENERIC",
+		.file_type = CKPT_FILE_GENERIC,
+		.restore = generic_file_restore,
+	},
+};
+
+static struct file *do_restore_file(struct ckpt_ctx *ctx)
+{
+	struct restore_file_ops *ops;
+	struct ckpt_hdr_file *h;
+	struct file *file = ERR_PTR(-EINVAL);
+
+	/*
+	 * All 'struct ckpt_hdr_file_...' begin with ckpt_hdr_file,
+	 * but the actual object depends on the file type. The length
+	 * should never be more than page.
+	 */
+	h = ckpt_read_buf_type(ctx, PAGE_SIZE, CKPT_HDR_FILE);
+	if (IS_ERR(h))
+		return (struct file *) h;
+	ckpt_debug("flags %#x mode %#x type %d\n",
+		 h->f_flags, h->f_mode, h->f_type);
+
+	if (h->f_type >= CKPT_FILE_MAX)
+		goto out;
+
+	ops = &restore_file_ops[h->f_type];
+	BUG_ON(ops->file_type != h->f_type);
+
+	if (ops->restore)
+		file = ops->restore(ctx, h);
+ out:
+	ckpt_hdr_put(ctx, h);
+	return file;
+}
+
+/* restore callback for file pointer */
+void *restore_file(struct ckpt_ctx *ctx)
+{
+	return (void *) do_restore_file(ctx);
+}
+
+/**
+ * ckpt_read_file_desc - restore the state of a given file descriptor
+ * @ctx: checkpoint context
+ *
+ * Restores the state of a file descriptor; looks up the objref (in the
+ * header) in the hash table, and if found picks the matching file and
+ * use it; otherwise calls restore_file to restore the file too.
+ */
+static int restore_file_desc(struct ckpt_ctx *ctx)
+{
+	struct ckpt_hdr_file_desc *h;
+	struct file *file;
+	int newfd, ret;
+
+	h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_FILE_DESC);
+	if (IS_ERR(h))
+		return PTR_ERR(h);
+	ckpt_debug("ref %d fd %d c.o.e %d\n",
+		 h->fd_objref, h->fd_descriptor, h->fd_close_on_exec);
+
+	ret = -EINVAL;
+	if (h->fd_objref <= 0 || h->fd_descriptor < 0)
+		goto out;
+
+	file = ckpt_obj_fetch(ctx, h->fd_objref, CKPT_OBJ_FILE);
+	if (IS_ERR(file)) {
+		ret = PTR_ERR(file);
+		goto out;
+	}
+
+	newfd = attach_file(file);
+	if (newfd < 0) {
+		ret = newfd;
+		goto out;
+	}
+
+	ckpt_debug("newfd got %d wanted %d\n", newfd, h->fd_descriptor);
+
+	/* reposition if newfd isn't desired fd */
+	if (newfd != h->fd_descriptor) {
+		ret = sys_dup2(newfd, h->fd_descriptor);
+		if (ret < 0)
+			goto out;
+		sys_close(newfd);
+	}
+
+	set_close_on_exec(h->fd_descriptor, h->fd_close_on_exec);
+	ret = 0;
+ out:
+	ckpt_hdr_put(ctx, h);
+	return ret;
+}
+
+/* restore callback for file table */
+static struct files_struct *do_restore_file_table(struct ckpt_ctx *ctx)
+{
+	struct ckpt_hdr_file_table *h;
+	struct files_struct *files;
+	int i, ret;
+
+	h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_FILE_TABLE);
+	if (IS_ERR(h))
+		return (struct files_struct *) h;
+
+	ckpt_debug("nfds %d\n", h->fdt_nfds);
+
+	ret = -EMFILE;
+	if (h->fdt_nfds < 0 || h->fdt_nfds > sysctl_nr_open)
+		goto out;
+
+	/*
+	 * We assume that restarting tasks, as created in user-space,
+	 * have distinct files_struct objects each. If not, we need to
+	 * call dup_fd() to make sure we don't overwrite an already
+	 * restored one.
+	 */
+
+	/* point of no return -- close all file descriptors */
+	ret = close_all_fds(current->files);
+	if (ret < 0)
+		goto out;
+
+	for (i = 0; i < h->fdt_nfds; i++) {
+		ret = restore_file_desc(ctx);
+		if (ret < 0)
+			goto out;
+	}
+
+	ret = deferqueue_run(ctx->files_deferq);
+	ckpt_debug("files_deferq ran %d entries\n", ret);
+	if (ret > 0)
+		ret = 0;
+ out:
+	ckpt_hdr_put(ctx, h);
+	if (!ret) {
+		files = current->files;
+		atomic_inc(&files->count);
+	} else {
+		files = ERR_PTR(ret);
+	}
+	return files;
+}
+
+void *restore_file_table(struct ckpt_ctx *ctx)
+{
+	return (void *) do_restore_file_table(ctx);
+}
+
+int restore_obj_file_table(struct ckpt_ctx *ctx, int files_objref)
+{
+	struct files_struct *files;
+
+	files = ckpt_obj_fetch(ctx, files_objref, CKPT_OBJ_FILE_TABLE);
+	if (IS_ERR(files))
+		return PTR_ERR(files);
+
+	if (files != current->files) {
+		struct files_struct *prev;
+
+		task_lock(current);
+		prev = current->files;
+		current->files = files;
+		atomic_inc(&files->count);
+		task_unlock(current);
+
+		put_files_struct(prev);
+	}
+
+	return 0;
+}
diff --git a/checkpoint/objhash.c b/checkpoint/objhash.c
index f25d130..cacc4c7 100644
--- a/checkpoint/objhash.c
+++ b/checkpoint/objhash.c
@@ -112,6 +112,7 @@ static struct ckpt_obj_ops ckpt_obj_ops[] = {
 		.ref_grab = obj_file_table_grab,
 		.ref_users = obj_file_table_users,
 		.checkpoint = checkpoint_file_table,
+		.restore = restore_file_table,
 	},
 	/* file object */
 	{
@@ -121,6 +122,7 @@ static struct ckpt_obj_ops ckpt_obj_ops[] = {
 		.ref_grab = obj_file_grab,
 		.ref_users = obj_file_users,
 		.checkpoint = checkpoint_file,
+		.restore = restore_file,
 	},
 };
 
diff --git a/checkpoint/process.c b/checkpoint/process.c
index adc34a2..23e0296 100644
--- a/checkpoint/process.c
+++ b/checkpoint/process.c
@@ -348,6 +348,22 @@ static int restore_task_struct(struct ckpt_ctx *ctx)
 	return ret;
 }
 
+static int restore_task_objs(struct ckpt_ctx *ctx)
+{
+	struct ckpt_hdr_task_objs *h;
+	int ret;
+
+	h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_TASK_OBJS);
+	if (IS_ERR(h))
+		return PTR_ERR(h);
+
+	ret = restore_obj_file_table(ctx, h->files_objref);
+	ckpt_debug("file_table: ret %d (%p)\n", ret, current->files);
+
+	ckpt_hdr_put(ctx, h);
+	return ret;
+}
+
 int restore_restart_block(struct ckpt_ctx *ctx)
 {
 	struct ckpt_hdr_restart_block *h;
@@ -477,6 +493,10 @@ int restore_task(struct ckpt_ctx *ctx)
 		goto out;
 	ret = restore_cpu(ctx);
 	ckpt_debug("cpu %d\n", ret);
+	if (ret < 0)
+		goto out;
+	ret = restore_task_objs(ctx);
+	ckpt_debug("objs %d\n", ret);
  out:
 	return ret;
 }
diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
index d74a890..749f30c 100644
--- a/include/linux/checkpoint.h
+++ b/include/linux/checkpoint.h
@@ -163,16 +163,23 @@ extern int restore_restart_block(struct ckpt_ctx *ctx);
 extern int ckpt_collect_file_table(struct ckpt_ctx *ctx, struct task_struct *t);
 extern int checkpoint_obj_file_table(struct ckpt_ctx *ctx,
 				     struct task_struct *t);
+extern int restore_obj_file_table(struct ckpt_ctx *ctx, int files_objref);
 extern int checkpoint_file_table(struct ckpt_ctx *ctx, void *ptr);
+extern void *restore_file_table(struct ckpt_ctx *ctx);
 
 /* files */
 extern int checkpoint_fname(struct ckpt_ctx *ctx,
 			    struct path *path, struct path *root);
+extern struct file *restore_open_fname(struct ckpt_ctx *ctx, int flags);
+
 extern int ckpt_collect_file(struct ckpt_ctx *ctx, struct file *file);
 extern int checkpoint_file(struct ckpt_ctx *ctx, void *ptr);
+extern void *restore_file(struct ckpt_ctx *ctx);
 
 extern int checkpoint_file_common(struct ckpt_ctx *ctx, struct file *file,
 				  struct ckpt_hdr_file *h);
+extern int restore_file_common(struct ckpt_ctx *ctx, struct file *file,
+			       struct ckpt_hdr_file *h);
 
 static inline int ckpt_validate_errno(int errno)
 {
-- 
1.6.3.3


^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [C/R v20][PATCH 40/96] c/r: introduce method '->checkpoint()' in struct vm_operations_struct
       [not found] ` <1268960401-16680-1-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
                     ` (3 preceding siblings ...)
  2010-03-19  0:59   ` [C/R v20][PATCH 39/96] c/r: restore " Oren Laadan
@ 2010-03-19  0:59   ` Oren Laadan
  2010-03-19  0:59   ` [C/R v20][PATCH 44/96] c/r: add generic '->checkpoint' f_op to ext fses Oren Laadan
                     ` (11 subsequent siblings)
  16 siblings, 0 replies; 88+ messages in thread
From: Oren Laadan @ 2010-03-19  0:59 UTC (permalink / raw)
  To: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Andreas Dilger

Changelog[v17]
  - Forward-declare 'ckpt_ctx et-al, don't use checkpoint_types.h

Signed-off-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
Acked-by: Serge E. Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
Tested-by: Serge E. Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
---
 include/linux/mm.h |    4 ++++
 1 files changed, 4 insertions(+), 0 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 60c467b..48d67ee 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -19,6 +19,7 @@ struct file_ra_state;
 struct user_struct;
 struct writeback_control;
 struct rlimit;
+struct ckpt_ctx;
 
 #ifndef CONFIG_DISCONTIGMEM          /* Don't use mapnrs, do it properly */
 extern unsigned long max_mapnr;
@@ -220,6 +221,9 @@ struct vm_operations_struct {
 	int (*migrate)(struct vm_area_struct *vma, const nodemask_t *from,
 		const nodemask_t *to, unsigned long flags);
 #endif
+#ifdef CONFIG_CHECKPOINT
+	int (*checkpoint)(struct ckpt_ctx *ctx, struct vm_area_struct *vma);
+#endif
 };
 
 struct mmu_gather;
-- 
1.6.3.3

^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [C/R v20][PATCH 40/96] c/r: introduce method '->checkpoint()' in struct vm_operations_struct
  2010-03-19  0:59 [C/R v20][PATCH 00/96] Linux Checkpoint-Restart - v20 Oren Laadan
                   ` (4 preceding siblings ...)
       [not found] ` <1268960401-16680-1-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
@ 2010-03-19  0:59 ` Oren Laadan
  2010-03-19  0:59 ` [C/R v20][PATCH 44/96] c/r: add generic '->checkpoint' f_op to ext fses Oren Laadan
                   ` (11 subsequent siblings)
  17 siblings, 0 replies; 88+ messages in thread
From: Oren Laadan @ 2010-03-19  0:59 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: containers, Matt Helsley, Andreas Dilger, Oren Laadan

Changelog[v17]
  - Forward-declare 'ckpt_ctx et-al, don't use checkpoint_types.h

Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
Acked-by: Serge E. Hallyn <serue@us.ibm.com>
Tested-by: Serge E. Hallyn <serue@us.ibm.com>
---
 include/linux/mm.h |    4 ++++
 1 files changed, 4 insertions(+), 0 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 60c467b..48d67ee 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -19,6 +19,7 @@ struct file_ra_state;
 struct user_struct;
 struct writeback_control;
 struct rlimit;
+struct ckpt_ctx;
 
 #ifndef CONFIG_DISCONTIGMEM          /* Don't use mapnrs, do it properly */
 extern unsigned long max_mapnr;
@@ -220,6 +221,9 @@ struct vm_operations_struct {
 	int (*migrate)(struct vm_area_struct *vma, const nodemask_t *from,
 		const nodemask_t *to, unsigned long flags);
 #endif
+#ifdef CONFIG_CHECKPOINT
+	int (*checkpoint)(struct ckpt_ctx *ctx, struct vm_area_struct *vma);
+#endif
 };
 
 struct mmu_gather;
-- 
1.6.3.3


^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [C/R v20][PATCH 44/96] c/r: add generic '->checkpoint' f_op to ext fses
       [not found] ` <1268960401-16680-1-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
                     ` (4 preceding siblings ...)
  2010-03-19  0:59   ` [C/R v20][PATCH 40/96] c/r: introduce method '->checkpoint()' in struct vm_operations_struct Oren Laadan
@ 2010-03-19  0:59   ` Oren Laadan
  2010-03-19  0:59   ` [C/R v20][PATCH 45/96] c/r: add generic '->checkpoint()' f_op to simple devices Oren Laadan
                     ` (10 subsequent siblings)
  16 siblings, 0 replies; 88+ messages in thread
From: Oren Laadan @ 2010-03-19  0:59 UTC (permalink / raw)
  To: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Andreas Dilger, Dave Hansen

From: Dave Hansen <dave-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>

This marks ext[234] as being checkpointable.  There will be many
more to do this to, but this is a start.

Changelog[ckpt-v19-rc3]:
  - Rebase to kernel 2.6.33 (ext2)
Changelog[v1]:
  - [Serge Hallyn] Use filemap_checkpoint() in ext4_file_vm_ops

Signed-off-by: Dave Hansen <dave-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
Signed-off-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
Acked-by: Serge E. Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
Tested-by: Serge E. Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
---
 fs/ext2/dir.c  |    1 +
 fs/ext2/file.c |    2 ++
 fs/ext3/dir.c  |    1 +
 fs/ext3/file.c |    1 +
 fs/ext4/dir.c  |    1 +
 fs/ext4/file.c |    4 ++++
 6 files changed, 10 insertions(+), 0 deletions(-)

diff --git a/fs/ext2/dir.c b/fs/ext2/dir.c
index 7516957..84c17f9 100644
--- a/fs/ext2/dir.c
+++ b/fs/ext2/dir.c
@@ -722,4 +722,5 @@ const struct file_operations ext2_dir_operations = {
 	.compat_ioctl	= ext2_compat_ioctl,
 #endif
 	.fsync		= ext2_fsync,
+	.checkpoint	= generic_file_checkpoint,
 };
diff --git a/fs/ext2/file.c b/fs/ext2/file.c
index 586e358..b38d7b9 100644
--- a/fs/ext2/file.c
+++ b/fs/ext2/file.c
@@ -75,6 +75,7 @@ const struct file_operations ext2_file_operations = {
 	.fsync		= ext2_fsync,
 	.splice_read	= generic_file_splice_read,
 	.splice_write	= generic_file_splice_write,
+	.checkpoint	= generic_file_checkpoint,
 };
 
 #ifdef CONFIG_EXT2_FS_XIP
@@ -90,6 +91,7 @@ const struct file_operations ext2_xip_file_operations = {
 	.open		= generic_file_open,
 	.release	= ext2_release_file,
 	.fsync		= ext2_fsync,
+	.checkpoint	= generic_file_checkpoint,
 };
 #endif
 
diff --git a/fs/ext3/dir.c b/fs/ext3/dir.c
index 373fa90..65f98af 100644
--- a/fs/ext3/dir.c
+++ b/fs/ext3/dir.c
@@ -48,6 +48,7 @@ const struct file_operations ext3_dir_operations = {
 #endif
 	.fsync		= ext3_sync_file,	/* BKL held */
 	.release	= ext3_release_dir,
+	.checkpoint	= generic_file_checkpoint,
 };
 
 
diff --git a/fs/ext3/file.c b/fs/ext3/file.c
index 388bbdf..bcd9b88 100644
--- a/fs/ext3/file.c
+++ b/fs/ext3/file.c
@@ -67,6 +67,7 @@ const struct file_operations ext3_file_operations = {
 	.fsync		= ext3_sync_file,
 	.splice_read	= generic_file_splice_read,
 	.splice_write	= generic_file_splice_write,
+	.checkpoint	= generic_file_checkpoint,
 };
 
 const struct inode_operations ext3_file_inode_operations = {
diff --git a/fs/ext4/dir.c b/fs/ext4/dir.c
index 9dc9316..f69404c 100644
--- a/fs/ext4/dir.c
+++ b/fs/ext4/dir.c
@@ -48,6 +48,7 @@ const struct file_operations ext4_dir_operations = {
 #endif
 	.fsync		= ext4_sync_file,
 	.release	= ext4_release_dir,
+	.checkpoint	= generic_file_checkpoint,
 };
 
 
diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 9630583..93a129b 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -84,6 +84,9 @@ ext4_file_write(struct kiocb *iocb, const struct iovec *iov,
 static const struct vm_operations_struct ext4_file_vm_ops = {
 	.fault		= filemap_fault,
 	.page_mkwrite   = ext4_page_mkwrite,
+#ifdef CONFIG_CHECKPOINT
+	.checkpoint	= filemap_checkpoint,
+#endif
 };
 
 static int ext4_file_mmap(struct file *file, struct vm_area_struct *vma)
@@ -146,6 +149,7 @@ const struct file_operations ext4_file_operations = {
 	.fsync		= ext4_sync_file,
 	.splice_read	= generic_file_splice_read,
 	.splice_write	= generic_file_splice_write,
+	.checkpoint	= generic_file_checkpoint,
 };
 
 const struct inode_operations ext4_file_inode_operations = {
-- 
1.6.3.3

^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [C/R v20][PATCH 44/96] c/r: add generic '->checkpoint' f_op to ext fses
  2010-03-19  0:59 [C/R v20][PATCH 00/96] Linux Checkpoint-Restart - v20 Oren Laadan
                   ` (5 preceding siblings ...)
  2010-03-19  0:59 ` [C/R v20][PATCH 40/96] c/r: introduce method '->checkpoint()' in struct vm_operations_struct Oren Laadan
@ 2010-03-19  0:59 ` Oren Laadan
  2010-03-19  0:59 ` [C/R v20][PATCH 45/96] c/r: add generic '->checkpoint()' f_op to simple devices Oren Laadan
                   ` (10 subsequent siblings)
  17 siblings, 0 replies; 88+ messages in thread
From: Oren Laadan @ 2010-03-19  0:59 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: containers, Matt Helsley, Andreas Dilger, Dave Hansen, Oren Laadan

From: Dave Hansen <dave@linux.vnet.ibm.com>

This marks ext[234] as being checkpointable.  There will be many
more to do this to, but this is a start.

Changelog[ckpt-v19-rc3]:
  - Rebase to kernel 2.6.33 (ext2)
Changelog[v1]:
  - [Serge Hallyn] Use filemap_checkpoint() in ext4_file_vm_ops

Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com>
Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
Acked-by: Serge E. Hallyn <serue@us.ibm.com>
Tested-by: Serge E. Hallyn <serue@us.ibm.com>
---
 fs/ext2/dir.c  |    1 +
 fs/ext2/file.c |    2 ++
 fs/ext3/dir.c  |    1 +
 fs/ext3/file.c |    1 +
 fs/ext4/dir.c  |    1 +
 fs/ext4/file.c |    4 ++++
 6 files changed, 10 insertions(+), 0 deletions(-)

diff --git a/fs/ext2/dir.c b/fs/ext2/dir.c
index 7516957..84c17f9 100644
--- a/fs/ext2/dir.c
+++ b/fs/ext2/dir.c
@@ -722,4 +722,5 @@ const struct file_operations ext2_dir_operations = {
 	.compat_ioctl	= ext2_compat_ioctl,
 #endif
 	.fsync		= ext2_fsync,
+	.checkpoint	= generic_file_checkpoint,
 };
diff --git a/fs/ext2/file.c b/fs/ext2/file.c
index 586e358..b38d7b9 100644
--- a/fs/ext2/file.c
+++ b/fs/ext2/file.c
@@ -75,6 +75,7 @@ const struct file_operations ext2_file_operations = {
 	.fsync		= ext2_fsync,
 	.splice_read	= generic_file_splice_read,
 	.splice_write	= generic_file_splice_write,
+	.checkpoint	= generic_file_checkpoint,
 };
 
 #ifdef CONFIG_EXT2_FS_XIP
@@ -90,6 +91,7 @@ const struct file_operations ext2_xip_file_operations = {
 	.open		= generic_file_open,
 	.release	= ext2_release_file,
 	.fsync		= ext2_fsync,
+	.checkpoint	= generic_file_checkpoint,
 };
 #endif
 
diff --git a/fs/ext3/dir.c b/fs/ext3/dir.c
index 373fa90..65f98af 100644
--- a/fs/ext3/dir.c
+++ b/fs/ext3/dir.c
@@ -48,6 +48,7 @@ const struct file_operations ext3_dir_operations = {
 #endif
 	.fsync		= ext3_sync_file,	/* BKL held */
 	.release	= ext3_release_dir,
+	.checkpoint	= generic_file_checkpoint,
 };
 
 
diff --git a/fs/ext3/file.c b/fs/ext3/file.c
index 388bbdf..bcd9b88 100644
--- a/fs/ext3/file.c
+++ b/fs/ext3/file.c
@@ -67,6 +67,7 @@ const struct file_operations ext3_file_operations = {
 	.fsync		= ext3_sync_file,
 	.splice_read	= generic_file_splice_read,
 	.splice_write	= generic_file_splice_write,
+	.checkpoint	= generic_file_checkpoint,
 };
 
 const struct inode_operations ext3_file_inode_operations = {
diff --git a/fs/ext4/dir.c b/fs/ext4/dir.c
index 9dc9316..f69404c 100644
--- a/fs/ext4/dir.c
+++ b/fs/ext4/dir.c
@@ -48,6 +48,7 @@ const struct file_operations ext4_dir_operations = {
 #endif
 	.fsync		= ext4_sync_file,
 	.release	= ext4_release_dir,
+	.checkpoint	= generic_file_checkpoint,
 };
 
 
diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 9630583..93a129b 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -84,6 +84,9 @@ ext4_file_write(struct kiocb *iocb, const struct iovec *iov,
 static const struct vm_operations_struct ext4_file_vm_ops = {
 	.fault		= filemap_fault,
 	.page_mkwrite   = ext4_page_mkwrite,
+#ifdef CONFIG_CHECKPOINT
+	.checkpoint	= filemap_checkpoint,
+#endif
 };
 
 static int ext4_file_mmap(struct file *file, struct vm_area_struct *vma)
@@ -146,6 +149,7 @@ const struct file_operations ext4_file_operations = {
 	.fsync		= ext4_sync_file,
 	.splice_read	= generic_file_splice_read,
 	.splice_write	= generic_file_splice_write,
+	.checkpoint	= generic_file_checkpoint,
 };
 
 const struct inode_operations ext4_file_inode_operations = {
-- 
1.6.3.3


^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [C/R v20][PATCH 45/96] c/r: add generic '->checkpoint()' f_op to simple devices
       [not found] ` <1268960401-16680-1-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
                     ` (5 preceding siblings ...)
  2010-03-19  0:59   ` [C/R v20][PATCH 44/96] c/r: add generic '->checkpoint' f_op to ext fses Oren Laadan
@ 2010-03-19  0:59   ` Oren Laadan
  2010-03-19  0:59   ` [C/R v20][PATCH 46/96] c/r: add checkpoint operation for opened files of generic filesystems Oren Laadan
                     ` (9 subsequent siblings)
  16 siblings, 0 replies; 88+ messages in thread
From: Oren Laadan @ 2010-03-19  0:59 UTC (permalink / raw)
  To: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Andreas Dilger

* /dev/null
* /dev/zero
* /dev/random
* /dev/urandom

Signed-off-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
Acked-by: Serge E. Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
Tested-by: Serge E. Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
---
 drivers/char/mem.c    |    2 ++
 drivers/char/random.c |    2 ++
 2 files changed, 4 insertions(+), 0 deletions(-)

diff --git a/drivers/char/mem.c b/drivers/char/mem.c
index 48788db..57e3443 100644
--- a/drivers/char/mem.c
+++ b/drivers/char/mem.c
@@ -763,6 +763,7 @@ static const struct file_operations null_fops = {
 	.read		= read_null,
 	.write		= write_null,
 	.splice_write	= splice_write_null,
+	.checkpoint	= generic_file_checkpoint,
 };
 
 #ifdef CONFIG_DEVPORT
@@ -779,6 +780,7 @@ static const struct file_operations zero_fops = {
 	.read		= read_zero,
 	.write		= write_zero,
 	.mmap		= mmap_zero,
+	.checkpoint	= generic_file_checkpoint,
 };
 
 /*
diff --git a/drivers/char/random.c b/drivers/char/random.c
index 2849713..c082789 100644
--- a/drivers/char/random.c
+++ b/drivers/char/random.c
@@ -1169,6 +1169,7 @@ const struct file_operations random_fops = {
 	.poll  = random_poll,
 	.unlocked_ioctl = random_ioctl,
 	.fasync = random_fasync,
+	.checkpoint = generic_file_checkpoint,
 };
 
 const struct file_operations urandom_fops = {
@@ -1176,6 +1177,7 @@ const struct file_operations urandom_fops = {
 	.write = random_write,
 	.unlocked_ioctl = random_ioctl,
 	.fasync = random_fasync,
+	.checkpoint = generic_file_checkpoint,
 };
 
 /***************************************************************
-- 
1.6.3.3

^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [C/R v20][PATCH 45/96] c/r: add generic '->checkpoint()' f_op to simple devices
  2010-03-19  0:59 [C/R v20][PATCH 00/96] Linux Checkpoint-Restart - v20 Oren Laadan
                   ` (6 preceding siblings ...)
  2010-03-19  0:59 ` [C/R v20][PATCH 44/96] c/r: add generic '->checkpoint' f_op to ext fses Oren Laadan
@ 2010-03-19  0:59 ` Oren Laadan
  2010-03-19  0:59 ` [C/R v20][PATCH 46/96] c/r: add checkpoint operation for opened files of generic filesystems Oren Laadan
                   ` (9 subsequent siblings)
  17 siblings, 0 replies; 88+ messages in thread
From: Oren Laadan @ 2010-03-19  0:59 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: containers, Matt Helsley, Andreas Dilger, Oren Laadan

* /dev/null
* /dev/zero
* /dev/random
* /dev/urandom

Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
Acked-by: Serge E. Hallyn <serue@us.ibm.com>
Tested-by: Serge E. Hallyn <serue@us.ibm.com>
---
 drivers/char/mem.c    |    2 ++
 drivers/char/random.c |    2 ++
 2 files changed, 4 insertions(+), 0 deletions(-)

diff --git a/drivers/char/mem.c b/drivers/char/mem.c
index 48788db..57e3443 100644
--- a/drivers/char/mem.c
+++ b/drivers/char/mem.c
@@ -763,6 +763,7 @@ static const struct file_operations null_fops = {
 	.read		= read_null,
 	.write		= write_null,
 	.splice_write	= splice_write_null,
+	.checkpoint	= generic_file_checkpoint,
 };
 
 #ifdef CONFIG_DEVPORT
@@ -779,6 +780,7 @@ static const struct file_operations zero_fops = {
 	.read		= read_zero,
 	.write		= write_zero,
 	.mmap		= mmap_zero,
+	.checkpoint	= generic_file_checkpoint,
 };
 
 /*
diff --git a/drivers/char/random.c b/drivers/char/random.c
index 2849713..c082789 100644
--- a/drivers/char/random.c
+++ b/drivers/char/random.c
@@ -1169,6 +1169,7 @@ const struct file_operations random_fops = {
 	.poll  = random_poll,
 	.unlocked_ioctl = random_ioctl,
 	.fasync = random_fasync,
+	.checkpoint = generic_file_checkpoint,
 };
 
 const struct file_operations urandom_fops = {
@@ -1176,6 +1177,7 @@ const struct file_operations urandom_fops = {
 	.write = random_write,
 	.unlocked_ioctl = random_ioctl,
 	.fasync = random_fasync,
+	.checkpoint = generic_file_checkpoint,
 };
 
 /***************************************************************
-- 
1.6.3.3


^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [C/R v20][PATCH 46/96] c/r: add checkpoint operation for opened files of generic filesystems
       [not found] ` <1268960401-16680-1-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
                     ` (6 preceding siblings ...)
  2010-03-19  0:59   ` [C/R v20][PATCH 45/96] c/r: add generic '->checkpoint()' f_op to simple devices Oren Laadan
@ 2010-03-19  0:59   ` Oren Laadan
  2010-03-19  0:59   ` [C/R v20][PATCH 50/96] splice: export pipe/file-to-pipe/file functionality Oren Laadan
                     ` (8 subsequent siblings)
  16 siblings, 0 replies; 88+ messages in thread
From: Oren Laadan @ 2010-03-19  0:59 UTC (permalink / raw)
  To: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Andreas Dilger

From: Matt Helsley <matthltc-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>

These patches extend the use of the generic file checkpoint operation to
non-extX filesystems which have lseek operations that ensure we can save
and restore the files for later use. Note that this does not include
things like FUSE, network filesystems, or pseudo-filesystem kernel
interfaces.

Only compile and boot tested (on x86-32).

[Oren Laadan] Folded patch series into a single patch; original post
included 36 separate patches for individual filesystems:

  [PATCH 01/36] Add the checkpoint operation for affs files and directories.
  [PATCH 02/36] Add the checkpoint operation for befs directories.
  [PATCH 03/36] Add the checkpoint operation for bfs files and directories.
  [PATCH 04/36] Add the checkpoint operation for btrfs files and directories.
  [PATCH 05/36] Add the checkpoint operation for cramfs directories.
  [PATCH 06/36] Add the checkpoint operation for ecryptfs files and directories.
  [PATCH 07/36] Add the checkpoint operation for fat files and directories.
  [PATCH 08/36] Add the checkpoint operation for freevxfs directories.
  [PATCH 09/36] Add the checkpoint operation for hfs files and directories.
  [PATCH 10/36] Add the checkpoint operation for hfsplus files and directories.
  [PATCH 11/36] Add the checkpoint operation for hpfs files and directories.
  [PATCH 12/36] Add the checkpoint operation for hppfs files and directories.
  [PATCH 13/36] Add the checkpoint operation for iso directories.
  [PATCH 14/36] Add the checkpoint operation for jffs2 files and directories.
  [PATCH 15/36] Add the checkpoint operation for jfs files and directories.
  [PATCH 16/36] Add the checkpoint operation for regular nfs files and directories. Skip the various /proc files for now.
  [PATCH 17/36] Add the checkpoint operation for ntfs directories.
  [PATCH 18/36] Add the checkpoint operation for openromfs directories. Explicitly skip the properties for now.
  [PATCH 19/36] Add the checkpoint operation for qnx4 files and directories.
  [PATCH 20/36] Add the checkpoint operation for reiserfs files and directories.
  [PATCH 21/36] Add the checkpoint operation for romfs directories.
  [PATCH 22/36] Add the checkpoint operation for squashfs directories.
  [PATCH 23/36] Add the checkpoint operation for sysv filesystem files and directories.
  [PATCH 24/36] Add the checkpoint operation for ubifs files and directories.
  [PATCH 25/36] Add the checkpoint operation for udf filesystem files and directories.
  [PATCH 26/36] Add the checkpoint operation for xfs files and directories.
  [PATCH 27/36] Add checkpoint operation for efs directories.
  [PATCH 28/36] Add the checkpoint operation for generic, read-only files. At present, some/all files of the following filesystems use this generic definition:
  [PATCH 29/36] Add checkpoint operation for minix filesystem files and directories.
  [PATCH 30/36] Add checkpoint operations for omfs files and directories.
  [PATCH 31/36] Add checkpoint operations for ufs files and directories.
  [PATCH 32/36] Add checkpoint operations for ramfs files. NOTE: since simple_dir_operations are shared between multiple filesystems including ramfs, it's not currently possible to checkpoint open ramfs directories.
  [PATCH 33/36] Add the checkpoint operation for adfs files and directories.
  [PATCH 34/36] Add the checkpoint operation to exofs files and directories.
  [PATCH 35/36] Add the checkpoint operation to nilfs2 files and directories.
  [PATCH 36/36] Add checkpoint operations for UML host filesystem files and directories.

Changelog[v19-rc3]:
  - [Suka] Enable C/R while executing over NFS

Signed-off-by: Matt Helsley <matthltc-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
Acked-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
Acked-by: Serge E. Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
Tested-by: Serge E. Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
Cc: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
---
 fs/adfs/dir.c               |    1 +
 fs/adfs/file.c              |    1 +
 fs/affs/dir.c               |    1 +
 fs/affs/file.c              |    1 +
 fs/befs/linuxvfs.c          |    1 +
 fs/bfs/dir.c                |    1 +
 fs/bfs/file.c               |    1 +
 fs/btrfs/file.c             |    1 +
 fs/btrfs/inode.c            |    1 +
 fs/btrfs/super.c            |    1 +
 fs/cramfs/inode.c           |    1 +
 fs/ecryptfs/file.c          |    2 ++
 fs/ecryptfs/miscdev.c       |    1 +
 fs/efs/dir.c                |    1 +
 fs/exofs/dir.c              |    1 +
 fs/exofs/file.c             |    1 +
 fs/fat/dir.c                |    1 +
 fs/fat/file.c               |    1 +
 fs/freevxfs/vxfs_lookup.c   |    1 +
 fs/hfs/dir.c                |    1 +
 fs/hfs/inode.c              |    1 +
 fs/hfsplus/dir.c            |    1 +
 fs/hfsplus/inode.c          |    1 +
 fs/hostfs/hostfs_kern.c     |    2 ++
 fs/hpfs/dir.c               |    1 +
 fs/hpfs/file.c              |    1 +
 fs/hppfs/hppfs.c            |    2 ++
 fs/isofs/dir.c              |    1 +
 fs/jffs2/dir.c              |    1 +
 fs/jffs2/file.c             |    1 +
 fs/jfs/file.c               |    1 +
 fs/jfs/namei.c              |    1 +
 fs/minix/dir.c              |    1 +
 fs/minix/file.c             |    1 +
 fs/nfs/dir.c                |    1 +
 fs/nfs/file.c               |    4 ++++
 fs/nilfs2/dir.c             |    2 +-
 fs/nilfs2/file.c            |    1 +
 fs/ntfs/dir.c               |    1 +
 fs/ntfs/file.c              |    3 ++-
 fs/omfs/dir.c               |    1 +
 fs/omfs/file.c              |    1 +
 fs/openpromfs/inode.c       |    2 ++
 fs/qnx4/dir.c               |    1 +
 fs/ramfs/file-mmu.c         |    1 +
 fs/ramfs/file-nommu.c       |    1 +
 fs/read_write.c             |    1 +
 fs/reiserfs/dir.c           |    1 +
 fs/reiserfs/file.c          |    1 +
 fs/romfs/mmap-nommu.c       |    1 +
 fs/romfs/super.c            |    1 +
 fs/squashfs/dir.c           |    3 ++-
 fs/sysv/dir.c               |    1 +
 fs/sysv/file.c              |    1 +
 fs/ubifs/debug.c            |    1 +
 fs/ubifs/dir.c              |    1 +
 fs/ubifs/file.c             |    1 +
 fs/udf/dir.c                |    1 +
 fs/udf/file.c               |    1 +
 fs/ufs/dir.c                |    1 +
 fs/ufs/file.c               |    1 +
 fs/xfs/linux-2.6/xfs_file.c |    2 ++
 62 files changed, 72 insertions(+), 3 deletions(-)

diff --git a/fs/adfs/dir.c b/fs/adfs/dir.c
index 23aa52f..7106f32 100644
--- a/fs/adfs/dir.c
+++ b/fs/adfs/dir.c
@@ -198,6 +198,7 @@ const struct file_operations adfs_dir_operations = {
 	.llseek		= generic_file_llseek,
 	.readdir	= adfs_readdir,
 	.fsync		= simple_fsync,
+	.checkpoint	= generic_file_checkpoint,
 };
 
 static int
diff --git a/fs/adfs/file.c b/fs/adfs/file.c
index 005ea34..97bd298 100644
--- a/fs/adfs/file.c
+++ b/fs/adfs/file.c
@@ -30,6 +30,7 @@ const struct file_operations adfs_file_operations = {
 	.write		= do_sync_write,
 	.aio_write	= generic_file_aio_write,
 	.splice_read	= generic_file_splice_read,
+	.checkpoint	= generic_file_checkpoint,
 };
 
 const struct inode_operations adfs_file_inode_operations = {
diff --git a/fs/affs/dir.c b/fs/affs/dir.c
index 8ca8f3a..6cc5e43 100644
--- a/fs/affs/dir.c
+++ b/fs/affs/dir.c
@@ -22,6 +22,7 @@ const struct file_operations affs_dir_operations = {
 	.llseek		= generic_file_llseek,
 	.readdir	= affs_readdir,
 	.fsync		= affs_file_fsync,
+	.checkpoint	= generic_file_checkpoint,
 };
 
 /*
diff --git a/fs/affs/file.c b/fs/affs/file.c
index 184e55c..d580a12 100644
--- a/fs/affs/file.c
+++ b/fs/affs/file.c
@@ -36,6 +36,7 @@ const struct file_operations affs_file_operations = {
 	.release	= affs_file_release,
 	.fsync		= affs_file_fsync,
 	.splice_read	= generic_file_splice_read,
+	.checkpoint	= generic_file_checkpoint,
 };
 
 const struct inode_operations affs_file_inode_operations = {
diff --git a/fs/befs/linuxvfs.c b/fs/befs/linuxvfs.c
index 34ddda8..b97f79b 100644
--- a/fs/befs/linuxvfs.c
+++ b/fs/befs/linuxvfs.c
@@ -67,6 +67,7 @@ static const struct file_operations befs_dir_operations = {
 	.read		= generic_read_dir,
 	.readdir	= befs_readdir,
 	.llseek		= generic_file_llseek,
+	.checkpoint	= generic_file_checkpoint,
 };
 
 static const struct inode_operations befs_dir_inode_operations = {
diff --git a/fs/bfs/dir.c b/fs/bfs/dir.c
index 1e41aad..d78015e 100644
--- a/fs/bfs/dir.c
+++ b/fs/bfs/dir.c
@@ -80,6 +80,7 @@ const struct file_operations bfs_dir_operations = {
 	.readdir	= bfs_readdir,
 	.fsync		= simple_fsync,
 	.llseek		= generic_file_llseek,
+	.checkpoint	= generic_file_checkpoint,
 };
 
 extern void dump_imap(const char *, struct super_block *);
diff --git a/fs/bfs/file.c b/fs/bfs/file.c
index 88b9a3f..7f61ed6 100644
--- a/fs/bfs/file.c
+++ b/fs/bfs/file.c
@@ -29,6 +29,7 @@ const struct file_operations bfs_file_operations = {
 	.aio_write	= generic_file_aio_write,
 	.mmap		= generic_file_mmap,
 	.splice_read	= generic_file_splice_read,
+	.checkpoint	= generic_file_checkpoint,
 };
 
 static int bfs_move_block(unsigned long from, unsigned long to,
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 6ed434a..281a2b8 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1164,4 +1164,5 @@ const struct file_operations btrfs_file_operations = {
 #ifdef CONFIG_COMPAT
 	.compat_ioctl	= btrfs_ioctl,
 #endif
+	.checkpoint	= generic_file_checkpoint,
 };
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 4deb280..606c31d 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -5971,6 +5971,7 @@ static const struct file_operations btrfs_dir_file_operations = {
 #endif
 	.release        = btrfs_release_file,
 	.fsync		= btrfs_sync_file,
+	.checkpoint	= generic_file_checkpoint,
 };
 
 static struct extent_io_ops btrfs_extent_io_ops = {
diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index 8a1ea6e..7a28ac5 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -718,6 +718,7 @@ static const struct file_operations btrfs_ctl_fops = {
 	.unlocked_ioctl	 = btrfs_control_ioctl,
 	.compat_ioctl = btrfs_control_ioctl,
 	.owner	 = THIS_MODULE,
+	.checkpoint = generic_file_checkpoint,
 };
 
 static struct miscdevice btrfs_misc = {
diff --git a/fs/cramfs/inode.c b/fs/cramfs/inode.c
index dd3634e..0927503 100644
--- a/fs/cramfs/inode.c
+++ b/fs/cramfs/inode.c
@@ -532,6 +532,7 @@ static const struct file_operations cramfs_directory_operations = {
 	.llseek		= generic_file_llseek,
 	.read		= generic_read_dir,
 	.readdir	= cramfs_readdir,
+	.checkpoint	= generic_file_checkpoint,
 };
 
 static const struct inode_operations cramfs_dir_inode_operations = {
diff --git a/fs/ecryptfs/file.c b/fs/ecryptfs/file.c
index 678172b..a8973ef 100644
--- a/fs/ecryptfs/file.c
+++ b/fs/ecryptfs/file.c
@@ -305,6 +305,7 @@ const struct file_operations ecryptfs_dir_fops = {
 	.fsync = ecryptfs_fsync,
 	.fasync = ecryptfs_fasync,
 	.splice_read = generic_file_splice_read,
+	.checkpoint = generic_file_checkpoint,
 };
 
 const struct file_operations ecryptfs_main_fops = {
@@ -322,6 +323,7 @@ const struct file_operations ecryptfs_main_fops = {
 	.fsync = ecryptfs_fsync,
 	.fasync = ecryptfs_fasync,
 	.splice_read = generic_file_splice_read,
+	.checkpoint = generic_file_checkpoint,
 };
 
 static int
diff --git a/fs/ecryptfs/miscdev.c b/fs/ecryptfs/miscdev.c
index 4ec8f61..9fd9b39 100644
--- a/fs/ecryptfs/miscdev.c
+++ b/fs/ecryptfs/miscdev.c
@@ -481,6 +481,7 @@ static const struct file_operations ecryptfs_miscdev_fops = {
 	.read    = ecryptfs_miscdev_read,
 	.write   = ecryptfs_miscdev_write,
 	.release = ecryptfs_miscdev_release,
+	.checkpoint = generic_file_checkpoint,
 };
 
 static struct miscdevice ecryptfs_miscdev = {
diff --git a/fs/efs/dir.c b/fs/efs/dir.c
index 7ee6f7e..da344b8 100644
--- a/fs/efs/dir.c
+++ b/fs/efs/dir.c
@@ -13,6 +13,7 @@ const struct file_operations efs_dir_operations = {
 	.llseek		= generic_file_llseek,
 	.read		= generic_read_dir,
 	.readdir	= efs_readdir,
+	.checkpoint	= generic_file_checkpoint,
 };
 
 const struct inode_operations efs_dir_inode_operations = {
diff --git a/fs/exofs/dir.c b/fs/exofs/dir.c
index 4cfab1c..f6693d3 100644
--- a/fs/exofs/dir.c
+++ b/fs/exofs/dir.c
@@ -667,4 +667,5 @@ const struct file_operations exofs_dir_operations = {
 	.llseek		= generic_file_llseek,
 	.read		= generic_read_dir,
 	.readdir	= exofs_readdir,
+	.checkpoint	= generic_file_checkpoint,
 };
diff --git a/fs/exofs/file.c b/fs/exofs/file.c
index 839b9dc..257e9da 100644
--- a/fs/exofs/file.c
+++ b/fs/exofs/file.c
@@ -73,6 +73,7 @@ static int exofs_flush(struct file *file, fl_owner_t id)
 
 const struct file_operations exofs_file_operations = {
 	.llseek		= generic_file_llseek,
+	.checkpoint	= generic_file_checkpoint,
 	.read		= do_sync_read,
 	.write		= do_sync_write,
 	.aio_read	= generic_file_aio_read,
diff --git a/fs/fat/dir.c b/fs/fat/dir.c
index 530b4ca..e3fa353 100644
--- a/fs/fat/dir.c
+++ b/fs/fat/dir.c
@@ -841,6 +841,7 @@ const struct file_operations fat_dir_operations = {
 	.compat_ioctl	= fat_compat_dir_ioctl,
 #endif
 	.fsync		= fat_file_fsync,
+	.checkpoint	= generic_file_checkpoint,
 };
 
 static int fat_get_short_entry(struct inode *dir, loff_t *pos,
diff --git a/fs/fat/file.c b/fs/fat/file.c
index e8c159d..e5aecc6 100644
--- a/fs/fat/file.c
+++ b/fs/fat/file.c
@@ -162,6 +162,7 @@ const struct file_operations fat_file_operations = {
 	.ioctl		= fat_generic_ioctl,
 	.fsync		= fat_file_fsync,
 	.splice_read	= generic_file_splice_read,
+	.checkpoint	= generic_file_checkpoint,
 };
 
 static int fat_cont_expand(struct inode *inode, loff_t size)
diff --git a/fs/freevxfs/vxfs_lookup.c b/fs/freevxfs/vxfs_lookup.c
index aee049c..3a09132 100644
--- a/fs/freevxfs/vxfs_lookup.c
+++ b/fs/freevxfs/vxfs_lookup.c
@@ -58,6 +58,7 @@ const struct inode_operations vxfs_dir_inode_ops = {
 
 const struct file_operations vxfs_dir_operations = {
 	.readdir =		vxfs_readdir,
+	.checkpoint =		generic_file_checkpoint,
 };
 
  
diff --git a/fs/hfs/dir.c b/fs/hfs/dir.c
index 2b3b861..0eef6c2 100644
--- a/fs/hfs/dir.c
+++ b/fs/hfs/dir.c
@@ -329,6 +329,7 @@ const struct file_operations hfs_dir_operations = {
 	.readdir	= hfs_readdir,
 	.llseek		= generic_file_llseek,
 	.release	= hfs_dir_release,
+	.checkpoint	= generic_file_checkpoint,
 };
 
 const struct inode_operations hfs_dir_inode_operations = {
diff --git a/fs/hfs/inode.c b/fs/hfs/inode.c
index a1cbff2..bf8950f 100644
--- a/fs/hfs/inode.c
+++ b/fs/hfs/inode.c
@@ -607,6 +607,7 @@ static const struct file_operations hfs_file_operations = {
 	.fsync		= file_fsync,
 	.open		= hfs_file_open,
 	.release	= hfs_file_release,
+	.checkpoint	= generic_file_checkpoint,
 };
 
 static const struct inode_operations hfs_file_inode_operations = {
diff --git a/fs/hfsplus/dir.c b/fs/hfsplus/dir.c
index 5f40236..41fbf2d 100644
--- a/fs/hfsplus/dir.c
+++ b/fs/hfsplus/dir.c
@@ -497,4 +497,5 @@ const struct file_operations hfsplus_dir_operations = {
 	.ioctl          = hfsplus_ioctl,
 	.llseek		= generic_file_llseek,
 	.release	= hfsplus_dir_release,
+	.checkpoint	= generic_file_checkpoint,
 };
diff --git a/fs/hfsplus/inode.c b/fs/hfsplus/inode.c
index 1bcf597..19abd7e 100644
--- a/fs/hfsplus/inode.c
+++ b/fs/hfsplus/inode.c
@@ -286,6 +286,7 @@ static const struct file_operations hfsplus_file_operations = {
 	.open		= hfsplus_file_open,
 	.release	= hfsplus_file_release,
 	.ioctl          = hfsplus_ioctl,
+	.checkpoint	= generic_file_checkpoint,
 };
 
 struct inode *hfsplus_new_inode(struct super_block *sb, int mode)
diff --git a/fs/hostfs/hostfs_kern.c b/fs/hostfs/hostfs_kern.c
index 032604e..67e2356 100644
--- a/fs/hostfs/hostfs_kern.c
+++ b/fs/hostfs/hostfs_kern.c
@@ -417,6 +417,7 @@ int hostfs_fsync(struct file *file, struct dentry *dentry, int datasync)
 
 static const struct file_operations hostfs_file_fops = {
 	.llseek		= generic_file_llseek,
+	.checkpoint	= generic_file_checkpoint,
 	.read		= do_sync_read,
 	.splice_read	= generic_file_splice_read,
 	.aio_read	= generic_file_aio_read,
@@ -430,6 +431,7 @@ static const struct file_operations hostfs_file_fops = {
 
 static const struct file_operations hostfs_dir_fops = {
 	.llseek		= generic_file_llseek,
+	.checkpoint	= generic_file_checkpoint,
 	.readdir	= hostfs_readdir,
 	.read		= generic_read_dir,
 };
diff --git a/fs/hpfs/dir.c b/fs/hpfs/dir.c
index 8865c94..dcde10f 100644
--- a/fs/hpfs/dir.c
+++ b/fs/hpfs/dir.c
@@ -322,4 +322,5 @@ const struct file_operations hpfs_dir_ops =
 	.readdir	= hpfs_readdir,
 	.release	= hpfs_dir_release,
 	.fsync		= hpfs_file_fsync,
+	.checkpoint	= generic_file_checkpoint,
 };
diff --git a/fs/hpfs/file.c b/fs/hpfs/file.c
index 3efabff..f1211f0 100644
--- a/fs/hpfs/file.c
+++ b/fs/hpfs/file.c
@@ -139,6 +139,7 @@ const struct file_operations hpfs_file_ops =
 	.release	= hpfs_file_release,
 	.fsync		= hpfs_file_fsync,
 	.splice_read	= generic_file_splice_read,
+	.checkpoint	= generic_file_checkpoint,
 };
 
 const struct inode_operations hpfs_file_iops =
diff --git a/fs/hppfs/hppfs.c b/fs/hppfs/hppfs.c
index 7239efc..e3c3bd3 100644
--- a/fs/hppfs/hppfs.c
+++ b/fs/hppfs/hppfs.c
@@ -546,6 +546,7 @@ static const struct file_operations hppfs_file_fops = {
 	.read		= hppfs_read,
 	.write		= hppfs_write,
 	.open		= hppfs_open,
+	.checkpoint	= generic_file_checkpoint,
 };
 
 struct hppfs_dirent {
@@ -597,6 +598,7 @@ static const struct file_operations hppfs_dir_fops = {
 	.readdir	= hppfs_readdir,
 	.open		= hppfs_dir_open,
 	.fsync		= hppfs_fsync,
+	.checkpoint	= generic_file_checkpoint,
 };
 
 static int hppfs_statfs(struct dentry *dentry, struct kstatfs *sf)
diff --git a/fs/isofs/dir.c b/fs/isofs/dir.c
index 8ba5441..848059d 100644
--- a/fs/isofs/dir.c
+++ b/fs/isofs/dir.c
@@ -273,6 +273,7 @@ const struct file_operations isofs_dir_operations =
 {
 	.read = generic_read_dir,
 	.readdir = isofs_readdir,
+	.checkpoint = generic_file_checkpoint,
 };
 
 /*
diff --git a/fs/jffs2/dir.c b/fs/jffs2/dir.c
index 7aa4417..c7c4dcb 100644
--- a/fs/jffs2/dir.c
+++ b/fs/jffs2/dir.c
@@ -41,6 +41,7 @@ const struct file_operations jffs2_dir_operations =
 	.unlocked_ioctl=jffs2_ioctl,
 	.fsync =	jffs2_fsync,
 	.llseek =	generic_file_llseek,
+	.checkpoint =	generic_file_checkpoint,
 };
 
 
diff --git a/fs/jffs2/file.c b/fs/jffs2/file.c
index b7b74e2..f01038d 100644
--- a/fs/jffs2/file.c
+++ b/fs/jffs2/file.c
@@ -50,6 +50,7 @@ const struct file_operations jffs2_file_operations =
 	.mmap =		generic_file_readonly_mmap,
 	.fsync =	jffs2_fsync,
 	.splice_read =	generic_file_splice_read,
+	.checkpoint =	generic_file_checkpoint,
 };
 
 /* jffs2_file_inode_operations */
diff --git a/fs/jfs/file.c b/fs/jfs/file.c
index 2b70fa7..3bd7114 100644
--- a/fs/jfs/file.c
+++ b/fs/jfs/file.c
@@ -116,4 +116,5 @@ const struct file_operations jfs_file_operations = {
 #ifdef CONFIG_COMPAT
 	.compat_ioctl	= jfs_compat_ioctl,
 #endif
+	.checkpoint	= generic_file_checkpoint,
 };
diff --git a/fs/jfs/namei.c b/fs/jfs/namei.c
index c79a427..585a7d2 100644
--- a/fs/jfs/namei.c
+++ b/fs/jfs/namei.c
@@ -1556,6 +1556,7 @@ const struct file_operations jfs_dir_operations = {
 	.compat_ioctl	= jfs_compat_ioctl,
 #endif
 	.llseek		= generic_file_llseek,
+	.checkpoint	= generic_file_checkpoint,
 };
 
 static int jfs_ci_hash(struct dentry *dir, struct qstr *this)
diff --git a/fs/minix/dir.c b/fs/minix/dir.c
index 6198731..74b6fb4 100644
--- a/fs/minix/dir.c
+++ b/fs/minix/dir.c
@@ -23,6 +23,7 @@ const struct file_operations minix_dir_operations = {
 	.read		= generic_read_dir,
 	.readdir	= minix_readdir,
 	.fsync		= simple_fsync,
+	.checkpoint	= generic_file_checkpoint,
 };
 
 static inline void dir_put_page(struct page *page)
diff --git a/fs/minix/file.c b/fs/minix/file.c
index 3eec3e6..2048d09 100644
--- a/fs/minix/file.c
+++ b/fs/minix/file.c
@@ -21,6 +21,7 @@ const struct file_operations minix_file_operations = {
 	.mmap		= generic_file_mmap,
 	.fsync		= simple_fsync,
 	.splice_read	= generic_file_splice_read,
+	.checkpoint	= generic_file_checkpoint,
 };
 
 const struct inode_operations minix_file_inode_operations = {
diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
index 3c7f03b..7d9d22a 100644
--- a/fs/nfs/dir.c
+++ b/fs/nfs/dir.c
@@ -63,6 +63,7 @@ const struct file_operations nfs_dir_operations = {
 	.open		= nfs_opendir,
 	.release	= nfs_release,
 	.fsync		= nfs_fsync_dir,
+	.checkpoint	= generic_file_checkpoint,
 };
 
 const struct inode_operations nfs_dir_inode_operations = {
diff --git a/fs/nfs/file.c b/fs/nfs/file.c
index 63f2071..4437ef9 100644
--- a/fs/nfs/file.c
+++ b/fs/nfs/file.c
@@ -78,6 +78,7 @@ const struct file_operations nfs_file_operations = {
 	.splice_write	= nfs_file_splice_write,
 	.check_flags	= nfs_check_flags,
 	.setlease	= nfs_setlease,
+	.checkpoint	= generic_file_checkpoint,
 };
 
 const struct inode_operations nfs_file_inode_operations = {
@@ -577,6 +578,9 @@ out_unlock:
 static const struct vm_operations_struct nfs_file_vm_ops = {
 	.fault = filemap_fault,
 	.page_mkwrite = nfs_vm_page_mkwrite,
+#ifdef CONFIG_CHECKPOINT
+	.checkpoint = filemap_checkpoint,
+#endif
 };
 
 static int nfs_need_sync_write(struct file *filp, struct inode *inode)
diff --git a/fs/nilfs2/dir.c b/fs/nilfs2/dir.c
index 76d803e..18b2171 100644
--- a/fs/nilfs2/dir.c
+++ b/fs/nilfs2/dir.c
@@ -702,5 +702,5 @@ const struct file_operations nilfs_dir_operations = {
 	.compat_ioctl	= nilfs_ioctl,
 #endif	/* CONFIG_COMPAT */
 	.fsync		= nilfs_sync_file,
-
+	.checkpoint	= generic_file_checkpoint,
 };
diff --git a/fs/nilfs2/file.c b/fs/nilfs2/file.c
index 30292df..4d585b5 100644
--- a/fs/nilfs2/file.c
+++ b/fs/nilfs2/file.c
@@ -136,6 +136,7 @@ static int nilfs_file_mmap(struct file *file, struct vm_area_struct *vma)
  */
 const struct file_operations nilfs_file_operations = {
 	.llseek		= generic_file_llseek,
+	.checkpoint	= generic_file_checkpoint,
 	.read		= do_sync_read,
 	.write		= do_sync_write,
 	.aio_read	= generic_file_aio_read,
diff --git a/fs/ntfs/dir.c b/fs/ntfs/dir.c
index 5a9e344..4fe3759 100644
--- a/fs/ntfs/dir.c
+++ b/fs/ntfs/dir.c
@@ -1572,4 +1572,5 @@ const struct file_operations ntfs_dir_ops = {
 	/*.ioctl	= ,*/			/* Perform function on the
 						   mounted filesystem. */
 	.open		= ntfs_dir_open,	/* Open directory. */
+	.checkpoint	= generic_file_checkpoint,
 };
diff --git a/fs/ntfs/file.c b/fs/ntfs/file.c
index 43179dd..32a43f5 100644
--- a/fs/ntfs/file.c
+++ b/fs/ntfs/file.c
@@ -2224,7 +2224,7 @@ const struct file_operations ntfs_file_ops = {
 						    mounted filesystem. */
 	.mmap		= generic_file_mmap,	 /* Mmap file. */
 	.open		= ntfs_file_open,	 /* Open file. */
-	.splice_read	= generic_file_splice_read /* Zero-copy data send with
+	.splice_read	= generic_file_splice_read, /* Zero-copy data send with
 						    the data source being on
 						    the ntfs partition.  We do
 						    not need to care about the
@@ -2234,6 +2234,7 @@ const struct file_operations ntfs_file_ops = {
 						    on the ntfs partition.  We
 						    do not need to care about
 						    the data source. */
+	.checkpoint	= generic_file_checkpoint,
 };
 
 const struct inode_operations ntfs_file_inode_ops = {
diff --git a/fs/omfs/dir.c b/fs/omfs/dir.c
index b42d624..e924e33 100644
--- a/fs/omfs/dir.c
+++ b/fs/omfs/dir.c
@@ -502,4 +502,5 @@ const struct file_operations omfs_dir_operations = {
 	.read = generic_read_dir,
 	.readdir = omfs_readdir,
 	.llseek = generic_file_llseek,
+	.checkpoint = generic_file_checkpoint,
 };
diff --git a/fs/omfs/file.c b/fs/omfs/file.c
index 399487c..83e63ef 100644
--- a/fs/omfs/file.c
+++ b/fs/omfs/file.c
@@ -331,6 +331,7 @@ const struct file_operations omfs_file_operations = {
 	.mmap = generic_file_mmap,
 	.fsync = simple_fsync,
 	.splice_read = generic_file_splice_read,
+	.checkpoint = generic_file_checkpoint,
 };
 
 const struct inode_operations omfs_file_inops = {
diff --git a/fs/openpromfs/inode.c b/fs/openpromfs/inode.c
index ffcd04f..d1f0677 100644
--- a/fs/openpromfs/inode.c
+++ b/fs/openpromfs/inode.c
@@ -160,6 +160,7 @@ static const struct file_operations openpromfs_prop_ops = {
 	.read		= seq_read,
 	.llseek		= seq_lseek,
 	.release	= seq_release,
+	.checkpoint	= NULL,
 };
 
 static int openpromfs_readdir(struct file *, void *, filldir_t);
@@ -168,6 +169,7 @@ static const struct file_operations openprom_operations = {
 	.read		= generic_read_dir,
 	.readdir	= openpromfs_readdir,
 	.llseek		= generic_file_llseek,
+	.checkpoint	= generic_file_checkpoint,
 };
 
 static struct dentry *openpromfs_lookup(struct inode *, struct dentry *, struct nameidata *);
diff --git a/fs/qnx4/dir.c b/fs/qnx4/dir.c
index 6f30c3d..fa14c55 100644
--- a/fs/qnx4/dir.c
+++ b/fs/qnx4/dir.c
@@ -80,6 +80,7 @@ const struct file_operations qnx4_dir_operations =
 	.read		= generic_read_dir,
 	.readdir	= qnx4_readdir,
 	.fsync		= simple_fsync,
+	.checkpoint	= generic_file_checkpoint,
 };
 
 const struct inode_operations qnx4_dir_inode_operations =
diff --git a/fs/ramfs/file-mmu.c b/fs/ramfs/file-mmu.c
index 78f613c..4430239 100644
--- a/fs/ramfs/file-mmu.c
+++ b/fs/ramfs/file-mmu.c
@@ -47,6 +47,7 @@ const struct file_operations ramfs_file_operations = {
 	.splice_read	= generic_file_splice_read,
 	.splice_write	= generic_file_splice_write,
 	.llseek		= generic_file_llseek,
+	.checkpoint	= generic_file_checkpoint,
 };
 
 const struct inode_operations ramfs_file_inode_operations = {
diff --git a/fs/ramfs/file-nommu.c b/fs/ramfs/file-nommu.c
index 1739a4a..9cd6208 100644
--- a/fs/ramfs/file-nommu.c
+++ b/fs/ramfs/file-nommu.c
@@ -45,6 +45,7 @@ const struct file_operations ramfs_file_operations = {
 	.splice_read		= generic_file_splice_read,
 	.splice_write		= generic_file_splice_write,
 	.llseek			= generic_file_llseek,
+	.checkpoint		= generic_file_checkpoint,
 };
 
 const struct inode_operations ramfs_file_inode_operations = {
diff --git a/fs/read_write.c b/fs/read_write.c
index e258301..65371e1 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -27,6 +27,7 @@ const struct file_operations generic_ro_fops = {
 	.aio_read	= generic_file_aio_read,
 	.mmap		= generic_file_readonly_mmap,
 	.splice_read	= generic_file_splice_read,
+	.checkpoint	= generic_file_checkpoint,
 };
 
 EXPORT_SYMBOL(generic_ro_fops);
diff --git a/fs/reiserfs/dir.c b/fs/reiserfs/dir.c
index c094f58..8681419 100644
--- a/fs/reiserfs/dir.c
+++ b/fs/reiserfs/dir.c
@@ -24,6 +24,7 @@ const struct file_operations reiserfs_dir_operations = {
 #ifdef CONFIG_COMPAT
 	.compat_ioctl = reiserfs_compat_ioctl,
 #endif
+	.checkpoint = generic_file_checkpoint,
 };
 
 static int reiserfs_dir_fsync(struct file *filp, struct dentry *dentry,
diff --git a/fs/reiserfs/file.c b/fs/reiserfs/file.c
index da2dba0..b6008f3 100644
--- a/fs/reiserfs/file.c
+++ b/fs/reiserfs/file.c
@@ -297,6 +297,7 @@ const struct file_operations reiserfs_file_operations = {
 	.splice_read = generic_file_splice_read,
 	.splice_write = generic_file_splice_write,
 	.llseek = generic_file_llseek,
+	.checkpoint = generic_file_checkpoint,
 };
 
 const struct inode_operations reiserfs_file_inode_operations = {
diff --git a/fs/romfs/mmap-nommu.c b/fs/romfs/mmap-nommu.c
index f0511e8..03c24d9 100644
--- a/fs/romfs/mmap-nommu.c
+++ b/fs/romfs/mmap-nommu.c
@@ -72,4 +72,5 @@ const struct file_operations romfs_ro_fops = {
 	.splice_read		= generic_file_splice_read,
 	.mmap			= romfs_mmap,
 	.get_unmapped_area	= romfs_get_unmapped_area,
+	.checkpoint		= generic_file_checkpoint,
 };
diff --git a/fs/romfs/super.c b/fs/romfs/super.c
index 42d2135..476ea8e 100644
--- a/fs/romfs/super.c
+++ b/fs/romfs/super.c
@@ -282,6 +282,7 @@ error:
 static const struct file_operations romfs_dir_operations = {
 	.read		= generic_read_dir,
 	.readdir	= romfs_readdir,
+	.checkpoint	= generic_file_checkpoint,
 };
 
 static const struct inode_operations romfs_dir_inode_operations = {
diff --git a/fs/squashfs/dir.c b/fs/squashfs/dir.c
index 566b0ea..b0c5336 100644
--- a/fs/squashfs/dir.c
+++ b/fs/squashfs/dir.c
@@ -231,5 +231,6 @@ failed_read:
 
 const struct file_operations squashfs_dir_ops = {
 	.read = generic_read_dir,
-	.readdir = squashfs_readdir
+	.readdir = squashfs_readdir,
+	.checkpoint = generic_file_checkpoint,
 };
diff --git a/fs/sysv/dir.c b/fs/sysv/dir.c
index 4e50286..53acd29 100644
--- a/fs/sysv/dir.c
+++ b/fs/sysv/dir.c
@@ -25,6 +25,7 @@ const struct file_operations sysv_dir_operations = {
 	.read		= generic_read_dir,
 	.readdir	= sysv_readdir,
 	.fsync		= simple_fsync,
+	.checkpoint	= generic_file_checkpoint,
 };
 
 static inline void dir_put_page(struct page *page)
diff --git a/fs/sysv/file.c b/fs/sysv/file.c
index 96340c0..aee556d 100644
--- a/fs/sysv/file.c
+++ b/fs/sysv/file.c
@@ -28,6 +28,7 @@ const struct file_operations sysv_file_operations = {
 	.mmap		= generic_file_mmap,
 	.fsync		= simple_fsync,
 	.splice_read	= generic_file_splice_read,
+	.checkpoint	= generic_file_checkpoint,
 };
 
 const struct inode_operations sysv_file_inode_operations = {
diff --git a/fs/ubifs/debug.c b/fs/ubifs/debug.c
index 9049232..e4f23c6 100644
--- a/fs/ubifs/debug.c
+++ b/fs/ubifs/debug.c
@@ -2623,6 +2623,7 @@ static ssize_t write_debugfs_file(struct file *file, const char __user *buf,
 static const struct file_operations dfs_fops = {
 	.open = open_debugfs_file,
 	.write = write_debugfs_file,
+	.checkpoint = generic_file_checkpoint,
 	.owner = THIS_MODULE,
 };
 
diff --git a/fs/ubifs/dir.c b/fs/ubifs/dir.c
index 552fb01..89ab2aa 100644
--- a/fs/ubifs/dir.c
+++ b/fs/ubifs/dir.c
@@ -1228,4 +1228,5 @@ const struct file_operations ubifs_dir_operations = {
 #ifdef CONFIG_COMPAT
 	.compat_ioctl   = ubifs_compat_ioctl,
 #endif
+	.checkpoint     = generic_file_checkpoint,
 };
diff --git a/fs/ubifs/file.c b/fs/ubifs/file.c
index 16a6444..254a4d9 100644
--- a/fs/ubifs/file.c
+++ b/fs/ubifs/file.c
@@ -1582,4 +1582,5 @@ const struct file_operations ubifs_file_operations = {
 #ifdef CONFIG_COMPAT
 	.compat_ioctl   = ubifs_compat_ioctl,
 #endif
+	.checkpoint     = generic_file_checkpoint,
 };
diff --git a/fs/udf/dir.c b/fs/udf/dir.c
index 61d9a76..6586dbe 100644
--- a/fs/udf/dir.c
+++ b/fs/udf/dir.c
@@ -211,4 +211,5 @@ const struct file_operations udf_dir_operations = {
 	.readdir		= udf_readdir,
 	.ioctl			= udf_ioctl,
 	.fsync			= simple_fsync,
+	.checkpoint		= generic_file_checkpoint,
 };
diff --git a/fs/udf/file.c b/fs/udf/file.c
index f311d50..e671552 100644
--- a/fs/udf/file.c
+++ b/fs/udf/file.c
@@ -215,6 +215,7 @@ const struct file_operations udf_file_operations = {
 	.fsync			= simple_fsync,
 	.splice_read		= generic_file_splice_read,
 	.llseek			= generic_file_llseek,
+	.checkpoint		= generic_file_checkpoint,
 };
 
 const struct inode_operations udf_file_inode_operations = {
diff --git a/fs/ufs/dir.c b/fs/ufs/dir.c
index 22af68f..29c9396 100644
--- a/fs/ufs/dir.c
+++ b/fs/ufs/dir.c
@@ -668,4 +668,5 @@ const struct file_operations ufs_dir_operations = {
 	.readdir	= ufs_readdir,
 	.fsync		= simple_fsync,
 	.llseek		= generic_file_llseek,
+	.checkpoint	= generic_file_checkpoint,
 };
diff --git a/fs/ufs/file.c b/fs/ufs/file.c
index 73655c6..15c8616 100644
--- a/fs/ufs/file.c
+++ b/fs/ufs/file.c
@@ -43,4 +43,5 @@ const struct file_operations ufs_file_operations = {
 	.open           = generic_file_open,
 	.fsync		= simple_fsync,
 	.splice_read	= generic_file_splice_read,
+	.checkpoint	= generic_file_checkpoint,
 };
diff --git a/fs/xfs/linux-2.6/xfs_file.c b/fs/xfs/linux-2.6/xfs_file.c
index e4caeb2..926f377 100644
--- a/fs/xfs/linux-2.6/xfs_file.c
+++ b/fs/xfs/linux-2.6/xfs_file.c
@@ -259,6 +259,7 @@ const struct file_operations xfs_file_operations = {
 #ifdef HAVE_FOP_OPEN_EXEC
 	.open_exec	= xfs_file_open_exec,
 #endif
+	.checkpoint	= generic_file_checkpoint,
 };
 
 const struct file_operations xfs_dir_file_operations = {
@@ -271,6 +272,7 @@ const struct file_operations xfs_dir_file_operations = {
 	.compat_ioctl	= xfs_file_compat_ioctl,
 #endif
 	.fsync		= xfs_file_fsync,
+	.checkpoint	= generic_file_checkpoint,
 };
 
 static const struct vm_operations_struct xfs_file_vm_ops = {
-- 
1.6.3.3

^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [C/R v20][PATCH 46/96] c/r: add checkpoint operation for opened files of generic filesystems
  2010-03-19  0:59 [C/R v20][PATCH 00/96] Linux Checkpoint-Restart - v20 Oren Laadan
                   ` (7 preceding siblings ...)
  2010-03-19  0:59 ` [C/R v20][PATCH 45/96] c/r: add generic '->checkpoint()' f_op to simple devices Oren Laadan
@ 2010-03-19  0:59 ` Oren Laadan
  2010-03-19  0:59 ` [C/R v20][PATCH 50/96] splice: export pipe/file-to-pipe/file functionality Oren Laadan
                   ` (8 subsequent siblings)
  17 siblings, 0 replies; 88+ messages in thread
From: Oren Laadan @ 2010-03-19  0:59 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: containers, Matt Helsley, Andreas Dilger

From: Matt Helsley <matthltc@us.ibm.com>

These patches extend the use of the generic file checkpoint operation to
non-extX filesystems which have lseek operations that ensure we can save
and restore the files for later use. Note that this does not include
things like FUSE, network filesystems, or pseudo-filesystem kernel
interfaces.

Only compile and boot tested (on x86-32).

[Oren Laadan] Folded patch series into a single patch; original post
included 36 separate patches for individual filesystems:

  [PATCH 01/36] Add the checkpoint operation for affs files and directories.
  [PATCH 02/36] Add the checkpoint operation for befs directories.
  [PATCH 03/36] Add the checkpoint operation for bfs files and directories.
  [PATCH 04/36] Add the checkpoint operation for btrfs files and directories.
  [PATCH 05/36] Add the checkpoint operation for cramfs directories.
  [PATCH 06/36] Add the checkpoint operation for ecryptfs files and directories.
  [PATCH 07/36] Add the checkpoint operation for fat files and directories.
  [PATCH 08/36] Add the checkpoint operation for freevxfs directories.
  [PATCH 09/36] Add the checkpoint operation for hfs files and directories.
  [PATCH 10/36] Add the checkpoint operation for hfsplus files and directories.
  [PATCH 11/36] Add the checkpoint operation for hpfs files and directories.
  [PATCH 12/36] Add the checkpoint operation for hppfs files and directories.
  [PATCH 13/36] Add the checkpoint operation for iso directories.
  [PATCH 14/36] Add the checkpoint operation for jffs2 files and directories.
  [PATCH 15/36] Add the checkpoint operation for jfs files and directories.
  [PATCH 16/36] Add the checkpoint operation for regular nfs files and directories. Skip the various /proc files for now.
  [PATCH 17/36] Add the checkpoint operation for ntfs directories.
  [PATCH 18/36] Add the checkpoint operation for openromfs directories. Explicitly skip the properties for now.
  [PATCH 19/36] Add the checkpoint operation for qnx4 files and directories.
  [PATCH 20/36] Add the checkpoint operation for reiserfs files and directories.
  [PATCH 21/36] Add the checkpoint operation for romfs directories.
  [PATCH 22/36] Add the checkpoint operation for squashfs directories.
  [PATCH 23/36] Add the checkpoint operation for sysv filesystem files and directories.
  [PATCH 24/36] Add the checkpoint operation for ubifs files and directories.
  [PATCH 25/36] Add the checkpoint operation for udf filesystem files and directories.
  [PATCH 26/36] Add the checkpoint operation for xfs files and directories.
  [PATCH 27/36] Add checkpoint operation for efs directories.
  [PATCH 28/36] Add the checkpoint operation for generic, read-only files. At present, some/all files of the following filesystems use this generic definition:
  [PATCH 29/36] Add checkpoint operation for minix filesystem files and directories.
  [PATCH 30/36] Add checkpoint operations for omfs files and directories.
  [PATCH 31/36] Add checkpoint operations for ufs files and directories.
  [PATCH 32/36] Add checkpoint operations for ramfs files. NOTE: since simple_dir_operations are shared between multiple filesystems including ramfs, it's not currently possible to checkpoint open ramfs directories.
  [PATCH 33/36] Add the checkpoint operation for adfs files and directories.
  [PATCH 34/36] Add the checkpoint operation to exofs files and directories.
  [PATCH 35/36] Add the checkpoint operation to nilfs2 files and directories.
  [PATCH 36/36] Add checkpoint operations for UML host filesystem files and directories.

Changelog[v19-rc3]:
  - [Suka] Enable C/R while executing over NFS

Signed-off-by: Matt Helsley <matthltc@us.ibm.com>
Acked-by: Oren Laadan <orenl@cs.columbia.edu>
Acked-by: Serge E. Hallyn <serue@us.ibm.com>
Tested-by: Serge E. Hallyn <serue@us.ibm.com>
Cc: linux-fsdevel@vger.kernel.org
---
 fs/adfs/dir.c               |    1 +
 fs/adfs/file.c              |    1 +
 fs/affs/dir.c               |    1 +
 fs/affs/file.c              |    1 +
 fs/befs/linuxvfs.c          |    1 +
 fs/bfs/dir.c                |    1 +
 fs/bfs/file.c               |    1 +
 fs/btrfs/file.c             |    1 +
 fs/btrfs/inode.c            |    1 +
 fs/btrfs/super.c            |    1 +
 fs/cramfs/inode.c           |    1 +
 fs/ecryptfs/file.c          |    2 ++
 fs/ecryptfs/miscdev.c       |    1 +
 fs/efs/dir.c                |    1 +
 fs/exofs/dir.c              |    1 +
 fs/exofs/file.c             |    1 +
 fs/fat/dir.c                |    1 +
 fs/fat/file.c               |    1 +
 fs/freevxfs/vxfs_lookup.c   |    1 +
 fs/hfs/dir.c                |    1 +
 fs/hfs/inode.c              |    1 +
 fs/hfsplus/dir.c            |    1 +
 fs/hfsplus/inode.c          |    1 +
 fs/hostfs/hostfs_kern.c     |    2 ++
 fs/hpfs/dir.c               |    1 +
 fs/hpfs/file.c              |    1 +
 fs/hppfs/hppfs.c            |    2 ++
 fs/isofs/dir.c              |    1 +
 fs/jffs2/dir.c              |    1 +
 fs/jffs2/file.c             |    1 +
 fs/jfs/file.c               |    1 +
 fs/jfs/namei.c              |    1 +
 fs/minix/dir.c              |    1 +
 fs/minix/file.c             |    1 +
 fs/nfs/dir.c                |    1 +
 fs/nfs/file.c               |    4 ++++
 fs/nilfs2/dir.c             |    2 +-
 fs/nilfs2/file.c            |    1 +
 fs/ntfs/dir.c               |    1 +
 fs/ntfs/file.c              |    3 ++-
 fs/omfs/dir.c               |    1 +
 fs/omfs/file.c              |    1 +
 fs/openpromfs/inode.c       |    2 ++
 fs/qnx4/dir.c               |    1 +
 fs/ramfs/file-mmu.c         |    1 +
 fs/ramfs/file-nommu.c       |    1 +
 fs/read_write.c             |    1 +
 fs/reiserfs/dir.c           |    1 +
 fs/reiserfs/file.c          |    1 +
 fs/romfs/mmap-nommu.c       |    1 +
 fs/romfs/super.c            |    1 +
 fs/squashfs/dir.c           |    3 ++-
 fs/sysv/dir.c               |    1 +
 fs/sysv/file.c              |    1 +
 fs/ubifs/debug.c            |    1 +
 fs/ubifs/dir.c              |    1 +
 fs/ubifs/file.c             |    1 +
 fs/udf/dir.c                |    1 +
 fs/udf/file.c               |    1 +
 fs/ufs/dir.c                |    1 +
 fs/ufs/file.c               |    1 +
 fs/xfs/linux-2.6/xfs_file.c |    2 ++
 62 files changed, 72 insertions(+), 3 deletions(-)

diff --git a/fs/adfs/dir.c b/fs/adfs/dir.c
index 23aa52f..7106f32 100644
--- a/fs/adfs/dir.c
+++ b/fs/adfs/dir.c
@@ -198,6 +198,7 @@ const struct file_operations adfs_dir_operations = {
 	.llseek		= generic_file_llseek,
 	.readdir	= adfs_readdir,
 	.fsync		= simple_fsync,
+	.checkpoint	= generic_file_checkpoint,
 };
 
 static int
diff --git a/fs/adfs/file.c b/fs/adfs/file.c
index 005ea34..97bd298 100644
--- a/fs/adfs/file.c
+++ b/fs/adfs/file.c
@@ -30,6 +30,7 @@ const struct file_operations adfs_file_operations = {
 	.write		= do_sync_write,
 	.aio_write	= generic_file_aio_write,
 	.splice_read	= generic_file_splice_read,
+	.checkpoint	= generic_file_checkpoint,
 };
 
 const struct inode_operations adfs_file_inode_operations = {
diff --git a/fs/affs/dir.c b/fs/affs/dir.c
index 8ca8f3a..6cc5e43 100644
--- a/fs/affs/dir.c
+++ b/fs/affs/dir.c
@@ -22,6 +22,7 @@ const struct file_operations affs_dir_operations = {
 	.llseek		= generic_file_llseek,
 	.readdir	= affs_readdir,
 	.fsync		= affs_file_fsync,
+	.checkpoint	= generic_file_checkpoint,
 };
 
 /*
diff --git a/fs/affs/file.c b/fs/affs/file.c
index 184e55c..d580a12 100644
--- a/fs/affs/file.c
+++ b/fs/affs/file.c
@@ -36,6 +36,7 @@ const struct file_operations affs_file_operations = {
 	.release	= affs_file_release,
 	.fsync		= affs_file_fsync,
 	.splice_read	= generic_file_splice_read,
+	.checkpoint	= generic_file_checkpoint,
 };
 
 const struct inode_operations affs_file_inode_operations = {
diff --git a/fs/befs/linuxvfs.c b/fs/befs/linuxvfs.c
index 34ddda8..b97f79b 100644
--- a/fs/befs/linuxvfs.c
+++ b/fs/befs/linuxvfs.c
@@ -67,6 +67,7 @@ static const struct file_operations befs_dir_operations = {
 	.read		= generic_read_dir,
 	.readdir	= befs_readdir,
 	.llseek		= generic_file_llseek,
+	.checkpoint	= generic_file_checkpoint,
 };
 
 static const struct inode_operations befs_dir_inode_operations = {
diff --git a/fs/bfs/dir.c b/fs/bfs/dir.c
index 1e41aad..d78015e 100644
--- a/fs/bfs/dir.c
+++ b/fs/bfs/dir.c
@@ -80,6 +80,7 @@ const struct file_operations bfs_dir_operations = {
 	.readdir	= bfs_readdir,
 	.fsync		= simple_fsync,
 	.llseek		= generic_file_llseek,
+	.checkpoint	= generic_file_checkpoint,
 };
 
 extern void dump_imap(const char *, struct super_block *);
diff --git a/fs/bfs/file.c b/fs/bfs/file.c
index 88b9a3f..7f61ed6 100644
--- a/fs/bfs/file.c
+++ b/fs/bfs/file.c
@@ -29,6 +29,7 @@ const struct file_operations bfs_file_operations = {
 	.aio_write	= generic_file_aio_write,
 	.mmap		= generic_file_mmap,
 	.splice_read	= generic_file_splice_read,
+	.checkpoint	= generic_file_checkpoint,
 };
 
 static int bfs_move_block(unsigned long from, unsigned long to,
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 6ed434a..281a2b8 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1164,4 +1164,5 @@ const struct file_operations btrfs_file_operations = {
 #ifdef CONFIG_COMPAT
 	.compat_ioctl	= btrfs_ioctl,
 #endif
+	.checkpoint	= generic_file_checkpoint,
 };
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 4deb280..606c31d 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -5971,6 +5971,7 @@ static const struct file_operations btrfs_dir_file_operations = {
 #endif
 	.release        = btrfs_release_file,
 	.fsync		= btrfs_sync_file,
+	.checkpoint	= generic_file_checkpoint,
 };
 
 static struct extent_io_ops btrfs_extent_io_ops = {
diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index 8a1ea6e..7a28ac5 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -718,6 +718,7 @@ static const struct file_operations btrfs_ctl_fops = {
 	.unlocked_ioctl	 = btrfs_control_ioctl,
 	.compat_ioctl = btrfs_control_ioctl,
 	.owner	 = THIS_MODULE,
+	.checkpoint = generic_file_checkpoint,
 };
 
 static struct miscdevice btrfs_misc = {
diff --git a/fs/cramfs/inode.c b/fs/cramfs/inode.c
index dd3634e..0927503 100644
--- a/fs/cramfs/inode.c
+++ b/fs/cramfs/inode.c
@@ -532,6 +532,7 @@ static const struct file_operations cramfs_directory_operations = {
 	.llseek		= generic_file_llseek,
 	.read		= generic_read_dir,
 	.readdir	= cramfs_readdir,
+	.checkpoint	= generic_file_checkpoint,
 };
 
 static const struct inode_operations cramfs_dir_inode_operations = {
diff --git a/fs/ecryptfs/file.c b/fs/ecryptfs/file.c
index 678172b..a8973ef 100644
--- a/fs/ecryptfs/file.c
+++ b/fs/ecryptfs/file.c
@@ -305,6 +305,7 @@ const struct file_operations ecryptfs_dir_fops = {
 	.fsync = ecryptfs_fsync,
 	.fasync = ecryptfs_fasync,
 	.splice_read = generic_file_splice_read,
+	.checkpoint = generic_file_checkpoint,
 };
 
 const struct file_operations ecryptfs_main_fops = {
@@ -322,6 +323,7 @@ const struct file_operations ecryptfs_main_fops = {
 	.fsync = ecryptfs_fsync,
 	.fasync = ecryptfs_fasync,
 	.splice_read = generic_file_splice_read,
+	.checkpoint = generic_file_checkpoint,
 };
 
 static int
diff --git a/fs/ecryptfs/miscdev.c b/fs/ecryptfs/miscdev.c
index 4ec8f61..9fd9b39 100644
--- a/fs/ecryptfs/miscdev.c
+++ b/fs/ecryptfs/miscdev.c
@@ -481,6 +481,7 @@ static const struct file_operations ecryptfs_miscdev_fops = {
 	.read    = ecryptfs_miscdev_read,
 	.write   = ecryptfs_miscdev_write,
 	.release = ecryptfs_miscdev_release,
+	.checkpoint = generic_file_checkpoint,
 };
 
 static struct miscdevice ecryptfs_miscdev = {
diff --git a/fs/efs/dir.c b/fs/efs/dir.c
index 7ee6f7e..da344b8 100644
--- a/fs/efs/dir.c
+++ b/fs/efs/dir.c
@@ -13,6 +13,7 @@ const struct file_operations efs_dir_operations = {
 	.llseek		= generic_file_llseek,
 	.read		= generic_read_dir,
 	.readdir	= efs_readdir,
+	.checkpoint	= generic_file_checkpoint,
 };
 
 const struct inode_operations efs_dir_inode_operations = {
diff --git a/fs/exofs/dir.c b/fs/exofs/dir.c
index 4cfab1c..f6693d3 100644
--- a/fs/exofs/dir.c
+++ b/fs/exofs/dir.c
@@ -667,4 +667,5 @@ const struct file_operations exofs_dir_operations = {
 	.llseek		= generic_file_llseek,
 	.read		= generic_read_dir,
 	.readdir	= exofs_readdir,
+	.checkpoint	= generic_file_checkpoint,
 };
diff --git a/fs/exofs/file.c b/fs/exofs/file.c
index 839b9dc..257e9da 100644
--- a/fs/exofs/file.c
+++ b/fs/exofs/file.c
@@ -73,6 +73,7 @@ static int exofs_flush(struct file *file, fl_owner_t id)
 
 const struct file_operations exofs_file_operations = {
 	.llseek		= generic_file_llseek,
+	.checkpoint	= generic_file_checkpoint,
 	.read		= do_sync_read,
 	.write		= do_sync_write,
 	.aio_read	= generic_file_aio_read,
diff --git a/fs/fat/dir.c b/fs/fat/dir.c
index 530b4ca..e3fa353 100644
--- a/fs/fat/dir.c
+++ b/fs/fat/dir.c
@@ -841,6 +841,7 @@ const struct file_operations fat_dir_operations = {
 	.compat_ioctl	= fat_compat_dir_ioctl,
 #endif
 	.fsync		= fat_file_fsync,
+	.checkpoint	= generic_file_checkpoint,
 };
 
 static int fat_get_short_entry(struct inode *dir, loff_t *pos,
diff --git a/fs/fat/file.c b/fs/fat/file.c
index e8c159d..e5aecc6 100644
--- a/fs/fat/file.c
+++ b/fs/fat/file.c
@@ -162,6 +162,7 @@ const struct file_operations fat_file_operations = {
 	.ioctl		= fat_generic_ioctl,
 	.fsync		= fat_file_fsync,
 	.splice_read	= generic_file_splice_read,
+	.checkpoint	= generic_file_checkpoint,
 };
 
 static int fat_cont_expand(struct inode *inode, loff_t size)
diff --git a/fs/freevxfs/vxfs_lookup.c b/fs/freevxfs/vxfs_lookup.c
index aee049c..3a09132 100644
--- a/fs/freevxfs/vxfs_lookup.c
+++ b/fs/freevxfs/vxfs_lookup.c
@@ -58,6 +58,7 @@ const struct inode_operations vxfs_dir_inode_ops = {
 
 const struct file_operations vxfs_dir_operations = {
 	.readdir =		vxfs_readdir,
+	.checkpoint =		generic_file_checkpoint,
 };
 
  
diff --git a/fs/hfs/dir.c b/fs/hfs/dir.c
index 2b3b861..0eef6c2 100644
--- a/fs/hfs/dir.c
+++ b/fs/hfs/dir.c
@@ -329,6 +329,7 @@ const struct file_operations hfs_dir_operations = {
 	.readdir	= hfs_readdir,
 	.llseek		= generic_file_llseek,
 	.release	= hfs_dir_release,
+	.checkpoint	= generic_file_checkpoint,
 };
 
 const struct inode_operations hfs_dir_inode_operations = {
diff --git a/fs/hfs/inode.c b/fs/hfs/inode.c
index a1cbff2..bf8950f 100644
--- a/fs/hfs/inode.c
+++ b/fs/hfs/inode.c
@@ -607,6 +607,7 @@ static const struct file_operations hfs_file_operations = {
 	.fsync		= file_fsync,
 	.open		= hfs_file_open,
 	.release	= hfs_file_release,
+	.checkpoint	= generic_file_checkpoint,
 };
 
 static const struct inode_operations hfs_file_inode_operations = {
diff --git a/fs/hfsplus/dir.c b/fs/hfsplus/dir.c
index 5f40236..41fbf2d 100644
--- a/fs/hfsplus/dir.c
+++ b/fs/hfsplus/dir.c
@@ -497,4 +497,5 @@ const struct file_operations hfsplus_dir_operations = {
 	.ioctl          = hfsplus_ioctl,
 	.llseek		= generic_file_llseek,
 	.release	= hfsplus_dir_release,
+	.checkpoint	= generic_file_checkpoint,
 };
diff --git a/fs/hfsplus/inode.c b/fs/hfsplus/inode.c
index 1bcf597..19abd7e 100644
--- a/fs/hfsplus/inode.c
+++ b/fs/hfsplus/inode.c
@@ -286,6 +286,7 @@ static const struct file_operations hfsplus_file_operations = {
 	.open		= hfsplus_file_open,
 	.release	= hfsplus_file_release,
 	.ioctl          = hfsplus_ioctl,
+	.checkpoint	= generic_file_checkpoint,
 };
 
 struct inode *hfsplus_new_inode(struct super_block *sb, int mode)
diff --git a/fs/hostfs/hostfs_kern.c b/fs/hostfs/hostfs_kern.c
index 032604e..67e2356 100644
--- a/fs/hostfs/hostfs_kern.c
+++ b/fs/hostfs/hostfs_kern.c
@@ -417,6 +417,7 @@ int hostfs_fsync(struct file *file, struct dentry *dentry, int datasync)
 
 static const struct file_operations hostfs_file_fops = {
 	.llseek		= generic_file_llseek,
+	.checkpoint	= generic_file_checkpoint,
 	.read		= do_sync_read,
 	.splice_read	= generic_file_splice_read,
 	.aio_read	= generic_file_aio_read,
@@ -430,6 +431,7 @@ static const struct file_operations hostfs_file_fops = {
 
 static const struct file_operations hostfs_dir_fops = {
 	.llseek		= generic_file_llseek,
+	.checkpoint	= generic_file_checkpoint,
 	.readdir	= hostfs_readdir,
 	.read		= generic_read_dir,
 };
diff --git a/fs/hpfs/dir.c b/fs/hpfs/dir.c
index 8865c94..dcde10f 100644
--- a/fs/hpfs/dir.c
+++ b/fs/hpfs/dir.c
@@ -322,4 +322,5 @@ const struct file_operations hpfs_dir_ops =
 	.readdir	= hpfs_readdir,
 	.release	= hpfs_dir_release,
 	.fsync		= hpfs_file_fsync,
+	.checkpoint	= generic_file_checkpoint,
 };
diff --git a/fs/hpfs/file.c b/fs/hpfs/file.c
index 3efabff..f1211f0 100644
--- a/fs/hpfs/file.c
+++ b/fs/hpfs/file.c
@@ -139,6 +139,7 @@ const struct file_operations hpfs_file_ops =
 	.release	= hpfs_file_release,
 	.fsync		= hpfs_file_fsync,
 	.splice_read	= generic_file_splice_read,
+	.checkpoint	= generic_file_checkpoint,
 };
 
 const struct inode_operations hpfs_file_iops =
diff --git a/fs/hppfs/hppfs.c b/fs/hppfs/hppfs.c
index 7239efc..e3c3bd3 100644
--- a/fs/hppfs/hppfs.c
+++ b/fs/hppfs/hppfs.c
@@ -546,6 +546,7 @@ static const struct file_operations hppfs_file_fops = {
 	.read		= hppfs_read,
 	.write		= hppfs_write,
 	.open		= hppfs_open,
+	.checkpoint	= generic_file_checkpoint,
 };
 
 struct hppfs_dirent {
@@ -597,6 +598,7 @@ static const struct file_operations hppfs_dir_fops = {
 	.readdir	= hppfs_readdir,
 	.open		= hppfs_dir_open,
 	.fsync		= hppfs_fsync,
+	.checkpoint	= generic_file_checkpoint,
 };
 
 static int hppfs_statfs(struct dentry *dentry, struct kstatfs *sf)
diff --git a/fs/isofs/dir.c b/fs/isofs/dir.c
index 8ba5441..848059d 100644
--- a/fs/isofs/dir.c
+++ b/fs/isofs/dir.c
@@ -273,6 +273,7 @@ const struct file_operations isofs_dir_operations =
 {
 	.read = generic_read_dir,
 	.readdir = isofs_readdir,
+	.checkpoint = generic_file_checkpoint,
 };
 
 /*
diff --git a/fs/jffs2/dir.c b/fs/jffs2/dir.c
index 7aa4417..c7c4dcb 100644
--- a/fs/jffs2/dir.c
+++ b/fs/jffs2/dir.c
@@ -41,6 +41,7 @@ const struct file_operations jffs2_dir_operations =
 	.unlocked_ioctl=jffs2_ioctl,
 	.fsync =	jffs2_fsync,
 	.llseek =	generic_file_llseek,
+	.checkpoint =	generic_file_checkpoint,
 };
 
 
diff --git a/fs/jffs2/file.c b/fs/jffs2/file.c
index b7b74e2..f01038d 100644
--- a/fs/jffs2/file.c
+++ b/fs/jffs2/file.c
@@ -50,6 +50,7 @@ const struct file_operations jffs2_file_operations =
 	.mmap =		generic_file_readonly_mmap,
 	.fsync =	jffs2_fsync,
 	.splice_read =	generic_file_splice_read,
+	.checkpoint =	generic_file_checkpoint,
 };
 
 /* jffs2_file_inode_operations */
diff --git a/fs/jfs/file.c b/fs/jfs/file.c
index 2b70fa7..3bd7114 100644
--- a/fs/jfs/file.c
+++ b/fs/jfs/file.c
@@ -116,4 +116,5 @@ const struct file_operations jfs_file_operations = {
 #ifdef CONFIG_COMPAT
 	.compat_ioctl	= jfs_compat_ioctl,
 #endif
+	.checkpoint	= generic_file_checkpoint,
 };
diff --git a/fs/jfs/namei.c b/fs/jfs/namei.c
index c79a427..585a7d2 100644
--- a/fs/jfs/namei.c
+++ b/fs/jfs/namei.c
@@ -1556,6 +1556,7 @@ const struct file_operations jfs_dir_operations = {
 	.compat_ioctl	= jfs_compat_ioctl,
 #endif
 	.llseek		= generic_file_llseek,
+	.checkpoint	= generic_file_checkpoint,
 };
 
 static int jfs_ci_hash(struct dentry *dir, struct qstr *this)
diff --git a/fs/minix/dir.c b/fs/minix/dir.c
index 6198731..74b6fb4 100644
--- a/fs/minix/dir.c
+++ b/fs/minix/dir.c
@@ -23,6 +23,7 @@ const struct file_operations minix_dir_operations = {
 	.read		= generic_read_dir,
 	.readdir	= minix_readdir,
 	.fsync		= simple_fsync,
+	.checkpoint	= generic_file_checkpoint,
 };
 
 static inline void dir_put_page(struct page *page)
diff --git a/fs/minix/file.c b/fs/minix/file.c
index 3eec3e6..2048d09 100644
--- a/fs/minix/file.c
+++ b/fs/minix/file.c
@@ -21,6 +21,7 @@ const struct file_operations minix_file_operations = {
 	.mmap		= generic_file_mmap,
 	.fsync		= simple_fsync,
 	.splice_read	= generic_file_splice_read,
+	.checkpoint	= generic_file_checkpoint,
 };
 
 const struct inode_operations minix_file_inode_operations = {
diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
index 3c7f03b..7d9d22a 100644
--- a/fs/nfs/dir.c
+++ b/fs/nfs/dir.c
@@ -63,6 +63,7 @@ const struct file_operations nfs_dir_operations = {
 	.open		= nfs_opendir,
 	.release	= nfs_release,
 	.fsync		= nfs_fsync_dir,
+	.checkpoint	= generic_file_checkpoint,
 };
 
 const struct inode_operations nfs_dir_inode_operations = {
diff --git a/fs/nfs/file.c b/fs/nfs/file.c
index 63f2071..4437ef9 100644
--- a/fs/nfs/file.c
+++ b/fs/nfs/file.c
@@ -78,6 +78,7 @@ const struct file_operations nfs_file_operations = {
 	.splice_write	= nfs_file_splice_write,
 	.check_flags	= nfs_check_flags,
 	.setlease	= nfs_setlease,
+	.checkpoint	= generic_file_checkpoint,
 };
 
 const struct inode_operations nfs_file_inode_operations = {
@@ -577,6 +578,9 @@ out_unlock:
 static const struct vm_operations_struct nfs_file_vm_ops = {
 	.fault = filemap_fault,
 	.page_mkwrite = nfs_vm_page_mkwrite,
+#ifdef CONFIG_CHECKPOINT
+	.checkpoint = filemap_checkpoint,
+#endif
 };
 
 static int nfs_need_sync_write(struct file *filp, struct inode *inode)
diff --git a/fs/nilfs2/dir.c b/fs/nilfs2/dir.c
index 76d803e..18b2171 100644
--- a/fs/nilfs2/dir.c
+++ b/fs/nilfs2/dir.c
@@ -702,5 +702,5 @@ const struct file_operations nilfs_dir_operations = {
 	.compat_ioctl	= nilfs_ioctl,
 #endif	/* CONFIG_COMPAT */
 	.fsync		= nilfs_sync_file,
-
+	.checkpoint	= generic_file_checkpoint,
 };
diff --git a/fs/nilfs2/file.c b/fs/nilfs2/file.c
index 30292df..4d585b5 100644
--- a/fs/nilfs2/file.c
+++ b/fs/nilfs2/file.c
@@ -136,6 +136,7 @@ static int nilfs_file_mmap(struct file *file, struct vm_area_struct *vma)
  */
 const struct file_operations nilfs_file_operations = {
 	.llseek		= generic_file_llseek,
+	.checkpoint	= generic_file_checkpoint,
 	.read		= do_sync_read,
 	.write		= do_sync_write,
 	.aio_read	= generic_file_aio_read,
diff --git a/fs/ntfs/dir.c b/fs/ntfs/dir.c
index 5a9e344..4fe3759 100644
--- a/fs/ntfs/dir.c
+++ b/fs/ntfs/dir.c
@@ -1572,4 +1572,5 @@ const struct file_operations ntfs_dir_ops = {
 	/*.ioctl	= ,*/			/* Perform function on the
 						   mounted filesystem. */
 	.open		= ntfs_dir_open,	/* Open directory. */
+	.checkpoint	= generic_file_checkpoint,
 };
diff --git a/fs/ntfs/file.c b/fs/ntfs/file.c
index 43179dd..32a43f5 100644
--- a/fs/ntfs/file.c
+++ b/fs/ntfs/file.c
@@ -2224,7 +2224,7 @@ const struct file_operations ntfs_file_ops = {
 						    mounted filesystem. */
 	.mmap		= generic_file_mmap,	 /* Mmap file. */
 	.open		= ntfs_file_open,	 /* Open file. */
-	.splice_read	= generic_file_splice_read /* Zero-copy data send with
+	.splice_read	= generic_file_splice_read, /* Zero-copy data send with
 						    the data source being on
 						    the ntfs partition.  We do
 						    not need to care about the
@@ -2234,6 +2234,7 @@ const struct file_operations ntfs_file_ops = {
 						    on the ntfs partition.  We
 						    do not need to care about
 						    the data source. */
+	.checkpoint	= generic_file_checkpoint,
 };
 
 const struct inode_operations ntfs_file_inode_ops = {
diff --git a/fs/omfs/dir.c b/fs/omfs/dir.c
index b42d624..e924e33 100644
--- a/fs/omfs/dir.c
+++ b/fs/omfs/dir.c
@@ -502,4 +502,5 @@ const struct file_operations omfs_dir_operations = {
 	.read = generic_read_dir,
 	.readdir = omfs_readdir,
 	.llseek = generic_file_llseek,
+	.checkpoint = generic_file_checkpoint,
 };
diff --git a/fs/omfs/file.c b/fs/omfs/file.c
index 399487c..83e63ef 100644
--- a/fs/omfs/file.c
+++ b/fs/omfs/file.c
@@ -331,6 +331,7 @@ const struct file_operations omfs_file_operations = {
 	.mmap = generic_file_mmap,
 	.fsync = simple_fsync,
 	.splice_read = generic_file_splice_read,
+	.checkpoint = generic_file_checkpoint,
 };
 
 const struct inode_operations omfs_file_inops = {
diff --git a/fs/openpromfs/inode.c b/fs/openpromfs/inode.c
index ffcd04f..d1f0677 100644
--- a/fs/openpromfs/inode.c
+++ b/fs/openpromfs/inode.c
@@ -160,6 +160,7 @@ static const struct file_operations openpromfs_prop_ops = {
 	.read		= seq_read,
 	.llseek		= seq_lseek,
 	.release	= seq_release,
+	.checkpoint	= NULL,
 };
 
 static int openpromfs_readdir(struct file *, void *, filldir_t);
@@ -168,6 +169,7 @@ static const struct file_operations openprom_operations = {
 	.read		= generic_read_dir,
 	.readdir	= openpromfs_readdir,
 	.llseek		= generic_file_llseek,
+	.checkpoint	= generic_file_checkpoint,
 };
 
 static struct dentry *openpromfs_lookup(struct inode *, struct dentry *, struct nameidata *);
diff --git a/fs/qnx4/dir.c b/fs/qnx4/dir.c
index 6f30c3d..fa14c55 100644
--- a/fs/qnx4/dir.c
+++ b/fs/qnx4/dir.c
@@ -80,6 +80,7 @@ const struct file_operations qnx4_dir_operations =
 	.read		= generic_read_dir,
 	.readdir	= qnx4_readdir,
 	.fsync		= simple_fsync,
+	.checkpoint	= generic_file_checkpoint,
 };
 
 const struct inode_operations qnx4_dir_inode_operations =
diff --git a/fs/ramfs/file-mmu.c b/fs/ramfs/file-mmu.c
index 78f613c..4430239 100644
--- a/fs/ramfs/file-mmu.c
+++ b/fs/ramfs/file-mmu.c
@@ -47,6 +47,7 @@ const struct file_operations ramfs_file_operations = {
 	.splice_read	= generic_file_splice_read,
 	.splice_write	= generic_file_splice_write,
 	.llseek		= generic_file_llseek,
+	.checkpoint	= generic_file_checkpoint,
 };
 
 const struct inode_operations ramfs_file_inode_operations = {
diff --git a/fs/ramfs/file-nommu.c b/fs/ramfs/file-nommu.c
index 1739a4a..9cd6208 100644
--- a/fs/ramfs/file-nommu.c
+++ b/fs/ramfs/file-nommu.c
@@ -45,6 +45,7 @@ const struct file_operations ramfs_file_operations = {
 	.splice_read		= generic_file_splice_read,
 	.splice_write		= generic_file_splice_write,
 	.llseek			= generic_file_llseek,
+	.checkpoint		= generic_file_checkpoint,
 };
 
 const struct inode_operations ramfs_file_inode_operations = {
diff --git a/fs/read_write.c b/fs/read_write.c
index e258301..65371e1 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -27,6 +27,7 @@ const struct file_operations generic_ro_fops = {
 	.aio_read	= generic_file_aio_read,
 	.mmap		= generic_file_readonly_mmap,
 	.splice_read	= generic_file_splice_read,
+	.checkpoint	= generic_file_checkpoint,
 };
 
 EXPORT_SYMBOL(generic_ro_fops);
diff --git a/fs/reiserfs/dir.c b/fs/reiserfs/dir.c
index c094f58..8681419 100644
--- a/fs/reiserfs/dir.c
+++ b/fs/reiserfs/dir.c
@@ -24,6 +24,7 @@ const struct file_operations reiserfs_dir_operations = {
 #ifdef CONFIG_COMPAT
 	.compat_ioctl = reiserfs_compat_ioctl,
 #endif
+	.checkpoint = generic_file_checkpoint,
 };
 
 static int reiserfs_dir_fsync(struct file *filp, struct dentry *dentry,
diff --git a/fs/reiserfs/file.c b/fs/reiserfs/file.c
index da2dba0..b6008f3 100644
--- a/fs/reiserfs/file.c
+++ b/fs/reiserfs/file.c
@@ -297,6 +297,7 @@ const struct file_operations reiserfs_file_operations = {
 	.splice_read = generic_file_splice_read,
 	.splice_write = generic_file_splice_write,
 	.llseek = generic_file_llseek,
+	.checkpoint = generic_file_checkpoint,
 };
 
 const struct inode_operations reiserfs_file_inode_operations = {
diff --git a/fs/romfs/mmap-nommu.c b/fs/romfs/mmap-nommu.c
index f0511e8..03c24d9 100644
--- a/fs/romfs/mmap-nommu.c
+++ b/fs/romfs/mmap-nommu.c
@@ -72,4 +72,5 @@ const struct file_operations romfs_ro_fops = {
 	.splice_read		= generic_file_splice_read,
 	.mmap			= romfs_mmap,
 	.get_unmapped_area	= romfs_get_unmapped_area,
+	.checkpoint		= generic_file_checkpoint,
 };
diff --git a/fs/romfs/super.c b/fs/romfs/super.c
index 42d2135..476ea8e 100644
--- a/fs/romfs/super.c
+++ b/fs/romfs/super.c
@@ -282,6 +282,7 @@ error:
 static const struct file_operations romfs_dir_operations = {
 	.read		= generic_read_dir,
 	.readdir	= romfs_readdir,
+	.checkpoint	= generic_file_checkpoint,
 };
 
 static const struct inode_operations romfs_dir_inode_operations = {
diff --git a/fs/squashfs/dir.c b/fs/squashfs/dir.c
index 566b0ea..b0c5336 100644
--- a/fs/squashfs/dir.c
+++ b/fs/squashfs/dir.c
@@ -231,5 +231,6 @@ failed_read:
 
 const struct file_operations squashfs_dir_ops = {
 	.read = generic_read_dir,
-	.readdir = squashfs_readdir
+	.readdir = squashfs_readdir,
+	.checkpoint = generic_file_checkpoint,
 };
diff --git a/fs/sysv/dir.c b/fs/sysv/dir.c
index 4e50286..53acd29 100644
--- a/fs/sysv/dir.c
+++ b/fs/sysv/dir.c
@@ -25,6 +25,7 @@ const struct file_operations sysv_dir_operations = {
 	.read		= generic_read_dir,
 	.readdir	= sysv_readdir,
 	.fsync		= simple_fsync,
+	.checkpoint	= generic_file_checkpoint,
 };
 
 static inline void dir_put_page(struct page *page)
diff --git a/fs/sysv/file.c b/fs/sysv/file.c
index 96340c0..aee556d 100644
--- a/fs/sysv/file.c
+++ b/fs/sysv/file.c
@@ -28,6 +28,7 @@ const struct file_operations sysv_file_operations = {
 	.mmap		= generic_file_mmap,
 	.fsync		= simple_fsync,
 	.splice_read	= generic_file_splice_read,
+	.checkpoint	= generic_file_checkpoint,
 };
 
 const struct inode_operations sysv_file_inode_operations = {
diff --git a/fs/ubifs/debug.c b/fs/ubifs/debug.c
index 9049232..e4f23c6 100644
--- a/fs/ubifs/debug.c
+++ b/fs/ubifs/debug.c
@@ -2623,6 +2623,7 @@ static ssize_t write_debugfs_file(struct file *file, const char __user *buf,
 static const struct file_operations dfs_fops = {
 	.open = open_debugfs_file,
 	.write = write_debugfs_file,
+	.checkpoint = generic_file_checkpoint,
 	.owner = THIS_MODULE,
 };
 
diff --git a/fs/ubifs/dir.c b/fs/ubifs/dir.c
index 552fb01..89ab2aa 100644
--- a/fs/ubifs/dir.c
+++ b/fs/ubifs/dir.c
@@ -1228,4 +1228,5 @@ const struct file_operations ubifs_dir_operations = {
 #ifdef CONFIG_COMPAT
 	.compat_ioctl   = ubifs_compat_ioctl,
 #endif
+	.checkpoint     = generic_file_checkpoint,
 };
diff --git a/fs/ubifs/file.c b/fs/ubifs/file.c
index 16a6444..254a4d9 100644
--- a/fs/ubifs/file.c
+++ b/fs/ubifs/file.c
@@ -1582,4 +1582,5 @@ const struct file_operations ubifs_file_operations = {
 #ifdef CONFIG_COMPAT
 	.compat_ioctl   = ubifs_compat_ioctl,
 #endif
+	.checkpoint     = generic_file_checkpoint,
 };
diff --git a/fs/udf/dir.c b/fs/udf/dir.c
index 61d9a76..6586dbe 100644
--- a/fs/udf/dir.c
+++ b/fs/udf/dir.c
@@ -211,4 +211,5 @@ const struct file_operations udf_dir_operations = {
 	.readdir		= udf_readdir,
 	.ioctl			= udf_ioctl,
 	.fsync			= simple_fsync,
+	.checkpoint		= generic_file_checkpoint,
 };
diff --git a/fs/udf/file.c b/fs/udf/file.c
index f311d50..e671552 100644
--- a/fs/udf/file.c
+++ b/fs/udf/file.c
@@ -215,6 +215,7 @@ const struct file_operations udf_file_operations = {
 	.fsync			= simple_fsync,
 	.splice_read		= generic_file_splice_read,
 	.llseek			= generic_file_llseek,
+	.checkpoint		= generic_file_checkpoint,
 };
 
 const struct inode_operations udf_file_inode_operations = {
diff --git a/fs/ufs/dir.c b/fs/ufs/dir.c
index 22af68f..29c9396 100644
--- a/fs/ufs/dir.c
+++ b/fs/ufs/dir.c
@@ -668,4 +668,5 @@ const struct file_operations ufs_dir_operations = {
 	.readdir	= ufs_readdir,
 	.fsync		= simple_fsync,
 	.llseek		= generic_file_llseek,
+	.checkpoint	= generic_file_checkpoint,
 };
diff --git a/fs/ufs/file.c b/fs/ufs/file.c
index 73655c6..15c8616 100644
--- a/fs/ufs/file.c
+++ b/fs/ufs/file.c
@@ -43,4 +43,5 @@ const struct file_operations ufs_file_operations = {
 	.open           = generic_file_open,
 	.fsync		= simple_fsync,
 	.splice_read	= generic_file_splice_read,
+	.checkpoint	= generic_file_checkpoint,
 };
diff --git a/fs/xfs/linux-2.6/xfs_file.c b/fs/xfs/linux-2.6/xfs_file.c
index e4caeb2..926f377 100644
--- a/fs/xfs/linux-2.6/xfs_file.c
+++ b/fs/xfs/linux-2.6/xfs_file.c
@@ -259,6 +259,7 @@ const struct file_operations xfs_file_operations = {
 #ifdef HAVE_FOP_OPEN_EXEC
 	.open_exec	= xfs_file_open_exec,
 #endif
+	.checkpoint	= generic_file_checkpoint,
 };
 
 const struct file_operations xfs_dir_file_operations = {
@@ -271,6 +272,7 @@ const struct file_operations xfs_dir_file_operations = {
 	.compat_ioctl	= xfs_file_compat_ioctl,
 #endif
 	.fsync		= xfs_file_fsync,
+	.checkpoint	= generic_file_checkpoint,
 };
 
 static const struct vm_operations_struct xfs_file_vm_ops = {
-- 
1.6.3.3


^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [C/R v20][PATCH 50/96] splice: export pipe/file-to-pipe/file functionality
       [not found] ` <1268960401-16680-1-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
                     ` (7 preceding siblings ...)
  2010-03-19  0:59   ` [C/R v20][PATCH 46/96] c/r: add checkpoint operation for opened files of generic filesystems Oren Laadan
@ 2010-03-19  0:59   ` Oren Laadan
  2010-03-19  0:59   ` [C/R v20][PATCH 51/96] c/r: support for open pipes Oren Laadan
                     ` (7 subsequent siblings)
  16 siblings, 0 replies; 88+ messages in thread
From: Oren Laadan @ 2010-03-19  0:59 UTC (permalink / raw)
  To: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Andreas Dilger

During pipes c/r pipes we need to save and restore pipe buffers. But
do_splice() requires two file descriptors, therefore we can't use it,
as we always have one file descriptor (checkpoint image) and one
pipe_inode_info.

This patch exports interfaces that work at the pipe_inode_info level,
namely link_pipe(), do_splice_to() and do_splice_from(). They are used
in the following patch to to save and restore pipe buffers without
unnecessary data copy.

It slightly modifies both do_splice_to() and do_splice_from() to
detect the case of pipe-to-pipe transfer, in which case they invoke
splice_pipe_to_pipe() directly.

Signed-off-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
Acked-by: Serge E. Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
Tested-by: Serge E. Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
---
 fs/splice.c            |   61 ++++++++++++++++++++++++++++++++---------------
 include/linux/splice.h |    9 +++++++
 2 files changed, 50 insertions(+), 20 deletions(-)

diff --git a/fs/splice.c b/fs/splice.c
index 3920866..76acb55 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -1051,18 +1051,43 @@ ssize_t generic_splice_sendpage(struct pipe_inode_info *pipe, struct file *out,
 EXPORT_SYMBOL(generic_splice_sendpage);
 
 /*
+ * After the inode slimming patch, i_pipe/i_bdev/i_cdev share the same
+ * location, so checking ->i_pipe is not enough to verify that this is a
+ * pipe.
+ */
+static inline struct pipe_inode_info *pipe_info(struct inode *inode)
+{
+	if (S_ISFIFO(inode->i_mode))
+		return inode->i_pipe;
+
+	return NULL;
+}
+
+static int splice_pipe_to_pipe(struct pipe_inode_info *ipipe,
+			       struct pipe_inode_info *opipe,
+			       size_t len, unsigned int flags);
+
+/*
  * Attempt to initiate a splice from pipe to file.
  */
-static long do_splice_from(struct pipe_inode_info *pipe, struct file *out,
-			   loff_t *ppos, size_t len, unsigned int flags)
+long do_splice_from(struct pipe_inode_info *pipe, struct file *out,
+		    loff_t *ppos, size_t len, unsigned int flags)
 {
 	ssize_t (*splice_write)(struct pipe_inode_info *, struct file *,
 				loff_t *, size_t, unsigned int);
+	struct pipe_inode_info *opipe;
 	int ret;
 
 	if (unlikely(!(out->f_mode & FMODE_WRITE)))
 		return -EBADF;
 
+	/* When called directly (e.g. from c/r) output may be a pipe */
+	opipe = pipe_info(out->f_path.dentry->d_inode);
+	if (opipe) {
+		BUG_ON(opipe == pipe);
+		return splice_pipe_to_pipe(pipe, opipe, len, flags);
+	}
+
 	if (unlikely(out->f_flags & O_APPEND))
 		return -EINVAL;
 
@@ -1081,17 +1106,25 @@ static long do_splice_from(struct pipe_inode_info *pipe, struct file *out,
 /*
  * Attempt to initiate a splice from a file to a pipe.
  */
-static long do_splice_to(struct file *in, loff_t *ppos,
-			 struct pipe_inode_info *pipe, size_t len,
-			 unsigned int flags)
+long do_splice_to(struct file *in, loff_t *ppos,
+		  struct pipe_inode_info *pipe, size_t len,
+		  unsigned int flags)
 {
 	ssize_t (*splice_read)(struct file *, loff_t *,
 			       struct pipe_inode_info *, size_t, unsigned int);
+	struct pipe_inode_info *ipipe;
 	int ret;
 
 	if (unlikely(!(in->f_mode & FMODE_READ)))
 		return -EBADF;
 
+	/* When called firectly (e.g. from c/r) input may be a pipe */
+	ipipe = pipe_info(in->f_path.dentry->d_inode);
+	if (ipipe) {
+		BUG_ON(ipipe == pipe);
+		return splice_pipe_to_pipe(ipipe, pipe, len, flags);
+	}
+
 	ret = rw_verify_area(READ, in, ppos, len);
 	if (unlikely(ret < 0))
 		return ret;
@@ -1271,18 +1304,6 @@ long do_splice_direct(struct file *in, loff_t *ppos, struct file *out,
 static int splice_pipe_to_pipe(struct pipe_inode_info *ipipe,
 			       struct pipe_inode_info *opipe,
 			       size_t len, unsigned int flags);
-/*
- * After the inode slimming patch, i_pipe/i_bdev/i_cdev share the same
- * location, so checking ->i_pipe is not enough to verify that this is a
- * pipe.
- */
-static inline struct pipe_inode_info *pipe_info(struct inode *inode)
-{
-	if (S_ISFIFO(inode->i_mode))
-		return inode->i_pipe;
-
-	return NULL;
-}
 
 /*
  * Determine where to splice to/from.
@@ -1887,9 +1908,9 @@ retry:
 /*
  * Link contents of ipipe to opipe.
  */
-static int link_pipe(struct pipe_inode_info *ipipe,
-		     struct pipe_inode_info *opipe,
-		     size_t len, unsigned int flags)
+int link_pipe(struct pipe_inode_info *ipipe,
+	      struct pipe_inode_info *opipe,
+	      size_t len, unsigned int flags)
 {
 	struct pipe_buffer *ibuf, *obuf;
 	int ret = 0, i = 0, nbuf;
diff --git a/include/linux/splice.h b/include/linux/splice.h
index 18e7c7c..431662c 100644
--- a/include/linux/splice.h
+++ b/include/linux/splice.h
@@ -82,4 +82,13 @@ extern ssize_t splice_to_pipe(struct pipe_inode_info *,
 extern ssize_t splice_direct_to_actor(struct file *, struct splice_desc *,
 				      splice_direct_actor *);
 
+extern int link_pipe(struct pipe_inode_info *ipipe,
+		     struct pipe_inode_info *opipe,
+		     size_t len, unsigned int flags);
+extern long do_splice_to(struct file *in, loff_t *ppos,
+			 struct pipe_inode_info *pipe, size_t len,
+			 unsigned int flags);
+extern long do_splice_from(struct pipe_inode_info *pipe, struct file *out,
+			   loff_t *ppos, size_t len, unsigned int flags);
+
 #endif
-- 
1.6.3.3

^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [C/R v20][PATCH 50/96] splice: export pipe/file-to-pipe/file functionality
  2010-03-19  0:59 [C/R v20][PATCH 00/96] Linux Checkpoint-Restart - v20 Oren Laadan
                   ` (8 preceding siblings ...)
  2010-03-19  0:59 ` [C/R v20][PATCH 46/96] c/r: add checkpoint operation for opened files of generic filesystems Oren Laadan
@ 2010-03-19  0:59 ` Oren Laadan
  2010-03-19  0:59 ` [C/R v20][PATCH 51/96] c/r: support for open pipes Oren Laadan
                   ` (7 subsequent siblings)
  17 siblings, 0 replies; 88+ messages in thread
From: Oren Laadan @ 2010-03-19  0:59 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: containers, Matt Helsley, Andreas Dilger, Oren Laadan

During pipes c/r pipes we need to save and restore pipe buffers. But
do_splice() requires two file descriptors, therefore we can't use it,
as we always have one file descriptor (checkpoint image) and one
pipe_inode_info.

This patch exports interfaces that work at the pipe_inode_info level,
namely link_pipe(), do_splice_to() and do_splice_from(). They are used
in the following patch to to save and restore pipe buffers without
unnecessary data copy.

It slightly modifies both do_splice_to() and do_splice_from() to
detect the case of pipe-to-pipe transfer, in which case they invoke
splice_pipe_to_pipe() directly.

Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
Acked-by: Serge E. Hallyn <serue@us.ibm.com>
Tested-by: Serge E. Hallyn <serue@us.ibm.com>
---
 fs/splice.c            |   61 ++++++++++++++++++++++++++++++++---------------
 include/linux/splice.h |    9 +++++++
 2 files changed, 50 insertions(+), 20 deletions(-)

diff --git a/fs/splice.c b/fs/splice.c
index 3920866..76acb55 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -1051,18 +1051,43 @@ ssize_t generic_splice_sendpage(struct pipe_inode_info *pipe, struct file *out,
 EXPORT_SYMBOL(generic_splice_sendpage);
 
 /*
+ * After the inode slimming patch, i_pipe/i_bdev/i_cdev share the same
+ * location, so checking ->i_pipe is not enough to verify that this is a
+ * pipe.
+ */
+static inline struct pipe_inode_info *pipe_info(struct inode *inode)
+{
+	if (S_ISFIFO(inode->i_mode))
+		return inode->i_pipe;
+
+	return NULL;
+}
+
+static int splice_pipe_to_pipe(struct pipe_inode_info *ipipe,
+			       struct pipe_inode_info *opipe,
+			       size_t len, unsigned int flags);
+
+/*
  * Attempt to initiate a splice from pipe to file.
  */
-static long do_splice_from(struct pipe_inode_info *pipe, struct file *out,
-			   loff_t *ppos, size_t len, unsigned int flags)
+long do_splice_from(struct pipe_inode_info *pipe, struct file *out,
+		    loff_t *ppos, size_t len, unsigned int flags)
 {
 	ssize_t (*splice_write)(struct pipe_inode_info *, struct file *,
 				loff_t *, size_t, unsigned int);
+	struct pipe_inode_info *opipe;
 	int ret;
 
 	if (unlikely(!(out->f_mode & FMODE_WRITE)))
 		return -EBADF;
 
+	/* When called directly (e.g. from c/r) output may be a pipe */
+	opipe = pipe_info(out->f_path.dentry->d_inode);
+	if (opipe) {
+		BUG_ON(opipe == pipe);
+		return splice_pipe_to_pipe(pipe, opipe, len, flags);
+	}
+
 	if (unlikely(out->f_flags & O_APPEND))
 		return -EINVAL;
 
@@ -1081,17 +1106,25 @@ static long do_splice_from(struct pipe_inode_info *pipe, struct file *out,
 /*
  * Attempt to initiate a splice from a file to a pipe.
  */
-static long do_splice_to(struct file *in, loff_t *ppos,
-			 struct pipe_inode_info *pipe, size_t len,
-			 unsigned int flags)
+long do_splice_to(struct file *in, loff_t *ppos,
+		  struct pipe_inode_info *pipe, size_t len,
+		  unsigned int flags)
 {
 	ssize_t (*splice_read)(struct file *, loff_t *,
 			       struct pipe_inode_info *, size_t, unsigned int);
+	struct pipe_inode_info *ipipe;
 	int ret;
 
 	if (unlikely(!(in->f_mode & FMODE_READ)))
 		return -EBADF;
 
+	/* When called firectly (e.g. from c/r) input may be a pipe */
+	ipipe = pipe_info(in->f_path.dentry->d_inode);
+	if (ipipe) {
+		BUG_ON(ipipe == pipe);
+		return splice_pipe_to_pipe(ipipe, pipe, len, flags);
+	}
+
 	ret = rw_verify_area(READ, in, ppos, len);
 	if (unlikely(ret < 0))
 		return ret;
@@ -1271,18 +1304,6 @@ long do_splice_direct(struct file *in, loff_t *ppos, struct file *out,
 static int splice_pipe_to_pipe(struct pipe_inode_info *ipipe,
 			       struct pipe_inode_info *opipe,
 			       size_t len, unsigned int flags);
-/*
- * After the inode slimming patch, i_pipe/i_bdev/i_cdev share the same
- * location, so checking ->i_pipe is not enough to verify that this is a
- * pipe.
- */
-static inline struct pipe_inode_info *pipe_info(struct inode *inode)
-{
-	if (S_ISFIFO(inode->i_mode))
-		return inode->i_pipe;
-
-	return NULL;
-}
 
 /*
  * Determine where to splice to/from.
@@ -1887,9 +1908,9 @@ retry:
 /*
  * Link contents of ipipe to opipe.
  */
-static int link_pipe(struct pipe_inode_info *ipipe,
-		     struct pipe_inode_info *opipe,
-		     size_t len, unsigned int flags)
+int link_pipe(struct pipe_inode_info *ipipe,
+	      struct pipe_inode_info *opipe,
+	      size_t len, unsigned int flags)
 {
 	struct pipe_buffer *ibuf, *obuf;
 	int ret = 0, i = 0, nbuf;
diff --git a/include/linux/splice.h b/include/linux/splice.h
index 18e7c7c..431662c 100644
--- a/include/linux/splice.h
+++ b/include/linux/splice.h
@@ -82,4 +82,13 @@ extern ssize_t splice_to_pipe(struct pipe_inode_info *,
 extern ssize_t splice_direct_to_actor(struct file *, struct splice_desc *,
 				      splice_direct_actor *);
 
+extern int link_pipe(struct pipe_inode_info *ipipe,
+		     struct pipe_inode_info *opipe,
+		     size_t len, unsigned int flags);
+extern long do_splice_to(struct file *in, loff_t *ppos,
+			 struct pipe_inode_info *pipe, size_t len,
+			 unsigned int flags);
+extern long do_splice_from(struct pipe_inode_info *pipe, struct file *out,
+			   loff_t *ppos, size_t len, unsigned int flags);
+
 #endif
-- 
1.6.3.3


^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [C/R v20][PATCH 51/96] c/r: support for open pipes
       [not found] ` <1268960401-16680-1-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
                     ` (8 preceding siblings ...)
  2010-03-19  0:59   ` [C/R v20][PATCH 50/96] splice: export pipe/file-to-pipe/file functionality Oren Laadan
@ 2010-03-19  0:59   ` Oren Laadan
  2010-03-19  0:59   ` [C/R v20][PATCH 52/96] c/r: checkpoint and restore FIFOs Oren Laadan
                     ` (6 subsequent siblings)
  16 siblings, 0 replies; 88+ messages in thread
From: Oren Laadan @ 2010-03-19  0:59 UTC (permalink / raw)
  To: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Andreas Dilger

A pipe is a double-headed inode with a buffer attached to it. We
checkpoint the pipe buffer only once, as soon as we hit one side of
the pipe, regardless whether it is read- or write- end.

To checkpoint a file descriptor that refers to a pipe (either end), we
first lookup the inode in the hash table: If not found, it is the
first encounter of this pipe. Besides the file descriptor, we also (a)
save the pipe data, and (b) register the pipe inode in the hash. If
found, it is the second encounter of this pipe, namely, as we hit the
other end of the same pipe. In both cases we write the pipe-objref of
the inode.

To restore, create a new pipe and thus have two file pointers (read-
and write- ends). We only use one of them, depending on which side was
checkpointed first. We register the file pointer of the other end in
the hash table, with the pipe_objref given for this pipe from the
checkpoint, to be used later when the other arrives. At this point we
also restore the contents of the pipe buffers.

To save the pipe buffer, given a source pipe, use do_tee() to clone
its contents into a temporary 'struct pipe_inode_info', and then use
do_splice_from() to transfer it directly to the checkpoint image file.

To restore the pipe buffer, with a fresh newly allocated target pipe,
use do_splice_to() to splice the data directly between the checkpoint
image file and the pipe.

Changelog[v19-rc1]:
  - Switch to ckpt_obj_try_fetch()
  - [Matt Helsley] Add cpp definitions for enums
Changelog[v18]:
  - Adjust format of pipe buffer to include the mandatory pre-header
Changelog[v17]:
  - Forward-declare 'ckpt_ctx' et-al, don't use checkpoint_types.h

Signed-off-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
Acked-by: Serge E. Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
Tested-by: Serge E. Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
---
 checkpoint/files.c             |    7 ++
 fs/pipe.c                      |  157 ++++++++++++++++++++++++++++++++++++++++
 include/linux/checkpoint_hdr.h |    9 +++
 include/linux/pipe_fs_i.h      |    8 ++
 4 files changed, 181 insertions(+), 0 deletions(-)

diff --git a/checkpoint/files.c b/checkpoint/files.c
index b404c8f..1c294fe 100644
--- a/checkpoint/files.c
+++ b/checkpoint/files.c
@@ -17,6 +17,7 @@
 #include <linux/file.h>
 #include <linux/fdtable.h>
 #include <linux/fsnotify.h>
+#include <linux/pipe_fs_i.h>
 #include <linux/syscalls.h>
 #include <linux/deferqueue.h>
 #include <linux/checkpoint.h>
@@ -592,6 +593,12 @@ static struct restore_file_ops restore_file_ops[] = {
 		.file_type = CKPT_FILE_GENERIC,
 		.restore = generic_file_restore,
 	},
+	/* pipes */
+	{
+		.file_name = "PIPE",
+		.file_type = CKPT_FILE_PIPE,
+		.restore = pipe_file_restore,
+	},
 };
 
 static struct file *do_restore_file(struct ckpt_ctx *ctx)
diff --git a/fs/pipe.c b/fs/pipe.c
index 37ba29f..747b2d7 100644
--- a/fs/pipe.c
+++ b/fs/pipe.c
@@ -13,11 +13,13 @@
 #include <linux/fs.h>
 #include <linux/mount.h>
 #include <linux/pipe_fs_i.h>
+#include <linux/splice.h>
 #include <linux/uio.h>
 #include <linux/highmem.h>
 #include <linux/pagemap.h>
 #include <linux/audit.h>
 #include <linux/syscalls.h>
+#include <linux/checkpoint.h>
 
 #include <asm/uaccess.h>
 #include <asm/ioctls.h>
@@ -828,6 +830,158 @@ pipe_rdwr_open(struct inode *inode, struct file *filp)
 	return ret;
 }
 
+#ifdef CONFIG_CHECKPOINT
+static int checkpoint_pipe(struct ckpt_ctx *ctx, struct inode *inode)
+{
+	struct pipe_inode_info *pipe;
+	int len, ret = -ENOMEM;
+
+	pipe = alloc_pipe_info(NULL);
+	if (!pipe)
+		return ret;
+
+	pipe->readers = 1;	/* bluff link_pipe() below */
+	len = link_pipe(inode->i_pipe, pipe, INT_MAX, SPLICE_F_NONBLOCK);
+	if (len == -EAGAIN)
+		len = 0;
+	if (len < 0) {
+		ret = len;
+		goto out;
+	}
+
+	ret = ckpt_write_obj_type(ctx, NULL, len, CKPT_HDR_PIPE_BUF);
+	if (ret < 0)
+		goto out;
+
+	ret = do_splice_from(pipe, ctx->file, &ctx->file->f_pos, len, 0);
+	if (ret < 0)
+		goto out;
+	if (ret != len)
+		ret = -EPIPE;  /* can occur due to an error in target file */
+ out:
+	__free_pipe_info(pipe);
+	return ret;
+}
+
+static int pipe_file_checkpoint(struct ckpt_ctx *ctx, struct file *file)
+{
+	struct ckpt_hdr_file_pipe *h;
+	struct inode *inode = file->f_dentry->d_inode;
+	int objref, first, ret;
+
+	objref = ckpt_obj_lookup_add(ctx, inode, CKPT_OBJ_INODE, &first);
+	if (objref < 0)
+		return objref;
+
+	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_FILE);
+	if (!h)
+		return -ENOMEM;
+
+	h->common.f_type = CKPT_FILE_PIPE;
+	h->pipe_objref = objref;
+
+	ret = checkpoint_file_common(ctx, file, &h->common);
+	if (ret < 0)
+		goto out;
+	ret = ckpt_write_obj(ctx, &h->common.h);
+	if (ret < 0)
+		goto out;
+
+	if (first)
+		ret = checkpoint_pipe(ctx, inode);
+ out:
+	ckpt_hdr_put(ctx, h);
+	return ret;
+}
+
+static int restore_pipe(struct ckpt_ctx *ctx, struct file *file)
+{
+	struct pipe_inode_info *pipe;
+	int len, ret;
+
+	len = _ckpt_read_obj_type(ctx, NULL, 0, CKPT_HDR_PIPE_BUF);
+	if (len < 0)
+		return len;
+
+	pipe = file->f_dentry->d_inode->i_pipe;
+	ret = do_splice_to(ctx->file, &ctx->file->f_pos, pipe, len, 0);
+
+	if (ret >= 0 && ret != len)
+		ret = -EPIPE;  /* can occur due to an error in source file */
+
+	return ret;
+}
+
+struct file *pipe_file_restore(struct ckpt_ctx *ctx, struct ckpt_hdr_file *ptr)
+{
+	struct ckpt_hdr_file_pipe *h = (struct ckpt_hdr_file_pipe *) ptr;
+	struct file *file;
+	int fds[2], which, ret;
+
+	if (ptr->h.type != CKPT_HDR_FILE  ||
+	    ptr->h.len != sizeof(*h) || ptr->f_type != CKPT_FILE_PIPE)
+		return ERR_PTR(-EINVAL);
+
+	if (h->pipe_objref <= 0)
+		return ERR_PTR(-EINVAL);
+
+	file = ckpt_obj_try_fetch(ctx, h->pipe_objref, CKPT_OBJ_FILE);
+	/*
+	 * If ckpt_obj_try_fetch() returned ERR_PTR(-EINVAL), then this is
+	 * the first time we see this pipe so need to restore the
+	 * contents.  Otherwise, use the file pointer skip forward.
+	 */
+	if (!IS_ERR(file)) {
+		get_file(file);
+	} else if (PTR_ERR(file) == -EINVAL) {
+		/* first encounter of this pipe: create it */
+		ret = do_pipe_flags(fds, 0);
+		if (ret < 0)
+			return file;
+
+		which = (ptr->f_flags & O_WRONLY ? 1 : 0);
+		/*
+		 * Below we return the file corersponding to one side
+		 * of the pipe for our caller to use. Now insert the
+		 * other side of the pipe to the hash, to be picked up
+		 * when that side is restored.
+		 */
+		file = fget(fds[1-which]);	/* the 'other' side */
+		if (!file)	/* this should _never_ happen ! */
+			return ERR_PTR(-EBADF);
+		ret = ckpt_obj_insert(ctx, file, h->pipe_objref, CKPT_OBJ_FILE);
+		if (ret < 0)
+			goto out;
+
+		ret = restore_pipe(ctx, file);
+		fput(file);
+		if (ret < 0)
+			return ERR_PTR(ret);
+
+		file = fget(fds[which]);	/* 'this' side */
+		if (!file)	/* this should _never_ happen ! */
+			return ERR_PTR(-EBADF);
+
+		/* get rid of the file descriptors (caller sets that) */
+		sys_close(fds[which]);
+		sys_close(fds[1-which]);
+	} else {
+		return file;
+	}
+
+	ret = restore_file_common(ctx, file, ptr);
+ out:
+	if (ret < 0) {
+		fput(file);
+		file = ERR_PTR(ret);
+	}
+
+	return file;
+}
+#else
+#define pipe_file_checkpoint  NULL
+#endif /* CONFIG_CHECKPOINT */
+
 /*
  * The file_operations structs are not static because they
  * are also used in linux/fs/fifo.c to do operations on FIFOs.
@@ -844,6 +998,7 @@ const struct file_operations read_pipefifo_fops = {
 	.open		= pipe_read_open,
 	.release	= pipe_read_release,
 	.fasync		= pipe_read_fasync,
+	.checkpoint	= pipe_file_checkpoint,
 };
 
 const struct file_operations write_pipefifo_fops = {
@@ -856,6 +1011,7 @@ const struct file_operations write_pipefifo_fops = {
 	.open		= pipe_write_open,
 	.release	= pipe_write_release,
 	.fasync		= pipe_write_fasync,
+	.checkpoint	= pipe_file_checkpoint,
 };
 
 const struct file_operations rdwr_pipefifo_fops = {
@@ -869,6 +1025,7 @@ const struct file_operations rdwr_pipefifo_fops = {
 	.open		= pipe_rdwr_open,
 	.release	= pipe_rdwr_release,
 	.fasync		= pipe_rdwr_fasync,
+	.checkpoint	= pipe_file_checkpoint,
 };
 
 struct pipe_inode_info * alloc_pipe_info(struct inode *inode)
diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
index 6fae6ef..885d06b 100644
--- a/include/linux/checkpoint_hdr.h
+++ b/include/linux/checkpoint_hdr.h
@@ -90,6 +90,8 @@ enum {
 #define CKPT_HDR_FILE_NAME CKPT_HDR_FILE_NAME
 	CKPT_HDR_FILE,
 #define CKPT_HDR_FILE CKPT_HDR_FILE
+	CKPT_HDR_PIPE_BUF,
+#define CKPT_HDR_PIPE_BUF CKPT_HDR_PIPE_BUF
 
 	CKPT_HDR_MM = 401,
 #define CKPT_HDR_MM CKPT_HDR_MM
@@ -277,6 +279,8 @@ enum file_type {
 #define CKPT_FILE_IGNORE CKPT_FILE_IGNORE
 	CKPT_FILE_GENERIC,
 #define CKPT_FILE_GENERIC CKPT_FILE_GENERIC
+	CKPT_FILE_PIPE,
+#define CKPT_FILE_PIPE CKPT_FILE_PIPE
 	CKPT_FILE_MAX
 #define CKPT_FILE_MAX CKPT_FILE_MAX
 };
@@ -296,6 +300,11 @@ struct ckpt_hdr_file_generic {
 	struct ckpt_hdr_file common;
 } __attribute__((aligned(8)));
 
+struct ckpt_hdr_file_pipe {
+	struct ckpt_hdr_file common;
+	__s32 pipe_objref;
+} __attribute__((aligned(8)));
+
 /* memory layout */
 struct ckpt_hdr_mm {
 	struct ckpt_hdr h;
diff --git a/include/linux/pipe_fs_i.h b/include/linux/pipe_fs_i.h
index b43a9e0..e526a12 100644
--- a/include/linux/pipe_fs_i.h
+++ b/include/linux/pipe_fs_i.h
@@ -154,4 +154,12 @@ int generic_pipe_buf_confirm(struct pipe_inode_info *, struct pipe_buffer *);
 int generic_pipe_buf_steal(struct pipe_inode_info *, struct pipe_buffer *);
 void generic_pipe_buf_release(struct pipe_inode_info *, struct pipe_buffer *);
 
+/* checkpoint/restart */
+#ifdef CONFIG_CHECKPOINT
+struct ckpt_ctx;
+struct ckpt_hdr_file;
+extern struct file *pipe_file_restore(struct ckpt_ctx *ctx,
+				      struct ckpt_hdr_file *ptr);
+#endif
+
 #endif
-- 
1.6.3.3

^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [C/R v20][PATCH 51/96] c/r: support for open pipes
  2010-03-19  0:59 [C/R v20][PATCH 00/96] Linux Checkpoint-Restart - v20 Oren Laadan
                   ` (9 preceding siblings ...)
  2010-03-19  0:59 ` [C/R v20][PATCH 50/96] splice: export pipe/file-to-pipe/file functionality Oren Laadan
@ 2010-03-19  0:59 ` Oren Laadan
  2010-03-19  0:59 ` [C/R v20][PATCH 52/96] c/r: checkpoint and restore FIFOs Oren Laadan
                   ` (6 subsequent siblings)
  17 siblings, 0 replies; 88+ messages in thread
From: Oren Laadan @ 2010-03-19  0:59 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: containers, Matt Helsley, Andreas Dilger, Oren Laadan

A pipe is a double-headed inode with a buffer attached to it. We
checkpoint the pipe buffer only once, as soon as we hit one side of
the pipe, regardless whether it is read- or write- end.

To checkpoint a file descriptor that refers to a pipe (either end), we
first lookup the inode in the hash table: If not found, it is the
first encounter of this pipe. Besides the file descriptor, we also (a)
save the pipe data, and (b) register the pipe inode in the hash. If
found, it is the second encounter of this pipe, namely, as we hit the
other end of the same pipe. In both cases we write the pipe-objref of
the inode.

To restore, create a new pipe and thus have two file pointers (read-
and write- ends). We only use one of them, depending on which side was
checkpointed first. We register the file pointer of the other end in
the hash table, with the pipe_objref given for this pipe from the
checkpoint, to be used later when the other arrives. At this point we
also restore the contents of the pipe buffers.

To save the pipe buffer, given a source pipe, use do_tee() to clone
its contents into a temporary 'struct pipe_inode_info', and then use
do_splice_from() to transfer it directly to the checkpoint image file.

To restore the pipe buffer, with a fresh newly allocated target pipe,
use do_splice_to() to splice the data directly between the checkpoint
image file and the pipe.

Changelog[v19-rc1]:
  - Switch to ckpt_obj_try_fetch()
  - [Matt Helsley] Add cpp definitions for enums
Changelog[v18]:
  - Adjust format of pipe buffer to include the mandatory pre-header
Changelog[v17]:
  - Forward-declare 'ckpt_ctx' et-al, don't use checkpoint_types.h

Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
Acked-by: Serge E. Hallyn <serue@us.ibm.com>
Tested-by: Serge E. Hallyn <serue@us.ibm.com>
---
 checkpoint/files.c             |    7 ++
 fs/pipe.c                      |  157 ++++++++++++++++++++++++++++++++++++++++
 include/linux/checkpoint_hdr.h |    9 +++
 include/linux/pipe_fs_i.h      |    8 ++
 4 files changed, 181 insertions(+), 0 deletions(-)

diff --git a/checkpoint/files.c b/checkpoint/files.c
index b404c8f..1c294fe 100644
--- a/checkpoint/files.c
+++ b/checkpoint/files.c
@@ -17,6 +17,7 @@
 #include <linux/file.h>
 #include <linux/fdtable.h>
 #include <linux/fsnotify.h>
+#include <linux/pipe_fs_i.h>
 #include <linux/syscalls.h>
 #include <linux/deferqueue.h>
 #include <linux/checkpoint.h>
@@ -592,6 +593,12 @@ static struct restore_file_ops restore_file_ops[] = {
 		.file_type = CKPT_FILE_GENERIC,
 		.restore = generic_file_restore,
 	},
+	/* pipes */
+	{
+		.file_name = "PIPE",
+		.file_type = CKPT_FILE_PIPE,
+		.restore = pipe_file_restore,
+	},
 };
 
 static struct file *do_restore_file(struct ckpt_ctx *ctx)
diff --git a/fs/pipe.c b/fs/pipe.c
index 37ba29f..747b2d7 100644
--- a/fs/pipe.c
+++ b/fs/pipe.c
@@ -13,11 +13,13 @@
 #include <linux/fs.h>
 #include <linux/mount.h>
 #include <linux/pipe_fs_i.h>
+#include <linux/splice.h>
 #include <linux/uio.h>
 #include <linux/highmem.h>
 #include <linux/pagemap.h>
 #include <linux/audit.h>
 #include <linux/syscalls.h>
+#include <linux/checkpoint.h>
 
 #include <asm/uaccess.h>
 #include <asm/ioctls.h>
@@ -828,6 +830,158 @@ pipe_rdwr_open(struct inode *inode, struct file *filp)
 	return ret;
 }
 
+#ifdef CONFIG_CHECKPOINT
+static int checkpoint_pipe(struct ckpt_ctx *ctx, struct inode *inode)
+{
+	struct pipe_inode_info *pipe;
+	int len, ret = -ENOMEM;
+
+	pipe = alloc_pipe_info(NULL);
+	if (!pipe)
+		return ret;
+
+	pipe->readers = 1;	/* bluff link_pipe() below */
+	len = link_pipe(inode->i_pipe, pipe, INT_MAX, SPLICE_F_NONBLOCK);
+	if (len == -EAGAIN)
+		len = 0;
+	if (len < 0) {
+		ret = len;
+		goto out;
+	}
+
+	ret = ckpt_write_obj_type(ctx, NULL, len, CKPT_HDR_PIPE_BUF);
+	if (ret < 0)
+		goto out;
+
+	ret = do_splice_from(pipe, ctx->file, &ctx->file->f_pos, len, 0);
+	if (ret < 0)
+		goto out;
+	if (ret != len)
+		ret = -EPIPE;  /* can occur due to an error in target file */
+ out:
+	__free_pipe_info(pipe);
+	return ret;
+}
+
+static int pipe_file_checkpoint(struct ckpt_ctx *ctx, struct file *file)
+{
+	struct ckpt_hdr_file_pipe *h;
+	struct inode *inode = file->f_dentry->d_inode;
+	int objref, first, ret;
+
+	objref = ckpt_obj_lookup_add(ctx, inode, CKPT_OBJ_INODE, &first);
+	if (objref < 0)
+		return objref;
+
+	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_FILE);
+	if (!h)
+		return -ENOMEM;
+
+	h->common.f_type = CKPT_FILE_PIPE;
+	h->pipe_objref = objref;
+
+	ret = checkpoint_file_common(ctx, file, &h->common);
+	if (ret < 0)
+		goto out;
+	ret = ckpt_write_obj(ctx, &h->common.h);
+	if (ret < 0)
+		goto out;
+
+	if (first)
+		ret = checkpoint_pipe(ctx, inode);
+ out:
+	ckpt_hdr_put(ctx, h);
+	return ret;
+}
+
+static int restore_pipe(struct ckpt_ctx *ctx, struct file *file)
+{
+	struct pipe_inode_info *pipe;
+	int len, ret;
+
+	len = _ckpt_read_obj_type(ctx, NULL, 0, CKPT_HDR_PIPE_BUF);
+	if (len < 0)
+		return len;
+
+	pipe = file->f_dentry->d_inode->i_pipe;
+	ret = do_splice_to(ctx->file, &ctx->file->f_pos, pipe, len, 0);
+
+	if (ret >= 0 && ret != len)
+		ret = -EPIPE;  /* can occur due to an error in source file */
+
+	return ret;
+}
+
+struct file *pipe_file_restore(struct ckpt_ctx *ctx, struct ckpt_hdr_file *ptr)
+{
+	struct ckpt_hdr_file_pipe *h = (struct ckpt_hdr_file_pipe *) ptr;
+	struct file *file;
+	int fds[2], which, ret;
+
+	if (ptr->h.type != CKPT_HDR_FILE  ||
+	    ptr->h.len != sizeof(*h) || ptr->f_type != CKPT_FILE_PIPE)
+		return ERR_PTR(-EINVAL);
+
+	if (h->pipe_objref <= 0)
+		return ERR_PTR(-EINVAL);
+
+	file = ckpt_obj_try_fetch(ctx, h->pipe_objref, CKPT_OBJ_FILE);
+	/*
+	 * If ckpt_obj_try_fetch() returned ERR_PTR(-EINVAL), then this is
+	 * the first time we see this pipe so need to restore the
+	 * contents.  Otherwise, use the file pointer skip forward.
+	 */
+	if (!IS_ERR(file)) {
+		get_file(file);
+	} else if (PTR_ERR(file) == -EINVAL) {
+		/* first encounter of this pipe: create it */
+		ret = do_pipe_flags(fds, 0);
+		if (ret < 0)
+			return file;
+
+		which = (ptr->f_flags & O_WRONLY ? 1 : 0);
+		/*
+		 * Below we return the file corersponding to one side
+		 * of the pipe for our caller to use. Now insert the
+		 * other side of the pipe to the hash, to be picked up
+		 * when that side is restored.
+		 */
+		file = fget(fds[1-which]);	/* the 'other' side */
+		if (!file)	/* this should _never_ happen ! */
+			return ERR_PTR(-EBADF);
+		ret = ckpt_obj_insert(ctx, file, h->pipe_objref, CKPT_OBJ_FILE);
+		if (ret < 0)
+			goto out;
+
+		ret = restore_pipe(ctx, file);
+		fput(file);
+		if (ret < 0)
+			return ERR_PTR(ret);
+
+		file = fget(fds[which]);	/* 'this' side */
+		if (!file)	/* this should _never_ happen ! */
+			return ERR_PTR(-EBADF);
+
+		/* get rid of the file descriptors (caller sets that) */
+		sys_close(fds[which]);
+		sys_close(fds[1-which]);
+	} else {
+		return file;
+	}
+
+	ret = restore_file_common(ctx, file, ptr);
+ out:
+	if (ret < 0) {
+		fput(file);
+		file = ERR_PTR(ret);
+	}
+
+	return file;
+}
+#else
+#define pipe_file_checkpoint  NULL
+#endif /* CONFIG_CHECKPOINT */
+
 /*
  * The file_operations structs are not static because they
  * are also used in linux/fs/fifo.c to do operations on FIFOs.
@@ -844,6 +998,7 @@ const struct file_operations read_pipefifo_fops = {
 	.open		= pipe_read_open,
 	.release	= pipe_read_release,
 	.fasync		= pipe_read_fasync,
+	.checkpoint	= pipe_file_checkpoint,
 };
 
 const struct file_operations write_pipefifo_fops = {
@@ -856,6 +1011,7 @@ const struct file_operations write_pipefifo_fops = {
 	.open		= pipe_write_open,
 	.release	= pipe_write_release,
 	.fasync		= pipe_write_fasync,
+	.checkpoint	= pipe_file_checkpoint,
 };
 
 const struct file_operations rdwr_pipefifo_fops = {
@@ -869,6 +1025,7 @@ const struct file_operations rdwr_pipefifo_fops = {
 	.open		= pipe_rdwr_open,
 	.release	= pipe_rdwr_release,
 	.fasync		= pipe_rdwr_fasync,
+	.checkpoint	= pipe_file_checkpoint,
 };
 
 struct pipe_inode_info * alloc_pipe_info(struct inode *inode)
diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
index 6fae6ef..885d06b 100644
--- a/include/linux/checkpoint_hdr.h
+++ b/include/linux/checkpoint_hdr.h
@@ -90,6 +90,8 @@ enum {
 #define CKPT_HDR_FILE_NAME CKPT_HDR_FILE_NAME
 	CKPT_HDR_FILE,
 #define CKPT_HDR_FILE CKPT_HDR_FILE
+	CKPT_HDR_PIPE_BUF,
+#define CKPT_HDR_PIPE_BUF CKPT_HDR_PIPE_BUF
 
 	CKPT_HDR_MM = 401,
 #define CKPT_HDR_MM CKPT_HDR_MM
@@ -277,6 +279,8 @@ enum file_type {
 #define CKPT_FILE_IGNORE CKPT_FILE_IGNORE
 	CKPT_FILE_GENERIC,
 #define CKPT_FILE_GENERIC CKPT_FILE_GENERIC
+	CKPT_FILE_PIPE,
+#define CKPT_FILE_PIPE CKPT_FILE_PIPE
 	CKPT_FILE_MAX
 #define CKPT_FILE_MAX CKPT_FILE_MAX
 };
@@ -296,6 +300,11 @@ struct ckpt_hdr_file_generic {
 	struct ckpt_hdr_file common;
 } __attribute__((aligned(8)));
 
+struct ckpt_hdr_file_pipe {
+	struct ckpt_hdr_file common;
+	__s32 pipe_objref;
+} __attribute__((aligned(8)));
+
 /* memory layout */
 struct ckpt_hdr_mm {
 	struct ckpt_hdr h;
diff --git a/include/linux/pipe_fs_i.h b/include/linux/pipe_fs_i.h
index b43a9e0..e526a12 100644
--- a/include/linux/pipe_fs_i.h
+++ b/include/linux/pipe_fs_i.h
@@ -154,4 +154,12 @@ int generic_pipe_buf_confirm(struct pipe_inode_info *, struct pipe_buffer *);
 int generic_pipe_buf_steal(struct pipe_inode_info *, struct pipe_buffer *);
 void generic_pipe_buf_release(struct pipe_inode_info *, struct pipe_buffer *);
 
+/* checkpoint/restart */
+#ifdef CONFIG_CHECKPOINT
+struct ckpt_ctx;
+struct ckpt_hdr_file;
+extern struct file *pipe_file_restore(struct ckpt_ctx *ctx,
+				      struct ckpt_hdr_file *ptr);
+#endif
+
 #endif
-- 
1.6.3.3


^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [C/R v20][PATCH 52/96] c/r: checkpoint and restore FIFOs
       [not found] ` <1268960401-16680-1-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
                     ` (9 preceding siblings ...)
  2010-03-19  0:59   ` [C/R v20][PATCH 51/96] c/r: support for open pipes Oren Laadan
@ 2010-03-19  0:59   ` Oren Laadan
  2010-03-19  0:59   ` [C/R v20][PATCH 53/96] c/r: refuse to checkpoint if monitoring directories with dnotify Oren Laadan
                     ` (5 subsequent siblings)
  16 siblings, 0 replies; 88+ messages in thread
From: Oren Laadan @ 2010-03-19  0:59 UTC (permalink / raw)
  To: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Andreas Dilger

FIFOs are almost like pipes.

Checkpoints adds the FIFO pathname. The first time the FIFO is found
it also assigns an @objref and dumps the contents in the buffers.

To restore, use the @objref only to determine whether a particular
FIFO has already been restored earlier. Note that it ignores the file
pointer that matches that @objref (unlike with pipes, where that file
corresponds to the other end of the pipe). Instead, it creates a new
FIFO using the saved pathname.

Changelog [v19-rc3]:
  - Rebase to kernel 2.6.33
Changelog [v19-rc1]:
  - Switch to ckpt_obj_try_fetch()
  - [Matt Helsley] Add cpp definitions for enums

Signed-off-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
Acked-by: Serge E. Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
Tested-by: Serge E. Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
---
 checkpoint/files.c             |    6 +++
 fs/pipe.c                      |   81 +++++++++++++++++++++++++++++++++++++++-
 include/linux/checkpoint_hdr.h |    2 +
 include/linux/pipe_fs_i.h      |    2 +
 4 files changed, 90 insertions(+), 1 deletions(-)

diff --git a/checkpoint/files.c b/checkpoint/files.c
index 1c294fe..c647bfd 100644
--- a/checkpoint/files.c
+++ b/checkpoint/files.c
@@ -599,6 +599,12 @@ static struct restore_file_ops restore_file_ops[] = {
 		.file_type = CKPT_FILE_PIPE,
 		.restore = pipe_file_restore,
 	},
+	/* fifo */
+	{
+		.file_name = "FIFO",
+		.file_type = CKPT_FILE_FIFO,
+		.restore = fifo_file_restore,
+	},
 };
 
 static struct file *do_restore_file(struct ckpt_ctx *ctx)
diff --git a/fs/pipe.c b/fs/pipe.c
index 747b2d7..8c79493 100644
--- a/fs/pipe.c
+++ b/fs/pipe.c
@@ -830,6 +830,8 @@ pipe_rdwr_open(struct inode *inode, struct file *filp)
 	return ret;
 }
 
+static struct vfsmount *pipe_mnt __read_mostly;
+
 #ifdef CONFIG_CHECKPOINT
 static int checkpoint_pipe(struct ckpt_ctx *ctx, struct inode *inode)
 {
@@ -877,7 +879,11 @@ static int pipe_file_checkpoint(struct ckpt_ctx *ctx, struct file *file)
 	if (!h)
 		return -ENOMEM;
 
-	h->common.f_type = CKPT_FILE_PIPE;
+	/* fifo and pipe are similar at checkpoint, differ on restore */
+	if (inode->i_sb == pipe_mnt->mnt_sb)
+		h->common.f_type = CKPT_FILE_PIPE;
+	else
+		h->common.f_type = CKPT_FILE_FIFO;
 	h->pipe_objref = objref;
 
 	ret = checkpoint_file_common(ctx, file, &h->common);
@@ -887,6 +893,13 @@ static int pipe_file_checkpoint(struct ckpt_ctx *ctx, struct file *file)
 	if (ret < 0)
 		goto out;
 
+	/* FIFO also needs a file name */
+	if (h->common.f_type == CKPT_FILE_FIFO) {
+		ret = checkpoint_fname(ctx, &file->f_path, &ctx->root_fs_path);
+		if (ret < 0)
+			goto out;
+	}
+
 	if (first)
 		ret = checkpoint_pipe(ctx, inode);
  out:
@@ -978,8 +991,74 @@ struct file *pipe_file_restore(struct ckpt_ctx *ctx, struct ckpt_hdr_file *ptr)
 
 	return file;
 }
+
+struct file *fifo_file_restore(struct ckpt_ctx *ctx, struct ckpt_hdr_file *ptr)
+{
+	struct ckpt_hdr_file_pipe *h = (struct ckpt_hdr_file_pipe *) ptr;
+	struct file *file;
+	int first, ret;
+
+	if (ptr->h.type != CKPT_HDR_FILE  ||
+	    ptr->h.len != sizeof(*h) || ptr->f_type != CKPT_FILE_FIFO)
+		return ERR_PTR(-EINVAL);
+
+	if (h->pipe_objref <= 0)
+		return ERR_PTR(-EINVAL);
+
+	/*
+	 * If ckpt_obj_try_fetch() returned ERR_PTR(-EINVAL), this is the
+	 * first time for this fifo.
+	 */
+	file = ckpt_obj_try_fetch(ctx, h->pipe_objref, CKPT_OBJ_FILE);
+	if (!IS_ERR(file))
+		first = 0;
+	else if (PTR_ERR(file) == -EINVAL)
+		first = 1;
+	else
+		return file;
+
+	/*
+	 * To avoid blocking, always open the fifo with O_RDWR;
+	 * then fix flags below.
+	 */
+	file = restore_open_fname(ctx, (ptr->f_flags & ~O_ACCMODE) | O_RDWR);
+	if (IS_ERR(file))
+		return file;
+
+	if ((ptr->f_flags & O_ACCMODE) == O_RDONLY) {
+		file->f_flags = (file->f_flags & ~O_ACCMODE) | O_RDONLY;
+		file->f_mode &= ~FMODE_WRITE;
+	} else if ((ptr->f_flags & O_ACCMODE) == O_WRONLY) {
+		file->f_flags = (file->f_flags & ~O_ACCMODE) | O_WRONLY;
+		file->f_mode &= ~FMODE_READ;
+	} else if ((ptr->f_flags & O_ACCMODE) != O_RDWR) {
+		ret = -EINVAL;
+		goto out;
+	}
+
+	/* first time: add to objhash and restore fifo's contents */
+	if (first) {
+		ret = ckpt_obj_insert(ctx, file, h->pipe_objref, CKPT_OBJ_FILE);
+		if (ret < 0)
+			goto out;
+
+		ret = restore_pipe(ctx, file);
+		if (ret < 0)
+			goto out;
+	}
+
+	ret = restore_file_common(ctx, file, ptr);
+ out:
+	if (ret < 0) {
+		fput(file);
+		file = ERR_PTR(ret);
+	}
+
+	return file;
+}
 #else
 #define pipe_file_checkpoint  NULL
+#define fifo_file_checkpoint  NULL
 #endif /* CONFIG_CHECKPOINT */
 
 /*
diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
index 885d06b..fce35f3 100644
--- a/include/linux/checkpoint_hdr.h
+++ b/include/linux/checkpoint_hdr.h
@@ -281,6 +281,8 @@ enum file_type {
 #define CKPT_FILE_GENERIC CKPT_FILE_GENERIC
 	CKPT_FILE_PIPE,
 #define CKPT_FILE_PIPE CKPT_FILE_PIPE
+	CKPT_FILE_FIFO,
+#define CKPT_FILE_FIFO CKPT_FILE_FIFO
 	CKPT_FILE_MAX
 #define CKPT_FILE_MAX CKPT_FILE_MAX
 };
diff --git a/include/linux/pipe_fs_i.h b/include/linux/pipe_fs_i.h
index e526a12..596403e 100644
--- a/include/linux/pipe_fs_i.h
+++ b/include/linux/pipe_fs_i.h
@@ -160,6 +160,8 @@ struct ckpt_ctx;
 struct ckpt_hdr_file;
 extern struct file *pipe_file_restore(struct ckpt_ctx *ctx,
 				      struct ckpt_hdr_file *ptr);
+extern struct file *fifo_file_restore(struct ckpt_ctx *ctx,
+				      struct ckpt_hdr_file *ptr);
 #endif
 
 #endif
-- 
1.6.3.3

^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [C/R v20][PATCH 52/96] c/r: checkpoint and restore FIFOs
  2010-03-19  0:59 [C/R v20][PATCH 00/96] Linux Checkpoint-Restart - v20 Oren Laadan
                   ` (10 preceding siblings ...)
  2010-03-19  0:59 ` [C/R v20][PATCH 51/96] c/r: support for open pipes Oren Laadan
@ 2010-03-19  0:59 ` Oren Laadan
  2010-03-19  0:59 ` [C/R v20][PATCH 53/96] c/r: refuse to checkpoint if monitoring directories with dnotify Oren Laadan
                   ` (5 subsequent siblings)
  17 siblings, 0 replies; 88+ messages in thread
From: Oren Laadan @ 2010-03-19  0:59 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: containers, Matt Helsley, Andreas Dilger, Oren Laadan

FIFOs are almost like pipes.

Checkpoints adds the FIFO pathname. The first time the FIFO is found
it also assigns an @objref and dumps the contents in the buffers.

To restore, use the @objref only to determine whether a particular
FIFO has already been restored earlier. Note that it ignores the file
pointer that matches that @objref (unlike with pipes, where that file
corresponds to the other end of the pipe). Instead, it creates a new
FIFO using the saved pathname.

Changelog [v19-rc3]:
  - Rebase to kernel 2.6.33
Changelog [v19-rc1]:
  - Switch to ckpt_obj_try_fetch()
  - [Matt Helsley] Add cpp definitions for enums

Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
Acked-by: Serge E. Hallyn <serue@us.ibm.com>
Tested-by: Serge E. Hallyn <serue@us.ibm.com>
---
 checkpoint/files.c             |    6 +++
 fs/pipe.c                      |   81 +++++++++++++++++++++++++++++++++++++++-
 include/linux/checkpoint_hdr.h |    2 +
 include/linux/pipe_fs_i.h      |    2 +
 4 files changed, 90 insertions(+), 1 deletions(-)

diff --git a/checkpoint/files.c b/checkpoint/files.c
index 1c294fe..c647bfd 100644
--- a/checkpoint/files.c
+++ b/checkpoint/files.c
@@ -599,6 +599,12 @@ static struct restore_file_ops restore_file_ops[] = {
 		.file_type = CKPT_FILE_PIPE,
 		.restore = pipe_file_restore,
 	},
+	/* fifo */
+	{
+		.file_name = "FIFO",
+		.file_type = CKPT_FILE_FIFO,
+		.restore = fifo_file_restore,
+	},
 };
 
 static struct file *do_restore_file(struct ckpt_ctx *ctx)
diff --git a/fs/pipe.c b/fs/pipe.c
index 747b2d7..8c79493 100644
--- a/fs/pipe.c
+++ b/fs/pipe.c
@@ -830,6 +830,8 @@ pipe_rdwr_open(struct inode *inode, struct file *filp)
 	return ret;
 }
 
+static struct vfsmount *pipe_mnt __read_mostly;
+
 #ifdef CONFIG_CHECKPOINT
 static int checkpoint_pipe(struct ckpt_ctx *ctx, struct inode *inode)
 {
@@ -877,7 +879,11 @@ static int pipe_file_checkpoint(struct ckpt_ctx *ctx, struct file *file)
 	if (!h)
 		return -ENOMEM;
 
-	h->common.f_type = CKPT_FILE_PIPE;
+	/* fifo and pipe are similar at checkpoint, differ on restore */
+	if (inode->i_sb == pipe_mnt->mnt_sb)
+		h->common.f_type = CKPT_FILE_PIPE;
+	else
+		h->common.f_type = CKPT_FILE_FIFO;
 	h->pipe_objref = objref;
 
 	ret = checkpoint_file_common(ctx, file, &h->common);
@@ -887,6 +893,13 @@ static int pipe_file_checkpoint(struct ckpt_ctx *ctx, struct file *file)
 	if (ret < 0)
 		goto out;
 
+	/* FIFO also needs a file name */
+	if (h->common.f_type == CKPT_FILE_FIFO) {
+		ret = checkpoint_fname(ctx, &file->f_path, &ctx->root_fs_path);
+		if (ret < 0)
+			goto out;
+	}
+
 	if (first)
 		ret = checkpoint_pipe(ctx, inode);
  out:
@@ -978,8 +991,74 @@ struct file *pipe_file_restore(struct ckpt_ctx *ctx, struct ckpt_hdr_file *ptr)
 
 	return file;
 }
+
+struct file *fifo_file_restore(struct ckpt_ctx *ctx, struct ckpt_hdr_file *ptr)
+{
+	struct ckpt_hdr_file_pipe *h = (struct ckpt_hdr_file_pipe *) ptr;
+	struct file *file;
+	int first, ret;
+
+	if (ptr->h.type != CKPT_HDR_FILE  ||
+	    ptr->h.len != sizeof(*h) || ptr->f_type != CKPT_FILE_FIFO)
+		return ERR_PTR(-EINVAL);
+
+	if (h->pipe_objref <= 0)
+		return ERR_PTR(-EINVAL);
+
+	/*
+	 * If ckpt_obj_try_fetch() returned ERR_PTR(-EINVAL), this is the
+	 * first time for this fifo.
+	 */
+	file = ckpt_obj_try_fetch(ctx, h->pipe_objref, CKPT_OBJ_FILE);
+	if (!IS_ERR(file))
+		first = 0;
+	else if (PTR_ERR(file) == -EINVAL)
+		first = 1;
+	else
+		return file;
+
+	/*
+	 * To avoid blocking, always open the fifo with O_RDWR;
+	 * then fix flags below.
+	 */
+	file = restore_open_fname(ctx, (ptr->f_flags & ~O_ACCMODE) | O_RDWR);
+	if (IS_ERR(file))
+		return file;
+
+	if ((ptr->f_flags & O_ACCMODE) == O_RDONLY) {
+		file->f_flags = (file->f_flags & ~O_ACCMODE) | O_RDONLY;
+		file->f_mode &= ~FMODE_WRITE;
+	} else if ((ptr->f_flags & O_ACCMODE) == O_WRONLY) {
+		file->f_flags = (file->f_flags & ~O_ACCMODE) | O_WRONLY;
+		file->f_mode &= ~FMODE_READ;
+	} else if ((ptr->f_flags & O_ACCMODE) != O_RDWR) {
+		ret = -EINVAL;
+		goto out;
+	}
+
+	/* first time: add to objhash and restore fifo's contents */
+	if (first) {
+		ret = ckpt_obj_insert(ctx, file, h->pipe_objref, CKPT_OBJ_FILE);
+		if (ret < 0)
+			goto out;
+
+		ret = restore_pipe(ctx, file);
+		if (ret < 0)
+			goto out;
+	}
+
+	ret = restore_file_common(ctx, file, ptr);
+ out:
+	if (ret < 0) {
+		fput(file);
+		file = ERR_PTR(ret);
+	}
+
+	return file;
+}
 #else
 #define pipe_file_checkpoint  NULL
+#define fifo_file_checkpoint  NULL
 #endif /* CONFIG_CHECKPOINT */
 
 /*
diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
index 885d06b..fce35f3 100644
--- a/include/linux/checkpoint_hdr.h
+++ b/include/linux/checkpoint_hdr.h
@@ -281,6 +281,8 @@ enum file_type {
 #define CKPT_FILE_GENERIC CKPT_FILE_GENERIC
 	CKPT_FILE_PIPE,
 #define CKPT_FILE_PIPE CKPT_FILE_PIPE
+	CKPT_FILE_FIFO,
+#define CKPT_FILE_FIFO CKPT_FILE_FIFO
 	CKPT_FILE_MAX
 #define CKPT_FILE_MAX CKPT_FILE_MAX
 };
diff --git a/include/linux/pipe_fs_i.h b/include/linux/pipe_fs_i.h
index e526a12..596403e 100644
--- a/include/linux/pipe_fs_i.h
+++ b/include/linux/pipe_fs_i.h
@@ -160,6 +160,8 @@ struct ckpt_ctx;
 struct ckpt_hdr_file;
 extern struct file *pipe_file_restore(struct ckpt_ctx *ctx,
 				      struct ckpt_hdr_file *ptr);
+extern struct file *fifo_file_restore(struct ckpt_ctx *ctx,
+				      struct ckpt_hdr_file *ptr);
 #endif
 
 #endif
-- 
1.6.3.3


^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [C/R v20][PATCH 53/96] c/r: refuse to checkpoint if monitoring directories with dnotify
       [not found] ` <1268960401-16680-1-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
                     ` (10 preceding siblings ...)
  2010-03-19  0:59   ` [C/R v20][PATCH 52/96] c/r: checkpoint and restore FIFOs Oren Laadan
@ 2010-03-19  0:59   ` Oren Laadan
  2010-03-19  0:59   ` [C/R v20][PATCH 66/96] c/r: restore file->f_cred Oren Laadan
                     ` (4 subsequent siblings)
  16 siblings, 0 replies; 88+ messages in thread
From: Oren Laadan @ 2010-03-19  0:59 UTC (permalink / raw)
  To: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Andreas Dilger

From: Matt Helsley <matthltc-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>

We do not support restarting fsnotify watches. inotify and fanotify utilize
anon_inodes for pseudofiles which lack the .checkpoint operation. So they
already cleanly prevent checkpoint. dnotify on the other hand registers
its watches using fcntl() which does not require the userspace task to
hold an fd with an empty .checkpoint operation. This means userspace
could use dnotify to set up fsnotify watches which won't be re-created during
restart.

Check for fsnotify watches created with dnotify and reject checkpoint
if there are any.

Signed-off-by: Matt Helsley <matthltc-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
Acked-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
Acked-by: Serge E. Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
Tested-by: Serge E. Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
---
 checkpoint/files.c          |    5 +++++
 fs/notify/dnotify/dnotify.c |   18 ++++++++++++++++++
 include/linux/dnotify.h     |    6 ++++++
 3 files changed, 29 insertions(+), 0 deletions(-)

diff --git a/checkpoint/files.c b/checkpoint/files.c
index c647bfd..62feadd 100644
--- a/checkpoint/files.c
+++ b/checkpoint/files.c
@@ -207,6 +207,11 @@ int checkpoint_file(struct ckpt_ctx *ctx, void *ptr)
 		return -EBADF;
 	}
 
+	if (is_dnotify_attached(file)) {
+		ckpt_err(ctx, -EBADF, "%(T)%(P)dnotify unsupported\n", file);
+		return -EBADF;
+	}
+
 	ret = file->f_op->checkpoint(ctx, file);
 	if (ret < 0)
 		ckpt_err(ctx, ret, "%(T)%(P)file checkpoint failed\n", file);
diff --git a/fs/notify/dnotify/dnotify.c b/fs/notify/dnotify/dnotify.c
index 7e54e52..0a63bf6 100644
--- a/fs/notify/dnotify/dnotify.c
+++ b/fs/notify/dnotify/dnotify.c
@@ -289,6 +289,24 @@ static int attach_dn(struct dnotify_struct *dn, struct dnotify_mark_entry *dnent
 	return 0;
 }
 
+int is_dnotify_attached(struct file *filp)
+{
+	struct fsnotify_mark_entry *entry;
+	struct inode *inode;
+
+	inode = filp->f_path.dentry->d_inode;
+	if (!S_ISDIR(inode->i_mode))
+		return 0;
+
+	spin_lock(&inode->i_lock);
+	entry = fsnotify_find_mark_entry(dnotify_group, inode);
+	spin_unlock(&inode->i_lock);
+	if (!entry)
+		return 0;
+	fsnotify_put_mark(entry);
+	return 1;
+}
+
 /*
  * When a process calls fcntl to attach a dnotify watch to a directory it ends
  * up here.  Allocate both a mark for fsnotify to add and a dnotify_struct to be
diff --git a/include/linux/dnotify.h b/include/linux/dnotify.h
index ecc0628..b9ce13c 100644
--- a/include/linux/dnotify.h
+++ b/include/linux/dnotify.h
@@ -29,6 +29,7 @@ struct dnotify_struct {
 			    FS_MOVED_FROM | FS_MOVED_TO)
 
 extern void dnotify_flush(struct file *, fl_owner_t);
+extern int is_dnotify_attached(struct file *);
 extern int fcntl_dirnotify(int, struct file *, unsigned long);
 
 #else
@@ -37,6 +38,11 @@ static inline void dnotify_flush(struct file *filp, fl_owner_t id)
 {
 }
 
+static inline int is_dnotify_attached(struct file *)
+{
+	return 0;
+}
+
 static inline int fcntl_dirnotify(int fd, struct file *filp, unsigned long arg)
 {
 	return -EINVAL;
-- 
1.6.3.3

^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [C/R v20][PATCH 53/96] c/r: refuse to checkpoint if monitoring directories with dnotify
  2010-03-19  0:59 [C/R v20][PATCH 00/96] Linux Checkpoint-Restart - v20 Oren Laadan
                   ` (11 preceding siblings ...)
  2010-03-19  0:59 ` [C/R v20][PATCH 52/96] c/r: checkpoint and restore FIFOs Oren Laadan
@ 2010-03-19  0:59 ` Oren Laadan
  2010-03-19  0:59 ` [C/R v20][PATCH 66/96] c/r: restore file->f_cred Oren Laadan
                   ` (4 subsequent siblings)
  17 siblings, 0 replies; 88+ messages in thread
From: Oren Laadan @ 2010-03-19  0:59 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: containers, Matt Helsley, Andreas Dilger

From: Matt Helsley <matthltc@us.ibm.com>

We do not support restarting fsnotify watches. inotify and fanotify utilize
anon_inodes for pseudofiles which lack the .checkpoint operation. So they
already cleanly prevent checkpoint. dnotify on the other hand registers
its watches using fcntl() which does not require the userspace task to
hold an fd with an empty .checkpoint operation. This means userspace
could use dnotify to set up fsnotify watches which won't be re-created during
restart.

Check for fsnotify watches created with dnotify and reject checkpoint
if there are any.

Signed-off-by: Matt Helsley <matthltc@us.ibm.com>
Acked-by: Oren Laadan <orenl@cs.columbia.edu>
Acked-by: Serge E. Hallyn <serue@us.ibm.com>
Tested-by: Serge E. Hallyn <serue@us.ibm.com>
---
 checkpoint/files.c          |    5 +++++
 fs/notify/dnotify/dnotify.c |   18 ++++++++++++++++++
 include/linux/dnotify.h     |    6 ++++++
 3 files changed, 29 insertions(+), 0 deletions(-)

diff --git a/checkpoint/files.c b/checkpoint/files.c
index c647bfd..62feadd 100644
--- a/checkpoint/files.c
+++ b/checkpoint/files.c
@@ -207,6 +207,11 @@ int checkpoint_file(struct ckpt_ctx *ctx, void *ptr)
 		return -EBADF;
 	}
 
+	if (is_dnotify_attached(file)) {
+		ckpt_err(ctx, -EBADF, "%(T)%(P)dnotify unsupported\n", file);
+		return -EBADF;
+	}
+
 	ret = file->f_op->checkpoint(ctx, file);
 	if (ret < 0)
 		ckpt_err(ctx, ret, "%(T)%(P)file checkpoint failed\n", file);
diff --git a/fs/notify/dnotify/dnotify.c b/fs/notify/dnotify/dnotify.c
index 7e54e52..0a63bf6 100644
--- a/fs/notify/dnotify/dnotify.c
+++ b/fs/notify/dnotify/dnotify.c
@@ -289,6 +289,24 @@ static int attach_dn(struct dnotify_struct *dn, struct dnotify_mark_entry *dnent
 	return 0;
 }
 
+int is_dnotify_attached(struct file *filp)
+{
+	struct fsnotify_mark_entry *entry;
+	struct inode *inode;
+
+	inode = filp->f_path.dentry->d_inode;
+	if (!S_ISDIR(inode->i_mode))
+		return 0;
+
+	spin_lock(&inode->i_lock);
+	entry = fsnotify_find_mark_entry(dnotify_group, inode);
+	spin_unlock(&inode->i_lock);
+	if (!entry)
+		return 0;
+	fsnotify_put_mark(entry);
+	return 1;
+}
+
 /*
  * When a process calls fcntl to attach a dnotify watch to a directory it ends
  * up here.  Allocate both a mark for fsnotify to add and a dnotify_struct to be
diff --git a/include/linux/dnotify.h b/include/linux/dnotify.h
index ecc0628..b9ce13c 100644
--- a/include/linux/dnotify.h
+++ b/include/linux/dnotify.h
@@ -29,6 +29,7 @@ struct dnotify_struct {
 			    FS_MOVED_FROM | FS_MOVED_TO)
 
 extern void dnotify_flush(struct file *, fl_owner_t);
+extern int is_dnotify_attached(struct file *);
 extern int fcntl_dirnotify(int, struct file *, unsigned long);
 
 #else
@@ -37,6 +38,11 @@ static inline void dnotify_flush(struct file *filp, fl_owner_t id)
 {
 }
 
+static inline int is_dnotify_attached(struct file *)
+{
+	return 0;
+}
+
 static inline int fcntl_dirnotify(int fd, struct file *filp, unsigned long arg)
 {
 	return -EINVAL;
-- 
1.6.3.3


^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [C/R v20][PATCH 66/96] c/r: restore file->f_cred
       [not found] ` <1268960401-16680-1-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
                     ` (11 preceding siblings ...)
  2010-03-19  0:59   ` [C/R v20][PATCH 53/96] c/r: refuse to checkpoint if monitoring directories with dnotify Oren Laadan
@ 2010-03-19  0:59   ` Oren Laadan
  2010-03-19  0:59   ` [C/R v20][PATCH 82/96] c/r: checkpoint/restart epoll sets Oren Laadan
                     ` (3 subsequent siblings)
  16 siblings, 0 replies; 88+ messages in thread
From: Oren Laadan @ 2010-03-19  0:59 UTC (permalink / raw)
  To: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Andreas Dilger

From: Serge E. Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>

Restore a file's f_cred.  This is set to the cred of the task doing
the open, so often it will be the same as that of the restarted task.

Changelog[v1]:
  - [Nathan Lynch] discard const from struct cred * where appropriate

Signed-off-by: Serge E. Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
Acked-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
---
 checkpoint/files.c             |   18 ++++++++++++++++--
 include/linux/checkpoint_hdr.h |    2 +-
 2 files changed, 17 insertions(+), 3 deletions(-)

diff --git a/checkpoint/files.c b/checkpoint/files.c
index 62feadd..63a611f 100644
--- a/checkpoint/files.c
+++ b/checkpoint/files.c
@@ -148,15 +148,21 @@ static int scan_fds(struct files_struct *files, int **fdtable)
 int checkpoint_file_common(struct ckpt_ctx *ctx, struct file *file,
 			   struct ckpt_hdr_file *h)
 {
+	struct cred *f_cred = (struct cred *) file->f_cred;
+
 	h->f_flags = file->f_flags;
 	h->f_mode = file->f_mode;
 	h->f_pos = file->f_pos;
 	h->f_version = file->f_version;
 
+	h->f_credref = checkpoint_obj(ctx, f_cred, CKPT_OBJ_CRED);
+	if (h->f_credref < 0)
+		return h->f_credref;
+
 	ckpt_debug("file %s credref %d", file->f_dentry->d_name.name,
 		h->f_credref);
 
-	/* FIX: need also file->uid, file->gid, file->f_owner, etc */
+	/* FIX: need also file->f_owner, etc */
 
 	return 0;
 }
@@ -522,8 +528,16 @@ int restore_file_common(struct ckpt_ctx *ctx, struct file *file,
 	fmode_t new_mode = file->f_mode;
 	fmode_t saved_mode = (__force fmode_t) h->f_mode;
 	int ret;
+	struct cred *cred;
+
+	/* FIX: need to restore owner etc */
 
-	/* FIX: need to restore uid, gid, owner etc */
+	/* restore the cred */
+	cred = ckpt_obj_fetch(ctx, h->f_credref, CKPT_OBJ_CRED);
+	if (IS_ERR(cred))
+		return PTR_ERR(cred);
+	put_cred(file->f_cred);
+	file->f_cred = get_cred(cred);
 
 	/* safe to set 1st arg (fd) to 0, as command is F_SETFL */
 	ret = vfs_fcntl(0, F_SETFL, h->f_flags & CKPT_SETFL_MASK, file);
diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
index cbccc81..729be96 100644
--- a/include/linux/checkpoint_hdr.h
+++ b/include/linux/checkpoint_hdr.h
@@ -432,7 +432,7 @@ struct ckpt_hdr_file {
 	__u32 f_type;
 	__u32 f_mode;
 	__u32 f_flags;
-	__u32 _padding;
+	__s32 f_credref;
 	__u64 f_pos;
 	__u64 f_version;
 } __attribute__((aligned(8)));
-- 
1.6.3.3

^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [C/R v20][PATCH 66/96] c/r: restore file->f_cred
  2010-03-19  0:59 [C/R v20][PATCH 00/96] Linux Checkpoint-Restart - v20 Oren Laadan
                   ` (12 preceding siblings ...)
  2010-03-19  0:59 ` [C/R v20][PATCH 53/96] c/r: refuse to checkpoint if monitoring directories with dnotify Oren Laadan
@ 2010-03-19  0:59 ` Oren Laadan
  2010-03-19  0:59 ` [C/R v20][PATCH 82/96] c/r: checkpoint/restart epoll sets Oren Laadan
                   ` (3 subsequent siblings)
  17 siblings, 0 replies; 88+ messages in thread
From: Oren Laadan @ 2010-03-19  0:59 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: containers, Matt Helsley, Andreas Dilger, Serge E. Hallyn

From: Serge E. Hallyn <serue@us.ibm.com>

Restore a file's f_cred.  This is set to the cred of the task doing
the open, so often it will be the same as that of the restarted task.

Changelog[v1]:
  - [Nathan Lynch] discard const from struct cred * where appropriate

Signed-off-by: Serge E. Hallyn <serue@us.ibm.com>
Acked-by: Oren Laadan <orenl@cs.columbia.edu>
---
 checkpoint/files.c             |   18 ++++++++++++++++--
 include/linux/checkpoint_hdr.h |    2 +-
 2 files changed, 17 insertions(+), 3 deletions(-)

diff --git a/checkpoint/files.c b/checkpoint/files.c
index 62feadd..63a611f 100644
--- a/checkpoint/files.c
+++ b/checkpoint/files.c
@@ -148,15 +148,21 @@ static int scan_fds(struct files_struct *files, int **fdtable)
 int checkpoint_file_common(struct ckpt_ctx *ctx, struct file *file,
 			   struct ckpt_hdr_file *h)
 {
+	struct cred *f_cred = (struct cred *) file->f_cred;
+
 	h->f_flags = file->f_flags;
 	h->f_mode = file->f_mode;
 	h->f_pos = file->f_pos;
 	h->f_version = file->f_version;
 
+	h->f_credref = checkpoint_obj(ctx, f_cred, CKPT_OBJ_CRED);
+	if (h->f_credref < 0)
+		return h->f_credref;
+
 	ckpt_debug("file %s credref %d", file->f_dentry->d_name.name,
 		h->f_credref);
 
-	/* FIX: need also file->uid, file->gid, file->f_owner, etc */
+	/* FIX: need also file->f_owner, etc */
 
 	return 0;
 }
@@ -522,8 +528,16 @@ int restore_file_common(struct ckpt_ctx *ctx, struct file *file,
 	fmode_t new_mode = file->f_mode;
 	fmode_t saved_mode = (__force fmode_t) h->f_mode;
 	int ret;
+	struct cred *cred;
+
+	/* FIX: need to restore owner etc */
 
-	/* FIX: need to restore uid, gid, owner etc */
+	/* restore the cred */
+	cred = ckpt_obj_fetch(ctx, h->f_credref, CKPT_OBJ_CRED);
+	if (IS_ERR(cred))
+		return PTR_ERR(cred);
+	put_cred(file->f_cred);
+	file->f_cred = get_cred(cred);
 
 	/* safe to set 1st arg (fd) to 0, as command is F_SETFL */
 	ret = vfs_fcntl(0, F_SETFL, h->f_flags & CKPT_SETFL_MASK, file);
diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
index cbccc81..729be96 100644
--- a/include/linux/checkpoint_hdr.h
+++ b/include/linux/checkpoint_hdr.h
@@ -432,7 +432,7 @@ struct ckpt_hdr_file {
 	__u32 f_type;
 	__u32 f_mode;
 	__u32 f_flags;
-	__u32 _padding;
+	__s32 f_credref;
 	__u64 f_pos;
 	__u64 f_version;
 } __attribute__((aligned(8)));
-- 
1.6.3.3


^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [C/R v20][PATCH 82/96] c/r: checkpoint/restart epoll sets
       [not found] ` <1268960401-16680-1-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
                     ` (12 preceding siblings ...)
  2010-03-19  0:59   ` [C/R v20][PATCH 66/96] c/r: restore file->f_cred Oren Laadan
@ 2010-03-19  0:59   ` Oren Laadan
  2010-03-19  0:59   ` [C/R v20][PATCH 83/96] c/r: checkpoint/restart eventfd Oren Laadan
                     ` (2 subsequent siblings)
  16 siblings, 0 replies; 88+ messages in thread
From: Oren Laadan @ 2010-03-19  0:59 UTC (permalink / raw)
  To: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Andreas Dilger

From: Matt Helsley <matthltc-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>

Save/restore epoll items during checkpoint/restart respectively.

Output the epoll header and items separately. Chunk the output much
like the pid array gets chunked. This ensures that even sub-order 0
allocations will enable checkpoint of large epoll sets. A subsequent
patch will do something similar for the restore path.

On restart, we grab a piece of memory suitable to store a "chunk" of
items for input. Read the input one chunk at a time and add epoll
items for each item in the chunk.

Signed-off-by: Matt Helsley <matthltc-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
Acked-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
Acked-by: Serge Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>

Changelog [v19]:
  - [Oren Laadan] Fix broken compilation for no-c/r architectures
Changelog [v19-rc1]:
  - [Oren Laadan] Return -EBUSY (not BUG_ON) if fd is gone on restart
  - [Oren Laadan] Fix the chunk size instead of auto-tune

Changelog v5:
	Fix potential recursion during collect.
	Replace call to ckpt_obj_collect() with ckpt_collect_file().
		[Oren]
	Fix checkpoint leak detection when there are more items than
		expected.
	Cleanup/simplify error write paths. (will complicate in a later
		patch) [Oren]
	Remove files_deferq bits. [Oren]
	Remove extra newline. [Oren]
	Remove aggregate check on number of watches added. [Oren]
		This is OK since these will be done individually anyway.
	Remove check for negative objrefs during restart. [Oren]
	Fixup comment regarding race that indicates checkpoint leaks.
		[Oren]
	s/ckpt_read_obj/ckpt_read_buf_type/ [Oren]
		Patch for lots of epoll items follows.
	Moved sys_close(epfd) right under fget(). [Oren]
	Use CKPT_HDR_BUFFER rather than custome ckpt_read/write_*
		This makes it more similar to the pid array code. [Oren]
		It also simplifies the error recovery paths.
	Tested polling a pipe and 50,000 UNIX sockets.

Changelog v4: ckpt-v18
	Use files_deferq as submitted by Dan Smith
		Cleanup to only report >= 1 items when debugging.

Changelog v3: [unposted]
	Removed most of the TODOs -- the remainder will be removed by
		subsequent patches.
	Fixed missing ep_file_collect() [Serge]
	Rather than include checkpoint_hdr.h declare (but do not define)
		the two structs needed in eventpoll.h [Oren]
	Complain with ckpt_write_err() when we detect checkpoint obj
		leaks. [Oren]
	Remove redundant is_epoll_file() check in collect. [Oren]
	Move epfile_objref lookup to simplify error handling. [Oren]
	Simplify error handling with early return in
		ep_eventpoll_checkpoint(). [Oren]
	Cleaned up a comment. [Oren]
	Shorten CKPT_HDR_FILE_EPOLL_ITEMS (-FILE) [Oren]
		Renumbered to indicate that it follows the file table.
	Renamed the epoll struct in checkpoint_hdr.h [Oren]
		Also renamed substruct.
	Fixup return of empty ep_file_restore(). [Oren]
	Changed some error returns. [Oren]
	Changed some tests to BUG_ON(). [Oren]
	Factored out watch insert with epoll_ctl() into do_epoll_ctl().
		[Cedric, Oren]

Signed-off-by: Matt Helsley <matthltc-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
Acked-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
Acked-by: Serge E. Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
Tested-by: Serge E. Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
---
 checkpoint/files.c             |    7 +
 fs/eventpoll.c                 |  334 ++++++++++++++++++++++++++++++++++++----
 include/linux/checkpoint_hdr.h |   18 ++
 include/linux/eventpoll.h      |   17 ++-
 4 files changed, 347 insertions(+), 29 deletions(-)

diff --git a/checkpoint/files.c b/checkpoint/files.c
index bcc1fbf..6aaaf22 100644
--- a/checkpoint/files.c
+++ b/checkpoint/files.c
@@ -22,6 +22,7 @@
 #include <linux/deferqueue.h>
 #include <linux/checkpoint.h>
 #include <linux/checkpoint_hdr.h>
+#include <linux/eventpoll.h>
 #include <net/sock.h>
 
 
@@ -637,6 +638,12 @@ static struct restore_file_ops restore_file_ops[] = {
 		.file_type = CKPT_FILE_TTY,
 		.restore = tty_file_restore,
 	},
+	/* epoll */
+	{
+		.file_name = "EPOLL",
+		.file_type = CKPT_FILE_EPOLL,
+		.restore = ep_file_restore,
+	},
 };
 
 static struct file *do_restore_file(struct ckpt_ctx *ctx)
diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index bd056a5..7f1a091 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -39,6 +39,9 @@
 #include <asm/mman.h>
 #include <asm/atomic.h>
 
+#include <linux/checkpoint.h>
+#include <linux/deferqueue.h>
+
 /*
  * LOCKING:
  * There are three level of locking required by epoll :
@@ -671,10 +674,20 @@ static unsigned int ep_eventpoll_poll(struct file *file, poll_table *wait)
 	return pollflags != -1 ? pollflags : 0;
 }
 
+#ifdef CONFIG_CHECKPOINT
+static int ep_eventpoll_checkpoint(struct ckpt_ctx *ctx, struct file *file);
+static int ep_file_collect(struct ckpt_ctx *ctx, struct file *file);
+#else
+#define ep_eventpoll_checkpoint NULL
+#define ep_file_collect NULL
+#endif
+
 /* File callbacks that implement the eventpoll file behaviour */
 static const struct file_operations eventpoll_fops = {
 	.release	= ep_eventpoll_release,
-	.poll		= ep_eventpoll_poll
+	.poll		= ep_eventpoll_poll,
+	.checkpoint 	= ep_eventpoll_checkpoint,
+	.collect 	= ep_file_collect,
 };
 
 /* Fast test to see if the file is an evenpoll file */
@@ -1226,35 +1239,18 @@ SYSCALL_DEFINE1(epoll_create, int, size)
  * the eventpoll file that enables the insertion/removal/change of
  * file descriptors inside the interest set.
  */
-SYSCALL_DEFINE4(epoll_ctl, int, epfd, int, op, int, fd,
-		struct epoll_event __user *, event)
+int do_epoll_ctl(int op, int fd,
+		 struct file *file, struct file *tfile,
+		 struct epoll_event *epds)
 {
 	int error;
-	struct file *file, *tfile;
 	struct eventpoll *ep;
 	struct epitem *epi;
-	struct epoll_event epds;
-
-	error = -EFAULT;
-	if (ep_op_has_event(op) &&
-	    copy_from_user(&epds, event, sizeof(struct epoll_event)))
-		goto error_return;
-
-	/* Get the "struct file *" for the eventpoll file */
-	error = -EBADF;
-	file = fget(epfd);
-	if (!file)
-		goto error_return;
-
-	/* Get the "struct file *" for the target file */
-	tfile = fget(fd);
-	if (!tfile)
-		goto error_fput;
 
 	/* The target file descriptor must support poll */
 	error = -EPERM;
 	if (!tfile->f_op || !tfile->f_op->poll)
-		goto error_tgt_fput;
+		return error;
 
 	/*
 	 * We have to check that the file structure underneath the file descriptor
@@ -1263,7 +1259,7 @@ SYSCALL_DEFINE4(epoll_ctl, int, epfd, int, op, int, fd,
 	 */
 	error = -EINVAL;
 	if (file == tfile || !is_file_epoll(file))
-		goto error_tgt_fput;
+		return error;
 
 	/*
 	 * At this point it is safe to assume that the "private_data" contains
@@ -1284,8 +1280,8 @@ SYSCALL_DEFINE4(epoll_ctl, int, epfd, int, op, int, fd,
 	switch (op) {
 	case EPOLL_CTL_ADD:
 		if (!epi) {
-			epds.events |= POLLERR | POLLHUP;
-			error = ep_insert(ep, &epds, tfile, fd);
+			epds->events |= POLLERR | POLLHUP;
+			error = ep_insert(ep, epds, tfile, fd);
 		} else
 			error = -EEXIST;
 		break;
@@ -1297,15 +1293,46 @@ SYSCALL_DEFINE4(epoll_ctl, int, epfd, int, op, int, fd,
 		break;
 	case EPOLL_CTL_MOD:
 		if (epi) {
-			epds.events |= POLLERR | POLLHUP;
-			error = ep_modify(ep, epi, &epds);
+			epds->events |= POLLERR | POLLHUP;
+			error = ep_modify(ep, epi, epds);
 		} else
 			error = -ENOENT;
 		break;
 	}
 	mutex_unlock(&ep->mtx);
 
-error_tgt_fput:
+	return error;
+}
+
+/*
+ * The following function implements the controller interface for
+ * the eventpoll file that enables the insertion/removal/change of
+ * file descriptors inside the interest set.
+ */
+SYSCALL_DEFINE4(epoll_ctl, int, epfd, int, op, int, fd,
+		struct epoll_event __user *, event)
+{
+	int error;
+	struct file *file, *tfile;
+	struct epoll_event epds;
+
+	error = -EFAULT;
+	if (ep_op_has_event(op) &&
+	    copy_from_user(&epds, event, sizeof(struct epoll_event)))
+		goto error_return;
+
+	/* Get the "struct file *" for the eventpoll file */
+	error = -EBADF;
+	file = fget(epfd);
+	if (!file)
+		goto error_return;
+
+	/* Get the "struct file *" for the target file */
+	tfile = fget(fd);
+	if (!tfile)
+		goto error_fput;
+
+	error = do_epoll_ctl(op, fd, file, tfile, &epds);
 	fput(tfile);
 error_fput:
 	fput(file);
@@ -1413,6 +1440,257 @@ SYSCALL_DEFINE6(epoll_pwait, int, epfd, struct epoll_event __user *, events,
 
 #endif /* HAVE_SET_RESTORE_SIGMASK */
 
+#ifdef CONFIG_CHECKPOINT
+static int ep_file_collect(struct ckpt_ctx *ctx, struct file *file)
+{
+	struct rb_node *rbp;
+	struct eventpoll *ep;
+	int ret = 0;
+
+	ep = file->private_data;
+	mutex_lock(&ep->mtx);
+	for (rbp = rb_first(&ep->rbr); rbp; rbp = rb_next(rbp)) {
+		struct epitem *epi;
+
+		epi = rb_entry(rbp, struct epitem, rbn);
+		if (is_file_epoll(epi->ffd.file))
+			continue; /* Don't recurse */
+		ret = ckpt_collect_file(ctx, epi->ffd.file);
+		if (ret < 0)
+			break;
+	}
+	mutex_unlock(&ep->mtx);
+	return ret;
+}
+
+struct epoll_deferq_entry {
+	struct ckpt_ctx *ctx;
+	struct file *epfile;
+};
+
+#define CKPT_EPOLL_CHUNK  (8096 / (int) sizeof(struct ckpt_eventpoll_item))
+
+static int ep_items_checkpoint(void *data)
+{
+	struct epoll_deferq_entry *dq_entry = data;
+	struct ckpt_ctx *ctx;
+	struct ckpt_hdr_eventpoll_items *h;
+	struct ckpt_eventpoll_item *items;
+	struct rb_node *rbp;
+	struct eventpoll *ep;
+	__s32 epfile_objref;
+	int num_items = 0, ret;
+
+	ctx = dq_entry->ctx;
+
+	epfile_objref = ckpt_obj_lookup(ctx, dq_entry->epfile, CKPT_OBJ_FILE);
+	BUG_ON(epfile_objref <= 0);
+
+	ep = dq_entry->epfile->private_data;
+	mutex_lock(&ep->mtx);
+	for (rbp = rb_first(&ep->rbr); rbp; rbp = rb_next(rbp))
+		num_items++;
+	mutex_unlock(&ep->mtx);
+
+	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_EPOLL_ITEMS);
+	if (!h)
+		return -ENOMEM;
+	h->num_items = num_items;
+	h->epfile_objref = epfile_objref;
+	ret = ckpt_write_obj(ctx, &h->h);
+	ckpt_hdr_put(ctx, h);
+	if (ret || !num_items)
+		return ret;
+
+	ret = ckpt_write_obj_type(ctx, NULL, sizeof(*items)*num_items,
+				  CKPT_HDR_BUFFER);
+	if (ret < 0)
+		return ret;
+
+	items = kzalloc(sizeof(*items) * CKPT_EPOLL_CHUNK, GFP_KERNEL);
+	if (!items)
+		return -ENOMEM;
+
+	/*
+	 * Walk the rbtree copying items into the chunk of memory and then
+	 * writing them to the checkpoint image
+	 */
+	ret = 0;
+	mutex_lock(&ep->mtx);
+	rbp = rb_first(&ep->rbr);
+	while ((num_items > 0) && rbp) {
+		int n = min(num_items, CKPT_EPOLL_CHUNK);
+		int j;
+
+		for (j = 0; rbp && j < n; j++, rbp = rb_next(rbp)) {
+			struct epitem *epi;
+			int objref;
+
+			epi = rb_entry(rbp, struct epitem, rbn);
+			items[j].fd = epi->ffd.fd;
+			items[j].events = epi->event.events;
+			items[j].data = epi->event.data;
+			objref = ckpt_obj_lookup(ctx, epi->ffd.file,
+						 CKPT_OBJ_FILE);
+			if (objref <= 0)
+				goto unlock;
+			items[j].file_objref = objref;
+		}
+		ret = ckpt_kwrite(ctx, items, n*sizeof(*items));
+		if (ret < 0)
+			break;
+		num_items -= n;
+	}
+unlock:
+	mutex_unlock(&ep->mtx);
+	kfree(items);
+	if (num_items != 0 || (num_items == 0 && rbp))
+		ret = -EBUSY; /* extra item(s) -- checkpoint obj leak */
+	if (ret)
+		ckpt_err(ctx, ret, "Checkpointing epoll items.\n");
+	return ret;
+}
+
+static int ep_eventpoll_checkpoint(struct ckpt_ctx *ctx, struct file *file)
+{
+	struct ckpt_hdr_file *h;
+	struct epoll_deferq_entry dq_entry;
+	int ret = -ENOMEM;
+
+	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_FILE);
+	if (!h)
+		return -ENOMEM;
+	h->f_type = CKPT_FILE_EPOLL;
+	ret = checkpoint_file_common(ctx, file, h);
+	if (ret < 0)
+		goto out;
+	ret = ckpt_write_obj(ctx, &h->h);
+	if (ret < 0)
+		goto out;
+
+	/*
+	 * Defer saving the epoll items until all of the ffd.file pointers
+	 * have an objref; after the file table has been checkpointed.
+	 */
+	dq_entry.ctx = ctx;
+	dq_entry.epfile = file;
+	ret = deferqueue_add(ctx->files_deferq, &dq_entry,
+			     sizeof(dq_entry), ep_items_checkpoint, NULL);
+out:
+	ckpt_hdr_put(ctx, h);
+	return ret;
+}
+
+static int ep_items_restore(void *data)
+{
+	struct ckpt_ctx *ctx = deferqueue_data_ptr(data);
+	struct ckpt_hdr_eventpoll_items *h;
+	struct ckpt_eventpoll_item *items = NULL;
+	struct eventpoll *ep;
+	struct file *epfile = NULL;
+	int ret, num_items;
+
+	h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_EPOLL_ITEMS);
+	if (IS_ERR(h))
+		return PTR_ERR(h);
+	num_items = h->num_items;
+	epfile = ckpt_obj_fetch(ctx, h->epfile_objref, CKPT_OBJ_FILE);
+	ckpt_hdr_put(ctx, h);
+
+	/* Make sure userspace didn't give us a ref to a non-epoll file. */
+	if (IS_ERR(epfile))
+		return PTR_ERR(epfile);
+	if (!is_file_epoll(epfile))
+		return -EINVAL;
+	if (!num_items)
+		return 0;
+
+	ret = _ckpt_read_obj_type(ctx, NULL, 0, CKPT_HDR_BUFFER);
+	if (ret < 0)
+		return ret;
+	/* Make sure the items match the size we expect */
+	if (num_items != (ret / sizeof(*items)))
+		return -EINVAL;
+
+	items = kzalloc(sizeof(*items) * CKPT_EPOLL_CHUNK, GFP_KERNEL);
+	if (!items)
+		return -ENOMEM;
+
+	ep = epfile->private_data;
+
+	while (num_items > 0) {
+		int n = min(num_items, CKPT_EPOLL_CHUNK);
+		int j;
+
+		ret = ckpt_kread(ctx, items, n*sizeof(*items));
+		if (ret < 0)
+			break;
+
+		/* Restore the epoll items/watches */
+		for (j = 0; !ret && j < n; j++) {
+			struct epoll_event epev;
+			struct file *tfile;
+
+			tfile = ckpt_obj_fetch(ctx, items[j].file_objref,
+					       CKPT_OBJ_FILE);
+			if (IS_ERR(tfile)) {
+				ret = PTR_ERR(tfile);
+				goto out;
+			}
+			epev.events = items[j].events;
+			epev.data = items[j].data;
+			ret = do_epoll_ctl(EPOLL_CTL_ADD, items[j].fd,
+					   epfile, tfile, &epev);
+		}
+		num_items -= n;
+	}
+out:
+	kfree(items);
+	return ret;
+}
+
+struct file *ep_file_restore(struct ckpt_ctx *ctx,
+			     struct ckpt_hdr_file *h)
+{
+	struct file *epfile;
+	int epfd, ret;
+
+	if (h->h.type != CKPT_HDR_FILE ||
+	    h->h.len  != sizeof(*h) ||
+	    h->f_type != CKPT_FILE_EPOLL)
+		return ERR_PTR(-EINVAL);
+
+	epfd = sys_epoll_create1(h->f_flags & EPOLL_CLOEXEC);
+	if (epfd < 0)
+		return ERR_PTR(epfd);
+	epfile = fget(epfd);
+	sys_close(epfd); /* harmless even if an error occured */
+	if (!epfile)  /* can happen with a malicious user */
+		return ERR_PTR(-EBUSY);
+
+	/*
+	 * Needed before we can properly restore the watches and enforce the
+	 * limit on watch numbers.
+	 */
+	ret = restore_file_common(ctx, epfile, h);
+	if (ret < 0)
+		goto fput_out;
+
+	/*
+	 * Defer restoring the epoll items until the file table is
+	 * fully restored. Ensures that valid file objrefs will resolve.
+	 */
+	ret = deferqueue_add_ptr(ctx->files_deferq, ctx, ep_items_restore, NULL);
+	if (ret < 0) {
+fput_out:
+		fput(epfile);
+		epfile = ERR_PTR(ret);
+	}
+	return epfile;
+}
+
+#endif /* CONFIG_CHECKPOINT */
+
 static int __init eventpoll_init(void)
 {
 	struct sysinfo si;
diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
index 4fe63b1..b96d2dc 100644
--- a/include/linux/checkpoint_hdr.h
+++ b/include/linux/checkpoint_hdr.h
@@ -119,6 +119,8 @@ enum {
 #define CKPT_HDR_TTY CKPT_HDR_TTY
 	CKPT_HDR_TTY_LDISC,
 #define CKPT_HDR_TTY_LDISC CKPT_HDR_TTY_LDISC
+	CKPT_HDR_EPOLL_ITEMS,  /* must be after file-table */
+#define CKPT_HDR_EPOLL_ITEMS CKPT_HDR_EPOLL_ITEMS
 
 	CKPT_HDR_MM = 401,
 #define CKPT_HDR_MM CKPT_HDR_MM
@@ -477,6 +479,8 @@ enum file_type {
 #define CKPT_FILE_SOCKET CKPT_FILE_SOCKET
 	CKPT_FILE_TTY,
 #define CKPT_FILE_TTY CKPT_FILE_TTY
+	CKPT_FILE_EPOLL,
+#define CKPT_FILE_EPOLL CKPT_FILE_EPOLL
 	CKPT_FILE_MAX
 #define CKPT_FILE_MAX CKPT_FILE_MAX
 };
@@ -693,6 +697,20 @@ struct ckpt_hdr_file_socket {
 	__s32 sock_objref;
 } __attribute__((aligned(8)));
 
+struct ckpt_hdr_eventpoll_items {
+	struct ckpt_hdr h;
+	__s32  epfile_objref;
+	__u32  num_items;
+} __attribute__((aligned(8)));
+
+/* Contained in a CKPT_HDR_BUFFER following the ckpt_hdr_eventpoll_items */
+struct ckpt_eventpoll_item {
+	__u64 data;
+	__u32 fd;
+	__s32 file_objref;
+	__u32 events;
+} __attribute__((aligned(8)));
+
 /* memory layout */
 struct ckpt_hdr_mm {
 	struct ckpt_hdr h;
diff --git a/include/linux/eventpoll.h b/include/linux/eventpoll.h
index f6856a5..52282ae 100644
--- a/include/linux/eventpoll.h
+++ b/include/linux/eventpoll.h
@@ -56,6 +56,9 @@ struct file;
 
 
 #ifdef CONFIG_EPOLL
+struct ckpt_ctx;
+struct ckpt_hdr_file;
+
 
 /* Used to initialize the epoll bits inside the "struct file" */
 static inline void eventpoll_init_file(struct file *file)
@@ -95,11 +98,23 @@ static inline void eventpoll_release(struct file *file)
 	eventpoll_release_file(file);
 }
 
-#else
 
+#ifdef CONFIG_CHECKPOINT
+extern struct file *ep_file_restore(struct ckpt_ctx *ctx,
+				    struct ckpt_hdr_file *h);
+#endif
+#else
+/* !defined(CONFIG_EPOLL) */
 static inline void eventpoll_init_file(struct file *file) {}
 static inline void eventpoll_release(struct file *file) {}
 
+#ifdef CONFIG_CHECKPOINT
+static inline struct file *ep_file_restore(struct ckpt_ctx *ctx,
+					   struct ckpt_hdr_file *ptr)
+{
+	return ERR_PTR(-ENOSYS);
+}
+#endif
 #endif
 
 #endif /* #ifdef __KERNEL__ */
-- 
1.6.3.3

^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [C/R v20][PATCH 82/96] c/r: checkpoint/restart epoll sets
  2010-03-19  0:59 [C/R v20][PATCH 00/96] Linux Checkpoint-Restart - v20 Oren Laadan
                   ` (13 preceding siblings ...)
  2010-03-19  0:59 ` [C/R v20][PATCH 66/96] c/r: restore file->f_cred Oren Laadan
@ 2010-03-19  0:59 ` Oren Laadan
  2010-03-19  0:59 ` [C/R v20][PATCH 83/96] c/r: checkpoint/restart eventfd Oren Laadan
                   ` (2 subsequent siblings)
  17 siblings, 0 replies; 88+ messages in thread
From: Oren Laadan @ 2010-03-19  0:59 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: containers, Matt Helsley, Andreas Dilger

From: Matt Helsley <matthltc@us.ibm.com>

Save/restore epoll items during checkpoint/restart respectively.

Output the epoll header and items separately. Chunk the output much
like the pid array gets chunked. This ensures that even sub-order 0
allocations will enable checkpoint of large epoll sets. A subsequent
patch will do something similar for the restore path.

On restart, we grab a piece of memory suitable to store a "chunk" of
items for input. Read the input one chunk at a time and add epoll
items for each item in the chunk.

Signed-off-by: Matt Helsley <matthltc@us.ibm.com>
Acked-by: Oren Laadan <orenl@cs.columbia.edu>
Acked-by: Serge Hallyn <serue@us.ibm.com>

Changelog [v19]:
  - [Oren Laadan] Fix broken compilation for no-c/r architectures
Changelog [v19-rc1]:
  - [Oren Laadan] Return -EBUSY (not BUG_ON) if fd is gone on restart
  - [Oren Laadan] Fix the chunk size instead of auto-tune

Changelog v5:
	Fix potential recursion during collect.
	Replace call to ckpt_obj_collect() with ckpt_collect_file().
		[Oren]
	Fix checkpoint leak detection when there are more items than
		expected.
	Cleanup/simplify error write paths. (will complicate in a later
		patch) [Oren]
	Remove files_deferq bits. [Oren]
	Remove extra newline. [Oren]
	Remove aggregate check on number of watches added. [Oren]
		This is OK since these will be done individually anyway.
	Remove check for negative objrefs during restart. [Oren]
	Fixup comment regarding race that indicates checkpoint leaks.
		[Oren]
	s/ckpt_read_obj/ckpt_read_buf_type/ [Oren]
		Patch for lots of epoll items follows.
	Moved sys_close(epfd) right under fget(). [Oren]
	Use CKPT_HDR_BUFFER rather than custome ckpt_read/write_*
		This makes it more similar to the pid array code. [Oren]
		It also simplifies the error recovery paths.
	Tested polling a pipe and 50,000 UNIX sockets.

Changelog v4: ckpt-v18
	Use files_deferq as submitted by Dan Smith
		Cleanup to only report >= 1 items when debugging.

Changelog v3: [unposted]
	Removed most of the TODOs -- the remainder will be removed by
		subsequent patches.
	Fixed missing ep_file_collect() [Serge]
	Rather than include checkpoint_hdr.h declare (but do not define)
		the two structs needed in eventpoll.h [Oren]
	Complain with ckpt_write_err() when we detect checkpoint obj
		leaks. [Oren]
	Remove redundant is_epoll_file() check in collect. [Oren]
	Move epfile_objref lookup to simplify error handling. [Oren]
	Simplify error handling with early return in
		ep_eventpoll_checkpoint(). [Oren]
	Cleaned up a comment. [Oren]
	Shorten CKPT_HDR_FILE_EPOLL_ITEMS (-FILE) [Oren]
		Renumbered to indicate that it follows the file table.
	Renamed the epoll struct in checkpoint_hdr.h [Oren]
		Also renamed substruct.
	Fixup return of empty ep_file_restore(). [Oren]
	Changed some error returns. [Oren]
	Changed some tests to BUG_ON(). [Oren]
	Factored out watch insert with epoll_ctl() into do_epoll_ctl().
		[Cedric, Oren]

Signed-off-by: Matt Helsley <matthltc@us.ibm.com>
Acked-by: Oren Laadan <orenl@cs.columbia.edu>
Acked-by: Serge E. Hallyn <serue@us.ibm.com>
Tested-by: Serge E. Hallyn <serue@us.ibm.com>
---
 checkpoint/files.c             |    7 +
 fs/eventpoll.c                 |  334 ++++++++++++++++++++++++++++++++++++----
 include/linux/checkpoint_hdr.h |   18 ++
 include/linux/eventpoll.h      |   17 ++-
 4 files changed, 347 insertions(+), 29 deletions(-)

diff --git a/checkpoint/files.c b/checkpoint/files.c
index bcc1fbf..6aaaf22 100644
--- a/checkpoint/files.c
+++ b/checkpoint/files.c
@@ -22,6 +22,7 @@
 #include <linux/deferqueue.h>
 #include <linux/checkpoint.h>
 #include <linux/checkpoint_hdr.h>
+#include <linux/eventpoll.h>
 #include <net/sock.h>
 
 
@@ -637,6 +638,12 @@ static struct restore_file_ops restore_file_ops[] = {
 		.file_type = CKPT_FILE_TTY,
 		.restore = tty_file_restore,
 	},
+	/* epoll */
+	{
+		.file_name = "EPOLL",
+		.file_type = CKPT_FILE_EPOLL,
+		.restore = ep_file_restore,
+	},
 };
 
 static struct file *do_restore_file(struct ckpt_ctx *ctx)
diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index bd056a5..7f1a091 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -39,6 +39,9 @@
 #include <asm/mman.h>
 #include <asm/atomic.h>
 
+#include <linux/checkpoint.h>
+#include <linux/deferqueue.h>
+
 /*
  * LOCKING:
  * There are three level of locking required by epoll :
@@ -671,10 +674,20 @@ static unsigned int ep_eventpoll_poll(struct file *file, poll_table *wait)
 	return pollflags != -1 ? pollflags : 0;
 }
 
+#ifdef CONFIG_CHECKPOINT
+static int ep_eventpoll_checkpoint(struct ckpt_ctx *ctx, struct file *file);
+static int ep_file_collect(struct ckpt_ctx *ctx, struct file *file);
+#else
+#define ep_eventpoll_checkpoint NULL
+#define ep_file_collect NULL
+#endif
+
 /* File callbacks that implement the eventpoll file behaviour */
 static const struct file_operations eventpoll_fops = {
 	.release	= ep_eventpoll_release,
-	.poll		= ep_eventpoll_poll
+	.poll		= ep_eventpoll_poll,
+	.checkpoint 	= ep_eventpoll_checkpoint,
+	.collect 	= ep_file_collect,
 };
 
 /* Fast test to see if the file is an evenpoll file */
@@ -1226,35 +1239,18 @@ SYSCALL_DEFINE1(epoll_create, int, size)
  * the eventpoll file that enables the insertion/removal/change of
  * file descriptors inside the interest set.
  */
-SYSCALL_DEFINE4(epoll_ctl, int, epfd, int, op, int, fd,
-		struct epoll_event __user *, event)
+int do_epoll_ctl(int op, int fd,
+		 struct file *file, struct file *tfile,
+		 struct epoll_event *epds)
 {
 	int error;
-	struct file *file, *tfile;
 	struct eventpoll *ep;
 	struct epitem *epi;
-	struct epoll_event epds;
-
-	error = -EFAULT;
-	if (ep_op_has_event(op) &&
-	    copy_from_user(&epds, event, sizeof(struct epoll_event)))
-		goto error_return;
-
-	/* Get the "struct file *" for the eventpoll file */
-	error = -EBADF;
-	file = fget(epfd);
-	if (!file)
-		goto error_return;
-
-	/* Get the "struct file *" for the target file */
-	tfile = fget(fd);
-	if (!tfile)
-		goto error_fput;
 
 	/* The target file descriptor must support poll */
 	error = -EPERM;
 	if (!tfile->f_op || !tfile->f_op->poll)
-		goto error_tgt_fput;
+		return error;
 
 	/*
 	 * We have to check that the file structure underneath the file descriptor
@@ -1263,7 +1259,7 @@ SYSCALL_DEFINE4(epoll_ctl, int, epfd, int, op, int, fd,
 	 */
 	error = -EINVAL;
 	if (file == tfile || !is_file_epoll(file))
-		goto error_tgt_fput;
+		return error;
 
 	/*
 	 * At this point it is safe to assume that the "private_data" contains
@@ -1284,8 +1280,8 @@ SYSCALL_DEFINE4(epoll_ctl, int, epfd, int, op, int, fd,
 	switch (op) {
 	case EPOLL_CTL_ADD:
 		if (!epi) {
-			epds.events |= POLLERR | POLLHUP;
-			error = ep_insert(ep, &epds, tfile, fd);
+			epds->events |= POLLERR | POLLHUP;
+			error = ep_insert(ep, epds, tfile, fd);
 		} else
 			error = -EEXIST;
 		break;
@@ -1297,15 +1293,46 @@ SYSCALL_DEFINE4(epoll_ctl, int, epfd, int, op, int, fd,
 		break;
 	case EPOLL_CTL_MOD:
 		if (epi) {
-			epds.events |= POLLERR | POLLHUP;
-			error = ep_modify(ep, epi, &epds);
+			epds->events |= POLLERR | POLLHUP;
+			error = ep_modify(ep, epi, epds);
 		} else
 			error = -ENOENT;
 		break;
 	}
 	mutex_unlock(&ep->mtx);
 
-error_tgt_fput:
+	return error;
+}
+
+/*
+ * The following function implements the controller interface for
+ * the eventpoll file that enables the insertion/removal/change of
+ * file descriptors inside the interest set.
+ */
+SYSCALL_DEFINE4(epoll_ctl, int, epfd, int, op, int, fd,
+		struct epoll_event __user *, event)
+{
+	int error;
+	struct file *file, *tfile;
+	struct epoll_event epds;
+
+	error = -EFAULT;
+	if (ep_op_has_event(op) &&
+	    copy_from_user(&epds, event, sizeof(struct epoll_event)))
+		goto error_return;
+
+	/* Get the "struct file *" for the eventpoll file */
+	error = -EBADF;
+	file = fget(epfd);
+	if (!file)
+		goto error_return;
+
+	/* Get the "struct file *" for the target file */
+	tfile = fget(fd);
+	if (!tfile)
+		goto error_fput;
+
+	error = do_epoll_ctl(op, fd, file, tfile, &epds);
 	fput(tfile);
 error_fput:
 	fput(file);
@@ -1413,6 +1440,257 @@ SYSCALL_DEFINE6(epoll_pwait, int, epfd, struct epoll_event __user *, events,
 
 #endif /* HAVE_SET_RESTORE_SIGMASK */
 
+#ifdef CONFIG_CHECKPOINT
+static int ep_file_collect(struct ckpt_ctx *ctx, struct file *file)
+{
+	struct rb_node *rbp;
+	struct eventpoll *ep;
+	int ret = 0;
+
+	ep = file->private_data;
+	mutex_lock(&ep->mtx);
+	for (rbp = rb_first(&ep->rbr); rbp; rbp = rb_next(rbp)) {
+		struct epitem *epi;
+
+		epi = rb_entry(rbp, struct epitem, rbn);
+		if (is_file_epoll(epi->ffd.file))
+			continue; /* Don't recurse */
+		ret = ckpt_collect_file(ctx, epi->ffd.file);
+		if (ret < 0)
+			break;
+	}
+	mutex_unlock(&ep->mtx);
+	return ret;
+}
+
+struct epoll_deferq_entry {
+	struct ckpt_ctx *ctx;
+	struct file *epfile;
+};
+
+#define CKPT_EPOLL_CHUNK  (8096 / (int) sizeof(struct ckpt_eventpoll_item))
+
+static int ep_items_checkpoint(void *data)
+{
+	struct epoll_deferq_entry *dq_entry = data;
+	struct ckpt_ctx *ctx;
+	struct ckpt_hdr_eventpoll_items *h;
+	struct ckpt_eventpoll_item *items;
+	struct rb_node *rbp;
+	struct eventpoll *ep;
+	__s32 epfile_objref;
+	int num_items = 0, ret;
+
+	ctx = dq_entry->ctx;
+
+	epfile_objref = ckpt_obj_lookup(ctx, dq_entry->epfile, CKPT_OBJ_FILE);
+	BUG_ON(epfile_objref <= 0);
+
+	ep = dq_entry->epfile->private_data;
+	mutex_lock(&ep->mtx);
+	for (rbp = rb_first(&ep->rbr); rbp; rbp = rb_next(rbp))
+		num_items++;
+	mutex_unlock(&ep->mtx);
+
+	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_EPOLL_ITEMS);
+	if (!h)
+		return -ENOMEM;
+	h->num_items = num_items;
+	h->epfile_objref = epfile_objref;
+	ret = ckpt_write_obj(ctx, &h->h);
+	ckpt_hdr_put(ctx, h);
+	if (ret || !num_items)
+		return ret;
+
+	ret = ckpt_write_obj_type(ctx, NULL, sizeof(*items)*num_items,
+				  CKPT_HDR_BUFFER);
+	if (ret < 0)
+		return ret;
+
+	items = kzalloc(sizeof(*items) * CKPT_EPOLL_CHUNK, GFP_KERNEL);
+	if (!items)
+		return -ENOMEM;
+
+	/*
+	 * Walk the rbtree copying items into the chunk of memory and then
+	 * writing them to the checkpoint image
+	 */
+	ret = 0;
+	mutex_lock(&ep->mtx);
+	rbp = rb_first(&ep->rbr);
+	while ((num_items > 0) && rbp) {
+		int n = min(num_items, CKPT_EPOLL_CHUNK);
+		int j;
+
+		for (j = 0; rbp && j < n; j++, rbp = rb_next(rbp)) {
+			struct epitem *epi;
+			int objref;
+
+			epi = rb_entry(rbp, struct epitem, rbn);
+			items[j].fd = epi->ffd.fd;
+			items[j].events = epi->event.events;
+			items[j].data = epi->event.data;
+			objref = ckpt_obj_lookup(ctx, epi->ffd.file,
+						 CKPT_OBJ_FILE);
+			if (objref <= 0)
+				goto unlock;
+			items[j].file_objref = objref;
+		}
+		ret = ckpt_kwrite(ctx, items, n*sizeof(*items));
+		if (ret < 0)
+			break;
+		num_items -= n;
+	}
+unlock:
+	mutex_unlock(&ep->mtx);
+	kfree(items);
+	if (num_items != 0 || (num_items == 0 && rbp))
+		ret = -EBUSY; /* extra item(s) -- checkpoint obj leak */
+	if (ret)
+		ckpt_err(ctx, ret, "Checkpointing epoll items.\n");
+	return ret;
+}
+
+static int ep_eventpoll_checkpoint(struct ckpt_ctx *ctx, struct file *file)
+{
+	struct ckpt_hdr_file *h;
+	struct epoll_deferq_entry dq_entry;
+	int ret = -ENOMEM;
+
+	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_FILE);
+	if (!h)
+		return -ENOMEM;
+	h->f_type = CKPT_FILE_EPOLL;
+	ret = checkpoint_file_common(ctx, file, h);
+	if (ret < 0)
+		goto out;
+	ret = ckpt_write_obj(ctx, &h->h);
+	if (ret < 0)
+		goto out;
+
+	/*
+	 * Defer saving the epoll items until all of the ffd.file pointers
+	 * have an objref; after the file table has been checkpointed.
+	 */
+	dq_entry.ctx = ctx;
+	dq_entry.epfile = file;
+	ret = deferqueue_add(ctx->files_deferq, &dq_entry,
+			     sizeof(dq_entry), ep_items_checkpoint, NULL);
+out:
+	ckpt_hdr_put(ctx, h);
+	return ret;
+}
+
+static int ep_items_restore(void *data)
+{
+	struct ckpt_ctx *ctx = deferqueue_data_ptr(data);
+	struct ckpt_hdr_eventpoll_items *h;
+	struct ckpt_eventpoll_item *items = NULL;
+	struct eventpoll *ep;
+	struct file *epfile = NULL;
+	int ret, num_items;
+
+	h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_EPOLL_ITEMS);
+	if (IS_ERR(h))
+		return PTR_ERR(h);
+	num_items = h->num_items;
+	epfile = ckpt_obj_fetch(ctx, h->epfile_objref, CKPT_OBJ_FILE);
+	ckpt_hdr_put(ctx, h);
+
+	/* Make sure userspace didn't give us a ref to a non-epoll file. */
+	if (IS_ERR(epfile))
+		return PTR_ERR(epfile);
+	if (!is_file_epoll(epfile))
+		return -EINVAL;
+	if (!num_items)
+		return 0;
+
+	ret = _ckpt_read_obj_type(ctx, NULL, 0, CKPT_HDR_BUFFER);
+	if (ret < 0)
+		return ret;
+	/* Make sure the items match the size we expect */
+	if (num_items != (ret / sizeof(*items)))
+		return -EINVAL;
+
+	items = kzalloc(sizeof(*items) * CKPT_EPOLL_CHUNK, GFP_KERNEL);
+	if (!items)
+		return -ENOMEM;
+
+	ep = epfile->private_data;
+
+	while (num_items > 0) {
+		int n = min(num_items, CKPT_EPOLL_CHUNK);
+		int j;
+
+		ret = ckpt_kread(ctx, items, n*sizeof(*items));
+		if (ret < 0)
+			break;
+
+		/* Restore the epoll items/watches */
+		for (j = 0; !ret && j < n; j++) {
+			struct epoll_event epev;
+			struct file *tfile;
+
+			tfile = ckpt_obj_fetch(ctx, items[j].file_objref,
+					       CKPT_OBJ_FILE);
+			if (IS_ERR(tfile)) {
+				ret = PTR_ERR(tfile);
+				goto out;
+			}
+			epev.events = items[j].events;
+			epev.data = items[j].data;
+			ret = do_epoll_ctl(EPOLL_CTL_ADD, items[j].fd,
+					   epfile, tfile, &epev);
+		}
+		num_items -= n;
+	}
+out:
+	kfree(items);
+	return ret;
+}
+
+struct file *ep_file_restore(struct ckpt_ctx *ctx,
+			     struct ckpt_hdr_file *h)
+{
+	struct file *epfile;
+	int epfd, ret;
+
+	if (h->h.type != CKPT_HDR_FILE ||
+	    h->h.len  != sizeof(*h) ||
+	    h->f_type != CKPT_FILE_EPOLL)
+		return ERR_PTR(-EINVAL);
+
+	epfd = sys_epoll_create1(h->f_flags & EPOLL_CLOEXEC);
+	if (epfd < 0)
+		return ERR_PTR(epfd);
+	epfile = fget(epfd);
+	sys_close(epfd); /* harmless even if an error occured */
+	if (!epfile)  /* can happen with a malicious user */
+		return ERR_PTR(-EBUSY);
+
+	/*
+	 * Needed before we can properly restore the watches and enforce the
+	 * limit on watch numbers.
+	 */
+	ret = restore_file_common(ctx, epfile, h);
+	if (ret < 0)
+		goto fput_out;
+
+	/*
+	 * Defer restoring the epoll items until the file table is
+	 * fully restored. Ensures that valid file objrefs will resolve.
+	 */
+	ret = deferqueue_add_ptr(ctx->files_deferq, ctx, ep_items_restore, NULL);
+	if (ret < 0) {
+fput_out:
+		fput(epfile);
+		epfile = ERR_PTR(ret);
+	}
+	return epfile;
+}
+
+#endif /* CONFIG_CHECKPOINT */
+
 static int __init eventpoll_init(void)
 {
 	struct sysinfo si;
diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
index 4fe63b1..b96d2dc 100644
--- a/include/linux/checkpoint_hdr.h
+++ b/include/linux/checkpoint_hdr.h
@@ -119,6 +119,8 @@ enum {
 #define CKPT_HDR_TTY CKPT_HDR_TTY
 	CKPT_HDR_TTY_LDISC,
 #define CKPT_HDR_TTY_LDISC CKPT_HDR_TTY_LDISC
+	CKPT_HDR_EPOLL_ITEMS,  /* must be after file-table */
+#define CKPT_HDR_EPOLL_ITEMS CKPT_HDR_EPOLL_ITEMS
 
 	CKPT_HDR_MM = 401,
 #define CKPT_HDR_MM CKPT_HDR_MM
@@ -477,6 +479,8 @@ enum file_type {
 #define CKPT_FILE_SOCKET CKPT_FILE_SOCKET
 	CKPT_FILE_TTY,
 #define CKPT_FILE_TTY CKPT_FILE_TTY
+	CKPT_FILE_EPOLL,
+#define CKPT_FILE_EPOLL CKPT_FILE_EPOLL
 	CKPT_FILE_MAX
 #define CKPT_FILE_MAX CKPT_FILE_MAX
 };
@@ -693,6 +697,20 @@ struct ckpt_hdr_file_socket {
 	__s32 sock_objref;
 } __attribute__((aligned(8)));
 
+struct ckpt_hdr_eventpoll_items {
+	struct ckpt_hdr h;
+	__s32  epfile_objref;
+	__u32  num_items;
+} __attribute__((aligned(8)));
+
+/* Contained in a CKPT_HDR_BUFFER following the ckpt_hdr_eventpoll_items */
+struct ckpt_eventpoll_item {
+	__u64 data;
+	__u32 fd;
+	__s32 file_objref;
+	__u32 events;
+} __attribute__((aligned(8)));
+
 /* memory layout */
 struct ckpt_hdr_mm {
 	struct ckpt_hdr h;
diff --git a/include/linux/eventpoll.h b/include/linux/eventpoll.h
index f6856a5..52282ae 100644
--- a/include/linux/eventpoll.h
+++ b/include/linux/eventpoll.h
@@ -56,6 +56,9 @@ struct file;
 
 
 #ifdef CONFIG_EPOLL
+struct ckpt_ctx;
+struct ckpt_hdr_file;
+
 
 /* Used to initialize the epoll bits inside the "struct file" */
 static inline void eventpoll_init_file(struct file *file)
@@ -95,11 +98,23 @@ static inline void eventpoll_release(struct file *file)
 	eventpoll_release_file(file);
 }
 
-#else
 
+#ifdef CONFIG_CHECKPOINT
+extern struct file *ep_file_restore(struct ckpt_ctx *ctx,
+				    struct ckpt_hdr_file *h);
+#endif
+#else
+/* !defined(CONFIG_EPOLL) */
 static inline void eventpoll_init_file(struct file *file) {}
 static inline void eventpoll_release(struct file *file) {}
 
+#ifdef CONFIG_CHECKPOINT
+static inline struct file *ep_file_restore(struct ckpt_ctx *ctx,
+					   struct ckpt_hdr_file *ptr)
+{
+	return ERR_PTR(-ENOSYS);
+}
+#endif
 #endif
 
 #endif /* #ifdef __KERNEL__ */
-- 
1.6.3.3


^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [C/R v20][PATCH 83/96] c/r: checkpoint/restart eventfd
       [not found] ` <1268960401-16680-1-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
                     ` (13 preceding siblings ...)
  2010-03-19  0:59   ` [C/R v20][PATCH 82/96] c/r: checkpoint/restart epoll sets Oren Laadan
@ 2010-03-19  0:59   ` Oren Laadan
  2010-03-19  1:00   ` [C/R v20][PATCH 84/96] c/r: restore task fs_root and pwd (v3) Oren Laadan
  2010-03-19  1:00   ` [C/R v20][PATCH 85/96] c/r: preliminary support mounts namespace Oren Laadan
  16 siblings, 0 replies; 88+ messages in thread
From: Oren Laadan @ 2010-03-19  0:59 UTC (permalink / raw)
  To: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Andreas Dilger

From: Matt Helsley <matthltc-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>

Save/restore eventfd files. These are anon_inodes just like epoll
but instead of a set of files to poll they are a 64-bit counter
and a flag value. Used for AIO.

[Oren Laadan] Added #ifdef's around checkpoint/restart to compile even
without CONFIG_CHECKPOINT

Changelog[v19]:
  - Fix broken compilation for architectures that don't support c/r

Signed-off-by: Matt Helsley <matthltc-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
Acked-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
Acked-by: Serge E. Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
Tested-by: Serge E. Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
---
 checkpoint/files.c             |    7 +++++
 fs/eventfd.c                   |   55 ++++++++++++++++++++++++++++++++++++++++
 include/linux/checkpoint_hdr.h |    8 ++++++
 include/linux/eventfd.h        |   12 ++++++++
 4 files changed, 82 insertions(+), 0 deletions(-)

diff --git a/checkpoint/files.c b/checkpoint/files.c
index 6aaaf22..4b551fe 100644
--- a/checkpoint/files.c
+++ b/checkpoint/files.c
@@ -23,6 +23,7 @@
 #include <linux/checkpoint.h>
 #include <linux/checkpoint_hdr.h>
 #include <linux/eventpoll.h>
+#include <linux/eventfd.h>
 #include <net/sock.h>
 
 
@@ -644,6 +645,12 @@ static struct restore_file_ops restore_file_ops[] = {
 		.file_type = CKPT_FILE_EPOLL,
 		.restore = ep_file_restore,
 	},
+	/* eventfd */
+	{
+		.file_name = "EVENTFD",
+		.file_type = CKPT_FILE_EVENTFD,
+		.restore = eventfd_restore,
+	},
 };
 
 static struct file *do_restore_file(struct ckpt_ctx *ctx)
diff --git a/fs/eventfd.c b/fs/eventfd.c
index 7758cc3..f2785c0 100644
--- a/fs/eventfd.c
+++ b/fs/eventfd.c
@@ -18,6 +18,7 @@
 #include <linux/module.h>
 #include <linux/kref.h>
 #include <linux/eventfd.h>
+#include <linux/checkpoint.h>
 
 struct eventfd_ctx {
 	struct kref kref;
@@ -287,11 +288,65 @@ static ssize_t eventfd_write(struct file *file, const char __user *buf, size_t c
 	return res;
 }
 
+#ifdef CONFIG_CHECKPOINT
+static int eventfd_checkpoint(struct ckpt_ctx *ckpt_ctx, struct file *file)
+{
+	struct eventfd_ctx *ctx;
+	struct ckpt_hdr_file_eventfd *h;
+	int ret = -ENOMEM;
+
+	h = ckpt_hdr_get_type(ckpt_ctx, sizeof(*h), CKPT_HDR_FILE);
+	if (!h)
+		return -ENOMEM;
+	h->common.f_type = CKPT_FILE_EVENTFD;
+	ret = checkpoint_file_common(ckpt_ctx, file, &h->common);
+	if (ret < 0)
+		goto out;
+	ctx = file->private_data;
+	h->count = ctx->count;
+	h->flags = ctx->flags;
+	ret = ckpt_write_obj(ckpt_ctx, &h->common.h);
+out:
+	ckpt_hdr_put(ckpt_ctx, h);
+	return ret;
+}
+
+struct file *eventfd_restore(struct ckpt_ctx *ckpt_ctx,
+			     struct ckpt_hdr_file *ptr)
+{
+	struct ckpt_hdr_file_eventfd *h = (struct ckpt_hdr_file_eventfd *) ptr;
+	struct file *evfile;
+	int evfd, ret;
+
+	/* Already know type == CKPT_HDR_FILE and f_type == CKPT_FILE_EVENTFD */
+	if (h->common.h.len != sizeof(*h))
+		return ERR_PTR(-EINVAL);
+
+	evfd = sys_eventfd2(h->count, h->flags);
+	if (evfd < 0)
+		return ERR_PTR(evfd);
+	evfile = fget(evfd);
+	sys_close(evfd);
+	if (!evfile)
+		return ERR_PTR(-EBUSY);
+
+	ret = restore_file_common(ckpt_ctx, evfile, &h->common);
+	if (ret < 0) {
+		fput(evfile);
+		return ERR_PTR(ret);
+	}
+	return evfile;
+}
+#else
+#define eventfd_checkpoint NULL
+#endif
+
 static const struct file_operations eventfd_fops = {
 	.release	= eventfd_release,
 	.poll		= eventfd_poll,
 	.read		= eventfd_read,
 	.write		= eventfd_write,
+	.checkpoint     = eventfd_checkpoint,
 };
 
 /**
diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
index b96d2dc..0b36430 100644
--- a/include/linux/checkpoint_hdr.h
+++ b/include/linux/checkpoint_hdr.h
@@ -481,6 +481,8 @@ enum file_type {
 #define CKPT_FILE_TTY CKPT_FILE_TTY
 	CKPT_FILE_EPOLL,
 #define CKPT_FILE_EPOLL CKPT_FILE_EPOLL
+	CKPT_FILE_EVENTFD,
+#define CKPT_FILE_EVENTFD CKPT_FILE_EVENTFD
 	CKPT_FILE_MAX
 #define CKPT_FILE_MAX CKPT_FILE_MAX
 };
@@ -505,6 +507,12 @@ struct ckpt_hdr_file_pipe {
 	__s32 pipe_objref;
 } __attribute__((aligned(8)));
 
+struct ckpt_hdr_file_eventfd {
+	struct ckpt_hdr_file common;
+	__u64 count;
+	__u32 flags;
+} __attribute__((aligned(8)));
+
 /* socket */
 struct ckpt_hdr_socket {
 	struct ckpt_hdr h;
diff --git a/include/linux/eventfd.h b/include/linux/eventfd.h
index 91bb4f2..2ce8525 100644
--- a/include/linux/eventfd.h
+++ b/include/linux/eventfd.h
@@ -39,6 +39,16 @@ ssize_t eventfd_ctx_read(struct eventfd_ctx *ctx, int no_wait, __u64 *cnt);
 int eventfd_ctx_remove_wait_queue(struct eventfd_ctx *ctx, wait_queue_t *wait,
 				  __u64 *cnt);
 
+#ifdef CONFIG_CHECKPOINT
+struct ckpt_ctx;
+struct ckpt_hdr_file;
+
+struct file *eventfd_restore(struct ckpt_ctx *ckpt_ctx,
+			     struct ckpt_hdr_file *ptr);
+#else
+#define eventfd_restore NULL
+#endif
+
 #else /* CONFIG_EVENTFD */
 
 /*
@@ -77,6 +87,8 @@ static inline int eventfd_ctx_remove_wait_queue(struct eventfd_ctx *ctx,
 	return -ENOSYS;
 }
 
+#define eventfd_restore NULL
+
 #endif
 
 #endif /* _LINUX_EVENTFD_H */
-- 
1.6.3.3

^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [C/R v20][PATCH 83/96] c/r: checkpoint/restart eventfd
  2010-03-19  0:59 [C/R v20][PATCH 00/96] Linux Checkpoint-Restart - v20 Oren Laadan
                   ` (14 preceding siblings ...)
  2010-03-19  0:59 ` [C/R v20][PATCH 82/96] c/r: checkpoint/restart epoll sets Oren Laadan
@ 2010-03-19  0:59 ` Oren Laadan
  2010-03-19  1:00 ` [C/R v20][PATCH 84/96] c/r: restore task fs_root and pwd (v3) Oren Laadan
  2010-03-19  1:00 ` [C/R v20][PATCH 85/96] c/r: preliminary support mounts namespace Oren Laadan
  17 siblings, 0 replies; 88+ messages in thread
From: Oren Laadan @ 2010-03-19  0:59 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: containers, Matt Helsley, Andreas Dilger

From: Matt Helsley <matthltc@us.ibm.com>

Save/restore eventfd files. These are anon_inodes just like epoll
but instead of a set of files to poll they are a 64-bit counter
and a flag value. Used for AIO.

[Oren Laadan] Added #ifdef's around checkpoint/restart to compile even
without CONFIG_CHECKPOINT

Changelog[v19]:
  - Fix broken compilation for architectures that don't support c/r

Signed-off-by: Matt Helsley <matthltc@us.ibm.com>
Acked-by: Oren Laadan <orenl@cs.columbia.edu>
Acked-by: Serge E. Hallyn <serue@us.ibm.com>
Tested-by: Serge E. Hallyn <serue@us.ibm.com>
---
 checkpoint/files.c             |    7 +++++
 fs/eventfd.c                   |   55 ++++++++++++++++++++++++++++++++++++++++
 include/linux/checkpoint_hdr.h |    8 ++++++
 include/linux/eventfd.h        |   12 ++++++++
 4 files changed, 82 insertions(+), 0 deletions(-)

diff --git a/checkpoint/files.c b/checkpoint/files.c
index 6aaaf22..4b551fe 100644
--- a/checkpoint/files.c
+++ b/checkpoint/files.c
@@ -23,6 +23,7 @@
 #include <linux/checkpoint.h>
 #include <linux/checkpoint_hdr.h>
 #include <linux/eventpoll.h>
+#include <linux/eventfd.h>
 #include <net/sock.h>
 
 
@@ -644,6 +645,12 @@ static struct restore_file_ops restore_file_ops[] = {
 		.file_type = CKPT_FILE_EPOLL,
 		.restore = ep_file_restore,
 	},
+	/* eventfd */
+	{
+		.file_name = "EVENTFD",
+		.file_type = CKPT_FILE_EVENTFD,
+		.restore = eventfd_restore,
+	},
 };
 
 static struct file *do_restore_file(struct ckpt_ctx *ctx)
diff --git a/fs/eventfd.c b/fs/eventfd.c
index 7758cc3..f2785c0 100644
--- a/fs/eventfd.c
+++ b/fs/eventfd.c
@@ -18,6 +18,7 @@
 #include <linux/module.h>
 #include <linux/kref.h>
 #include <linux/eventfd.h>
+#include <linux/checkpoint.h>
 
 struct eventfd_ctx {
 	struct kref kref;
@@ -287,11 +288,65 @@ static ssize_t eventfd_write(struct file *file, const char __user *buf, size_t c
 	return res;
 }
 
+#ifdef CONFIG_CHECKPOINT
+static int eventfd_checkpoint(struct ckpt_ctx *ckpt_ctx, struct file *file)
+{
+	struct eventfd_ctx *ctx;
+	struct ckpt_hdr_file_eventfd *h;
+	int ret = -ENOMEM;
+
+	h = ckpt_hdr_get_type(ckpt_ctx, sizeof(*h), CKPT_HDR_FILE);
+	if (!h)
+		return -ENOMEM;
+	h->common.f_type = CKPT_FILE_EVENTFD;
+	ret = checkpoint_file_common(ckpt_ctx, file, &h->common);
+	if (ret < 0)
+		goto out;
+	ctx = file->private_data;
+	h->count = ctx->count;
+	h->flags = ctx->flags;
+	ret = ckpt_write_obj(ckpt_ctx, &h->common.h);
+out:
+	ckpt_hdr_put(ckpt_ctx, h);
+	return ret;
+}
+
+struct file *eventfd_restore(struct ckpt_ctx *ckpt_ctx,
+			     struct ckpt_hdr_file *ptr)
+{
+	struct ckpt_hdr_file_eventfd *h = (struct ckpt_hdr_file_eventfd *) ptr;
+	struct file *evfile;
+	int evfd, ret;
+
+	/* Already know type == CKPT_HDR_FILE and f_type == CKPT_FILE_EVENTFD */
+	if (h->common.h.len != sizeof(*h))
+		return ERR_PTR(-EINVAL);
+
+	evfd = sys_eventfd2(h->count, h->flags);
+	if (evfd < 0)
+		return ERR_PTR(evfd);
+	evfile = fget(evfd);
+	sys_close(evfd);
+	if (!evfile)
+		return ERR_PTR(-EBUSY);
+
+	ret = restore_file_common(ckpt_ctx, evfile, &h->common);
+	if (ret < 0) {
+		fput(evfile);
+		return ERR_PTR(ret);
+	}
+	return evfile;
+}
+#else
+#define eventfd_checkpoint NULL
+#endif
+
 static const struct file_operations eventfd_fops = {
 	.release	= eventfd_release,
 	.poll		= eventfd_poll,
 	.read		= eventfd_read,
 	.write		= eventfd_write,
+	.checkpoint     = eventfd_checkpoint,
 };
 
 /**
diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
index b96d2dc..0b36430 100644
--- a/include/linux/checkpoint_hdr.h
+++ b/include/linux/checkpoint_hdr.h
@@ -481,6 +481,8 @@ enum file_type {
 #define CKPT_FILE_TTY CKPT_FILE_TTY
 	CKPT_FILE_EPOLL,
 #define CKPT_FILE_EPOLL CKPT_FILE_EPOLL
+	CKPT_FILE_EVENTFD,
+#define CKPT_FILE_EVENTFD CKPT_FILE_EVENTFD
 	CKPT_FILE_MAX
 #define CKPT_FILE_MAX CKPT_FILE_MAX
 };
@@ -505,6 +507,12 @@ struct ckpt_hdr_file_pipe {
 	__s32 pipe_objref;
 } __attribute__((aligned(8)));
 
+struct ckpt_hdr_file_eventfd {
+	struct ckpt_hdr_file common;
+	__u64 count;
+	__u32 flags;
+} __attribute__((aligned(8)));
+
 /* socket */
 struct ckpt_hdr_socket {
 	struct ckpt_hdr h;
diff --git a/include/linux/eventfd.h b/include/linux/eventfd.h
index 91bb4f2..2ce8525 100644
--- a/include/linux/eventfd.h
+++ b/include/linux/eventfd.h
@@ -39,6 +39,16 @@ ssize_t eventfd_ctx_read(struct eventfd_ctx *ctx, int no_wait, __u64 *cnt);
 int eventfd_ctx_remove_wait_queue(struct eventfd_ctx *ctx, wait_queue_t *wait,
 				  __u64 *cnt);
 
+#ifdef CONFIG_CHECKPOINT
+struct ckpt_ctx;
+struct ckpt_hdr_file;
+
+struct file *eventfd_restore(struct ckpt_ctx *ckpt_ctx,
+			     struct ckpt_hdr_file *ptr);
+#else
+#define eventfd_restore NULL
+#endif
+
 #else /* CONFIG_EVENTFD */
 
 /*
@@ -77,6 +87,8 @@ static inline int eventfd_ctx_remove_wait_queue(struct eventfd_ctx *ctx,
 	return -ENOSYS;
 }
 
+#define eventfd_restore NULL
+
 #endif
 
 #endif /* _LINUX_EVENTFD_H */
-- 
1.6.3.3


^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [C/R v20][PATCH 84/96] c/r: restore task fs_root and pwd (v3)
       [not found] ` <1268960401-16680-1-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
                     ` (14 preceding siblings ...)
  2010-03-19  0:59   ` [C/R v20][PATCH 83/96] c/r: checkpoint/restart eventfd Oren Laadan
@ 2010-03-19  1:00   ` Oren Laadan
  2010-03-19  1:00   ` [C/R v20][PATCH 85/96] c/r: preliminary support mounts namespace Oren Laadan
  16 siblings, 0 replies; 88+ messages in thread
From: Oren Laadan @ 2010-03-19  1:00 UTC (permalink / raw)
  To: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Andreas Dilger

Checkpoint and restore task->fs.  Tasks sharing task->fs will
share them again after restart.

Original patch by Serge Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>

Changelog:
  Jan 25: [orenl] Addressed comments by .. myself:
    - add leak detection
    - change order of save/restore of chroot and cwd
    - save/restore fs only after file-table and mm
    - rename functions to adapt existing conventions
  Dec 28: [serge] Addressed comments by Oren (and Dave)
    - define and use {get,put}_fs_struct helpers
    - fix locking comment
    - define ckpt_read_fname() and use in checkpoint/files.c

Signed-off-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
Signed-off-by: Serge Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
---
 checkpoint/files.c             |  203 +++++++++++++++++++++++++++++++++++++++-
 checkpoint/objhash.c           |   34 +++++++
 checkpoint/process.c           |   17 ++++
 fs/fs_struct.c                 |   21 ++++
 fs/open.c                      |   58 +++++++-----
 include/linux/checkpoint.h     |    8 ++-
 include/linux/checkpoint_hdr.h |   12 +++
 include/linux/fs.h             |    4 +
 include/linux/fs_struct.h      |    2 +
 9 files changed, 331 insertions(+), 28 deletions(-)

diff --git a/checkpoint/files.c b/checkpoint/files.c
index 4b551fe..7855bae 100644
--- a/checkpoint/files.c
+++ b/checkpoint/files.c
@@ -15,6 +15,9 @@
 #include <linux/module.h>
 #include <linux/sched.h>
 #include <linux/file.h>
+#include <linux/namei.h>
+#include <linux/fs_struct.h>
+#include <linux/fs.h>
 #include <linux/fdtable.h>
 #include <linux/fsnotify.h>
 #include <linux/pipe_fs_i.h>
@@ -374,6 +377,62 @@ int checkpoint_obj_file_table(struct ckpt_ctx *ctx, struct task_struct *t)
 	return objref;
 }
 
+int checkpoint_obj_fs(struct ckpt_ctx *ctx, struct task_struct *t)
+{
+	struct fs_struct *fs;
+	int fs_objref;
+
+	task_lock(current);
+	fs = t->fs;
+	get_fs_struct(fs);
+	task_unlock(current);
+
+	fs_objref = checkpoint_obj(ctx, fs, CKPT_OBJ_FS);
+	put_fs_struct(fs);
+
+	return fs_objref;
+}
+
+/* called with fs refcount bumped so it won't disappear */
+static int do_checkpoint_fs(struct ckpt_ctx *ctx, struct fs_struct *fs)
+{
+	struct ckpt_hdr_fs *h;
+	struct fs_struct *fscopy;
+	int ret;
+
+	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_FS);
+	if (!h)
+		return -ENOMEM;
+	ret = ckpt_write_obj(ctx, &h->h);
+	ckpt_hdr_put(ctx, h);
+	if (ret)
+		return ret;
+
+	fscopy = copy_fs_struct(fs);
+	if (!fs)
+		return -ENOMEM;
+
+	ret = checkpoint_fname(ctx, &fscopy->pwd, &ctx->root_fs_path);
+	if (ret < 0) {
+		ckpt_err(ctx, ret, "%(T)writing path of cwd");
+		goto out;
+	}
+	ret = checkpoint_fname(ctx, &fscopy->root, &ctx->root_fs_path);
+	if (ret < 0) {
+		ckpt_err(ctx, ret, "%(T)writing path of fs root");
+		goto out;
+	}
+	ret = 0;
+ out:
+	free_fs_struct(fscopy);
+	return ret;
+}
+
+int checkpoint_fs(struct ckpt_ctx *ctx, void *ptr)
+{
+	return do_checkpoint_fs(ctx, (struct fs_struct *) ptr);
+}
+
 /***********************************************************************
  * Collect
  */
@@ -460,10 +519,41 @@ int ckpt_collect_file_table(struct ckpt_ctx *ctx, struct task_struct *t)
 	return ret;
 }
 
+int ckpt_collect_fs(struct ckpt_ctx *ctx, struct task_struct *t)
+{
+	struct fs_struct *fs;
+	int ret;
+
+	task_lock(t);
+	fs = t->fs;
+	get_fs_struct(fs);
+	task_unlock(t);
+
+	ret = ckpt_obj_collect(ctx, fs, CKPT_OBJ_FS);
+
+	put_fs_struct(fs);
+	return ret;
+}
+
 /**************************************************************************
  * Restart
  */
 
+static int ckpt_read_fname(struct ckpt_ctx *ctx, char **fname)
+{
+	int len;
+
+	len = ckpt_read_payload(ctx, (void **) fname,
+				PATH_MAX, CKPT_HDR_FILE_NAME);
+	if (len < 0)
+		return len;
+
+	(*fname)[len - 1] = '\0';	/* always play if safe */
+	ckpt_debug("read filename '%s'\n", *fname);
+
+	return len;
+}
+
 /**
  * restore_open_fname - read a file name and open a file
  * @ctx: checkpoint context
@@ -479,11 +569,9 @@ struct file *restore_open_fname(struct ckpt_ctx *ctx, int flags)
 	if (flags & (O_CREAT | O_EXCL | O_NOCTTY | O_TRUNC))
 		return ERR_PTR(-EINVAL);
 
-	len = ckpt_read_payload(ctx, (void **) &fname,
-				PATH_MAX, CKPT_HDR_FILE_NAME);
+	len = ckpt_read_fname(ctx, &fname);
 	if (len < 0)
 		return ERR_PTR(len);
-	fname[len - 1] = '\0';	/* always play if safe */
 	ckpt_debug("fname '%s' flags %#x\n", fname, flags);
 
 	file = filp_open(fname, flags, 0);
@@ -819,3 +907,112 @@ int restore_obj_file_table(struct ckpt_ctx *ctx, int files_objref)
 
 	return 0;
 }
+
+/*
+ * Called by task restore code to set the restarted task's
+ * current->fs to an entry on the hash
+ */
+int restore_obj_fs(struct ckpt_ctx *ctx, int fs_objref)
+{
+	struct fs_struct *newfs, *oldfs;
+
+	newfs = ckpt_obj_fetch(ctx, fs_objref, CKPT_OBJ_FS);
+	if (IS_ERR(newfs))
+		return PTR_ERR(newfs);
+
+	task_lock(current);
+	get_fs_struct(newfs);
+	oldfs = current->fs;
+	current->fs = newfs;
+	task_unlock(current);
+	put_fs_struct(oldfs);
+
+	return 0;
+}
+
+static int restore_chroot(struct ckpt_ctx *ctx, struct fs_struct *fs, char *name)
+{
+	struct nameidata nd;
+	int ret;
+
+	ckpt_debug("attempting chroot to %s\n", name);
+	ret = path_lookup(name, LOOKUP_FOLLOW | LOOKUP_DIRECTORY, &nd);
+	if (ret) {
+		ckpt_err(ctx, ret, "%(T)Opening chroot dir %s", name);
+		return ret;
+	}
+	ret = do_chroot(fs, &nd.path);
+	path_put(&nd.path);
+	if (ret) {
+		ckpt_err(ctx, ret, "%(T)Setting chroot %s", name);
+		return ret;
+	}
+	return 0;
+}
+
+static int restore_cwd(struct ckpt_ctx *ctx, struct fs_struct *fs, char *name)
+{
+	struct nameidata nd;
+	int ret;
+
+	ckpt_debug("attempting chdir to %s\n", name);
+	ret = path_lookup(name, LOOKUP_FOLLOW | LOOKUP_DIRECTORY, &nd);
+	if (ret) {
+		ckpt_err(ctx, ret, "%(T)Opening cwd %s", name);
+		return ret;
+	}
+	ret = do_chdir(fs, &nd.path);
+	path_put(&nd.path);
+	if (ret) {
+		ckpt_err(ctx, ret, "%(T)Setting cwd %s", name);
+		return ret;
+	}
+	return 0;
+}
+
+/*
+ * Called by objhash when it runs into a CKPT_OBJ_FS entry. Creates
+ * an fs_struct with desired chroot/cwd and places it in the hash.
+ */
+static struct fs_struct *do_restore_fs(struct ckpt_ctx *ctx)
+{
+	struct ckpt_hdr_fs *h;
+	struct fs_struct *fs;
+	char *path;
+	int ret = 0;
+
+	h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_FS);
+	if (IS_ERR(h))
+		return ERR_PTR(PTR_ERR(h));
+	ckpt_hdr_put(ctx, h);
+
+	fs = copy_fs_struct(current->fs);
+	if (!fs)
+		return ERR_PTR(-ENOMEM);
+
+	ret = ckpt_read_fname(ctx, &path);
+	if (ret < 0)
+		goto out;
+	ret = restore_cwd(ctx, fs, path);
+	kfree(path);
+	if (ret)
+		goto out;
+
+	ret = ckpt_read_fname(ctx, &path);
+	if (ret < 0)
+		goto out;
+	ret = restore_chroot(ctx, fs, path);
+	kfree(path);
+
+out:
+	if (ret) {
+		free_fs_struct(fs);
+		return ERR_PTR(ret);
+	}
+	return fs;
+}
+
+void *restore_fs(struct ckpt_ctx *ctx)
+{
+	return (void *) do_restore_fs(ctx);
+}
diff --git a/checkpoint/objhash.c b/checkpoint/objhash.c
index 84bceec..5c4749d 100644
--- a/checkpoint/objhash.c
+++ b/checkpoint/objhash.c
@@ -15,6 +15,7 @@
 #include <linux/hash.h>
 #include <linux/file.h>
 #include <linux/fdtable.h>
+#include <linux/fs_struct.h>
 #include <linux/sched.h>
 #include <linux/ipc_namespace.h>
 #include <linux/user_namespace.h>
@@ -126,6 +127,29 @@ static int obj_mm_users(void *ptr)
 	return atomic_read(&((struct mm_struct *) ptr)->mm_users);
 }
 
+static int obj_fs_grab(void *ptr)
+{
+	get_fs_struct((struct fs_struct *) ptr);
+	return 0;
+}
+
+static void obj_fs_drop(void *ptr, int lastref)
+{
+	put_fs_struct((struct fs_struct *) ptr);
+}
+
+static int obj_fs_users(void *ptr)
+{
+	/*
+	 * It's safe to not use fs->lock because the fs referenced.
+	 * It's also sufficient for leak detection: with no leak the
+	 * count can't change; with a leak it will be too big already
+	 * (even if it's about to grow), and if it's about to shrink
+	 * then it's as if we sampled the count a bit earlier.
+	 */
+	return ((struct fs_struct *) ptr)->users;
+}
+
 static int obj_sighand_grab(void *ptr)
 {
 	atomic_inc(&((struct sighand_struct *) ptr)->count);
@@ -330,6 +354,16 @@ static struct ckpt_obj_ops ckpt_obj_ops[] = {
 		.checkpoint = checkpoint_mm,
 		.restore = restore_mm,
 	},
+	/* fs object */
+	{
+		.obj_name = "FS",
+		.obj_type = CKPT_OBJ_FS,
+		.ref_drop = obj_fs_drop,
+		.ref_grab = obj_fs_grab,
+		.ref_users = obj_fs_users,
+		.checkpoint = checkpoint_fs,
+		.restore = restore_fs,
+	},
 	/* sighand object */
 	{
 		.obj_name = "SIGHAND",
diff --git a/checkpoint/process.c b/checkpoint/process.c
index e0ef795..f917112 100644
--- a/checkpoint/process.c
+++ b/checkpoint/process.c
@@ -232,6 +232,7 @@ static int checkpoint_task_objs(struct ckpt_ctx *ctx, struct task_struct *t)
 	struct ckpt_hdr_task_objs *h;
 	int files_objref;
 	int mm_objref;
+	int fs_objref;
 	int sighand_objref;
 	int signal_objref;
 	int first, ret;
@@ -272,6 +273,13 @@ static int checkpoint_task_objs(struct ckpt_ctx *ctx, struct task_struct *t)
 		return mm_objref;
 	}
 
+	/* note: this must come *after* file-table and mm */
+	fs_objref = checkpoint_obj_fs(ctx, t);
+	if (fs_objref < 0) {
+		ckpt_err(ctx, fs_objref, "%(T)process fs\n");
+		return fs_objref;
+	}
+
 	sighand_objref = checkpoint_obj_sighand(ctx, t);
 	ckpt_debug("sighand: objref %d\n", sighand_objref);
 	if (sighand_objref < 0) {
@@ -299,6 +307,7 @@ static int checkpoint_task_objs(struct ckpt_ctx *ctx, struct task_struct *t)
 		return -ENOMEM;
 	h->files_objref = files_objref;
 	h->mm_objref = mm_objref;
+	h->fs_objref = fs_objref;
 	h->sighand_objref = sighand_objref;
 	h->signal_objref = signal_objref;
 	ret = ckpt_write_obj(ctx, &h->h);
@@ -477,6 +486,9 @@ int ckpt_collect_task(struct ckpt_ctx *ctx, struct task_struct *t)
 	ret = ckpt_collect_mm(ctx, t);
 	if (ret < 0)
 		return ret;
+	ret = ckpt_collect_fs(ctx, t);
+	if (ret < 0)
+		return ret;
 	ret = ckpt_collect_sighand(ctx, t);
 
 	return ret;
@@ -645,6 +657,11 @@ static int restore_task_objs(struct ckpt_ctx *ctx)
 	if (ret < 0)
 		goto out;
 
+	ret = restore_obj_fs(ctx, h->fs_objref);
+	ckpt_debug("fs: ret %d (%p)\n", ret, current->fs);
+	if (ret < 0)
+		return ret;
+
 	ret = restore_obj_sighand(ctx, h->sighand_objref);
 	ckpt_debug("sighand: ret %d (%p)\n", ret, current->sighand);
 	if (ret < 0)
diff --git a/fs/fs_struct.c b/fs/fs_struct.c
index eee0590..2a4c6f5 100644
--- a/fs/fs_struct.c
+++ b/fs/fs_struct.c
@@ -6,6 +6,27 @@
 #include <linux/fs_struct.h>
 
 /*
+ * call with owning task locked
+ */
+void get_fs_struct(struct fs_struct *fs)
+{
+	write_lock(&fs->lock);
+	fs->users++;
+	write_unlock(&fs->lock);
+}
+
+void put_fs_struct(struct fs_struct *fs)
+{
+	int kill;
+
+	write_lock(&fs->lock);
+	kill = !--fs->users;
+	write_unlock(&fs->lock);
+	if (kill)
+		free_fs_struct(fs);
+}
+
+/*
  * Replace the fs->{rootmnt,root} with {mnt,dentry}. Put the old values.
  * It can block.
  */
diff --git a/fs/open.c b/fs/open.c
index 040cef7..62fc70c 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -527,6 +527,18 @@ SYSCALL_DEFINE2(access, const char __user *, filename, int, mode)
 	return sys_faccessat(AT_FDCWD, filename, mode);
 }
 
+int do_chdir(struct fs_struct *fs, struct path *path)
+{
+	int error;
+
+	error = inode_permission(path->dentry->d_inode, MAY_EXEC | MAY_ACCESS);
+	if (error)
+		return error;
+
+	set_fs_pwd(fs, path);
+	return 0;
+}
+
 SYSCALL_DEFINE1(chdir, const char __user *, filename)
 {
 	struct path path;
@@ -534,17 +546,10 @@ SYSCALL_DEFINE1(chdir, const char __user *, filename)
 
 	error = user_path_dir(filename, &path);
 	if (error)
-		goto out;
-
-	error = inode_permission(path.dentry->d_inode, MAY_EXEC | MAY_ACCESS);
-	if (error)
-		goto dput_and_out;
-
-	set_fs_pwd(current->fs, &path);
+		return error;
 
-dput_and_out:
+	error = do_chdir(current->fs, &path);
 	path_put(&path);
-out:
 	return error;
 }
 
@@ -574,31 +579,36 @@ out:
 	return error;
 }
 
-SYSCALL_DEFINE1(chroot, const char __user *, filename)
+int do_chroot(struct fs_struct *fs, struct path *path)
 {
-	struct path path;
 	int error;
 
-	error = user_path_dir(filename, &path);
+	error = inode_permission(path->dentry->d_inode, MAY_EXEC | MAY_ACCESS);
 	if (error)
-		goto out;
+		return error;
+
+	if (!capable(CAP_SYS_CHROOT))
+		return -EPERM;
 
-	error = inode_permission(path.dentry->d_inode, MAY_EXEC | MAY_ACCESS);
+	error = security_path_chroot(path);
 	if (error)
-		goto dput_and_out;
+		return error;
 
-	error = -EPERM;
-	if (!capable(CAP_SYS_CHROOT))
-		goto dput_and_out;
-	error = security_path_chroot(&path);
+	set_fs_root(fs, path);
+	return 0;
+}
+
+SYSCALL_DEFINE1(chroot, const char __user *, filename)
+{
+	struct path path;
+	int error;
+
+	error = user_path_dir(filename, &path);
 	if (error)
-		goto dput_and_out;
+		return error;
 
-	set_fs_root(current->fs, &path);
-	error = 0;
-dput_and_out:
+	error = do_chroot(current->fs, &path);
 	path_put(&path);
-out:
 	return error;
 }
 
diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
index ca91405..3e0937a 100644
--- a/include/linux/checkpoint.h
+++ b/include/linux/checkpoint.h
@@ -10,7 +10,7 @@
  *  distribution for more details.
  */
 
-#define CHECKPOINT_VERSION  3
+#define CHECKPOINT_VERSION  4
 
 /* checkpoint user flags */
 #define CHECKPOINT_SUBTREE	0x1
@@ -236,6 +236,12 @@ extern int checkpoint_file_common(struct ckpt_ctx *ctx, struct file *file,
 extern int restore_file_common(struct ckpt_ctx *ctx, struct file *file,
 			       struct ckpt_hdr_file *h);
 
+extern int ckpt_collect_fs(struct ckpt_ctx *ctx, struct task_struct *t);
+extern int checkpoint_obj_fs(struct ckpt_ctx *ctx, struct task_struct *t);
+extern int restore_obj_fs(struct ckpt_ctx *ctx, int fs_objref);
+extern int checkpoint_fs(struct ckpt_ctx *ctx, void *ptr);
+extern void *restore_fs(struct ckpt_ctx *ctx);
+
 /* credentials */
 extern int checkpoint_groupinfo(struct ckpt_ctx *ctx, void *ptr);
 extern int checkpoint_user(struct ckpt_ctx *ctx, void *ptr);
diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
index 0b36430..4dc852d 100644
--- a/include/linux/checkpoint_hdr.h
+++ b/include/linux/checkpoint_hdr.h
@@ -131,6 +131,9 @@ enum {
 	CKPT_HDR_MM_CONTEXT,
 #define CKPT_HDR_MM_CONTEXT CKPT_HDR_MM_CONTEXT
 
+	CKPT_HDR_FS = 451,  /* must be after file-table, mm */
+#define CKPT_HDR_FS CKPT_HDR_FS
+
 	CKPT_HDR_IPC = 501,
 #define CKPT_HDR_IPC CKPT_HDR_IPC
 	CKPT_HDR_IPC_SHM,
@@ -201,6 +204,8 @@ enum obj_type {
 #define CKPT_OBJ_FILE CKPT_OBJ_FILE
 	CKPT_OBJ_MM,
 #define CKPT_OBJ_MM CKPT_OBJ_MM
+	CKPT_OBJ_FS,
+#define CKPT_OBJ_FS CKPT_OBJ_FS
 	CKPT_OBJ_SIGHAND,
 #define CKPT_OBJ_SIGHAND CKPT_OBJ_SIGHAND
 	CKPT_OBJ_SIGNAL,
@@ -416,6 +421,7 @@ struct ckpt_hdr_task_objs {
 
 	__s32 files_objref;
 	__s32 mm_objref;
+	__s32 fs_objref;
 	__s32 sighand_objref;
 	__s32 signal_objref;
 } __attribute__((aligned(8)));
@@ -453,6 +459,12 @@ enum restart_block_type {
 };
 
 /* file system */
+struct ckpt_hdr_fs {
+	struct ckpt_hdr h;
+	/* char *fs_root */
+	/* char *fs_pwd */
+} __attribute__((aligned(8)));
+
 struct ckpt_hdr_file_table {
 	struct ckpt_hdr h;
 	__s32 fdt_nfds;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 7902a51..a1525aa 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1818,6 +1818,10 @@ extern void drop_collected_mounts(struct vfsmount *);
 
 extern int vfs_statfs(struct dentry *, struct kstatfs *);
 
+struct fs_struct;
+extern int do_chdir(struct fs_struct *fs, struct path *path);
+extern int do_chroot(struct fs_struct *fs, struct path *path);
+
 extern int current_umask(void);
 
 /* /sys/fs */
diff --git a/include/linux/fs_struct.h b/include/linux/fs_struct.h
index 78a05bf..a73cbcb 100644
--- a/include/linux/fs_struct.h
+++ b/include/linux/fs_struct.h
@@ -20,5 +20,7 @@ extern struct fs_struct *copy_fs_struct(struct fs_struct *);
 extern void free_fs_struct(struct fs_struct *);
 extern void daemonize_fs_struct(void);
 extern int unshare_fs_struct(void);
+extern void get_fs_struct(struct fs_struct *);
+extern void put_fs_struct(struct fs_struct *);
 
 #endif /* _LINUX_FS_STRUCT_H */
-- 
1.6.3.3

^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [C/R v20][PATCH 84/96] c/r: restore task fs_root and pwd (v3)
  2010-03-19  0:59 [C/R v20][PATCH 00/96] Linux Checkpoint-Restart - v20 Oren Laadan
                   ` (15 preceding siblings ...)
  2010-03-19  0:59 ` [C/R v20][PATCH 83/96] c/r: checkpoint/restart eventfd Oren Laadan
@ 2010-03-19  1:00 ` Oren Laadan
  2010-03-19  1:00 ` [C/R v20][PATCH 85/96] c/r: preliminary support mounts namespace Oren Laadan
  17 siblings, 0 replies; 88+ messages in thread
From: Oren Laadan @ 2010-03-19  1:00 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: containers, Matt Helsley, Andreas Dilger, Oren Laadan, Serge Hallyn

Checkpoint and restore task->fs.  Tasks sharing task->fs will
share them again after restart.

Original patch by Serge Hallyn <serue@us.ibm.com>

Changelog:
  Jan 25: [orenl] Addressed comments by .. myself:
    - add leak detection
    - change order of save/restore of chroot and cwd
    - save/restore fs only after file-table and mm
    - rename functions to adapt existing conventions
  Dec 28: [serge] Addressed comments by Oren (and Dave)
    - define and use {get,put}_fs_struct helpers
    - fix locking comment
    - define ckpt_read_fname() and use in checkpoint/files.c

Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
Signed-off-by: Serge Hallyn <serue@us.ibm.com>
---
 checkpoint/files.c             |  203 +++++++++++++++++++++++++++++++++++++++-
 checkpoint/objhash.c           |   34 +++++++
 checkpoint/process.c           |   17 ++++
 fs/fs_struct.c                 |   21 ++++
 fs/open.c                      |   58 +++++++-----
 include/linux/checkpoint.h     |    8 ++-
 include/linux/checkpoint_hdr.h |   12 +++
 include/linux/fs.h             |    4 +
 include/linux/fs_struct.h      |    2 +
 9 files changed, 331 insertions(+), 28 deletions(-)

diff --git a/checkpoint/files.c b/checkpoint/files.c
index 4b551fe..7855bae 100644
--- a/checkpoint/files.c
+++ b/checkpoint/files.c
@@ -15,6 +15,9 @@
 #include <linux/module.h>
 #include <linux/sched.h>
 #include <linux/file.h>
+#include <linux/namei.h>
+#include <linux/fs_struct.h>
+#include <linux/fs.h>
 #include <linux/fdtable.h>
 #include <linux/fsnotify.h>
 #include <linux/pipe_fs_i.h>
@@ -374,6 +377,62 @@ int checkpoint_obj_file_table(struct ckpt_ctx *ctx, struct task_struct *t)
 	return objref;
 }
 
+int checkpoint_obj_fs(struct ckpt_ctx *ctx, struct task_struct *t)
+{
+	struct fs_struct *fs;
+	int fs_objref;
+
+	task_lock(current);
+	fs = t->fs;
+	get_fs_struct(fs);
+	task_unlock(current);
+
+	fs_objref = checkpoint_obj(ctx, fs, CKPT_OBJ_FS);
+	put_fs_struct(fs);
+
+	return fs_objref;
+}
+
+/* called with fs refcount bumped so it won't disappear */
+static int do_checkpoint_fs(struct ckpt_ctx *ctx, struct fs_struct *fs)
+{
+	struct ckpt_hdr_fs *h;
+	struct fs_struct *fscopy;
+	int ret;
+
+	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_FS);
+	if (!h)
+		return -ENOMEM;
+	ret = ckpt_write_obj(ctx, &h->h);
+	ckpt_hdr_put(ctx, h);
+	if (ret)
+		return ret;
+
+	fscopy = copy_fs_struct(fs);
+	if (!fs)
+		return -ENOMEM;
+
+	ret = checkpoint_fname(ctx, &fscopy->pwd, &ctx->root_fs_path);
+	if (ret < 0) {
+		ckpt_err(ctx, ret, "%(T)writing path of cwd");
+		goto out;
+	}
+	ret = checkpoint_fname(ctx, &fscopy->root, &ctx->root_fs_path);
+	if (ret < 0) {
+		ckpt_err(ctx, ret, "%(T)writing path of fs root");
+		goto out;
+	}
+	ret = 0;
+ out:
+	free_fs_struct(fscopy);
+	return ret;
+}
+
+int checkpoint_fs(struct ckpt_ctx *ctx, void *ptr)
+{
+	return do_checkpoint_fs(ctx, (struct fs_struct *) ptr);
+}
+
 /***********************************************************************
  * Collect
  */
@@ -460,10 +519,41 @@ int ckpt_collect_file_table(struct ckpt_ctx *ctx, struct task_struct *t)
 	return ret;
 }
 
+int ckpt_collect_fs(struct ckpt_ctx *ctx, struct task_struct *t)
+{
+	struct fs_struct *fs;
+	int ret;
+
+	task_lock(t);
+	fs = t->fs;
+	get_fs_struct(fs);
+	task_unlock(t);
+
+	ret = ckpt_obj_collect(ctx, fs, CKPT_OBJ_FS);
+
+	put_fs_struct(fs);
+	return ret;
+}
+
 /**************************************************************************
  * Restart
  */
 
+static int ckpt_read_fname(struct ckpt_ctx *ctx, char **fname)
+{
+	int len;
+
+	len = ckpt_read_payload(ctx, (void **) fname,
+				PATH_MAX, CKPT_HDR_FILE_NAME);
+	if (len < 0)
+		return len;
+
+	(*fname)[len - 1] = '\0';	/* always play if safe */
+	ckpt_debug("read filename '%s'\n", *fname);
+
+	return len;
+}
+
 /**
  * restore_open_fname - read a file name and open a file
  * @ctx: checkpoint context
@@ -479,11 +569,9 @@ struct file *restore_open_fname(struct ckpt_ctx *ctx, int flags)
 	if (flags & (O_CREAT | O_EXCL | O_NOCTTY | O_TRUNC))
 		return ERR_PTR(-EINVAL);
 
-	len = ckpt_read_payload(ctx, (void **) &fname,
-				PATH_MAX, CKPT_HDR_FILE_NAME);
+	len = ckpt_read_fname(ctx, &fname);
 	if (len < 0)
 		return ERR_PTR(len);
-	fname[len - 1] = '\0';	/* always play if safe */
 	ckpt_debug("fname '%s' flags %#x\n", fname, flags);
 
 	file = filp_open(fname, flags, 0);
@@ -819,3 +907,112 @@ int restore_obj_file_table(struct ckpt_ctx *ctx, int files_objref)
 
 	return 0;
 }
+
+/*
+ * Called by task restore code to set the restarted task's
+ * current->fs to an entry on the hash
+ */
+int restore_obj_fs(struct ckpt_ctx *ctx, int fs_objref)
+{
+	struct fs_struct *newfs, *oldfs;
+
+	newfs = ckpt_obj_fetch(ctx, fs_objref, CKPT_OBJ_FS);
+	if (IS_ERR(newfs))
+		return PTR_ERR(newfs);
+
+	task_lock(current);
+	get_fs_struct(newfs);
+	oldfs = current->fs;
+	current->fs = newfs;
+	task_unlock(current);
+	put_fs_struct(oldfs);
+
+	return 0;
+}
+
+static int restore_chroot(struct ckpt_ctx *ctx, struct fs_struct *fs, char *name)
+{
+	struct nameidata nd;
+	int ret;
+
+	ckpt_debug("attempting chroot to %s\n", name);
+	ret = path_lookup(name, LOOKUP_FOLLOW | LOOKUP_DIRECTORY, &nd);
+	if (ret) {
+		ckpt_err(ctx, ret, "%(T)Opening chroot dir %s", name);
+		return ret;
+	}
+	ret = do_chroot(fs, &nd.path);
+	path_put(&nd.path);
+	if (ret) {
+		ckpt_err(ctx, ret, "%(T)Setting chroot %s", name);
+		return ret;
+	}
+	return 0;
+}
+
+static int restore_cwd(struct ckpt_ctx *ctx, struct fs_struct *fs, char *name)
+{
+	struct nameidata nd;
+	int ret;
+
+	ckpt_debug("attempting chdir to %s\n", name);
+	ret = path_lookup(name, LOOKUP_FOLLOW | LOOKUP_DIRECTORY, &nd);
+	if (ret) {
+		ckpt_err(ctx, ret, "%(T)Opening cwd %s", name);
+		return ret;
+	}
+	ret = do_chdir(fs, &nd.path);
+	path_put(&nd.path);
+	if (ret) {
+		ckpt_err(ctx, ret, "%(T)Setting cwd %s", name);
+		return ret;
+	}
+	return 0;
+}
+
+/*
+ * Called by objhash when it runs into a CKPT_OBJ_FS entry. Creates
+ * an fs_struct with desired chroot/cwd and places it in the hash.
+ */
+static struct fs_struct *do_restore_fs(struct ckpt_ctx *ctx)
+{
+	struct ckpt_hdr_fs *h;
+	struct fs_struct *fs;
+	char *path;
+	int ret = 0;
+
+	h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_FS);
+	if (IS_ERR(h))
+		return ERR_PTR(PTR_ERR(h));
+	ckpt_hdr_put(ctx, h);
+
+	fs = copy_fs_struct(current->fs);
+	if (!fs)
+		return ERR_PTR(-ENOMEM);
+
+	ret = ckpt_read_fname(ctx, &path);
+	if (ret < 0)
+		goto out;
+	ret = restore_cwd(ctx, fs, path);
+	kfree(path);
+	if (ret)
+		goto out;
+
+	ret = ckpt_read_fname(ctx, &path);
+	if (ret < 0)
+		goto out;
+	ret = restore_chroot(ctx, fs, path);
+	kfree(path);
+
+out:
+	if (ret) {
+		free_fs_struct(fs);
+		return ERR_PTR(ret);
+	}
+	return fs;
+}
+
+void *restore_fs(struct ckpt_ctx *ctx)
+{
+	return (void *) do_restore_fs(ctx);
+}
diff --git a/checkpoint/objhash.c b/checkpoint/objhash.c
index 84bceec..5c4749d 100644
--- a/checkpoint/objhash.c
+++ b/checkpoint/objhash.c
@@ -15,6 +15,7 @@
 #include <linux/hash.h>
 #include <linux/file.h>
 #include <linux/fdtable.h>
+#include <linux/fs_struct.h>
 #include <linux/sched.h>
 #include <linux/ipc_namespace.h>
 #include <linux/user_namespace.h>
@@ -126,6 +127,29 @@ static int obj_mm_users(void *ptr)
 	return atomic_read(&((struct mm_struct *) ptr)->mm_users);
 }
 
+static int obj_fs_grab(void *ptr)
+{
+	get_fs_struct((struct fs_struct *) ptr);
+	return 0;
+}
+
+static void obj_fs_drop(void *ptr, int lastref)
+{
+	put_fs_struct((struct fs_struct *) ptr);
+}
+
+static int obj_fs_users(void *ptr)
+{
+	/*
+	 * It's safe to not use fs->lock because the fs referenced.
+	 * It's also sufficient for leak detection: with no leak the
+	 * count can't change; with a leak it will be too big already
+	 * (even if it's about to grow), and if it's about to shrink
+	 * then it's as if we sampled the count a bit earlier.
+	 */
+	return ((struct fs_struct *) ptr)->users;
+}
+
 static int obj_sighand_grab(void *ptr)
 {
 	atomic_inc(&((struct sighand_struct *) ptr)->count);
@@ -330,6 +354,16 @@ static struct ckpt_obj_ops ckpt_obj_ops[] = {
 		.checkpoint = checkpoint_mm,
 		.restore = restore_mm,
 	},
+	/* fs object */
+	{
+		.obj_name = "FS",
+		.obj_type = CKPT_OBJ_FS,
+		.ref_drop = obj_fs_drop,
+		.ref_grab = obj_fs_grab,
+		.ref_users = obj_fs_users,
+		.checkpoint = checkpoint_fs,
+		.restore = restore_fs,
+	},
 	/* sighand object */
 	{
 		.obj_name = "SIGHAND",
diff --git a/checkpoint/process.c b/checkpoint/process.c
index e0ef795..f917112 100644
--- a/checkpoint/process.c
+++ b/checkpoint/process.c
@@ -232,6 +232,7 @@ static int checkpoint_task_objs(struct ckpt_ctx *ctx, struct task_struct *t)
 	struct ckpt_hdr_task_objs *h;
 	int files_objref;
 	int mm_objref;
+	int fs_objref;
 	int sighand_objref;
 	int signal_objref;
 	int first, ret;
@@ -272,6 +273,13 @@ static int checkpoint_task_objs(struct ckpt_ctx *ctx, struct task_struct *t)
 		return mm_objref;
 	}
 
+	/* note: this must come *after* file-table and mm */
+	fs_objref = checkpoint_obj_fs(ctx, t);
+	if (fs_objref < 0) {
+		ckpt_err(ctx, fs_objref, "%(T)process fs\n");
+		return fs_objref;
+	}
+
 	sighand_objref = checkpoint_obj_sighand(ctx, t);
 	ckpt_debug("sighand: objref %d\n", sighand_objref);
 	if (sighand_objref < 0) {
@@ -299,6 +307,7 @@ static int checkpoint_task_objs(struct ckpt_ctx *ctx, struct task_struct *t)
 		return -ENOMEM;
 	h->files_objref = files_objref;
 	h->mm_objref = mm_objref;
+	h->fs_objref = fs_objref;
 	h->sighand_objref = sighand_objref;
 	h->signal_objref = signal_objref;
 	ret = ckpt_write_obj(ctx, &h->h);
@@ -477,6 +486,9 @@ int ckpt_collect_task(struct ckpt_ctx *ctx, struct task_struct *t)
 	ret = ckpt_collect_mm(ctx, t);
 	if (ret < 0)
 		return ret;
+	ret = ckpt_collect_fs(ctx, t);
+	if (ret < 0)
+		return ret;
 	ret = ckpt_collect_sighand(ctx, t);
 
 	return ret;
@@ -645,6 +657,11 @@ static int restore_task_objs(struct ckpt_ctx *ctx)
 	if (ret < 0)
 		goto out;
 
+	ret = restore_obj_fs(ctx, h->fs_objref);
+	ckpt_debug("fs: ret %d (%p)\n", ret, current->fs);
+	if (ret < 0)
+		return ret;
+
 	ret = restore_obj_sighand(ctx, h->sighand_objref);
 	ckpt_debug("sighand: ret %d (%p)\n", ret, current->sighand);
 	if (ret < 0)
diff --git a/fs/fs_struct.c b/fs/fs_struct.c
index eee0590..2a4c6f5 100644
--- a/fs/fs_struct.c
+++ b/fs/fs_struct.c
@@ -6,6 +6,27 @@
 #include <linux/fs_struct.h>
 
 /*
+ * call with owning task locked
+ */
+void get_fs_struct(struct fs_struct *fs)
+{
+	write_lock(&fs->lock);
+	fs->users++;
+	write_unlock(&fs->lock);
+}
+
+void put_fs_struct(struct fs_struct *fs)
+{
+	int kill;
+
+	write_lock(&fs->lock);
+	kill = !--fs->users;
+	write_unlock(&fs->lock);
+	if (kill)
+		free_fs_struct(fs);
+}
+
+/*
  * Replace the fs->{rootmnt,root} with {mnt,dentry}. Put the old values.
  * It can block.
  */
diff --git a/fs/open.c b/fs/open.c
index 040cef7..62fc70c 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -527,6 +527,18 @@ SYSCALL_DEFINE2(access, const char __user *, filename, int, mode)
 	return sys_faccessat(AT_FDCWD, filename, mode);
 }
 
+int do_chdir(struct fs_struct *fs, struct path *path)
+{
+	int error;
+
+	error = inode_permission(path->dentry->d_inode, MAY_EXEC | MAY_ACCESS);
+	if (error)
+		return error;
+
+	set_fs_pwd(fs, path);
+	return 0;
+}
+
 SYSCALL_DEFINE1(chdir, const char __user *, filename)
 {
 	struct path path;
@@ -534,17 +546,10 @@ SYSCALL_DEFINE1(chdir, const char __user *, filename)
 
 	error = user_path_dir(filename, &path);
 	if (error)
-		goto out;
-
-	error = inode_permission(path.dentry->d_inode, MAY_EXEC | MAY_ACCESS);
-	if (error)
-		goto dput_and_out;
-
-	set_fs_pwd(current->fs, &path);
+		return error;
 
-dput_and_out:
+	error = do_chdir(current->fs, &path);
 	path_put(&path);
-out:
 	return error;
 }
 
@@ -574,31 +579,36 @@ out:
 	return error;
 }
 
-SYSCALL_DEFINE1(chroot, const char __user *, filename)
+int do_chroot(struct fs_struct *fs, struct path *path)
 {
-	struct path path;
 	int error;
 
-	error = user_path_dir(filename, &path);
+	error = inode_permission(path->dentry->d_inode, MAY_EXEC | MAY_ACCESS);
 	if (error)
-		goto out;
+		return error;
+
+	if (!capable(CAP_SYS_CHROOT))
+		return -EPERM;
 
-	error = inode_permission(path.dentry->d_inode, MAY_EXEC | MAY_ACCESS);
+	error = security_path_chroot(path);
 	if (error)
-		goto dput_and_out;
+		return error;
 
-	error = -EPERM;
-	if (!capable(CAP_SYS_CHROOT))
-		goto dput_and_out;
-	error = security_path_chroot(&path);
+	set_fs_root(fs, path);
+	return 0;
+}
+
+SYSCALL_DEFINE1(chroot, const char __user *, filename)
+{
+	struct path path;
+	int error;
+
+	error = user_path_dir(filename, &path);
 	if (error)
-		goto dput_and_out;
+		return error;
 
-	set_fs_root(current->fs, &path);
-	error = 0;
-dput_and_out:
+	error = do_chroot(current->fs, &path);
 	path_put(&path);
-out:
 	return error;
 }
 
diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
index ca91405..3e0937a 100644
--- a/include/linux/checkpoint.h
+++ b/include/linux/checkpoint.h
@@ -10,7 +10,7 @@
  *  distribution for more details.
  */
 
-#define CHECKPOINT_VERSION  3
+#define CHECKPOINT_VERSION  4
 
 /* checkpoint user flags */
 #define CHECKPOINT_SUBTREE	0x1
@@ -236,6 +236,12 @@ extern int checkpoint_file_common(struct ckpt_ctx *ctx, struct file *file,
 extern int restore_file_common(struct ckpt_ctx *ctx, struct file *file,
 			       struct ckpt_hdr_file *h);
 
+extern int ckpt_collect_fs(struct ckpt_ctx *ctx, struct task_struct *t);
+extern int checkpoint_obj_fs(struct ckpt_ctx *ctx, struct task_struct *t);
+extern int restore_obj_fs(struct ckpt_ctx *ctx, int fs_objref);
+extern int checkpoint_fs(struct ckpt_ctx *ctx, void *ptr);
+extern void *restore_fs(struct ckpt_ctx *ctx);
+
 /* credentials */
 extern int checkpoint_groupinfo(struct ckpt_ctx *ctx, void *ptr);
 extern int checkpoint_user(struct ckpt_ctx *ctx, void *ptr);
diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
index 0b36430..4dc852d 100644
--- a/include/linux/checkpoint_hdr.h
+++ b/include/linux/checkpoint_hdr.h
@@ -131,6 +131,9 @@ enum {
 	CKPT_HDR_MM_CONTEXT,
 #define CKPT_HDR_MM_CONTEXT CKPT_HDR_MM_CONTEXT
 
+	CKPT_HDR_FS = 451,  /* must be after file-table, mm */
+#define CKPT_HDR_FS CKPT_HDR_FS
+
 	CKPT_HDR_IPC = 501,
 #define CKPT_HDR_IPC CKPT_HDR_IPC
 	CKPT_HDR_IPC_SHM,
@@ -201,6 +204,8 @@ enum obj_type {
 #define CKPT_OBJ_FILE CKPT_OBJ_FILE
 	CKPT_OBJ_MM,
 #define CKPT_OBJ_MM CKPT_OBJ_MM
+	CKPT_OBJ_FS,
+#define CKPT_OBJ_FS CKPT_OBJ_FS
 	CKPT_OBJ_SIGHAND,
 #define CKPT_OBJ_SIGHAND CKPT_OBJ_SIGHAND
 	CKPT_OBJ_SIGNAL,
@@ -416,6 +421,7 @@ struct ckpt_hdr_task_objs {
 
 	__s32 files_objref;
 	__s32 mm_objref;
+	__s32 fs_objref;
 	__s32 sighand_objref;
 	__s32 signal_objref;
 } __attribute__((aligned(8)));
@@ -453,6 +459,12 @@ enum restart_block_type {
 };
 
 /* file system */
+struct ckpt_hdr_fs {
+	struct ckpt_hdr h;
+	/* char *fs_root */
+	/* char *fs_pwd */
+} __attribute__((aligned(8)));
+
 struct ckpt_hdr_file_table {
 	struct ckpt_hdr h;
 	__s32 fdt_nfds;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 7902a51..a1525aa 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1818,6 +1818,10 @@ extern void drop_collected_mounts(struct vfsmount *);
 
 extern int vfs_statfs(struct dentry *, struct kstatfs *);
 
+struct fs_struct;
+extern int do_chdir(struct fs_struct *fs, struct path *path);
+extern int do_chroot(struct fs_struct *fs, struct path *path);
+
 extern int current_umask(void);
 
 /* /sys/fs */
diff --git a/include/linux/fs_struct.h b/include/linux/fs_struct.h
index 78a05bf..a73cbcb 100644
--- a/include/linux/fs_struct.h
+++ b/include/linux/fs_struct.h
@@ -20,5 +20,7 @@ extern struct fs_struct *copy_fs_struct(struct fs_struct *);
 extern void free_fs_struct(struct fs_struct *);
 extern void daemonize_fs_struct(void);
 extern int unshare_fs_struct(void);
+extern void get_fs_struct(struct fs_struct *);
+extern void put_fs_struct(struct fs_struct *);
 
 #endif /* _LINUX_FS_STRUCT_H */
-- 
1.6.3.3


^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [C/R v20][PATCH 85/96] c/r: preliminary support mounts namespace
       [not found] ` <1268960401-16680-1-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
                     ` (15 preceding siblings ...)
  2010-03-19  1:00   ` [C/R v20][PATCH 84/96] c/r: restore task fs_root and pwd (v3) Oren Laadan
@ 2010-03-19  1:00   ` Oren Laadan
  16 siblings, 0 replies; 88+ messages in thread
From: Oren Laadan @ 2010-03-19  1:00 UTC (permalink / raw)
  To: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Andreas Dilger

We only allow c/r when all processes shared a single mounts ns.

We do intend to implement c/r of mounts and mounts namespaces in the
kernel.  It shouldn't be ugly or complicate locking to do so.  Just
haven't gotten around to it. A more complete solution is more than we
want to take on now for v19.

But we'd like as much as possible for everything which we don't
support, to not be checkpointable, since not doing so has in the past
invited slanderous accusations of being a toy implementation :)

Meanwhile, we get the following:
1) Checkpoint bails if not all tasks share the same mnt-ns
2) Leak detection works for full container checkpoint

On restart, all tasks inherit the same mnt-ns of the coordinator, by
default. A follow-up patch to user-cr will add a new switch to the
'restart' to request a CLONE_NEWMNT flag when creating the root-task
of the restart.

Signed-off-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
Signed-off-by: Serge E. Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
---
 checkpoint/objhash.c           |   25 +++++++++++++++++++++++++
 include/linux/checkpoint.h     |    2 +-
 include/linux/checkpoint_hdr.h |    4 ++++
 kernel/nsproxy.c               |   16 +++++++++++++---
 4 files changed, 43 insertions(+), 4 deletions(-)

diff --git a/checkpoint/objhash.c b/checkpoint/objhash.c
index 5c4749d..42998b2 100644
--- a/checkpoint/objhash.c
+++ b/checkpoint/objhash.c
@@ -19,6 +19,7 @@
 #include <linux/sched.h>
 #include <linux/ipc_namespace.h>
 #include <linux/user_namespace.h>
+#include <linux/mnt_namespace.h>
 #include <linux/checkpoint.h>
 #include <linux/checkpoint_hdr.h>
 #include <net/sock.h>
@@ -214,6 +215,22 @@ static int obj_ipc_ns_users(void *ptr)
 	return atomic_read(&((struct ipc_namespace *) ptr)->count);
 }
 
+static int obj_mnt_ns_grab(void *ptr)
+{
+	get_mnt_ns((struct mnt_namespace *) ptr);
+	return 0;
+}
+
+static void obj_mnt_ns_drop(void *ptr, int lastref)
+{
+	put_mnt_ns((struct mnt_namespace *) ptr);
+}
+
+static int obj_mnt_ns_users(void *ptr)
+{
+	return atomic_read(&((struct mnt_namespace *) ptr)->count);
+}
+
 static int obj_cred_grab(void *ptr)
 {
 	get_cred((struct cred *) ptr);
@@ -411,6 +428,14 @@ static struct ckpt_obj_ops ckpt_obj_ops[] = {
 		.checkpoint = checkpoint_ipc_ns,
 		.restore = restore_ipc_ns,
 	},
+	/* mnt_ns object */
+	{
+		.obj_name = "MOUNTS NS",
+		.obj_type = CKPT_OBJ_MNT_NS,
+		.ref_grab = obj_mnt_ns_grab,
+		.ref_drop = obj_mnt_ns_drop,
+		.ref_users = obj_mnt_ns_users,
+	},
 	/* user_ns object */
 	{
 		.obj_name = "USER_NS",
diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
index 3e0937a..64b4b8a 100644
--- a/include/linux/checkpoint.h
+++ b/include/linux/checkpoint.h
@@ -10,7 +10,7 @@
  *  distribution for more details.
  */
 
-#define CHECKPOINT_VERSION  4
+#define CHECKPOINT_VERSION  5
 
 /* checkpoint user flags */
 #define CHECKPOINT_SUBTREE	0x1
diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
index 4dc852d..28dfc36 100644
--- a/include/linux/checkpoint_hdr.h
+++ b/include/linux/checkpoint_hdr.h
@@ -90,6 +90,8 @@ enum {
 #define CKPT_HDR_UTS_NS CKPT_HDR_UTS_NS
 	CKPT_HDR_IPC_NS,
 #define CKPT_HDR_IPC_NS CKPT_HDR_IPC_NS
+	CKPT_HDR_MNT_NS,
+#define CKPT_HDR_MNT_NS CKPT_HDR_MNT_NS
 	CKPT_HDR_CAPABILITIES,
 #define CKPT_HDR_CAPABILITIES CKPT_HDR_CAPABILITIES
 	CKPT_HDR_USER_NS,
@@ -216,6 +218,8 @@ enum obj_type {
 #define CKPT_OBJ_UTS_NS CKPT_OBJ_UTS_NS
 	CKPT_OBJ_IPC_NS,
 #define CKPT_OBJ_IPC_NS CKPT_OBJ_IPC_NS
+	CKPT_OBJ_MNT_NS,
+#define CKPT_OBJ_MNT_NS CKPT_OBJ_MNT_NS
 	CKPT_OBJ_USER_NS,
 #define CKPT_OBJ_USER_NS CKPT_OBJ_USER_NS
 	CKPT_OBJ_CRED,
diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c
index 17b048e..0da0d83 100644
--- a/kernel/nsproxy.c
+++ b/kernel/nsproxy.c
@@ -255,10 +255,17 @@ int ckpt_collect_ns(struct ckpt_ctx *ctx, struct task_struct *t)
 	 * ipc_ns (shm) may keep references to files: if this is the
 	 * first time we see this ipc_ns (ret > 0), proceed inside.
 	 */
-	if (ret)
+	if (ret) {
 		ret = ckpt_collect_ipc_ns(ctx, nsproxy->ipc_ns);
+		if (ret < 0)
+			goto out;
+	}
 
-	/* TODO: collect other namespaces here */
+	ret = ckpt_obj_collect(ctx, nsproxy->mnt_ns, CKPT_OBJ_MNT_NS);
+	if (ret < 0)
+		goto out;
+
+	ret = 0;
  out:
 	put_nsproxy(nsproxy);
 	return ret;
@@ -282,7 +289,10 @@ static int do_checkpoint_ns(struct ckpt_ctx *ctx, struct nsproxy *nsproxy)
 		goto out;
 	h->ipc_objref = ret;
 
-	/* TODO: Write other namespaces here */
+	/* FIXME: for now, only marked visited to pacify leaks */
+	ret = ckpt_obj_visit(ctx, nsproxy->mnt_ns, CKPT_OBJ_MNT_NS);
+	if (ret < 0)
+		goto out;
 
 	ret = ckpt_write_obj(ctx, &h->h);
  out:
-- 
1.6.3.3

^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [C/R v20][PATCH 85/96] c/r: preliminary support mounts namespace
  2010-03-19  0:59 [C/R v20][PATCH 00/96] Linux Checkpoint-Restart - v20 Oren Laadan
                   ` (16 preceding siblings ...)
  2010-03-19  1:00 ` [C/R v20][PATCH 84/96] c/r: restore task fs_root and pwd (v3) Oren Laadan
@ 2010-03-19  1:00 ` Oren Laadan
  17 siblings, 0 replies; 88+ messages in thread
From: Oren Laadan @ 2010-03-19  1:00 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: containers, Matt Helsley, Andreas Dilger, Oren Laadan, Serge E. Hallyn

We only allow c/r when all processes shared a single mounts ns.

We do intend to implement c/r of mounts and mounts namespaces in the
kernel.  It shouldn't be ugly or complicate locking to do so.  Just
haven't gotten around to it. A more complete solution is more than we
want to take on now for v19.

But we'd like as much as possible for everything which we don't
support, to not be checkpointable, since not doing so has in the past
invited slanderous accusations of being a toy implementation :)

Meanwhile, we get the following:
1) Checkpoint bails if not all tasks share the same mnt-ns
2) Leak detection works for full container checkpoint

On restart, all tasks inherit the same mnt-ns of the coordinator, by
default. A follow-up patch to user-cr will add a new switch to the
'restart' to request a CLONE_NEWMNT flag when creating the root-task
of the restart.

Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
Signed-off-by: Serge E. Hallyn <serue@us.ibm.com>
---
 checkpoint/objhash.c           |   25 +++++++++++++++++++++++++
 include/linux/checkpoint.h     |    2 +-
 include/linux/checkpoint_hdr.h |    4 ++++
 kernel/nsproxy.c               |   16 +++++++++++++---
 4 files changed, 43 insertions(+), 4 deletions(-)

diff --git a/checkpoint/objhash.c b/checkpoint/objhash.c
index 5c4749d..42998b2 100644
--- a/checkpoint/objhash.c
+++ b/checkpoint/objhash.c
@@ -19,6 +19,7 @@
 #include <linux/sched.h>
 #include <linux/ipc_namespace.h>
 #include <linux/user_namespace.h>
+#include <linux/mnt_namespace.h>
 #include <linux/checkpoint.h>
 #include <linux/checkpoint_hdr.h>
 #include <net/sock.h>
@@ -214,6 +215,22 @@ static int obj_ipc_ns_users(void *ptr)
 	return atomic_read(&((struct ipc_namespace *) ptr)->count);
 }
 
+static int obj_mnt_ns_grab(void *ptr)
+{
+	get_mnt_ns((struct mnt_namespace *) ptr);
+	return 0;
+}
+
+static void obj_mnt_ns_drop(void *ptr, int lastref)
+{
+	put_mnt_ns((struct mnt_namespace *) ptr);
+}
+
+static int obj_mnt_ns_users(void *ptr)
+{
+	return atomic_read(&((struct mnt_namespace *) ptr)->count);
+}
+
 static int obj_cred_grab(void *ptr)
 {
 	get_cred((struct cred *) ptr);
@@ -411,6 +428,14 @@ static struct ckpt_obj_ops ckpt_obj_ops[] = {
 		.checkpoint = checkpoint_ipc_ns,
 		.restore = restore_ipc_ns,
 	},
+	/* mnt_ns object */
+	{
+		.obj_name = "MOUNTS NS",
+		.obj_type = CKPT_OBJ_MNT_NS,
+		.ref_grab = obj_mnt_ns_grab,
+		.ref_drop = obj_mnt_ns_drop,
+		.ref_users = obj_mnt_ns_users,
+	},
 	/* user_ns object */
 	{
 		.obj_name = "USER_NS",
diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
index 3e0937a..64b4b8a 100644
--- a/include/linux/checkpoint.h
+++ b/include/linux/checkpoint.h
@@ -10,7 +10,7 @@
  *  distribution for more details.
  */
 
-#define CHECKPOINT_VERSION  4
+#define CHECKPOINT_VERSION  5
 
 /* checkpoint user flags */
 #define CHECKPOINT_SUBTREE	0x1
diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
index 4dc852d..28dfc36 100644
--- a/include/linux/checkpoint_hdr.h
+++ b/include/linux/checkpoint_hdr.h
@@ -90,6 +90,8 @@ enum {
 #define CKPT_HDR_UTS_NS CKPT_HDR_UTS_NS
 	CKPT_HDR_IPC_NS,
 #define CKPT_HDR_IPC_NS CKPT_HDR_IPC_NS
+	CKPT_HDR_MNT_NS,
+#define CKPT_HDR_MNT_NS CKPT_HDR_MNT_NS
 	CKPT_HDR_CAPABILITIES,
 #define CKPT_HDR_CAPABILITIES CKPT_HDR_CAPABILITIES
 	CKPT_HDR_USER_NS,
@@ -216,6 +218,8 @@ enum obj_type {
 #define CKPT_OBJ_UTS_NS CKPT_OBJ_UTS_NS
 	CKPT_OBJ_IPC_NS,
 #define CKPT_OBJ_IPC_NS CKPT_OBJ_IPC_NS
+	CKPT_OBJ_MNT_NS,
+#define CKPT_OBJ_MNT_NS CKPT_OBJ_MNT_NS
 	CKPT_OBJ_USER_NS,
 #define CKPT_OBJ_USER_NS CKPT_OBJ_USER_NS
 	CKPT_OBJ_CRED,
diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c
index 17b048e..0da0d83 100644
--- a/kernel/nsproxy.c
+++ b/kernel/nsproxy.c
@@ -255,10 +255,17 @@ int ckpt_collect_ns(struct ckpt_ctx *ctx, struct task_struct *t)
 	 * ipc_ns (shm) may keep references to files: if this is the
 	 * first time we see this ipc_ns (ret > 0), proceed inside.
 	 */
-	if (ret)
+	if (ret) {
 		ret = ckpt_collect_ipc_ns(ctx, nsproxy->ipc_ns);
+		if (ret < 0)
+			goto out;
+	}
 
-	/* TODO: collect other namespaces here */
+	ret = ckpt_obj_collect(ctx, nsproxy->mnt_ns, CKPT_OBJ_MNT_NS);
+	if (ret < 0)
+		goto out;
+
+	ret = 0;
  out:
 	put_nsproxy(nsproxy);
 	return ret;
@@ -282,7 +289,10 @@ static int do_checkpoint_ns(struct ckpt_ctx *ctx, struct nsproxy *nsproxy)
 		goto out;
 	h->ipc_objref = ret;
 
-	/* TODO: Write other namespaces here */
+	/* FIXME: for now, only marked visited to pacify leaks */
+	ret = ckpt_obj_visit(ctx, nsproxy->mnt_ns, CKPT_OBJ_MNT_NS);
+	if (ret < 0)
+		goto out;
 
 	ret = ckpt_write_obj(ctx, &h->h);
  out:
-- 
1.6.3.3


^ permalink raw reply related	[flat|nested] 88+ messages in thread

* Re: [C/R v20][PATCH 38/96] c/r: dump open file descriptors
       [not found]   ` <1268960401-16680-4-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
@ 2010-03-19 23:19     ` Andreas Dilger
  2010-03-22 10:30     ` Nick Piggin
  1 sibling, 0 replies; 88+ messages in thread
From: Andreas Dilger @ 2010-03-19 23:19 UTC (permalink / raw)
  To: Oren Laadan
  Cc: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA

On 2010-03-18, at 18:59, Oren Laadan wrote:
> +int checkpoint_fname(struct ckpt_ctx *ctx, struct path *path,  
> struct path *root)
> +{
> +	fname = ckpt_fill_fname(path, root, buf, &flen);
> +	if (!IS_ERR(fname)) {
> +		ret = ckpt_write_obj_type(ctx, fname, flen,
> +					  CKPT_HDR_FILE_NAME);

What is the intended use case for the checkpoint/restore being  
developed here?  It seems like a major risk to do the checkpoint using  
the filename, since this is not guaranteed to stay constant and the  
restore may give you a different state than what was running when the  
checkpoint was done.  Storing a file handle in the checkpoint, instead  
of (or in addition to) the filename would allow restoring the state  
correctly.

Note that you would also need to store some kind of FSID as part of  
the file handle, which is a functionality that would be desirable for  
Aneesh's recent open_by_handle() patches as well, so getting this  
right once would be of use to both projects.

That said, if the intent is to allow the restore to be done on another  
node with a "similar" filesystem (e.g. created by rsync/node image),  
instead of having a coherent distributed filesystem on all of the  
nodes then the filename makes sense.

I would recommend to store both the file handle+FSID and the filename,  
preferring the former for "100% correct" restores on the same node,  
and the latter for being able to restore on a similar node (e.g.  
system files and such that are expected to be the same on all nodes,  
but do not necessarily have the same inode number).

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [C/R v20][PATCH 38/96] c/r: dump open file descriptors
  2010-03-19  0:59 ` [C/R v20][PATCH 38/96] c/r: dump open file descriptors Oren Laadan
@ 2010-03-19 23:19   ` Andreas Dilger
       [not found]     ` <F18D161D-850B-4C82-83D5-1F19D573E84F-xsfywfwIY+M@public.gmane.org>
  2010-03-20  4:43     ` Matt Helsley
       [not found]   ` <1268960401-16680-4-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
  1 sibling, 2 replies; 88+ messages in thread
From: Andreas Dilger @ 2010-03-19 23:19 UTC (permalink / raw)
  To: Oren Laadan; +Cc: linux-fsdevel, containers, Matt Helsley

On 2010-03-18, at 18:59, Oren Laadan wrote:
> +int checkpoint_fname(struct ckpt_ctx *ctx, struct path *path,  
> struct path *root)
> +{
> +	fname = ckpt_fill_fname(path, root, buf, &flen);
> +	if (!IS_ERR(fname)) {
> +		ret = ckpt_write_obj_type(ctx, fname, flen,
> +					  CKPT_HDR_FILE_NAME);

What is the intended use case for the checkpoint/restore being  
developed here?  It seems like a major risk to do the checkpoint using  
the filename, since this is not guaranteed to stay constant and the  
restore may give you a different state than what was running when the  
checkpoint was done.  Storing a file handle in the checkpoint, instead  
of (or in addition to) the filename would allow restoring the state  
correctly.

Note that you would also need to store some kind of FSID as part of  
the file handle, which is a functionality that would be desirable for  
Aneesh's recent open_by_handle() patches as well, so getting this  
right once would be of use to both projects.

That said, if the intent is to allow the restore to be done on another  
node with a "similar" filesystem (e.g. created by rsync/node image),  
instead of having a coherent distributed filesystem on all of the  
nodes then the filename makes sense.

I would recommend to store both the file handle+FSID and the filename,  
preferring the former for "100% correct" restores on the same node,  
and the latter for being able to restore on a similar node (e.g.  
system files and such that are expected to be the same on all nodes,  
but do not necessarily have the same inode number).

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.


^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [C/R v20][PATCH 38/96] c/r: dump open file descriptors
       [not found]     ` <F18D161D-850B-4C82-83D5-1F19D573E84F-xsfywfwIY+M@public.gmane.org>
@ 2010-03-20  4:43       ` Matt Helsley
  0 siblings, 0 replies; 88+ messages in thread
From: Matt Helsley @ 2010-03-20  4:43 UTC (permalink / raw)
  To: Andreas Dilger
  Cc: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA

On Fri, Mar 19, 2010 at 05:19:22PM -0600, Andreas Dilger wrote:
> On 2010-03-18, at 18:59, Oren Laadan wrote:
> >+int checkpoint_fname(struct ckpt_ctx *ctx, struct path *path,
> >struct path *root)
> >+{
> >+	fname = ckpt_fill_fname(path, root, buf, &flen);
> >+	if (!IS_ERR(fname)) {
> >+		ret = ckpt_write_obj_type(ctx, fname, flen,
> >+					  CKPT_HDR_FILE_NAME);
> 
> What is the intended use case for the checkpoint/restore being
> developed here?  It seems like a major risk to do the checkpoint

Yes, as you anticipated below, we want to be able to migrate the
image to a similar node.

> using the filename, since this is not guaranteed to stay constant
> and the restore may give you a different state than what was running
> when the checkpoint was done.  Storing a file handle in the

We're aware of this.

Our assumption is userspace will freeze the filesystem and/or take
suitable snapshots (e.g. with btrfs) while the tasks being checkpointed
are also frozen. If userspace wants to freeze everything but the task
performing the checkpoint then that's fine too.

We decided to have userspace checkpoint the filesystem contents because
it will likely take an extraordinarily long time. We anticipate that
userspace will want to take advantage of many time-saving strategies
which would be impossible to anticipate perfectly for our kernel
syscall ABI.

Even though a wide set of time-saving strategies is available,
the goal is to keep the checkpoint image format and content
independent of the tools that perform migration.

> checkpoint, instead of (or in addition to) the filename would allow
> restoring the state correctly.
>
> Note that you would also need to store some kind of FSID as part of
> the file handle, which is a functionality that would be desirable
> for Aneesh's recent open_by_handle() patches as well, so getting
> this right once would be of use to both projects.

I haven't looked at those, sorry. It may be useful but I think
there's room for adding that in the future as you hinted above.
My guess is, depending on the environment of the restarting machine,
an FSID might not even be enough. Again -- I need to find some time
to review those patches before I can be sure :).

Userspace coordinates the management of the nodes and thus knows
best how to map things like major:minor, /dev/foo, and/or
uuids to the appropriate "things" when it comes time to restart.
The best the kernel can do is provide all of those so that userspace
can make the choices it needs to. However, most of that information is
already available via /proc in mountinfo or via other userspace tools.
So we don't save it in the image nor do we provide new interfaces to
get it.

> That said, if the intent is to allow the restore to be done on
> another node with a "similar" filesystem (e.g. created by rsync/node
> image), instead of having a coherent distributed filesystem on all
> of the nodes then the filename makes sense.

Yes, this is the intent.

> I would recommend to store both the file handle+FSID and the
> filename, preferring the former for "100% correct" restores on the
> same node, and the latter for being able to restore on a similar
> node (e.g. system files and such that are expected to be the same on
> all nodes, but do not necessarily have the same inode number).

This sounds like a good idea for the future. However I do not think
inclusion of our patches should be predicated on this since the patches
are still useful for local restart (thanks to things like mount namespaces)
and migration without file handles.

Thanks for having a look at these!

Cheers,
	-Matt Helsley

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [C/R v20][PATCH 38/96] c/r: dump open file descriptors
  2010-03-19 23:19   ` Andreas Dilger
       [not found]     ` <F18D161D-850B-4C82-83D5-1F19D573E84F-xsfywfwIY+M@public.gmane.org>
@ 2010-03-20  4:43     ` Matt Helsley
  2010-03-21 17:27       ` Jamie Lokier
       [not found]       ` <20100320044310.GC2887-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org>
  1 sibling, 2 replies; 88+ messages in thread
From: Matt Helsley @ 2010-03-20  4:43 UTC (permalink / raw)
  To: Andreas Dilger; +Cc: Oren Laadan, linux-fsdevel, containers, Matt Helsley

On Fri, Mar 19, 2010 at 05:19:22PM -0600, Andreas Dilger wrote:
> On 2010-03-18, at 18:59, Oren Laadan wrote:
> >+int checkpoint_fname(struct ckpt_ctx *ctx, struct path *path,
> >struct path *root)
> >+{
> >+	fname = ckpt_fill_fname(path, root, buf, &flen);
> >+	if (!IS_ERR(fname)) {
> >+		ret = ckpt_write_obj_type(ctx, fname, flen,
> >+					  CKPT_HDR_FILE_NAME);
> 
> What is the intended use case for the checkpoint/restore being
> developed here?  It seems like a major risk to do the checkpoint

Yes, as you anticipated below, we want to be able to migrate the
image to a similar node.

> using the filename, since this is not guaranteed to stay constant
> and the restore may give you a different state than what was running
> when the checkpoint was done.  Storing a file handle in the

We're aware of this.

Our assumption is userspace will freeze the filesystem and/or take
suitable snapshots (e.g. with btrfs) while the tasks being checkpointed
are also frozen. If userspace wants to freeze everything but the task
performing the checkpoint then that's fine too.

We decided to have userspace checkpoint the filesystem contents because
it will likely take an extraordinarily long time. We anticipate that
userspace will want to take advantage of many time-saving strategies
which would be impossible to anticipate perfectly for our kernel
syscall ABI.

Even though a wide set of time-saving strategies is available,
the goal is to keep the checkpoint image format and content
independent of the tools that perform migration.

> checkpoint, instead of (or in addition to) the filename would allow
> restoring the state correctly.
>
> Note that you would also need to store some kind of FSID as part of
> the file handle, which is a functionality that would be desirable
> for Aneesh's recent open_by_handle() patches as well, so getting
> this right once would be of use to both projects.

I haven't looked at those, sorry. It may be useful but I think
there's room for adding that in the future as you hinted above.
My guess is, depending on the environment of the restarting machine,
an FSID might not even be enough. Again -- I need to find some time
to review those patches before I can be sure :).

Userspace coordinates the management of the nodes and thus knows
best how to map things like major:minor, /dev/foo, and/or
uuids to the appropriate "things" when it comes time to restart.
The best the kernel can do is provide all of those so that userspace
can make the choices it needs to. However, most of that information is
already available via /proc in mountinfo or via other userspace tools.
So we don't save it in the image nor do we provide new interfaces to
get it.

> That said, if the intent is to allow the restore to be done on
> another node with a "similar" filesystem (e.g. created by rsync/node
> image), instead of having a coherent distributed filesystem on all
> of the nodes then the filename makes sense.

Yes, this is the intent.

> I would recommend to store both the file handle+FSID and the
> filename, preferring the former for "100% correct" restores on the
> same node, and the latter for being able to restore on a similar
> node (e.g. system files and such that are expected to be the same on
> all nodes, but do not necessarily have the same inode number).

This sounds like a good idea for the future. However I do not think
inclusion of our patches should be predicated on this since the patches
are still useful for local restart (thanks to things like mount namespaces)
and migration without file handles.

Thanks for having a look at these!

Cheers,
	-Matt Helsley

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [C/R v20][PATCH 38/96] c/r: dump open file descriptors
       [not found]       ` <20100320044310.GC2887-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org>
@ 2010-03-21 17:27         ` Jamie Lokier
  0 siblings, 0 replies; 88+ messages in thread
From: Jamie Lokier @ 2010-03-21 17:27 UTC (permalink / raw)
  To: Matt Helsley
  Cc: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Andreas Dilger,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA

Matt Helsley wrote:
> > That said, if the intent is to allow the restore to be done on
> > another node with a "similar" filesystem (e.g. created by rsync/node
> > image), instead of having a coherent distributed filesystem on all
> > of the nodes then the filename makes sense.
> 
> Yes, this is the intent.

I would worry about programs which are using files which have been
deleted, renamed, or (very common) renamed-over by another process
after being opened, as there's a good chance they will successfully
open the wrong file after c/r, and corrupt state from then on.

This can be avoided by ensuring every checkpointed application is
specially "c/r aware", but that makes the feature a lot less
attractive, as well as uncomfortably unsafe to use on arbitrary
processes.  Ideally, c/r would fail on some types of process
(e.g. using sockets), but at least fail in a safe way that does not
lead to quiet data corruption.

-- Jamie

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [C/R v20][PATCH 38/96] c/r: dump open file descriptors
  2010-03-20  4:43     ` Matt Helsley
@ 2010-03-21 17:27       ` Jamie Lokier
  2010-03-21 19:40         ` Serge E. Hallyn
                           ` (2 more replies)
       [not found]       ` <20100320044310.GC2887-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org>
  1 sibling, 3 replies; 88+ messages in thread
From: Jamie Lokier @ 2010-03-21 17:27 UTC (permalink / raw)
  To: Matt Helsley; +Cc: Andreas Dilger, Oren Laadan, linux-fsdevel, containers

Matt Helsley wrote:
> > That said, if the intent is to allow the restore to be done on
> > another node with a "similar" filesystem (e.g. created by rsync/node
> > image), instead of having a coherent distributed filesystem on all
> > of the nodes then the filename makes sense.
> 
> Yes, this is the intent.

I would worry about programs which are using files which have been
deleted, renamed, or (very common) renamed-over by another process
after being opened, as there's a good chance they will successfully
open the wrong file after c/r, and corrupt state from then on.

This can be avoided by ensuring every checkpointed application is
specially "c/r aware", but that makes the feature a lot less
attractive, as well as uncomfortably unsafe to use on arbitrary
processes.  Ideally, c/r would fail on some types of process
(e.g. using sockets), but at least fail in a safe way that does not
lead to quiet data corruption.

-- Jamie

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [C/R v20][PATCH 38/96] c/r: dump open file descriptors
       [not found]         ` <20100321172703.GC4174-yetKDKU6eevNLxjTenLetw@public.gmane.org>
@ 2010-03-21 19:40           ` Serge E. Hallyn
  2010-03-22  1:06           ` Matt Helsley
  1 sibling, 0 replies; 88+ messages in thread
From: Serge E. Hallyn @ 2010-03-21 19:40 UTC (permalink / raw)
  To: Jamie Lokier
  Cc: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Andreas Dilger

Quoting Jamie Lokier (jamie-yetKDKU6eevNLxjTenLetw@public.gmane.org):
> Matt Helsley wrote:
> > > That said, if the intent is to allow the restore to be done on
> > > another node with a "similar" filesystem (e.g. created by rsync/node
> > > image), instead of having a coherent distributed filesystem on all
> > > of the nodes then the filename makes sense.
> > 
> > Yes, this is the intent.
> 
> I would worry about programs which are using files which have been
> deleted, renamed, or (very common) renamed-over by another process
> after being opened, as there's a good chance they will successfully
> open the wrong file after c/r, and corrupt state from then on.

Userspace is expected to back up and restore the filesystem, for
instance using a btrfs snapshot or a simple rsync or tar.

If we detect anything which really is not supported (for instance
inotify for now) then we fail and leave a log message explaining the
failure.

thanks,
-serge

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [C/R v20][PATCH 38/96] c/r: dump open file descriptors
  2010-03-21 17:27       ` Jamie Lokier
@ 2010-03-21 19:40         ` Serge E. Hallyn
  2010-03-21 20:58           ` Daniel Lezcano
       [not found]           ` <20100321194019.GA11714-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org>
       [not found]         ` <20100321172703.GC4174-yetKDKU6eevNLxjTenLetw@public.gmane.org>
  2010-03-22  1:06         ` Matt Helsley
  2 siblings, 2 replies; 88+ messages in thread
From: Serge E. Hallyn @ 2010-03-21 19:40 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: Matt Helsley, linux-fsdevel, Andreas Dilger, containers

Quoting Jamie Lokier (jamie@shareable.org):
> Matt Helsley wrote:
> > > That said, if the intent is to allow the restore to be done on
> > > another node with a "similar" filesystem (e.g. created by rsync/node
> > > image), instead of having a coherent distributed filesystem on all
> > > of the nodes then the filename makes sense.
> > 
> > Yes, this is the intent.
> 
> I would worry about programs which are using files which have been
> deleted, renamed, or (very common) renamed-over by another process
> after being opened, as there's a good chance they will successfully
> open the wrong file after c/r, and corrupt state from then on.

Userspace is expected to back up and restore the filesystem, for
instance using a btrfs snapshot or a simple rsync or tar.

If we detect anything which really is not supported (for instance
inotify for now) then we fail and leave a log message explaining the
failure.

thanks,
-serge

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [C/R v20][PATCH 38/96] c/r: dump open file descriptors
       [not found]           ` <20100321194019.GA11714-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org>
@ 2010-03-21 20:58             ` Daniel Lezcano
  0 siblings, 0 replies; 88+ messages in thread
From: Daniel Lezcano @ 2010-03-21 20:58 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Jamie Lokier, Andreas Dilger

Serge E. Hallyn wrote:
> Quoting Jamie Lokier (jamie-yetKDKU6eevNLxjTenLetw@public.gmane.org):
>   
>> Matt Helsley wrote:
>>     
>>>> That said, if the intent is to allow the restore to be done on
>>>> another node with a "similar" filesystem (e.g. created by rsync/node
>>>> image), instead of having a coherent distributed filesystem on all
>>>> of the nodes then the filename makes sense.
>>>>         
>>> Yes, this is the intent.
>>>       
>> I would worry about programs which are using files which have been
>> deleted, renamed, or (very common) renamed-over by another process
>> after being opened, as there's a good chance they will successfully
>> open the wrong file after c/r, and corrupt state from then on.
>>     
>
> Userspace is expected to back up and restore the filesystem, for
> instance using a btrfs snapshot or a simple rsync or tar.
>
>   
That does not solve the problem Jamie is talking about.
A rsync or a tar will not see a deleted file and using a btrfs to have 
the CR to work with the deleted files is a bit overkill, no ?
I have another question about the deleted files. How is handled the case 
when a process has a deleted mapped file but without an associated file 
descriptor ?

> If we detect anything which really is not supported (for instance
> inotify for now) then we fail and leave a log message explaining the
> failure.
>   

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [C/R v20][PATCH 38/96] c/r: dump open file descriptors
  2010-03-21 19:40         ` Serge E. Hallyn
@ 2010-03-21 20:58           ` Daniel Lezcano
  2010-03-21 21:36             ` Oren Laadan
                               ` (2 more replies)
       [not found]           ` <20100321194019.GA11714-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org>
  1 sibling, 3 replies; 88+ messages in thread
From: Daniel Lezcano @ 2010-03-21 20:58 UTC (permalink / raw)
  To: Serge E. Hallyn; +Cc: Jamie Lokier, linux-fsdevel, containers, Andreas Dilger

Serge E. Hallyn wrote:
> Quoting Jamie Lokier (jamie@shareable.org):
>   
>> Matt Helsley wrote:
>>     
>>>> That said, if the intent is to allow the restore to be done on
>>>> another node with a "similar" filesystem (e.g. created by rsync/node
>>>> image), instead of having a coherent distributed filesystem on all
>>>> of the nodes then the filename makes sense.
>>>>         
>>> Yes, this is the intent.
>>>       
>> I would worry about programs which are using files which have been
>> deleted, renamed, or (very common) renamed-over by another process
>> after being opened, as there's a good chance they will successfully
>> open the wrong file after c/r, and corrupt state from then on.
>>     
>
> Userspace is expected to back up and restore the filesystem, for
> instance using a btrfs snapshot or a simple rsync or tar.
>
>   
That does not solve the problem Jamie is talking about.
A rsync or a tar will not see a deleted file and using a btrfs to have 
the CR to work with the deleted files is a bit overkill, no ?
I have another question about the deleted files. How is handled the case 
when a process has a deleted mapped file but without an associated file 
descriptor ?

> If we detect anything which really is not supported (for instance
> inotify for now) then we fail and leave a log message explaining the
> failure.
>   


^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [C/R v20][PATCH 38/96] c/r: dump open file descriptors
       [not found]             ` <4BA68884.3080003-GANU6spQydw@public.gmane.org>
@ 2010-03-21 21:36               ` Oren Laadan
  2010-03-22  2:12               ` Matt Helsley
  1 sibling, 0 replies; 88+ messages in thread
From: Oren Laadan @ 2010-03-21 21:36 UTC (permalink / raw)
  To: Daniel Lezcano
  Cc: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Jamie Lokier, Andreas Dilger



Daniel Lezcano wrote:
> Serge E. Hallyn wrote:
>> Quoting Jamie Lokier (jamie-yetKDKU6eevNLxjTenLetw@public.gmane.org):
>>   
>>> Matt Helsley wrote:
>>>     
>>>>> That said, if the intent is to allow the restore to be done on
>>>>> another node with a "similar" filesystem (e.g. created by rsync/node
>>>>> image), instead of having a coherent distributed filesystem on all
>>>>> of the nodes then the filename makes sense.
>>>>>         
>>>> Yes, this is the intent.
>>>>       
>>> I would worry about programs which are using files which have been
>>> deleted, renamed, or (very common) renamed-over by another process
>>> after being opened, as there's a good chance they will successfully
>>> open the wrong file after c/r, and corrupt state from then on.
>>>     
>> Userspace is expected to back up and restore the filesystem, for
>> instance using a btrfs snapshot or a simple rsync or tar.
>>
>>   
> That does not solve the problem Jamie is talking about.
> A rsync or a tar will not see a deleted file and using a btrfs to have 
> the CR to work with the deleted files is a bit overkill, no ?

Let's separate the issues of file system snapshot and deleted files.

1) File system snapshot:
------------------------
The requirement is to preserve the file system state between the time
of the checkpoint and the time of the restart, because userspace will
expect it to remain the same.

The alternatives are:

a) Use capable file system, like brfs, or (modified) nilfs.

b) Userspace saves the state e.g. w/ tar or rsync (maybe incremental)

c) Assume/expect that the file system isn't modified between checkpoint
and restart (e.g. if we use c/r to suspend a user's session)

d) Expect userspace to adapt to changes if they occur, e.g. by having
the application be aware of the possibility, or by providing a wrapper
that will do some magic prior to restart (by looking at the checkpoint
image).

Options a,b,c are all transparent to the application, while option
d required that applications become aware of c/r. That's ok, but our
primary goal is to be generic enough to unmodified applications.

2) Deleted files:
-----------------
The requirement is that at restart we'll be able to restore the file
point in the kernel to a deleted file with same properties and contents
as it was at the time of the checkpoint.

The alternatives we considered are:

e) For each deleted file, save the contents of that file as part of
the checkpoint image;
At restart - create a new file, populate with the contents, open it
(to get an active file pointer), and finally unlink it, so it is -
again - deleted.

f) At checkpoint time, create a file (from scratch) in a dedicated
area of the file system (userspace configurable?), and copy the
contents of the deleted file to this file. Only save the file system
state after this is done.
At restart, open the alternative file instead, and then immediately
delete it.

g) At checkpoint time, re-link the file to a dedicated area of the
file system. This requires support from the underlying file system,
of course. For instance, it's trivial for ext2,3 but IIRC will need
help for ext4. Re-linking is essentially attaching a new filename
to an existing inode that is still referenced but is otherwise not
reachable - and make it reachable again.
At restart, open the re-linked file and then immediately delete it.

> I have another question about the deleted files. How is handled the case 
> when a process has a deleted mapped file but without an associated file 
> descriptor ?
> 

It works the same as with non-deleted files (assuming that we know
how to handle delete files in general, e.g. options e,d,f above):

To checkpoint a task's mm we loop through the vma's and checkpoint
them. For a vma that corresponds to a mapped file, we first save
the vma->vm_file. In turn, for a file pointer we save the filename,
properties, credentials. A file pointer is saved as an independent
object - and is assigned a unique id - objref. The state of the vma
will indicate indicate this objref.

At restart, we will first see the file pointer object, and will
open the file to create a corresponding file pointer. Later when
we restore the vma, we'll locate the (new) file pointer using the
objref and use it in mmap.

Oren.


>> If we detect anything which really is not supported (for instance
>> inotify for now) then we fail and leave a log message explaining the
>> failure.
>>   
> 
> _______________________________________________
> Containers mailing list
> Containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org
> https://lists.linux-foundation.org/mailman/listinfo/containers
> 

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [C/R v20][PATCH 38/96] c/r: dump open file descriptors
  2010-03-21 20:58           ` Daniel Lezcano
@ 2010-03-21 21:36             ` Oren Laadan
       [not found]               ` <4BA6914D.8040007-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
  2010-03-22  8:40               ` Daniel Lezcano
       [not found]             ` <4BA68884.3080003-GANU6spQydw@public.gmane.org>
  2010-03-22  2:12             ` Matt Helsley
  2 siblings, 2 replies; 88+ messages in thread
From: Oren Laadan @ 2010-03-21 21:36 UTC (permalink / raw)
  To: Daniel Lezcano
  Cc: Serge E. Hallyn, linux-fsdevel, containers, Jamie Lokier, Andreas Dilger



Daniel Lezcano wrote:
> Serge E. Hallyn wrote:
>> Quoting Jamie Lokier (jamie@shareable.org):
>>   
>>> Matt Helsley wrote:
>>>     
>>>>> That said, if the intent is to allow the restore to be done on
>>>>> another node with a "similar" filesystem (e.g. created by rsync/node
>>>>> image), instead of having a coherent distributed filesystem on all
>>>>> of the nodes then the filename makes sense.
>>>>>         
>>>> Yes, this is the intent.
>>>>       
>>> I would worry about programs which are using files which have been
>>> deleted, renamed, or (very common) renamed-over by another process
>>> after being opened, as there's a good chance they will successfully
>>> open the wrong file after c/r, and corrupt state from then on.
>>>     
>> Userspace is expected to back up and restore the filesystem, for
>> instance using a btrfs snapshot or a simple rsync or tar.
>>
>>   
> That does not solve the problem Jamie is talking about.
> A rsync or a tar will not see a deleted file and using a btrfs to have 
> the CR to work with the deleted files is a bit overkill, no ?

Let's separate the issues of file system snapshot and deleted files.

1) File system snapshot:
------------------------
The requirement is to preserve the file system state between the time
of the checkpoint and the time of the restart, because userspace will
expect it to remain the same.

The alternatives are:

a) Use capable file system, like brfs, or (modified) nilfs.

b) Userspace saves the state e.g. w/ tar or rsync (maybe incremental)

c) Assume/expect that the file system isn't modified between checkpoint
and restart (e.g. if we use c/r to suspend a user's session)

d) Expect userspace to adapt to changes if they occur, e.g. by having
the application be aware of the possibility, or by providing a wrapper
that will do some magic prior to restart (by looking at the checkpoint
image).

Options a,b,c are all transparent to the application, while option
d required that applications become aware of c/r. That's ok, but our
primary goal is to be generic enough to unmodified applications.

2) Deleted files:
-----------------
The requirement is that at restart we'll be able to restore the file
point in the kernel to a deleted file with same properties and contents
as it was at the time of the checkpoint.

The alternatives we considered are:

e) For each deleted file, save the contents of that file as part of
the checkpoint image;
At restart - create a new file, populate with the contents, open it
(to get an active file pointer), and finally unlink it, so it is -
again - deleted.

f) At checkpoint time, create a file (from scratch) in a dedicated
area of the file system (userspace configurable?), and copy the
contents of the deleted file to this file. Only save the file system
state after this is done.
At restart, open the alternative file instead, and then immediately
delete it.

g) At checkpoint time, re-link the file to a dedicated area of the
file system. This requires support from the underlying file system,
of course. For instance, it's trivial for ext2,3 but IIRC will need
help for ext4. Re-linking is essentially attaching a new filename
to an existing inode that is still referenced but is otherwise not
reachable - and make it reachable again.
At restart, open the re-linked file and then immediately delete it.

> I have another question about the deleted files. How is handled the case 
> when a process has a deleted mapped file but without an associated file 
> descriptor ?
> 

It works the same as with non-deleted files (assuming that we know
how to handle delete files in general, e.g. options e,d,f above):

To checkpoint a task's mm we loop through the vma's and checkpoint
them. For a vma that corresponds to a mapped file, we first save
the vma->vm_file. In turn, for a file pointer we save the filename,
properties, credentials. A file pointer is saved as an independent
object - and is assigned a unique id - objref. The state of the vma
will indicate indicate this objref.

At restart, we will first see the file pointer object, and will
open the file to create a corresponding file pointer. Later when
we restore the vma, we'll locate the (new) file pointer using the
objref and use it in mmap.

Oren.


>> If we detect anything which really is not supported (for instance
>> inotify for now) then we fail and leave a log message explaining the
>> failure.
>>   
> 
> _______________________________________________
> Containers mailing list
> Containers@lists.linux-foundation.org
> https://lists.linux-foundation.org/mailman/listinfo/containers
> 

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [C/R v20][PATCH 38/96] c/r: dump open file descriptors
       [not found]               ` <4BA6914D.8040007-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
@ 2010-03-21 23:31                 ` xing lin
  2010-03-22  8:40                 ` Daniel Lezcano
  1 sibling, 0 replies; 88+ messages in thread
From: xing lin @ 2010-03-21 23:31 UTC (permalink / raw)
  To: Oren Laadan
  Cc: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Andreas Dilger, Jamie Lokier

Hi, I am Xing, a PHD candidate in University of Utah. I am also quite
interested in container-based virtualization.  That's why I register this
email list. :)

Now, I am working on Container migration of OpenVZ in Emulab. I just begin
to hack OpenVZ kernel.

On Sun, Mar 21, 2010 at 3:36 PM, Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org> wrote:

>
> Let's separate the issues of file system snapshot and deleted files.
>
> 1) File system snapshot:
> ------------------------
> The requirement is to preserve the file system state between the time
> of the checkpoint and the time of the restart, because userspace will
> expect it to remain the same.
>
> The alternatives are:
>
> a) Use capable file system, like brfs, or (modified) nilfs.
>

Do you mean btrfs? These two file systems both support snapshot. Sound
great.

b) Userspace saves the state e.g. w/ tar or rsync (maybe incremental)
>
> c) Assume/expect that the file system isn't modified between checkpoint
> and restart (e.g. if we use c/r to suspend a user's session)
>

This is what OpenVZ does. OpenVZ assumes the underlying file system to be
consistent during checkpoint and restart. If the file does not exist when
restoring the container, the restoring will fail(It will give a message to
show which file can not be found). OpenVZ also does not support nfs. If a
nfs is mounted in the container's file system, this container can not be
suspended. Since we want to enable container migration in Emulab, so we are
trying to solve these issues. nfs is the big issue since almost all users
will store their files at their home directories which are mounted from the
nfs server. We are still discussing how to deal with this.

-- 
Regards,
Xing
School of Computing, University of Utah
http://www.cs.utah.edu/~xinglin/

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [C/R v20][PATCH 38/96] c/r: dump open file descriptors
       [not found]         ` <20100321172703.GC4174-yetKDKU6eevNLxjTenLetw@public.gmane.org>
  2010-03-21 19:40           ` Serge E. Hallyn
@ 2010-03-22  1:06           ` Matt Helsley
  1 sibling, 0 replies; 88+ messages in thread
From: Matt Helsley @ 2010-03-22  1:06 UTC (permalink / raw)
  To: Jamie Lokier
  Cc: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Andreas Dilger,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA

On Sun, Mar 21, 2010 at 05:27:03PM +0000, Jamie Lokier wrote:
> Matt Helsley wrote:
> > > That said, if the intent is to allow the restore to be done on
> > > another node with a "similar" filesystem (e.g. created by rsync/node
> > > image), instead of having a coherent distributed filesystem on all
> > > of the nodes then the filename makes sense.
> > 
> > Yes, this is the intent.
> 
> I would worry about programs which are using files which have been
> deleted, renamed, or (very common) renamed-over by another process
> after being opened, as there's a good chance they will successfully
> open the wrong file after c/r, and corrupt state from then on.

The code in the patches does check for unlinked files and refuses
to checkpoint if an unlinked file is open. Yes, this limits the usefulness
of the code somewhat but it's a problem we can solve and c/r is still quite
useful without the solution.

My favorite solution for unlinked files is keeping the contents of the file
in the checkpoint image. Another solution is relinking it to a new "safe"
location in the filesystem. Determining the "safe" location is not very clean
because we need one "safe" location per filesystem being backed-up. Hence I
tend to favor the first approach. Neither solution is implemented
and thoroughly tested yet though.

These solutions are needed because the data is not available via a normal
filesystem backup. Renames are dealt with by requiring userspace to freeze
and/or safely take a snapshot of the filesystem as with any backup.

> This can be avoided by ensuring every checkpointed application is
> specially "c/r aware", but that makes the feature a lot less
> attractive, as well as uncomfortably unsafe to use on arbitrary

We avoided using that solution for the very flaws you point out.
In fact, so far we've managed to avoid requiring cooperation with
the tasks being checkpointed.

> processes.  Ideally, c/r would fail on some types of process
> (e.g. using sockets), but at least fail in a safe way that does not
> lead to quiet data corruption.

We've done our best to try and reach that ideal. You're welcome to have a
look at the code to see if you can find any ways in which we haven't.
Here's the code that refuses to checkpoint unsupported files. I think
it's pretty easy to read:

int checkpoint_file(struct ckpt_ctx *ctx, void *ptr)
{
        struct file *file = (struct file *) ptr;
        int ret;

        if (!file->f_op || !file->f_op->checkpoint) {
                ckpt_err(ctx, -EBADF, "%(T)%(P)%(V)f_op lacks checkpoint\n",
                               file, file->f_op);
                return -EBADF;
        }

        if (is_dnotify_attached(file)) {
                ckpt_err(ctx, -EBADF, "%(T)%(P)dnotify unsupported\n", file);
                return -EBADF;
        }

        ret = file->f_op->checkpoint(ctx, file);
        if (ret < 0)
                ckpt_err(ctx, ret, "%(T)%(P)file checkpoint failed\n", file);
        return ret;
}

(As Serge noted, we don't support inotify. inotify and fanotify require
an fd to register the fsnotify marks and the struct file associated with
that fd lacks the f_ops->checkpoint operation, hence that will cause
checkpoint to fail too and, again, there will be no silent corruption)

Negative return values cause sys_checkpoint() to stop checkpointing and
return the given errno. The f_op->checkpoint is often a generic operation
which ensures that the file is not unlinked before it saves things like
the position of the file (checkpoint_file_common()) and the path to the file
(checkpoint_fname()):

int generic_file_checkpoint(struct ckpt_ctx *ctx, struct file *file)
{
        struct ckpt_hdr_file_generic *h;
        int ret;

        /*
         * FIXME: when we'll add support for unlinked files/dirs, we'll
         * need to distinguish between unlinked filed and unlinked dirs.
         */
        if (d_unlinked(file->f_dentry)) {
                ckpt_err(ctx, -EBADF, "%(T)%(P)Unlinked files unsupported\n",
                         file);
                return -EBADF;
        }

        h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_FILE);
        if (!h)
                return -ENOMEM;

        h->common.f_type = CKPT_FILE_GENERIC;

        ret = checkpoint_file_common(ctx, file, &h->common);
        if (ret < 0)
                goto out;
        ret = ckpt_write_obj(ctx, &h->common.h);
        if (ret < 0)
                goto out;
        ret = checkpoint_fname(ctx, &file->f_path, &ctx->root_fs_path);
 out:
        ckpt_hdr_put(ctx, h);
        return ret;
}
EXPORT_SYMBOL(generic_file_checkpoint);

I wrote a simple script to look for missing operations in things like
file_operations. It can output counts in directories/files or show the
spot in the files where the struct is defined and a little context.

I used that script to check which files and protocols aren't supported
(for 2.6.33-rc8), I placed a histogram of the output in the wiki, and I've
tried to keep it up-to-date.

https://ckpt.wiki.kernel.org/index.php/UncheckpointableFilesystems
https://ckpt.wiki.kernel.org/index.php/UncheckpointableProtocols

The script is also there for anyone who wants to use it on newer kernels.
Here's the output which is of interest to folks on linux-fsdevel for anyone
who doesn't wish to follow a link -- the number of file_operations
structures missing the .checkpoint operation:

    162 arch
      3 block
      1 crypto
      1 Documentation
    718 drivers
    178 fs
             3 9p
              8 afs
              1 autofs
              3 autofs4
              1 bad_inode.c
              3 binfmt_misc.c
              1 block_dev.c
              2 cachefiles
              1 char_dev.c
             15 cifs
              4 coda
              2 configfs
              3 debugfs
              8 dlm
              1 ext4
              1 fifo.c
              1 filesystems.c
              3 fscache
              9 fuse
              5 gfs2
              1 hugetlbfs
              1 jbd2
              6 jfs
              1 libfs.c
              1 locks.c
              2 ncpfs
              2 nfs
              5 nfsd
              1 no-block.c
              1 notify
              1 ntfs
             15 ocfs2
             55 proc
              1 reiserfs
              1 signalfd.c
              2 smbfs
              3 sysfs
              1 timerfd.c
              3 xfs
      1 include
      4 ipc
     88 kernel
      3 lib
     12 mm
    164 net
      1 samples
     35 security
     29 sound
      4 virt

  Notes:
   1. The missing checkpoint file operation in fs/fifo.c is only an artifact of
	the unusual way fifo file ops are assigned. FIFOs are supported.
   2. The ext4 missing file operation is for the multiblock groups file in /proc
	IMHO trying to checkpoint the contents of /proc files is usually a bad
	idea. Thankfuly, most programs don't hold these files open for very
	long.

Cheers,
	-Matt Helsley

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [C/R v20][PATCH 38/96] c/r: dump open file descriptors
  2010-03-21 17:27       ` Jamie Lokier
  2010-03-21 19:40         ` Serge E. Hallyn
       [not found]         ` <20100321172703.GC4174-yetKDKU6eevNLxjTenLetw@public.gmane.org>
@ 2010-03-22  1:06         ` Matt Helsley
  2010-03-22  2:20           ` Jamie Lokier
                             ` (2 more replies)
  2 siblings, 3 replies; 88+ messages in thread
From: Matt Helsley @ 2010-03-22  1:06 UTC (permalink / raw)
  To: Jamie Lokier
  Cc: Matt Helsley, Andreas Dilger, Oren Laadan, linux-fsdevel, containers

On Sun, Mar 21, 2010 at 05:27:03PM +0000, Jamie Lokier wrote:
> Matt Helsley wrote:
> > > That said, if the intent is to allow the restore to be done on
> > > another node with a "similar" filesystem (e.g. created by rsync/node
> > > image), instead of having a coherent distributed filesystem on all
> > > of the nodes then the filename makes sense.
> > 
> > Yes, this is the intent.
> 
> I would worry about programs which are using files which have been
> deleted, renamed, or (very common) renamed-over by another process
> after being opened, as there's a good chance they will successfully
> open the wrong file after c/r, and corrupt state from then on.

The code in the patches does check for unlinked files and refuses
to checkpoint if an unlinked file is open. Yes, this limits the usefulness
of the code somewhat but it's a problem we can solve and c/r is still quite
useful without the solution.

My favorite solution for unlinked files is keeping the contents of the file
in the checkpoint image. Another solution is relinking it to a new "safe"
location in the filesystem. Determining the "safe" location is not very clean
because we need one "safe" location per filesystem being backed-up. Hence I
tend to favor the first approach. Neither solution is implemented
and thoroughly tested yet though.

These solutions are needed because the data is not available via a normal
filesystem backup. Renames are dealt with by requiring userspace to freeze
and/or safely take a snapshot of the filesystem as with any backup.

> This can be avoided by ensuring every checkpointed application is
> specially "c/r aware", but that makes the feature a lot less
> attractive, as well as uncomfortably unsafe to use on arbitrary

We avoided using that solution for the very flaws you point out.
In fact, so far we've managed to avoid requiring cooperation with
the tasks being checkpointed.

> processes.  Ideally, c/r would fail on some types of process
> (e.g. using sockets), but at least fail in a safe way that does not
> lead to quiet data corruption.

We've done our best to try and reach that ideal. You're welcome to have a
look at the code to see if you can find any ways in which we haven't.
Here's the code that refuses to checkpoint unsupported files. I think
it's pretty easy to read:

int checkpoint_file(struct ckpt_ctx *ctx, void *ptr)
{
        struct file *file = (struct file *) ptr;
        int ret;

        if (!file->f_op || !file->f_op->checkpoint) {
                ckpt_err(ctx, -EBADF, "%(T)%(P)%(V)f_op lacks checkpoint\n",
                               file, file->f_op);
                return -EBADF;
        }

        if (is_dnotify_attached(file)) {
                ckpt_err(ctx, -EBADF, "%(T)%(P)dnotify unsupported\n", file);
                return -EBADF;
        }

        ret = file->f_op->checkpoint(ctx, file);
        if (ret < 0)
                ckpt_err(ctx, ret, "%(T)%(P)file checkpoint failed\n", file);
        return ret;
}

(As Serge noted, we don't support inotify. inotify and fanotify require
an fd to register the fsnotify marks and the struct file associated with
that fd lacks the f_ops->checkpoint operation, hence that will cause
checkpoint to fail too and, again, there will be no silent corruption)

Negative return values cause sys_checkpoint() to stop checkpointing and
return the given errno. The f_op->checkpoint is often a generic operation
which ensures that the file is not unlinked before it saves things like
the position of the file (checkpoint_file_common()) and the path to the file
(checkpoint_fname()):

int generic_file_checkpoint(struct ckpt_ctx *ctx, struct file *file)
{
        struct ckpt_hdr_file_generic *h;
        int ret;

        /*
         * FIXME: when we'll add support for unlinked files/dirs, we'll
         * need to distinguish between unlinked filed and unlinked dirs.
         */
        if (d_unlinked(file->f_dentry)) {
                ckpt_err(ctx, -EBADF, "%(T)%(P)Unlinked files unsupported\n",
                         file);
                return -EBADF;
        }

        h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_FILE);
        if (!h)
                return -ENOMEM;

        h->common.f_type = CKPT_FILE_GENERIC;

        ret = checkpoint_file_common(ctx, file, &h->common);
        if (ret < 0)
                goto out;
        ret = ckpt_write_obj(ctx, &h->common.h);
        if (ret < 0)
                goto out;
        ret = checkpoint_fname(ctx, &file->f_path, &ctx->root_fs_path);
 out:
        ckpt_hdr_put(ctx, h);
        return ret;
}
EXPORT_SYMBOL(generic_file_checkpoint);

I wrote a simple script to look for missing operations in things like
file_operations. It can output counts in directories/files or show the
spot in the files where the struct is defined and a little context.

I used that script to check which files and protocols aren't supported
(for 2.6.33-rc8), I placed a histogram of the output in the wiki, and I've
tried to keep it up-to-date.

https://ckpt.wiki.kernel.org/index.php/UncheckpointableFilesystems
https://ckpt.wiki.kernel.org/index.php/UncheckpointableProtocols

The script is also there for anyone who wants to use it on newer kernels.
Here's the output which is of interest to folks on linux-fsdevel for anyone
who doesn't wish to follow a link -- the number of file_operations
structures missing the .checkpoint operation:

    162 arch
      3 block
      1 crypto
      1 Documentation
    718 drivers
    178 fs
             3 9p
              8 afs
              1 autofs
              3 autofs4
              1 bad_inode.c
              3 binfmt_misc.c
              1 block_dev.c
              2 cachefiles
              1 char_dev.c
             15 cifs
              4 coda
              2 configfs
              3 debugfs
              8 dlm
              1 ext4
              1 fifo.c
              1 filesystems.c
              3 fscache
              9 fuse
              5 gfs2
              1 hugetlbfs
              1 jbd2
              6 jfs
              1 libfs.c
              1 locks.c
              2 ncpfs
              2 nfs
              5 nfsd
              1 no-block.c
              1 notify
              1 ntfs
             15 ocfs2
             55 proc
              1 reiserfs
              1 signalfd.c
              2 smbfs
              3 sysfs
              1 timerfd.c
              3 xfs
      1 include
      4 ipc
     88 kernel
      3 lib
     12 mm
    164 net
      1 samples
     35 security
     29 sound
      4 virt

  Notes:
   1. The missing checkpoint file operation in fs/fifo.c is only an artifact of
	the unusual way fifo file ops are assigned. FIFOs are supported.
   2. The ext4 missing file operation is for the multiblock groups file in /proc
	IMHO trying to checkpoint the contents of /proc files is usually a bad
	idea. Thankfuly, most programs don't hold these files open for very
	long.

Cheers,
	-Matt Helsley

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [C/R v20][PATCH 38/96] c/r: dump open file descriptors
       [not found]             ` <4BA68884.3080003-GANU6spQydw@public.gmane.org>
  2010-03-21 21:36               ` Oren Laadan
@ 2010-03-22  2:12               ` Matt Helsley
  1 sibling, 0 replies; 88+ messages in thread
From: Matt Helsley @ 2010-03-22  2:12 UTC (permalink / raw)
  To: Daniel Lezcano
  Cc: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Jamie Lokier, Andreas Dilger

On Sun, Mar 21, 2010 at 09:58:44PM +0100, Daniel Lezcano wrote:
> Serge E. Hallyn wrote:
> > Quoting Jamie Lokier (jamie-yetKDKU6eevNLxjTenLetw@public.gmane.org):
> >   
> >> Matt Helsley wrote:
> >>     
> >>>> That said, if the intent is to allow the restore to be done on
> >>>> another node with a "similar" filesystem (e.g. created by rsync/node
> >>>> image), instead of having a coherent distributed filesystem on all
> >>>> of the nodes then the filename makes sense.
> >>>>         
> >>> Yes, this is the intent.
> >>>       
> >> I would worry about programs which are using files which have been
> >> deleted, renamed, or (very common) renamed-over by another process
> >> after being opened, as there's a good chance they will successfully
> >> open the wrong file after c/r, and corrupt state from then on.
> >>     
> >
> > Userspace is expected to back up and restore the filesystem, for
> > instance using a btrfs snapshot or a simple rsync or tar.
> >
> >   
> That does not solve the problem Jamie is talking about.
> A rsync or a tar will not see a deleted file and using a btrfs to have 
> the CR to work with the deleted files is a bit overkill, no ?

These are the same kinds of problems encountered during backup. You
can play fast and loose -- like taking a backup while everything is
running -- or you can play it conservative and freeze things.

I think btrfs snapshots are just one possible solution and it's not
overkill.

For some filesystems it might make sense to use the filesystem freezer to
ensure that no files are deleted while the backup takes place. Combined
with tools like rsync or rdiff backup these operations could be low bandwidth
and low latency if well-known live-migration techniques are used.

Or use dm snapshots.

I imagine fanotify could also be useful so long as userspace has marked
things correctly prior to checkpoint. My high level understanding of
fanotify was we'd be able to delay (or deny) deletion until checkpoint
is complete.

Or if using fanotify is unacceptable, at the very least we could use
inotify to know when a file needed for restart has been deleted. It might
go something like:

start watching files/dirs needed (fanotify or inotify)
	Delay/deny changes (fanotify ONLY)
freeze tasks for checkpoint
freeze filesystem contents:
	take btrfs snapshots OR
	take dm snapshots OR
	use filesystem freezer OR
backup filesystem contents
sys_checkpoint
check for changes to the filesystem contents and report failure if they
	interfere with restart (inotify ONLY)
thaw filesystem contents
thaw tasks

So there are lots of possible solutions and they don't all involve trying to
stop the whole VFS or the whole machine. They also don't require anything
more in-kernel than what's already being pushed (our patchset, Eric Paris'
patchset for the optional fanotify idea).

> I have another question about the deleted files. How is handled the case 
> when a process has a deleted mapped file but without an associated file 
> descriptor ?

The mapped file holds a struct file reference in the VMA. When checkpoint
walks the VMAs the struct file is visited just like for struct files reached
from file descriptors.

Cheers,
	-Matt Helsley

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [C/R v20][PATCH 38/96] c/r: dump open file descriptors
  2010-03-21 20:58           ` Daniel Lezcano
  2010-03-21 21:36             ` Oren Laadan
       [not found]             ` <4BA68884.3080003-GANU6spQydw@public.gmane.org>
@ 2010-03-22  2:12             ` Matt Helsley
  2010-03-22 13:51               ` Jamie Lokier
                                 ` (2 more replies)
  2 siblings, 3 replies; 88+ messages in thread
From: Matt Helsley @ 2010-03-22  2:12 UTC (permalink / raw)
  To: Daniel Lezcano
  Cc: Serge E. Hallyn, linux-fsdevel, containers, Jamie Lokier, Andreas Dilger

On Sun, Mar 21, 2010 at 09:58:44PM +0100, Daniel Lezcano wrote:
> Serge E. Hallyn wrote:
> > Quoting Jamie Lokier (jamie@shareable.org):
> >   
> >> Matt Helsley wrote:
> >>     
> >>>> That said, if the intent is to allow the restore to be done on
> >>>> another node with a "similar" filesystem (e.g. created by rsync/node
> >>>> image), instead of having a coherent distributed filesystem on all
> >>>> of the nodes then the filename makes sense.
> >>>>         
> >>> Yes, this is the intent.
> >>>       
> >> I would worry about programs which are using files which have been
> >> deleted, renamed, or (very common) renamed-over by another process
> >> after being opened, as there's a good chance they will successfully
> >> open the wrong file after c/r, and corrupt state from then on.
> >>     
> >
> > Userspace is expected to back up and restore the filesystem, for
> > instance using a btrfs snapshot or a simple rsync or tar.
> >
> >   
> That does not solve the problem Jamie is talking about.
> A rsync or a tar will not see a deleted file and using a btrfs to have 
> the CR to work with the deleted files is a bit overkill, no ?

These are the same kinds of problems encountered during backup. You
can play fast and loose -- like taking a backup while everything is
running -- or you can play it conservative and freeze things.

I think btrfs snapshots are just one possible solution and it's not
overkill.

For some filesystems it might make sense to use the filesystem freezer to
ensure that no files are deleted while the backup takes place. Combined
with tools like rsync or rdiff backup these operations could be low bandwidth
and low latency if well-known live-migration techniques are used.

Or use dm snapshots.

I imagine fanotify could also be useful so long as userspace has marked
things correctly prior to checkpoint. My high level understanding of
fanotify was we'd be able to delay (or deny) deletion until checkpoint
is complete.

Or if using fanotify is unacceptable, at the very least we could use
inotify to know when a file needed for restart has been deleted. It might
go something like:

start watching files/dirs needed (fanotify or inotify)
	Delay/deny changes (fanotify ONLY)
freeze tasks for checkpoint
freeze filesystem contents:
	take btrfs snapshots OR
	take dm snapshots OR
	use filesystem freezer OR
backup filesystem contents
sys_checkpoint
check for changes to the filesystem contents and report failure if they
	interfere with restart (inotify ONLY)
thaw filesystem contents
thaw tasks

So there are lots of possible solutions and they don't all involve trying to
stop the whole VFS or the whole machine. They also don't require anything
more in-kernel than what's already being pushed (our patchset, Eric Paris'
patchset for the optional fanotify idea).

> I have another question about the deleted files. How is handled the case 
> when a process has a deleted mapped file but without an associated file 
> descriptor ?

The mapped file holds a struct file reference in the VMA. When checkpoint
walks the VMAs the struct file is visited just like for struct files reached
from file descriptors.

Cheers,
	-Matt Helsley

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [C/R v20][PATCH 38/96] c/r: dump open file descriptors
       [not found]           ` <20100322010606.GG2887-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org>
@ 2010-03-22  2:20             ` Jamie Lokier
  2010-03-22  2:55             ` Serge E. Hallyn
  1 sibling, 0 replies; 88+ messages in thread
From: Jamie Lokier @ 2010-03-22  2:20 UTC (permalink / raw)
  To: Matt Helsley
  Cc: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Andreas Dilger,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA

Matt Helsley wrote:
> On Sun, Mar 21, 2010 at 05:27:03PM +0000, Jamie Lokier wrote:
> > Matt Helsley wrote:
> > > > That said, if the intent is to allow the restore to be done on
> > > > another node with a "similar" filesystem (e.g. created by rsync/node
> > > > image), instead of having a coherent distributed filesystem on all
> > > > of the nodes then the filename makes sense.
> > > 
> > > Yes, this is the intent.
> > 
> > I would worry about programs which are using files which have been
> > deleted, renamed, or (very common) renamed-over by another process
> > after being opened, as there's a good chance they will successfully
> > open the wrong file after c/r, and corrupt state from then on.
> 
> The code in the patches does check for unlinked files and refuses
> to checkpoint if an unlinked file is open. Yes, this limits the usefulness
> of the code somewhat but it's a problem we can solve and c/r is still quite
> useful without the solution.
> 
> We've done our best to try and reach that ideal. You're welcome to have a
> look at the code to see if you can find any ways in which we haven't.
> Here's the code that refuses to checkpoint unsupported files. I think
> it's pretty easy to read:

From a very quick read, 

>         if (d_unlinked(file->f_dentry)) {
>                 ckpt_err(ctx, -EBADF, "%(T)%(P)Unlinked files unsupported\n",
>                          file);

Hmm.

I wonder if d_unlinked() is always true for a file which is opened,
unlinked or renamed over, but has a hard link to it from elsewhere so
the on-disk file hasn't gone away.

I guess it probably is.  That's kinda neat!  I'd hoped there would be a
good reason for f_dentry eventually ;-)

What about files opened through /proc/self/fd/N before or after the
original file was unlinked/renamed-over.  Where does the dentry point?

-- Jamie

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [C/R v20][PATCH 38/96] c/r: dump open file descriptors
  2010-03-22  1:06         ` Matt Helsley
@ 2010-03-22  2:20           ` Jamie Lokier
       [not found]             ` <20100322022003.GA16462-yetKDKU6eevNLxjTenLetw@public.gmane.org>
  2010-03-22  3:37             ` Matt Helsley
  2010-03-22  2:55           ` Serge E. Hallyn
       [not found]           ` <20100322010606.GG2887-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org>
  2 siblings, 2 replies; 88+ messages in thread
From: Jamie Lokier @ 2010-03-22  2:20 UTC (permalink / raw)
  To: Matt Helsley; +Cc: Andreas Dilger, Oren Laadan, linux-fsdevel, containers

Matt Helsley wrote:
> On Sun, Mar 21, 2010 at 05:27:03PM +0000, Jamie Lokier wrote:
> > Matt Helsley wrote:
> > > > That said, if the intent is to allow the restore to be done on
> > > > another node with a "similar" filesystem (e.g. created by rsync/node
> > > > image), instead of having a coherent distributed filesystem on all
> > > > of the nodes then the filename makes sense.
> > > 
> > > Yes, this is the intent.
> > 
> > I would worry about programs which are using files which have been
> > deleted, renamed, or (very common) renamed-over by another process
> > after being opened, as there's a good chance they will successfully
> > open the wrong file after c/r, and corrupt state from then on.
> 
> The code in the patches does check for unlinked files and refuses
> to checkpoint if an unlinked file is open. Yes, this limits the usefulness
> of the code somewhat but it's a problem we can solve and c/r is still quite
> useful without the solution.
> 
> We've done our best to try and reach that ideal. You're welcome to have a
> look at the code to see if you can find any ways in which we haven't.
> Here's the code that refuses to checkpoint unsupported files. I think
> it's pretty easy to read:

>From a very quick read, 

>         if (d_unlinked(file->f_dentry)) {
>                 ckpt_err(ctx, -EBADF, "%(T)%(P)Unlinked files unsupported\n",
>                          file);

Hmm.

I wonder if d_unlinked() is always true for a file which is opened,
unlinked or renamed over, but has a hard link to it from elsewhere so
the on-disk file hasn't gone away.

I guess it probably is.  That's kinda neat!  I'd hoped there would be a
good reason for f_dentry eventually ;-)

What about files opened through /proc/self/fd/N before or after the
original file was unlinked/renamed-over.  Where does the dentry point?

-- Jamie

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [C/R v20][PATCH 38/96] c/r: dump open file descriptors
       [not found]           ` <20100322010606.GG2887-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org>
  2010-03-22  2:20             ` Jamie Lokier
@ 2010-03-22  2:55             ` Serge E. Hallyn
  1 sibling, 0 replies; 88+ messages in thread
From: Serge E. Hallyn @ 2010-03-22  2:55 UTC (permalink / raw)
  To: Matt Helsley
  Cc: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Andreas Dilger,
	Jamie Lokier,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA

Quoting Matt Helsley (matthltc-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org):
> On Sun, Mar 21, 2010 at 05:27:03PM +0000, Jamie Lokier wrote:
> > Matt Helsley wrote:
> > > > That said, if the intent is to allow the restore to be done on
> > > > another node with a "similar" filesystem (e.g. created by rsync/node
> > > > image), instead of having a coherent distributed filesystem on all
> > > > of the nodes then the filename makes sense.
> > > 
> > > Yes, this is the intent.
> > 
> > I would worry about programs which are using files which have been
> > deleted, renamed, or (very common) renamed-over by another process
> > after being opened, as there's a good chance they will successfully
> > open the wrong file after c/r, and corrupt state from then on.
> 
> The code in the patches does check for unlinked files and refuses
> to checkpoint if an unlinked file is open. Yes, this limits the usefulness

Oh, haha - open/mapped unlinked files.  Sorry  :)

-serge

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [C/R v20][PATCH 38/96] c/r: dump open file descriptors
  2010-03-22  1:06         ` Matt Helsley
  2010-03-22  2:20           ` Jamie Lokier
@ 2010-03-22  2:55           ` Serge E. Hallyn
       [not found]           ` <20100322010606.GG2887-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org>
  2 siblings, 0 replies; 88+ messages in thread
From: Serge E. Hallyn @ 2010-03-22  2:55 UTC (permalink / raw)
  To: Matt Helsley; +Cc: Jamie Lokier, linux-fsdevel, Andreas Dilger, containers

Quoting Matt Helsley (matthltc@us.ibm.com):
> On Sun, Mar 21, 2010 at 05:27:03PM +0000, Jamie Lokier wrote:
> > Matt Helsley wrote:
> > > > That said, if the intent is to allow the restore to be done on
> > > > another node with a "similar" filesystem (e.g. created by rsync/node
> > > > image), instead of having a coherent distributed filesystem on all
> > > > of the nodes then the filename makes sense.
> > > 
> > > Yes, this is the intent.
> > 
> > I would worry about programs which are using files which have been
> > deleted, renamed, or (very common) renamed-over by another process
> > after being opened, as there's a good chance they will successfully
> > open the wrong file after c/r, and corrupt state from then on.
> 
> The code in the patches does check for unlinked files and refuses
> to checkpoint if an unlinked file is open. Yes, this limits the usefulness

Oh, haha - open/mapped unlinked files.  Sorry  :)

-serge

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [C/R v20][PATCH 38/96] c/r: dump open file descriptors
       [not found]             ` <20100322022003.GA16462-yetKDKU6eevNLxjTenLetw@public.gmane.org>
@ 2010-03-22  3:37               ` Matt Helsley
  0 siblings, 0 replies; 88+ messages in thread
From: Matt Helsley @ 2010-03-22  3:37 UTC (permalink / raw)
  To: Jamie Lokier
  Cc: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Andreas Dilger,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA

On Mon, Mar 22, 2010 at 02:20:03AM +0000, Jamie Lokier wrote:
> Matt Helsley wrote:
> > On Sun, Mar 21, 2010 at 05:27:03PM +0000, Jamie Lokier wrote:
> > > Matt Helsley wrote:
> > > > > That said, if the intent is to allow the restore to be done on
> > > > > another node with a "similar" filesystem (e.g. created by rsync/node
> > > > > image), instead of having a coherent distributed filesystem on all
> > > > > of the nodes then the filename makes sense.
> > > > 
> > > > Yes, this is the intent.
> > > 
> > > I would worry about programs which are using files which have been
> > > deleted, renamed, or (very common) renamed-over by another process
> > > after being opened, as there's a good chance they will successfully
> > > open the wrong file after c/r, and corrupt state from then on.
> > 
> > The code in the patches does check for unlinked files and refuses
> > to checkpoint if an unlinked file is open. Yes, this limits the usefulness
> > of the code somewhat but it's a problem we can solve and c/r is still quite
> > useful without the solution.
> > 
> > We've done our best to try and reach that ideal. You're welcome to have a
> > look at the code to see if you can find any ways in which we haven't.
> > Here's the code that refuses to checkpoint unsupported files. I think
> > it's pretty easy to read:
> 
> From a very quick read, 
> 
> >         if (d_unlinked(file->f_dentry)) {
> >                 ckpt_err(ctx, -EBADF, "%(T)%(P)Unlinked files unsupported\n",
> >                          file);
> 
> Hmm.
> 
> I wonder if d_unlinked() is always true for a file which is opened,
> unlinked or renamed over, but has a hard link to it from elsewhere so
> the on-disk file hasn't gone away.

Well, if the on-disk file hasn't gone away due to a hardlink then we
won't need to save the file in the checkpoint image -- the filesystem
content backup done during checkpoint should also get the file contents.

> 
> I guess it probably is.  That's kinda neat!  I'd hoped there would be a
> good reason for f_dentry eventually ;-)
> 
> What about files opened through /proc/self/fd/N before or after the
> original file was unlinked/renamed-over.  Where does the dentry point?

Before the unlink it will result in the same file being opened. If it's
opened by a task being checkpointed then we'll be in the same situation
as the "self" task. If it's opened by a task not being checkpointed then
the "leak detection" code will notice that there's an unaccounted reference
to the file and checkpoint will fail.

That code is in checkpoint/objhash.c. It works by doing two passes:
	1. Collect references
	2. Checkpoint referenced objects

We only do the second pass if the ref count matches the number of times
we've "collected" the file (I added comments to the .ref_foo = ops so you
don't need to see them to get the idea):

static struct ckpt_obj_ops ckpt_obj_ops[] = {
	...
        /* file object */
        {
                .obj_name = "FILE",
                .obj_type = CKPT_OBJ_FILE,
                .ref_drop = obj_file_drop, /* aka fput */
                .ref_grab = obj_file_grab, /* aka get_file */
                .ref_users = obj_file_users, /* does atomic read of f_count */
                .checkpoint = checkpoint_file,
                .restore = restore_file,
        },
	...
};
...
/**
 * ckpt_obj_contained - test if shared objects are contained in checkpoint
 * @ctx: checkpoint context
 *
 * Loops through all objects in the table and compares the number of
 * references accumulated during checkpoint, with the reference count
 * reported by the kernel.
 *
 * Return 1 if respective counts match for all objects, 0 otherwise.
 */
int ckpt_obj_contained(struct ckpt_ctx *ctx)
{
        struct ckpt_obj *obj;
        struct hlist_node *node;

        /* account for ctx->{file,logfile} (if in the table already) */
        ckpt_obj_users_inc(ctx, ctx->file, 1);
        if (ctx->logfile)
                ckpt_obj_users_inc(ctx, ctx->logfile, 1);
        /* account for ctx->root_nsproxy (if in the table already) */
        ckpt_obj_users_inc(ctx, ctx->root_nsproxy, 1);

        hlist_for_each_entry(obj, node, &ctx->obj_hash->list, next) {
                if (!obj->ops->ref_users)
                        continue;

                if (obj->ops->obj_type == CKPT_OBJ_SOCK)
                        obj_sock_adjust_users(obj);

                if (obj->ops->ref_users(obj->ptr) != obj->users) {
                        ckpt_err(ctx, -EBUSY,
                                 "%(O)%(P)%(S)Usage leak (%d != %d)\n",
                                 obj->objref, obj->ptr, obj->ops->obj_name,
                                 obj->ops->ref_users(obj->ptr), obj->users);
                        return 0;
                }
        }

        return 1;
}
...

So that hopefully addresses your questions regarding the use of the symlinks
before the unlink.

After the unlink those symlinks are broken since they have "(deleted)"
appended. Making sure they are broken after restart is one detail I've
thought about. To make it perfect I think we could:

	1. Move any existing file at the original symlinked path to a
		temporary location.
	2. Restore the "unlinked" file to that location.
		(in quotes since it's not unlinked yet)
	3. Open the "unlinked" file.
	4. Unlink the file again.
	5. Move the existing file back from the temporary location.

As with relinking, we need a good way to do the "temporary location".
That is complicated because we need to choose a location that we have
permission to write to, always exists during restart, and is guaranteed
not to have files in it. Relinking the file shifts these problems from
restart to checkpoint.

In case you're bored, before Oren posted these patches I wrote:
	https://ckpt.wiki.kernel.org/index.php/Checklist/UnlinkedFiles

and there's lots of info related to what we do and don't support,
many related to files in one way or another, in the table at:
	https://ckpt.wiki.kernel.org/index.php/Checklist

I'll update that page with some of my responses above. Getting your
thoughts on my ideas outlined above would be excellent. If you've
got some counter proposals I'd be happy to hear them too. I'll add
a reference to this thread and an edited collection of my rambling
responses to that page if you like.

Cheers,
	-Matt Helsley

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [C/R v20][PATCH 38/96] c/r: dump open file descriptors
  2010-03-22  2:20           ` Jamie Lokier
       [not found]             ` <20100322022003.GA16462-yetKDKU6eevNLxjTenLetw@public.gmane.org>
@ 2010-03-22  3:37             ` Matt Helsley
       [not found]               ` <20100322033724.GA20796-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org>
  2010-03-22 14:13               ` Jamie Lokier
  1 sibling, 2 replies; 88+ messages in thread
From: Matt Helsley @ 2010-03-22  3:37 UTC (permalink / raw)
  To: Jamie Lokier
  Cc: Matt Helsley, Andreas Dilger, Oren Laadan, linux-fsdevel, containers

On Mon, Mar 22, 2010 at 02:20:03AM +0000, Jamie Lokier wrote:
> Matt Helsley wrote:
> > On Sun, Mar 21, 2010 at 05:27:03PM +0000, Jamie Lokier wrote:
> > > Matt Helsley wrote:
> > > > > That said, if the intent is to allow the restore to be done on
> > > > > another node with a "similar" filesystem (e.g. created by rsync/node
> > > > > image), instead of having a coherent distributed filesystem on all
> > > > > of the nodes then the filename makes sense.
> > > > 
> > > > Yes, this is the intent.
> > > 
> > > I would worry about programs which are using files which have been
> > > deleted, renamed, or (very common) renamed-over by another process
> > > after being opened, as there's a good chance they will successfully
> > > open the wrong file after c/r, and corrupt state from then on.
> > 
> > The code in the patches does check for unlinked files and refuses
> > to checkpoint if an unlinked file is open. Yes, this limits the usefulness
> > of the code somewhat but it's a problem we can solve and c/r is still quite
> > useful without the solution.
> > 
> > We've done our best to try and reach that ideal. You're welcome to have a
> > look at the code to see if you can find any ways in which we haven't.
> > Here's the code that refuses to checkpoint unsupported files. I think
> > it's pretty easy to read:
> 
> From a very quick read, 
> 
> >         if (d_unlinked(file->f_dentry)) {
> >                 ckpt_err(ctx, -EBADF, "%(T)%(P)Unlinked files unsupported\n",
> >                          file);
> 
> Hmm.
> 
> I wonder if d_unlinked() is always true for a file which is opened,
> unlinked or renamed over, but has a hard link to it from elsewhere so
> the on-disk file hasn't gone away.

Well, if the on-disk file hasn't gone away due to a hardlink then we
won't need to save the file in the checkpoint image -- the filesystem
content backup done during checkpoint should also get the file contents.

> 
> I guess it probably is.  That's kinda neat!  I'd hoped there would be a
> good reason for f_dentry eventually ;-)
> 
> What about files opened through /proc/self/fd/N before or after the
> original file was unlinked/renamed-over.  Where does the dentry point?

Before the unlink it will result in the same file being opened. If it's
opened by a task being checkpointed then we'll be in the same situation
as the "self" task. If it's opened by a task not being checkpointed then
the "leak detection" code will notice that there's an unaccounted reference
to the file and checkpoint will fail.

That code is in checkpoint/objhash.c. It works by doing two passes:
	1. Collect references
	2. Checkpoint referenced objects

We only do the second pass if the ref count matches the number of times
we've "collected" the file (I added comments to the .ref_foo = ops so you
don't need to see them to get the idea):

static struct ckpt_obj_ops ckpt_obj_ops[] = {
	...
        /* file object */
        {
                .obj_name = "FILE",
                .obj_type = CKPT_OBJ_FILE,
                .ref_drop = obj_file_drop, /* aka fput */
                .ref_grab = obj_file_grab, /* aka get_file */
                .ref_users = obj_file_users, /* does atomic read of f_count */
                .checkpoint = checkpoint_file,
                .restore = restore_file,
        },
	...
};
...
/**
 * ckpt_obj_contained - test if shared objects are contained in checkpoint
 * @ctx: checkpoint context
 *
 * Loops through all objects in the table and compares the number of
 * references accumulated during checkpoint, with the reference count
 * reported by the kernel.
 *
 * Return 1 if respective counts match for all objects, 0 otherwise.
 */
int ckpt_obj_contained(struct ckpt_ctx *ctx)
{
        struct ckpt_obj *obj;
        struct hlist_node *node;

        /* account for ctx->{file,logfile} (if in the table already) */
        ckpt_obj_users_inc(ctx, ctx->file, 1);
        if (ctx->logfile)
                ckpt_obj_users_inc(ctx, ctx->logfile, 1);
        /* account for ctx->root_nsproxy (if in the table already) */
        ckpt_obj_users_inc(ctx, ctx->root_nsproxy, 1);

        hlist_for_each_entry(obj, node, &ctx->obj_hash->list, next) {
                if (!obj->ops->ref_users)
                        continue;

                if (obj->ops->obj_type == CKPT_OBJ_SOCK)
                        obj_sock_adjust_users(obj);

                if (obj->ops->ref_users(obj->ptr) != obj->users) {
                        ckpt_err(ctx, -EBUSY,
                                 "%(O)%(P)%(S)Usage leak (%d != %d)\n",
                                 obj->objref, obj->ptr, obj->ops->obj_name,
                                 obj->ops->ref_users(obj->ptr), obj->users);
                        return 0;
                }
        }

        return 1;
}
...

So that hopefully addresses your questions regarding the use of the symlinks
before the unlink.

After the unlink those symlinks are broken since they have "(deleted)"
appended. Making sure they are broken after restart is one detail I've
thought about. To make it perfect I think we could:

	1. Move any existing file at the original symlinked path to a
		temporary location.
	2. Restore the "unlinked" file to that location.
		(in quotes since it's not unlinked yet)
	3. Open the "unlinked" file.
	4. Unlink the file again.
	5. Move the existing file back from the temporary location.

As with relinking, we need a good way to do the "temporary location".
That is complicated because we need to choose a location that we have
permission to write to, always exists during restart, and is guaranteed
not to have files in it. Relinking the file shifts these problems from
restart to checkpoint.

In case you're bored, before Oren posted these patches I wrote:
	https://ckpt.wiki.kernel.org/index.php/Checklist/UnlinkedFiles

and there's lots of info related to what we do and don't support,
many related to files in one way or another, in the table at:
	https://ckpt.wiki.kernel.org/index.php/Checklist

I'll update that page with some of my responses above. Getting your
thoughts on my ideas outlined above would be excellent. If you've
got some counter proposals I'd be happy to hear them too. I'll add
a reference to this thread and an edited collection of my rambling
responses to that page if you like.

Cheers,
	-Matt Helsley

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [C/R v20][PATCH 20/96] c/r: make file_pos_read/write() public
       [not found]   ` <1268960401-16680-2-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
@ 2010-03-22  6:31     ` Nick Piggin
  0 siblings, 0 replies; 88+ messages in thread
From: Nick Piggin @ 2010-03-22  6:31 UTC (permalink / raw)
  To: Oren Laadan
  Cc: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Andreas Dilger

On Thu, Mar 18, 2010 at 08:59:45PM -0400, Oren Laadan wrote:
> These two are used in the next patch when calling vfs_read/write()

Said next patch didn't seem to make it to fsdevel.

Should it at least go to fs/internal.h?

> 
> Signed-off-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
> Acked-by: Serge E. Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
> ---
>  fs/read_write.c    |   10 ----------
>  include/linux/fs.h |   10 ++++++++++
>  2 files changed, 10 insertions(+), 10 deletions(-)
> 
> diff --git a/fs/read_write.c b/fs/read_write.c
> index b7f4a1f..e258301 100644
> --- a/fs/read_write.c
> +++ b/fs/read_write.c
> @@ -359,16 +359,6 @@ ssize_t vfs_write(struct file *file, const char __user *buf, size_t count, loff_
>  
>  EXPORT_SYMBOL(vfs_write);
>  
> -static inline loff_t file_pos_read(struct file *file)
> -{
> -	return file->f_pos;
> -}
> -
> -static inline void file_pos_write(struct file *file, loff_t pos)
> -{
> -	file->f_pos = pos;
> -}
> -
>  SYSCALL_DEFINE3(read, unsigned int, fd, char __user *, buf, size_t, count)
>  {
>  	struct file *file;
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index ebb1cd5..6c08df2 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -1543,6 +1543,16 @@ ssize_t rw_copy_check_uvector(int type, const struct iovec __user * uvector,
>  				struct iovec *fast_pointer,
>  				struct iovec **ret_pointer);
>  
> +static inline loff_t file_pos_read(struct file *file)
> +{
> +	return file->f_pos;
> +}
> +
> +static inline void file_pos_write(struct file *file, loff_t pos)
> +{
> +	file->f_pos = pos;
> +}
> +
>  extern ssize_t vfs_read(struct file *, char __user *, size_t, loff_t *);
>  extern ssize_t vfs_write(struct file *, const char __user *, size_t, loff_t *);
>  extern ssize_t vfs_readv(struct file *, const struct iovec __user *,
> -- 
> 1.6.3.3
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [C/R v20][PATCH 20/96] c/r: make file_pos_read/write() public
  2010-03-19  0:59 ` [C/R v20][PATCH 20/96] c/r: make file_pos_read/write() public Oren Laadan
       [not found]   ` <1268960401-16680-2-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
@ 2010-03-22  6:31   ` Nick Piggin
  2010-03-23  0:12     ` Oren Laadan
  2010-03-23  0:12     ` Oren Laadan
  1 sibling, 2 replies; 88+ messages in thread
From: Nick Piggin @ 2010-03-22  6:31 UTC (permalink / raw)
  To: Oren Laadan; +Cc: linux-fsdevel, containers, Matt Helsley, Andreas Dilger

On Thu, Mar 18, 2010 at 08:59:45PM -0400, Oren Laadan wrote:
> These two are used in the next patch when calling vfs_read/write()

Said next patch didn't seem to make it to fsdevel.

Should it at least go to fs/internal.h?

> 
> Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
> Acked-by: Serge E. Hallyn <serue@us.ibm.com>
> ---
>  fs/read_write.c    |   10 ----------
>  include/linux/fs.h |   10 ++++++++++
>  2 files changed, 10 insertions(+), 10 deletions(-)
> 
> diff --git a/fs/read_write.c b/fs/read_write.c
> index b7f4a1f..e258301 100644
> --- a/fs/read_write.c
> +++ b/fs/read_write.c
> @@ -359,16 +359,6 @@ ssize_t vfs_write(struct file *file, const char __user *buf, size_t count, loff_
>  
>  EXPORT_SYMBOL(vfs_write);
>  
> -static inline loff_t file_pos_read(struct file *file)
> -{
> -	return file->f_pos;
> -}
> -
> -static inline void file_pos_write(struct file *file, loff_t pos)
> -{
> -	file->f_pos = pos;
> -}
> -
>  SYSCALL_DEFINE3(read, unsigned int, fd, char __user *, buf, size_t, count)
>  {
>  	struct file *file;
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index ebb1cd5..6c08df2 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -1543,6 +1543,16 @@ ssize_t rw_copy_check_uvector(int type, const struct iovec __user * uvector,
>  				struct iovec *fast_pointer,
>  				struct iovec **ret_pointer);
>  
> +static inline loff_t file_pos_read(struct file *file)
> +{
> +	return file->f_pos;
> +}
> +
> +static inline void file_pos_write(struct file *file, loff_t pos)
> +{
> +	file->f_pos = pos;
> +}
> +
>  extern ssize_t vfs_read(struct file *, char __user *, size_t, loff_t *);
>  extern ssize_t vfs_write(struct file *, const char __user *, size_t, loff_t *);
>  extern ssize_t vfs_readv(struct file *, const struct iovec __user *,
> -- 
> 1.6.3.3
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [C/R v20][PATCH 37/96] c/r: introduce new 'file_operations': ->checkpoint, ->collect()
       [not found]   ` <1268960401-16680-3-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
@ 2010-03-22  6:34     ` Nick Piggin
  0 siblings, 0 replies; 88+ messages in thread
From: Nick Piggin @ 2010-03-22  6:34 UTC (permalink / raw)
  To: Oren Laadan
  Cc: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Andreas Dilger

On Thu, Mar 18, 2010 at 08:59:46PM -0400, Oren Laadan wrote:
> While we assume all normal files and directories can be checkpointed,
> there are, as usual in the VFS, specialized places that will always
> need an ability to override these defaults. Although we could do this
> completely in the checkpoint code, that would bitrot quickly.
> 
> This adds a new 'file_operations' function for checkpointing a file.
> It is assumed that there should be a dirt-simple way to make something
> (un)checkpointable that fits in with current code.
> 
> As you can see in the ext[234] patches down the road, all that we have
> to do to make something simple be supported is add a single "generic"
> f_op entry.
> 
> Also adds a new 'file_operations' function for 'collecting' a file for
> leak-detection during full-container checkpoint. This is useful for
> those files that hold references to other "collectable" objects. Two
> examples are pty files that point to corresponding tty objects, and
> eventpoll files that refer to the files they are monitoring.
> 
> Finally, this patch introduces vfs_fcntl() so that it can be called
> from restart (see patch adding restart of files).
> 
> Changelog[v17]
>   - Introduce 'collect' method
> Changelog[v17]
>   - Forward-declare 'ckpt_ctx' et-al, don't use checkpoint_types.h
> 
> Signed-off-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
> Acked-by: Serge E. Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
> Tested-by: Serge E. Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
> ---
>  fs/fcntl.c         |   21 +++++++++++++--------
>  include/linux/fs.h |    7 +++++++
>  2 files changed, 20 insertions(+), 8 deletions(-)
> 
> diff --git a/fs/fcntl.c b/fs/fcntl.c
> index 97e01dc..e1f02ca 100644
> --- a/fs/fcntl.c
> +++ b/fs/fcntl.c
> @@ -418,6 +418,18 @@ static long do_fcntl(int fd, unsigned int cmd, unsigned long arg,
>  	return err;
>  }
>  
> +int vfs_fcntl(int fd, unsigned int cmd, unsigned long arg, struct file *filp)
> +{
> +	int err;
> +
> +	err = security_file_fcntl(filp, cmd, arg);
> +	if (err)
> +		goto out;
> +	err = do_fcntl(fd, cmd, arg, filp);
> + out:
> +	return err;
> +}
> +
>  SYSCALL_DEFINE3(fcntl, unsigned int, fd, unsigned int, cmd, unsigned long, arg)
>  {	
>  	struct file *filp;
> @@ -427,14 +439,7 @@ SYSCALL_DEFINE3(fcntl, unsigned int, fd, unsigned int, cmd, unsigned long, arg)
>  	if (!filp)
>  		goto out;
>  
> -	err = security_file_fcntl(filp, cmd, arg);
> -	if (err) {
> -		fput(filp);
> -		return err;
> -	}
> -
> -	err = do_fcntl(fd, cmd, arg, filp);
> -
> +	err = vfs_fcntl(fd, cmd, arg, filp);
>   	fput(filp);
>  out:
>  	return err;

There is no point combining these two logically distinct patches.


> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 6c08df2..65ebec5 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -394,6 +394,7 @@ struct kstatfs;
>  struct vm_area_struct;
>  struct vfsmount;
>  struct cred;
> +struct ckpt_ctx;
>  
>  extern void __init inode_init(void);
>  extern void __init inode_init_early(void);
> @@ -1093,6 +1094,8 @@ struct file_lock {
>  
>  #include <linux/fcntl.h>
>  
> +extern int vfs_fcntl(int fd, unsigned cmd, unsigned long arg, struct file *fp);
> +
>  extern void send_sigio(struct fown_struct *fown, int fd, int band);
>  
>  #ifdef CONFIG_FILE_LOCKING
> @@ -1504,6 +1507,8 @@ struct file_operations {
>  	ssize_t (*splice_write)(struct pipe_inode_info *, struct file *, loff_t *, size_t, unsigned int);
>  	ssize_t (*splice_read)(struct file *, loff_t *, struct pipe_inode_info *, size_t, unsigned int);
>  	int (*setlease)(struct file *, long, struct file_lock **);
> +	int (*checkpoint)(struct ckpt_ctx *, struct file *);
> +	int (*collect)(struct ckpt_ctx *, struct file *);
>  };
>  
>  struct inode_operations {

You didn't add any documentation for this (unless it is in a following
patch, which it shouldn't be).

> @@ -2313,6 +2318,8 @@ void inode_sub_bytes(struct inode *inode, loff_t bytes);
>  loff_t inode_get_bytes(struct inode *inode);
>  void inode_set_bytes(struct inode *inode, loff_t bytes);
>  
> +#define generic_file_checkpoint NULL
> +
>  extern int vfs_readdir(struct file *, filldir_t, void *);
>  
>  extern int vfs_stat(char __user *, struct kstat *);

Hmm, what does generic_file_checkpoint mean? A NULL checkpoint op means
that checkpointing is allowed, and no action is required? Shouldn't it
be an opt-in operation, where NULL means not allowed?

Either way, I don't know if you need to have this #define, provided you
have sufficient documentation.

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [C/R v20][PATCH 37/96] c/r: introduce new 'file_operations': ->checkpoint, ->collect()
  2010-03-19  0:59 ` [C/R v20][PATCH 37/96] c/r: introduce new 'file_operations': ->checkpoint, ->collect() Oren Laadan
@ 2010-03-22  6:34   ` Nick Piggin
  2010-03-22 10:16     ` Matt Helsley
  2010-03-22 10:16     ` Matt Helsley
       [not found]   ` <1268960401-16680-3-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
  1 sibling, 2 replies; 88+ messages in thread
From: Nick Piggin @ 2010-03-22  6:34 UTC (permalink / raw)
  To: Oren Laadan; +Cc: linux-fsdevel, containers, Matt Helsley, Andreas Dilger

On Thu, Mar 18, 2010 at 08:59:46PM -0400, Oren Laadan wrote:
> While we assume all normal files and directories can be checkpointed,
> there are, as usual in the VFS, specialized places that will always
> need an ability to override these defaults. Although we could do this
> completely in the checkpoint code, that would bitrot quickly.
> 
> This adds a new 'file_operations' function for checkpointing a file.
> It is assumed that there should be a dirt-simple way to make something
> (un)checkpointable that fits in with current code.
> 
> As you can see in the ext[234] patches down the road, all that we have
> to do to make something simple be supported is add a single "generic"
> f_op entry.
> 
> Also adds a new 'file_operations' function for 'collecting' a file for
> leak-detection during full-container checkpoint. This is useful for
> those files that hold references to other "collectable" objects. Two
> examples are pty files that point to corresponding tty objects, and
> eventpoll files that refer to the files they are monitoring.
> 
> Finally, this patch introduces vfs_fcntl() so that it can be called
> from restart (see patch adding restart of files).
> 
> Changelog[v17]
>   - Introduce 'collect' method
> Changelog[v17]
>   - Forward-declare 'ckpt_ctx' et-al, don't use checkpoint_types.h
> 
> Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
> Acked-by: Serge E. Hallyn <serue@us.ibm.com>
> Tested-by: Serge E. Hallyn <serue@us.ibm.com>
> ---
>  fs/fcntl.c         |   21 +++++++++++++--------
>  include/linux/fs.h |    7 +++++++
>  2 files changed, 20 insertions(+), 8 deletions(-)
> 
> diff --git a/fs/fcntl.c b/fs/fcntl.c
> index 97e01dc..e1f02ca 100644
> --- a/fs/fcntl.c
> +++ b/fs/fcntl.c
> @@ -418,6 +418,18 @@ static long do_fcntl(int fd, unsigned int cmd, unsigned long arg,
>  	return err;
>  }
>  
> +int vfs_fcntl(int fd, unsigned int cmd, unsigned long arg, struct file *filp)
> +{
> +	int err;
> +
> +	err = security_file_fcntl(filp, cmd, arg);
> +	if (err)
> +		goto out;
> +	err = do_fcntl(fd, cmd, arg, filp);
> + out:
> +	return err;
> +}
> +
>  SYSCALL_DEFINE3(fcntl, unsigned int, fd, unsigned int, cmd, unsigned long, arg)
>  {	
>  	struct file *filp;
> @@ -427,14 +439,7 @@ SYSCALL_DEFINE3(fcntl, unsigned int, fd, unsigned int, cmd, unsigned long, arg)
>  	if (!filp)
>  		goto out;
>  
> -	err = security_file_fcntl(filp, cmd, arg);
> -	if (err) {
> -		fput(filp);
> -		return err;
> -	}
> -
> -	err = do_fcntl(fd, cmd, arg, filp);
> -
> +	err = vfs_fcntl(fd, cmd, arg, filp);
>   	fput(filp);
>  out:
>  	return err;

There is no point combining these two logically distinct patches.


> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 6c08df2..65ebec5 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -394,6 +394,7 @@ struct kstatfs;
>  struct vm_area_struct;
>  struct vfsmount;
>  struct cred;
> +struct ckpt_ctx;
>  
>  extern void __init inode_init(void);
>  extern void __init inode_init_early(void);
> @@ -1093,6 +1094,8 @@ struct file_lock {
>  
>  #include <linux/fcntl.h>
>  
> +extern int vfs_fcntl(int fd, unsigned cmd, unsigned long arg, struct file *fp);
> +
>  extern void send_sigio(struct fown_struct *fown, int fd, int band);
>  
>  #ifdef CONFIG_FILE_LOCKING
> @@ -1504,6 +1507,8 @@ struct file_operations {
>  	ssize_t (*splice_write)(struct pipe_inode_info *, struct file *, loff_t *, size_t, unsigned int);
>  	ssize_t (*splice_read)(struct file *, loff_t *, struct pipe_inode_info *, size_t, unsigned int);
>  	int (*setlease)(struct file *, long, struct file_lock **);
> +	int (*checkpoint)(struct ckpt_ctx *, struct file *);
> +	int (*collect)(struct ckpt_ctx *, struct file *);
>  };
>  
>  struct inode_operations {

You didn't add any documentation for this (unless it is in a following
patch, which it shouldn't be).

> @@ -2313,6 +2318,8 @@ void inode_sub_bytes(struct inode *inode, loff_t bytes);
>  loff_t inode_get_bytes(struct inode *inode);
>  void inode_set_bytes(struct inode *inode, loff_t bytes);
>  
> +#define generic_file_checkpoint NULL
> +
>  extern int vfs_readdir(struct file *, filldir_t, void *);
>  
>  extern int vfs_stat(char __user *, struct kstat *);

Hmm, what does generic_file_checkpoint mean? A NULL checkpoint op means
that checkpointing is allowed, and no action is required? Shouldn't it
be an opt-in operation, where NULL means not allowed?

Either way, I don't know if you need to have this #define, provided you
have sufficient documentation.


^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [C/R v20][PATCH 38/96] c/r: dump open file descriptors
       [not found]               ` <4BA6914D.8040007-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
  2010-03-21 23:31                 ` xing lin
@ 2010-03-22  8:40                 ` Daniel Lezcano
  1 sibling, 0 replies; 88+ messages in thread
From: Daniel Lezcano @ 2010-03-22  8:40 UTC (permalink / raw)
  To: Oren Laadan
  Cc: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Jamie Lokier, Andreas Dilger

Oren Laadan wrote:
>
>
> Daniel Lezcano wrote:
>> Serge E. Hallyn wrote:
>>> Quoting Jamie Lokier (jamie-yetKDKU6eevNLxjTenLetw@public.gmane.org):
>>>  
>>>> Matt Helsley wrote:
>>>>    
>>>>>> That said, if the intent is to allow the restore to be done on
>>>>>> another node with a "similar" filesystem (e.g. created by rsync/node
>>>>>> image), instead of having a coherent distributed filesystem on all
>>>>>> of the nodes then the filename makes sense.
>>>>>>         
>>>>> Yes, this is the intent.
>>>>>       
>>>> I would worry about programs which are using files which have been
>>>> deleted, renamed, or (very common) renamed-over by another process
>>>> after being opened, as there's a good chance they will successfully
>>>> open the wrong file after c/r, and corrupt state from then on.
>>>>     
>>> Userspace is expected to back up and restore the filesystem, for
>>> instance using a btrfs snapshot or a simple rsync or tar.
>>>
>>>   
>> That does not solve the problem Jamie is talking about.
>> A rsync or a tar will not see a deleted file and using a btrfs to 
>> have the CR to work with the deleted files is a bit overkill, no ?
>
> Let's separate the issues of file system snapshot and deleted files.
>
> 1) File system snapshot:
> ------------------------
> The requirement is to preserve the file system state between the time
> of the checkpoint and the time of the restart, because userspace will
> expect it to remain the same.
>
> The alternatives are:
>
> a) Use capable file system, like brfs, or (modified) nilfs.
>
> b) Userspace saves the state e.g. w/ tar or rsync (maybe incremental)
>
> c) Assume/expect that the file system isn't modified between checkpoint
> and restart (e.g. if we use c/r to suspend a user's session)
>
> d) Expect userspace to adapt to changes if they occur, e.g. by having
> the application be aware of the possibility, or by providing a wrapper
> that will do some magic prior to restart (by looking at the checkpoint
> image).
>
> Options a,b,c are all transparent to the application, while option
> d required that applications become aware of c/r. That's ok, but our
> primary goal is to be generic enough to unmodified applications.
>
> 2) Deleted files:
> -----------------
> The requirement is that at restart we'll be able to restore the file
> point in the kernel to a deleted file with same properties and contents
> as it was at the time of the checkpoint.
>
> The alternatives we considered are:
>
> e) For each deleted file, save the contents of that file as part of
> the checkpoint image;
> At restart - create a new file, populate with the contents, open it
> (to get an active file pointer), and finally unlink it, so it is -
> again - deleted.
>
> f) At checkpoint time, create a file (from scratch) in a dedicated
> area of the file system (userspace configurable?), and copy the
> contents of the deleted file to this file. Only save the file system
> state after this is done.
> At restart, open the alternative file instead, and then immediately
> delete it.
>
> g) At checkpoint time, re-link the file to a dedicated area of the
> file system. This requires support from the underlying file system,
> of course. For instance, it's trivial for ext2,3 but IIRC will need
> help for ext4. Re-linking is essentially attaching a new filename
> to an existing inode that is still referenced but is otherwise not
> reachable - and make it reachable again.
> At restart, open the re-linked file and then immediately delete it.
>
>> I have another question about the deleted files. How is handled the 
>> case when a process has a deleted mapped file but without an 
>> associated file descriptor ?
>>
>
> It works the same as with non-deleted files (assuming that we know
> how to handle delete files in general, e.g. options e,d,f above):
>
> To checkpoint a task's mm we loop through the vma's and checkpoint
> them. For a vma that corresponds to a mapped file, we first save
> the vma->vm_file. In turn, for a file pointer we save the filename,
> properties, credentials. A file pointer is saved as an independent
> object - and is assigned a unique id - objref. The state of the vma
> will indicate indicate this objref.
>
> At restart, we will first see the file pointer object, and will
> open the file to create a corresponding file pointer. Later when
> we restore the vma, we'll locate the (new) file pointer using the
> objref and use it in mmap.
>
> Oren.
>

Thanks Oren for the detailed answer.

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [C/R v20][PATCH 38/96] c/r: dump open file descriptors
  2010-03-21 21:36             ` Oren Laadan
       [not found]               ` <4BA6914D.8040007-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
@ 2010-03-22  8:40               ` Daniel Lezcano
  1 sibling, 0 replies; 88+ messages in thread
From: Daniel Lezcano @ 2010-03-22  8:40 UTC (permalink / raw)
  To: Oren Laadan
  Cc: Serge E. Hallyn, linux-fsdevel, containers, Jamie Lokier, Andreas Dilger

Oren Laadan wrote:
>
>
> Daniel Lezcano wrote:
>> Serge E. Hallyn wrote:
>>> Quoting Jamie Lokier (jamie@shareable.org):
>>>  
>>>> Matt Helsley wrote:
>>>>    
>>>>>> That said, if the intent is to allow the restore to be done on
>>>>>> another node with a "similar" filesystem (e.g. created by rsync/node
>>>>>> image), instead of having a coherent distributed filesystem on all
>>>>>> of the nodes then the filename makes sense.
>>>>>>         
>>>>> Yes, this is the intent.
>>>>>       
>>>> I would worry about programs which are using files which have been
>>>> deleted, renamed, or (very common) renamed-over by another process
>>>> after being opened, as there's a good chance they will successfully
>>>> open the wrong file after c/r, and corrupt state from then on.
>>>>     
>>> Userspace is expected to back up and restore the filesystem, for
>>> instance using a btrfs snapshot or a simple rsync or tar.
>>>
>>>   
>> That does not solve the problem Jamie is talking about.
>> A rsync or a tar will not see a deleted file and using a btrfs to 
>> have the CR to work with the deleted files is a bit overkill, no ?
>
> Let's separate the issues of file system snapshot and deleted files.
>
> 1) File system snapshot:
> ------------------------
> The requirement is to preserve the file system state between the time
> of the checkpoint and the time of the restart, because userspace will
> expect it to remain the same.
>
> The alternatives are:
>
> a) Use capable file system, like brfs, or (modified) nilfs.
>
> b) Userspace saves the state e.g. w/ tar or rsync (maybe incremental)
>
> c) Assume/expect that the file system isn't modified between checkpoint
> and restart (e.g. if we use c/r to suspend a user's session)
>
> d) Expect userspace to adapt to changes if they occur, e.g. by having
> the application be aware of the possibility, or by providing a wrapper
> that will do some magic prior to restart (by looking at the checkpoint
> image).
>
> Options a,b,c are all transparent to the application, while option
> d required that applications become aware of c/r. That's ok, but our
> primary goal is to be generic enough to unmodified applications.
>
> 2) Deleted files:
> -----------------
> The requirement is that at restart we'll be able to restore the file
> point in the kernel to a deleted file with same properties and contents
> as it was at the time of the checkpoint.
>
> The alternatives we considered are:
>
> e) For each deleted file, save the contents of that file as part of
> the checkpoint image;
> At restart - create a new file, populate with the contents, open it
> (to get an active file pointer), and finally unlink it, so it is -
> again - deleted.
>
> f) At checkpoint time, create a file (from scratch) in a dedicated
> area of the file system (userspace configurable?), and copy the
> contents of the deleted file to this file. Only save the file system
> state after this is done.
> At restart, open the alternative file instead, and then immediately
> delete it.
>
> g) At checkpoint time, re-link the file to a dedicated area of the
> file system. This requires support from the underlying file system,
> of course. For instance, it's trivial for ext2,3 but IIRC will need
> help for ext4. Re-linking is essentially attaching a new filename
> to an existing inode that is still referenced but is otherwise not
> reachable - and make it reachable again.
> At restart, open the re-linked file and then immediately delete it.
>
>> I have another question about the deleted files. How is handled the 
>> case when a process has a deleted mapped file but without an 
>> associated file descriptor ?
>>
>
> It works the same as with non-deleted files (assuming that we know
> how to handle delete files in general, e.g. options e,d,f above):
>
> To checkpoint a task's mm we loop through the vma's and checkpoint
> them. For a vma that corresponds to a mapped file, we first save
> the vma->vm_file. In turn, for a file pointer we save the filename,
> properties, credentials. A file pointer is saved as an independent
> object - and is assigned a unique id - objref. The state of the vma
> will indicate indicate this objref.
>
> At restart, we will first see the file pointer object, and will
> open the file to create a corresponding file pointer. Later when
> we restore the vma, we'll locate the (new) file pointer using the
> objref and use it in mmap.
>
> Oren.
>

Thanks Oren for the detailed answer.

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [C/R v20][PATCH 37/96] c/r: introduce new 'file_operations': ->checkpoint, ->collect()
  2010-03-22  6:34   ` Nick Piggin
  2010-03-22 10:16     ` Matt Helsley
@ 2010-03-22 10:16     ` Matt Helsley
  1 sibling, 0 replies; 88+ messages in thread
From: Matt Helsley @ 2010-03-22 10:16 UTC (permalink / raw)
  To: Nick Piggin
  Cc: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Andreas Dilger

On Mon, Mar 22, 2010 at 05:34:28PM +1100, Nick Piggin wrote:
> On Thu, Mar 18, 2010 at 08:59:46PM -0400, Oren Laadan wrote:
> > While we assume all normal files and directories can be checkpointed,
> > there are, as usual in the VFS, specialized places that will always
> > need an ability to override these defaults. Although we could do this
> > completely in the checkpoint code, that would bitrot quickly.
> > 
> > This adds a new 'file_operations' function for checkpointing a file.
> > It is assumed that there should be a dirt-simple way to make something
> > (un)checkpointable that fits in with current code.
> > 
> > As you can see in the ext[234] patches down the road, all that we have
> > to do to make something simple be supported is add a single "generic"
> > f_op entry.
> > 
> > Also adds a new 'file_operations' function for 'collecting' a file for
> > leak-detection during full-container checkpoint. This is useful for
> > those files that hold references to other "collectable" objects. Two
> > examples are pty files that point to corresponding tty objects, and
> > eventpoll files that refer to the files they are monitoring.
> > 
> > Finally, this patch introduces vfs_fcntl() so that it can be called
> > from restart (see patch adding restart of files).
> > 
> > Changelog[v17]
> >   - Introduce 'collect' method
> > Changelog[v17]
> >   - Forward-declare 'ckpt_ctx' et-al, don't use checkpoint_types.h
> > 
> > Signed-off-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
> > Acked-by: Serge E. Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
> > Tested-by: Serge E. Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
> > ---
> >  fs/fcntl.c         |   21 +++++++++++++--------
> >  include/linux/fs.h |    7 +++++++
> >  2 files changed, 20 insertions(+), 8 deletions(-)
> > 
> > diff --git a/fs/fcntl.c b/fs/fcntl.c
> > index 97e01dc..e1f02ca 100644
> > --- a/fs/fcntl.c
> > +++ b/fs/fcntl.c
> > @@ -418,6 +418,18 @@ static long do_fcntl(int fd, unsigned int cmd, unsigned long arg,
> >  	return err;
> >  }
> >  
> > +int vfs_fcntl(int fd, unsigned int cmd, unsigned long arg, struct file *filp)
> > +{
> > +	int err;
> > +
> > +	err = security_file_fcntl(filp, cmd, arg);
> > +	if (err)
> > +		goto out;
> > +	err = do_fcntl(fd, cmd, arg, filp);
> > + out:
> > +	return err;
> > +}
> > +
> >  SYSCALL_DEFINE3(fcntl, unsigned int, fd, unsigned int, cmd, unsigned long, arg)
> >  {	
> >  	struct file *filp;
> > @@ -427,14 +439,7 @@ SYSCALL_DEFINE3(fcntl, unsigned int, fd, unsigned int, cmd, unsigned long, arg)
> >  	if (!filp)
> >  		goto out;
> >  
> > -	err = security_file_fcntl(filp, cmd, arg);
> > -	if (err) {
> > -		fput(filp);
> > -		return err;
> > -	}
> > -
> > -	err = do_fcntl(fd, cmd, arg, filp);
> > -
> > +	err = vfs_fcntl(fd, cmd, arg, filp);
> >   	fput(filp);
> >  out:
> >  	return err;
> 
> There is no point combining these two logically distinct patches.

Good point.

> > diff --git a/include/linux/fs.h b/include/linux/fs.h
> > index 6c08df2..65ebec5 100644
> > --- a/include/linux/fs.h
> > +++ b/include/linux/fs.h
> > @@ -394,6 +394,7 @@ struct kstatfs;
> >  struct vm_area_struct;
> >  struct vfsmount;
> >  struct cred;
> > +struct ckpt_ctx;
> >  
> >  extern void __init inode_init(void);
> >  extern void __init inode_init_early(void);
> > @@ -1093,6 +1094,8 @@ struct file_lock {
> >  
> >  #include <linux/fcntl.h>
> >  
> > +extern int vfs_fcntl(int fd, unsigned cmd, unsigned long arg, struct file *fp);
> > +
> >  extern void send_sigio(struct fown_struct *fown, int fd, int band);
> >  
> >  #ifdef CONFIG_FILE_LOCKING
> > @@ -1504,6 +1507,8 @@ struct file_operations {
> >  	ssize_t (*splice_write)(struct pipe_inode_info *, struct file *, loff_t *, size_t, unsigned int);
> >  	ssize_t (*splice_read)(struct file *, loff_t *, struct pipe_inode_info *, size_t, unsigned int);
> >  	int (*setlease)(struct file *, long, struct file_lock **);
> > +	int (*checkpoint)(struct ckpt_ctx *, struct file *);
> > +	int (*collect)(struct ckpt_ctx *, struct file *);
> >  };
> >  
> >  struct inode_operations {
> 
> You didn't add any documentation for this (unless it is in a following
> patch, which it shouldn't be).

Another good point -- we should have added that to
Documentation/filesystems/vfs.txt

> 
> > @@ -2313,6 +2318,8 @@ void inode_sub_bytes(struct inode *inode, loff_t bytes);
> >  loff_t inode_get_bytes(struct inode *inode);
> >  void inode_set_bytes(struct inode *inode, loff_t bytes);
> >  
> > +#define generic_file_checkpoint NULL
> > +
> >  extern int vfs_readdir(struct file *, filldir_t, void *);
> >  
> >  extern int vfs_stat(char __user *, struct kstat *);
> 
> Hmm, what does generic_file_checkpoint mean? A NULL checkpoint op means
> that checkpointing is allowed, and no action is required? Shouldn't it
> be an opt-in operation, where NULL means not allowed?

generic_file_checkpoint is for files that have a seek operation and can be
backed up or restored with a simple copy.

A NULL checkpoint op means "not allowed" as you thought it should. What
gave you the impression it was otherwise? Here's the relevant snippet
from checkpoint/files.c:

/* checkpoint callback for file pointer */
int checkpoint_file(struct ckpt_ctx *ctx, void *ptr)
{
        struct file *file = (struct file *) ptr;
        int ret;

        if (!file->f_op || !file->f_op->checkpoint) {
                ckpt_err(ctx, -EBADF, "%(T)%(P)%(V)f_op lacks checkpoint\n",
                               file, file->f_op);
                return -EBADF;
        }

> Either way, I don't know if you need to have this #define, provided you
> have sufficient documentation.

We need it (or a suitable replacement) to avoid adding #ifdef around
assignments to the operation in every filesystem. It's used if
CONFIG_CHECKPOINT is not defined.

Thanks for the review.

Cheers,
	-Matt Helsley

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [C/R v20][PATCH 37/96] c/r: introduce new 'file_operations': ->checkpoint, ->collect()
  2010-03-22  6:34   ` Nick Piggin
@ 2010-03-22 10:16     ` Matt Helsley
       [not found]       ` <20100322101635.GC20796-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org>
  2010-03-22 11:00       ` Nick Piggin
  2010-03-22 10:16     ` Matt Helsley
  1 sibling, 2 replies; 88+ messages in thread
From: Matt Helsley @ 2010-03-22 10:16 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Oren Laadan, linux-fsdevel, containers, Matt Helsley, Andreas Dilger

On Mon, Mar 22, 2010 at 05:34:28PM +1100, Nick Piggin wrote:
> On Thu, Mar 18, 2010 at 08:59:46PM -0400, Oren Laadan wrote:
> > While we assume all normal files and directories can be checkpointed,
> > there are, as usual in the VFS, specialized places that will always
> > need an ability to override these defaults. Although we could do this
> > completely in the checkpoint code, that would bitrot quickly.
> > 
> > This adds a new 'file_operations' function for checkpointing a file.
> > It is assumed that there should be a dirt-simple way to make something
> > (un)checkpointable that fits in with current code.
> > 
> > As you can see in the ext[234] patches down the road, all that we have
> > to do to make something simple be supported is add a single "generic"
> > f_op entry.
> > 
> > Also adds a new 'file_operations' function for 'collecting' a file for
> > leak-detection during full-container checkpoint. This is useful for
> > those files that hold references to other "collectable" objects. Two
> > examples are pty files that point to corresponding tty objects, and
> > eventpoll files that refer to the files they are monitoring.
> > 
> > Finally, this patch introduces vfs_fcntl() so that it can be called
> > from restart (see patch adding restart of files).
> > 
> > Changelog[v17]
> >   - Introduce 'collect' method
> > Changelog[v17]
> >   - Forward-declare 'ckpt_ctx' et-al, don't use checkpoint_types.h
> > 
> > Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
> > Acked-by: Serge E. Hallyn <serue@us.ibm.com>
> > Tested-by: Serge E. Hallyn <serue@us.ibm.com>
> > ---
> >  fs/fcntl.c         |   21 +++++++++++++--------
> >  include/linux/fs.h |    7 +++++++
> >  2 files changed, 20 insertions(+), 8 deletions(-)
> > 
> > diff --git a/fs/fcntl.c b/fs/fcntl.c
> > index 97e01dc..e1f02ca 100644
> > --- a/fs/fcntl.c
> > +++ b/fs/fcntl.c
> > @@ -418,6 +418,18 @@ static long do_fcntl(int fd, unsigned int cmd, unsigned long arg,
> >  	return err;
> >  }
> >  
> > +int vfs_fcntl(int fd, unsigned int cmd, unsigned long arg, struct file *filp)
> > +{
> > +	int err;
> > +
> > +	err = security_file_fcntl(filp, cmd, arg);
> > +	if (err)
> > +		goto out;
> > +	err = do_fcntl(fd, cmd, arg, filp);
> > + out:
> > +	return err;
> > +}
> > +
> >  SYSCALL_DEFINE3(fcntl, unsigned int, fd, unsigned int, cmd, unsigned long, arg)
> >  {	
> >  	struct file *filp;
> > @@ -427,14 +439,7 @@ SYSCALL_DEFINE3(fcntl, unsigned int, fd, unsigned int, cmd, unsigned long, arg)
> >  	if (!filp)
> >  		goto out;
> >  
> > -	err = security_file_fcntl(filp, cmd, arg);
> > -	if (err) {
> > -		fput(filp);
> > -		return err;
> > -	}
> > -
> > -	err = do_fcntl(fd, cmd, arg, filp);
> > -
> > +	err = vfs_fcntl(fd, cmd, arg, filp);
> >   	fput(filp);
> >  out:
> >  	return err;
> 
> There is no point combining these two logically distinct patches.

Good point.

> > diff --git a/include/linux/fs.h b/include/linux/fs.h
> > index 6c08df2..65ebec5 100644
> > --- a/include/linux/fs.h
> > +++ b/include/linux/fs.h
> > @@ -394,6 +394,7 @@ struct kstatfs;
> >  struct vm_area_struct;
> >  struct vfsmount;
> >  struct cred;
> > +struct ckpt_ctx;
> >  
> >  extern void __init inode_init(void);
> >  extern void __init inode_init_early(void);
> > @@ -1093,6 +1094,8 @@ struct file_lock {
> >  
> >  #include <linux/fcntl.h>
> >  
> > +extern int vfs_fcntl(int fd, unsigned cmd, unsigned long arg, struct file *fp);
> > +
> >  extern void send_sigio(struct fown_struct *fown, int fd, int band);
> >  
> >  #ifdef CONFIG_FILE_LOCKING
> > @@ -1504,6 +1507,8 @@ struct file_operations {
> >  	ssize_t (*splice_write)(struct pipe_inode_info *, struct file *, loff_t *, size_t, unsigned int);
> >  	ssize_t (*splice_read)(struct file *, loff_t *, struct pipe_inode_info *, size_t, unsigned int);
> >  	int (*setlease)(struct file *, long, struct file_lock **);
> > +	int (*checkpoint)(struct ckpt_ctx *, struct file *);
> > +	int (*collect)(struct ckpt_ctx *, struct file *);
> >  };
> >  
> >  struct inode_operations {
> 
> You didn't add any documentation for this (unless it is in a following
> patch, which it shouldn't be).

Another good point -- we should have added that to
Documentation/filesystems/vfs.txt

> 
> > @@ -2313,6 +2318,8 @@ void inode_sub_bytes(struct inode *inode, loff_t bytes);
> >  loff_t inode_get_bytes(struct inode *inode);
> >  void inode_set_bytes(struct inode *inode, loff_t bytes);
> >  
> > +#define generic_file_checkpoint NULL
> > +
> >  extern int vfs_readdir(struct file *, filldir_t, void *);
> >  
> >  extern int vfs_stat(char __user *, struct kstat *);
> 
> Hmm, what does generic_file_checkpoint mean? A NULL checkpoint op means
> that checkpointing is allowed, and no action is required? Shouldn't it
> be an opt-in operation, where NULL means not allowed?

generic_file_checkpoint is for files that have a seek operation and can be
backed up or restored with a simple copy.

A NULL checkpoint op means "not allowed" as you thought it should. What
gave you the impression it was otherwise? Here's the relevant snippet
from checkpoint/files.c:

/* checkpoint callback for file pointer */
int checkpoint_file(struct ckpt_ctx *ctx, void *ptr)
{
        struct file *file = (struct file *) ptr;
        int ret;

        if (!file->f_op || !file->f_op->checkpoint) {
                ckpt_err(ctx, -EBADF, "%(T)%(P)%(V)f_op lacks checkpoint\n",
                               file, file->f_op);
                return -EBADF;
        }

> Either way, I don't know if you need to have this #define, provided you
> have sufficient documentation.

We need it (or a suitable replacement) to avoid adding #ifdef around
assignments to the operation in every filesystem. It's used if
CONFIG_CHECKPOINT is not defined.

Thanks for the review.

Cheers,
	-Matt Helsley

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [C/R v20][PATCH 38/96] c/r: dump open file descriptors
       [not found]   ` <1268960401-16680-4-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
  2010-03-19 23:19     ` Andreas Dilger
@ 2010-03-22 10:30     ` Nick Piggin
  2010-03-22 13:22       ` Matt Helsley
  2010-03-22 13:22       ` Matt Helsley
  1 sibling, 2 replies; 88+ messages in thread
From: Nick Piggin @ 2010-03-22 10:30 UTC (permalink / raw)
  To: Oren Laadan
  Cc: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Andreas Dilger

On Thu, Mar 18, 2010 at 08:59:47PM -0400, Oren Laadan wrote:
> @@ -531,6 +533,15 @@ static int init_checkpoint_ctx(struct ckpt_ctx *ctx, pid_t pid)
>  		return -EINVAL;  /* cleanup by ckpt_ctx_free() */
>  	}
>  
> +	/* root vfs (FIX: WILL CHANGE with mnt-ns etc */
> +	task_lock(ctx->root_task);
> +	fs = ctx->root_task->fs;
> +	read_lock(&fs->lock);
> +	ctx->root_fs_path = fs->root;
> +	path_get(&ctx->root_fs_path);
> +	read_unlock(&fs->lock);
> +	task_unlock(ctx->root_task);
> +
>  	return 0;
>  }
>  
> diff --git a/checkpoint/files.c b/checkpoint/files.c
> new file mode 100644
> index 0000000..7a57b24
> --- /dev/null
> +++ b/checkpoint/files.c
> +char *ckpt_fill_fname(struct path *path, struct path *root, char *buf, int *len)
> +{
> +	struct path tmp = *root;
> +	char *fname;
> +
> +	BUG_ON(!buf);
> +	spin_lock(&dcache_lock);
> +	fname = __d_path(path, &tmp, buf, *len);
> +	spin_unlock(&dcache_lock);
> +	if (IS_ERR(fname))
> +		return fname;
> +	*len = (buf + (*len) - fname);
> +	/*
> +	 * FIX: if __d_path() changed these, it must have stepped out of
> +	 * init's namespace. Since currently we require a unified namespace
> +	 * within the container: simply fail.
> +	 */
> +	if (tmp.mnt != root->mnt || tmp.dentry != root->dentry)
> +		fname = ERR_PTR(-EBADF);

Maybe something like this is better in fs/?


> +static int scan_fds(struct files_struct *files, int **fdtable)
> +{
> +	struct fdtable *fdt;
> +	int *fds = NULL;
> +	int i = 0, n = 0;
> +	int tot = CKPT_DEFAULT_FDTABLE;
> +
> +	/*
> +	 * We assume that all tasks possibly sharing the file table are
> +	 * frozen (or we are a single process and we checkpoint ourselves).
> +	 * Therefore, we can safely proceed after krealloc() from where we
> +	 * left off. Otherwise the file table may be modified by another
> +	 * task after we scan it. The behavior is this case is undefined,
> +	 * and either checkpoint or restart will likely fail.
> +	 */
> + retry:
> +	fds = krealloc(fds, tot * sizeof(*fds), GFP_KERNEL);
> +	if (!fds)
> +		return -ENOMEM;
> +
> +	rcu_read_lock();
> +	fdt = files_fdtable(files);
> +	for (/**/; i < fdt->max_fds; i++) {
> +		if (!fcheck_files(files, i))
> +			continue;
> +		if (n == tot) {
> +			rcu_read_unlock();
> +			tot *= 2;	/* won't overflow: kmalloc will fail */
> +			goto retry;
> +		}
> +		fds[n++] = i;
> +	}
> +	rcu_read_unlock();

...

> +static int checkpoint_file_desc(struct ckpt_ctx *ctx,
> +				struct files_struct *files, int fd)
> +{
> +	struct ckpt_hdr_file_desc *h;
> +	struct file *file = NULL;
> +	struct fdtable *fdt;
> +	int objref, ret;
> +	int coe = 0;	/* avoid gcc warning */
> +	pid_t pid;
> +
> +	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_FILE_DESC);
> +	if (!h)
> +		return -ENOMEM;
> +
> +	rcu_read_lock();
> +	fdt = files_fdtable(files);
> +	file = fcheck_files(files, fd);
> +	if (file) {
> +		coe = FD_ISSET(fd, fdt->close_on_exec);
> +		get_file(file);
> +	}
> +	rcu_read_unlock();
> +
> +	ret = find_locks_with_owner(file, files);
> +	/*
> +	 * find_locks_with_owner() returns an error when there
> +	 * are no locks found, so we *want* it to return an error
> +	 * code.  Its success means we have to fail the checkpoint.
> +	 */
> +	if (!ret) {
> +		ret = -EBADF;
> +		ckpt_err(ctx, ret, "%(T)fd %d has file lock or lease\n", fd);
> +		goto out;
> +	}
> +
> +	/* sanity check (although this shouldn't happen) */
> +	ret = -EBADF;
> +	if (!file) {
> +		ckpt_err(ctx, ret, "%(T)fd %d gone?\n", fd);
> +		goto out;
> +	}
> +
> +	/*
> +	 * TODO: Implement c/r of fowner and f_sigio.  Should be
> +	 * trivial, but for now we just refuse its checkpoint
> +	 */
> +	pid = f_getown(file);
> +	if (pid) {
> +		ret = -EBUSY;
> +		ckpt_err(ctx, ret, "%(T)fd %d has an owner (%d)\n", fd);
> +		goto out;
> +	}
> +
> +	/*
> +	 * if seen first time, this will add 'file' to the objhash, keep
> +	 * a reference to it, dump its state while at it.
> +	 */

All these kinds of things (including above hunks) IMO are nasty to put
outside fs/. It would be nice to see higher level functionality
implemented in fs and exported to your checkpoint stuff.

Apparently it's hard because checkpointing is so incestuous with
everything, but that's why it's important to structure the code well.

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [C/R v20][PATCH 37/96] c/r: introduce new 'file_operations': ->checkpoint, ->collect()
       [not found]       ` <20100322101635.GC20796-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org>
@ 2010-03-22 11:00         ` Nick Piggin
  0 siblings, 0 replies; 88+ messages in thread
From: Nick Piggin @ 2010-03-22 11:00 UTC (permalink / raw)
  To: Matt Helsley
  Cc: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Andreas Dilger

On Mon, Mar 22, 2010 at 03:16:35AM -0700, Matt Helsley wrote:
> On Mon, Mar 22, 2010 at 05:34:28PM +1100, Nick Piggin wrote:
> > On Thu, Mar 18, 2010 at 08:59:46PM -0400, Oren Laadan wrote:
> > Hmm, what does generic_file_checkpoint mean? A NULL checkpoint op means
> > that checkpointing is allowed, and no action is required? Shouldn't it
> > be an opt-in operation, where NULL means not allowed?
> 
> generic_file_checkpoint is for files that have a seek operation and can be
> backed up or restored with a simple copy.
> 
> A NULL checkpoint op means "not allowed" as you thought it should. What
> gave you the impression it was otherwise? Here's the relevant snippet
> from checkpoint/files.c:

Right I didn't check that far. It's just a bit strange to make it look
like filling in an aop function but it is actually still NULL.


> 
> /* checkpoint callback for file pointer */
> int checkpoint_file(struct ckpt_ctx *ctx, void *ptr)
> {
>         struct file *file = (struct file *) ptr;
>         int ret;
> 
>         if (!file->f_op || !file->f_op->checkpoint) {
>                 ckpt_err(ctx, -EBADF, "%(T)%(P)%(V)f_op lacks checkpoint\n",
>                                file, file->f_op);
>                 return -EBADF;
>         }
> 
> > Either way, I don't know if you need to have this #define, provided you
> > have sufficient documentation.
> 
> We need it (or a suitable replacement) to avoid adding #ifdef around
> assignments to the operation in every filesystem. It's used if
> CONFIG_CHECKPOINT is not defined.

If !CONFIG_CHECKPOINT, ->checkpoint should not exist and neither
should it's callers.

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [C/R v20][PATCH 37/96] c/r: introduce new 'file_operations': ->checkpoint, ->collect()
  2010-03-22 10:16     ` Matt Helsley
       [not found]       ` <20100322101635.GC20796-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org>
@ 2010-03-22 11:00       ` Nick Piggin
  1 sibling, 0 replies; 88+ messages in thread
From: Nick Piggin @ 2010-03-22 11:00 UTC (permalink / raw)
  To: Matt Helsley; +Cc: Oren Laadan, linux-fsdevel, containers, Andreas Dilger

On Mon, Mar 22, 2010 at 03:16:35AM -0700, Matt Helsley wrote:
> On Mon, Mar 22, 2010 at 05:34:28PM +1100, Nick Piggin wrote:
> > On Thu, Mar 18, 2010 at 08:59:46PM -0400, Oren Laadan wrote:
> > Hmm, what does generic_file_checkpoint mean? A NULL checkpoint op means
> > that checkpointing is allowed, and no action is required? Shouldn't it
> > be an opt-in operation, where NULL means not allowed?
> 
> generic_file_checkpoint is for files that have a seek operation and can be
> backed up or restored with a simple copy.
> 
> A NULL checkpoint op means "not allowed" as you thought it should. What
> gave you the impression it was otherwise? Here's the relevant snippet
> from checkpoint/files.c:

Right I didn't check that far. It's just a bit strange to make it look
like filling in an aop function but it is actually still NULL.


> 
> /* checkpoint callback for file pointer */
> int checkpoint_file(struct ckpt_ctx *ctx, void *ptr)
> {
>         struct file *file = (struct file *) ptr;
>         int ret;
> 
>         if (!file->f_op || !file->f_op->checkpoint) {
>                 ckpt_err(ctx, -EBADF, "%(T)%(P)%(V)f_op lacks checkpoint\n",
>                                file, file->f_op);
>                 return -EBADF;
>         }
> 
> > Either way, I don't know if you need to have this #define, provided you
> > have sufficient documentation.
> 
> We need it (or a suitable replacement) to avoid adding #ifdef around
> assignments to the operation in every filesystem. It's used if
> CONFIG_CHECKPOINT is not defined.

If !CONFIG_CHECKPOINT, ->checkpoint should not exist and neither
should it's callers.


^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [C/R v20][PATCH 38/96] c/r: dump open file descriptors
  2010-03-22 10:30     ` Nick Piggin
  2010-03-22 13:22       ` Matt Helsley
@ 2010-03-22 13:22       ` Matt Helsley
  1 sibling, 0 replies; 88+ messages in thread
From: Matt Helsley @ 2010-03-22 13:22 UTC (permalink / raw)
  To: Nick Piggin
  Cc: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Andreas Dilger

On Mon, Mar 22, 2010 at 09:30:35PM +1100, Nick Piggin wrote:
> On Thu, Mar 18, 2010 at 08:59:47PM -0400, Oren Laadan wrote:
> > @@ -531,6 +533,15 @@ static int init_checkpoint_ctx(struct ckpt_ctx *ctx, pid_t pid)
> >  		return -EINVAL;  /* cleanup by ckpt_ctx_free() */
> >  	}
> >  
> > +	/* root vfs (FIX: WILL CHANGE with mnt-ns etc */
> > +	task_lock(ctx->root_task);
> > +	fs = ctx->root_task->fs;
> > +	read_lock(&fs->lock);
> > +	ctx->root_fs_path = fs->root;
> > +	path_get(&ctx->root_fs_path);
> > +	read_unlock(&fs->lock);
> > +	task_unlock(ctx->root_task);
> > +
> >  	return 0;
> >  }
> >  
> > diff --git a/checkpoint/files.c b/checkpoint/files.c
> > new file mode 100644
> > index 0000000..7a57b24
> > --- /dev/null
> > +++ b/checkpoint/files.c
> > +char *ckpt_fill_fname(struct path *path, struct path *root, char *buf, int *len)
> > +{
> > +	struct path tmp = *root;
> > +	char *fname;
> > +
> > +	BUG_ON(!buf);
> > +	spin_lock(&dcache_lock);
> > +	fname = __d_path(path, &tmp, buf, *len);
> > +	spin_unlock(&dcache_lock);
> > +	if (IS_ERR(fname))
> > +		return fname;
> > +	*len = (buf + (*len) - fname);
> > +	/*
> > +	 * FIX: if __d_path() changed these, it must have stepped out of
> > +	 * init's namespace. Since currently we require a unified namespace
> > +	 * within the container: simply fail.
> > +	 */
> > +	if (tmp.mnt != root->mnt || tmp.dentry != root->dentry)
> > +		fname = ERR_PTR(-EBADF);
> 
> Maybe something like this is better in fs/?
> 
> 
> > +static int scan_fds(struct files_struct *files, int **fdtable)
> > +{
> > +	struct fdtable *fdt;
> > +	int *fds = NULL;
> > +	int i = 0, n = 0;
> > +	int tot = CKPT_DEFAULT_FDTABLE;
> > +
> > +	/*
> > +	 * We assume that all tasks possibly sharing the file table are
> > +	 * frozen (or we are a single process and we checkpoint ourselves).
> > +	 * Therefore, we can safely proceed after krealloc() from where we
> > +	 * left off. Otherwise the file table may be modified by another
> > +	 * task after we scan it. The behavior is this case is undefined,
> > +	 * and either checkpoint or restart will likely fail.
> > +	 */
> > + retry:
> > +	fds = krealloc(fds, tot * sizeof(*fds), GFP_KERNEL);
> > +	if (!fds)
> > +		return -ENOMEM;
> > +
> > +	rcu_read_lock();
> > +	fdt = files_fdtable(files);
> > +	for (/**/; i < fdt->max_fds; i++) {
> > +		if (!fcheck_files(files, i))
> > +			continue;
> > +		if (n == tot) {
> > +			rcu_read_unlock();
> > +			tot *= 2;	/* won't overflow: kmalloc will fail */
> > +			goto retry;
> > +		}
> > +		fds[n++] = i;
> > +	}
> > +	rcu_read_unlock();
> 
> ...
> 
> > +static int checkpoint_file_desc(struct ckpt_ctx *ctx,
> > +				struct files_struct *files, int fd)
> > +{
> > +	struct ckpt_hdr_file_desc *h;
> > +	struct file *file = NULL;
> > +	struct fdtable *fdt;
> > +	int objref, ret;
> > +	int coe = 0;	/* avoid gcc warning */
> > +	pid_t pid;
> > +
> > +	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_FILE_DESC);
> > +	if (!h)
> > +		return -ENOMEM;
> > +
> > +	rcu_read_lock();
> > +	fdt = files_fdtable(files);
> > +	file = fcheck_files(files, fd);
> > +	if (file) {
> > +		coe = FD_ISSET(fd, fdt->close_on_exec);
> > +		get_file(file);
> > +	}
> > +	rcu_read_unlock();
> > +
> > +	ret = find_locks_with_owner(file, files);
> > +	/*
> > +	 * find_locks_with_owner() returns an error when there
> > +	 * are no locks found, so we *want* it to return an error
> > +	 * code.  Its success means we have to fail the checkpoint.
> > +	 */
> > +	if (!ret) {
> > +		ret = -EBADF;
> > +		ckpt_err(ctx, ret, "%(T)fd %d has file lock or lease\n", fd);
> > +		goto out;
> > +	}
> > +
> > +	/* sanity check (although this shouldn't happen) */
> > +	ret = -EBADF;
> > +	if (!file) {
> > +		ckpt_err(ctx, ret, "%(T)fd %d gone?\n", fd);
> > +		goto out;
> > +	}
> > +
> > +	/*
> > +	 * TODO: Implement c/r of fowner and f_sigio.  Should be
> > +	 * trivial, but for now we just refuse its checkpoint
> > +	 */
> > +	pid = f_getown(file);
> > +	if (pid) {
> > +		ret = -EBUSY;
> > +		ckpt_err(ctx, ret, "%(T)fd %d has an owner (%d)\n", fd);
> > +		goto out;
> > +	}
> > +
> > +	/*
> > +	 * if seen first time, this will add 'file' to the objhash, keep
> > +	 * a reference to it, dump its state while at it.
> > +	 */
> 
> All these kinds of things (including above hunks) IMO are nasty to put
> outside fs/. It would be nice to see higher level functionality
> implemented in fs and exported to your checkpoint stuff.

Agreed. I posted a series of patches that reorganized the non-filesystem
checkpoint/restart code by distributing it to more appropriate places.
If you can stomach web interfaces have a look at:

	http://thread.gmane.org/gmane.linux.kernel.containers/16617

It'll take a some effort to reorganize and retest ckpt-v20 as I did for
v19. Then I've got to do the same for the filesystem portions. I think
that would complete the reorganization.

> Apparently it's hard because checkpointing is so incestuous with
> everything, but that's why it's important to structure the code well.

You're saying it's difficult to organize because it's got to work with
quite a few disparate VFS structures? My impression is the code breaks
down pretty well along existing lines (fds, fd table, struct files...).
The main problems are resolving the effects of CONFIG_CHECKPOINT=n and
header inclusion messes.

Cheers,
	-Matt Helsley

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [C/R v20][PATCH 38/96] c/r: dump open file descriptors
  2010-03-22 10:30     ` Nick Piggin
@ 2010-03-22 13:22       ` Matt Helsley
       [not found]         ` <20100322132232.GD20796-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org>
  2010-03-22 13:38         ` Nick Piggin
  2010-03-22 13:22       ` Matt Helsley
  1 sibling, 2 replies; 88+ messages in thread
From: Matt Helsley @ 2010-03-22 13:22 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Oren Laadan, linux-fsdevel, containers, Matt Helsley, Andreas Dilger

On Mon, Mar 22, 2010 at 09:30:35PM +1100, Nick Piggin wrote:
> On Thu, Mar 18, 2010 at 08:59:47PM -0400, Oren Laadan wrote:
> > @@ -531,6 +533,15 @@ static int init_checkpoint_ctx(struct ckpt_ctx *ctx, pid_t pid)
> >  		return -EINVAL;  /* cleanup by ckpt_ctx_free() */
> >  	}
> >  
> > +	/* root vfs (FIX: WILL CHANGE with mnt-ns etc */
> > +	task_lock(ctx->root_task);
> > +	fs = ctx->root_task->fs;
> > +	read_lock(&fs->lock);
> > +	ctx->root_fs_path = fs->root;
> > +	path_get(&ctx->root_fs_path);
> > +	read_unlock(&fs->lock);
> > +	task_unlock(ctx->root_task);
> > +
> >  	return 0;
> >  }
> >  
> > diff --git a/checkpoint/files.c b/checkpoint/files.c
> > new file mode 100644
> > index 0000000..7a57b24
> > --- /dev/null
> > +++ b/checkpoint/files.c
> > +char *ckpt_fill_fname(struct path *path, struct path *root, char *buf, int *len)
> > +{
> > +	struct path tmp = *root;
> > +	char *fname;
> > +
> > +	BUG_ON(!buf);
> > +	spin_lock(&dcache_lock);
> > +	fname = __d_path(path, &tmp, buf, *len);
> > +	spin_unlock(&dcache_lock);
> > +	if (IS_ERR(fname))
> > +		return fname;
> > +	*len = (buf + (*len) - fname);
> > +	/*
> > +	 * FIX: if __d_path() changed these, it must have stepped out of
> > +	 * init's namespace. Since currently we require a unified namespace
> > +	 * within the container: simply fail.
> > +	 */
> > +	if (tmp.mnt != root->mnt || tmp.dentry != root->dentry)
> > +		fname = ERR_PTR(-EBADF);
> 
> Maybe something like this is better in fs/?
> 
> 
> > +static int scan_fds(struct files_struct *files, int **fdtable)
> > +{
> > +	struct fdtable *fdt;
> > +	int *fds = NULL;
> > +	int i = 0, n = 0;
> > +	int tot = CKPT_DEFAULT_FDTABLE;
> > +
> > +	/*
> > +	 * We assume that all tasks possibly sharing the file table are
> > +	 * frozen (or we are a single process and we checkpoint ourselves).
> > +	 * Therefore, we can safely proceed after krealloc() from where we
> > +	 * left off. Otherwise the file table may be modified by another
> > +	 * task after we scan it. The behavior is this case is undefined,
> > +	 * and either checkpoint or restart will likely fail.
> > +	 */
> > + retry:
> > +	fds = krealloc(fds, tot * sizeof(*fds), GFP_KERNEL);
> > +	if (!fds)
> > +		return -ENOMEM;
> > +
> > +	rcu_read_lock();
> > +	fdt = files_fdtable(files);
> > +	for (/**/; i < fdt->max_fds; i++) {
> > +		if (!fcheck_files(files, i))
> > +			continue;
> > +		if (n == tot) {
> > +			rcu_read_unlock();
> > +			tot *= 2;	/* won't overflow: kmalloc will fail */
> > +			goto retry;
> > +		}
> > +		fds[n++] = i;
> > +	}
> > +	rcu_read_unlock();
> 
> ...
> 
> > +static int checkpoint_file_desc(struct ckpt_ctx *ctx,
> > +				struct files_struct *files, int fd)
> > +{
> > +	struct ckpt_hdr_file_desc *h;
> > +	struct file *file = NULL;
> > +	struct fdtable *fdt;
> > +	int objref, ret;
> > +	int coe = 0;	/* avoid gcc warning */
> > +	pid_t pid;
> > +
> > +	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_FILE_DESC);
> > +	if (!h)
> > +		return -ENOMEM;
> > +
> > +	rcu_read_lock();
> > +	fdt = files_fdtable(files);
> > +	file = fcheck_files(files, fd);
> > +	if (file) {
> > +		coe = FD_ISSET(fd, fdt->close_on_exec);
> > +		get_file(file);
> > +	}
> > +	rcu_read_unlock();
> > +
> > +	ret = find_locks_with_owner(file, files);
> > +	/*
> > +	 * find_locks_with_owner() returns an error when there
> > +	 * are no locks found, so we *want* it to return an error
> > +	 * code.  Its success means we have to fail the checkpoint.
> > +	 */
> > +	if (!ret) {
> > +		ret = -EBADF;
> > +		ckpt_err(ctx, ret, "%(T)fd %d has file lock or lease\n", fd);
> > +		goto out;
> > +	}
> > +
> > +	/* sanity check (although this shouldn't happen) */
> > +	ret = -EBADF;
> > +	if (!file) {
> > +		ckpt_err(ctx, ret, "%(T)fd %d gone?\n", fd);
> > +		goto out;
> > +	}
> > +
> > +	/*
> > +	 * TODO: Implement c/r of fowner and f_sigio.  Should be
> > +	 * trivial, but for now we just refuse its checkpoint
> > +	 */
> > +	pid = f_getown(file);
> > +	if (pid) {
> > +		ret = -EBUSY;
> > +		ckpt_err(ctx, ret, "%(T)fd %d has an owner (%d)\n", fd);
> > +		goto out;
> > +	}
> > +
> > +	/*
> > +	 * if seen first time, this will add 'file' to the objhash, keep
> > +	 * a reference to it, dump its state while at it.
> > +	 */
> 
> All these kinds of things (including above hunks) IMO are nasty to put
> outside fs/. It would be nice to see higher level functionality
> implemented in fs and exported to your checkpoint stuff.

Agreed. I posted a series of patches that reorganized the non-filesystem
checkpoint/restart code by distributing it to more appropriate places.
If you can stomach web interfaces have a look at:

	http://thread.gmane.org/gmane.linux.kernel.containers/16617

It'll take a some effort to reorganize and retest ckpt-v20 as I did for
v19. Then I've got to do the same for the filesystem portions. I think
that would complete the reorganization.

> Apparently it's hard because checkpointing is so incestuous with
> everything, but that's why it's important to structure the code well.

You're saying it's difficult to organize because it's got to work with
quite a few disparate VFS structures? My impression is the code breaks
down pretty well along existing lines (fds, fd table, struct files...).
The main problems are resolving the effects of CONFIG_CHECKPOINT=n and
header inclusion messes.

Cheers,
	-Matt Helsley

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [C/R v20][PATCH 38/96] c/r: dump open file descriptors
       [not found]         ` <20100322132232.GD20796-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org>
@ 2010-03-22 13:38           ` Nick Piggin
  0 siblings, 0 replies; 88+ messages in thread
From: Nick Piggin @ 2010-03-22 13:38 UTC (permalink / raw)
  To: Matt Helsley
  Cc: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Andreas Dilger

On Mon, Mar 22, 2010 at 06:22:32AM -0700, Matt Helsley wrote:
> On Mon, Mar 22, 2010 at 09:30:35PM +1100, Nick Piggin wrote:
> > On Thu, Mar 18, 2010 at 08:59:47PM -0400, Oren Laadan wrote:
> > > +	/*
> > > +	 * if seen first time, this will add 'file' to the objhash, keep
> > > +	 * a reference to it, dump its state while at it.
> > > +	 */
> > 
> > All these kinds of things (including above hunks) IMO are nasty to put
> > outside fs/. It would be nice to see higher level functionality
> > implemented in fs and exported to your checkpoint stuff.
> 
> Agreed. I posted a series of patches that reorganized the non-filesystem
> checkpoint/restart code by distributing it to more appropriate places.
> If you can stomach web interfaces have a look at:
> 
> 	http://thread.gmane.org/gmane.linux.kernel.containers/16617
> 
> It'll take a some effort to reorganize and retest ckpt-v20 as I did for
> v19. Then I've got to do the same for the filesystem portions. I think
> that would complete the reorganization.

It may get easier for fs people to review because they won't have
to wade through as much checkpoint code.

 
> > Apparently it's hard because checkpointing is so incestuous with
> > everything, but that's why it's important to structure the code well.
> 
> You're saying it's difficult to organize because it's got to work with
> quite a few disparate VFS structures? My impression is the code breaks
> down pretty well along existing lines (fds, fd table, struct files...).

It is that you are poking inside the internals of the vfs from your
module. This isn't liked because now any changes to vfs have to be
done with an eye to checkpoint/ code. If you can instead implement
the required higher level functionality in fs then it is easier to
ensure that is correct and that the interfaces are used correctly.


> The main problems are resolving the effects of CONFIG_CHECKPOINT=n and
> header inclusion messes.

That's your main problem?

#ifdefs are discouraged but not if it makes the whole structure of
the code more convoluted. ifdefs in _ops structure init for example
isn't a big problem.

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [C/R v20][PATCH 38/96] c/r: dump open file descriptors
  2010-03-22 13:22       ` Matt Helsley
       [not found]         ` <20100322132232.GD20796-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org>
@ 2010-03-22 13:38         ` Nick Piggin
  1 sibling, 0 replies; 88+ messages in thread
From: Nick Piggin @ 2010-03-22 13:38 UTC (permalink / raw)
  To: Matt Helsley; +Cc: Oren Laadan, linux-fsdevel, containers, Andreas Dilger

On Mon, Mar 22, 2010 at 06:22:32AM -0700, Matt Helsley wrote:
> On Mon, Mar 22, 2010 at 09:30:35PM +1100, Nick Piggin wrote:
> > On Thu, Mar 18, 2010 at 08:59:47PM -0400, Oren Laadan wrote:
> > > +	/*
> > > +	 * if seen first time, this will add 'file' to the objhash, keep
> > > +	 * a reference to it, dump its state while at it.
> > > +	 */
> > 
> > All these kinds of things (including above hunks) IMO are nasty to put
> > outside fs/. It would be nice to see higher level functionality
> > implemented in fs and exported to your checkpoint stuff.
> 
> Agreed. I posted a series of patches that reorganized the non-filesystem
> checkpoint/restart code by distributing it to more appropriate places.
> If you can stomach web interfaces have a look at:
> 
> 	http://thread.gmane.org/gmane.linux.kernel.containers/16617
> 
> It'll take a some effort to reorganize and retest ckpt-v20 as I did for
> v19. Then I've got to do the same for the filesystem portions. I think
> that would complete the reorganization.

It may get easier for fs people to review because they won't have
to wade through as much checkpoint code.

 
> > Apparently it's hard because checkpointing is so incestuous with
> > everything, but that's why it's important to structure the code well.
> 
> You're saying it's difficult to organize because it's got to work with
> quite a few disparate VFS structures? My impression is the code breaks
> down pretty well along existing lines (fds, fd table, struct files...).

It is that you are poking inside the internals of the vfs from your
module. This isn't liked because now any changes to vfs have to be
done with an eye to checkpoint/ code. If you can instead implement
the required higher level functionality in fs then it is easier to
ensure that is correct and that the interfaces are used correctly.


> The main problems are resolving the effects of CONFIG_CHECKPOINT=n and
> header inclusion messes.

That's your main problem?

#ifdefs are discouraged but not if it makes the whole structure of
the code more convoluted. ifdefs in _ops structure init for example
isn't a big problem.


^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [C/R v20][PATCH 38/96] c/r: dump open file descriptors
       [not found]               ` <20100322021242.GI2887-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org>
@ 2010-03-22 13:51                 ` Jamie Lokier
  2010-03-22 23:18                 ` Andreas Dilger
  1 sibling, 0 replies; 88+ messages in thread
From: Jamie Lokier @ 2010-03-22 13:51 UTC (permalink / raw)
  To: Matt Helsley
  Cc: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Andreas Dilger

Matt Helsley wrote:
> These are the same kinds of problems encountered during backup. You
> can play fast and loose -- like taking a backup while everything is
> running -- or you can play it conservative and freeze things.

Not really.  The issue isn't files getting deleted during the checkpoint,
it's files deleted or renamed over _prior_ to beginning checkpoint.

That's a common situation.  For example if someone did a software
package update, you can easily have processes which reference deleted
files running for months.  Same if a program keeps open a data file
which is edited by a text editor, which renames when saving.  Etc,
etc.

> I think btrfs snapshots are just one possible solution and it's not
> overkill.

I don't think btrfs snapshots solves the problem anyway, unless you
also have a way to look up a file by inode number or equivalent, or
the other ideas discussed such as making a link to a deleted file.

Note that it isn't _just_ deleted files.  The name in question may be
deleted but there may still be other links to the file.  Or it could
be opened via different link names, some or all of which have been
deleted or renamed over.

In thoses cases it would be a bug to make a copy of the deleted file
in the checkpoint state, or in the filesystem, as were mentioned
earlier...

> I imagine fanotify could also be useful so long as userspace has marked
> things correctly prior to checkpoint. My high level understanding of
> fanotify was we'd be able to delay (or deny) deletion until checkpoint
> is complete.

Yes, that might be a way to block filesystem changes during
checkpoint, although fanotify's capabilities weren't complete enough
for this, last time I looked.  (It didn't give sufficient information
directory operations.)

-- Jamie

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [C/R v20][PATCH 38/96] c/r: dump open file descriptors
  2010-03-22  2:12             ` Matt Helsley
@ 2010-03-22 13:51               ` Jamie Lokier
       [not found]               ` <20100322021242.GI2887-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org>
  2010-03-22 23:18               ` Andreas Dilger
  2 siblings, 0 replies; 88+ messages in thread
From: Jamie Lokier @ 2010-03-22 13:51 UTC (permalink / raw)
  To: Matt Helsley
  Cc: Daniel Lezcano, Serge E. Hallyn, linux-fsdevel, containers,
	Andreas Dilger

Matt Helsley wrote:
> These are the same kinds of problems encountered during backup. You
> can play fast and loose -- like taking a backup while everything is
> running -- or you can play it conservative and freeze things.

Not really.  The issue isn't files getting deleted during the checkpoint,
it's files deleted or renamed over _prior_ to beginning checkpoint.

That's a common situation.  For example if someone did a software
package update, you can easily have processes which reference deleted
files running for months.  Same if a program keeps open a data file
which is edited by a text editor, which renames when saving.  Etc,
etc.

> I think btrfs snapshots are just one possible solution and it's not
> overkill.

I don't think btrfs snapshots solves the problem anyway, unless you
also have a way to look up a file by inode number or equivalent, or
the other ideas discussed such as making a link to a deleted file.

Note that it isn't _just_ deleted files.  The name in question may be
deleted but there may still be other links to the file.  Or it could
be opened via different link names, some or all of which have been
deleted or renamed over.

In thoses cases it would be a bug to make a copy of the deleted file
in the checkpoint state, or in the filesystem, as were mentioned
earlier...

> I imagine fanotify could also be useful so long as userspace has marked
> things correctly prior to checkpoint. My high level understanding of
> fanotify was we'd be able to delay (or deny) deletion until checkpoint
> is complete.

Yes, that might be a way to block filesystem changes during
checkpoint, although fanotify's capabilities weren't complete enough
for this, last time I looked.  (It didn't give sufficient information
directory operations.)

-- Jamie

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [C/R v20][PATCH 38/96] c/r: dump open file descriptors
       [not found]               ` <20100322033724.GA20796-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org>
@ 2010-03-22 14:13                 ` Jamie Lokier
  0 siblings, 0 replies; 88+ messages in thread
From: Jamie Lokier @ 2010-03-22 14:13 UTC (permalink / raw)
  To: Matt Helsley
  Cc: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Andreas Dilger,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA

Matt Helsley wrote:
> > I wonder if d_unlinked() is always true for a file which is opened,
> > unlinked or renamed over, but has a hard link to it from elsewhere so
> > the on-disk file hasn't gone away.
> 
> Well, if the on-disk file hasn't gone away due to a hardlink then we
> won't need to save the file in the checkpoint image -- the filesystem
> content backup done during checkpoint should also get the file contents.

When that happens, how do you open the correct file on restart?  You
don't know the other link names unless you scan the entire filesystem.
Is that done?

> > I guess it probably is.  That's kinda neat!  I'd hoped there would be a
> > good reason for f_dentry eventually ;-)
> > 
> > What about files opened through /proc/self/fd/N before or after the
> > original file was unlinked/renamed-over.  Where does the dentry point?
> 
> Before the unlink it will result in the same file being opened. If it's
> opened by a task being checkpointed then we'll be in the same situation
> as the "self" task. If it's opened by a task not being checkpointed then
> the "leak detection" code will notice that there's an unaccounted reference
> to the file and checkpoint will fail.

In a nutshell, is that: If you have a filp (open file pointer
(i.e. including seek position)) which is shared between a task which
is checkpointed and a task which isn't checkpointed, that is the
unaccounted reference and will fail?  E.g. as you might get with
dup+fork or AF_LOCAL descriptor passing?

Assuming yes, that has nothing specific to do with /proc.  My question
about /proc was just about whether the newly open file shares the
dentry or gets a new one, I suppose.

Note that...

> So that hopefully addresses your questions regarding the use of the symlinks
> before the unlink.
> 
> After the unlink those symlinks are broken since they have "(deleted)"
> appended.

...the "links" in /proc/N/fd/ are *not* real symlinks, and opening then
does not use the text returned by readlink().

The "(deleted)" text doesn't stop them from being opened after they
are unliked or renamed over (and it certainly doesn't try to open a
file with " (deleted)" in the name :-).

> As with relinking, we need a good way to do the "temporary location".
> That is complicated because we need to choose a location that we have
> permission to write to, always exists during restart, and is guaranteed
> not to have files in it. Relinking the file shifts these problems from
> restart to checkpoint.

It also breaks programs which expect fstat() to always return the same
st_ino while a file is open.  Even FAT guarantees that, I think :-)
Can't win them all :-)

-- Jamie

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [C/R v20][PATCH 38/96] c/r: dump open file descriptors
  2010-03-22  3:37             ` Matt Helsley
       [not found]               ` <20100322033724.GA20796-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org>
@ 2010-03-22 14:13               ` Jamie Lokier
  1 sibling, 0 replies; 88+ messages in thread
From: Jamie Lokier @ 2010-03-22 14:13 UTC (permalink / raw)
  To: Matt Helsley; +Cc: Andreas Dilger, Oren Laadan, linux-fsdevel, containers

Matt Helsley wrote:
> > I wonder if d_unlinked() is always true for a file which is opened,
> > unlinked or renamed over, but has a hard link to it from elsewhere so
> > the on-disk file hasn't gone away.
> 
> Well, if the on-disk file hasn't gone away due to a hardlink then we
> won't need to save the file in the checkpoint image -- the filesystem
> content backup done during checkpoint should also get the file contents.

When that happens, how do you open the correct file on restart?  You
don't know the other link names unless you scan the entire filesystem.
Is that done?

> > I guess it probably is.  That's kinda neat!  I'd hoped there would be a
> > good reason for f_dentry eventually ;-)
> > 
> > What about files opened through /proc/self/fd/N before or after the
> > original file was unlinked/renamed-over.  Where does the dentry point?
> 
> Before the unlink it will result in the same file being opened. If it's
> opened by a task being checkpointed then we'll be in the same situation
> as the "self" task. If it's opened by a task not being checkpointed then
> the "leak detection" code will notice that there's an unaccounted reference
> to the file and checkpoint will fail.

In a nutshell, is that: If you have a filp (open file pointer
(i.e. including seek position)) which is shared between a task which
is checkpointed and a task which isn't checkpointed, that is the
unaccounted reference and will fail?  E.g. as you might get with
dup+fork or AF_LOCAL descriptor passing?

Assuming yes, that has nothing specific to do with /proc.  My question
about /proc was just about whether the newly open file shares the
dentry or gets a new one, I suppose.

Note that...

> So that hopefully addresses your questions regarding the use of the symlinks
> before the unlink.
> 
> After the unlink those symlinks are broken since they have "(deleted)"
> appended.

...the "links" in /proc/N/fd/ are *not* real symlinks, and opening then
does not use the text returned by readlink().

The "(deleted)" text doesn't stop them from being opened after they
are unliked or renamed over (and it certainly doesn't try to open a
file with " (deleted)" in the name :-).

> As with relinking, we need a good way to do the "temporary location".
> That is complicated because we need to choose a location that we have
> permission to write to, always exists during restart, and is guaranteed
> not to have files in it. Relinking the file shifts these problems from
> restart to checkpoint.

It also breaks programs which expect fstat() to always return the same
st_ino while a file is open.  Even FAT guarantees that, I think :-)
Can't win them all :-)

-- Jamie

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [C/R v20][PATCH 38/96] c/r: dump open file descriptors
       [not found]               ` <20100322021242.GI2887-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org>
  2010-03-22 13:51                 ` Jamie Lokier
@ 2010-03-22 23:18                 ` Andreas Dilger
  1 sibling, 0 replies; 88+ messages in thread
From: Andreas Dilger @ 2010-03-22 23:18 UTC (permalink / raw)
  To: Matt Helsley
  Cc: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Jamie Lokier

On 2010-03-21, at 20:12, Matt Helsley wrote:
> These are the same kinds of problems encountered during backup. You
> can play fast and loose -- like taking a backup while everything is
> running -- or you can play it conservative and freeze things.
>
> I think btrfs snapshots are just one possible solution and it's not
> overkill.
>
> For some filesystems it might make sense to use the filesystem  
> freezer to
> ensure that no files are deleted while the backup takes place.  
> Combined
> with tools like rsync or rdiff backup these operations could be low  
> bandwidth
> and low latency if well-known live-migration techniques are used.
>
> Or use dm snapshots.


If you are using snapshots, then even an open-unlinked file will not  
be deleted from the filesystem until it is closed, because the inode  
will still be available on disk even without the filename.  That would  
be a good reason to also store the file handle (e.g. inode+generation  
for simple filesystems) in the checkpoint file, so that you can re- 
open this file by the file handle after the process is restarted.

Since Aneesh is starting to add an interface for this to the kernel  
anyway, I don't think it would be very hard to dump/restore a handful  
of extra bytes with each file.  Conversely, now is the time for  
getting the open-by-handle APIs correct for this code.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [C/R v20][PATCH 38/96] c/r: dump open file descriptors
  2010-03-22  2:12             ` Matt Helsley
  2010-03-22 13:51               ` Jamie Lokier
       [not found]               ` <20100322021242.GI2887-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org>
@ 2010-03-22 23:18               ` Andreas Dilger
  2 siblings, 0 replies; 88+ messages in thread
From: Andreas Dilger @ 2010-03-22 23:18 UTC (permalink / raw)
  To: Matt Helsley
  Cc: Daniel Lezcano, Serge E. Hallyn, linux-fsdevel, containers, Jamie Lokier

On 2010-03-21, at 20:12, Matt Helsley wrote:
> These are the same kinds of problems encountered during backup. You
> can play fast and loose -- like taking a backup while everything is
> running -- or you can play it conservative and freeze things.
>
> I think btrfs snapshots are just one possible solution and it's not
> overkill.
>
> For some filesystems it might make sense to use the filesystem  
> freezer to
> ensure that no files are deleted while the backup takes place.  
> Combined
> with tools like rsync or rdiff backup these operations could be low  
> bandwidth
> and low latency if well-known live-migration techniques are used.
>
> Or use dm snapshots.


If you are using snapshots, then even an open-unlinked file will not  
be deleted from the filesystem until it is closed, because the inode  
will still be available on disk even without the filename.  That would  
be a good reason to also store the file handle (e.g. inode+generation  
for simple filesystems) in the checkpoint file, so that you can re- 
open this file by the file handle after the process is restarted.

Since Aneesh is starting to add an interface for this to the kernel  
anyway, I don't think it would be very hard to dump/restore a handful  
of extra bytes with each file.  Conversely, now is the time for  
getting the open-by-handle APIs correct for this code.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.


^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [C/R v20][PATCH 20/96] c/r: make file_pos_read/write() public
  2010-03-22  6:31   ` Nick Piggin
  2010-03-23  0:12     ` Oren Laadan
@ 2010-03-23  0:12     ` Oren Laadan
  1 sibling, 0 replies; 88+ messages in thread
From: Oren Laadan @ 2010-03-23  0:12 UTC (permalink / raw)
  To: Nick Piggin
  Cc: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Andreas Dilger

On Mon, 22 Mar 2010, Nick Piggin wrote:

> On Thu, Mar 18, 2010 at 08:59:45PM -0400, Oren Laadan wrote:
> > These two are used in the next patch when calling vfs_read/write()
> 
> Said next patch didn't seem to make it to fsdevel.

Thanks for reviewing, and sorry about this glitch - see below.

> 
> Should it at least go to fs/internal.h?

Sure.

So Here is the relevant hunk from said patch (the entire
patch is: https://patchwork.kernel.org/patch/86389/):

+/*
+ * Helpers to write(read) from(to) kernel space to(from) the checkpoint
+ * image file descriptor (similar to how a core-dump is performed).
+ *
+ *   ckpt_kwrite() - write a kernel-space buffer to the checkpoint image
+ *   ckpt_kread() - read from the checkpoint image to a kernel-space buffer
+ */
+
+static inline int _ckpt_kwrite(struct file *file, void *addr, int count)
+{
+	void __user *uaddr = (__force void __user *) addr;
+	ssize_t nwrite;
+	int nleft;
+
+	for (nleft = count; nleft; nleft -= nwrite) {
+		loff_t pos = file_pos_read(file);
+		nwrite = vfs_write(file, uaddr, nleft, &pos);
+		file_pos_write(file, pos);
+		if (nwrite < 0) {
+			if (nwrite == -EAGAIN)
+				nwrite = 0;
+			else
+				return nwrite;
+		}
+		uaddr += nwrite;
+	}
+	return 0;
+}
+
+int ckpt_kwrite(struct ckpt_ctx *ctx, void *addr, int count)
+{
+	mm_segment_t fs;
+	int ret;
+
+	fs = get_fs();
+	set_fs(KERNEL_DS);
+	ret = _ckpt_kwrite(ctx->file, addr, count);
+	set_fs(fs);
+
+	ctx->total += count;
+	return ret;
+}
+
+static inline int _ckpt_kread(struct file *file, void *addr, int count)
+{
+	void __user *uaddr = (__force void __user *) addr;
+	ssize_t nread;
+	int nleft;
+
+	for (nleft = count; nleft; nleft -= nread) {
+		loff_t pos = file_pos_read(file);
+		nread = vfs_read(file, uaddr, nleft, &pos);
+		file_pos_write(file, pos);
+		if (nread <= 0) {
+			if (nread == -EAGAIN) {
+				nread = 0;
+				continue;
+			} else if (nread == 0)
+				nread = -EPIPE;		/* unexecpted EOF */
+			return nread;
+		}
+		uaddr += nread;
+	}
+	return 0;
+}
+
+int ckpt_kread(struct ckpt_ctx *ctx, void *addr, int count)
+{
+	mm_segment_t fs;
+	int ret;
+
+	fs = get_fs();
+	set_fs(KERNEL_DS);
+	ret = _ckpt_kread(ctx->file , addr, count);
+	set_fs(fs);
+
+	ctx->total += count;
+	return ret;
+}

Oren.

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [C/R v20][PATCH 20/96] c/r: make file_pos_read/write() public
  2010-03-22  6:31   ` Nick Piggin
@ 2010-03-23  0:12     ` Oren Laadan
  2010-03-23  0:43       ` Nick Piggin
       [not found]       ` <Pine.LNX.4.64.1003221959450.1520-CXF6herHY6ykSYb+qCZC/1i27PF6R63G9nwVQlTi/Pw@public.gmane.org>
  2010-03-23  0:12     ` Oren Laadan
  1 sibling, 2 replies; 88+ messages in thread
From: Oren Laadan @ 2010-03-23  0:12 UTC (permalink / raw)
  To: Nick Piggin; +Cc: linux-fsdevel, containers, Matt Helsley, Andreas Dilger

On Mon, 22 Mar 2010, Nick Piggin wrote:

> On Thu, Mar 18, 2010 at 08:59:45PM -0400, Oren Laadan wrote:
> > These two are used in the next patch when calling vfs_read/write()
> 
> Said next patch didn't seem to make it to fsdevel.

Thanks for reviewing, and sorry about this glitch - see below.

> 
> Should it at least go to fs/internal.h?

Sure.

So Here is the relevant hunk from said patch (the entire
patch is: https://patchwork.kernel.org/patch/86389/):

+/*
+ * Helpers to write(read) from(to) kernel space to(from) the checkpoint
+ * image file descriptor (similar to how a core-dump is performed).
+ *
+ *   ckpt_kwrite() - write a kernel-space buffer to the checkpoint image
+ *   ckpt_kread() - read from the checkpoint image to a kernel-space buffer
+ */
+
+static inline int _ckpt_kwrite(struct file *file, void *addr, int count)
+{
+	void __user *uaddr = (__force void __user *) addr;
+	ssize_t nwrite;
+	int nleft;
+
+	for (nleft = count; nleft; nleft -= nwrite) {
+		loff_t pos = file_pos_read(file);
+		nwrite = vfs_write(file, uaddr, nleft, &pos);
+		file_pos_write(file, pos);
+		if (nwrite < 0) {
+			if (nwrite == -EAGAIN)
+				nwrite = 0;
+			else
+				return nwrite;
+		}
+		uaddr += nwrite;
+	}
+	return 0;
+}
+
+int ckpt_kwrite(struct ckpt_ctx *ctx, void *addr, int count)
+{
+	mm_segment_t fs;
+	int ret;
+
+	fs = get_fs();
+	set_fs(KERNEL_DS);
+	ret = _ckpt_kwrite(ctx->file, addr, count);
+	set_fs(fs);
+
+	ctx->total += count;
+	return ret;
+}
+
+static inline int _ckpt_kread(struct file *file, void *addr, int count)
+{
+	void __user *uaddr = (__force void __user *) addr;
+	ssize_t nread;
+	int nleft;
+
+	for (nleft = count; nleft; nleft -= nread) {
+		loff_t pos = file_pos_read(file);
+		nread = vfs_read(file, uaddr, nleft, &pos);
+		file_pos_write(file, pos);
+		if (nread <= 0) {
+			if (nread == -EAGAIN) {
+				nread = 0;
+				continue;
+			} else if (nread == 0)
+				nread = -EPIPE;		/* unexecpted EOF */
+			return nread;
+		}
+		uaddr += nread;
+	}
+	return 0;
+}
+
+int ckpt_kread(struct ckpt_ctx *ctx, void *addr, int count)
+{
+	mm_segment_t fs;
+	int ret;
+
+	fs = get_fs();
+	set_fs(KERNEL_DS);
+	ret = _ckpt_kread(ctx->file , addr, count);
+	set_fs(fs);
+
+	ctx->total += count;
+	return ret;
+}

Oren.

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [C/R v20][PATCH 20/96] c/r: make file_pos_read/write() public
       [not found]       ` <Pine.LNX.4.64.1003221959450.1520-CXF6herHY6ykSYb+qCZC/1i27PF6R63G9nwVQlTi/Pw@public.gmane.org>
@ 2010-03-23  0:43         ` Nick Piggin
  0 siblings, 0 replies; 88+ messages in thread
From: Nick Piggin @ 2010-03-23  0:43 UTC (permalink / raw)
  To: Oren Laadan
  Cc: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Andreas Dilger

On Mon, Mar 22, 2010 at 08:12:45PM -0400, Oren Laadan wrote:
> On Mon, 22 Mar 2010, Nick Piggin wrote:
> 
> > On Thu, Mar 18, 2010 at 08:59:45PM -0400, Oren Laadan wrote:
> > > These two are used in the next patch when calling vfs_read/write()
> > 
> > Said next patch didn't seem to make it to fsdevel.
> 
> Thanks for reviewing, and sorry about this glitch - see below.
> 
> > 
> > Should it at least go to fs/internal.h?
> 
> Sure.
> 
> So Here is the relevant hunk from said patch (the entire
> patch is: https://patchwork.kernel.org/patch/86389/):
> 
> +/*
> + * Helpers to write(read) from(to) kernel space to(from) the checkpoint
> + * image file descriptor (similar to how a core-dump is performed).
> + *
> + *   ckpt_kwrite() - write a kernel-space buffer to the checkpoint image
> + *   ckpt_kread() - read from the checkpoint image to a kernel-space buffer

Hmm, OK. Slightly-more-write(2) type of write.

fs/splice.c code also has a kernel_write and readv. Not sure if there is
any other common code. But maybe it would be better to put together some
useful helpers under fs/ rather than a ckpt specific thing.

> + */
> +
> +static inline int _ckpt_kwrite(struct file *file, void *addr, int count)
> +{
> +	void __user *uaddr = (__force void __user *) addr;
> +	ssize_t nwrite;
> +	int nleft;
> +
> +	for (nleft = count; nleft; nleft -= nwrite) {
> +		loff_t pos = file_pos_read(file);
> +		nwrite = vfs_write(file, uaddr, nleft, &pos);
> +		file_pos_write(file, pos);
> +		if (nwrite < 0) {
> +			if (nwrite == -EAGAIN)
> +				nwrite = 0;
> +			else
> +				return nwrite;
> +		}
> +		uaddr += nwrite;
> +	}
> +	return 0;
> +}
> +
> +int ckpt_kwrite(struct ckpt_ctx *ctx, void *addr, int count)
> +{
> +	mm_segment_t fs;
> +	int ret;
> +
> +	fs = get_fs();
> +	set_fs(KERNEL_DS);
> +	ret = _ckpt_kwrite(ctx->file, addr, count);
> +	set_fs(fs);
> +
> +	ctx->total += count;
> +	return ret;
> +}
> +
> +static inline int _ckpt_kread(struct file *file, void *addr, int count)
> +{
> +	void __user *uaddr = (__force void __user *) addr;
> +	ssize_t nread;
> +	int nleft;
> +
> +	for (nleft = count; nleft; nleft -= nread) {
> +		loff_t pos = file_pos_read(file);
> +		nread = vfs_read(file, uaddr, nleft, &pos);
> +		file_pos_write(file, pos);
> +		if (nread <= 0) {
> +			if (nread == -EAGAIN) {
> +				nread = 0;
> +				continue;
> +			} else if (nread == 0)
> +				nread = -EPIPE;		/* unexecpted EOF */
> +			return nread;
> +		}
> +		uaddr += nread;
> +	}
> +	return 0;
> +}
> +
> +int ckpt_kread(struct ckpt_ctx *ctx, void *addr, int count)
> +{
> +	mm_segment_t fs;
> +	int ret;
> +
> +	fs = get_fs();
> +	set_fs(KERNEL_DS);
> +	ret = _ckpt_kread(ctx->file , addr, count);
> +	set_fs(fs);
> +
> +	ctx->total += count;
> +	return ret;
> +}
> 
> Oren.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [C/R v20][PATCH 20/96] c/r: make file_pos_read/write() public
  2010-03-23  0:12     ` Oren Laadan
@ 2010-03-23  0:43       ` Nick Piggin
  2010-03-23  0:56         ` Oren Laadan
  2010-03-23  0:56         ` Oren Laadan
       [not found]       ` <Pine.LNX.4.64.1003221959450.1520-CXF6herHY6ykSYb+qCZC/1i27PF6R63G9nwVQlTi/Pw@public.gmane.org>
  1 sibling, 2 replies; 88+ messages in thread
From: Nick Piggin @ 2010-03-23  0:43 UTC (permalink / raw)
  To: Oren Laadan; +Cc: linux-fsdevel, containers, Matt Helsley, Andreas Dilger

On Mon, Mar 22, 2010 at 08:12:45PM -0400, Oren Laadan wrote:
> On Mon, 22 Mar 2010, Nick Piggin wrote:
> 
> > On Thu, Mar 18, 2010 at 08:59:45PM -0400, Oren Laadan wrote:
> > > These two are used in the next patch when calling vfs_read/write()
> > 
> > Said next patch didn't seem to make it to fsdevel.
> 
> Thanks for reviewing, and sorry about this glitch - see below.
> 
> > 
> > Should it at least go to fs/internal.h?
> 
> Sure.
> 
> So Here is the relevant hunk from said patch (the entire
> patch is: https://patchwork.kernel.org/patch/86389/):
> 
> +/*
> + * Helpers to write(read) from(to) kernel space to(from) the checkpoint
> + * image file descriptor (similar to how a core-dump is performed).
> + *
> + *   ckpt_kwrite() - write a kernel-space buffer to the checkpoint image
> + *   ckpt_kread() - read from the checkpoint image to a kernel-space buffer

Hmm, OK. Slightly-more-write(2) type of write.

fs/splice.c code also has a kernel_write and readv. Not sure if there is
any other common code. But maybe it would be better to put together some
useful helpers under fs/ rather than a ckpt specific thing.

> + */
> +
> +static inline int _ckpt_kwrite(struct file *file, void *addr, int count)
> +{
> +	void __user *uaddr = (__force void __user *) addr;
> +	ssize_t nwrite;
> +	int nleft;
> +
> +	for (nleft = count; nleft; nleft -= nwrite) {
> +		loff_t pos = file_pos_read(file);
> +		nwrite = vfs_write(file, uaddr, nleft, &pos);
> +		file_pos_write(file, pos);
> +		if (nwrite < 0) {
> +			if (nwrite == -EAGAIN)
> +				nwrite = 0;
> +			else
> +				return nwrite;
> +		}
> +		uaddr += nwrite;
> +	}
> +	return 0;
> +}
> +
> +int ckpt_kwrite(struct ckpt_ctx *ctx, void *addr, int count)
> +{
> +	mm_segment_t fs;
> +	int ret;
> +
> +	fs = get_fs();
> +	set_fs(KERNEL_DS);
> +	ret = _ckpt_kwrite(ctx->file, addr, count);
> +	set_fs(fs);
> +
> +	ctx->total += count;
> +	return ret;
> +}
> +
> +static inline int _ckpt_kread(struct file *file, void *addr, int count)
> +{
> +	void __user *uaddr = (__force void __user *) addr;
> +	ssize_t nread;
> +	int nleft;
> +
> +	for (nleft = count; nleft; nleft -= nread) {
> +		loff_t pos = file_pos_read(file);
> +		nread = vfs_read(file, uaddr, nleft, &pos);
> +		file_pos_write(file, pos);
> +		if (nread <= 0) {
> +			if (nread == -EAGAIN) {
> +				nread = 0;
> +				continue;
> +			} else if (nread == 0)
> +				nread = -EPIPE;		/* unexecpted EOF */
> +			return nread;
> +		}
> +		uaddr += nread;
> +	}
> +	return 0;
> +}
> +
> +int ckpt_kread(struct ckpt_ctx *ctx, void *addr, int count)
> +{
> +	mm_segment_t fs;
> +	int ret;
> +
> +	fs = get_fs();
> +	set_fs(KERNEL_DS);
> +	ret = _ckpt_kread(ctx->file , addr, count);
> +	set_fs(fs);
> +
> +	ctx->total += count;
> +	return ret;
> +}
> 
> Oren.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [C/R v20][PATCH 20/96] c/r: make file_pos_read/write() public
  2010-03-23  0:43       ` Nick Piggin
@ 2010-03-23  0:56         ` Oren Laadan
  2010-03-23  0:56         ` Oren Laadan
  1 sibling, 0 replies; 88+ messages in thread
From: Oren Laadan @ 2010-03-23  0:56 UTC (permalink / raw)
  To: Nick Piggin
  Cc: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Andreas Dilger



Nick Piggin wrote:
> On Mon, Mar 22, 2010 at 08:12:45PM -0400, Oren Laadan wrote:
>> On Mon, 22 Mar 2010, Nick Piggin wrote:
>>
>>> On Thu, Mar 18, 2010 at 08:59:45PM -0400, Oren Laadan wrote:
>>>> These two are used in the next patch when calling vfs_read/write()
>>> Said next patch didn't seem to make it to fsdevel.
>> Thanks for reviewing, and sorry about this glitch - see below.
>>
>>> Should it at least go to fs/internal.h?
>> Sure.
>>
>> So Here is the relevant hunk from said patch (the entire
>> patch is: https://patchwork.kernel.org/patch/86389/):
>>
>> +/*
>> + * Helpers to write(read) from(to) kernel space to(from) the checkpoint
>> + * image file descriptor (similar to how a core-dump is performed).
>> + *
>> + *   ckpt_kwrite() - write a kernel-space buffer to the checkpoint image
>> + *   ckpt_kread() - read from the checkpoint image to a kernel-space buffer
> 
> Hmm, OK. Slightly-more-write(2) type of write.
> 
> fs/splice.c code also has a kernel_write and readv. Not sure if there is
> any other common code. But maybe it would be better to put together some
> useful helpers under fs/ rather than a ckpt specific thing.

Right. Another place is fs/exec.c that provides kernel_read().
I'll put the common code in kernel/read_write.c then.

Oren.

> 
>> + */
>> +
>> +static inline int _ckpt_kwrite(struct file *file, void *addr, int count)
>> +{
>> +	void __user *uaddr = (__force void __user *) addr;
>> +	ssize_t nwrite;
>> +	int nleft;
>> +
>> +	for (nleft = count; nleft; nleft -= nwrite) {
>> +		loff_t pos = file_pos_read(file);
>> +		nwrite = vfs_write(file, uaddr, nleft, &pos);
>> +		file_pos_write(file, pos);
>> +		if (nwrite < 0) {
>> +			if (nwrite == -EAGAIN)
>> +				nwrite = 0;
>> +			else
>> +				return nwrite;
>> +		}
>> +		uaddr += nwrite;
>> +	}
>> +	return 0;
>> +}
>> +
>> +int ckpt_kwrite(struct ckpt_ctx *ctx, void *addr, int count)
>> +{
>> +	mm_segment_t fs;
>> +	int ret;
>> +
>> +	fs = get_fs();
>> +	set_fs(KERNEL_DS);
>> +	ret = _ckpt_kwrite(ctx->file, addr, count);
>> +	set_fs(fs);
>> +
>> +	ctx->total += count;
>> +	return ret;
>> +}
>> +
>> +static inline int _ckpt_kread(struct file *file, void *addr, int count)
>> +{
>> +	void __user *uaddr = (__force void __user *) addr;
>> +	ssize_t nread;
>> +	int nleft;
>> +
>> +	for (nleft = count; nleft; nleft -= nread) {
>> +		loff_t pos = file_pos_read(file);
>> +		nread = vfs_read(file, uaddr, nleft, &pos);
>> +		file_pos_write(file, pos);
>> +		if (nread <= 0) {
>> +			if (nread == -EAGAIN) {
>> +				nread = 0;
>> +				continue;
>> +			} else if (nread == 0)
>> +				nread = -EPIPE;		/* unexecpted EOF */
>> +			return nread;
>> +		}
>> +		uaddr += nread;
>> +	}
>> +	return 0;
>> +}
>> +
>> +int ckpt_kread(struct ckpt_ctx *ctx, void *addr, int count)
>> +{
>> +	mm_segment_t fs;
>> +	int ret;
>> +
>> +	fs = get_fs();
>> +	set_fs(KERNEL_DS);
>> +	ret = _ckpt_kread(ctx->file , addr, count);
>> +	set_fs(fs);
>> +
>> +	ctx->total += count;
>> +	return ret;
>> +}
>>
>> Oren.
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [C/R v20][PATCH 20/96] c/r: make file_pos_read/write() public
  2010-03-23  0:43       ` Nick Piggin
  2010-03-23  0:56         ` Oren Laadan
@ 2010-03-23  0:56         ` Oren Laadan
  1 sibling, 0 replies; 88+ messages in thread
From: Oren Laadan @ 2010-03-23  0:56 UTC (permalink / raw)
  To: Nick Piggin; +Cc: linux-fsdevel, containers, Matt Helsley, Andreas Dilger



Nick Piggin wrote:
> On Mon, Mar 22, 2010 at 08:12:45PM -0400, Oren Laadan wrote:
>> On Mon, 22 Mar 2010, Nick Piggin wrote:
>>
>>> On Thu, Mar 18, 2010 at 08:59:45PM -0400, Oren Laadan wrote:
>>>> These two are used in the next patch when calling vfs_read/write()
>>> Said next patch didn't seem to make it to fsdevel.
>> Thanks for reviewing, and sorry about this glitch - see below.
>>
>>> Should it at least go to fs/internal.h?
>> Sure.
>>
>> So Here is the relevant hunk from said patch (the entire
>> patch is: https://patchwork.kernel.org/patch/86389/):
>>
>> +/*
>> + * Helpers to write(read) from(to) kernel space to(from) the checkpoint
>> + * image file descriptor (similar to how a core-dump is performed).
>> + *
>> + *   ckpt_kwrite() - write a kernel-space buffer to the checkpoint image
>> + *   ckpt_kread() - read from the checkpoint image to a kernel-space buffer
> 
> Hmm, OK. Slightly-more-write(2) type of write.
> 
> fs/splice.c code also has a kernel_write and readv. Not sure if there is
> any other common code. But maybe it would be better to put together some
> useful helpers under fs/ rather than a ckpt specific thing.

Right. Another place is fs/exec.c that provides kernel_read().
I'll put the common code in kernel/read_write.c then.

Oren.

> 
>> + */
>> +
>> +static inline int _ckpt_kwrite(struct file *file, void *addr, int count)
>> +{
>> +	void __user *uaddr = (__force void __user *) addr;
>> +	ssize_t nwrite;
>> +	int nleft;
>> +
>> +	for (nleft = count; nleft; nleft -= nwrite) {
>> +		loff_t pos = file_pos_read(file);
>> +		nwrite = vfs_write(file, uaddr, nleft, &pos);
>> +		file_pos_write(file, pos);
>> +		if (nwrite < 0) {
>> +			if (nwrite == -EAGAIN)
>> +				nwrite = 0;
>> +			else
>> +				return nwrite;
>> +		}
>> +		uaddr += nwrite;
>> +	}
>> +	return 0;
>> +}
>> +
>> +int ckpt_kwrite(struct ckpt_ctx *ctx, void *addr, int count)
>> +{
>> +	mm_segment_t fs;
>> +	int ret;
>> +
>> +	fs = get_fs();
>> +	set_fs(KERNEL_DS);
>> +	ret = _ckpt_kwrite(ctx->file, addr, count);
>> +	set_fs(fs);
>> +
>> +	ctx->total += count;
>> +	return ret;
>> +}
>> +
>> +static inline int _ckpt_kread(struct file *file, void *addr, int count)
>> +{
>> +	void __user *uaddr = (__force void __user *) addr;
>> +	ssize_t nread;
>> +	int nleft;
>> +
>> +	for (nleft = count; nleft; nleft -= nread) {
>> +		loff_t pos = file_pos_read(file);
>> +		nread = vfs_read(file, uaddr, nleft, &pos);
>> +		file_pos_write(file, pos);
>> +		if (nread <= 0) {
>> +			if (nread == -EAGAIN) {
>> +				nread = 0;
>> +				continue;
>> +			} else if (nread == 0)
>> +				nread = -EPIPE;		/* unexecpted EOF */
>> +			return nread;
>> +		}
>> +		uaddr += nread;
>> +	}
>> +	return 0;
>> +}
>> +
>> +int ckpt_kread(struct ckpt_ctx *ctx, void *addr, int count)
>> +{
>> +	mm_segment_t fs;
>> +	int ret;
>> +
>> +	fs = get_fs();
>> +	set_fs(KERNEL_DS);
>> +	ret = _ckpt_kread(ctx->file , addr, count);
>> +	set_fs(fs);
>> +
>> +	ctx->total += count;
>> +	return ret;
>> +}
>>
>> Oren.
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

^ permalink raw reply	[flat|nested] 88+ messages in thread

* [C/R v20][PATCH 38/96] c/r: dump open file descriptors
       [not found]                                                                           ` <1268842164-5590-38-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
@ 2010-03-17 16:08                                                                             ` Oren Laadan
  0 siblings, 0 replies; 88+ messages in thread
From: Oren Laadan @ 2010-03-17 16:08 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-api-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Ingo Molnar

Dump the file table with 'struct ckpt_hdr_file_table, followed by all
open file descriptors. Because the 'struct file' corresponding to an
fd can be shared, they are assigned an objref and registered in the
object hash. A reference to the 'file *' is kept for as long as it
lives in the hash (the hash is only cleaned up at the end of the
checkpoint).

Also provide generic_checkpoint_file() and generic_restore_file()
which is good for normal files and directories. It does not support
yet unlinked files or directories.

Changelog[v19]:
  - Fix false negative of test for unlinked files at checkpoint
Changelog[v19-rc3]:
  - [Serge Hallyn] Rename fs_mnt to root_fs_path
  - [Dave Hansen] Error out on file locks and leases
  - [Serge Hallyn] Refuse checkpoint of file with f_owner
Changelog[v19-rc1]:
  - [Matt Helsley] Add cpp definitions for enums
Changelog[v18]:
  - Add a few more ckpt_write_err()s
  - [Dan Smith] Export fill_fname() as ckpt_fill_fname()
  - Introduce ckpt_collect_file() that also uses file->collect method
  - In collect_file_stabl() use retval from ckpt_obj_collect() to
    test for first-time-object
Changelog[v17]:
  - Only collect sub-objects of files_struct once
  - Better file error debugging
  - Use (new) d_unlinked()
Changelog[v16]:
  - Fix compile warning in checkpoint_bad()
Changelog[v16]:
  - Reorder patch (move earlier in series)
  - Handle shared files_struct objects
Changelog[v14]:
  - File objects are dumped/restored prior to the first reference
  - Introduce a per file-type restore() callback
  - Use struct file_operations->checkpoint()
  - Put code for generic file descriptors in a separate function
  - Use one CKPT_FILE_GENERIC for both regular files and dirs
  - Revert change to pr_debug(), back to ckpt_debug()
  - Use only unsigned fields in checkpoint headers
  - Rename:  ckpt_write_files() => checkpoint_fd_table()
  - Rename:  ckpt_write_fd_data() => checkpoint_file()
  - Discard field 'h->parent'
Changelog[v12]:
  - Replace obsolete ckpt_debug() with pr_debug()
Changelog[v11]:
  - Discard handling of opened symlinks (there is no such thing)
  - ckpt_scan_fds() retries from scratch if hits size limits
Changelog[v9]:
  - Fix a couple of leaks in ckpt_write_files()
  - Drop useless kfree from ckpt_scan_fds()
Changelog[v8]:
  - initialize 'coe' to workaround gcc false warning
Changelog[v6]:
  - Balance all calls to ckpt_hbuf_get() with matching ckpt_hbuf_put()
    (even though it's not really needed)

Signed-off-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
Acked-by: Serge E. Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
Tested-by: Serge E. Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
---
 checkpoint/Makefile              |    3 +-
 checkpoint/checkpoint.c          |   11 +
 checkpoint/files.c               |  444 ++++++++++++++++++++++++++++++++++++++
 checkpoint/objhash.c             |   52 +++++
 checkpoint/process.c             |   33 +++-
 checkpoint/sys.c                 |    8 +
 fs/locks.c                       |   35 +++
 include/linux/checkpoint.h       |   19 ++
 include/linux/checkpoint_hdr.h   |   59 +++++
 include/linux/checkpoint_types.h |    5 +
 include/linux/fs.h               |   10 +
 11 files changed, 677 insertions(+), 2 deletions(-)
 create mode 100644 checkpoint/files.c

diff --git a/checkpoint/Makefile b/checkpoint/Makefile
index 5aa6a75..1d0c058 100644
--- a/checkpoint/Makefile
+++ b/checkpoint/Makefile
@@ -7,4 +7,5 @@ obj-$(CONFIG_CHECKPOINT) += \
 	objhash.o \
 	checkpoint.o \
 	restart.o \
-	process.o
+	process.o \
+	files.o
diff --git a/checkpoint/checkpoint.c b/checkpoint/checkpoint.c
index c016a2d..2bc2495 100644
--- a/checkpoint/checkpoint.c
+++ b/checkpoint/checkpoint.c
@@ -18,6 +18,7 @@
 #include <linux/time.h>
 #include <linux/fs.h>
 #include <linux/file.h>
+#include <linux/fs_struct.h>
 #include <linux/dcache.h>
 #include <linux/mount.h>
 #include <linux/utsname.h>
@@ -490,6 +491,7 @@ static int init_checkpoint_ctx(struct ckpt_ctx *ctx, pid_t pid)
 {
 	struct task_struct *task;
 	struct nsproxy *nsproxy;
+	struct fs_struct *fs;
 
 	/*
 	 * No need for explicit cleanup here, because if an error
@@ -531,6 +533,15 @@ static int init_checkpoint_ctx(struct ckpt_ctx *ctx, pid_t pid)
 		return -EINVAL;  /* cleanup by ckpt_ctx_free() */
 	}
 
+	/* root vfs (FIX: WILL CHANGE with mnt-ns etc */
+	task_lock(ctx->root_task);
+	fs = ctx->root_task->fs;
+	read_lock(&fs->lock);
+	ctx->root_fs_path = fs->root;
+	path_get(&ctx->root_fs_path);
+	read_unlock(&fs->lock);
+	task_unlock(ctx->root_task);
+
 	return 0;
 }
 
diff --git a/checkpoint/files.c b/checkpoint/files.c
new file mode 100644
index 0000000..7a57b24
--- /dev/null
+++ b/checkpoint/files.c
@@ -0,0 +1,444 @@
+/*
+ *  Checkpoint file descriptors
+ *
+ *  Copyright (C) 2008-2009 Oren Laadan
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+/* default debug level for output */
+#define CKPT_DFLAG  CKPT_DFILE
+
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/sched.h>
+#include <linux/file.h>
+#include <linux/fdtable.h>
+#include <linux/deferqueue.h>
+#include <linux/checkpoint.h>
+#include <linux/checkpoint_hdr.h>
+
+
+/**************************************************************************
+ * Checkpoint
+ */
+
+/**
+ * ckpt_fill_fname - return pathname of a given file
+ * @path: path name
+ * @root: relative root
+ * @buf: buffer for pathname
+ * @len: buffer length (in) and pathname length (out)
+ */
+char *ckpt_fill_fname(struct path *path, struct path *root, char *buf, int *len)
+{
+	struct path tmp = *root;
+	char *fname;
+
+	BUG_ON(!buf);
+	spin_lock(&dcache_lock);
+	fname = __d_path(path, &tmp, buf, *len);
+	spin_unlock(&dcache_lock);
+	if (IS_ERR(fname))
+		return fname;
+	*len = (buf + (*len) - fname);
+	/*
+	 * FIX: if __d_path() changed these, it must have stepped out of
+	 * init's namespace. Since currently we require a unified namespace
+	 * within the container: simply fail.
+	 */
+	if (tmp.mnt != root->mnt || tmp.dentry != root->dentry)
+		fname = ERR_PTR(-EBADF);
+
+	return fname;
+}
+
+/**
+ * checkpoint_fname - write a file name
+ * @ctx: checkpoint context
+ * @path: path name
+ * @root: relative root
+ */
+int checkpoint_fname(struct ckpt_ctx *ctx, struct path *path, struct path *root)
+{
+	char *buf, *fname;
+	int ret, flen;
+
+	/*
+	 * FIXME: we can optimize and save memory (and storage) if we
+	 * share strings (through objhash) and reference them instead
+	 */
+
+	flen = PATH_MAX;
+	buf = kmalloc(flen, GFP_KERNEL);
+	if (!buf)
+		return -ENOMEM;
+
+	fname = ckpt_fill_fname(path, root, buf, &flen);
+	if (!IS_ERR(fname)) {
+		ret = ckpt_write_obj_type(ctx, fname, flen,
+					  CKPT_HDR_FILE_NAME);
+	} else {
+		ret = PTR_ERR(fname);
+		ckpt_err(ctx, ret, "%(T)%(S)Obtain filename\n",
+			 path->dentry->d_name.name);
+	}
+
+	kfree(buf);
+	return ret;
+}
+
+#define CKPT_DEFAULT_FDTABLE  256		/* an initial guess */
+
+/**
+ * scan_fds - scan file table and construct array of open fds
+ * @files: files_struct pointer
+ * @fdtable: (output) array of open fds
+ *
+ * Returns the number of open fds found, and also the file table
+ * array via *fdtable. The caller should free the array.
+ *
+ * The caller must validate the file descriptors collected in the
+ * array before using them, e.g. by using fcheck_files(), in case
+ * the task's fdtable changes in the meantime.
+ */
+static int scan_fds(struct files_struct *files, int **fdtable)
+{
+	struct fdtable *fdt;
+	int *fds = NULL;
+	int i = 0, n = 0;
+	int tot = CKPT_DEFAULT_FDTABLE;
+
+	/*
+	 * We assume that all tasks possibly sharing the file table are
+	 * frozen (or we are a single process and we checkpoint ourselves).
+	 * Therefore, we can safely proceed after krealloc() from where we
+	 * left off. Otherwise the file table may be modified by another
+	 * task after we scan it. The behavior is this case is undefined,
+	 * and either checkpoint or restart will likely fail.
+	 */
+ retry:
+	fds = krealloc(fds, tot * sizeof(*fds), GFP_KERNEL);
+	if (!fds)
+		return -ENOMEM;
+
+	rcu_read_lock();
+	fdt = files_fdtable(files);
+	for (/**/; i < fdt->max_fds; i++) {
+		if (!fcheck_files(files, i))
+			continue;
+		if (n == tot) {
+			rcu_read_unlock();
+			tot *= 2;	/* won't overflow: kmalloc will fail */
+			goto retry;
+		}
+		fds[n++] = i;
+	}
+	rcu_read_unlock();
+
+	*fdtable = fds;
+	return n;
+}
+
+int checkpoint_file_common(struct ckpt_ctx *ctx, struct file *file,
+			   struct ckpt_hdr_file *h)
+{
+	h->f_flags = file->f_flags;
+	h->f_mode = file->f_mode;
+	h->f_pos = file->f_pos;
+	h->f_version = file->f_version;
+
+	ckpt_debug("file %s credref %d", file->f_dentry->d_name.name,
+		h->f_credref);
+
+	/* FIX: need also file->uid, file->gid, file->f_owner, etc */
+
+	return 0;
+}
+
+int generic_file_checkpoint(struct ckpt_ctx *ctx, struct file *file)
+{
+	struct ckpt_hdr_file_generic *h;
+	int ret;
+
+	/*
+	 * FIXME: when we'll add support for unlinked files/dirs, we'll
+	 * need to distinguish between unlinked filed and unlinked dirs.
+	 */
+	if (d_unlinked(file->f_dentry)) {
+		ckpt_err(ctx, -EBADF, "%(T)%(P)Unlinked files unsupported\n",
+			 file);
+		return -EBADF;
+	}
+
+	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_FILE);
+	if (!h)
+		return -ENOMEM;
+
+	h->common.f_type = CKPT_FILE_GENERIC;
+
+	ret = checkpoint_file_common(ctx, file, &h->common);
+	if (ret < 0)
+		goto out;
+	ret = ckpt_write_obj(ctx, &h->common.h);
+	if (ret < 0)
+		goto out;
+	ret = checkpoint_fname(ctx, &file->f_path, &ctx->root_fs_path);
+ out:
+	ckpt_hdr_put(ctx, h);
+	return ret;
+}
+EXPORT_SYMBOL(generic_file_checkpoint);
+
+/* checkpoint callback for file pointer */
+int checkpoint_file(struct ckpt_ctx *ctx, void *ptr)
+{
+	struct file *file = (struct file *) ptr;
+	int ret;
+
+	if (!file->f_op || !file->f_op->checkpoint) {
+		ckpt_err(ctx, -EBADF, "%(T)%(P)%(V)f_op lacks checkpoint\n",
+			       file, file->f_op);
+		return -EBADF;
+	}
+
+	ret = file->f_op->checkpoint(ctx, file);
+	if (ret < 0)
+		ckpt_err(ctx, ret, "%(T)%(P)file checkpoint failed\n", file);
+	return ret;
+}
+
+/**
+ * ckpt_write_file_desc - dump the state of a given file descriptor
+ * @ctx: checkpoint context
+ * @files: files_struct pointer
+ * @fd: file descriptor
+ *
+ * Saves the state of the file descriptor; looks up the actual file
+ * pointer in the hash table, and if found saves the matching objref,
+ * otherwise calls ckpt_write_file to dump the file pointer too.
+ */
+static int checkpoint_file_desc(struct ckpt_ctx *ctx,
+				struct files_struct *files, int fd)
+{
+	struct ckpt_hdr_file_desc *h;
+	struct file *file = NULL;
+	struct fdtable *fdt;
+	int objref, ret;
+	int coe = 0;	/* avoid gcc warning */
+	pid_t pid;
+
+	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_FILE_DESC);
+	if (!h)
+		return -ENOMEM;
+
+	rcu_read_lock();
+	fdt = files_fdtable(files);
+	file = fcheck_files(files, fd);
+	if (file) {
+		coe = FD_ISSET(fd, fdt->close_on_exec);
+		get_file(file);
+	}
+	rcu_read_unlock();
+
+	ret = find_locks_with_owner(file, files);
+	/*
+	 * find_locks_with_owner() returns an error when there
+	 * are no locks found, so we *want* it to return an error
+	 * code.  Its success means we have to fail the checkpoint.
+	 */
+	if (!ret) {
+		ret = -EBADF;
+		ckpt_err(ctx, ret, "%(T)fd %d has file lock or lease\n", fd);
+		goto out;
+	}
+
+	/* sanity check (although this shouldn't happen) */
+	ret = -EBADF;
+	if (!file) {
+		ckpt_err(ctx, ret, "%(T)fd %d gone?\n", fd);
+		goto out;
+	}
+
+	/*
+	 * TODO: Implement c/r of fowner and f_sigio.  Should be
+	 * trivial, but for now we just refuse its checkpoint
+	 */
+	pid = f_getown(file);
+	if (pid) {
+		ret = -EBUSY;
+		ckpt_err(ctx, ret, "%(T)fd %d has an owner (%d)\n", fd);
+		goto out;
+	}
+
+	/*
+	 * if seen first time, this will add 'file' to the objhash, keep
+	 * a reference to it, dump its state while at it.
+	 */
+	objref = checkpoint_obj(ctx, file, CKPT_OBJ_FILE);
+	ckpt_debug("fd %d objref %d file %p coe %d)\n", fd, objref, file, coe);
+	if (objref < 0) {
+		ret = objref;
+		goto out;
+	}
+
+	h->fd_objref = objref;
+	h->fd_descriptor = fd;
+	h->fd_close_on_exec = coe;
+
+	ret = ckpt_write_obj(ctx, &h->h);
+out:
+	ckpt_hdr_put(ctx, h);
+	if (file)
+		fput(file);
+	return ret;
+}
+
+static int do_checkpoint_file_table(struct ckpt_ctx *ctx,
+				    struct files_struct *files)
+{
+	struct ckpt_hdr_file_table *h;
+	int *fdtable = NULL;
+	int nfds, n, ret;
+
+	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_FILE_TABLE);
+	if (!h)
+		return -ENOMEM;
+
+	nfds = scan_fds(files, &fdtable);
+	if (nfds < 0) {
+		ret = nfds;
+		goto out;
+	}
+
+	h->fdt_nfds = nfds;
+
+	ret = ckpt_write_obj(ctx, &h->h);
+	ckpt_hdr_put(ctx, h);
+	if (ret < 0)
+		goto out;
+
+	ckpt_debug("nfds %d\n", nfds);
+	for (n = 0; n < nfds; n++) {
+		ret = checkpoint_file_desc(ctx, files, fdtable[n]);
+		if (ret < 0)
+			goto out;
+	}
+
+	ret = deferqueue_run(ctx->files_deferq);
+	ckpt_debug("files_deferq ran %d entries\n", ret);
+	if (ret > 0)
+		ret = 0;
+ out:
+	kfree(fdtable);
+	return ret;
+}
+
+/* checkpoint callback for file table */
+int checkpoint_file_table(struct ckpt_ctx *ctx, void *ptr)
+{
+	return do_checkpoint_file_table(ctx, (struct files_struct *) ptr);
+}
+
+/* checkpoint wrapper for file table */
+int checkpoint_obj_file_table(struct ckpt_ctx *ctx, struct task_struct *t)
+{
+	struct files_struct *files;
+	int objref;
+
+	files = get_files_struct(t);
+	if (!files)
+		return -EBUSY;
+	objref = checkpoint_obj(ctx, files, CKPT_OBJ_FILE_TABLE);
+	put_files_struct(files);
+
+	return objref;
+}
+
+/***********************************************************************
+ * Collect
+ */
+
+int ckpt_collect_file(struct ckpt_ctx *ctx, struct file *file)
+{
+	int ret;
+
+	ret = ckpt_obj_collect(ctx, file, CKPT_OBJ_FILE);
+	if (ret <= 0)
+		return ret;
+	/* if first time for this file (ret > 0), invoke ->collect() */
+	if (file->f_op->collect)
+		ret = file->f_op->collect(ctx, file);
+	if (ret < 0)
+		ckpt_err(ctx, ret, "%(T)%(P)File collect\n", file);
+	return ret;
+}
+
+static int collect_file_desc(struct ckpt_ctx *ctx,
+			     struct files_struct *files, int fd)
+{
+	struct fdtable *fdt;
+	struct file *file;
+	int ret;
+
+	rcu_read_lock();
+	fdt = files_fdtable(files);
+	file = fcheck_files(files, fd);
+	if (file)
+		get_file(file);
+	rcu_read_unlock();
+
+	if (!file) {
+		ckpt_err(ctx, -EBUSY, "%(T)%(P)File removed\n", file);
+		return -EBUSY;
+	}
+
+	ret = ckpt_collect_file(ctx, file);
+	fput(file);
+
+	return ret;
+}
+
+static int collect_file_table(struct ckpt_ctx *ctx, struct files_struct *files)
+{
+	int *fdtable;
+	int nfds, n;
+	int ret;
+
+	/* if already exists (ret == 0), nothing to do */
+	ret = ckpt_obj_collect(ctx, files, CKPT_OBJ_FILE_TABLE);
+	if (ret <= 0)
+		return ret;
+
+	/* if first time for this file table (ret > 0), proceed inside */
+	nfds = scan_fds(files, &fdtable);
+	if (nfds < 0)
+		return nfds;
+
+	for (n = 0; n < nfds; n++) {
+		ret = collect_file_desc(ctx, files, fdtable[n]);
+		if (ret < 0)
+			break;
+	}
+
+	kfree(fdtable);
+	return ret;
+}
+
+int ckpt_collect_file_table(struct ckpt_ctx *ctx, struct task_struct *t)
+{
+	struct files_struct *files;
+	int ret;
+
+	files = get_files_struct(t);
+	if (!files) {
+		ckpt_err(ctx, -EBUSY, "%(T)files_struct missing\n");
+		return -EBUSY;
+	}
+	ret = collect_file_table(ctx, files);
+	put_files_struct(files);
+
+	return ret;
+}
diff --git a/checkpoint/objhash.c b/checkpoint/objhash.c
index 22b1601..f25d130 100644
--- a/checkpoint/objhash.c
+++ b/checkpoint/objhash.c
@@ -13,6 +13,8 @@
 
 #include <linux/kernel.h>
 #include <linux/hash.h>
+#include <linux/file.h>
+#include <linux/fdtable.h>
 #include <linux/checkpoint.h>
 #include <linux/checkpoint_hdr.h>
 
@@ -62,6 +64,38 @@ static int obj_no_grab(void *ptr)
 	return 0;
 }
 
+static int obj_file_table_grab(void *ptr)
+{
+	atomic_inc(&((struct files_struct *) ptr)->count);
+	return 0;
+}
+
+static void obj_file_table_drop(void *ptr, int lastref)
+{
+	put_files_struct((struct files_struct *) ptr);
+}
+
+static int obj_file_table_users(void *ptr)
+{
+	return atomic_read(&((struct files_struct *) ptr)->count);
+}
+
+static int obj_file_grab(void *ptr)
+{
+	get_file((struct file *) ptr);
+	return 0;
+}
+
+static void obj_file_drop(void *ptr, int lastref)
+{
+	fput((struct file *) ptr);
+}
+
+static int obj_file_users(void *ptr)
+{
+	return atomic_long_read(&((struct file *) ptr)->f_count);
+}
+
 static struct ckpt_obj_ops ckpt_obj_ops[] = {
 	/* ignored object */
 	{
@@ -70,6 +104,24 @@ static struct ckpt_obj_ops ckpt_obj_ops[] = {
 		.ref_drop = obj_no_drop,
 		.ref_grab = obj_no_grab,
 	},
+	/* files_struct object */
+	{
+		.obj_name = "FILE_TABLE",
+		.obj_type = CKPT_OBJ_FILE_TABLE,
+		.ref_drop = obj_file_table_drop,
+		.ref_grab = obj_file_table_grab,
+		.ref_users = obj_file_table_users,
+		.checkpoint = checkpoint_file_table,
+	},
+	/* file object */
+	{
+		.obj_name = "FILE",
+		.obj_type = CKPT_OBJ_FILE,
+		.ref_drop = obj_file_drop,
+		.ref_grab = obj_file_grab,
+		.ref_users = obj_file_users,
+		.checkpoint = checkpoint_file,
+	},
 };
 
 
diff --git a/checkpoint/process.c b/checkpoint/process.c
index ef394a5..adc34a2 100644
--- a/checkpoint/process.c
+++ b/checkpoint/process.c
@@ -104,6 +104,29 @@ static int checkpoint_task_struct(struct ckpt_ctx *ctx, struct task_struct *t)
 	return ckpt_write_string(ctx, t->comm, TASK_COMM_LEN);
 }
 
+static int checkpoint_task_objs(struct ckpt_ctx *ctx, struct task_struct *t)
+{
+	struct ckpt_hdr_task_objs *h;
+	int files_objref;
+	int ret;
+
+	files_objref = checkpoint_obj_file_table(ctx, t);
+	ckpt_debug("files: objref %d\n", files_objref);
+	if (files_objref < 0) {
+		ckpt_err(ctx, files_objref, "%(T)files_struct\n");
+		return files_objref;
+	}
+
+	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_TASK_OBJS);
+	if (!h)
+		return -ENOMEM;
+	h->files_objref = files_objref;
+	ret = ckpt_write_obj(ctx, &h->h);
+	ckpt_hdr_put(ctx, h);
+
+	return ret;
+}
+
 /* dump the task_struct of a given task */
 int checkpoint_restart_block(struct ckpt_ctx *ctx, struct task_struct *t)
 {
@@ -240,6 +263,10 @@ int checkpoint_task(struct ckpt_ctx *ctx, struct task_struct *t)
 		goto out;
 	ret = checkpoint_cpu(ctx, t);
 	ckpt_debug("cpu %d\n", ret);
+	if (ret < 0)
+		goto out;
+	ret = checkpoint_task_objs(ctx, t);
+	ckpt_debug("objs %d\n", ret);
  out:
 	ctx->tsk = NULL;
 	return ret;
@@ -247,7 +274,11 @@ int checkpoint_task(struct ckpt_ctx *ctx, struct task_struct *t)
 
 int ckpt_collect_task(struct ckpt_ctx *ctx, struct task_struct *t)
 {
-	return 0;
+	int ret;
+
+	ret = ckpt_collect_file_table(ctx, t);
+
+	return ret;
 }
 
 /***********************************************************************
diff --git a/checkpoint/sys.c b/checkpoint/sys.c
index 926c937..30b8004 100644
--- a/checkpoint/sys.c
+++ b/checkpoint/sys.c
@@ -206,12 +206,16 @@ static void ckpt_ctx_free(struct ckpt_ctx *ctx)
 	if (ctx->kflags & CKPT_CTX_RESTART)
 		restore_debug_free(ctx);
 
+	if (ctx->files_deferq)
+		deferqueue_destroy(ctx->files_deferq);
+
 	if (ctx->file)
 		fput(ctx->file);
 	if (ctx->logfile)
 		fput(ctx->logfile);
 
 	ckpt_obj_hash_free(ctx);
+	path_put(&ctx->root_fs_path);
 
 	if (ctx->tasks_arr)
 		task_arr_free(ctx);
@@ -270,6 +274,10 @@ static struct ckpt_ctx *ckpt_ctx_alloc(int fd, unsigned long uflags,
 	if (ckpt_obj_hash_alloc(ctx) < 0)
 		goto err;
 
+	ctx->files_deferq = deferqueue_create();
+	if (!ctx->files_deferq)
+		goto err;
+
 	atomic_inc(&ctx->refcount);
 	return ctx;
  err:
diff --git a/fs/locks.c b/fs/locks.c
index a8794f2..721481a 100644
--- a/fs/locks.c
+++ b/fs/locks.c
@@ -1994,6 +1994,41 @@ void locks_remove_posix(struct file *filp, fl_owner_t owner)
 
 EXPORT_SYMBOL(locks_remove_posix);
 
+int find_locks_with_owner(struct file *filp, fl_owner_t owner)
+{
+	struct inode *inode = filp->f_path.dentry->d_inode;
+	struct file_lock **inode_fl;
+	int ret = -EEXIST;
+
+	lock_kernel();
+	for_each_lock(inode, inode_fl) {
+		struct file_lock *fl = *inode_fl;
+		/*
+		 * We could use posix_same_owner() along with a 'fake'
+		 * file_lock.  But, the fake file will never have the
+		 * same fl_lmops as the fl that we are looking for and
+		 * posix_same_owner() would just fall back to this
+		 * check anyway.
+		 */
+		if (IS_POSIX(fl)) {
+			if (fl->fl_owner == owner) {
+				ret = 0;
+				break;
+			}
+		} else if (IS_FLOCK(fl) || IS_LEASE(fl)) {
+			if (fl->fl_file == filp) {
+				ret = 0;
+				break;
+			}
+		} else {
+			WARN(1, "unknown file lock type, fl_flags: %x",
+				fl->fl_flags);
+		}
+	}
+	unlock_kernel();
+	return ret;
+}
+
 /*
  * This function is called on the last close of an open file.
  */
diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
index 50ce8f9..d74a890 100644
--- a/include/linux/checkpoint.h
+++ b/include/linux/checkpoint.h
@@ -80,6 +80,9 @@ extern int ckpt_read_payload(struct ckpt_ctx *ctx,
 extern char *ckpt_read_string(struct ckpt_ctx *ctx, int max);
 extern int ckpt_read_consume(struct ckpt_ctx *ctx, int len, int type);
 
+extern char *ckpt_fill_fname(struct path *path, struct path *root,
+			     char *buf, int *len);
+
 /* ckpt kflags */
 #define ckpt_set_ctx_kflag(__ctx, __kflag)  \
 	set_bit(__kflag##_BIT, &(__ctx)->kflags)
@@ -156,6 +159,21 @@ extern int checkpoint_restart_block(struct ckpt_ctx *ctx,
 				    struct task_struct *t);
 extern int restore_restart_block(struct ckpt_ctx *ctx);
 
+/* file table */
+extern int ckpt_collect_file_table(struct ckpt_ctx *ctx, struct task_struct *t);
+extern int checkpoint_obj_file_table(struct ckpt_ctx *ctx,
+				     struct task_struct *t);
+extern int checkpoint_file_table(struct ckpt_ctx *ctx, void *ptr);
+
+/* files */
+extern int checkpoint_fname(struct ckpt_ctx *ctx,
+			    struct path *path, struct path *root);
+extern int ckpt_collect_file(struct ckpt_ctx *ctx, struct file *file);
+extern int checkpoint_file(struct ckpt_ctx *ctx, void *ptr);
+
+extern int checkpoint_file_common(struct ckpt_ctx *ctx, struct file *file,
+				  struct ckpt_hdr_file *h);
+
 static inline int ckpt_validate_errno(int errno)
 {
 	return (errno >= 0) && (errno < MAX_ERRNO);
@@ -166,6 +184,7 @@ static inline int ckpt_validate_errno(int errno)
 #define CKPT_DSYS	0x2		/* generic (system) */
 #define CKPT_DRW	0x4		/* image read/write */
 #define CKPT_DOBJ	0x8		/* shared objects */
+#define CKPT_DFILE	0x10		/* files and filesystem */
 
 #define CKPT_DDEFAULT	0xffff		/* default debug level */
 
diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
index cdca9e4..3222545 100644
--- a/include/linux/checkpoint_hdr.h
+++ b/include/linux/checkpoint_hdr.h
@@ -71,6 +71,8 @@ enum {
 #define CKPT_HDR_TREE CKPT_HDR_TREE
 	CKPT_HDR_TASK,
 #define CKPT_HDR_TASK CKPT_HDR_TASK
+	CKPT_HDR_TASK_OBJS,
+#define CKPT_HDR_TASK_OBJS CKPT_HDR_TASK_OBJS
 	CKPT_HDR_RESTART_BLOCK,
 #define CKPT_HDR_RESTART_BLOCK CKPT_HDR_RESTART_BLOCK
 	CKPT_HDR_THREAD,
@@ -80,6 +82,15 @@ enum {
 
 	/* 201-299: reserved for arch-dependent */
 
+	CKPT_HDR_FILE_TABLE = 301,
+#define CKPT_HDR_FILE_TABLE CKPT_HDR_FILE_TABLE
+	CKPT_HDR_FILE_DESC,
+#define CKPT_HDR_FILE_DESC CKPT_HDR_FILE_DESC
+	CKPT_HDR_FILE_NAME,
+#define CKPT_HDR_FILE_NAME CKPT_HDR_FILE_NAME
+	CKPT_HDR_FILE,
+#define CKPT_HDR_FILE CKPT_HDR_FILE
+
 	CKPT_HDR_TAIL = 9001,
 #define CKPT_HDR_TAIL CKPT_HDR_TAIL
 
@@ -106,6 +117,10 @@ struct ckpt_hdr_objref {
 enum obj_type {
 	CKPT_OBJ_IGNORE = 0,
 #define CKPT_OBJ_IGNORE CKPT_OBJ_IGNORE
+	CKPT_OBJ_FILE_TABLE,
+#define CKPT_OBJ_FILE_TABLE CKPT_OBJ_FILE_TABLE
+	CKPT_OBJ_FILE,
+#define CKPT_OBJ_FILE CKPT_OBJ_FILE
 	CKPT_OBJ_MAX
 #define CKPT_OBJ_MAX CKPT_OBJ_MAX
 };
@@ -188,6 +203,12 @@ struct ckpt_hdr_task {
 	__u64 clear_child_tid;
 } __attribute__((aligned(8)));
 
+/* task's shared resources */
+struct ckpt_hdr_task_objs {
+	struct ckpt_hdr h;
+	__s32 files_objref;
+} __attribute__((aligned(8)));
+
 /* restart blocks */
 struct ckpt_hdr_restart_block {
 	struct ckpt_hdr h;
@@ -220,4 +241,42 @@ enum restart_block_type {
 #define CKPT_RESTART_BLOCK_FUTEX CKPT_RESTART_BLOCK_FUTEX
 };
 
+/* file system */
+struct ckpt_hdr_file_table {
+	struct ckpt_hdr h;
+	__s32 fdt_nfds;
+} __attribute__((aligned(8)));
+
+/* file descriptors */
+struct ckpt_hdr_file_desc {
+	struct ckpt_hdr h;
+	__s32 fd_objref;
+	__s32 fd_descriptor;
+	__u32 fd_close_on_exec;
+} __attribute__((aligned(8)));
+
+enum file_type {
+	CKPT_FILE_IGNORE = 0,
+#define CKPT_FILE_IGNORE CKPT_FILE_IGNORE
+	CKPT_FILE_GENERIC,
+#define CKPT_FILE_GENERIC CKPT_FILE_GENERIC
+	CKPT_FILE_MAX
+#define CKPT_FILE_MAX CKPT_FILE_MAX
+};
+
+/* file objects */
+struct ckpt_hdr_file {
+	struct ckpt_hdr h;
+	__u32 f_type;
+	__u32 f_mode;
+	__u32 f_flags;
+	__u32 _padding;
+	__u64 f_pos;
+	__u64 f_version;
+} __attribute__((aligned(8)));
+
+struct ckpt_hdr_file_generic {
+	struct ckpt_hdr_file common;
+} __attribute__((aligned(8)));
+
 #endif /* _CHECKPOINT_CKPT_HDR_H_ */
diff --git a/include/linux/checkpoint_types.h b/include/linux/checkpoint_types.h
index 90bbb16..aae6755 100644
--- a/include/linux/checkpoint_types.h
+++ b/include/linux/checkpoint_types.h
@@ -14,6 +14,8 @@
 
 #include <linux/sched.h>
 #include <linux/nsproxy.h>
+#include <linux/list.h>
+#include <linux/path.h>
 #include <linux/fs.h>
 #include <linux/ktime.h>
 #include <linux/wait.h>
@@ -40,6 +42,9 @@ struct ckpt_ctx {
 	atomic_t refcount;
 
 	struct ckpt_obj_hash *obj_hash;	/* repository for shared objects */
+	struct deferqueue_head *files_deferq;	/* deferred file-table work */
+
+	struct path root_fs_path;     /* container root (FIXME) */
 
 	struct task_struct *tsk;/* checkpoint: current target task */
 	char err_string[256];	/* checkpoint: error string */
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 65ebec5..7902a51 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1120,6 +1120,7 @@ extern void locks_remove_posix(struct file *, fl_owner_t);
 extern void locks_remove_flock(struct file *);
 extern void locks_release_private(struct file_lock *);
 extern void posix_test_lock(struct file *, struct file_lock *);
+extern int find_locks_with_owner(struct file *filp, fl_owner_t owner);
 extern int posix_lock_file(struct file *, struct file_lock *, struct file_lock *);
 extern int posix_lock_file_wait(struct file *, struct file_lock *);
 extern int posix_unblock_lock(struct file *, struct file_lock *);
@@ -1188,6 +1189,11 @@ static inline void locks_remove_posix(struct file *filp, fl_owner_t owner)
 	return;
 }
 
+static inline int find_locks_with_owner(struct file *filp, fl_owner_t owner)
+{
+	return -ENOENT;
+}
+
 static inline void locks_remove_flock(struct file *filp)
 {
 	return;
@@ -2318,7 +2324,11 @@ void inode_sub_bytes(struct inode *inode, loff_t bytes);
 loff_t inode_get_bytes(struct inode *inode);
 void inode_set_bytes(struct inode *inode, loff_t bytes);
 
+#ifdef CONFIG_CHECKPOINT
+extern int generic_file_checkpoint(struct ckpt_ctx *ctx, struct file *file);
+#else
 #define generic_file_checkpoint NULL
+#endif
 
 extern int vfs_readdir(struct file *, filldir_t, void *);
 
-- 
1.6.3.3

^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [C/R v20][PATCH 38/96] c/r: dump open file descriptors
  2010-03-17 16:08                                                                         ` [C/R v20][PATCH 37/96] c/r: introduce new 'file_operations': ->checkpoint, ->collect() Oren Laadan
@ 2010-03-17 16:08                                                                             ` Oren Laadan
  2010-03-17 16:08                                                                             ` Oren Laadan
  1 sibling, 0 replies; 88+ messages in thread
From: Oren Laadan @ 2010-03-17 16:08 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, linux-api, Serge Hallyn, Ingo Molnar,
	containers, Oren Laadan

Dump the file table with 'struct ckpt_hdr_file_table, followed by all
open file descriptors. Because the 'struct file' corresponding to an
fd can be shared, they are assigned an objref and registered in the
object hash. A reference to the 'file *' is kept for as long as it
lives in the hash (the hash is only cleaned up at the end of the
checkpoint).

Also provide generic_checkpoint_file() and generic_restore_file()
which is good for normal files and directories. It does not support
yet unlinked files or directories.

Changelog[v19]:
  - Fix false negative of test for unlinked files at checkpoint
Changelog[v19-rc3]:
  - [Serge Hallyn] Rename fs_mnt to root_fs_path
  - [Dave Hansen] Error out on file locks and leases
  - [Serge Hallyn] Refuse checkpoint of file with f_owner
Changelog[v19-rc1]:
  - [Matt Helsley] Add cpp definitions for enums
Changelog[v18]:
  - Add a few more ckpt_write_err()s
  - [Dan Smith] Export fill_fname() as ckpt_fill_fname()
  - Introduce ckpt_collect_file() that also uses file->collect method
  - In collect_file_stabl() use retval from ckpt_obj_collect() to
    test for first-time-object
Changelog[v17]:
  - Only collect sub-objects of files_struct once
  - Better file error debugging
  - Use (new) d_unlinked()
Changelog[v16]:
  - Fix compile warning in checkpoint_bad()
Changelog[v16]:
  - Reorder patch (move earlier in series)
  - Handle shared files_struct objects
Changelog[v14]:
  - File objects are dumped/restored prior to the first reference
  - Introduce a per file-type restore() callback
  - Use struct file_operations->checkpoint()
  - Put code for generic file descriptors in a separate function
  - Use one CKPT_FILE_GENERIC for both regular files and dirs
  - Revert change to pr_debug(), back to ckpt_debug()
  - Use only unsigned fields in checkpoint headers
  - Rename:  ckpt_write_files() => checkpoint_fd_table()
  - Rename:  ckpt_write_fd_data() => checkpoint_file()
  - Discard field 'h->parent'
Changelog[v12]:
  - Replace obsolete ckpt_debug() with pr_debug()
Changelog[v11]:
  - Discard handling of opened symlinks (there is no such thing)
  - ckpt_scan_fds() retries from scratch if hits size limits
Changelog[v9]:
  - Fix a couple of leaks in ckpt_write_files()
  - Drop useless kfree from ckpt_scan_fds()
Changelog[v8]:
  - initialize 'coe' to workaround gcc false warning
Changelog[v6]:
  - Balance all calls to ckpt_hbuf_get() with matching ckpt_hbuf_put()
    (even though it's not really needed)

Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
Acked-by: Serge E. Hallyn <serue@us.ibm.com>
Tested-by: Serge E. Hallyn <serue@us.ibm.com>
---
 checkpoint/Makefile              |    3 +-
 checkpoint/checkpoint.c          |   11 +
 checkpoint/files.c               |  444 ++++++++++++++++++++++++++++++++++++++
 checkpoint/objhash.c             |   52 +++++
 checkpoint/process.c             |   33 +++-
 checkpoint/sys.c                 |    8 +
 fs/locks.c                       |   35 +++
 include/linux/checkpoint.h       |   19 ++
 include/linux/checkpoint_hdr.h   |   59 +++++
 include/linux/checkpoint_types.h |    5 +
 include/linux/fs.h               |   10 +
 11 files changed, 677 insertions(+), 2 deletions(-)
 create mode 100644 checkpoint/files.c

diff --git a/checkpoint/Makefile b/checkpoint/Makefile
index 5aa6a75..1d0c058 100644
--- a/checkpoint/Makefile
+++ b/checkpoint/Makefile
@@ -7,4 +7,5 @@ obj-$(CONFIG_CHECKPOINT) += \
 	objhash.o \
 	checkpoint.o \
 	restart.o \
-	process.o
+	process.o \
+	files.o
diff --git a/checkpoint/checkpoint.c b/checkpoint/checkpoint.c
index c016a2d..2bc2495 100644
--- a/checkpoint/checkpoint.c
+++ b/checkpoint/checkpoint.c
@@ -18,6 +18,7 @@
 #include <linux/time.h>
 #include <linux/fs.h>
 #include <linux/file.h>
+#include <linux/fs_struct.h>
 #include <linux/dcache.h>
 #include <linux/mount.h>
 #include <linux/utsname.h>
@@ -490,6 +491,7 @@ static int init_checkpoint_ctx(struct ckpt_ctx *ctx, pid_t pid)
 {
 	struct task_struct *task;
 	struct nsproxy *nsproxy;
+	struct fs_struct *fs;
 
 	/*
 	 * No need for explicit cleanup here, because if an error
@@ -531,6 +533,15 @@ static int init_checkpoint_ctx(struct ckpt_ctx *ctx, pid_t pid)
 		return -EINVAL;  /* cleanup by ckpt_ctx_free() */
 	}
 
+	/* root vfs (FIX: WILL CHANGE with mnt-ns etc */
+	task_lock(ctx->root_task);
+	fs = ctx->root_task->fs;
+	read_lock(&fs->lock);
+	ctx->root_fs_path = fs->root;
+	path_get(&ctx->root_fs_path);
+	read_unlock(&fs->lock);
+	task_unlock(ctx->root_task);
+
 	return 0;
 }
 
diff --git a/checkpoint/files.c b/checkpoint/files.c
new file mode 100644
index 0000000..7a57b24
--- /dev/null
+++ b/checkpoint/files.c
@@ -0,0 +1,444 @@
+/*
+ *  Checkpoint file descriptors
+ *
+ *  Copyright (C) 2008-2009 Oren Laadan
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+/* default debug level for output */
+#define CKPT_DFLAG  CKPT_DFILE
+
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/sched.h>
+#include <linux/file.h>
+#include <linux/fdtable.h>
+#include <linux/deferqueue.h>
+#include <linux/checkpoint.h>
+#include <linux/checkpoint_hdr.h>
+
+
+/**************************************************************************
+ * Checkpoint
+ */
+
+/**
+ * ckpt_fill_fname - return pathname of a given file
+ * @path: path name
+ * @root: relative root
+ * @buf: buffer for pathname
+ * @len: buffer length (in) and pathname length (out)
+ */
+char *ckpt_fill_fname(struct path *path, struct path *root, char *buf, int *len)
+{
+	struct path tmp = *root;
+	char *fname;
+
+	BUG_ON(!buf);
+	spin_lock(&dcache_lock);
+	fname = __d_path(path, &tmp, buf, *len);
+	spin_unlock(&dcache_lock);
+	if (IS_ERR(fname))
+		return fname;
+	*len = (buf + (*len) - fname);
+	/*
+	 * FIX: if __d_path() changed these, it must have stepped out of
+	 * init's namespace. Since currently we require a unified namespace
+	 * within the container: simply fail.
+	 */
+	if (tmp.mnt != root->mnt || tmp.dentry != root->dentry)
+		fname = ERR_PTR(-EBADF);
+
+	return fname;
+}
+
+/**
+ * checkpoint_fname - write a file name
+ * @ctx: checkpoint context
+ * @path: path name
+ * @root: relative root
+ */
+int checkpoint_fname(struct ckpt_ctx *ctx, struct path *path, struct path *root)
+{
+	char *buf, *fname;
+	int ret, flen;
+
+	/*
+	 * FIXME: we can optimize and save memory (and storage) if we
+	 * share strings (through objhash) and reference them instead
+	 */
+
+	flen = PATH_MAX;
+	buf = kmalloc(flen, GFP_KERNEL);
+	if (!buf)
+		return -ENOMEM;
+
+	fname = ckpt_fill_fname(path, root, buf, &flen);
+	if (!IS_ERR(fname)) {
+		ret = ckpt_write_obj_type(ctx, fname, flen,
+					  CKPT_HDR_FILE_NAME);
+	} else {
+		ret = PTR_ERR(fname);
+		ckpt_err(ctx, ret, "%(T)%(S)Obtain filename\n",
+			 path->dentry->d_name.name);
+	}
+
+	kfree(buf);
+	return ret;
+}
+
+#define CKPT_DEFAULT_FDTABLE  256		/* an initial guess */
+
+/**
+ * scan_fds - scan file table and construct array of open fds
+ * @files: files_struct pointer
+ * @fdtable: (output) array of open fds
+ *
+ * Returns the number of open fds found, and also the file table
+ * array via *fdtable. The caller should free the array.
+ *
+ * The caller must validate the file descriptors collected in the
+ * array before using them, e.g. by using fcheck_files(), in case
+ * the task's fdtable changes in the meantime.
+ */
+static int scan_fds(struct files_struct *files, int **fdtable)
+{
+	struct fdtable *fdt;
+	int *fds = NULL;
+	int i = 0, n = 0;
+	int tot = CKPT_DEFAULT_FDTABLE;
+
+	/*
+	 * We assume that all tasks possibly sharing the file table are
+	 * frozen (or we are a single process and we checkpoint ourselves).
+	 * Therefore, we can safely proceed after krealloc() from where we
+	 * left off. Otherwise the file table may be modified by another
+	 * task after we scan it. The behavior is this case is undefined,
+	 * and either checkpoint or restart will likely fail.
+	 */
+ retry:
+	fds = krealloc(fds, tot * sizeof(*fds), GFP_KERNEL);
+	if (!fds)
+		return -ENOMEM;
+
+	rcu_read_lock();
+	fdt = files_fdtable(files);
+	for (/**/; i < fdt->max_fds; i++) {
+		if (!fcheck_files(files, i))
+			continue;
+		if (n == tot) {
+			rcu_read_unlock();
+			tot *= 2;	/* won't overflow: kmalloc will fail */
+			goto retry;
+		}
+		fds[n++] = i;
+	}
+	rcu_read_unlock();
+
+	*fdtable = fds;
+	return n;
+}
+
+int checkpoint_file_common(struct ckpt_ctx *ctx, struct file *file,
+			   struct ckpt_hdr_file *h)
+{
+	h->f_flags = file->f_flags;
+	h->f_mode = file->f_mode;
+	h->f_pos = file->f_pos;
+	h->f_version = file->f_version;
+
+	ckpt_debug("file %s credref %d", file->f_dentry->d_name.name,
+		h->f_credref);
+
+	/* FIX: need also file->uid, file->gid, file->f_owner, etc */
+
+	return 0;
+}
+
+int generic_file_checkpoint(struct ckpt_ctx *ctx, struct file *file)
+{
+	struct ckpt_hdr_file_generic *h;
+	int ret;
+
+	/*
+	 * FIXME: when we'll add support for unlinked files/dirs, we'll
+	 * need to distinguish between unlinked filed and unlinked dirs.
+	 */
+	if (d_unlinked(file->f_dentry)) {
+		ckpt_err(ctx, -EBADF, "%(T)%(P)Unlinked files unsupported\n",
+			 file);
+		return -EBADF;
+	}
+
+	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_FILE);
+	if (!h)
+		return -ENOMEM;
+
+	h->common.f_type = CKPT_FILE_GENERIC;
+
+	ret = checkpoint_file_common(ctx, file, &h->common);
+	if (ret < 0)
+		goto out;
+	ret = ckpt_write_obj(ctx, &h->common.h);
+	if (ret < 0)
+		goto out;
+	ret = checkpoint_fname(ctx, &file->f_path, &ctx->root_fs_path);
+ out:
+	ckpt_hdr_put(ctx, h);
+	return ret;
+}
+EXPORT_SYMBOL(generic_file_checkpoint);
+
+/* checkpoint callback for file pointer */
+int checkpoint_file(struct ckpt_ctx *ctx, void *ptr)
+{
+	struct file *file = (struct file *) ptr;
+	int ret;
+
+	if (!file->f_op || !file->f_op->checkpoint) {
+		ckpt_err(ctx, -EBADF, "%(T)%(P)%(V)f_op lacks checkpoint\n",
+			       file, file->f_op);
+		return -EBADF;
+	}
+
+	ret = file->f_op->checkpoint(ctx, file);
+	if (ret < 0)
+		ckpt_err(ctx, ret, "%(T)%(P)file checkpoint failed\n", file);
+	return ret;
+}
+
+/**
+ * ckpt_write_file_desc - dump the state of a given file descriptor
+ * @ctx: checkpoint context
+ * @files: files_struct pointer
+ * @fd: file descriptor
+ *
+ * Saves the state of the file descriptor; looks up the actual file
+ * pointer in the hash table, and if found saves the matching objref,
+ * otherwise calls ckpt_write_file to dump the file pointer too.
+ */
+static int checkpoint_file_desc(struct ckpt_ctx *ctx,
+				struct files_struct *files, int fd)
+{
+	struct ckpt_hdr_file_desc *h;
+	struct file *file = NULL;
+	struct fdtable *fdt;
+	int objref, ret;
+	int coe = 0;	/* avoid gcc warning */
+	pid_t pid;
+
+	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_FILE_DESC);
+	if (!h)
+		return -ENOMEM;
+
+	rcu_read_lock();
+	fdt = files_fdtable(files);
+	file = fcheck_files(files, fd);
+	if (file) {
+		coe = FD_ISSET(fd, fdt->close_on_exec);
+		get_file(file);
+	}
+	rcu_read_unlock();
+
+	ret = find_locks_with_owner(file, files);
+	/*
+	 * find_locks_with_owner() returns an error when there
+	 * are no locks found, so we *want* it to return an error
+	 * code.  Its success means we have to fail the checkpoint.
+	 */
+	if (!ret) {
+		ret = -EBADF;
+		ckpt_err(ctx, ret, "%(T)fd %d has file lock or lease\n", fd);
+		goto out;
+	}
+
+	/* sanity check (although this shouldn't happen) */
+	ret = -EBADF;
+	if (!file) {
+		ckpt_err(ctx, ret, "%(T)fd %d gone?\n", fd);
+		goto out;
+	}
+
+	/*
+	 * TODO: Implement c/r of fowner and f_sigio.  Should be
+	 * trivial, but for now we just refuse its checkpoint
+	 */
+	pid = f_getown(file);
+	if (pid) {
+		ret = -EBUSY;
+		ckpt_err(ctx, ret, "%(T)fd %d has an owner (%d)\n", fd);
+		goto out;
+	}
+
+	/*
+	 * if seen first time, this will add 'file' to the objhash, keep
+	 * a reference to it, dump its state while at it.
+	 */
+	objref = checkpoint_obj(ctx, file, CKPT_OBJ_FILE);
+	ckpt_debug("fd %d objref %d file %p coe %d)\n", fd, objref, file, coe);
+	if (objref < 0) {
+		ret = objref;
+		goto out;
+	}
+
+	h->fd_objref = objref;
+	h->fd_descriptor = fd;
+	h->fd_close_on_exec = coe;
+
+	ret = ckpt_write_obj(ctx, &h->h);
+out:
+	ckpt_hdr_put(ctx, h);
+	if (file)
+		fput(file);
+	return ret;
+}
+
+static int do_checkpoint_file_table(struct ckpt_ctx *ctx,
+				    struct files_struct *files)
+{
+	struct ckpt_hdr_file_table *h;
+	int *fdtable = NULL;
+	int nfds, n, ret;
+
+	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_FILE_TABLE);
+	if (!h)
+		return -ENOMEM;
+
+	nfds = scan_fds(files, &fdtable);
+	if (nfds < 0) {
+		ret = nfds;
+		goto out;
+	}
+
+	h->fdt_nfds = nfds;
+
+	ret = ckpt_write_obj(ctx, &h->h);
+	ckpt_hdr_put(ctx, h);
+	if (ret < 0)
+		goto out;
+
+	ckpt_debug("nfds %d\n", nfds);
+	for (n = 0; n < nfds; n++) {
+		ret = checkpoint_file_desc(ctx, files, fdtable[n]);
+		if (ret < 0)
+			goto out;
+	}
+
+	ret = deferqueue_run(ctx->files_deferq);
+	ckpt_debug("files_deferq ran %d entries\n", ret);
+	if (ret > 0)
+		ret = 0;
+ out:
+	kfree(fdtable);
+	return ret;
+}
+
+/* checkpoint callback for file table */
+int checkpoint_file_table(struct ckpt_ctx *ctx, void *ptr)
+{
+	return do_checkpoint_file_table(ctx, (struct files_struct *) ptr);
+}
+
+/* checkpoint wrapper for file table */
+int checkpoint_obj_file_table(struct ckpt_ctx *ctx, struct task_struct *t)
+{
+	struct files_struct *files;
+	int objref;
+
+	files = get_files_struct(t);
+	if (!files)
+		return -EBUSY;
+	objref = checkpoint_obj(ctx, files, CKPT_OBJ_FILE_TABLE);
+	put_files_struct(files);
+
+	return objref;
+}
+
+/***********************************************************************
+ * Collect
+ */
+
+int ckpt_collect_file(struct ckpt_ctx *ctx, struct file *file)
+{
+	int ret;
+
+	ret = ckpt_obj_collect(ctx, file, CKPT_OBJ_FILE);
+	if (ret <= 0)
+		return ret;
+	/* if first time for this file (ret > 0), invoke ->collect() */
+	if (file->f_op->collect)
+		ret = file->f_op->collect(ctx, file);
+	if (ret < 0)
+		ckpt_err(ctx, ret, "%(T)%(P)File collect\n", file);
+	return ret;
+}
+
+static int collect_file_desc(struct ckpt_ctx *ctx,
+			     struct files_struct *files, int fd)
+{
+	struct fdtable *fdt;
+	struct file *file;
+	int ret;
+
+	rcu_read_lock();
+	fdt = files_fdtable(files);
+	file = fcheck_files(files, fd);
+	if (file)
+		get_file(file);
+	rcu_read_unlock();
+
+	if (!file) {
+		ckpt_err(ctx, -EBUSY, "%(T)%(P)File removed\n", file);
+		return -EBUSY;
+	}
+
+	ret = ckpt_collect_file(ctx, file);
+	fput(file);
+
+	return ret;
+}
+
+static int collect_file_table(struct ckpt_ctx *ctx, struct files_struct *files)
+{
+	int *fdtable;
+	int nfds, n;
+	int ret;
+
+	/* if already exists (ret == 0), nothing to do */
+	ret = ckpt_obj_collect(ctx, files, CKPT_OBJ_FILE_TABLE);
+	if (ret <= 0)
+		return ret;
+
+	/* if first time for this file table (ret > 0), proceed inside */
+	nfds = scan_fds(files, &fdtable);
+	if (nfds < 0)
+		return nfds;
+
+	for (n = 0; n < nfds; n++) {
+		ret = collect_file_desc(ctx, files, fdtable[n]);
+		if (ret < 0)
+			break;
+	}
+
+	kfree(fdtable);
+	return ret;
+}
+
+int ckpt_collect_file_table(struct ckpt_ctx *ctx, struct task_struct *t)
+{
+	struct files_struct *files;
+	int ret;
+
+	files = get_files_struct(t);
+	if (!files) {
+		ckpt_err(ctx, -EBUSY, "%(T)files_struct missing\n");
+		return -EBUSY;
+	}
+	ret = collect_file_table(ctx, files);
+	put_files_struct(files);
+
+	return ret;
+}
diff --git a/checkpoint/objhash.c b/checkpoint/objhash.c
index 22b1601..f25d130 100644
--- a/checkpoint/objhash.c
+++ b/checkpoint/objhash.c
@@ -13,6 +13,8 @@
 
 #include <linux/kernel.h>
 #include <linux/hash.h>
+#include <linux/file.h>
+#include <linux/fdtable.h>
 #include <linux/checkpoint.h>
 #include <linux/checkpoint_hdr.h>
 
@@ -62,6 +64,38 @@ static int obj_no_grab(void *ptr)
 	return 0;
 }
 
+static int obj_file_table_grab(void *ptr)
+{
+	atomic_inc(&((struct files_struct *) ptr)->count);
+	return 0;
+}
+
+static void obj_file_table_drop(void *ptr, int lastref)
+{
+	put_files_struct((struct files_struct *) ptr);
+}
+
+static int obj_file_table_users(void *ptr)
+{
+	return atomic_read(&((struct files_struct *) ptr)->count);
+}
+
+static int obj_file_grab(void *ptr)
+{
+	get_file((struct file *) ptr);
+	return 0;
+}
+
+static void obj_file_drop(void *ptr, int lastref)
+{
+	fput((struct file *) ptr);
+}
+
+static int obj_file_users(void *ptr)
+{
+	return atomic_long_read(&((struct file *) ptr)->f_count);
+}
+
 static struct ckpt_obj_ops ckpt_obj_ops[] = {
 	/* ignored object */
 	{
@@ -70,6 +104,24 @@ static struct ckpt_obj_ops ckpt_obj_ops[] = {
 		.ref_drop = obj_no_drop,
 		.ref_grab = obj_no_grab,
 	},
+	/* files_struct object */
+	{
+		.obj_name = "FILE_TABLE",
+		.obj_type = CKPT_OBJ_FILE_TABLE,
+		.ref_drop = obj_file_table_drop,
+		.ref_grab = obj_file_table_grab,
+		.ref_users = obj_file_table_users,
+		.checkpoint = checkpoint_file_table,
+	},
+	/* file object */
+	{
+		.obj_name = "FILE",
+		.obj_type = CKPT_OBJ_FILE,
+		.ref_drop = obj_file_drop,
+		.ref_grab = obj_file_grab,
+		.ref_users = obj_file_users,
+		.checkpoint = checkpoint_file,
+	},
 };
 
 
diff --git a/checkpoint/process.c b/checkpoint/process.c
index ef394a5..adc34a2 100644
--- a/checkpoint/process.c
+++ b/checkpoint/process.c
@@ -104,6 +104,29 @@ static int checkpoint_task_struct(struct ckpt_ctx *ctx, struct task_struct *t)
 	return ckpt_write_string(ctx, t->comm, TASK_COMM_LEN);
 }
 
+static int checkpoint_task_objs(struct ckpt_ctx *ctx, struct task_struct *t)
+{
+	struct ckpt_hdr_task_objs *h;
+	int files_objref;
+	int ret;
+
+	files_objref = checkpoint_obj_file_table(ctx, t);
+	ckpt_debug("files: objref %d\n", files_objref);
+	if (files_objref < 0) {
+		ckpt_err(ctx, files_objref, "%(T)files_struct\n");
+		return files_objref;
+	}
+
+	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_TASK_OBJS);
+	if (!h)
+		return -ENOMEM;
+	h->files_objref = files_objref;
+	ret = ckpt_write_obj(ctx, &h->h);
+	ckpt_hdr_put(ctx, h);
+
+	return ret;
+}
+
 /* dump the task_struct of a given task */
 int checkpoint_restart_block(struct ckpt_ctx *ctx, struct task_struct *t)
 {
@@ -240,6 +263,10 @@ int checkpoint_task(struct ckpt_ctx *ctx, struct task_struct *t)
 		goto out;
 	ret = checkpoint_cpu(ctx, t);
 	ckpt_debug("cpu %d\n", ret);
+	if (ret < 0)
+		goto out;
+	ret = checkpoint_task_objs(ctx, t);
+	ckpt_debug("objs %d\n", ret);
  out:
 	ctx->tsk = NULL;
 	return ret;
@@ -247,7 +274,11 @@ int checkpoint_task(struct ckpt_ctx *ctx, struct task_struct *t)
 
 int ckpt_collect_task(struct ckpt_ctx *ctx, struct task_struct *t)
 {
-	return 0;
+	int ret;
+
+	ret = ckpt_collect_file_table(ctx, t);
+
+	return ret;
 }
 
 /***********************************************************************
diff --git a/checkpoint/sys.c b/checkpoint/sys.c
index 926c937..30b8004 100644
--- a/checkpoint/sys.c
+++ b/checkpoint/sys.c
@@ -206,12 +206,16 @@ static void ckpt_ctx_free(struct ckpt_ctx *ctx)
 	if (ctx->kflags & CKPT_CTX_RESTART)
 		restore_debug_free(ctx);
 
+	if (ctx->files_deferq)
+		deferqueue_destroy(ctx->files_deferq);
+
 	if (ctx->file)
 		fput(ctx->file);
 	if (ctx->logfile)
 		fput(ctx->logfile);
 
 	ckpt_obj_hash_free(ctx);
+	path_put(&ctx->root_fs_path);
 
 	if (ctx->tasks_arr)
 		task_arr_free(ctx);
@@ -270,6 +274,10 @@ static struct ckpt_ctx *ckpt_ctx_alloc(int fd, unsigned long uflags,
 	if (ckpt_obj_hash_alloc(ctx) < 0)
 		goto err;
 
+	ctx->files_deferq = deferqueue_create();
+	if (!ctx->files_deferq)
+		goto err;
+
 	atomic_inc(&ctx->refcount);
 	return ctx;
  err:
diff --git a/fs/locks.c b/fs/locks.c
index a8794f2..721481a 100644
--- a/fs/locks.c
+++ b/fs/locks.c
@@ -1994,6 +1994,41 @@ void locks_remove_posix(struct file *filp, fl_owner_t owner)
 
 EXPORT_SYMBOL(locks_remove_posix);
 
+int find_locks_with_owner(struct file *filp, fl_owner_t owner)
+{
+	struct inode *inode = filp->f_path.dentry->d_inode;
+	struct file_lock **inode_fl;
+	int ret = -EEXIST;
+
+	lock_kernel();
+	for_each_lock(inode, inode_fl) {
+		struct file_lock *fl = *inode_fl;
+		/*
+		 * We could use posix_same_owner() along with a 'fake'
+		 * file_lock.  But, the fake file will never have the
+		 * same fl_lmops as the fl that we are looking for and
+		 * posix_same_owner() would just fall back to this
+		 * check anyway.
+		 */
+		if (IS_POSIX(fl)) {
+			if (fl->fl_owner == owner) {
+				ret = 0;
+				break;
+			}
+		} else if (IS_FLOCK(fl) || IS_LEASE(fl)) {
+			if (fl->fl_file == filp) {
+				ret = 0;
+				break;
+			}
+		} else {
+			WARN(1, "unknown file lock type, fl_flags: %x",
+				fl->fl_flags);
+		}
+	}
+	unlock_kernel();
+	return ret;
+}
+
 /*
  * This function is called on the last close of an open file.
  */
diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
index 50ce8f9..d74a890 100644
--- a/include/linux/checkpoint.h
+++ b/include/linux/checkpoint.h
@@ -80,6 +80,9 @@ extern int ckpt_read_payload(struct ckpt_ctx *ctx,
 extern char *ckpt_read_string(struct ckpt_ctx *ctx, int max);
 extern int ckpt_read_consume(struct ckpt_ctx *ctx, int len, int type);
 
+extern char *ckpt_fill_fname(struct path *path, struct path *root,
+			     char *buf, int *len);
+
 /* ckpt kflags */
 #define ckpt_set_ctx_kflag(__ctx, __kflag)  \
 	set_bit(__kflag##_BIT, &(__ctx)->kflags)
@@ -156,6 +159,21 @@ extern int checkpoint_restart_block(struct ckpt_ctx *ctx,
 				    struct task_struct *t);
 extern int restore_restart_block(struct ckpt_ctx *ctx);
 
+/* file table */
+extern int ckpt_collect_file_table(struct ckpt_ctx *ctx, struct task_struct *t);
+extern int checkpoint_obj_file_table(struct ckpt_ctx *ctx,
+				     struct task_struct *t);
+extern int checkpoint_file_table(struct ckpt_ctx *ctx, void *ptr);
+
+/* files */
+extern int checkpoint_fname(struct ckpt_ctx *ctx,
+			    struct path *path, struct path *root);
+extern int ckpt_collect_file(struct ckpt_ctx *ctx, struct file *file);
+extern int checkpoint_file(struct ckpt_ctx *ctx, void *ptr);
+
+extern int checkpoint_file_common(struct ckpt_ctx *ctx, struct file *file,
+				  struct ckpt_hdr_file *h);
+
 static inline int ckpt_validate_errno(int errno)
 {
 	return (errno >= 0) && (errno < MAX_ERRNO);
@@ -166,6 +184,7 @@ static inline int ckpt_validate_errno(int errno)
 #define CKPT_DSYS	0x2		/* generic (system) */
 #define CKPT_DRW	0x4		/* image read/write */
 #define CKPT_DOBJ	0x8		/* shared objects */
+#define CKPT_DFILE	0x10		/* files and filesystem */
 
 #define CKPT_DDEFAULT	0xffff		/* default debug level */
 
diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
index cdca9e4..3222545 100644
--- a/include/linux/checkpoint_hdr.h
+++ b/include/linux/checkpoint_hdr.h
@@ -71,6 +71,8 @@ enum {
 #define CKPT_HDR_TREE CKPT_HDR_TREE
 	CKPT_HDR_TASK,
 #define CKPT_HDR_TASK CKPT_HDR_TASK
+	CKPT_HDR_TASK_OBJS,
+#define CKPT_HDR_TASK_OBJS CKPT_HDR_TASK_OBJS
 	CKPT_HDR_RESTART_BLOCK,
 #define CKPT_HDR_RESTART_BLOCK CKPT_HDR_RESTART_BLOCK
 	CKPT_HDR_THREAD,
@@ -80,6 +82,15 @@ enum {
 
 	/* 201-299: reserved for arch-dependent */
 
+	CKPT_HDR_FILE_TABLE = 301,
+#define CKPT_HDR_FILE_TABLE CKPT_HDR_FILE_TABLE
+	CKPT_HDR_FILE_DESC,
+#define CKPT_HDR_FILE_DESC CKPT_HDR_FILE_DESC
+	CKPT_HDR_FILE_NAME,
+#define CKPT_HDR_FILE_NAME CKPT_HDR_FILE_NAME
+	CKPT_HDR_FILE,
+#define CKPT_HDR_FILE CKPT_HDR_FILE
+
 	CKPT_HDR_TAIL = 9001,
 #define CKPT_HDR_TAIL CKPT_HDR_TAIL
 
@@ -106,6 +117,10 @@ struct ckpt_hdr_objref {
 enum obj_type {
 	CKPT_OBJ_IGNORE = 0,
 #define CKPT_OBJ_IGNORE CKPT_OBJ_IGNORE
+	CKPT_OBJ_FILE_TABLE,
+#define CKPT_OBJ_FILE_TABLE CKPT_OBJ_FILE_TABLE
+	CKPT_OBJ_FILE,
+#define CKPT_OBJ_FILE CKPT_OBJ_FILE
 	CKPT_OBJ_MAX
 #define CKPT_OBJ_MAX CKPT_OBJ_MAX
 };
@@ -188,6 +203,12 @@ struct ckpt_hdr_task {
 	__u64 clear_child_tid;
 } __attribute__((aligned(8)));
 
+/* task's shared resources */
+struct ckpt_hdr_task_objs {
+	struct ckpt_hdr h;
+	__s32 files_objref;
+} __attribute__((aligned(8)));
+
 /* restart blocks */
 struct ckpt_hdr_restart_block {
 	struct ckpt_hdr h;
@@ -220,4 +241,42 @@ enum restart_block_type {
 #define CKPT_RESTART_BLOCK_FUTEX CKPT_RESTART_BLOCK_FUTEX
 };
 
+/* file system */
+struct ckpt_hdr_file_table {
+	struct ckpt_hdr h;
+	__s32 fdt_nfds;
+} __attribute__((aligned(8)));
+
+/* file descriptors */
+struct ckpt_hdr_file_desc {
+	struct ckpt_hdr h;
+	__s32 fd_objref;
+	__s32 fd_descriptor;
+	__u32 fd_close_on_exec;
+} __attribute__((aligned(8)));
+
+enum file_type {
+	CKPT_FILE_IGNORE = 0,
+#define CKPT_FILE_IGNORE CKPT_FILE_IGNORE
+	CKPT_FILE_GENERIC,
+#define CKPT_FILE_GENERIC CKPT_FILE_GENERIC
+	CKPT_FILE_MAX
+#define CKPT_FILE_MAX CKPT_FILE_MAX
+};
+
+/* file objects */
+struct ckpt_hdr_file {
+	struct ckpt_hdr h;
+	__u32 f_type;
+	__u32 f_mode;
+	__u32 f_flags;
+	__u32 _padding;
+	__u64 f_pos;
+	__u64 f_version;
+} __attribute__((aligned(8)));
+
+struct ckpt_hdr_file_generic {
+	struct ckpt_hdr_file common;
+} __attribute__((aligned(8)));
+
 #endif /* _CHECKPOINT_CKPT_HDR_H_ */
diff --git a/include/linux/checkpoint_types.h b/include/linux/checkpoint_types.h
index 90bbb16..aae6755 100644
--- a/include/linux/checkpoint_types.h
+++ b/include/linux/checkpoint_types.h
@@ -14,6 +14,8 @@
 
 #include <linux/sched.h>
 #include <linux/nsproxy.h>
+#include <linux/list.h>
+#include <linux/path.h>
 #include <linux/fs.h>
 #include <linux/ktime.h>
 #include <linux/wait.h>
@@ -40,6 +42,9 @@ struct ckpt_ctx {
 	atomic_t refcount;
 
 	struct ckpt_obj_hash *obj_hash;	/* repository for shared objects */
+	struct deferqueue_head *files_deferq;	/* deferred file-table work */
+
+	struct path root_fs_path;     /* container root (FIXME) */
 
 	struct task_struct *tsk;/* checkpoint: current target task */
 	char err_string[256];	/* checkpoint: error string */
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 65ebec5..7902a51 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1120,6 +1120,7 @@ extern void locks_remove_posix(struct file *, fl_owner_t);
 extern void locks_remove_flock(struct file *);
 extern void locks_release_private(struct file_lock *);
 extern void posix_test_lock(struct file *, struct file_lock *);
+extern int find_locks_with_owner(struct file *filp, fl_owner_t owner);
 extern int posix_lock_file(struct file *, struct file_lock *, struct file_lock *);
 extern int posix_lock_file_wait(struct file *, struct file_lock *);
 extern int posix_unblock_lock(struct file *, struct file_lock *);
@@ -1188,6 +1189,11 @@ static inline void locks_remove_posix(struct file *filp, fl_owner_t owner)
 	return;
 }
 
+static inline int find_locks_with_owner(struct file *filp, fl_owner_t owner)
+{
+	return -ENOENT;
+}
+
 static inline void locks_remove_flock(struct file *filp)
 {
 	return;
@@ -2318,7 +2324,11 @@ void inode_sub_bytes(struct inode *inode, loff_t bytes);
 loff_t inode_get_bytes(struct inode *inode);
 void inode_set_bytes(struct inode *inode, loff_t bytes);
 
+#ifdef CONFIG_CHECKPOINT
+extern int generic_file_checkpoint(struct ckpt_ctx *ctx, struct file *file);
+#else
 #define generic_file_checkpoint NULL
+#endif
 
 extern int vfs_readdir(struct file *, filldir_t, void *);
 
-- 
1.6.3.3


^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [C/R v20][PATCH 38/96] c/r: dump open file descriptors
@ 2010-03-17 16:08                                                                             ` Oren Laadan
  0 siblings, 0 replies; 88+ messages in thread
From: Oren Laadan @ 2010-03-17 16:08 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, linux-api, Serge Hallyn, Ingo Molnar,
	containers, Oren Laadan

Dump the file table with 'struct ckpt_hdr_file_table, followed by all
open file descriptors. Because the 'struct file' corresponding to an
fd can be shared, they are assigned an objref and registered in the
object hash. A reference to the 'file *' is kept for as long as it
lives in the hash (the hash is only cleaned up at the end of the
checkpoint).

Also provide generic_checkpoint_file() and generic_restore_file()
which is good for normal files and directories. It does not support
yet unlinked files or directories.

Changelog[v19]:
  - Fix false negative of test for unlinked files at checkpoint
Changelog[v19-rc3]:
  - [Serge Hallyn] Rename fs_mnt to root_fs_path
  - [Dave Hansen] Error out on file locks and leases
  - [Serge Hallyn] Refuse checkpoint of file with f_owner
Changelog[v19-rc1]:
  - [Matt Helsley] Add cpp definitions for enums
Changelog[v18]:
  - Add a few more ckpt_write_err()s
  - [Dan Smith] Export fill_fname() as ckpt_fill_fname()
  - Introduce ckpt_collect_file() that also uses file->collect method
  - In collect_file_stabl() use retval from ckpt_obj_collect() to
    test for first-time-object
Changelog[v17]:
  - Only collect sub-objects of files_struct once
  - Better file error debugging
  - Use (new) d_unlinked()
Changelog[v16]:
  - Fix compile warning in checkpoint_bad()
Changelog[v16]:
  - Reorder patch (move earlier in series)
  - Handle shared files_struct objects
Changelog[v14]:
  - File objects are dumped/restored prior to the first reference
  - Introduce a per file-type restore() callback
  - Use struct file_operations->checkpoint()
  - Put code for generic file descriptors in a separate function
  - Use one CKPT_FILE_GENERIC for both regular files and dirs
  - Revert change to pr_debug(), back to ckpt_debug()
  - Use only unsigned fields in checkpoint headers
  - Rename:  ckpt_write_files() => checkpoint_fd_table()
  - Rename:  ckpt_write_fd_data() => checkpoint_file()
  - Discard field 'h->parent'
Changelog[v12]:
  - Replace obsolete ckpt_debug() with pr_debug()
Changelog[v11]:
  - Discard handling of opened symlinks (there is no such thing)
  - ckpt_scan_fds() retries from scratch if hits size limits
Changelog[v9]:
  - Fix a couple of leaks in ckpt_write_files()
  - Drop useless kfree from ckpt_scan_fds()
Changelog[v8]:
  - initialize 'coe' to workaround gcc false warning
Changelog[v6]:
  - Balance all calls to ckpt_hbuf_get() with matching ckpt_hbuf_put()
    (even though it's not really needed)

Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
Acked-by: Serge E. Hallyn <serue@us.ibm.com>
Tested-by: Serge E. Hallyn <serue@us.ibm.com>
---
 checkpoint/Makefile              |    3 +-
 checkpoint/checkpoint.c          |   11 +
 checkpoint/files.c               |  444 ++++++++++++++++++++++++++++++++++++++
 checkpoint/objhash.c             |   52 +++++
 checkpoint/process.c             |   33 +++-
 checkpoint/sys.c                 |    8 +
 fs/locks.c                       |   35 +++
 include/linux/checkpoint.h       |   19 ++
 include/linux/checkpoint_hdr.h   |   59 +++++
 include/linux/checkpoint_types.h |    5 +
 include/linux/fs.h               |   10 +
 11 files changed, 677 insertions(+), 2 deletions(-)
 create mode 100644 checkpoint/files.c

diff --git a/checkpoint/Makefile b/checkpoint/Makefile
index 5aa6a75..1d0c058 100644
--- a/checkpoint/Makefile
+++ b/checkpoint/Makefile
@@ -7,4 +7,5 @@ obj-$(CONFIG_CHECKPOINT) += \
 	objhash.o \
 	checkpoint.o \
 	restart.o \
-	process.o
+	process.o \
+	files.o
diff --git a/checkpoint/checkpoint.c b/checkpoint/checkpoint.c
index c016a2d..2bc2495 100644
--- a/checkpoint/checkpoint.c
+++ b/checkpoint/checkpoint.c
@@ -18,6 +18,7 @@
 #include <linux/time.h>
 #include <linux/fs.h>
 #include <linux/file.h>
+#include <linux/fs_struct.h>
 #include <linux/dcache.h>
 #include <linux/mount.h>
 #include <linux/utsname.h>
@@ -490,6 +491,7 @@ static int init_checkpoint_ctx(struct ckpt_ctx *ctx, pid_t pid)
 {
 	struct task_struct *task;
 	struct nsproxy *nsproxy;
+	struct fs_struct *fs;
 
 	/*
 	 * No need for explicit cleanup here, because if an error
@@ -531,6 +533,15 @@ static int init_checkpoint_ctx(struct ckpt_ctx *ctx, pid_t pid)
 		return -EINVAL;  /* cleanup by ckpt_ctx_free() */
 	}
 
+	/* root vfs (FIX: WILL CHANGE with mnt-ns etc */
+	task_lock(ctx->root_task);
+	fs = ctx->root_task->fs;
+	read_lock(&fs->lock);
+	ctx->root_fs_path = fs->root;
+	path_get(&ctx->root_fs_path);
+	read_unlock(&fs->lock);
+	task_unlock(ctx->root_task);
+
 	return 0;
 }
 
diff --git a/checkpoint/files.c b/checkpoint/files.c
new file mode 100644
index 0000000..7a57b24
--- /dev/null
+++ b/checkpoint/files.c
@@ -0,0 +1,444 @@
+/*
+ *  Checkpoint file descriptors
+ *
+ *  Copyright (C) 2008-2009 Oren Laadan
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+/* default debug level for output */
+#define CKPT_DFLAG  CKPT_DFILE
+
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/sched.h>
+#include <linux/file.h>
+#include <linux/fdtable.h>
+#include <linux/deferqueue.h>
+#include <linux/checkpoint.h>
+#include <linux/checkpoint_hdr.h>
+
+
+/**************************************************************************
+ * Checkpoint
+ */
+
+/**
+ * ckpt_fill_fname - return pathname of a given file
+ * @path: path name
+ * @root: relative root
+ * @buf: buffer for pathname
+ * @len: buffer length (in) and pathname length (out)
+ */
+char *ckpt_fill_fname(struct path *path, struct path *root, char *buf, int *len)
+{
+	struct path tmp = *root;
+	char *fname;
+
+	BUG_ON(!buf);
+	spin_lock(&dcache_lock);
+	fname = __d_path(path, &tmp, buf, *len);
+	spin_unlock(&dcache_lock);
+	if (IS_ERR(fname))
+		return fname;
+	*len = (buf + (*len) - fname);
+	/*
+	 * FIX: if __d_path() changed these, it must have stepped out of
+	 * init's namespace. Since currently we require a unified namespace
+	 * within the container: simply fail.
+	 */
+	if (tmp.mnt != root->mnt || tmp.dentry != root->dentry)
+		fname = ERR_PTR(-EBADF);
+
+	return fname;
+}
+
+/**
+ * checkpoint_fname - write a file name
+ * @ctx: checkpoint context
+ * @path: path name
+ * @root: relative root
+ */
+int checkpoint_fname(struct ckpt_ctx *ctx, struct path *path, struct path *root)
+{
+	char *buf, *fname;
+	int ret, flen;
+
+	/*
+	 * FIXME: we can optimize and save memory (and storage) if we
+	 * share strings (through objhash) and reference them instead
+	 */
+
+	flen = PATH_MAX;
+	buf = kmalloc(flen, GFP_KERNEL);
+	if (!buf)
+		return -ENOMEM;
+
+	fname = ckpt_fill_fname(path, root, buf, &flen);
+	if (!IS_ERR(fname)) {
+		ret = ckpt_write_obj_type(ctx, fname, flen,
+					  CKPT_HDR_FILE_NAME);
+	} else {
+		ret = PTR_ERR(fname);
+		ckpt_err(ctx, ret, "%(T)%(S)Obtain filename\n",
+			 path->dentry->d_name.name);
+	}
+
+	kfree(buf);
+	return ret;
+}
+
+#define CKPT_DEFAULT_FDTABLE  256		/* an initial guess */
+
+/**
+ * scan_fds - scan file table and construct array of open fds
+ * @files: files_struct pointer
+ * @fdtable: (output) array of open fds
+ *
+ * Returns the number of open fds found, and also the file table
+ * array via *fdtable. The caller should free the array.
+ *
+ * The caller must validate the file descriptors collected in the
+ * array before using them, e.g. by using fcheck_files(), in case
+ * the task's fdtable changes in the meantime.
+ */
+static int scan_fds(struct files_struct *files, int **fdtable)
+{
+	struct fdtable *fdt;
+	int *fds = NULL;
+	int i = 0, n = 0;
+	int tot = CKPT_DEFAULT_FDTABLE;
+
+	/*
+	 * We assume that all tasks possibly sharing the file table are
+	 * frozen (or we are a single process and we checkpoint ourselves).
+	 * Therefore, we can safely proceed after krealloc() from where we
+	 * left off. Otherwise the file table may be modified by another
+	 * task after we scan it. The behavior is this case is undefined,
+	 * and either checkpoint or restart will likely fail.
+	 */
+ retry:
+	fds = krealloc(fds, tot * sizeof(*fds), GFP_KERNEL);
+	if (!fds)
+		return -ENOMEM;
+
+	rcu_read_lock();
+	fdt = files_fdtable(files);
+	for (/**/; i < fdt->max_fds; i++) {
+		if (!fcheck_files(files, i))
+			continue;
+		if (n == tot) {
+			rcu_read_unlock();
+			tot *= 2;	/* won't overflow: kmalloc will fail */
+			goto retry;
+		}
+		fds[n++] = i;
+	}
+	rcu_read_unlock();
+
+	*fdtable = fds;
+	return n;
+}
+
+int checkpoint_file_common(struct ckpt_ctx *ctx, struct file *file,
+			   struct ckpt_hdr_file *h)
+{
+	h->f_flags = file->f_flags;
+	h->f_mode = file->f_mode;
+	h->f_pos = file->f_pos;
+	h->f_version = file->f_version;
+
+	ckpt_debug("file %s credref %d", file->f_dentry->d_name.name,
+		h->f_credref);
+
+	/* FIX: need also file->uid, file->gid, file->f_owner, etc */
+
+	return 0;
+}
+
+int generic_file_checkpoint(struct ckpt_ctx *ctx, struct file *file)
+{
+	struct ckpt_hdr_file_generic *h;
+	int ret;
+
+	/*
+	 * FIXME: when we'll add support for unlinked files/dirs, we'll
+	 * need to distinguish between unlinked filed and unlinked dirs.
+	 */
+	if (d_unlinked(file->f_dentry)) {
+		ckpt_err(ctx, -EBADF, "%(T)%(P)Unlinked files unsupported\n",
+			 file);
+		return -EBADF;
+	}
+
+	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_FILE);
+	if (!h)
+		return -ENOMEM;
+
+	h->common.f_type = CKPT_FILE_GENERIC;
+
+	ret = checkpoint_file_common(ctx, file, &h->common);
+	if (ret < 0)
+		goto out;
+	ret = ckpt_write_obj(ctx, &h->common.h);
+	if (ret < 0)
+		goto out;
+	ret = checkpoint_fname(ctx, &file->f_path, &ctx->root_fs_path);
+ out:
+	ckpt_hdr_put(ctx, h);
+	return ret;
+}
+EXPORT_SYMBOL(generic_file_checkpoint);
+
+/* checkpoint callback for file pointer */
+int checkpoint_file(struct ckpt_ctx *ctx, void *ptr)
+{
+	struct file *file = (struct file *) ptr;
+	int ret;
+
+	if (!file->f_op || !file->f_op->checkpoint) {
+		ckpt_err(ctx, -EBADF, "%(T)%(P)%(V)f_op lacks checkpoint\n",
+			       file, file->f_op);
+		return -EBADF;
+	}
+
+	ret = file->f_op->checkpoint(ctx, file);
+	if (ret < 0)
+		ckpt_err(ctx, ret, "%(T)%(P)file checkpoint failed\n", file);
+	return ret;
+}
+
+/**
+ * ckpt_write_file_desc - dump the state of a given file descriptor
+ * @ctx: checkpoint context
+ * @files: files_struct pointer
+ * @fd: file descriptor
+ *
+ * Saves the state of the file descriptor; looks up the actual file
+ * pointer in the hash table, and if found saves the matching objref,
+ * otherwise calls ckpt_write_file to dump the file pointer too.
+ */
+static int checkpoint_file_desc(struct ckpt_ctx *ctx,
+				struct files_struct *files, int fd)
+{
+	struct ckpt_hdr_file_desc *h;
+	struct file *file = NULL;
+	struct fdtable *fdt;
+	int objref, ret;
+	int coe = 0;	/* avoid gcc warning */
+	pid_t pid;
+
+	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_FILE_DESC);
+	if (!h)
+		return -ENOMEM;
+
+	rcu_read_lock();
+	fdt = files_fdtable(files);
+	file = fcheck_files(files, fd);
+	if (file) {
+		coe = FD_ISSET(fd, fdt->close_on_exec);
+		get_file(file);
+	}
+	rcu_read_unlock();
+
+	ret = find_locks_with_owner(file, files);
+	/*
+	 * find_locks_with_owner() returns an error when there
+	 * are no locks found, so we *want* it to return an error
+	 * code.  Its success means we have to fail the checkpoint.
+	 */
+	if (!ret) {
+		ret = -EBADF;
+		ckpt_err(ctx, ret, "%(T)fd %d has file lock or lease\n", fd);
+		goto out;
+	}
+
+	/* sanity check (although this shouldn't happen) */
+	ret = -EBADF;
+	if (!file) {
+		ckpt_err(ctx, ret, "%(T)fd %d gone?\n", fd);
+		goto out;
+	}
+
+	/*
+	 * TODO: Implement c/r of fowner and f_sigio.  Should be
+	 * trivial, but for now we just refuse its checkpoint
+	 */
+	pid = f_getown(file);
+	if (pid) {
+		ret = -EBUSY;
+		ckpt_err(ctx, ret, "%(T)fd %d has an owner (%d)\n", fd);
+		goto out;
+	}
+
+	/*
+	 * if seen first time, this will add 'file' to the objhash, keep
+	 * a reference to it, dump its state while at it.
+	 */
+	objref = checkpoint_obj(ctx, file, CKPT_OBJ_FILE);
+	ckpt_debug("fd %d objref %d file %p coe %d)\n", fd, objref, file, coe);
+	if (objref < 0) {
+		ret = objref;
+		goto out;
+	}
+
+	h->fd_objref = objref;
+	h->fd_descriptor = fd;
+	h->fd_close_on_exec = coe;
+
+	ret = ckpt_write_obj(ctx, &h->h);
+out:
+	ckpt_hdr_put(ctx, h);
+	if (file)
+		fput(file);
+	return ret;
+}
+
+static int do_checkpoint_file_table(struct ckpt_ctx *ctx,
+				    struct files_struct *files)
+{
+	struct ckpt_hdr_file_table *h;
+	int *fdtable = NULL;
+	int nfds, n, ret;
+
+	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_FILE_TABLE);
+	if (!h)
+		return -ENOMEM;
+
+	nfds = scan_fds(files, &fdtable);
+	if (nfds < 0) {
+		ret = nfds;
+		goto out;
+	}
+
+	h->fdt_nfds = nfds;
+
+	ret = ckpt_write_obj(ctx, &h->h);
+	ckpt_hdr_put(ctx, h);
+	if (ret < 0)
+		goto out;
+
+	ckpt_debug("nfds %d\n", nfds);
+	for (n = 0; n < nfds; n++) {
+		ret = checkpoint_file_desc(ctx, files, fdtable[n]);
+		if (ret < 0)
+			goto out;
+	}
+
+	ret = deferqueue_run(ctx->files_deferq);
+	ckpt_debug("files_deferq ran %d entries\n", ret);
+	if (ret > 0)
+		ret = 0;
+ out:
+	kfree(fdtable);
+	return ret;
+}
+
+/* checkpoint callback for file table */
+int checkpoint_file_table(struct ckpt_ctx *ctx, void *ptr)
+{
+	return do_checkpoint_file_table(ctx, (struct files_struct *) ptr);
+}
+
+/* checkpoint wrapper for file table */
+int checkpoint_obj_file_table(struct ckpt_ctx *ctx, struct task_struct *t)
+{
+	struct files_struct *files;
+	int objref;
+
+	files = get_files_struct(t);
+	if (!files)
+		return -EBUSY;
+	objref = checkpoint_obj(ctx, files, CKPT_OBJ_FILE_TABLE);
+	put_files_struct(files);
+
+	return objref;
+}
+
+/***********************************************************************
+ * Collect
+ */
+
+int ckpt_collect_file(struct ckpt_ctx *ctx, struct file *file)
+{
+	int ret;
+
+	ret = ckpt_obj_collect(ctx, file, CKPT_OBJ_FILE);
+	if (ret <= 0)
+		return ret;
+	/* if first time for this file (ret > 0), invoke ->collect() */
+	if (file->f_op->collect)
+		ret = file->f_op->collect(ctx, file);
+	if (ret < 0)
+		ckpt_err(ctx, ret, "%(T)%(P)File collect\n", file);
+	return ret;
+}
+
+static int collect_file_desc(struct ckpt_ctx *ctx,
+			     struct files_struct *files, int fd)
+{
+	struct fdtable *fdt;
+	struct file *file;
+	int ret;
+
+	rcu_read_lock();
+	fdt = files_fdtable(files);
+	file = fcheck_files(files, fd);
+	if (file)
+		get_file(file);
+	rcu_read_unlock();
+
+	if (!file) {
+		ckpt_err(ctx, -EBUSY, "%(T)%(P)File removed\n", file);
+		return -EBUSY;
+	}
+
+	ret = ckpt_collect_file(ctx, file);
+	fput(file);
+
+	return ret;
+}
+
+static int collect_file_table(struct ckpt_ctx *ctx, struct files_struct *files)
+{
+	int *fdtable;
+	int nfds, n;
+	int ret;
+
+	/* if already exists (ret == 0), nothing to do */
+	ret = ckpt_obj_collect(ctx, files, CKPT_OBJ_FILE_TABLE);
+	if (ret <= 0)
+		return ret;
+
+	/* if first time for this file table (ret > 0), proceed inside */
+	nfds = scan_fds(files, &fdtable);
+	if (nfds < 0)
+		return nfds;
+
+	for (n = 0; n < nfds; n++) {
+		ret = collect_file_desc(ctx, files, fdtable[n]);
+		if (ret < 0)
+			break;
+	}
+
+	kfree(fdtable);
+	return ret;
+}
+
+int ckpt_collect_file_table(struct ckpt_ctx *ctx, struct task_struct *t)
+{
+	struct files_struct *files;
+	int ret;
+
+	files = get_files_struct(t);
+	if (!files) {
+		ckpt_err(ctx, -EBUSY, "%(T)files_struct missing\n");
+		return -EBUSY;
+	}
+	ret = collect_file_table(ctx, files);
+	put_files_struct(files);
+
+	return ret;
+}
diff --git a/checkpoint/objhash.c b/checkpoint/objhash.c
index 22b1601..f25d130 100644
--- a/checkpoint/objhash.c
+++ b/checkpoint/objhash.c
@@ -13,6 +13,8 @@
 
 #include <linux/kernel.h>
 #include <linux/hash.h>
+#include <linux/file.h>
+#include <linux/fdtable.h>
 #include <linux/checkpoint.h>
 #include <linux/checkpoint_hdr.h>
 
@@ -62,6 +64,38 @@ static int obj_no_grab(void *ptr)
 	return 0;
 }
 
+static int obj_file_table_grab(void *ptr)
+{
+	atomic_inc(&((struct files_struct *) ptr)->count);
+	return 0;
+}
+
+static void obj_file_table_drop(void *ptr, int lastref)
+{
+	put_files_struct((struct files_struct *) ptr);
+}
+
+static int obj_file_table_users(void *ptr)
+{
+	return atomic_read(&((struct files_struct *) ptr)->count);
+}
+
+static int obj_file_grab(void *ptr)
+{
+	get_file((struct file *) ptr);
+	return 0;
+}
+
+static void obj_file_drop(void *ptr, int lastref)
+{
+	fput((struct file *) ptr);
+}
+
+static int obj_file_users(void *ptr)
+{
+	return atomic_long_read(&((struct file *) ptr)->f_count);
+}
+
 static struct ckpt_obj_ops ckpt_obj_ops[] = {
 	/* ignored object */
 	{
@@ -70,6 +104,24 @@ static struct ckpt_obj_ops ckpt_obj_ops[] = {
 		.ref_drop = obj_no_drop,
 		.ref_grab = obj_no_grab,
 	},
+	/* files_struct object */
+	{
+		.obj_name = "FILE_TABLE",
+		.obj_type = CKPT_OBJ_FILE_TABLE,
+		.ref_drop = obj_file_table_drop,
+		.ref_grab = obj_file_table_grab,
+		.ref_users = obj_file_table_users,
+		.checkpoint = checkpoint_file_table,
+	},
+	/* file object */
+	{
+		.obj_name = "FILE",
+		.obj_type = CKPT_OBJ_FILE,
+		.ref_drop = obj_file_drop,
+		.ref_grab = obj_file_grab,
+		.ref_users = obj_file_users,
+		.checkpoint = checkpoint_file,
+	},
 };
 
 
diff --git a/checkpoint/process.c b/checkpoint/process.c
index ef394a5..adc34a2 100644
--- a/checkpoint/process.c
+++ b/checkpoint/process.c
@@ -104,6 +104,29 @@ static int checkpoint_task_struct(struct ckpt_ctx *ctx, struct task_struct *t)
 	return ckpt_write_string(ctx, t->comm, TASK_COMM_LEN);
 }
 
+static int checkpoint_task_objs(struct ckpt_ctx *ctx, struct task_struct *t)
+{
+	struct ckpt_hdr_task_objs *h;
+	int files_objref;
+	int ret;
+
+	files_objref = checkpoint_obj_file_table(ctx, t);
+	ckpt_debug("files: objref %d\n", files_objref);
+	if (files_objref < 0) {
+		ckpt_err(ctx, files_objref, "%(T)files_struct\n");
+		return files_objref;
+	}
+
+	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_TASK_OBJS);
+	if (!h)
+		return -ENOMEM;
+	h->files_objref = files_objref;
+	ret = ckpt_write_obj(ctx, &h->h);
+	ckpt_hdr_put(ctx, h);
+
+	return ret;
+}
+
 /* dump the task_struct of a given task */
 int checkpoint_restart_block(struct ckpt_ctx *ctx, struct task_struct *t)
 {
@@ -240,6 +263,10 @@ int checkpoint_task(struct ckpt_ctx *ctx, struct task_struct *t)
 		goto out;
 	ret = checkpoint_cpu(ctx, t);
 	ckpt_debug("cpu %d\n", ret);
+	if (ret < 0)
+		goto out;
+	ret = checkpoint_task_objs(ctx, t);
+	ckpt_debug("objs %d\n", ret);
  out:
 	ctx->tsk = NULL;
 	return ret;
@@ -247,7 +274,11 @@ int checkpoint_task(struct ckpt_ctx *ctx, struct task_struct *t)
 
 int ckpt_collect_task(struct ckpt_ctx *ctx, struct task_struct *t)
 {
-	return 0;
+	int ret;
+
+	ret = ckpt_collect_file_table(ctx, t);
+
+	return ret;
 }
 
 /***********************************************************************
diff --git a/checkpoint/sys.c b/checkpoint/sys.c
index 926c937..30b8004 100644
--- a/checkpoint/sys.c
+++ b/checkpoint/sys.c
@@ -206,12 +206,16 @@ static void ckpt_ctx_free(struct ckpt_ctx *ctx)
 	if (ctx->kflags & CKPT_CTX_RESTART)
 		restore_debug_free(ctx);
 
+	if (ctx->files_deferq)
+		deferqueue_destroy(ctx->files_deferq);
+
 	if (ctx->file)
 		fput(ctx->file);
 	if (ctx->logfile)
 		fput(ctx->logfile);
 
 	ckpt_obj_hash_free(ctx);
+	path_put(&ctx->root_fs_path);
 
 	if (ctx->tasks_arr)
 		task_arr_free(ctx);
@@ -270,6 +274,10 @@ static struct ckpt_ctx *ckpt_ctx_alloc(int fd, unsigned long uflags,
 	if (ckpt_obj_hash_alloc(ctx) < 0)
 		goto err;
 
+	ctx->files_deferq = deferqueue_create();
+	if (!ctx->files_deferq)
+		goto err;
+
 	atomic_inc(&ctx->refcount);
 	return ctx;
  err:
diff --git a/fs/locks.c b/fs/locks.c
index a8794f2..721481a 100644
--- a/fs/locks.c
+++ b/fs/locks.c
@@ -1994,6 +1994,41 @@ void locks_remove_posix(struct file *filp, fl_owner_t owner)
 
 EXPORT_SYMBOL(locks_remove_posix);
 
+int find_locks_with_owner(struct file *filp, fl_owner_t owner)
+{
+	struct inode *inode = filp->f_path.dentry->d_inode;
+	struct file_lock **inode_fl;
+	int ret = -EEXIST;
+
+	lock_kernel();
+	for_each_lock(inode, inode_fl) {
+		struct file_lock *fl = *inode_fl;
+		/*
+		 * We could use posix_same_owner() along with a 'fake'
+		 * file_lock.  But, the fake file will never have the
+		 * same fl_lmops as the fl that we are looking for and
+		 * posix_same_owner() would just fall back to this
+		 * check anyway.
+		 */
+		if (IS_POSIX(fl)) {
+			if (fl->fl_owner == owner) {
+				ret = 0;
+				break;
+			}
+		} else if (IS_FLOCK(fl) || IS_LEASE(fl)) {
+			if (fl->fl_file == filp) {
+				ret = 0;
+				break;
+			}
+		} else {
+			WARN(1, "unknown file lock type, fl_flags: %x",
+				fl->fl_flags);
+		}
+	}
+	unlock_kernel();
+	return ret;
+}
+
 /*
  * This function is called on the last close of an open file.
  */
diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
index 50ce8f9..d74a890 100644
--- a/include/linux/checkpoint.h
+++ b/include/linux/checkpoint.h
@@ -80,6 +80,9 @@ extern int ckpt_read_payload(struct ckpt_ctx *ctx,
 extern char *ckpt_read_string(struct ckpt_ctx *ctx, int max);
 extern int ckpt_read_consume(struct ckpt_ctx *ctx, int len, int type);
 
+extern char *ckpt_fill_fname(struct path *path, struct path *root,
+			     char *buf, int *len);
+
 /* ckpt kflags */
 #define ckpt_set_ctx_kflag(__ctx, __kflag)  \
 	set_bit(__kflag##_BIT, &(__ctx)->kflags)
@@ -156,6 +159,21 @@ extern int checkpoint_restart_block(struct ckpt_ctx *ctx,
 				    struct task_struct *t);
 extern int restore_restart_block(struct ckpt_ctx *ctx);
 
+/* file table */
+extern int ckpt_collect_file_table(struct ckpt_ctx *ctx, struct task_struct *t);
+extern int checkpoint_obj_file_table(struct ckpt_ctx *ctx,
+				     struct task_struct *t);
+extern int checkpoint_file_table(struct ckpt_ctx *ctx, void *ptr);
+
+/* files */
+extern int checkpoint_fname(struct ckpt_ctx *ctx,
+			    struct path *path, struct path *root);
+extern int ckpt_collect_file(struct ckpt_ctx *ctx, struct file *file);
+extern int checkpoint_file(struct ckpt_ctx *ctx, void *ptr);
+
+extern int checkpoint_file_common(struct ckpt_ctx *ctx, struct file *file,
+				  struct ckpt_hdr_file *h);
+
 static inline int ckpt_validate_errno(int errno)
 {
 	return (errno >= 0) && (errno < MAX_ERRNO);
@@ -166,6 +184,7 @@ static inline int ckpt_validate_errno(int errno)
 #define CKPT_DSYS	0x2		/* generic (system) */
 #define CKPT_DRW	0x4		/* image read/write */
 #define CKPT_DOBJ	0x8		/* shared objects */
+#define CKPT_DFILE	0x10		/* files and filesystem */
 
 #define CKPT_DDEFAULT	0xffff		/* default debug level */
 
diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
index cdca9e4..3222545 100644
--- a/include/linux/checkpoint_hdr.h
+++ b/include/linux/checkpoint_hdr.h
@@ -71,6 +71,8 @@ enum {
 #define CKPT_HDR_TREE CKPT_HDR_TREE
 	CKPT_HDR_TASK,
 #define CKPT_HDR_TASK CKPT_HDR_TASK
+	CKPT_HDR_TASK_OBJS,
+#define CKPT_HDR_TASK_OBJS CKPT_HDR_TASK_OBJS
 	CKPT_HDR_RESTART_BLOCK,
 #define CKPT_HDR_RESTART_BLOCK CKPT_HDR_RESTART_BLOCK
 	CKPT_HDR_THREAD,
@@ -80,6 +82,15 @@ enum {
 
 	/* 201-299: reserved for arch-dependent */
 
+	CKPT_HDR_FILE_TABLE = 301,
+#define CKPT_HDR_FILE_TABLE CKPT_HDR_FILE_TABLE
+	CKPT_HDR_FILE_DESC,
+#define CKPT_HDR_FILE_DESC CKPT_HDR_FILE_DESC
+	CKPT_HDR_FILE_NAME,
+#define CKPT_HDR_FILE_NAME CKPT_HDR_FILE_NAME
+	CKPT_HDR_FILE,
+#define CKPT_HDR_FILE CKPT_HDR_FILE
+
 	CKPT_HDR_TAIL = 9001,
 #define CKPT_HDR_TAIL CKPT_HDR_TAIL
 
@@ -106,6 +117,10 @@ struct ckpt_hdr_objref {
 enum obj_type {
 	CKPT_OBJ_IGNORE = 0,
 #define CKPT_OBJ_IGNORE CKPT_OBJ_IGNORE
+	CKPT_OBJ_FILE_TABLE,
+#define CKPT_OBJ_FILE_TABLE CKPT_OBJ_FILE_TABLE
+	CKPT_OBJ_FILE,
+#define CKPT_OBJ_FILE CKPT_OBJ_FILE
 	CKPT_OBJ_MAX
 #define CKPT_OBJ_MAX CKPT_OBJ_MAX
 };
@@ -188,6 +203,12 @@ struct ckpt_hdr_task {
 	__u64 clear_child_tid;
 } __attribute__((aligned(8)));
 
+/* task's shared resources */
+struct ckpt_hdr_task_objs {
+	struct ckpt_hdr h;
+	__s32 files_objref;
+} __attribute__((aligned(8)));
+
 /* restart blocks */
 struct ckpt_hdr_restart_block {
 	struct ckpt_hdr h;
@@ -220,4 +241,42 @@ enum restart_block_type {
 #define CKPT_RESTART_BLOCK_FUTEX CKPT_RESTART_BLOCK_FUTEX
 };
 
+/* file system */
+struct ckpt_hdr_file_table {
+	struct ckpt_hdr h;
+	__s32 fdt_nfds;
+} __attribute__((aligned(8)));
+
+/* file descriptors */
+struct ckpt_hdr_file_desc {
+	struct ckpt_hdr h;
+	__s32 fd_objref;
+	__s32 fd_descriptor;
+	__u32 fd_close_on_exec;
+} __attribute__((aligned(8)));
+
+enum file_type {
+	CKPT_FILE_IGNORE = 0,
+#define CKPT_FILE_IGNORE CKPT_FILE_IGNORE
+	CKPT_FILE_GENERIC,
+#define CKPT_FILE_GENERIC CKPT_FILE_GENERIC
+	CKPT_FILE_MAX
+#define CKPT_FILE_MAX CKPT_FILE_MAX
+};
+
+/* file objects */
+struct ckpt_hdr_file {
+	struct ckpt_hdr h;
+	__u32 f_type;
+	__u32 f_mode;
+	__u32 f_flags;
+	__u32 _padding;
+	__u64 f_pos;
+	__u64 f_version;
+} __attribute__((aligned(8)));
+
+struct ckpt_hdr_file_generic {
+	struct ckpt_hdr_file common;
+} __attribute__((aligned(8)));
+
 #endif /* _CHECKPOINT_CKPT_HDR_H_ */
diff --git a/include/linux/checkpoint_types.h b/include/linux/checkpoint_types.h
index 90bbb16..aae6755 100644
--- a/include/linux/checkpoint_types.h
+++ b/include/linux/checkpoint_types.h
@@ -14,6 +14,8 @@
 
 #include <linux/sched.h>
 #include <linux/nsproxy.h>
+#include <linux/list.h>
+#include <linux/path.h>
 #include <linux/fs.h>
 #include <linux/ktime.h>
 #include <linux/wait.h>
@@ -40,6 +42,9 @@ struct ckpt_ctx {
 	atomic_t refcount;
 
 	struct ckpt_obj_hash *obj_hash;	/* repository for shared objects */
+	struct deferqueue_head *files_deferq;	/* deferred file-table work */
+
+	struct path root_fs_path;     /* container root (FIXME) */
 
 	struct task_struct *tsk;/* checkpoint: current target task */
 	char err_string[256];	/* checkpoint: error string */
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 65ebec5..7902a51 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1120,6 +1120,7 @@ extern void locks_remove_posix(struct file *, fl_owner_t);
 extern void locks_remove_flock(struct file *);
 extern void locks_release_private(struct file_lock *);
 extern void posix_test_lock(struct file *, struct file_lock *);
+extern int find_locks_with_owner(struct file *filp, fl_owner_t owner);
 extern int posix_lock_file(struct file *, struct file_lock *, struct file_lock *);
 extern int posix_lock_file_wait(struct file *, struct file_lock *);
 extern int posix_unblock_lock(struct file *, struct file_lock *);
@@ -1188,6 +1189,11 @@ static inline void locks_remove_posix(struct file *filp, fl_owner_t owner)
 	return;
 }
 
+static inline int find_locks_with_owner(struct file *filp, fl_owner_t owner)
+{
+	return -ENOENT;
+}
+
 static inline void locks_remove_flock(struct file *filp)
 {
 	return;
@@ -2318,7 +2324,11 @@ void inode_sub_bytes(struct inode *inode, loff_t bytes);
 loff_t inode_get_bytes(struct inode *inode);
 void inode_set_bytes(struct inode *inode, loff_t bytes);
 
+#ifdef CONFIG_CHECKPOINT
+extern int generic_file_checkpoint(struct ckpt_ctx *ctx, struct file *file);
+#else
 #define generic_file_checkpoint NULL
+#endif
 
 extern int vfs_readdir(struct file *, filldir_t, void *);
 
-- 
1.6.3.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 88+ messages in thread

end of thread, other threads:[~2010-03-23  0:56 UTC | newest]

Thread overview: 88+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-03-19  0:59 [C/R v20][PATCH 00/96] Linux Checkpoint-Restart - v20 Oren Laadan
2010-03-19  0:59 ` [C/R v20][PATCH 20/96] c/r: make file_pos_read/write() public Oren Laadan
     [not found]   ` <1268960401-16680-2-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
2010-03-22  6:31     ` Nick Piggin
2010-03-22  6:31   ` Nick Piggin
2010-03-23  0:12     ` Oren Laadan
2010-03-23  0:43       ` Nick Piggin
2010-03-23  0:56         ` Oren Laadan
2010-03-23  0:56         ` Oren Laadan
     [not found]       ` <Pine.LNX.4.64.1003221959450.1520-CXF6herHY6ykSYb+qCZC/1i27PF6R63G9nwVQlTi/Pw@public.gmane.org>
2010-03-23  0:43         ` Nick Piggin
2010-03-23  0:12     ` Oren Laadan
2010-03-19  0:59 ` [C/R v20][PATCH 37/96] c/r: introduce new 'file_operations': ->checkpoint, ->collect() Oren Laadan
2010-03-22  6:34   ` Nick Piggin
2010-03-22 10:16     ` Matt Helsley
     [not found]       ` <20100322101635.GC20796-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org>
2010-03-22 11:00         ` Nick Piggin
2010-03-22 11:00       ` Nick Piggin
2010-03-22 10:16     ` Matt Helsley
     [not found]   ` <1268960401-16680-3-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
2010-03-22  6:34     ` Nick Piggin
2010-03-19  0:59 ` [C/R v20][PATCH 38/96] c/r: dump open file descriptors Oren Laadan
2010-03-19 23:19   ` Andreas Dilger
     [not found]     ` <F18D161D-850B-4C82-83D5-1F19D573E84F-xsfywfwIY+M@public.gmane.org>
2010-03-20  4:43       ` Matt Helsley
2010-03-20  4:43     ` Matt Helsley
2010-03-21 17:27       ` Jamie Lokier
2010-03-21 19:40         ` Serge E. Hallyn
2010-03-21 20:58           ` Daniel Lezcano
2010-03-21 21:36             ` Oren Laadan
     [not found]               ` <4BA6914D.8040007-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
2010-03-21 23:31                 ` xing lin
2010-03-22  8:40                 ` Daniel Lezcano
2010-03-22  8:40               ` Daniel Lezcano
     [not found]             ` <4BA68884.3080003-GANU6spQydw@public.gmane.org>
2010-03-21 21:36               ` Oren Laadan
2010-03-22  2:12               ` Matt Helsley
2010-03-22  2:12             ` Matt Helsley
2010-03-22 13:51               ` Jamie Lokier
     [not found]               ` <20100322021242.GI2887-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org>
2010-03-22 13:51                 ` Jamie Lokier
2010-03-22 23:18                 ` Andreas Dilger
2010-03-22 23:18               ` Andreas Dilger
     [not found]           ` <20100321194019.GA11714-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org>
2010-03-21 20:58             ` Daniel Lezcano
     [not found]         ` <20100321172703.GC4174-yetKDKU6eevNLxjTenLetw@public.gmane.org>
2010-03-21 19:40           ` Serge E. Hallyn
2010-03-22  1:06           ` Matt Helsley
2010-03-22  1:06         ` Matt Helsley
2010-03-22  2:20           ` Jamie Lokier
     [not found]             ` <20100322022003.GA16462-yetKDKU6eevNLxjTenLetw@public.gmane.org>
2010-03-22  3:37               ` Matt Helsley
2010-03-22  3:37             ` Matt Helsley
     [not found]               ` <20100322033724.GA20796-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org>
2010-03-22 14:13                 ` Jamie Lokier
2010-03-22 14:13               ` Jamie Lokier
2010-03-22  2:55           ` Serge E. Hallyn
     [not found]           ` <20100322010606.GG2887-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org>
2010-03-22  2:20             ` Jamie Lokier
2010-03-22  2:55             ` Serge E. Hallyn
     [not found]       ` <20100320044310.GC2887-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org>
2010-03-21 17:27         ` Jamie Lokier
     [not found]   ` <1268960401-16680-4-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
2010-03-19 23:19     ` Andreas Dilger
2010-03-22 10:30     ` Nick Piggin
2010-03-22 13:22       ` Matt Helsley
     [not found]         ` <20100322132232.GD20796-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org>
2010-03-22 13:38           ` Nick Piggin
2010-03-22 13:38         ` Nick Piggin
2010-03-22 13:22       ` Matt Helsley
2010-03-19  0:59 ` [C/R v20][PATCH 39/96] c/r: restore " Oren Laadan
     [not found] ` <1268960401-16680-1-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
2010-03-19  0:59   ` [C/R v20][PATCH 20/96] c/r: make file_pos_read/write() public Oren Laadan
2010-03-19  0:59   ` [C/R v20][PATCH 37/96] c/r: introduce new 'file_operations': ->checkpoint, ->collect() Oren Laadan
2010-03-19  0:59   ` [C/R v20][PATCH 38/96] c/r: dump open file descriptors Oren Laadan
2010-03-19  0:59   ` [C/R v20][PATCH 39/96] c/r: restore " Oren Laadan
2010-03-19  0:59   ` [C/R v20][PATCH 40/96] c/r: introduce method '->checkpoint()' in struct vm_operations_struct Oren Laadan
2010-03-19  0:59   ` [C/R v20][PATCH 44/96] c/r: add generic '->checkpoint' f_op to ext fses Oren Laadan
2010-03-19  0:59   ` [C/R v20][PATCH 45/96] c/r: add generic '->checkpoint()' f_op to simple devices Oren Laadan
2010-03-19  0:59   ` [C/R v20][PATCH 46/96] c/r: add checkpoint operation for opened files of generic filesystems Oren Laadan
2010-03-19  0:59   ` [C/R v20][PATCH 50/96] splice: export pipe/file-to-pipe/file functionality Oren Laadan
2010-03-19  0:59   ` [C/R v20][PATCH 51/96] c/r: support for open pipes Oren Laadan
2010-03-19  0:59   ` [C/R v20][PATCH 52/96] c/r: checkpoint and restore FIFOs Oren Laadan
2010-03-19  0:59   ` [C/R v20][PATCH 53/96] c/r: refuse to checkpoint if monitoring directories with dnotify Oren Laadan
2010-03-19  0:59   ` [C/R v20][PATCH 66/96] c/r: restore file->f_cred Oren Laadan
2010-03-19  0:59   ` [C/R v20][PATCH 82/96] c/r: checkpoint/restart epoll sets Oren Laadan
2010-03-19  0:59   ` [C/R v20][PATCH 83/96] c/r: checkpoint/restart eventfd Oren Laadan
2010-03-19  1:00   ` [C/R v20][PATCH 84/96] c/r: restore task fs_root and pwd (v3) Oren Laadan
2010-03-19  1:00   ` [C/R v20][PATCH 85/96] c/r: preliminary support mounts namespace Oren Laadan
2010-03-19  0:59 ` [C/R v20][PATCH 40/96] c/r: introduce method '->checkpoint()' in struct vm_operations_struct Oren Laadan
2010-03-19  0:59 ` [C/R v20][PATCH 44/96] c/r: add generic '->checkpoint' f_op to ext fses Oren Laadan
2010-03-19  0:59 ` [C/R v20][PATCH 45/96] c/r: add generic '->checkpoint()' f_op to simple devices Oren Laadan
2010-03-19  0:59 ` [C/R v20][PATCH 46/96] c/r: add checkpoint operation for opened files of generic filesystems Oren Laadan
2010-03-19  0:59 ` [C/R v20][PATCH 50/96] splice: export pipe/file-to-pipe/file functionality Oren Laadan
2010-03-19  0:59 ` [C/R v20][PATCH 51/96] c/r: support for open pipes Oren Laadan
2010-03-19  0:59 ` [C/R v20][PATCH 52/96] c/r: checkpoint and restore FIFOs Oren Laadan
2010-03-19  0:59 ` [C/R v20][PATCH 53/96] c/r: refuse to checkpoint if monitoring directories with dnotify Oren Laadan
2010-03-19  0:59 ` [C/R v20][PATCH 66/96] c/r: restore file->f_cred Oren Laadan
2010-03-19  0:59 ` [C/R v20][PATCH 82/96] c/r: checkpoint/restart epoll sets Oren Laadan
2010-03-19  0:59 ` [C/R v20][PATCH 83/96] c/r: checkpoint/restart eventfd Oren Laadan
2010-03-19  1:00 ` [C/R v20][PATCH 84/96] c/r: restore task fs_root and pwd (v3) Oren Laadan
2010-03-19  1:00 ` [C/R v20][PATCH 85/96] c/r: preliminary support mounts namespace Oren Laadan
  -- strict thread matches above, loose matches on Subject: below --
2010-03-17 16:07 [C/R v20][PATCH 00/96] Linux Checkpoint-Restart - v20 Oren Laadan
2010-03-17 16:07 ` [C/R v20][PATCH 01/96] eclone (1/11): Factor out code to allocate pidmap page Oren Laadan
2010-03-17 16:07   ` [C/R v20][PATCH 02/96] eclone (2/11): Have alloc_pidmap() return actual error code Oren Laadan
2010-03-17 16:07     ` [C/R v20][PATCH 03/96] eclone (3/11): Define set_pidmap() function Oren Laadan
2010-03-17 16:07       ` [C/R v20][PATCH 04/96] eclone (4/11): Add target_pids parameter to alloc_pid() Oren Laadan
2010-03-17 16:07         ` [C/R v20][PATCH 05/96] eclone (5/11): Add target_pids parameter to copy_process() Oren Laadan
2010-03-17 16:07           ` [C/R v20][PATCH 06/96] eclone (6/11): Check invalid clone flags Oren Laadan
2010-03-17 16:07             ` [C/R v20][PATCH 07/96] eclone (7/11): Define do_fork_with_pids() Oren Laadan
2010-03-17 16:07               ` [C/R v20][PATCH 08/96] eclone (8/11): Implement sys_eclone for x86 (32,64) Oren Laadan
2010-03-17 16:07                 ` [C/R v20][PATCH 09/96] eclone (9/11): Implement sys_eclone for s390 Oren Laadan
2010-03-17 16:07                   ` [C/R v20][PATCH 10/96] eclone (10/11): Implement sys_eclone for powerpc Oren Laadan
2010-03-17 16:07                     ` [C/R v20][PATCH 11/96] eclone (11/11): Document sys_eclone Oren Laadan
2010-03-17 16:08                       ` [C/R v20][PATCH 12/96] c/r: extend arch_setup_additional_pages() Oren Laadan
2010-03-17 16:08                         ` [C/R v20][PATCH 13/96] c/r: break out new_user_ns() Oren Laadan
2010-03-17 16:08                           ` [C/R v20][PATCH 14/96] c/r: split core function out of some set*{u,g}id functions Oren Laadan
2010-03-17 16:08                             ` [C/R v20][PATCH 15/96] cgroup freezer: Fix buggy resume test for tasks frozen with cgroup freezer Oren Laadan
2010-03-17 16:08                               ` [C/R v20][PATCH 16/96] cgroup freezer: Update stale locking comments Oren Laadan
2010-03-17 16:08                                 ` [C/R v20][PATCH 17/96] cgroup freezer: Add CHECKPOINTING state to safeguard container checkpoint Oren Laadan
2010-03-17 16:08                                   ` [C/R v20][PATCH 18/96] cgroup freezer: interface to freeze a cgroup from within the kernel Oren Laadan
2010-03-17 16:08                                     ` [C/R v20][PATCH 19/96] Namespaces submenu Oren Laadan
2010-03-17 16:08                                       ` [C/R v20][PATCH 20/96] c/r: make file_pos_read/write() public Oren Laadan
2010-03-17 16:08                                         ` [C/R v20][PATCH 21/96] c/r: create syscalls: sys_checkpoint, sys_restart Oren Laadan
2010-03-17 16:08                                           ` [C/R v20][PATCH 22/96] c/r: documentation Oren Laadan
2010-03-17 16:08                                             ` [C/R v20][PATCH 23/96] c/r: basic infrastructure for checkpoint/restart Oren Laadan
2010-03-17 16:08                                               ` [C/R v20][PATCH 24/96] c/r: x86_32 support " Oren Laadan
2010-03-17 16:08                                                 ` [C/R v20][PATCH 25/96] c/r: x86-64: checkpoint/restart implementation Oren Laadan
2010-03-17 16:08                                                   ` [C/R v20][PATCH 26/96] c/r: external checkpoint of a task other than ourself Oren Laadan
2010-03-17 16:08                                                     ` [C/R v20][PATCH 27/96] c/r: export functionality used in next patch for restart-blocks Oren Laadan
2010-03-17 16:08                                                       ` [C/R v20][PATCH 28/96] c/r: restart-blocks Oren Laadan
2010-03-17 16:08                                                         ` [C/R v20][PATCH 29/96] c/r: checkpoint multiple processes Oren Laadan
2010-03-17 16:08                                                           ` [C/R v20][PATCH 30/96] c/r: restart " Oren Laadan
2010-03-17 16:08                                                             ` [C/R v20][PATCH 31/96] c/r: introduce PF_RESTARTING, and skip notification on exit Oren Laadan
2010-03-17 16:08                                                               ` [C/R v20][PATCH 32/96] c/r: support for zombie processes Oren Laadan
2010-03-17 16:08                                                                 ` [C/R v20][PATCH 33/96] c/r: Save and restore the [compat_]robust_list member of the task struct Oren Laadan
2010-03-17 16:08                                                                   ` [C/R v20][PATCH 34/96] c/r: infrastructure for shared objects Oren Laadan
2010-03-17 16:08                                                                     ` [C/R v20][PATCH 35/96] c/r: detect resource leaks for whole-container checkpoint Oren Laadan
2010-03-17 16:08                                                                       ` [C/R v20][PATCH 36/96] deferqueue: generic queue to defer work Oren Laadan
2010-03-17 16:08                                                                         ` [C/R v20][PATCH 37/96] c/r: introduce new 'file_operations': ->checkpoint, ->collect() Oren Laadan
     [not found]                                                                           ` <1268842164-5590-38-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
2010-03-17 16:08                                                                             ` [C/R v20][PATCH 38/96] c/r: dump open file descriptors Oren Laadan
2010-03-17 16:08                                                                           ` Oren Laadan
2010-03-17 16:08                                                                             ` Oren Laadan

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.