All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/8] checkpoint/restart: sysvshm fixes and hugetlb support
@ 2010-09-14 20:02 Nathan Lynch
       [not found] ` <1284494530-25946-1-git-send-email-ntl-e+AXbWqSrlAAvxtiuMwx3w@public.gmane.org>
  0 siblings, 1 reply; 21+ messages in thread
From: Nathan Lynch @ 2010-09-14 20:02 UTC (permalink / raw)
  To: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA

The following patches implement c/r for huge pages and fix a couple of
minor issues in the SysV shm c/r code I noticed in the process.

Tested on i386, x86_64, and ppc64.

Nathan Lynch (8):
  sysvshm: check for hugetlb before assuming shmem
  sysvshm: report error on failure to reattach, avoid crash
  checkpoint/sysvshm: release rwsem earlier during restore
  checkpoint/ipc: allow shmat callers to specify ipc namespace
  checkpoint/restart of anonymous hugetlb mappings
  remove VM_HUGETLB and VM_RESERVED from CKPT_VMA_NOT_SUPPORTED
  hugetlbfs checkpoint/restart hooks
  checkpoint/restart of SysV SHM_HUGETLB regions

 fs/hugetlbfs/inode.c           |   24 ++++-
 include/linux/checkpoint.h     |    7 +-
 include/linux/checkpoint_hdr.h |   16 +++
 include/linux/hugetlb.h        |   11 ++
 include/linux/shm.h            |    8 +-
 ipc/checkpoint_shm.c           |  172 ++++++++++++++++++++++++---
 ipc/shm.c                      |   17 ++--
 mm/checkpoint.c                |   13 ++
 mm/hugetlb.c                   |  257 ++++++++++++++++++++++++++++++++++++++++
 9 files changed, 491 insertions(+), 34 deletions(-)

-- 
1.7.2.2

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH 1/8] sysvshm: check for hugetlb before assuming shmem
       [not found] ` <1284494530-25946-1-git-send-email-ntl-e+AXbWqSrlAAvxtiuMwx3w@public.gmane.org>
@ 2010-09-14 20:02   ` Nathan Lynch
  2010-09-14 20:02   ` [PATCH 2/8] sysvshm: report error on failure to reattach, avoid crash Nathan Lynch
                     ` (7 subsequent siblings)
  8 siblings, 0 replies; 21+ messages in thread
From: Nathan Lynch @ 2010-09-14 20:02 UTC (permalink / raw)
  To: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA

fill_ipc_shm_hdr should ensure that the region isn't hugetlbfs-backed
before backcasting the inode to shmem_inode_info, not after.

Signed-off-by: Nathan Lynch <ntl-e+AXbWqSrlAAvxtiuMwx3w@public.gmane.org>
---
 ipc/checkpoint_shm.c |   10 +++++++---
 1 files changed, 7 insertions(+), 3 deletions(-)

diff --git a/ipc/checkpoint_shm.c b/ipc/checkpoint_shm.c
index c8247ce..a929e33 100644
--- a/ipc/checkpoint_shm.c
+++ b/ipc/checkpoint_shm.c
@@ -58,13 +58,17 @@ static int fill_ipc_shm_hdr(struct ckpt_ctx *ctx,
 		h->mlock_uid = (unsigned int) -1;
 
 	h->flags = 0;
-	/* check if shm was setup with SHM_NORESERVE */
-	if (SHMEM_I(shp->shm_file->f_dentry->d_inode)->flags & VM_NORESERVE)
-		h->flags |= SHM_NORESERVE;
+
 	/* check if shm was setup with SHM_HUGETLB (unsupported yet) */
 	if (is_file_hugepages(shp->shm_file)) {
 		pr_warning("c/r: unsupported SHM_HUGETLB\n");
 		ret = -ENOSYS;
+	} else {
+		struct shmem_inode_info *info;
+
+		info = SHMEM_I(shp->shm_file->f_dentry->d_inode);
+		if (info->flags & VM_NORESERVE)
+			h->flags |= SHM_NORESERVE;
 	}
 
 	ipc_unlock(&shp->shm_perm);
-- 
1.7.2.2

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 2/8] sysvshm: report error on failure to reattach, avoid crash
       [not found] ` <1284494530-25946-1-git-send-email-ntl-e+AXbWqSrlAAvxtiuMwx3w@public.gmane.org>
  2010-09-14 20:02   ` [PATCH 1/8] sysvshm: check for hugetlb before assuming shmem Nathan Lynch
@ 2010-09-14 20:02   ` Nathan Lynch
       [not found]     ` <1284494530-25946-3-git-send-email-ntl-e+AXbWqSrlAAvxtiuMwx3w@public.gmane.org>
  2010-09-14 20:02   ` [PATCH 3/8] checkpoint/sysvshm: release rwsem earlier during restore Nathan Lynch
                     ` (6 subsequent siblings)
  8 siblings, 1 reply; 21+ messages in thread
From: Nathan Lynch @ 2010-09-14 20:02 UTC (permalink / raw)
  To: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA

If ipcshm_restore fails to look up the file object for the region
being restored, it should return the error to its caller and not
proceed to dereference the file pointer.

Signed-off-by: Nathan Lynch <ntl-e+AXbWqSrlAAvxtiuMwx3w@public.gmane.org>
---
 ipc/shm.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/ipc/shm.c b/ipc/shm.c
index eed4b9a..1ba9193 100644
--- a/ipc/shm.c
+++ b/ipc/shm.c
@@ -334,7 +334,7 @@ int ipcshm_restore(struct ckpt_ctx *ctx, struct mm_struct *mm,
 
 	file = ckpt_obj_fetch(ctx, h->ino_objref, CKPT_OBJ_FILE);
 	if (IS_ERR(file))
-		PTR_ERR(file);
+		return PTR_ERR(file);
 
 	shmid = file->f_dentry->d_inode->i_ino;
 
-- 
1.7.2.2

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 3/8] checkpoint/sysvshm: release rwsem earlier during restore
       [not found] ` <1284494530-25946-1-git-send-email-ntl-e+AXbWqSrlAAvxtiuMwx3w@public.gmane.org>
  2010-09-14 20:02   ` [PATCH 1/8] sysvshm: check for hugetlb before assuming shmem Nathan Lynch
  2010-09-14 20:02   ` [PATCH 2/8] sysvshm: report error on failure to reattach, avoid crash Nathan Lynch
@ 2010-09-14 20:02   ` Nathan Lynch
  2010-09-14 20:02   ` [PATCH 4/8] checkpoint/ipc: allow shmat callers to specify ipc namespace Nathan Lynch
                     ` (5 subsequent siblings)
  8 siblings, 0 replies; 21+ messages in thread
From: Nathan Lynch @ 2010-09-14 20:02 UTC (permalink / raw)
  To: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA

shm_ids->rw_mutex protects namespace-wide shm info, but not the
attributes or contents of individual shm regions, so release it as
soon as possible in the restart path.

The SHM_HUGETLB-specific restore code (unfortunately) needs to take
mmap_sem, which cannot be taken while holding shm_ids->rw_mutex (see
sys_shmdt -> do_munmap -> shm_close), so this is a prerequisite for
restoring hugetlb shm regions.

Signed-off-by: Nathan Lynch <ntl-e+AXbWqSrlAAvxtiuMwx3w@public.gmane.org>
---
 ipc/checkpoint_shm.c |    8 ++++----
 1 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/ipc/checkpoint_shm.c b/ipc/checkpoint_shm.c
index a929e33..69ba35a 100644
--- a/ipc/checkpoint_shm.c
+++ b/ipc/checkpoint_shm.c
@@ -284,19 +284,19 @@ int restore_ipc_shm(struct ckpt_ctx *ctx, struct ipc_namespace *ns)
 
 	/* this is safe because no unauthorized access is possible */
 	ipc_unlock(ipc);
+	up_write(&shm_ids->rw_mutex);
 
 	ret = load_ipc_shm_hdr(ctx, h, shp);
 	if (ret < 0)
-		goto mutex;
+		goto fput;
 
 	/* deposit in objhash and read contents in */
 	ret = ckpt_obj_insert(ctx, file, h->objref, CKPT_OBJ_FILE);
 	if (ret < 0)
-		goto mutex;
+		goto fput;
 	ret = restore_memory_contents(ctx, file->f_dentry->d_inode);
- mutex:
+fput:
 	fput(file);
-	up_write(&shm_ids->rw_mutex);
 
 	/* discard this shmid if error and deferqueue wasn't set */
 	if (ret < 0 && !(h->perms.mode & SHM_DEST)) {
-- 
1.7.2.2

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 4/8] checkpoint/ipc: allow shmat callers to specify ipc namespace
       [not found] ` <1284494530-25946-1-git-send-email-ntl-e+AXbWqSrlAAvxtiuMwx3w@public.gmane.org>
                     ` (2 preceding siblings ...)
  2010-09-14 20:02   ` [PATCH 3/8] checkpoint/sysvshm: release rwsem earlier during restore Nathan Lynch
@ 2010-09-14 20:02   ` Nathan Lynch
  2010-09-14 20:02   ` [PATCH 5/8] checkpoint/restart of anonymous hugetlb mappings Nathan Lynch
                     ` (4 subsequent siblings)
  8 siblings, 0 replies; 21+ messages in thread
From: Nathan Lynch @ 2010-09-14 20:02 UTC (permalink / raw)
  To: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA

Restore of huge page shm regions will require that the restarting
process temporarily shmat-attach to regions which are not in the
current ipc namespace.  Add an ipc_namespace argument to
do_shmat_pgoff; convert callers to pass current->nsproxy->ipc_ns.

This depends on the patch "support ipc shm regions which are partially
mapped" posted to the containers list on 6 August.

Signed-off-by: Nathan Lynch <ntl-e+AXbWqSrlAAvxtiuMwx3w@public.gmane.org>
---
 include/linux/shm.h |    8 +++++---
 ipc/shm.c           |   15 ++++++++-------
 2 files changed, 13 insertions(+), 10 deletions(-)

diff --git a/include/linux/shm.h b/include/linux/shm.h
index 1f6af8c..9ca3196 100644
--- a/include/linux/shm.h
+++ b/include/linux/shm.h
@@ -104,10 +104,12 @@ struct shmid_kernel /* private to the kernel */
 #define SHM_NORESERVE   010000  /* don't check for reservations */
 
 #ifdef CONFIG_SYSVIPC
+struct ipc_namespace;
+
 long do_shmat(int shmid, char __user *shmaddr, int shmflg, unsigned long *addr);
-long do_shmat_pgoff(int shmid, char __user *shmaddr,
-		    int shmflg, unsigned long *addr,
-		    unsigned long shmsize, unsigned long shmpgoff);
+long do_shmat_ns_pgoff(struct ipc_namespace *ns, int shmid,
+		       char __user *shmaddr, int shmflg, unsigned long *addr,
+		       unsigned long shmsize, unsigned long shmpgoff);
 extern int is_file_shm_hugepages(struct file *file);
 #else
 static inline long do_shmat(int shmid, char __user *shmaddr,
diff --git a/ipc/shm.c b/ipc/shm.c
index 1ba9193..6768fe4 100644
--- a/ipc/shm.c
+++ b/ipc/shm.c
@@ -321,6 +321,7 @@ static int ipcshm_checkpoint(struct ckpt_ctx *ctx, struct vm_area_struct *vma)
 int ipcshm_restore(struct ckpt_ctx *ctx, struct mm_struct *mm,
 		   struct ckpt_hdr_vma *h)
 {
+	struct ipc_namespace *ipcns = current->nsproxy->ipc_ns;
 	struct file *file;
 	int shmid, shmflg = 0;
 	unsigned long start;
@@ -348,8 +349,8 @@ int ipcshm_restore(struct ckpt_ctx *ctx, struct mm_struct *mm,
 	/* do_shmat() below handles partial and offset mapping */
 	start = (unsigned long) h->vm_start;
 	size = min(h->ino_size, h->vm_end - h->vm_start);
-	ret = do_shmat_pgoff(shmid, (char __user *) start, shmflg,
-			     &addr, size, h->vm_pgoff << PAGE_SHIFT);
+	ret = do_shmat_ns_pgoff(ipcns, shmid, (char __user *) start, shmflg,
+				&addr, size, h->vm_pgoff << PAGE_SHIFT);
 
 	BUG_ON(ret >= 0 && addr != h->vm_start);
 	return ret;
@@ -880,8 +881,9 @@ out:
  * "raddr" thing points to kernel space, and there has to be a wrapper around
  * this.
  */
-long do_shmat_pgoff(int shmid, char __user *shmaddr, int shmflg,
-		    ulong *raddr, ulong shmsize, ulong shmpgoff)
+long do_shmat_ns_pgoff(struct ipc_namespace *ns, int shmid,
+		       char __user *shmaddr, int shmflg, ulong *raddr,
+		       ulong shmsize, ulong shmpgoff)
 {
 	struct shmid_kernel *shp;
 	unsigned long addr;
@@ -893,7 +895,6 @@ long do_shmat_pgoff(int shmid, char __user *shmaddr, int shmflg,
 	unsigned long prot;
 	int acc_mode;
 	unsigned long user_addr;
-	struct ipc_namespace *ns;
 	struct shm_file_data *sfd;
 	struct path path;
 	fmode_t f_mode;
@@ -937,7 +938,6 @@ long do_shmat_pgoff(int shmid, char __user *shmaddr, int shmflg,
 	 * We cannot rely on the fs check since SYSV IPC does have an
 	 * additional creator id...
 	 */
-	ns = current->nsproxy->ipc_ns;
 	shp = shm_lock_check(ns, shmid);
 	if (IS_ERR(shp)) {
 		err = PTR_ERR(shp);
@@ -1045,7 +1045,8 @@ out_put_dentry:
  */
 long do_shmat(int shmid, char __user *shmaddr, int shmflg, ulong *raddr)
 {
-	return do_shmat_pgoff(shmid, shmaddr, shmflg, raddr, 0, 0);
+	return do_shmat_ns_pgoff(current->nsproxy->ipc_ns, shmid, shmaddr,
+				 shmflg, raddr, 0, 0);
 }
 
 SYSCALL_DEFINE3(shmat, int, shmid, char __user *, shmaddr, int, shmflg)
-- 
1.7.2.2

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 5/8] checkpoint/restart of anonymous hugetlb mappings
       [not found] ` <1284494530-25946-1-git-send-email-ntl-e+AXbWqSrlAAvxtiuMwx3w@public.gmane.org>
                     ` (3 preceding siblings ...)
  2010-09-14 20:02   ` [PATCH 4/8] checkpoint/ipc: allow shmat callers to specify ipc namespace Nathan Lynch
@ 2010-09-14 20:02   ` Nathan Lynch
       [not found]     ` <1284494530-25946-6-git-send-email-ntl-e+AXbWqSrlAAvxtiuMwx3w@public.gmane.org>
  2010-09-14 20:02   ` [PATCH 6/8] remove VM_HUGETLB and VM_RESERVED from CKPT_VMA_NOT_SUPPORTED Nathan Lynch
                     ` (3 subsequent siblings)
  8 siblings, 1 reply; 21+ messages in thread
From: Nathan Lynch @ 2010-09-14 20:02 UTC (permalink / raw)
  To: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA

Support checkpoint and restore of both private and shared
hugepage-backed mappings established via mmap(MAP_HUGETLB).  Introduce
APIs for checkpoint and restart of individual huge pages which are to
be used by the sysv SHM_HUGETLB c/r code.

Signed-off-by: Nathan Lynch <ntl-e+AXbWqSrlAAvxtiuMwx3w@public.gmane.org>
---
 include/linux/checkpoint.h     |    4 +-
 include/linux/checkpoint_hdr.h |   16 +++
 include/linux/hugetlb.h        |   11 ++
 mm/checkpoint.c                |   13 ++
 mm/hugetlb.c                   |  257 ++++++++++++++++++++++++++++++++++++++++
 5 files changed, 300 insertions(+), 1 deletions(-)

diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
index 4e25042..d9a65a7 100644
--- a/include/linux/checkpoint.h
+++ b/include/linux/checkpoint.h
@@ -299,12 +299,14 @@ extern unsigned long generic_vma_restore(struct mm_struct *mm,
 extern int private_vma_restore(struct ckpt_ctx *ctx, struct mm_struct *mm,
 			       struct file *file, struct ckpt_hdr_vma *h);
 
+extern int hugetlb_restore(struct ckpt_ctx *ctx, struct mm_struct *mm,
+			   struct ckpt_hdr_vma *hdr);
+
 extern int checkpoint_memory_contents(struct ckpt_ctx *ctx,
 				      struct vm_area_struct *vma,
 				      struct inode *inode);
 extern int restore_memory_contents(struct ckpt_ctx *ctx, struct inode *inode);
 
-
 #define CKPT_VMA_NOT_SUPPORTED						\
 	(VM_IO | VM_HUGETLB | VM_NONLINEAR | VM_PFNMAP |		\
 	 VM_RESERVED | VM_HUGETLB | VM_NONLINEAR |	\
diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
index f4f9577..bda5d74 100644
--- a/include/linux/checkpoint_hdr.h
+++ b/include/linux/checkpoint_hdr.h
@@ -151,6 +151,8 @@ enum {
 #define CKPT_HDR_VMA CKPT_HDR_VMA
 	CKPT_HDR_PGARR,
 #define CKPT_HDR_PGARR CKPT_HDR_PGARR
+	CKPT_HDR_HPAGE,
+#define CKPT_HDR_HPAGE CKPT_HDR_HPAGE
 	CKPT_HDR_MM_CONTEXT,
 #define CKPT_HDR_MM_CONTEXT CKPT_HDR_MM_CONTEXT
 
@@ -881,6 +883,10 @@ enum vma_type {
 #define CKPT_VMA_SHM_IPC CKPT_VMA_SHM_IPC
 	CKPT_VMA_SHM_IPC_SKIP,	/* shared sysvipc (skip contents) */
 #define CKPT_VMA_SHM_IPC_SKIP CKPT_VMA_SHM_IPC_SKIP
+	CKPT_VMA_HUGETLB,
+#define CKPT_VMA_HUGETLB CKPT_VMA_HUGETLB
+	CKPT_VMA_HUGETLB_SKIP,
+#define CKPT_VMA_HUGETLB_SKIP CKPT_VMA_HUGETLB_SKIP
 	CKPT_VMA_MAX,
 #define CKPT_VMA_MAX CKPT_VMA_MAX
 };
@@ -907,6 +913,16 @@ struct ckpt_hdr_pgarr {
 	__u64 nr_pages;		/* number of pages to saved */
 } __attribute__((aligned(8)));
 
+/* huge page */
+struct ckpt_hdr_hpage {
+	struct ckpt_hdr h;
+	union {
+		__u64 vaddr;
+		__u64 index;
+	};
+	__u16 shift;
+} __attribute__((aligned(8)));
+
 /* signals */
 struct ckpt_sigset {
 	__u8 sigset[CKPT_ARCH_NSIG / 8];
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 78b4bc6..3808c04 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -47,6 +47,8 @@ int hugetlb_reserve_pages(struct inode *inode, long from, long to,
 						struct vm_area_struct *vma,
 						int acctflags);
 void hugetlb_unreserve_pages(struct inode *inode, long offset, long freed);
+int hugetlb_restore_page(struct ckpt_ctx *ctx, struct page *page);
+int hugetlb_checkpoint_page(struct ckpt_ctx *ctx, struct page *page);
 
 extern unsigned long hugepages_treat_as_movable;
 extern const unsigned long hugetlb_zero, hugetlb_infinity;
@@ -323,6 +325,15 @@ static inline unsigned int pages_per_huge_page(struct hstate *h)
 {
 	return 1;
 }
+
+static inline int hugetlb_restore_page(struct ckpt_ctx *ctx, struct page *page)
+{
+	return -ENOSYS;
+}
+static inline int hugetlb_checkpoint_page(struct ckpt_ctx *ctx, struct page *page)
+{
+	return -ENOSYS;
+}
 #endif
 
 #endif /* _LINUX_HUGETLB_H */
diff --git a/mm/checkpoint.c b/mm/checkpoint.c
index 70300e8..8d9a168 100644
--- a/mm/checkpoint.c
+++ b/mm/checkpoint.c
@@ -1021,6 +1021,8 @@ static unsigned long calc_map_flags_bits(unsigned long orig_vm_flags)
 		vm_flags |= MAP_PRIVATE;
 	if (orig_vm_flags & VM_NORESERVE)
 		vm_flags |= MAP_NORESERVE;
+	if (orig_vm_flags & VM_HUGETLB)
+		vm_flags |= MAP_HUGETLB;
 
 	return vm_flags;
 }
@@ -1180,6 +1182,17 @@ static struct restore_vma_ops restore_vma_ops[] = {
 		.vma_type = CKPT_VMA_SHM_IPC_SKIP,
 		.restore = ipcshm_restore,
 	},
+	/* hugeltb */
+	{
+		.vma_name = "HUGETLB",
+		.vma_type = CKPT_VMA_HUGETLB,
+		.restore = hugetlb_restore,
+	},
+	{
+		.vma_name = "HUGETLB (SKIP)",
+		.vma_type = CKPT_VMA_HUGETLB_SKIP,
+		.restore = hugetlb_restore,
+	},
 };
 
 /**
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 6034dc9..3b5942c 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -8,7 +8,10 @@
 #include <linux/mm.h>
 #include <linux/seq_file.h>
 #include <linux/sysctl.h>
+#include <linux/checkpoint.h>
+#include <linux/file.h>
 #include <linux/highmem.h>
+#include <linux/mman.h>
 #include <linux/mmu_notifier.h>
 #include <linux/nodemask.h>
 #include <linux/pagemap.h>
@@ -2057,10 +2060,264 @@ static int hugetlb_vm_op_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
 	return 0;
 }
 
+#define ckpt_debug_hpage_hdr(hdr) \
+	ckpt_debug("vaddr=%#llx shift=%hu\n", (hdr)->vaddr, (hdr)->shift)
+
+static void ckpt_hdr_hpage_init(struct ckpt_hdr_hpage *hdr, unsigned long shift)
+{
+	hdr->h.type = CKPT_HDR_HPAGE;
+	hdr->h.len = sizeof(struct ckpt_hdr_hpage);
+	hdr->shift = shift;
+	hdr->vaddr = 0; /* to be filled in by user */
+}
+
+int hugetlb_checkpoint_page(struct ckpt_ctx *ctx, struct page *head)
+{
+	unsigned int nr_pages;
+	struct page *page;
+	int ret = 0;
+	int i;
+
+	nr_pages = pages_per_huge_page(page_hstate(head));
+	page = head;
+
+	for (i = 0; i < nr_pages; i++) {
+		void *ptr;
+
+		cond_resched();
+
+		ptr = kmap_atomic(page, KM_USER1);
+		copy_page(ctx->scratch_page, ptr);
+		kunmap_atomic(ptr, KM_USER1);
+		ret = ckpt_kwrite(ctx, ctx->scratch_page, PAGE_SIZE);
+		if (ret < 0)
+			break;
+
+		page = mem_map_next(page, head, i + 1);
+	}
+
+	return ret;
+}
+
+#define CKPT_HDR_HPAGE_LAST ~(0UL)
+static bool ckpt_hdr_hpage_last(const struct ckpt_hdr_hpage *hdr)
+{
+	return hdr->vaddr == CKPT_HDR_HPAGE_LAST;
+}
+
+static int hugetlb_dump_contents(struct ckpt_ctx *ctx, struct vm_area_struct *vma)
+{
+	struct ckpt_hdr_hpage hdr;
+	unsigned long pageshift;
+	unsigned long pagesize;
+	unsigned long addr;
+	int ret;
+
+	pageshift = huge_page_shift(hstate_vma(vma));
+	pagesize = vma_kernel_pagesize(vma);
+
+	ckpt_hdr_hpage_init(&hdr, pageshift);
+
+	for (addr = vma->vm_start; addr < vma->vm_end; addr += pagesize) {
+		struct page *page = NULL;
+
+		down_read(&vma->vm_mm->mmap_sem);
+		ret = __get_user_pages(ctx->tsk, vma->vm_mm,
+				       addr, 1, FOLL_DUMP | FOLL_GET,
+				       &page, NULL);
+		/* FOLL_DUMP gives -EFAULT for holes */
+		if (ret == -EFAULT)
+			ret = 0;
+		up_read(&vma->vm_mm->mmap_sem);
+
+		if (ret < 0)
+			goto release;
+		if (!page)
+			continue;
+
+		hdr.vaddr = addr;
+
+		ckpt_debug_hpage_hdr(&hdr);
+
+		ret = ckpt_write_obj(ctx, &hdr.h);
+		if (ret < 0)
+			goto release;
+
+		ret = hugetlb_checkpoint_page(ctx, page);
+release:
+		if (page)
+			page_cache_release(page);
+		if (ret < 0)
+			break;
+	}
+
+	if (ret < 0)
+		goto err;
+	hdr.vaddr = CKPT_HDR_HPAGE_LAST;
+	ret = ckpt_write_obj(ctx, &hdr.h);
+err:
+	return ret;
+}
+
+static int hugetlb_vm_op_checkpoint(struct ckpt_ctx *ctx, struct vm_area_struct *vma)
+{
+	enum vma_type vma_type;
+	int ino_objref;
+	int ret, first;
+
+	BUG_ON(!(vma->vm_flags & VM_HUGETLB));
+	BUG_ON(!vma->vm_file);
+
+	ret = ckpt_obj_visit(ctx, vma->vm_file, CKPT_OBJ_FILE);
+	if (ret < 0)
+		return ret;
+
+	ino_objref = ckpt_obj_lookup_add(ctx, vma->vm_file->f_dentry->d_inode,
+					 CKPT_OBJ_INODE, &first);
+	if (ino_objref < 0)
+		return ino_objref;
+
+	vma_type = first ? CKPT_VMA_HUGETLB : CKPT_VMA_HUGETLB_SKIP;
+
+	ret = generic_vma_checkpoint(ctx, vma, vma_type, 0, ino_objref);
+	if (ret)
+		return ret;
+
+	if (vma_type == CKPT_VMA_HUGETLB)
+		ret = hugetlb_dump_contents(ctx, vma);
+
+	return ret;
+}
+
+int hugetlb_restore_page(struct ckpt_ctx *ctx, struct page *head)
+{
+	unsigned int nr_pages;
+	struct page *page;
+	int ret = 0;
+	int i;
+
+	nr_pages = pages_per_huge_page(page_hstate(head));
+	page = head;
+
+	for (i = 0; i < nr_pages; i++) {
+		void *ptr;
+
+		ret = ckpt_kread(ctx, ctx->scratch_page, PAGE_SIZE);
+		if (ret < 0)
+			break;
+
+		cond_resched();
+
+		ptr = kmap_atomic(page, KM_USER1);
+		copy_page(ptr, ctx->scratch_page);
+		kunmap_atomic(ptr, KM_USER1);
+
+		page = mem_map_next(page, head, i + 1);
+	}
+
+	return ret;
+}
+
+static int hugetlb_restore_contents(struct ckpt_ctx *ctx)
+{
+	int ret = 0;
+
+	while (1) {
+		struct ckpt_hdr_hpage *hdr;
+		unsigned long addr;
+		struct page *page;
+		bool last;
+
+		hdr = ckpt_read_obj_type(ctx, sizeof(*hdr), CKPT_HDR_HPAGE);
+		if (IS_ERR(hdr)) {
+			ret = PTR_ERR(hdr);
+			break;
+		}
+
+		ckpt_debug_hpage_hdr(hdr);
+		last = ckpt_hdr_hpage_last(hdr);
+		addr = (unsigned long)hdr->vaddr;
+
+		ckpt_hdr_put(ctx, hdr);
+
+		if (last)
+			break;
+
+		down_read(&current->mm->mmap_sem);
+		ret = get_user_pages(current, current->mm, addr, 1, 1, 1,
+				     &page, NULL);
+		up_read(&current->mm->mmap_sem);
+
+		if (ret < 0)
+			break;
+
+		ret = hugetlb_restore_page(ctx, page);
+
+		page_cache_release(page);
+
+		if (ret < 0)
+			break;
+	}
+
+	return ret;
+}
+
+int hugetlb_restore(struct ckpt_ctx *ctx, struct mm_struct *mm, struct ckpt_hdr_vma *hdr)
+{
+	unsigned long addr;
+	struct file *file;
+	int ret = 0;
+
+	if (!(hdr->vm_flags & (VM_HUGETLB)))
+		return -EINVAL;
+
+	file = ckpt_obj_try_fetch(ctx, hdr->ino_objref, CKPT_OBJ_FILE);
+	if (PTR_ERR(file) == -EINVAL)
+		file = NULL;
+	if (IS_ERR(file))
+		return PTR_ERR(file);
+
+	/* To do: don't assume same default_hstate on source and destinaton */
+	if (!file) {
+		struct user_struct *user = NULL;
+		unsigned long len;
+
+		if (hdr->vma_type != CKPT_VMA_HUGETLB)
+			return -EINVAL;
+
+		/* see sys_mmap_pgoff */
+		len = hdr->vm_end - hdr->vm_start;
+		len = ALIGN(len, huge_page_size(&default_hstate));
+		file = hugetlb_file_setup(HUGETLB_ANON_FILE, len, VM_NORESERVE,
+					  &user, HUGETLB_ANONHUGE_INODE);
+		if (IS_ERR(file))
+			return PTR_ERR(file);
+		ret = ckpt_obj_insert(ctx, file, hdr->ino_objref, CKPT_OBJ_FILE);
+		if (ret < 0)
+			goto out;
+	} else {
+		if (hdr->vma_type != CKPT_VMA_HUGETLB_SKIP)
+			return -EINVAL;
+		get_file(file);
+	}
+
+	addr = generic_vma_restore(mm, file, hdr);
+	if (IS_ERR((void *)addr))
+		ret = PTR_ERR((void *)addr);
+	else if (hdr->vma_type == CKPT_VMA_HUGETLB)
+		ret = hugetlb_restore_contents(ctx);
+out:
+	fput(file);
+	return ret;
+}
+
 const struct vm_operations_struct hugetlb_vm_ops = {
 	.fault = hugetlb_vm_op_fault,
 	.open = hugetlb_vm_op_open,
 	.close = hugetlb_vm_op_close,
+#ifdef CONFIG_CHECKPOINT
+	.checkpoint = hugetlb_vm_op_checkpoint,
+#endif
 };
 
 static pte_t make_huge_pte(struct vm_area_struct *vma, struct page *page,
-- 
1.7.2.2

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 6/8] remove VM_HUGETLB and VM_RESERVED from CKPT_VMA_NOT_SUPPORTED
       [not found] ` <1284494530-25946-1-git-send-email-ntl-e+AXbWqSrlAAvxtiuMwx3w@public.gmane.org>
                     ` (4 preceding siblings ...)
  2010-09-14 20:02   ` [PATCH 5/8] checkpoint/restart of anonymous hugetlb mappings Nathan Lynch
@ 2010-09-14 20:02   ` Nathan Lynch
       [not found]     ` <1284494530-25946-7-git-send-email-ntl-e+AXbWqSrlAAvxtiuMwx3w@public.gmane.org>
  2010-09-14 20:02   ` [PATCH 7/8] hugetlbfs checkpoint/restart hooks Nathan Lynch
                     ` (2 subsequent siblings)
  8 siblings, 1 reply; 21+ messages in thread
From: Nathan Lynch @ 2010-09-14 20:02 UTC (permalink / raw)
  To: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA

Now that the hugetlb c/r support code is in place we can remove
VM_HUGETLB from the bitmask of unsupported vma flags.

All huge pages are VM_RESERVED so a less coarse method is needed to
prevent checkpoint of other reserved pages.

Signed-off-by: Nathan Lynch <ntl-e+AXbWqSrlAAvxtiuMwx3w@public.gmane.org>
---
 include/linux/checkpoint.h |    3 +--
 1 files changed, 1 insertions(+), 2 deletions(-)

diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
index d9a65a7..e224490 100644
--- a/include/linux/checkpoint.h
+++ b/include/linux/checkpoint.h
@@ -308,8 +308,7 @@ extern int checkpoint_memory_contents(struct ckpt_ctx *ctx,
 extern int restore_memory_contents(struct ckpt_ctx *ctx, struct inode *inode);
 
 #define CKPT_VMA_NOT_SUPPORTED						\
-	(VM_IO | VM_HUGETLB | VM_NONLINEAR | VM_PFNMAP |		\
-	 VM_RESERVED | VM_HUGETLB | VM_NONLINEAR |	\
+	(VM_IO | VM_NONLINEAR | VM_PFNMAP | VM_NONLINEAR |		\
 	 VM_MAPPED_COPY | VM_INSERTPAGE | VM_MIXEDMAP | VM_SAO)
 
 /* signals */
-- 
1.7.2.2

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 7/8] hugetlbfs checkpoint/restart hooks
       [not found] ` <1284494530-25946-1-git-send-email-ntl-e+AXbWqSrlAAvxtiuMwx3w@public.gmane.org>
                     ` (5 preceding siblings ...)
  2010-09-14 20:02   ` [PATCH 6/8] remove VM_HUGETLB and VM_RESERVED from CKPT_VMA_NOT_SUPPORTED Nathan Lynch
@ 2010-09-14 20:02   ` Nathan Lynch
  2010-09-14 20:02   ` [PATCH 8/8] checkpoint/restart of SysV SHM_HUGETLB regions Nathan Lynch
  2010-09-17  0:37   ` [PATCH 0/8] checkpoint/restart: sysvshm fixes and hugetlb support Oren Laadan
  8 siblings, 0 replies; 21+ messages in thread
From: Nathan Lynch @ 2010-09-14 20:02 UTC (permalink / raw)
  To: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA

Signed-off-by: Nathan Lynch <ntl-e+AXbWqSrlAAvxtiuMwx3w@public.gmane.org>
---
 fs/hugetlbfs/inode.c |   24 +++++++++++++++++++++++-
 1 files changed, 23 insertions(+), 1 deletions(-)

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index a0bbd3d..575f837 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -21,6 +21,7 @@
 #include <linux/string.h>
 #include <linux/capability.h>
 #include <linux/ctype.h>
+#include <linux/checkpoint.h>
 #include <linux/backing-dev.h>
 #include <linux/hugetlb.h>
 #include <linux/pagevec.h>
@@ -463,6 +464,24 @@ out:
 	return error;
 }
 
+#ifdef CONFIG_CHECKPOINT
+static const struct file_operations hugetlbfs_dir_operations = {
+	/* Just like simple_dir_operations except... */
+	.open		= dcache_dir_open,
+	.release	= dcache_dir_close,
+	.llseek		= dcache_dir_lseek,
+	.read		= generic_read_dir,
+	.readdir	= dcache_readdir,
+	.fsync		= simple_sync_file,
+
+	/* The checkpoint ops are unlike simple_dir_operations */
+	.checkpoint	= generic_file_checkpoint,
+};
+#else
+#define hugetlbfs_dir_operations simple_dir_operations
+#endif
+
+
 static struct inode *hugetlbfs_get_inode(struct super_block *sb, uid_t uid, 
 					gid_t gid, int mode, dev_t dev)
 {
@@ -497,7 +516,7 @@ static struct inode *hugetlbfs_get_inode(struct super_block *sb, uid_t uid,
 			break;
 		case S_IFDIR:
 			inode->i_op = &hugetlbfs_dir_inode_operations;
-			inode->i_fop = &simple_dir_operations;
+			inode->i_fop = &hugetlbfs_dir_operations;
 
 			/* directory inodes start off with i_nlink == 2 (for "." entry) */
 			inc_nlink(inode);
@@ -690,6 +709,9 @@ const struct file_operations hugetlbfs_file_operations = {
 	.mmap			= hugetlbfs_file_mmap,
 	.fsync			= simple_sync_file,
 	.get_unmapped_area	= hugetlb_get_unmapped_area,
+#ifdef CONFIG_CHECKPOINT
+	.checkpoint	= generic_file_checkpoint,
+#endif
 };
 
 static const struct inode_operations hugetlbfs_dir_inode_operations = {
-- 
1.7.2.2

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 8/8] checkpoint/restart of SysV SHM_HUGETLB regions
       [not found] ` <1284494530-25946-1-git-send-email-ntl-e+AXbWqSrlAAvxtiuMwx3w@public.gmane.org>
                     ` (6 preceding siblings ...)
  2010-09-14 20:02   ` [PATCH 7/8] hugetlbfs checkpoint/restart hooks Nathan Lynch
@ 2010-09-14 20:02   ` Nathan Lynch
       [not found]     ` <1284494530-25946-9-git-send-email-ntl-e+AXbWqSrlAAvxtiuMwx3w@public.gmane.org>
  2010-09-17  0:37   ` [PATCH 0/8] checkpoint/restart: sysvshm fixes and hugetlb support Oren Laadan
  8 siblings, 1 reply; 21+ messages in thread
From: Nathan Lynch @ 2010-09-14 20:02 UTC (permalink / raw)
  To: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA

Large page-backed shm regions require special handling, especially
during restart.  The association of a large page with a shm region's
inode can occur only in the context of a process causing a fault with
the region mapped into its mm.  In order to restore that association,
temporarily shmat-attach the restored SHM_HUGETLB region to the
restarting process's mm, using the just-restored ipc namespace
instead of the current one (the nsproxy switch hasn't occured yet).

Since the temporary shmat of the region during restart causes some of
the shm attributes to be updated, re-restore them from the ipc_shm
checkpoint header after unmapping.

Signed-off-by: Nathan Lynch <ntl-e+AXbWqSrlAAvxtiuMwx3w@public.gmane.org>
---
 ipc/checkpoint_shm.c |  154 ++++++++++++++++++++++++++++++++++++++++++++++----
 1 files changed, 142 insertions(+), 12 deletions(-)

diff --git a/ipc/checkpoint_shm.c b/ipc/checkpoint_shm.c
index 69ba35a..7f9d701 100644
--- a/ipc/checkpoint_shm.c
+++ b/ipc/checkpoint_shm.c
@@ -32,6 +32,69 @@
  * ipc checkpoint
  */
 
+#define CKPT_HDR_HPAGE_LAST ~(0UL)
+static bool ckpt_hdr_hpage_last(const struct ckpt_hdr_hpage *hdr)
+{
+	return hdr->index == CKPT_HDR_HPAGE_LAST;
+}
+
+static void ckpt_hdr_hpage_init(struct ckpt_hdr_hpage *hdr, unsigned long shift)
+{
+	hdr->h.type = CKPT_HDR_HPAGE;
+	hdr->h.len = sizeof(struct ckpt_hdr_hpage);
+	hdr->shift = shift;
+	hdr->index = 0; /* to be filled in by user */
+}
+
+static int shm_hugetlb_checkpoint_contents(struct ckpt_ctx *ctx, struct file *filp)
+{
+	struct hstate *h = hstate_file(filp);
+	struct address_space *mapping = filp->f_mapping;
+	struct inode *inode = mapping->host;
+	struct ckpt_hdr_hpage hdr;
+	unsigned long end_index;
+	unsigned long index;
+	ssize_t retval = 0;
+	loff_t isize;
+
+	isize = i_size_read(inode);
+	if (isize == 0)
+		goto out;
+
+	end_index = (isize - 1) >> huge_page_shift(h);
+
+	ckpt_hdr_hpage_init(&hdr, huge_page_shift(h));
+
+	for (index = 0; index < end_index + 1; index++) {
+		struct page *page;
+
+		page = find_get_page(mapping, index);
+
+		/* skip holes */
+		if (!page)
+			continue;
+
+		hdr.index = index;
+
+		retval = ckpt_write_obj(ctx, &hdr.h);
+		if (retval < 0)
+			goto release;
+
+		retval = hugetlb_checkpoint_page(ctx, page);
+release:
+		page_cache_release(page);
+		if (retval < 0)
+			break;
+	}
+
+	if (retval < 0)
+		goto out;
+	hdr.index = CKPT_HDR_HPAGE_LAST;
+	retval = ckpt_write_obj(ctx, &hdr.h);
+out:
+	return retval;
+}
+
 /* called with the msgids->rw_mutex is read-held */
 static int fill_ipc_shm_hdr(struct ckpt_ctx *ctx,
 			    struct ckpt_hdr_ipc_shm *h,
@@ -59,10 +122,8 @@ static int fill_ipc_shm_hdr(struct ckpt_ctx *ctx,
 
 	h->flags = 0;
 
-	/* check if shm was setup with SHM_HUGETLB (unsupported yet) */
 	if (is_file_hugepages(shp->shm_file)) {
-		pr_warning("c/r: unsupported SHM_HUGETLB\n");
-		ret = -ENOSYS;
+		h->flags |= SHM_HUGETLB;
 	} else {
 		struct shmem_inode_info *info;
 
@@ -117,7 +178,10 @@ int checkpoint_ipc_shm(int id, void *p, void *data)
 	if (ret < 0)
 		goto out;
 
-	ret = checkpoint_memory_contents(ctx, NULL, inode);
+	if (is_file_hugepages(shp->shm_file))
+		ret = shm_hugetlb_checkpoint_contents(ctx, shp->shm_file);
+	else
+		ret = checkpoint_memory_contents(ctx, NULL, inode);
  out:
 	ckpt_hdr_put(ctx, h);
 	return ret;
@@ -149,6 +213,75 @@ struct dq_ipcshm_del {
 	int id;
 };
 
+static void __load_ipc_shm_hdr(const struct ckpt_hdr_ipc_shm *h, struct shmid_kernel *shp)
+{
+	shp->shm_atim = h->shm_atim;
+	shp->shm_dtim = h->shm_dtim;
+	shp->shm_ctim = h->shm_ctim;
+	shp->shm_cprid = h->shm_cprid;
+	shp->shm_lprid = h->shm_lprid;
+}
+
+static int shm_hugetlb_restore_contents(struct ckpt_ctx *ctx, struct ipc_namespace *ipcns, struct shmid_kernel *shp, const struct ckpt_hdr_ipc_shm *hdr)
+{
+	unsigned long start;
+	int ret;
+
+	ret = do_shmat_ns_pgoff(ipcns, shp->shm_perm.id, (char __user *)0,
+				0, &start, 0, 0);
+	if (ret != 0)
+		return ret;
+
+	ckpt_debug("temporarily using %#lx for huge shm restore\n", start);
+
+	while (1) {
+		struct ckpt_hdr_hpage *hdr;
+		unsigned long hpagesize;
+		unsigned long index;
+		unsigned long addr;
+		struct page *page;
+		bool last;
+
+		hdr = ckpt_read_obj_type(ctx, sizeof(*hdr), CKPT_HDR_HPAGE);
+		if (IS_ERR(hdr)) {
+			ret = PTR_ERR(hdr);
+			break;
+		}
+
+		last = ckpt_hdr_hpage_last(hdr);
+		index = (unsigned long)hdr->index;
+		hpagesize = 1UL << hdr->shift;
+
+		ckpt_hdr_put(ctx, hdr);
+
+		if (last)
+			break;
+
+		addr = start + (hpagesize * index);
+
+		down_read(&current->mm->mmap_sem);
+		ret = get_user_pages(current, current->mm, addr, 1, 1, 1,
+				     &page, NULL);
+		up_read(&current->mm->mmap_sem);
+
+		if (ret < 0)
+			break;
+
+		ret = hugetlb_restore_page(ctx, page);
+
+		page_cache_release(page);
+
+		if (ret < 0)
+			break;
+	}
+
+	sys_shmdt((void __user *)start);
+
+	__load_ipc_shm_hdr(hdr, shp);
+
+	return ret;
+}
+
 static int _ipc_shm_delete(struct ipc_namespace *ns, int id)
 {
 	mm_segment_t old_fs;
@@ -190,11 +323,7 @@ static int load_ipc_shm_hdr(struct ckpt_ctx *ctx,
 	if (h->shm_cprid < 0 || h->shm_lprid < 0)
 		return -EINVAL;
 
-	shp->shm_atim = h->shm_atim;
-	shp->shm_dtim = h->shm_dtim;
-	shp->shm_ctim = h->shm_ctim;
-	shp->shm_cprid = h->shm_cprid;
-	shp->shm_lprid = h->shm_lprid;
+	__load_ipc_shm_hdr(h, shp);
 
 	return 0;
 }
@@ -224,8 +353,6 @@ int restore_ipc_shm(struct ckpt_ctx *ctx, struct ipc_namespace *ns)
 	ret = -ENOSYS;
 	if (h->mlock_uid != (unsigned int) -1)	/* FIXME: support SHM_LOCK */
 		goto out;
-	if (h->flags & SHM_HUGETLB)	/* FIXME: support SHM_HUGETLB */
-		goto out;
 
 	shmflag = h->flags | h->perms.mode | IPC_CREAT | IPC_EXCL;
 	ckpt_debug("shm: do_shmget size %lld flag %#x id %d\n",
@@ -294,7 +421,10 @@ int restore_ipc_shm(struct ckpt_ctx *ctx, struct ipc_namespace *ns)
 	ret = ckpt_obj_insert(ctx, file, h->objref, CKPT_OBJ_FILE);
 	if (ret < 0)
 		goto fput;
-	ret = restore_memory_contents(ctx, file->f_dentry->d_inode);
+	if (is_file_hugepages(file))
+		ret = shm_hugetlb_restore_contents(ctx, ns, shp, h);
+	else
+		ret = restore_memory_contents(ctx, file->f_dentry->d_inode);
 fput:
 	fput(file);
 
-- 
1.7.2.2

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH 2/8] sysvshm: report error on failure to reattach, avoid crash
       [not found]     ` <1284494530-25946-3-git-send-email-ntl-e+AXbWqSrlAAvxtiuMwx3w@public.gmane.org>
@ 2010-09-15  2:55       ` Matt Helsley
       [not found]         ` <20100915025502.GG8957-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org>
  0 siblings, 1 reply; 21+ messages in thread
From: Matt Helsley @ 2010-09-15  2:55 UTC (permalink / raw)
  To: Nathan Lynch; +Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA

On Tue, Sep 14, 2010 at 03:02:04PM -0500, Nathan Lynch wrote:
> If ipcshm_restore fails to look up the file object for the region
> being restored, it should return the error to its caller and not
> proceed to dereference the file pointer.
> 
> Signed-off-by: Nathan Lynch <ntl-e+AXbWqSrlAAvxtiuMwx3w@public.gmane.org>

(Important fix. Adding Oren to Cc)

Reviewed-by: Matt Helsley <matthltc-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>

> ---
>  ipc/shm.c |    2 +-
>  1 files changed, 1 insertions(+), 1 deletions(-)
> 
> diff --git a/ipc/shm.c b/ipc/shm.c
> index eed4b9a..1ba9193 100644
> --- a/ipc/shm.c
> +++ b/ipc/shm.c
> @@ -334,7 +334,7 @@ int ipcshm_restore(struct ckpt_ctx *ctx, struct mm_struct *mm,
> 
>  	file = ckpt_obj_fetch(ctx, h->ino_objref, CKPT_OBJ_FILE);
>  	if (IS_ERR(file))
> -		PTR_ERR(file);
> +		return PTR_ERR(file);

Odd that the original code didn't trigger any unused result or must check
warnings. In linux/err.h I already see:

	static inline long __must_check PTR_ERR(const void *ptr)
	...

And in linux/compiler-gcc3.h
	if __GNUC_MINOR__ >= 4
	#define __must_check            __attribute__((warn_unused_result))
	#endif

or for those of us using GCC 4.x (linux/compiler-gcc4.h):
	#define __must_check            __attribute__((warn_unused_result))

and in my .config I have:
	CONFIG_SYSVIPC=y
	CONFIG_SYSVIPC_CHECKPOINT=y
	...
	CONFIG_ENABLE_MUST_CHECK=y

plus a simple test:

	#include <stdlib.h>

	static inline void * __attribute__((warn_unused_result)) foo(void)
	{
		return NULL;
	}

	int main (void)
	{
		if (0)
			foo();
		return 0;
	}

even without -Wall triggers the compiler warning just fine. So I can't
see why the warning is not triggering.

Cheers,
	-Matt

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 2/8] sysvshm: report error on failure to reattach, avoid crash
       [not found]         ` <20100915025502.GG8957-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org>
@ 2010-09-15  3:04           ` Matt Helsley
  2010-09-17  0:17           ` Oren Laadan
  1 sibling, 0 replies; 21+ messages in thread
From: Matt Helsley @ 2010-09-15  3:04 UTC (permalink / raw)
  To: Matt Helsley
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Nathan Lynch

On Tue, Sep 14, 2010 at 07:55:03PM -0700, Matt Helsley wrote:
> On Tue, Sep 14, 2010 at 03:02:04PM -0500, Nathan Lynch wrote:
> > If ipcshm_restore fails to look up the file object for the region
> > being restored, it should return the error to its caller and not
> > proceed to dereference the file pointer.
> > 
> > Signed-off-by: Nathan Lynch <ntl-e+AXbWqSrlAAvxtiuMwx3w@public.gmane.org>
> 
> (Important fix. Adding Oren to Cc)

Argh. I was so busy verifying that __must_check didn't warn about the
bug that I neglected to do this. Sorry for the noise! Oren please have
a look at this thread.

Cheers,
	-Matt

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 2/8] sysvshm: report error on failure to reattach, avoid crash
       [not found]         ` <20100915025502.GG8957-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org>
  2010-09-15  3:04           ` Matt Helsley
@ 2010-09-17  0:17           ` Oren Laadan
  1 sibling, 0 replies; 21+ messages in thread
From: Oren Laadan @ 2010-09-17  0:17 UTC (permalink / raw)
  To: Matt Helsley
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Nathan Lynch



On 09/14/2010 10:55 PM, Matt Helsley wrote:
> On Tue, Sep 14, 2010 at 03:02:04PM -0500, Nathan Lynch wrote:
>> If ipcshm_restore fails to look up the file object for the region
>> being restored, it should return the error to its caller and not
>> proceed to dereference the file pointer.
>>
>> Signed-off-by: Nathan Lynch <ntl-e+AXbWqSrlAAvxtiuMwx3w@public.gmane.org>
> 
> (Important fix. Adding Oren to Cc)
> 
> Reviewed-by: Matt Helsley <matthltc-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
> 
>> ---
>>  ipc/shm.c |    2 +-
>>  1 files changed, 1 insertions(+), 1 deletions(-)
>>
>> diff --git a/ipc/shm.c b/ipc/shm.c
>> index eed4b9a..1ba9193 100644
>> --- a/ipc/shm.c
>> +++ b/ipc/shm.c
>> @@ -334,7 +334,7 @@ int ipcshm_restore(struct ckpt_ctx *ctx, struct mm_struct *mm,
>>
>>  	file = ckpt_obj_fetch(ctx, h->ino_objref, CKPT_OBJ_FILE);
>>  	if (IS_ERR(file))
>> -		PTR_ERR(file);
>> +		return PTR_ERR(file);

Hrm ... I really don't see what's the problem ;)

Oren.

> 
> Odd that the original code didn't trigger any unused result or must check
> warnings. In linux/err.h I already see:
> 
> 	static inline long __must_check PTR_ERR(const void *ptr)
> 	...
> 
> And in linux/compiler-gcc3.h
> 	if __GNUC_MINOR__ >= 4
> 	#define __must_check            __attribute__((warn_unused_result))
> 	#endif
> 
> or for those of us using GCC 4.x (linux/compiler-gcc4.h):
> 	#define __must_check            __attribute__((warn_unused_result))
> 
> and in my .config I have:
> 	CONFIG_SYSVIPC=y
> 	CONFIG_SYSVIPC_CHECKPOINT=y
> 	...
> 	CONFIG_ENABLE_MUST_CHECK=y
> 
> plus a simple test:
> 
> 	#include <stdlib.h>
> 
> 	static inline void * __attribute__((warn_unused_result)) foo(void)
> 	{
> 		return NULL;
> 	}
> 
> 	int main (void)
> 	{
> 		if (0)
> 			foo();
> 		return 0;
> 	}
> 
> even without -Wall triggers the compiler warning just fine. So I can't
> see why the warning is not triggering.
> 
> Cheers,
> 	-Matt
> 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 0/8] checkpoint/restart: sysvshm fixes and hugetlb support
       [not found] ` <1284494530-25946-1-git-send-email-ntl-e+AXbWqSrlAAvxtiuMwx3w@public.gmane.org>
                     ` (7 preceding siblings ...)
  2010-09-14 20:02   ` [PATCH 8/8] checkpoint/restart of SysV SHM_HUGETLB regions Nathan Lynch
@ 2010-09-17  0:37   ` Oren Laadan
       [not found]     ` <4C92B831.40400-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
  8 siblings, 1 reply; 21+ messages in thread
From: Oren Laadan @ 2010-09-17  0:37 UTC (permalink / raw)
  To: Nathan Lynch; +Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA

Hi Nathan,

Thanks for the patch series. For starters, I'll pull
patches 1-3 for ckpt-v22-dev. I'll wait for reviews
for the others.

Oren.

On 09/14/2010 04:02 PM, Nathan Lynch wrote:
> The following patches implement c/r for huge pages and fix a couple of
> minor issues in the SysV shm c/r code I noticed in the process.
> 
> Tested on i386, x86_64, and ppc64.
> 
> Nathan Lynch (8):
>   sysvshm: check for hugetlb before assuming shmem
>   sysvshm: report error on failure to reattach, avoid crash
>   checkpoint/sysvshm: release rwsem earlier during restore
>   checkpoint/ipc: allow shmat callers to specify ipc namespace
>   checkpoint/restart of anonymous hugetlb mappings
>   remove VM_HUGETLB and VM_RESERVED from CKPT_VMA_NOT_SUPPORTED
>   hugetlbfs checkpoint/restart hooks
>   checkpoint/restart of SysV SHM_HUGETLB regions
> 
>  fs/hugetlbfs/inode.c           |   24 ++++-
>  include/linux/checkpoint.h     |    7 +-
>  include/linux/checkpoint_hdr.h |   16 +++
>  include/linux/hugetlb.h        |   11 ++
>  include/linux/shm.h            |    8 +-
>  ipc/checkpoint_shm.c           |  172 ++++++++++++++++++++++++---
>  ipc/shm.c                      |   17 ++--
>  mm/checkpoint.c                |   13 ++
>  mm/hugetlb.c                   |  257 ++++++++++++++++++++++++++++++++++++++++
>  9 files changed, 491 insertions(+), 34 deletions(-)
> 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 8/8] checkpoint/restart of SysV SHM_HUGETLB regions
       [not found]     ` <1284494530-25946-9-git-send-email-ntl-e+AXbWqSrlAAvxtiuMwx3w@public.gmane.org>
@ 2010-09-17  0:40       ` Oren Laadan
       [not found]         ` <4C92B903.20304-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
  0 siblings, 1 reply; 21+ messages in thread
From: Oren Laadan @ 2010-09-17  0:40 UTC (permalink / raw)
  To: Nathan Lynch; +Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA



On 09/14/2010 04:02 PM, Nathan Lynch wrote:
> Large page-backed shm regions require special handling, especially
> during restart.  The association of a large page with a shm region's
> inode can occur only in the context of a process causing a fault with
> the region mapped into its mm.  In order to restore that association,
> temporarily shmat-attach the restored SHM_HUGETLB region to the
> restarting process's mm, using the just-restored ipc namespace
> instead of the current one (the nsproxy switch hasn't occured yet).
> 
> Since the temporary shmat of the region during restart causes some of
> the shm attributes to be updated, re-restore them from the ipc_shm
> checkpoint header after unmapping.

Would it work to just move the original call to load_ipc_shm_hdr()
further down in restore_ipc_shm(), especially since the mutex is
not needed anymore - that way you don't need to re-restore them ?

I'm not too familiar with HUGETLB code otherwise, so hoping that
others review those parts while I find time to study it ...

Thanks,

Oren.

> 
> Signed-off-by: Nathan Lynch <ntl-e+AXbWqSrlAAvxtiuMwx3w@public.gmane.org>
> ---
>  ipc/checkpoint_shm.c |  154 ++++++++++++++++++++++++++++++++++++++++++++++----
>  1 files changed, 142 insertions(+), 12 deletions(-)
> 
> diff --git a/ipc/checkpoint_shm.c b/ipc/checkpoint_shm.c
> index 69ba35a..7f9d701 100644
> --- a/ipc/checkpoint_shm.c
> +++ b/ipc/checkpoint_shm.c
> @@ -32,6 +32,69 @@
>   * ipc checkpoint
>   */
>  
> +#define CKPT_HDR_HPAGE_LAST ~(0UL)
> +static bool ckpt_hdr_hpage_last(const struct ckpt_hdr_hpage *hdr)
> +{
> +	return hdr->index == CKPT_HDR_HPAGE_LAST;
> +}
> +
> +static void ckpt_hdr_hpage_init(struct ckpt_hdr_hpage *hdr, unsigned long shift)
> +{
> +	hdr->h.type = CKPT_HDR_HPAGE;
> +	hdr->h.len = sizeof(struct ckpt_hdr_hpage);
> +	hdr->shift = shift;
> +	hdr->index = 0; /* to be filled in by user */
> +}
> +
> +static int shm_hugetlb_checkpoint_contents(struct ckpt_ctx *ctx, struct file *filp)
> +{
> +	struct hstate *h = hstate_file(filp);
> +	struct address_space *mapping = filp->f_mapping;
> +	struct inode *inode = mapping->host;
> +	struct ckpt_hdr_hpage hdr;
> +	unsigned long end_index;
> +	unsigned long index;
> +	ssize_t retval = 0;
> +	loff_t isize;
> +
> +	isize = i_size_read(inode);
> +	if (isize == 0)
> +		goto out;
> +
> +	end_index = (isize - 1) >> huge_page_shift(h);
> +
> +	ckpt_hdr_hpage_init(&hdr, huge_page_shift(h));
> +
> +	for (index = 0; index < end_index + 1; index++) {
> +		struct page *page;
> +
> +		page = find_get_page(mapping, index);
> +
> +		/* skip holes */
> +		if (!page)
> +			continue;
> +
> +		hdr.index = index;
> +
> +		retval = ckpt_write_obj(ctx, &hdr.h);
> +		if (retval < 0)
> +			goto release;
> +
> +		retval = hugetlb_checkpoint_page(ctx, page);
> +release:
> +		page_cache_release(page);
> +		if (retval < 0)
> +			break;
> +	}
> +
> +	if (retval < 0)
> +		goto out;
> +	hdr.index = CKPT_HDR_HPAGE_LAST;
> +	retval = ckpt_write_obj(ctx, &hdr.h);
> +out:
> +	return retval;
> +}
> +
>  /* called with the msgids->rw_mutex is read-held */
>  static int fill_ipc_shm_hdr(struct ckpt_ctx *ctx,
>  			    struct ckpt_hdr_ipc_shm *h,
> @@ -59,10 +122,8 @@ static int fill_ipc_shm_hdr(struct ckpt_ctx *ctx,
>  
>  	h->flags = 0;
>  
> -	/* check if shm was setup with SHM_HUGETLB (unsupported yet) */
>  	if (is_file_hugepages(shp->shm_file)) {
> -		pr_warning("c/r: unsupported SHM_HUGETLB\n");
> -		ret = -ENOSYS;
> +		h->flags |= SHM_HUGETLB;
>  	} else {
>  		struct shmem_inode_info *info;
>  
> @@ -117,7 +178,10 @@ int checkpoint_ipc_shm(int id, void *p, void *data)
>  	if (ret < 0)
>  		goto out;
>  
> -	ret = checkpoint_memory_contents(ctx, NULL, inode);
> +	if (is_file_hugepages(shp->shm_file))
> +		ret = shm_hugetlb_checkpoint_contents(ctx, shp->shm_file);
> +	else
> +		ret = checkpoint_memory_contents(ctx, NULL, inode);
>   out:
>  	ckpt_hdr_put(ctx, h);
>  	return ret;
> @@ -149,6 +213,75 @@ struct dq_ipcshm_del {
>  	int id;
>  };
>  
> +static void __load_ipc_shm_hdr(const struct ckpt_hdr_ipc_shm *h, struct shmid_kernel *shp)
> +{
> +	shp->shm_atim = h->shm_atim;
> +	shp->shm_dtim = h->shm_dtim;
> +	shp->shm_ctim = h->shm_ctim;
> +	shp->shm_cprid = h->shm_cprid;
> +	shp->shm_lprid = h->shm_lprid;
> +}
> +
> +static int shm_hugetlb_restore_contents(struct ckpt_ctx *ctx, struct ipc_namespace *ipcns, struct shmid_kernel *shp, const struct ckpt_hdr_ipc_shm *hdr)
> +{
> +	unsigned long start;
> +	int ret;
> +
> +	ret = do_shmat_ns_pgoff(ipcns, shp->shm_perm.id, (char __user *)0,
> +				0, &start, 0, 0);
> +	if (ret != 0)
> +		return ret;
> +
> +	ckpt_debug("temporarily using %#lx for huge shm restore\n", start);
> +
> +	while (1) {
> +		struct ckpt_hdr_hpage *hdr;
> +		unsigned long hpagesize;
> +		unsigned long index;
> +		unsigned long addr;
> +		struct page *page;
> +		bool last;
> +
> +		hdr = ckpt_read_obj_type(ctx, sizeof(*hdr), CKPT_HDR_HPAGE);
> +		if (IS_ERR(hdr)) {
> +			ret = PTR_ERR(hdr);
> +			break;
> +		}
> +
> +		last = ckpt_hdr_hpage_last(hdr);
> +		index = (unsigned long)hdr->index;
> +		hpagesize = 1UL << hdr->shift;
> +
> +		ckpt_hdr_put(ctx, hdr);
> +
> +		if (last)
> +			break;
> +
> +		addr = start + (hpagesize * index);
> +
> +		down_read(&current->mm->mmap_sem);
> +		ret = get_user_pages(current, current->mm, addr, 1, 1, 1,
> +				     &page, NULL);
> +		up_read(&current->mm->mmap_sem);
> +
> +		if (ret < 0)
> +			break;
> +
> +		ret = hugetlb_restore_page(ctx, page);
> +
> +		page_cache_release(page);
> +
> +		if (ret < 0)
> +			break;
> +	}
> +
> +	sys_shmdt((void __user *)start);
> +
> +	__load_ipc_shm_hdr(hdr, shp);
> +
> +	return ret;
> +}
> +
>  static int _ipc_shm_delete(struct ipc_namespace *ns, int id)
>  {
>  	mm_segment_t old_fs;
> @@ -190,11 +323,7 @@ static int load_ipc_shm_hdr(struct ckpt_ctx *ctx,
>  	if (h->shm_cprid < 0 || h->shm_lprid < 0)
>  		return -EINVAL;
>  
> -	shp->shm_atim = h->shm_atim;
> -	shp->shm_dtim = h->shm_dtim;
> -	shp->shm_ctim = h->shm_ctim;
> -	shp->shm_cprid = h->shm_cprid;
> -	shp->shm_lprid = h->shm_lprid;
> +	__load_ipc_shm_hdr(h, shp);
>  
>  	return 0;
>  }
> @@ -224,8 +353,6 @@ int restore_ipc_shm(struct ckpt_ctx *ctx, struct ipc_namespace *ns)
>  	ret = -ENOSYS;
>  	if (h->mlock_uid != (unsigned int) -1)	/* FIXME: support SHM_LOCK */
>  		goto out;
> -	if (h->flags & SHM_HUGETLB)	/* FIXME: support SHM_HUGETLB */
> -		goto out;
>  
>  	shmflag = h->flags | h->perms.mode | IPC_CREAT | IPC_EXCL;
>  	ckpt_debug("shm: do_shmget size %lld flag %#x id %d\n",
> @@ -294,7 +421,10 @@ int restore_ipc_shm(struct ckpt_ctx *ctx, struct ipc_namespace *ns)
>  	ret = ckpt_obj_insert(ctx, file, h->objref, CKPT_OBJ_FILE);
>  	if (ret < 0)
>  		goto fput;
> -	ret = restore_memory_contents(ctx, file->f_dentry->d_inode);
> +	if (is_file_hugepages(file))
> +		ret = shm_hugetlb_restore_contents(ctx, ns, shp, h);
> +	else
> +		ret = restore_memory_contents(ctx, file->f_dentry->d_inode);
>  fput:
>  	fput(file);
>  

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 5/8] checkpoint/restart of anonymous hugetlb mappings
       [not found]     ` <1284494530-25946-6-git-send-email-ntl-e+AXbWqSrlAAvxtiuMwx3w@public.gmane.org>
@ 2010-09-17  0:44       ` Oren Laadan
       [not found]         ` <4C92BA08.70106-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
  0 siblings, 1 reply; 21+ messages in thread
From: Oren Laadan @ 2010-09-17  0:44 UTC (permalink / raw)
  To: Nathan Lynch; +Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA



On 09/14/2010 04:02 PM, Nathan Lynch wrote:
> Support checkpoint and restore of both private and shared
> hugepage-backed mappings established via mmap(MAP_HUGETLB).  Introduce
> APIs for checkpoint and restart of individual huge pages which are to
> be used by the sysv SHM_HUGETLB c/r code.
> 
> Signed-off-by: Nathan Lynch <ntl-e+AXbWqSrlAAvxtiuMwx3w@public.gmane.org>

The code looks clean, but I need to learn more about HUGETLB
before I can say much...

Do you also have test-suite for this ?

[...]

> +static int hugetlb_dump_contents(struct ckpt_ctx *ctx, struct vm_area_struct *vma)
> +{
> +	struct ckpt_hdr_hpage hdr;
> +	unsigned long pageshift;
> +	unsigned long pagesize;
> +	unsigned long addr;
> +	int ret;
> +
> +	pageshift = huge_page_shift(hstate_vma(vma));
> +	pagesize = vma_kernel_pagesize(vma);
> +
> +	ckpt_hdr_hpage_init(&hdr, pageshift);
> +
> +	for (addr = vma->vm_start; addr < vma->vm_end; addr += pagesize) {
> +		struct page *page = NULL;
> +
> +		down_read(&vma->vm_mm->mmap_sem);
> +		ret = __get_user_pages(ctx->tsk, vma->vm_mm,
> +				       addr, 1, FOLL_DUMP | FOLL_GET,
> +				       &page, NULL);
> +		/* FOLL_DUMP gives -EFAULT for holes */
> +		if (ret == -EFAULT)
> +			ret = 0;

With regular pages, this didn't always work, especially after they
slightly changed the semantics of FOLL_DUMP. So I introduced the
FOLL_DIRTY flag to detect dirty (non-zero) pages.  I wonder if
something like that may be needed here too ?

[...]

Oren.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 6/8] remove VM_HUGETLB and VM_RESERVED from CKPT_VMA_NOT_SUPPORTED
       [not found]     ` <1284494530-25946-7-git-send-email-ntl-e+AXbWqSrlAAvxtiuMwx3w@public.gmane.org>
@ 2010-09-17  3:35       ` Serge E. Hallyn
  0 siblings, 0 replies; 21+ messages in thread
From: Serge E. Hallyn @ 2010-09-17  3:35 UTC (permalink / raw)
  To: Nathan Lynch; +Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA

Quoting Nathan Lynch (ntl-e+AXbWqSrlAAvxtiuMwx3w@public.gmane.org):
> Now that the hugetlb c/r support code is in place we can remove
> VM_HUGETLB from the bitmask of unsupported vma flags.
> 
> All huge pages are VM_RESERVED so a less coarse method is needed to
> prevent checkpoint of other reserved pages.
> 
> Signed-off-by: Nathan Lynch <ntl-e+AXbWqSrlAAvxtiuMwx3w@public.gmane.org>
> ---
>  include/linux/checkpoint.h |    3 +--
>  1 files changed, 1 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
> index d9a65a7..e224490 100644
> --- a/include/linux/checkpoint.h
> +++ b/include/linux/checkpoint.h
> @@ -308,8 +308,7 @@ extern int checkpoint_memory_contents(struct ckpt_ctx *ctx,
>  extern int restore_memory_contents(struct ckpt_ctx *ctx, struct inode *inode);
>  
>  #define CKPT_VMA_NOT_SUPPORTED						\
> -	(VM_IO | VM_HUGETLB | VM_NONLINEAR | VM_PFNMAP |		\
> -	 VM_RESERVED | VM_HUGETLB | VM_NONLINEAR |	\
> +	(VM_IO | VM_NONLINEAR | VM_PFNMAP | VM_NONLINEAR |		\

	VM_NONLINEAR | VM_NONLINEAR  :)

>  	 VM_MAPPED_COPY | VM_INSERTPAGE | VM_MIXEDMAP | VM_SAO)
>  
>  /* signals */
> -- 
> 1.7.2.2
> 
> _______________________________________________
> Containers mailing list
> Containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org
> https://lists.linux-foundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 8/8] checkpoint/restart of SysV SHM_HUGETLB regions
       [not found]         ` <4C92B903.20304-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
@ 2010-09-17 19:03           ` Nathan Lynch
  0 siblings, 0 replies; 21+ messages in thread
From: Nathan Lynch @ 2010-09-17 19:03 UTC (permalink / raw)
  To: Oren Laadan; +Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA

On Thu, 2010-09-16 at 20:40 -0400, Oren Laadan wrote:
> 
> On 09/14/2010 04:02 PM, Nathan Lynch wrote:
> > Large page-backed shm regions require special handling, especially
> > during restart.  The association of a large page with a shm region's
> > inode can occur only in the context of a process causing a fault with
> > the region mapped into its mm.  In order to restore that association,
> > temporarily shmat-attach the restored SHM_HUGETLB region to the
> > restarting process's mm, using the just-restored ipc namespace
> > instead of the current one (the nsproxy switch hasn't occured yet).
> > 
> > Since the temporary shmat of the region during restart causes some of
> > the shm attributes to be updated, re-restore them from the ipc_shm
> > checkpoint header after unmapping.
> 
> Would it work to just move the original call to load_ipc_shm_hdr()
> further down in restore_ipc_shm(), especially since the mutex is
> not needed anymore - that way you don't need to re-restore them ?

load_ipc_shm_hdr indirectly calls the security context restore stuff, so
I'm not sure whether it's okay to change the ordering here.  But perhaps
it would make sense to defer restoring the shm_* fields until after
restoring contents.

Of course, it would be best if we could figure out a way to restore
SHM_HUGETLB contents without artificially attaching segments to the
restoring task's mm.  But so far I haven't been able to do that.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 5/8] checkpoint/restart of anonymous hugetlb mappings
       [not found]         ` <4C92BA08.70106-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
@ 2010-09-17 20:23           ` Nathan Lynch
       [not found]             ` <1284754993.4109.397.camel-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>
  0 siblings, 1 reply; 21+ messages in thread
From: Nathan Lynch @ 2010-09-17 20:23 UTC (permalink / raw)
  To: Oren Laadan; +Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA

On Thu, 2010-09-16 at 20:44 -0400, Oren Laadan wrote:
> 
> On 09/14/2010 04:02 PM, Nathan Lynch wrote:
> > Support checkpoint and restore of both private and shared
> > hugepage-backed mappings established via mmap(MAP_HUGETLB).  Introduce
> > APIs for checkpoint and restart of individual huge pages which are to
> > be used by the sysv SHM_HUGETLB c/r code.
> > 
> > Signed-off-by: Nathan Lynch <ntl-e+AXbWqSrlAAvxtiuMwx3w@public.gmane.org>
> 
> The code looks clean, but I need to learn more about HUGETLB
> before I can say much...
> 
> Do you also have test-suite for this ?

Included below is a throwaway patch to user-cr's shmem and ipcshm tests
which will cause them to use huge pages.  You'll need to configure huge
pages on your system; see Documentation/vm/hugetlbpage.txt in the kernel
source.


> 
> [...]
> 
> > +static int hugetlb_dump_contents(struct ckpt_ctx *ctx, struct vm_area_struct *vma)
> > +{
> > +	struct ckpt_hdr_hpage hdr;
> > +	unsigned long pageshift;
> > +	unsigned long pagesize;
> > +	unsigned long addr;
> > +	int ret;
> > +
> > +	pageshift = huge_page_shift(hstate_vma(vma));
> > +	pagesize = vma_kernel_pagesize(vma);
> > +
> > +	ckpt_hdr_hpage_init(&hdr, pageshift);
> > +
> > +	for (addr = vma->vm_start; addr < vma->vm_end; addr += pagesize) {
> > +		struct page *page = NULL;
> > +
> > +		down_read(&vma->vm_mm->mmap_sem);
> > +		ret = __get_user_pages(ctx->tsk, vma->vm_mm,
> > +				       addr, 1, FOLL_DUMP | FOLL_GET,
> > +				       &page, NULL);
> > +		/* FOLL_DUMP gives -EFAULT for holes */
> > +		if (ret == -EFAULT)
> > +			ret = 0;
> 
> With regular pages, this didn't always work, especially after they
> slightly changed the semantics of FOLL_DUMP. So I introduced the
> FOLL_DIRTY flag to detect dirty (non-zero) pages.  I wonder if
> something like that may be needed here too ?

I don't think so - huge pages are never used to map regular files (they
are always on hugetlbfs), so they can't get out of sync with a backing
store.


 test/ipcshm.c |    7 ++++---
 test/shmem.c  |    8 ++++++--
 2 files changed, 10 insertions(+), 5 deletions(-)

diff --git a/test/ipcshm.c b/test/ipcshm.c
index cf932b4..f4b5e8a 100644
--- a/test/ipcshm.c
+++ b/test/ipcshm.c
@@ -7,6 +7,7 @@
 
 #define OUTFILE  "/tmp/cr-test.out"
 #define SEG_SIZE (20 * 4096)
+#define HTLB_SEG_SIZE (1024 * 1024 * 16)
 #define SEG_KEY1 11
 
 int main(int argc, char *argv[])
@@ -37,7 +38,7 @@ int main(int argc, char *argv[])
 		exit(1);
 	}
 
-	id2 = shmget(IPC_PRIVATE, SEG_SIZE, 0700|IPC_CREAT|IPC_EXCL);
+	id2 = shmget(IPC_PRIVATE, HTLB_SEG_SIZE, 0700|IPC_CREAT|IPC_EXCL|SHM_HUGETLB);
 	if (id2 < 0) {		
 		perror("shmget2");
 		exit(1);
@@ -63,9 +64,9 @@ int main(int argc, char *argv[])
 	if (shmdt(seg1) < 0)
 		perror("shmdt1");
 
-	fprintf(file, "detaches 2nd, sleeping 30\n");
+	fprintf(file, "detaches 2nd, sleeping 120\n");
 	fflush(file);
-	sleep(20);
+	sleep(120);
 	fprintf(file, "waking up\n");
 	fflush(file);
 
diff --git a/test/shmem.c b/test/shmem.c
index 6d7dd8a..cb9fd10 100644
--- a/test/shmem.c
+++ b/test/shmem.c
@@ -5,6 +5,10 @@
 #include <math.h>
 #include <sys/mman.h>
 
+#ifndef MAP_HUGETLB
+#define MAP_HUGETLB 0x40000
+#endif
+
 #define OUTFILE  "/tmp/cr-test.out"
 
 int main(int argc, char *argv[])
@@ -41,7 +45,7 @@ int main(int argc, char *argv[])
 	}
 
 	addr = mmap(NULL, 16384, PROT_READ | PROT_WRITE,
-		    MAP_ANONYMOUS | MAP_SHARED, 0, 0);
+		    MAP_ANONYMOUS | MAP_SHARED | MAP_HUGETLB, 0, 0);
 	if (addr == MAP_FAILED) {
 		perror("mmap");
 		exit(1);
@@ -66,7 +70,7 @@ int main(int argc, char *argv[])
 		close(pipefd[1]);
 	}
 
-	for (i = 0; i < 10; i++) {
+	for (i = 0; i < 120; i++) {
 		sleep(1);
 		/* make the fpu work ->  a = a + i/10  */
 		a = sqrt(a*a + 2*a*(i/10.0) + i*i/100.0);

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH 0/8] checkpoint/restart: sysvshm fixes and hugetlb support
       [not found]     ` <4C92B831.40400-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
@ 2010-10-06 19:43       ` Nathan Lynch
  2010-11-01 17:45         ` Oren Laadan
  0 siblings, 1 reply; 21+ messages in thread
From: Nathan Lynch @ 2010-10-06 19:43 UTC (permalink / raw)
  To: Oren Laadan; +Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA

On Thu, 2010-09-16 at 20:37 -0400, Oren Laadan wrote:
> Hi Nathan,
> 
> Thanks for the patch series. For starters, I'll pull
> patches 1-3 for ckpt-v22-dev. I'll wait for reviews
> for the others.

Well, reviews haven't exactly been pouring in, but I'm reasonably
confident in these patches -- anything I can do to increase your comfort
level with applying them?

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 5/8] checkpoint/restart of anonymous hugetlb mappings
       [not found]             ` <1284754993.4109.397.camel-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>
@ 2010-11-01 17:44               ` Oren Laadan
  0 siblings, 0 replies; 21+ messages in thread
From: Oren Laadan @ 2010-11-01 17:44 UTC (permalink / raw)
  To: Nathan Lynch; +Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA



On 09/17/2010 04:23 PM, Nathan Lynch wrote:
> On Thu, 2010-09-16 at 20:44 -0400, Oren Laadan wrote:
>>
>> On 09/14/2010 04:02 PM, Nathan Lynch wrote:
>>> Support checkpoint and restore of both private and shared
>>> hugepage-backed mappings established via mmap(MAP_HUGETLB).  Introduce
>>> APIs for checkpoint and restart of individual huge pages which are to
>>> be used by the sysv SHM_HUGETLB c/r code.
>>>
>>> Signed-off-by: Nathan Lynch <ntl-e+AXbWqSrlAAvxtiuMwx3w@public.gmane.org>
>>
>> The code looks clean, but I need to learn more about HUGETLB
>> before I can say much...
>>
>> Do you also have test-suite for this ?
> 
> Included below is a throwaway patch to user-cr's shmem and ipcshm tests
> which will cause them to use huge pages.  You'll need to configure huge
> pages on your system; see Documentation/vm/hugetlbpage.txt in the kernel
> source.

Thanks. I'll add it when I pull these.

>> [...]
>>
>>> +static int hugetlb_dump_contents(struct ckpt_ctx *ctx, struct vm_area_struct *vma)
>>> +{
>>> +	struct ckpt_hdr_hpage hdr;
>>> +	unsigned long pageshift;
>>> +	unsigned long pagesize;
>>> +	unsigned long addr;
>>> +	int ret;
>>> +
>>> +	pageshift = huge_page_shift(hstate_vma(vma));
>>> +	pagesize = vma_kernel_pagesize(vma);
>>> +
>>> +	ckpt_hdr_hpage_init(&hdr, pageshift);
>>> +
>>> +	for (addr = vma->vm_start; addr < vma->vm_end; addr += pagesize) {
>>> +		struct page *page = NULL;
>>> +
>>> +		down_read(&vma->vm_mm->mmap_sem);
>>> +		ret = __get_user_pages(ctx->tsk, vma->vm_mm,
>>> +				       addr, 1, FOLL_DUMP | FOLL_GET,
>>> +				       &page, NULL);
>>> +		/* FOLL_DUMP gives -EFAULT for holes */
>>> +		if (ret == -EFAULT)
>>> +			ret = 0;
>>
>> With regular pages, this didn't always work, especially after they
>> slightly changed the semantics of FOLL_DUMP. So I introduced the
>> FOLL_DIRTY flag to detect dirty (non-zero) pages.  I wonder if
>> something like that may be needed here too ?
> 
> I don't think so - huge pages are never used to map regular files (they
> are always on hugetlbfs), so they can't get out of sync with a backing
> store.

Right.

Oren.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 0/8] checkpoint/restart: sysvshm fixes and hugetlb support
  2010-10-06 19:43       ` Nathan Lynch
@ 2010-11-01 17:45         ` Oren Laadan
  0 siblings, 0 replies; 21+ messages in thread
From: Oren Laadan @ 2010-11-01 17:45 UTC (permalink / raw)
  To: Nathan Lynch; +Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA


On 10/06/2010 03:43 PM, Nathan Lynch wrote:
> On Thu, 2010-09-16 at 20:37 -0400, Oren Laadan wrote:
>> Hi Nathan,
>>
>> Thanks for the patch series. For starters, I'll pull
>> patches 1-3 for ckpt-v22-dev. I'll wait for reviews
>> for the others.
> 
> Well, reviews haven't exactly been pouring in, but I'm reasonably
> confident in these patches -- anything I can do to increase your comfort
> level with applying them?
> 

Working on it - I'm playing catch-up...
(I'll look at your newer submission).

Oren.

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2010-11-01 17:45 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-09-14 20:02 [PATCH 0/8] checkpoint/restart: sysvshm fixes and hugetlb support Nathan Lynch
     [not found] ` <1284494530-25946-1-git-send-email-ntl-e+AXbWqSrlAAvxtiuMwx3w@public.gmane.org>
2010-09-14 20:02   ` [PATCH 1/8] sysvshm: check for hugetlb before assuming shmem Nathan Lynch
2010-09-14 20:02   ` [PATCH 2/8] sysvshm: report error on failure to reattach, avoid crash Nathan Lynch
     [not found]     ` <1284494530-25946-3-git-send-email-ntl-e+AXbWqSrlAAvxtiuMwx3w@public.gmane.org>
2010-09-15  2:55       ` Matt Helsley
     [not found]         ` <20100915025502.GG8957-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org>
2010-09-15  3:04           ` Matt Helsley
2010-09-17  0:17           ` Oren Laadan
2010-09-14 20:02   ` [PATCH 3/8] checkpoint/sysvshm: release rwsem earlier during restore Nathan Lynch
2010-09-14 20:02   ` [PATCH 4/8] checkpoint/ipc: allow shmat callers to specify ipc namespace Nathan Lynch
2010-09-14 20:02   ` [PATCH 5/8] checkpoint/restart of anonymous hugetlb mappings Nathan Lynch
     [not found]     ` <1284494530-25946-6-git-send-email-ntl-e+AXbWqSrlAAvxtiuMwx3w@public.gmane.org>
2010-09-17  0:44       ` Oren Laadan
     [not found]         ` <4C92BA08.70106-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
2010-09-17 20:23           ` Nathan Lynch
     [not found]             ` <1284754993.4109.397.camel-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>
2010-11-01 17:44               ` Oren Laadan
2010-09-14 20:02   ` [PATCH 6/8] remove VM_HUGETLB and VM_RESERVED from CKPT_VMA_NOT_SUPPORTED Nathan Lynch
     [not found]     ` <1284494530-25946-7-git-send-email-ntl-e+AXbWqSrlAAvxtiuMwx3w@public.gmane.org>
2010-09-17  3:35       ` Serge E. Hallyn
2010-09-14 20:02   ` [PATCH 7/8] hugetlbfs checkpoint/restart hooks Nathan Lynch
2010-09-14 20:02   ` [PATCH 8/8] checkpoint/restart of SysV SHM_HUGETLB regions Nathan Lynch
     [not found]     ` <1284494530-25946-9-git-send-email-ntl-e+AXbWqSrlAAvxtiuMwx3w@public.gmane.org>
2010-09-17  0:40       ` Oren Laadan
     [not found]         ` <4C92B903.20304-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
2010-09-17 19:03           ` Nathan Lynch
2010-09-17  0:37   ` [PATCH 0/8] checkpoint/restart: sysvshm fixes and hugetlb support Oren Laadan
     [not found]     ` <4C92B831.40400-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
2010-10-06 19:43       ` Nathan Lynch
2010-11-01 17:45         ` Oren Laadan

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.