All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC][PATCH 0/3][cr][v2]: Checkpoint/restart file leases
@ 2010-05-26  1:07 Sukadev Bhattiprolu
  2010-05-26  1:07 ` [RFC][PATCH 1/3][cr][v2]: Define do_setlease() Sukadev Bhattiprolu
                   ` (3 more replies)
  0 siblings, 4 replies; 39+ messages in thread
From: Sukadev Bhattiprolu @ 2010-05-26  1:07 UTC (permalink / raw)
  To: Oren Laadan; +Cc: serue, Matt Helsley, matthew, linux-fsdevel, Containers

Checkpoint/restart file leases. See comments in patches 2 and 3 for
details. These patches apply on top of the following patchset which
implements checkpoint/restart of posix locks and file-owners:

https://lists.linux-foundation.org/pipermail/containers/2010-May/024613.html

Changelog[v2]:

	- Added patch 3/3 to ensure that lease-holder gets only one SIGIO
	  and that the total time the lease-breaker waits is limited to
	  lease_break_time (these two were TODOs in [v1]).

Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>

Sukadev Bhattiprolu (3):
  Define do_setlease()
  Checkpoint/restart file leases
  fileleases: C/R of an in-progress lease.

 fs/checkpoint.c                |   30 ++++++++++-
 fs/locks.c                     |  111 +++++++++++++++++++++++++++++++++++-----
 include/linux/checkpoint_hdr.h |    3 +
 include/linux/fs.h             |    4 ++
 4 files changed, 132 insertions(+), 16 deletions(-)


^ permalink raw reply	[flat|nested] 39+ messages in thread

* [RFC][PATCH 1/3][cr][v2]: Define do_setlease()
       [not found] ` <1274836063-13271-1-git-send-email-sukadev-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
@ 2010-05-26  1:07   ` Sukadev Bhattiprolu
  2010-05-26  1:07   ` [RFC][PATCH 2/3][cr][v2]: Checkpoint/restart file leases Sukadev Bhattiprolu
  2010-05-26  1:07   ` [RFC][PATCH 3/3][cr][v2]: fileleases: C/R of an in-progress lease Sukadev Bhattiprolu
  2 siblings, 0 replies; 39+ messages in thread
From: Sukadev Bhattiprolu @ 2010-05-26  1:07 UTC (permalink / raw)
  To: Oren Laadan
  Cc: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	serue-r/Jw6+rmf7HQT0dZR+AlfA, Containers, matthew-Ztpu424NOJ8

Move the core functionality of fcntl_setlease() into a new function,
do_setlease(). do_setlease() is same as fcntl_setlease() except that
it takes an extra 'rem_lease' parameter. do_setlease() will be used
in a follow-on patch to checkpoint/restart file-leases.

Signed-off-by: Sukadev Bhattiprolu <sukadev-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
---
 fs/locks.c         |   27 ++++++++++++++++-----------
 include/linux/fs.h |    1 +
 2 files changed, 17 insertions(+), 11 deletions(-)

diff --git a/fs/locks.c b/fs/locks.c
index c62ab7f..4107295 100644
--- a/fs/locks.c
+++ b/fs/locks.c
@@ -1471,17 +1471,7 @@ int vfs_setlease(struct file *filp, long arg, struct file_lock **lease)
 }
 EXPORT_SYMBOL_GPL(vfs_setlease);
 
-/**
- *	fcntl_setlease	-	sets a lease on an open file
- *	@fd: open file descriptor
- *	@filp: file pointer
- *	@arg: type of lease to obtain
- *
- *	Call this fcntl to establish a lease on the file.
- *	Note that you also need to call %F_SETSIG to
- *	receive a signal when the lease is broken.
- */
-int fcntl_setlease(unsigned int fd, struct file *filp, long arg)
+int do_setlease(unsigned int fd, struct file *filp, long arg, int rem_lease)
 {
 	struct file_lock fl, *flp = &fl;
 	struct inode *inode = filp->f_path.dentry->d_inode;
@@ -1515,6 +1505,21 @@ out_unlock:
 }
 
 /**
+ *	fcntl_setlease	-	sets a lease on an open file
+ *	@fd: open file descriptor
+ *	@filp: file pointer
+ *	@arg: type of lease to obtain
+ *
+ *	Call this fcntl to establish a lease on the file.
+ *	Note that you also need to call %F_SETSIG to
+ *	receive a signal when the lease is broken.
+ */
+int fcntl_setlease(unsigned int fd, struct file *filp, long arg)
+{
+	return do_setlease(fd, filp, arg, 0);
+}
+
+/**
  * flock_lock_file_wait - Apply a FLOCK-style lock to a file
  * @filp: The file to apply the lock to
  * @fl: The lock to be applied
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 49d4eeb..700317a 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1122,6 +1122,7 @@ extern int flock64_set(unsigned int, struct file *, unsigned int,
 			struct flock64 *);
 #endif
 
+extern int do_setlease(unsigned int fd, struct file *filp, long arg, int rem_lease);
 extern int fcntl_setlease(unsigned int fd, struct file *filp, long arg);
 extern int fcntl_getlease(struct file *filp);
 
-- 
1.6.0.4

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [RFC][PATCH 1/3][cr][v2]: Define do_setlease()
  2010-05-26  1:07 [RFC][PATCH 0/3][cr][v2]: Checkpoint/restart file leases Sukadev Bhattiprolu
@ 2010-05-26  1:07 ` Sukadev Bhattiprolu
  2010-05-26 13:52   ` Serge E. Hallyn
       [not found]   ` <1274836063-13271-2-git-send-email-sukadev-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
  2010-05-26  1:07 ` [RFC][PATCH 2/3][cr][v2]: Checkpoint/restart file leases Sukadev Bhattiprolu
                   ` (2 subsequent siblings)
  3 siblings, 2 replies; 39+ messages in thread
From: Sukadev Bhattiprolu @ 2010-05-26  1:07 UTC (permalink / raw)
  To: Oren Laadan; +Cc: serue, Matt Helsley, matthew, linux-fsdevel, Containers

Move the core functionality of fcntl_setlease() into a new function,
do_setlease(). do_setlease() is same as fcntl_setlease() except that
it takes an extra 'rem_lease' parameter. do_setlease() will be used
in a follow-on patch to checkpoint/restart file-leases.

Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
---
 fs/locks.c         |   27 ++++++++++++++++-----------
 include/linux/fs.h |    1 +
 2 files changed, 17 insertions(+), 11 deletions(-)

diff --git a/fs/locks.c b/fs/locks.c
index c62ab7f..4107295 100644
--- a/fs/locks.c
+++ b/fs/locks.c
@@ -1471,17 +1471,7 @@ int vfs_setlease(struct file *filp, long arg, struct file_lock **lease)
 }
 EXPORT_SYMBOL_GPL(vfs_setlease);
 
-/**
- *	fcntl_setlease	-	sets a lease on an open file
- *	@fd: open file descriptor
- *	@filp: file pointer
- *	@arg: type of lease to obtain
- *
- *	Call this fcntl to establish a lease on the file.
- *	Note that you also need to call %F_SETSIG to
- *	receive a signal when the lease is broken.
- */
-int fcntl_setlease(unsigned int fd, struct file *filp, long arg)
+int do_setlease(unsigned int fd, struct file *filp, long arg, int rem_lease)
 {
 	struct file_lock fl, *flp = &fl;
 	struct inode *inode = filp->f_path.dentry->d_inode;
@@ -1515,6 +1505,21 @@ out_unlock:
 }
 
 /**
+ *	fcntl_setlease	-	sets a lease on an open file
+ *	@fd: open file descriptor
+ *	@filp: file pointer
+ *	@arg: type of lease to obtain
+ *
+ *	Call this fcntl to establish a lease on the file.
+ *	Note that you also need to call %F_SETSIG to
+ *	receive a signal when the lease is broken.
+ */
+int fcntl_setlease(unsigned int fd, struct file *filp, long arg)
+{
+	return do_setlease(fd, filp, arg, 0);
+}
+
+/**
  * flock_lock_file_wait - Apply a FLOCK-style lock to a file
  * @filp: The file to apply the lock to
  * @fl: The lock to be applied
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 49d4eeb..700317a 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1122,6 +1122,7 @@ extern int flock64_set(unsigned int, struct file *, unsigned int,
 			struct flock64 *);
 #endif
 
+extern int do_setlease(unsigned int fd, struct file *filp, long arg, int rem_lease);
 extern int fcntl_setlease(unsigned int fd, struct file *filp, long arg);
 extern int fcntl_getlease(struct file *filp);
 
-- 
1.6.0.4


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [RFC][PATCH 2/3][cr][v2]: Checkpoint/restart file leases
       [not found] ` <1274836063-13271-1-git-send-email-sukadev-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
  2010-05-26  1:07   ` [RFC][PATCH 1/3][cr][v2]: Define do_setlease() Sukadev Bhattiprolu
@ 2010-05-26  1:07   ` Sukadev Bhattiprolu
  2010-05-26  1:07   ` [RFC][PATCH 3/3][cr][v2]: fileleases: C/R of an in-progress lease Sukadev Bhattiprolu
  2 siblings, 0 replies; 39+ messages in thread
From: Sukadev Bhattiprolu @ 2010-05-26  1:07 UTC (permalink / raw)
  To: Oren Laadan
  Cc: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	serue-r/Jw6+rmf7HQT0dZR+AlfA, Containers, matthew-Ztpu424NOJ8

Build upon the C/R of file-locks to C/R file-leases. C/R of a lease that is
not being broken is almost identical to C/R of file-locks.  i.e save the type
of lease for the file in the checkpoint image and when restarting, restore
the lease by calling do_setlease().

C/R of file-lease gets complicated (I think), if a process is checkpointed
when its lease was being revoked. i.e if P1 has a F_WRLCK lease on file F1
and P2 opens F1 for write, P2's open is blocked for lease_break_time (45 secs).
P1's lease is revoked (i.e set to F_UNLCK) and P1 is notified via a SIGIO to
flush any dirty data.

This brings up two issues:

First, if P1 is checkpointed during this lease_break_time, we need to remember
to that P1 originally had a F_WRLCK lease which is now revoked to F_UNLCK.
Checkpointing the "current lease type" would wrongly save the lease-type as
F_UNLCK.

Secondly, if P1 was checkpointed 40 seconds into the lease_break_time,(i.e.
it had 5 seconds remaining in the lease), we want to ensure that after restart,
P1 gets at least 5 more seconds in the lease (no ?). (i.e P1 could be in the
its SIGIO handler when it was checkpointed and may be about to start a new
write(). If P1 does not gets its 5 seconds and P2's open and a read()
completes, we would have a data corruption).

This patch addresses the first issue above by adding file_lock->fl_type_prev
field. When a lease is downgraded/revoked, the original lease type is saved
in ->fl_type_prev and is also checkpointed. When the process P1 is restarted,
the kernel temporarily restores the original (F_WRLCK) lease.  When process
P2 is restarted, the open() would fail with -ERESTARTSYS and the open() would
be repeated. This open() would initiate the lease-break protocol again on P1.

To address the second issue above, this patch saves the remaining-lease in
the checkpoint image, but does not (yet) use this value. The plan is to use
this remaining-lease period when P1/P2 are restarted so that P2 is blocked
only for the remaining-lease rather than entire lease_break_time. I want to
check if there are better ways to address this.

Changelog[v2]:
	- Added [PATCH 3/3] to remove the TODOs that were listed in this patch

Signed-off-by: Sukadev Bhattiprolu <sukadev-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
---
 fs/checkpoint.c                |   25 ++++++++++++++++++++++---
 fs/locks.c                     |   40 ++++++++++++++++++++++++++++++++++++++++
 include/linux/checkpoint_hdr.h |    2 ++
 include/linux/fs.h             |    1 +
 4 files changed, 65 insertions(+), 3 deletions(-)

diff --git a/fs/checkpoint.c b/fs/checkpoint.c
index 64809db..2baeb32 100644
--- a/fs/checkpoint.c
+++ b/fs/checkpoint.c
@@ -280,8 +280,14 @@ static int checkpoint_one_file_lock(struct ckpt_ctx *ctx, struct file *file,
 	if (lock) {
 		h->fl_start = lock->fl_start;
 		h->fl_end = lock->fl_end;
+		/* checkpoint F_INPROGRESS if set for now */
 		h->fl_type = lock->fl_type;
+		h->fl_type_prev = lock->fl_type_prev;
 		h->fl_flags = lock->fl_flags;
+		if (h->fl_type & F_INPROGRESS && 
+					(lock->fl_break_time > jiffies))
+			h->fl_rem_lease = (lock->fl_break_time - jiffies) / HZ;
+
 	} else {
 		/* Checkpoint a dummy lock as a marker */
 		h->fl_start = -1;
@@ -315,7 +321,7 @@ checkpoint_file_locks(struct ckpt_ctx *ctx, struct files_struct *files,
 			continue;
 
 		rc = -EBADF;
-		if (IS_POSIX(lockp))
+		if (IS_POSIX(lockp) || IS_LEASE(lockp))
 			rc = checkpoint_one_file_lock(ctx, file, lockp);
 
 		if (rc < 0) {
@@ -1055,8 +1061,10 @@ static int restore_file_locks(struct ckpt_ctx *ctx, struct file *file, int fd)
 		if (IS_ERR(h))
 			return PTR_ERR(h);
 
-		ckpt_debug("Lock [%lld, %lld, %d, 0x%x]\n", h->fl_start,
-				h->fl_end, (int)h->fl_type, h->fl_flags);
+		ckpt_debug("Lock [%lld, %lld, %d, 0x%x], rem-lease %lus, "
+				"fl-type-prev %d\n", h->fl_start, h->fl_end,
+				(int)h->fl_type, h->fl_flags, h->fl_rem_lease,
+				h->fl_type_prev);
 
 		/*
 		 * If it is a dummy-lock, we are done with this fd.
@@ -1070,6 +1078,17 @@ static int restore_file_locks(struct ckpt_ctx *ctx, struct file *file, int fd)
 		if (h->fl_flags & FL_POSIX)
 			ret = restore_one_file_lock(ctx, file, fd, h); 
 
+		else if (h->fl_flags & FL_LEASE) {
+			int type;
+
+			type = h->fl_type;
+			if (h->fl_type & F_INPROGRESS)
+				type = h->fl_type_prev;
+			ret = do_setlease(fd, file, type, h->fl_rem_lease);
+			if (ret)
+				ckpt_err(ctx, ret, "do_setlease(): %d\n", type);
+		}
+
 		if (ret < 0)
 			ckpt_err(ctx, ret, "%(T) fl_flags 0x%x\n", h->fl_flags);
 	}
diff --git a/fs/locks.c b/fs/locks.c
index 4107295..7a80278 100644
--- a/fs/locks.c
+++ b/fs/locks.c
@@ -184,6 +184,8 @@ void locks_init_lock(struct file_lock *fl)
 	fl->fl_file = NULL;
 	fl->fl_flags = 0;
 	fl->fl_type = 0;
+	fl->fl_type_prev = 0;
+	fl->fl_break_time = 0UL;
 	fl->fl_start = fl->fl_end = 0;
 	fl->fl_ops = NULL;
 	fl->fl_lmops = NULL;
@@ -291,6 +293,13 @@ static int assign_type(struct file_lock *fl, int type)
 	case F_WRLCK:
 	case F_UNLCK:
 		fl->fl_type = type;
+		/*
+		 * Clear fl_type_prev since we now have a new lease-type.
+		 * That way, break_lease() will know to save the new lease-type
+		 * in case of a checkpoint. (non-lease file-locks don't use
+		 * ->fl_type_prev).  
+		 */
+		fl->fl_type_prev = 0;
 		break;
 	default:
 		return -EINVAL;
@@ -1211,6 +1220,16 @@ int __break_lease(struct inode *inode, unsigned int mode)
 		goto out;
 	}
 
+	/*
+	 * TODO: Checkpoint/restart. Suppose lease_break_time was 45 seonds and
+	 * 	 we were checkpointed when we had 35 seconds remaining in our
+	 * 	 lease. When we are restarted, should we get only 35 seconds
+	 * 	 of the lease and not the full lease_break_time ?
+	 *
+	 * 	 We checkpoint ->fl_break_time in the hope that we can use it
+	 * 	 to calculate the remaining lease, but for now, give the
+	 * 	 restarted process the full 'lease_break_time'.
+	 */
 	break_time = 0;
 	if (lease_break_time > 0) {
 		break_time = jiffies + lease_break_time * HZ;
@@ -1220,8 +1239,29 @@ int __break_lease(struct inode *inode, unsigned int mode)
 
 	for (fl = flock; fl && IS_LEASE(fl); fl = fl->fl_next) {
 		if (fl->fl_type != future) {
+			/*
+			 * CHECK:
+			 *
+			 * If fl_type_prev is already set, we could be in a
+			 * recursive checkpoint-restart i.e we were checkpointed
+			 * once when our lease was being broken. We were then
+			 * restarted from the checkpoint and checkpointed
+			 * again before the restored lease expired. In this
+			 * case, we want to restore the lease to the original
+			 * type. So don't overwrite fl_type_prev if its already
+			 * set.
+			 */
+			if (!fl->fl_type_prev)
+				fl->fl_type_prev = fl->fl_type;
 			fl->fl_type = future;
 			fl->fl_break_time = break_time;
+
+			/*
+			 * TODO: ->fl_break() sends the SIGIO to lease-holder.
+			 *       If lease-holder was checkpointed/restarted and
+			 *       this is a restarted lease, we should not
+			 *       re-send the SIGIO ?
+			 */
 			/* lease must have lmops break callback */
 			fl->fl_lmops->fl_break(fl);
 		}
diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
index 4509016..ddad73f 100644
--- a/include/linux/checkpoint_hdr.h
+++ b/include/linux/checkpoint_hdr.h
@@ -588,7 +588,9 @@ struct ckpt_hdr_file_lock {
        __s64 fl_start;
        __s64 fl_end;
        __u8 fl_type;
+       __u8 fl_type_prev;
        __u8 fl_flags;
+       unsigned long fl_rem_lease;
 };
 
 struct ckpt_hdr_file_pipe {
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 700317a..54b2a7b 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1066,6 +1066,7 @@ struct file_lock {
 	fl_owner_t fl_owner;
 	unsigned char fl_flags;
 	unsigned char fl_type;
+	unsigned char fl_type_prev;
 	unsigned int fl_pid;
 	struct pid *fl_nspid;
 	wait_queue_head_t fl_wait;
-- 
1.6.0.4

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [RFC][PATCH 2/3][cr][v2]: Checkpoint/restart file leases
  2010-05-26  1:07 [RFC][PATCH 0/3][cr][v2]: Checkpoint/restart file leases Sukadev Bhattiprolu
  2010-05-26  1:07 ` [RFC][PATCH 1/3][cr][v2]: Define do_setlease() Sukadev Bhattiprolu
@ 2010-05-26  1:07 ` Sukadev Bhattiprolu
  2010-06-15  4:40   ` Oren Laadan
       [not found]   ` <1274836063-13271-3-git-send-email-sukadev-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
  2010-05-26  1:07 ` [RFC][PATCH 3/3][cr][v2]: fileleases: C/R of an in-progress lease Sukadev Bhattiprolu
       [not found] ` <1274836063-13271-1-git-send-email-sukadev-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
  3 siblings, 2 replies; 39+ messages in thread
From: Sukadev Bhattiprolu @ 2010-05-26  1:07 UTC (permalink / raw)
  To: Oren Laadan; +Cc: serue, Matt Helsley, matthew, linux-fsdevel, Containers

Build upon the C/R of file-locks to C/R file-leases. C/R of a lease that is
not being broken is almost identical to C/R of file-locks.  i.e save the type
of lease for the file in the checkpoint image and when restarting, restore
the lease by calling do_setlease().

C/R of file-lease gets complicated (I think), if a process is checkpointed
when its lease was being revoked. i.e if P1 has a F_WRLCK lease on file F1
and P2 opens F1 for write, P2's open is blocked for lease_break_time (45 secs).
P1's lease is revoked (i.e set to F_UNLCK) and P1 is notified via a SIGIO to
flush any dirty data.

This brings up two issues:

First, if P1 is checkpointed during this lease_break_time, we need to remember
to that P1 originally had a F_WRLCK lease which is now revoked to F_UNLCK.
Checkpointing the "current lease type" would wrongly save the lease-type as
F_UNLCK.

Secondly, if P1 was checkpointed 40 seconds into the lease_break_time,(i.e.
it had 5 seconds remaining in the lease), we want to ensure that after restart,
P1 gets at least 5 more seconds in the lease (no ?). (i.e P1 could be in the
its SIGIO handler when it was checkpointed and may be about to start a new
write(). If P1 does not gets its 5 seconds and P2's open and a read()
completes, we would have a data corruption).

This patch addresses the first issue above by adding file_lock->fl_type_prev
field. When a lease is downgraded/revoked, the original lease type is saved
in ->fl_type_prev and is also checkpointed. When the process P1 is restarted,
the kernel temporarily restores the original (F_WRLCK) lease.  When process
P2 is restarted, the open() would fail with -ERESTARTSYS and the open() would
be repeated. This open() would initiate the lease-break protocol again on P1.

To address the second issue above, this patch saves the remaining-lease in
the checkpoint image, but does not (yet) use this value. The plan is to use
this remaining-lease period when P1/P2 are restarted so that P2 is blocked
only for the remaining-lease rather than entire lease_break_time. I want to
check if there are better ways to address this.

Changelog[v2]:
	- Added [PATCH 3/3] to remove the TODOs that were listed in this patch

Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
---
 fs/checkpoint.c                |   25 ++++++++++++++++++++++---
 fs/locks.c                     |   40 ++++++++++++++++++++++++++++++++++++++++
 include/linux/checkpoint_hdr.h |    2 ++
 include/linux/fs.h             |    1 +
 4 files changed, 65 insertions(+), 3 deletions(-)

diff --git a/fs/checkpoint.c b/fs/checkpoint.c
index 64809db..2baeb32 100644
--- a/fs/checkpoint.c
+++ b/fs/checkpoint.c
@@ -280,8 +280,14 @@ static int checkpoint_one_file_lock(struct ckpt_ctx *ctx, struct file *file,
 	if (lock) {
 		h->fl_start = lock->fl_start;
 		h->fl_end = lock->fl_end;
+		/* checkpoint F_INPROGRESS if set for now */
 		h->fl_type = lock->fl_type;
+		h->fl_type_prev = lock->fl_type_prev;
 		h->fl_flags = lock->fl_flags;
+		if (h->fl_type & F_INPROGRESS && 
+					(lock->fl_break_time > jiffies))
+			h->fl_rem_lease = (lock->fl_break_time - jiffies) / HZ;
+
 	} else {
 		/* Checkpoint a dummy lock as a marker */
 		h->fl_start = -1;
@@ -315,7 +321,7 @@ checkpoint_file_locks(struct ckpt_ctx *ctx, struct files_struct *files,
 			continue;
 
 		rc = -EBADF;
-		if (IS_POSIX(lockp))
+		if (IS_POSIX(lockp) || IS_LEASE(lockp))
 			rc = checkpoint_one_file_lock(ctx, file, lockp);
 
 		if (rc < 0) {
@@ -1055,8 +1061,10 @@ static int restore_file_locks(struct ckpt_ctx *ctx, struct file *file, int fd)
 		if (IS_ERR(h))
 			return PTR_ERR(h);
 
-		ckpt_debug("Lock [%lld, %lld, %d, 0x%x]\n", h->fl_start,
-				h->fl_end, (int)h->fl_type, h->fl_flags);
+		ckpt_debug("Lock [%lld, %lld, %d, 0x%x], rem-lease %lus, "
+				"fl-type-prev %d\n", h->fl_start, h->fl_end,
+				(int)h->fl_type, h->fl_flags, h->fl_rem_lease,
+				h->fl_type_prev);
 
 		/*
 		 * If it is a dummy-lock, we are done with this fd.
@@ -1070,6 +1078,17 @@ static int restore_file_locks(struct ckpt_ctx *ctx, struct file *file, int fd)
 		if (h->fl_flags & FL_POSIX)
 			ret = restore_one_file_lock(ctx, file, fd, h); 
 
+		else if (h->fl_flags & FL_LEASE) {
+			int type;
+
+			type = h->fl_type;
+			if (h->fl_type & F_INPROGRESS)
+				type = h->fl_type_prev;
+			ret = do_setlease(fd, file, type, h->fl_rem_lease);
+			if (ret)
+				ckpt_err(ctx, ret, "do_setlease(): %d\n", type);
+		}
+
 		if (ret < 0)
 			ckpt_err(ctx, ret, "%(T) fl_flags 0x%x\n", h->fl_flags);
 	}
diff --git a/fs/locks.c b/fs/locks.c
index 4107295..7a80278 100644
--- a/fs/locks.c
+++ b/fs/locks.c
@@ -184,6 +184,8 @@ void locks_init_lock(struct file_lock *fl)
 	fl->fl_file = NULL;
 	fl->fl_flags = 0;
 	fl->fl_type = 0;
+	fl->fl_type_prev = 0;
+	fl->fl_break_time = 0UL;
 	fl->fl_start = fl->fl_end = 0;
 	fl->fl_ops = NULL;
 	fl->fl_lmops = NULL;
@@ -291,6 +293,13 @@ static int assign_type(struct file_lock *fl, int type)
 	case F_WRLCK:
 	case F_UNLCK:
 		fl->fl_type = type;
+		/*
+		 * Clear fl_type_prev since we now have a new lease-type.
+		 * That way, break_lease() will know to save the new lease-type
+		 * in case of a checkpoint. (non-lease file-locks don't use
+		 * ->fl_type_prev).  
+		 */
+		fl->fl_type_prev = 0;
 		break;
 	default:
 		return -EINVAL;
@@ -1211,6 +1220,16 @@ int __break_lease(struct inode *inode, unsigned int mode)
 		goto out;
 	}
 
+	/*
+	 * TODO: Checkpoint/restart. Suppose lease_break_time was 45 seonds and
+	 * 	 we were checkpointed when we had 35 seconds remaining in our
+	 * 	 lease. When we are restarted, should we get only 35 seconds
+	 * 	 of the lease and not the full lease_break_time ?
+	 *
+	 * 	 We checkpoint ->fl_break_time in the hope that we can use it
+	 * 	 to calculate the remaining lease, but for now, give the
+	 * 	 restarted process the full 'lease_break_time'.
+	 */
 	break_time = 0;
 	if (lease_break_time > 0) {
 		break_time = jiffies + lease_break_time * HZ;
@@ -1220,8 +1239,29 @@ int __break_lease(struct inode *inode, unsigned int mode)
 
 	for (fl = flock; fl && IS_LEASE(fl); fl = fl->fl_next) {
 		if (fl->fl_type != future) {
+			/*
+			 * CHECK:
+			 *
+			 * If fl_type_prev is already set, we could be in a
+			 * recursive checkpoint-restart i.e we were checkpointed
+			 * once when our lease was being broken. We were then
+			 * restarted from the checkpoint and checkpointed
+			 * again before the restored lease expired. In this
+			 * case, we want to restore the lease to the original
+			 * type. So don't overwrite fl_type_prev if its already
+			 * set.
+			 */
+			if (!fl->fl_type_prev)
+				fl->fl_type_prev = fl->fl_type;
 			fl->fl_type = future;
 			fl->fl_break_time = break_time;
+
+			/*
+			 * TODO: ->fl_break() sends the SIGIO to lease-holder.
+			 *       If lease-holder was checkpointed/restarted and
+			 *       this is a restarted lease, we should not
+			 *       re-send the SIGIO ?
+			 */
 			/* lease must have lmops break callback */
 			fl->fl_lmops->fl_break(fl);
 		}
diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
index 4509016..ddad73f 100644
--- a/include/linux/checkpoint_hdr.h
+++ b/include/linux/checkpoint_hdr.h
@@ -588,7 +588,9 @@ struct ckpt_hdr_file_lock {
        __s64 fl_start;
        __s64 fl_end;
        __u8 fl_type;
+       __u8 fl_type_prev;
        __u8 fl_flags;
+       unsigned long fl_rem_lease;
 };
 
 struct ckpt_hdr_file_pipe {
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 700317a..54b2a7b 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1066,6 +1066,7 @@ struct file_lock {
 	fl_owner_t fl_owner;
 	unsigned char fl_flags;
 	unsigned char fl_type;
+	unsigned char fl_type_prev;
 	unsigned int fl_pid;
 	struct pid *fl_nspid;
 	wait_queue_head_t fl_wait;
-- 
1.6.0.4


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [RFC][PATCH 3/3][cr][v2]: fileleases: C/R of an in-progress lease.
       [not found] ` <1274836063-13271-1-git-send-email-sukadev-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
  2010-05-26  1:07   ` [RFC][PATCH 1/3][cr][v2]: Define do_setlease() Sukadev Bhattiprolu
  2010-05-26  1:07   ` [RFC][PATCH 2/3][cr][v2]: Checkpoint/restart file leases Sukadev Bhattiprolu
@ 2010-05-26  1:07   ` Sukadev Bhattiprolu
  2 siblings, 0 replies; 39+ messages in thread
From: Sukadev Bhattiprolu @ 2010-05-26  1:07 UTC (permalink / raw)
  To: Oren Laadan
  Cc: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	serue-r/Jw6+rmf7HQT0dZR+AlfA, Containers, matthew-Ztpu424NOJ8

If process P1 has a F_WRLCK lease on file F1 and process P2 opens the
file, P2's open() blocks for lease_break_time (45 seconds) and P1 gets
a SIGIO to cleanup it lease in preparation for P2's open.  If the two
processes are checkpointed/restarted in this window, we should address
following two issues:

	- P1 should get a SIGIO only once for the lease (i.e if P1 got the
	  SIGIO before checkpoint, it should not get the SIGIO after restart).

	- If R seconds remain in the lease, P2's open should be blocked for
	  at least the R seconds, so P1 has the time to clean up its lease.
	  The previous patch gives P1 the entire lease_break_time but that
	  can leave P2 stalled for 2*lease_break_time.

To address first, we add a field ->fl_break_notified to "remember" if we
notified the lease-holder already. We save this field in the checkpoint
image and when restarting, we notify the lease-holder only if this field
is not set.

To address the second issue, we also checkpoint the ->fl_break_time for
an in-progress lease. When restarting the process, we ensure that the
lease-holder sleeps only for the remaining-lease rather than the entire
lease.

These two fixes sound like an approximation (see comments in do_setlease()
and __break_lease() below) and are also a bit kludgy (hence a separate patch
for now).

Appreciate comments on how we can do this better. Specifically:

	- do we even need to try and address the second issue above or
	  just let P1 have the entire lease_break_time again ?

	- theoretically, the R seconds should start counting after *all*
	  processes in the application-process tree have been restarted,
	  since P1 waits inside the kernel for a portion of the remaining
	  lease - should we then add a delta to R ?

Signed-off-by: Sukadev Bhattiprolu <sukadev-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
---
 fs/checkpoint.c                |   13 ++++++---
 fs/locks.c                     |   56 ++++++++++++++++++++++++++++++++++-----
 include/linux/checkpoint_hdr.h |    1 +
 include/linux/fs.h             |    4 ++-
 4 files changed, 61 insertions(+), 13 deletions(-)

diff --git a/fs/checkpoint.c b/fs/checkpoint.c
index 2baeb32..f2adb38 100644
--- a/fs/checkpoint.c
+++ b/fs/checkpoint.c
@@ -284,9 +284,13 @@ static int checkpoint_one_file_lock(struct ckpt_ctx *ctx, struct file *file,
 		h->fl_type = lock->fl_type;
 		h->fl_type_prev = lock->fl_type_prev;
 		h->fl_flags = lock->fl_flags;
-		if (h->fl_type & F_INPROGRESS && 
-					(lock->fl_break_time > jiffies))
-			h->fl_rem_lease = (lock->fl_break_time - jiffies) / HZ;
+		if (IS_LEASE(lock)) {
+			unsigned long bt = lock->fl_break_time;
+
+			h->fl_break_notified = lock->fl_break_notified;
+			if ((h->fl_type & F_INPROGRESS) && (bt > jiffies))
+				h->fl_rem_lease = (bt - jiffies) / HZ;
+		}
 
 	} else {
 		/* Checkpoint a dummy lock as a marker */
@@ -1084,7 +1088,8 @@ static int restore_file_locks(struct ckpt_ctx *ctx, struct file *file, int fd)
 			type = h->fl_type;
 			if (h->fl_type & F_INPROGRESS)
 				type = h->fl_type_prev;
-			ret = do_setlease(fd, file, type, h->fl_rem_lease);
+			ret = do_setlease(fd, file, type, h->fl_rem_lease,
+					h->fl_break_notified);
 			if (ret)
 				ckpt_err(ctx, ret, "do_setlease(): %d\n", type);
 		}
diff --git a/fs/locks.c b/fs/locks.c
index 7a80278..c6ef829 100644
--- a/fs/locks.c
+++ b/fs/locks.c
@@ -185,6 +185,7 @@ void locks_init_lock(struct file_lock *fl)
 	fl->fl_flags = 0;
 	fl->fl_type = 0;
 	fl->fl_type_prev = 0;
+	fl->fl_break_notified = 0;
 	fl->fl_break_time = 0UL;
 	fl->fl_start = fl->fl_end = 0;
 	fl->fl_ops = NULL;
@@ -229,6 +230,8 @@ void __locks_copy_lock(struct file_lock *new, const struct file_lock *fl)
 	new->fl_flags = fl->fl_flags;
 	new->fl_type = fl->fl_type;
 	new->fl_start = fl->fl_start;
+	new->fl_break_time = fl->fl_break_time;
+	new->fl_break_notified = fl->fl_break_notified;
 	new->fl_end = fl->fl_end;
 	new->fl_ops = NULL;
 	new->fl_lmops = NULL;
@@ -460,6 +463,8 @@ static int lease_init(struct file *filp, int type, struct file_lock *fl)
 	fl->fl_end = OFFSET_MAX;
 	fl->fl_ops = NULL;
 	fl->fl_lmops = &lease_manager_ops;
+	fl->fl_break_time = 0UL;
+	fl->fl_break_notified = 0;
 	return 0;
 }
 
@@ -1139,6 +1144,7 @@ int lease_modify(struct file_lock **before, int arg)
 	struct file_lock *fl = *before;
 	int error = assign_type(fl, arg);
 
+	fl->fl_break_notified = 0;
 	if (error)
 		return error;
 	locks_wake_up_blocks(fl);
@@ -1254,16 +1260,25 @@ int __break_lease(struct inode *inode, unsigned int mode)
 			if (!fl->fl_type_prev)
 				fl->fl_type_prev = fl->fl_type;
 			fl->fl_type = future;
-			fl->fl_break_time = break_time;
 
 			/*
-			 * TODO: ->fl_break() sends the SIGIO to lease-holder.
-			 *       If lease-holder was checkpointed/restarted and
-			 *       this is a restarted lease, we should not
-			 *       re-send the SIGIO ?
+			 * CHECK:
+			 *
+			 * Similarly, if ->fl_break_time or ->fl_break_notified
+			 * are set, we were in the middle of a lease-break
+			 * when we were checkpointed. So we don't need to
+			 * notify of the lease holder again or wait for the
+			 * entire lease_break_time. Note that this time
+			 * could be more than lease_break_time since we use
+			 * the value that was checkpointed.
 			 */
+			if (!fl->fl_break_time)
+				fl->fl_break_time = break_time;
+
 			/* lease must have lmops break callback */
-			fl->fl_lmops->fl_break(fl);
+			if (!fl->fl_break_notified)
+				fl->fl_lmops->fl_break(fl);
+			fl->fl_break_notified = 1;
 		}
 	}
 
@@ -1511,7 +1526,8 @@ int vfs_setlease(struct file *filp, long arg, struct file_lock **lease)
 }
 EXPORT_SYMBOL_GPL(vfs_setlease);
 
-int do_setlease(unsigned int fd, struct file *filp, long arg, int rem_lease)
+int do_setlease(unsigned int fd, struct file *filp, long arg, int rem_lease,
+		int notified)
 {
 	struct file_lock fl, *flp = &fl;
 	struct inode *inode = filp->f_path.dentry->d_inode;
@@ -1521,6 +1537,30 @@ int do_setlease(unsigned int fd, struct file *filp, long arg, int rem_lease)
 	error = lease_init(filp, arg, &fl);
 	if (error)
 		return error;
+	fl.fl_break_notified = notified;
+
+	/*
+	 * Checkpoint/restart:
+	 *
+	 * If this lease is from a checkpoint, use the 'remaining-lease' that
+	 * was checkpointed. 
+	 *
+	 * NOTE: There are couple of observations:
+	 *
+	 * 	- We look at jiffies now and decide the absolute end time for
+	 * 	  the lease, but the lease-holder is still frozen and will not
+	 * 	  actually have the entire time to clean up. When the lease-
+	 * 	  holder gets to run, depends on how many other processes are
+	 * 	  in the checkpointed application's process-tree.
+	 *
+	 * 	- We assume that the lease-breaker was also checkpointed and
+	 * 	  set up the lease/file_lock object in anticipation of the 
+	 * 	  lease-breaker retrying their open(). If the lease-breaker
+	 * 	  was not checkpointed, the lease-holder continues to have the
+	 * 	  lease until another lease-breaker comes along.
+	 */
+	if (rem_lease)
+		fl.fl_break_time = jiffies + rem_lease * HZ;
 
 	lock_kernel();
 
@@ -1556,7 +1596,7 @@ out_unlock:
  */
 int fcntl_setlease(unsigned int fd, struct file *filp, long arg)
 {
-	return do_setlease(fd, filp, arg, 0);
+	return do_setlease(fd, filp, arg, 0, 0);
 }
 
 /**
diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
index ddad73f..c741e6a 100644
--- a/include/linux/checkpoint_hdr.h
+++ b/include/linux/checkpoint_hdr.h
@@ -590,6 +590,7 @@ struct ckpt_hdr_file_lock {
        __u8 fl_type;
        __u8 fl_type_prev;
        __u8 fl_flags;
+       __u8 fl_break_notified;
        unsigned long fl_rem_lease;
 };
 
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 54b2a7b..74da7bb 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1067,6 +1067,7 @@ struct file_lock {
 	unsigned char fl_flags;
 	unsigned char fl_type;
 	unsigned char fl_type_prev;
+	unsigned char fl_break_notified;
 	unsigned int fl_pid;
 	struct pid *fl_nspid;
 	wait_queue_head_t fl_wait;
@@ -1123,7 +1124,8 @@ extern int flock64_set(unsigned int, struct file *, unsigned int,
 			struct flock64 *);
 #endif
 
-extern int do_setlease(unsigned int fd, struct file *filp, long arg, int rem_lease);
+extern int do_setlease(unsigned int fd, struct file *filp, long arg, 
+			int rem_lease, int notified);
 extern int fcntl_setlease(unsigned int fd, struct file *filp, long arg);
 extern int fcntl_getlease(struct file *filp);
 
-- 
1.6.0.4

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [RFC][PATCH 3/3][cr][v2]: fileleases: C/R of an in-progress lease.
  2010-05-26  1:07 [RFC][PATCH 0/3][cr][v2]: Checkpoint/restart file leases Sukadev Bhattiprolu
  2010-05-26  1:07 ` [RFC][PATCH 1/3][cr][v2]: Define do_setlease() Sukadev Bhattiprolu
  2010-05-26  1:07 ` [RFC][PATCH 2/3][cr][v2]: Checkpoint/restart file leases Sukadev Bhattiprolu
@ 2010-05-26  1:07 ` Sukadev Bhattiprolu
  2010-06-15  4:43   ` Oren Laadan
                     ` (3 more replies)
       [not found] ` <1274836063-13271-1-git-send-email-sukadev-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
  3 siblings, 4 replies; 39+ messages in thread
From: Sukadev Bhattiprolu @ 2010-05-26  1:07 UTC (permalink / raw)
  To: Oren Laadan; +Cc: serue, Matt Helsley, matthew, linux-fsdevel, Containers

If process P1 has a F_WRLCK lease on file F1 and process P2 opens the
file, P2's open() blocks for lease_break_time (45 seconds) and P1 gets
a SIGIO to cleanup it lease in preparation for P2's open.  If the two
processes are checkpointed/restarted in this window, we should address
following two issues:

	- P1 should get a SIGIO only once for the lease (i.e if P1 got the
	  SIGIO before checkpoint, it should not get the SIGIO after restart).

	- If R seconds remain in the lease, P2's open should be blocked for
	  at least the R seconds, so P1 has the time to clean up its lease.
	  The previous patch gives P1 the entire lease_break_time but that
	  can leave P2 stalled for 2*lease_break_time.

To address first, we add a field ->fl_break_notified to "remember" if we
notified the lease-holder already. We save this field in the checkpoint
image and when restarting, we notify the lease-holder only if this field
is not set.

To address the second issue, we also checkpoint the ->fl_break_time for
an in-progress lease. When restarting the process, we ensure that the
lease-holder sleeps only for the remaining-lease rather than the entire
lease.

These two fixes sound like an approximation (see comments in do_setlease()
and __break_lease() below) and are also a bit kludgy (hence a separate patch
for now).

Appreciate comments on how we can do this better. Specifically:

	- do we even need to try and address the second issue above or
	  just let P1 have the entire lease_break_time again ?

	- theoretically, the R seconds should start counting after *all*
	  processes in the application-process tree have been restarted,
	  since P1 waits inside the kernel for a portion of the remaining
	  lease - should we then add a delta to R ?

Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
---
 fs/checkpoint.c                |   13 ++++++---
 fs/locks.c                     |   56 ++++++++++++++++++++++++++++++++++-----
 include/linux/checkpoint_hdr.h |    1 +
 include/linux/fs.h             |    4 ++-
 4 files changed, 61 insertions(+), 13 deletions(-)

diff --git a/fs/checkpoint.c b/fs/checkpoint.c
index 2baeb32..f2adb38 100644
--- a/fs/checkpoint.c
+++ b/fs/checkpoint.c
@@ -284,9 +284,13 @@ static int checkpoint_one_file_lock(struct ckpt_ctx *ctx, struct file *file,
 		h->fl_type = lock->fl_type;
 		h->fl_type_prev = lock->fl_type_prev;
 		h->fl_flags = lock->fl_flags;
-		if (h->fl_type & F_INPROGRESS && 
-					(lock->fl_break_time > jiffies))
-			h->fl_rem_lease = (lock->fl_break_time - jiffies) / HZ;
+		if (IS_LEASE(lock)) {
+			unsigned long bt = lock->fl_break_time;
+
+			h->fl_break_notified = lock->fl_break_notified;
+			if ((h->fl_type & F_INPROGRESS) && (bt > jiffies))
+				h->fl_rem_lease = (bt - jiffies) / HZ;
+		}
 
 	} else {
 		/* Checkpoint a dummy lock as a marker */
@@ -1084,7 +1088,8 @@ static int restore_file_locks(struct ckpt_ctx *ctx, struct file *file, int fd)
 			type = h->fl_type;
 			if (h->fl_type & F_INPROGRESS)
 				type = h->fl_type_prev;
-			ret = do_setlease(fd, file, type, h->fl_rem_lease);
+			ret = do_setlease(fd, file, type, h->fl_rem_lease,
+					h->fl_break_notified);
 			if (ret)
 				ckpt_err(ctx, ret, "do_setlease(): %d\n", type);
 		}
diff --git a/fs/locks.c b/fs/locks.c
index 7a80278..c6ef829 100644
--- a/fs/locks.c
+++ b/fs/locks.c
@@ -185,6 +185,7 @@ void locks_init_lock(struct file_lock *fl)
 	fl->fl_flags = 0;
 	fl->fl_type = 0;
 	fl->fl_type_prev = 0;
+	fl->fl_break_notified = 0;
 	fl->fl_break_time = 0UL;
 	fl->fl_start = fl->fl_end = 0;
 	fl->fl_ops = NULL;
@@ -229,6 +230,8 @@ void __locks_copy_lock(struct file_lock *new, const struct file_lock *fl)
 	new->fl_flags = fl->fl_flags;
 	new->fl_type = fl->fl_type;
 	new->fl_start = fl->fl_start;
+	new->fl_break_time = fl->fl_break_time;
+	new->fl_break_notified = fl->fl_break_notified;
 	new->fl_end = fl->fl_end;
 	new->fl_ops = NULL;
 	new->fl_lmops = NULL;
@@ -460,6 +463,8 @@ static int lease_init(struct file *filp, int type, struct file_lock *fl)
 	fl->fl_end = OFFSET_MAX;
 	fl->fl_ops = NULL;
 	fl->fl_lmops = &lease_manager_ops;
+	fl->fl_break_time = 0UL;
+	fl->fl_break_notified = 0;
 	return 0;
 }
 
@@ -1139,6 +1144,7 @@ int lease_modify(struct file_lock **before, int arg)
 	struct file_lock *fl = *before;
 	int error = assign_type(fl, arg);
 
+	fl->fl_break_notified = 0;
 	if (error)
 		return error;
 	locks_wake_up_blocks(fl);
@@ -1254,16 +1260,25 @@ int __break_lease(struct inode *inode, unsigned int mode)
 			if (!fl->fl_type_prev)
 				fl->fl_type_prev = fl->fl_type;
 			fl->fl_type = future;
-			fl->fl_break_time = break_time;
 
 			/*
-			 * TODO: ->fl_break() sends the SIGIO to lease-holder.
-			 *       If lease-holder was checkpointed/restarted and
-			 *       this is a restarted lease, we should not
-			 *       re-send the SIGIO ?
+			 * CHECK:
+			 *
+			 * Similarly, if ->fl_break_time or ->fl_break_notified
+			 * are set, we were in the middle of a lease-break
+			 * when we were checkpointed. So we don't need to
+			 * notify of the lease holder again or wait for the
+			 * entire lease_break_time. Note that this time
+			 * could be more than lease_break_time since we use
+			 * the value that was checkpointed.
 			 */
+			if (!fl->fl_break_time)
+				fl->fl_break_time = break_time;
+
 			/* lease must have lmops break callback */
-			fl->fl_lmops->fl_break(fl);
+			if (!fl->fl_break_notified)
+				fl->fl_lmops->fl_break(fl);
+			fl->fl_break_notified = 1;
 		}
 	}
 
@@ -1511,7 +1526,8 @@ int vfs_setlease(struct file *filp, long arg, struct file_lock **lease)
 }
 EXPORT_SYMBOL_GPL(vfs_setlease);
 
-int do_setlease(unsigned int fd, struct file *filp, long arg, int rem_lease)
+int do_setlease(unsigned int fd, struct file *filp, long arg, int rem_lease,
+		int notified)
 {
 	struct file_lock fl, *flp = &fl;
 	struct inode *inode = filp->f_path.dentry->d_inode;
@@ -1521,6 +1537,30 @@ int do_setlease(unsigned int fd, struct file *filp, long arg, int rem_lease)
 	error = lease_init(filp, arg, &fl);
 	if (error)
 		return error;
+	fl.fl_break_notified = notified;
+
+	/*
+	 * Checkpoint/restart:
+	 *
+	 * If this lease is from a checkpoint, use the 'remaining-lease' that
+	 * was checkpointed. 
+	 *
+	 * NOTE: There are couple of observations:
+	 *
+	 * 	- We look at jiffies now and decide the absolute end time for
+	 * 	  the lease, but the lease-holder is still frozen and will not
+	 * 	  actually have the entire time to clean up. When the lease-
+	 * 	  holder gets to run, depends on how many other processes are
+	 * 	  in the checkpointed application's process-tree.
+	 *
+	 * 	- We assume that the lease-breaker was also checkpointed and
+	 * 	  set up the lease/file_lock object in anticipation of the 
+	 * 	  lease-breaker retrying their open(). If the lease-breaker
+	 * 	  was not checkpointed, the lease-holder continues to have the
+	 * 	  lease until another lease-breaker comes along.
+	 */
+	if (rem_lease)
+		fl.fl_break_time = jiffies + rem_lease * HZ;
 
 	lock_kernel();
 
@@ -1556,7 +1596,7 @@ out_unlock:
  */
 int fcntl_setlease(unsigned int fd, struct file *filp, long arg)
 {
-	return do_setlease(fd, filp, arg, 0);
+	return do_setlease(fd, filp, arg, 0, 0);
 }
 
 /**
diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
index ddad73f..c741e6a 100644
--- a/include/linux/checkpoint_hdr.h
+++ b/include/linux/checkpoint_hdr.h
@@ -590,6 +590,7 @@ struct ckpt_hdr_file_lock {
        __u8 fl_type;
        __u8 fl_type_prev;
        __u8 fl_flags;
+       __u8 fl_break_notified;
        unsigned long fl_rem_lease;
 };
 
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 54b2a7b..74da7bb 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1067,6 +1067,7 @@ struct file_lock {
 	unsigned char fl_flags;
 	unsigned char fl_type;
 	unsigned char fl_type_prev;
+	unsigned char fl_break_notified;
 	unsigned int fl_pid;
 	struct pid *fl_nspid;
 	wait_queue_head_t fl_wait;
@@ -1123,7 +1124,8 @@ extern int flock64_set(unsigned int, struct file *, unsigned int,
 			struct flock64 *);
 #endif
 
-extern int do_setlease(unsigned int fd, struct file *filp, long arg, int rem_lease);
+extern int do_setlease(unsigned int fd, struct file *filp, long arg, 
+			int rem_lease, int notified);
 extern int fcntl_setlease(unsigned int fd, struct file *filp, long arg);
 extern int fcntl_getlease(struct file *filp);
 
-- 
1.6.0.4


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* Re: [RFC][PATCH 1/3][cr][v2]: Define do_setlease()
       [not found]   ` <1274836063-13271-2-git-send-email-sukadev-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
@ 2010-05-26 13:52     ` Serge E. Hallyn
  0 siblings, 0 replies; 39+ messages in thread
From: Serge E. Hallyn @ 2010-05-26 13:52 UTC (permalink / raw)
  To: Sukadev Bhattiprolu
  Cc: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Containers, matthew-Ztpu424NOJ8

Quoting Sukadev Bhattiprolu (sukadev-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org):
> Move the core functionality of fcntl_setlease() into a new function,
> do_setlease(). do_setlease() is same as fcntl_setlease() except that
> it takes an extra 'rem_lease' parameter. do_setlease() will be used
> in a follow-on patch to checkpoint/restart file-leases.
> 
> Signed-off-by: Sukadev Bhattiprolu <sukadev-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
> ---
>  fs/locks.c         |   27 ++++++++++++++++-----------
>  include/linux/fs.h |    1 +
>  2 files changed, 17 insertions(+), 11 deletions(-)
> 
> diff --git a/fs/locks.c b/fs/locks.c
> index c62ab7f..4107295 100644
> --- a/fs/locks.c
> +++ b/fs/locks.c
> @@ -1471,17 +1471,7 @@ int vfs_setlease(struct file *filp, long arg, struct file_lock **lease)
>  }
>  EXPORT_SYMBOL_GPL(vfs_setlease);
> 
> -/**
> - *	fcntl_setlease	-	sets a lease on an open file
> - *	@fd: open file descriptor
> - *	@filp: file pointer
> - *	@arg: type of lease to obtain
> - *
> - *	Call this fcntl to establish a lease on the file.
> - *	Note that you also need to call %F_SETSIG to
> - *	receive a signal when the lease is broken.
> - */
> -int fcntl_setlease(unsigned int fd, struct file *filp, long arg)
> +int do_setlease(unsigned int fd, struct file *filp, long arg, int rem_lease)
>  {

Note that rem_lease arg here is int, but the value you're checkpointing
and passing in is unsigned long.  So userspace on 64-bit could easily
overflow that, though not sure to what end.  Also putting 'unsigned long'
in the checkpoint_hdr.h is not in keeping with the rest of that file.

thanks,
-serge

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC][PATCH 1/3][cr][v2]: Define do_setlease()
  2010-05-26  1:07 ` [RFC][PATCH 1/3][cr][v2]: Define do_setlease() Sukadev Bhattiprolu
@ 2010-05-26 13:52   ` Serge E. Hallyn
  2010-05-26 17:14     ` Sukadev Bhattiprolu
       [not found]     ` <20100526135256.GA25799-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
       [not found]   ` <1274836063-13271-2-git-send-email-sukadev-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
  1 sibling, 2 replies; 39+ messages in thread
From: Serge E. Hallyn @ 2010-05-26 13:52 UTC (permalink / raw)
  To: Sukadev Bhattiprolu
  Cc: Oren Laadan, Matt Helsley, matthew, linux-fsdevel, Containers

Quoting Sukadev Bhattiprolu (sukadev@linux.vnet.ibm.com):
> Move the core functionality of fcntl_setlease() into a new function,
> do_setlease(). do_setlease() is same as fcntl_setlease() except that
> it takes an extra 'rem_lease' parameter. do_setlease() will be used
> in a follow-on patch to checkpoint/restart file-leases.
> 
> Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
> ---
>  fs/locks.c         |   27 ++++++++++++++++-----------
>  include/linux/fs.h |    1 +
>  2 files changed, 17 insertions(+), 11 deletions(-)
> 
> diff --git a/fs/locks.c b/fs/locks.c
> index c62ab7f..4107295 100644
> --- a/fs/locks.c
> +++ b/fs/locks.c
> @@ -1471,17 +1471,7 @@ int vfs_setlease(struct file *filp, long arg, struct file_lock **lease)
>  }
>  EXPORT_SYMBOL_GPL(vfs_setlease);
> 
> -/**
> - *	fcntl_setlease	-	sets a lease on an open file
> - *	@fd: open file descriptor
> - *	@filp: file pointer
> - *	@arg: type of lease to obtain
> - *
> - *	Call this fcntl to establish a lease on the file.
> - *	Note that you also need to call %F_SETSIG to
> - *	receive a signal when the lease is broken.
> - */
> -int fcntl_setlease(unsigned int fd, struct file *filp, long arg)
> +int do_setlease(unsigned int fd, struct file *filp, long arg, int rem_lease)
>  {

Note that rem_lease arg here is int, but the value you're checkpointing
and passing in is unsigned long.  So userspace on 64-bit could easily
overflow that, though not sure to what end.  Also putting 'unsigned long'
in the checkpoint_hdr.h is not in keeping with the rest of that file.

thanks,
-serge

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC][PATCH 1/3][cr][v2]: Define do_setlease()
       [not found]     ` <20100526135256.GA25799-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
@ 2010-05-26 17:14       ` Sukadev Bhattiprolu
  0 siblings, 0 replies; 39+ messages in thread
From: Sukadev Bhattiprolu @ 2010-05-26 17:14 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Containers, matthew-Ztpu424NOJ8

Serge E. Hallyn [serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org] wrote:
| Quoting Sukadev Bhattiprolu (sukadev-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org):
| > Move the core functionality of fcntl_setlease() into a new function,
| > do_setlease(). do_setlease() is same as fcntl_setlease() except that
| > it takes an extra 'rem_lease' parameter. do_setlease() will be used
| > in a follow-on patch to checkpoint/restart file-leases.
| > 
| > Signed-off-by: Sukadev Bhattiprolu <sukadev-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
| > ---
| >  fs/locks.c         |   27 ++++++++++++++++-----------
| >  include/linux/fs.h |    1 +
| >  2 files changed, 17 insertions(+), 11 deletions(-)
| > 
| > diff --git a/fs/locks.c b/fs/locks.c
| > index c62ab7f..4107295 100644
| > --- a/fs/locks.c
| > +++ b/fs/locks.c
| > @@ -1471,17 +1471,7 @@ int vfs_setlease(struct file *filp, long arg, struct file_lock **lease)
| >  }
| >  EXPORT_SYMBOL_GPL(vfs_setlease);
| > 
| > -/**
| > - *	fcntl_setlease	-	sets a lease on an open file
| > - *	@fd: open file descriptor
| > - *	@filp: file pointer
| > - *	@arg: type of lease to obtain
| > - *
| > - *	Call this fcntl to establish a lease on the file.
| > - *	Note that you also need to call %F_SETSIG to
| > - *	receive a signal when the lease is broken.
| > - */
| > -int fcntl_setlease(unsigned int fd, struct file *filp, long arg)
| > +int do_setlease(unsigned int fd, struct file *filp, long arg, int rem_lease)
| >  {
| 
| Note that rem_lease arg here is int, but the value you're checkpointing
| and passing in is unsigned long.  So userspace on 64-bit could easily
| overflow that, though not sure to what end.  Also putting 'unsigned long'
| in the checkpoint_hdr.h is not in keeping with the rest of that file.

lease_break_time is an int, so I set rem_lease to int. But the unsigned long
in checkpoint_hdr.h is just wrong. I will change it __s32.

Sukadev

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC][PATCH 1/3][cr][v2]: Define do_setlease()
  2010-05-26 13:52   ` Serge E. Hallyn
@ 2010-05-26 17:14     ` Sukadev Bhattiprolu
       [not found]     ` <20100526135256.GA25799-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
  1 sibling, 0 replies; 39+ messages in thread
From: Sukadev Bhattiprolu @ 2010-05-26 17:14 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: Oren Laadan, Matt Helsley, matthew, linux-fsdevel, Containers

Serge E. Hallyn [serue@us.ibm.com] wrote:
| Quoting Sukadev Bhattiprolu (sukadev@linux.vnet.ibm.com):
| > Move the core functionality of fcntl_setlease() into a new function,
| > do_setlease(). do_setlease() is same as fcntl_setlease() except that
| > it takes an extra 'rem_lease' parameter. do_setlease() will be used
| > in a follow-on patch to checkpoint/restart file-leases.
| > 
| > Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
| > ---
| >  fs/locks.c         |   27 ++++++++++++++++-----------
| >  include/linux/fs.h |    1 +
| >  2 files changed, 17 insertions(+), 11 deletions(-)
| > 
| > diff --git a/fs/locks.c b/fs/locks.c
| > index c62ab7f..4107295 100644
| > --- a/fs/locks.c
| > +++ b/fs/locks.c
| > @@ -1471,17 +1471,7 @@ int vfs_setlease(struct file *filp, long arg, struct file_lock **lease)
| >  }
| >  EXPORT_SYMBOL_GPL(vfs_setlease);
| > 
| > -/**
| > - *	fcntl_setlease	-	sets a lease on an open file
| > - *	@fd: open file descriptor
| > - *	@filp: file pointer
| > - *	@arg: type of lease to obtain
| > - *
| > - *	Call this fcntl to establish a lease on the file.
| > - *	Note that you also need to call %F_SETSIG to
| > - *	receive a signal when the lease is broken.
| > - */
| > -int fcntl_setlease(unsigned int fd, struct file *filp, long arg)
| > +int do_setlease(unsigned int fd, struct file *filp, long arg, int rem_lease)
| >  {
| 
| Note that rem_lease arg here is int, but the value you're checkpointing
| and passing in is unsigned long.  So userspace on 64-bit could easily
| overflow that, though not sure to what end.  Also putting 'unsigned long'
| in the checkpoint_hdr.h is not in keeping with the rest of that file.

lease_break_time is an int, so I set rem_lease to int. But the unsigned long
in checkpoint_hdr.h is just wrong. I will change it __s32.

Sukadev

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC][PATCH 2/3][cr][v2]: Checkpoint/restart file leases
       [not found]   ` <1274836063-13271-3-git-send-email-sukadev-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
@ 2010-06-15  4:40     ` Oren Laadan
  0 siblings, 0 replies; 39+ messages in thread
From: Oren Laadan @ 2010-06-15  4:40 UTC (permalink / raw)
  To: Sukadev Bhattiprolu
  Cc: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	serue-r/Jw6+rmf7HQT0dZR+AlfA, Containers, matthew-Ztpu424NOJ8



On 05/25/2010 09:07 PM, Sukadev Bhattiprolu wrote:
> Build upon the C/R of file-locks to C/R file-leases. C/R of a lease that is
> not being broken is almost identical to C/R of file-locks.  i.e save the type
> of lease for the file in the checkpoint image and when restarting, restore
> the lease by calling do_setlease().
> 
> C/R of file-lease gets complicated (I think), if a process is checkpointed
> when its lease was being revoked. i.e if P1 has a F_WRLCK lease on file F1
> and P2 opens F1 for write, P2's open is blocked for lease_break_time (45 secs).
> P1's lease is revoked (i.e set to F_UNLCK) and P1 is notified via a SIGIO to
> flush any dirty data.
> 
> This brings up two issues:
> 
> First, if P1 is checkpointed during this lease_break_time, we need to remember
> to that P1 originally had a F_WRLCK lease which is now revoked to F_UNLCK.
> Checkpointing the "current lease type" would wrongly save the lease-type as
> F_UNLCK.
> 
> Secondly, if P1 was checkpointed 40 seconds into the lease_break_time,(i.e.
> it had 5 seconds remaining in the lease), we want to ensure that after restart,
> P1 gets at least 5 more seconds in the lease (no ?). (i.e P1 could be in the
> its SIGIO handler when it was checkpointed and may be about to start a new
> write(). If P1 does not gets its 5 seconds and P2's open and a read()
> completes, we would have a data corruption).
> 
> This patch addresses the first issue above by adding file_lock->fl_type_prev
> field. When a lease is downgraded/revoked, the original lease type is saved
> in ->fl_type_prev and is also checkpointed. When the process P1 is restarted,
> the kernel temporarily restores the original (F_WRLCK) lease.  When process
> P2 is restarted, the open() would fail with -ERESTARTSYS and the open() would
> be repeated. This open() would initiate the lease-break protocol again on P1.
> 
> To address the second issue above, this patch saves the remaining-lease in
> the checkpoint image, but does not (yet) use this value. The plan is to use
> this remaining-lease period when P1/P2 are restarted so that P2 is blocked
> only for the remaining-lease rather than entire lease_break_time. I want to
> check if there are better ways to address this.

[...]

> @@ -280,8 +280,14 @@ static int checkpoint_one_file_lock(struct ckpt_ctx *ctx, struct file *file,
>  	if (lock) {
>  		h->fl_start = lock->fl_start;
>  		h->fl_end = lock->fl_end;
> +		/* checkpoint F_INPROGRESS if set for now */

Did I miss anything -- what is F_INPROGRESS ?

>  		h->fl_type = lock->fl_type;
> +		h->fl_type_prev = lock->fl_type_prev;
>  		h->fl_flags = lock->fl_flags;
> +		if (h->fl_type & F_INPROGRESS && 
> +					(lock->fl_break_time > jiffies))
> +			h->fl_rem_lease = (lock->fl_break_time - jiffies) / HZ;

Hmmm -- we have a ctx->ktime_begin marking the start of the checkpoint.
It is used for relative-time calculations, for example, the expiry of
restart-blocks and timeouts.  I suggest that we use it here too to be
consistent.

> +
>  	} else {
>  		/* Checkpoint a dummy lock as a marker */
>  		h->fl_start = -1;
> @@ -315,7 +321,7 @@ checkpoint_file_locks(struct ckpt_ctx *ctx, struct files_struct *files,
>  			continue;
>  
>  		rc = -EBADF;
> -		if (IS_POSIX(lockp))
> +		if (IS_POSIX(lockp) || IS_LEASE(lockp))
>  			rc = checkpoint_one_file_lock(ctx, file, lockp);
>  
>  		if (rc < 0) {
> @@ -1055,8 +1061,10 @@ static int restore_file_locks(struct ckpt_ctx *ctx, struct file *file, int fd)
>  		if (IS_ERR(h))
>  			return PTR_ERR(h);
>  
> -		ckpt_debug("Lock [%lld, %lld, %d, 0x%x]\n", h->fl_start,
> -				h->fl_end, (int)h->fl_type, h->fl_flags);
> +		ckpt_debug("Lock [%lld, %lld, %d, 0x%x], rem-lease %lus, "
> +				"fl-type-prev %d\n", h->fl_start, h->fl_end,
> +				(int)h->fl_type, h->fl_flags, h->fl_rem_lease,
> +				h->fl_type_prev);
>  
>  		/*
>  		 * If it is a dummy-lock, we are done with this fd.
> @@ -1070,6 +1078,17 @@ static int restore_file_locks(struct ckpt_ctx *ctx, struct file *file, int fd)
>  		if (h->fl_flags & FL_POSIX)
>  			ret = restore_one_file_lock(ctx, file, fd, h); 
>  
> +		else if (h->fl_flags & FL_LEASE) {
> +			int type;
> +
> +			type = h->fl_type;
> +			if (h->fl_type & F_INPROGRESS)
> +				type = h->fl_type_prev;
> +			ret = do_setlease(fd, file, type, h->fl_rem_lease);

Do we need to sanitize the "type" argument ?
or the h->fl_rem_lease ?
or h->fl_type_prev ?  (which is taken blindly here)

> +			if (ret)
> +				ckpt_err(ctx, ret, "do_setlease(): %d\n", type);
> +		}
> +
>  		if (ret < 0)
>  			ckpt_err(ctx, ret, "%(T) fl_flags 0x%x\n", h->fl_flags);
>  	}
> diff --git a/fs/locks.c b/fs/locks.c
> index 4107295..7a80278 100644
> --- a/fs/locks.c
> +++ b/fs/locks.c
> @@ -184,6 +184,8 @@ void locks_init_lock(struct file_lock *fl)
>  	fl->fl_file = NULL;
>  	fl->fl_flags = 0;
>  	fl->fl_type = 0;
> +	fl->fl_type_prev = 0;
> +	fl->fl_break_time = 0UL;
>  	fl->fl_start = fl->fl_end = 0;
>  	fl->fl_ops = NULL;
>  	fl->fl_lmops = NULL;
> @@ -291,6 +293,13 @@ static int assign_type(struct file_lock *fl, int type)
>  	case F_WRLCK:
>  	case F_UNLCK:
>  		fl->fl_type = type;
> +		/*
> +		 * Clear fl_type_prev since we now have a new lease-type.
> +		 * That way, break_lease() will know to save the new lease-type
> +		 * in case of a checkpoint. (non-lease file-locks don't use
> +		 * ->fl_type_prev).  
> +		 */
> +		fl->fl_type_prev = 0;
>  		break;
>  	default:
>  		return -EINVAL;
> @@ -1211,6 +1220,16 @@ int __break_lease(struct inode *inode, unsigned int mode)
>  		goto out;
>  	}
>  
> +	/*
> +	 * TODO: Checkpoint/restart. Suppose lease_break_time was 45 seonds and
> +	 * 	 we were checkpointed when we had 35 seconds remaining in our
> +	 * 	 lease. When we are restarted, should we get only 35 seconds
> +	 * 	 of the lease and not the full lease_break_time ?
> +	 *
> +	 * 	 We checkpoint ->fl_break_time in the hope that we can use it
> +	 * 	 to calculate the remaining lease, but for now, give the
> +	 * 	 restarted process the full 'lease_break_time'.
> +	 */
>  	break_time = 0;
>  	if (lease_break_time > 0) {
>  		break_time = jiffies + lease_break_time * HZ;

Here, too, use ctx->ktime_begin

> @@ -1220,8 +1239,29 @@ int __break_lease(struct inode *inode, unsigned int mode)
>  
>  	for (fl = flock; fl && IS_LEASE(fl); fl = fl->fl_next) {
>  		if (fl->fl_type != future) {
> +			/*
> +			 * CHECK:
> +			 *
> +			 * If fl_type_prev is already set, we could be in a
> +			 * recursive checkpoint-restart i.e we were checkpointed
> +			 * once when our lease was being broken. We were then
> +			 * restarted from the checkpoint and checkpointed
> +			 * again before the restored lease expired. In this
> +			 * case, we want to restore the lease to the original
> +			 * type. So don't overwrite fl_type_prev if its already
> +			 * set.
> +			 */
> +			if (!fl->fl_type_prev)
> +				fl->fl_type_prev = fl->fl_type;
>  			fl->fl_type = future;
>  			fl->fl_break_time = break_time;
> +
> +			/*
> +			 * TODO: ->fl_break() sends the SIGIO to lease-holder.
> +			 *       If lease-holder was checkpointed/restarted and
> +			 *       this is a restarted lease, we should not
> +			 *       re-send the SIGIO ?
> +			 */
>  			/* lease must have lmops break callback */
>  			fl->fl_lmops->fl_break(fl);
>  		}
> diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
> index 4509016..ddad73f 100644
> --- a/include/linux/checkpoint_hdr.h
> +++ b/include/linux/checkpoint_hdr.h
> @@ -588,7 +588,9 @@ struct ckpt_hdr_file_lock {
>         __s64 fl_start;
>         __s64 fl_end;
>         __u8 fl_type;
> +       __u8 fl_type_prev;
>         __u8 fl_flags;
> +       unsigned long fl_rem_lease;

Mis-alignment :(

>  };
>  
>  struct ckpt_hdr_file_pipe {
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 700317a..54b2a7b 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -1066,6 +1066,7 @@ struct file_lock {
>  	fl_owner_t fl_owner;
>  	unsigned char fl_flags;
>  	unsigned char fl_type;
> +	unsigned char fl_type_prev;
>  	unsigned int fl_pid;
>  	struct pid *fl_nspid;
>  	wait_queue_head_t fl_wait;

Oren.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC][PATCH 2/3][cr][v2]: Checkpoint/restart file leases
  2010-05-26  1:07 ` [RFC][PATCH 2/3][cr][v2]: Checkpoint/restart file leases Sukadev Bhattiprolu
@ 2010-06-15  4:40   ` Oren Laadan
       [not found]     ` <4C170430.2030708-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
  2010-07-30 19:16     ` Sukadev Bhattiprolu
       [not found]   ` <1274836063-13271-3-git-send-email-sukadev-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
  1 sibling, 2 replies; 39+ messages in thread
From: Oren Laadan @ 2010-06-15  4:40 UTC (permalink / raw)
  To: Sukadev Bhattiprolu
  Cc: serue, Matt Helsley, matthew, linux-fsdevel, Containers



On 05/25/2010 09:07 PM, Sukadev Bhattiprolu wrote:
> Build upon the C/R of file-locks to C/R file-leases. C/R of a lease that is
> not being broken is almost identical to C/R of file-locks.  i.e save the type
> of lease for the file in the checkpoint image and when restarting, restore
> the lease by calling do_setlease().
> 
> C/R of file-lease gets complicated (I think), if a process is checkpointed
> when its lease was being revoked. i.e if P1 has a F_WRLCK lease on file F1
> and P2 opens F1 for write, P2's open is blocked for lease_break_time (45 secs).
> P1's lease is revoked (i.e set to F_UNLCK) and P1 is notified via a SIGIO to
> flush any dirty data.
> 
> This brings up two issues:
> 
> First, if P1 is checkpointed during this lease_break_time, we need to remember
> to that P1 originally had a F_WRLCK lease which is now revoked to F_UNLCK.
> Checkpointing the "current lease type" would wrongly save the lease-type as
> F_UNLCK.
> 
> Secondly, if P1 was checkpointed 40 seconds into the lease_break_time,(i.e.
> it had 5 seconds remaining in the lease), we want to ensure that after restart,
> P1 gets at least 5 more seconds in the lease (no ?). (i.e P1 could be in the
> its SIGIO handler when it was checkpointed and may be about to start a new
> write(). If P1 does not gets its 5 seconds and P2's open and a read()
> completes, we would have a data corruption).
> 
> This patch addresses the first issue above by adding file_lock->fl_type_prev
> field. When a lease is downgraded/revoked, the original lease type is saved
> in ->fl_type_prev and is also checkpointed. When the process P1 is restarted,
> the kernel temporarily restores the original (F_WRLCK) lease.  When process
> P2 is restarted, the open() would fail with -ERESTARTSYS and the open() would
> be repeated. This open() would initiate the lease-break protocol again on P1.
> 
> To address the second issue above, this patch saves the remaining-lease in
> the checkpoint image, but does not (yet) use this value. The plan is to use
> this remaining-lease period when P1/P2 are restarted so that P2 is blocked
> only for the remaining-lease rather than entire lease_break_time. I want to
> check if there are better ways to address this.

[...]

> @@ -280,8 +280,14 @@ static int checkpoint_one_file_lock(struct ckpt_ctx *ctx, struct file *file,
>  	if (lock) {
>  		h->fl_start = lock->fl_start;
>  		h->fl_end = lock->fl_end;
> +		/* checkpoint F_INPROGRESS if set for now */

Did I miss anything -- what is F_INPROGRESS ?

>  		h->fl_type = lock->fl_type;
> +		h->fl_type_prev = lock->fl_type_prev;
>  		h->fl_flags = lock->fl_flags;
> +		if (h->fl_type & F_INPROGRESS && 
> +					(lock->fl_break_time > jiffies))
> +			h->fl_rem_lease = (lock->fl_break_time - jiffies) / HZ;

Hmmm -- we have a ctx->ktime_begin marking the start of the checkpoint.
It is used for relative-time calculations, for example, the expiry of
restart-blocks and timeouts.  I suggest that we use it here too to be
consistent.

> +
>  	} else {
>  		/* Checkpoint a dummy lock as a marker */
>  		h->fl_start = -1;
> @@ -315,7 +321,7 @@ checkpoint_file_locks(struct ckpt_ctx *ctx, struct files_struct *files,
>  			continue;
>  
>  		rc = -EBADF;
> -		if (IS_POSIX(lockp))
> +		if (IS_POSIX(lockp) || IS_LEASE(lockp))
>  			rc = checkpoint_one_file_lock(ctx, file, lockp);
>  
>  		if (rc < 0) {
> @@ -1055,8 +1061,10 @@ static int restore_file_locks(struct ckpt_ctx *ctx, struct file *file, int fd)
>  		if (IS_ERR(h))
>  			return PTR_ERR(h);
>  
> -		ckpt_debug("Lock [%lld, %lld, %d, 0x%x]\n", h->fl_start,
> -				h->fl_end, (int)h->fl_type, h->fl_flags);
> +		ckpt_debug("Lock [%lld, %lld, %d, 0x%x], rem-lease %lus, "
> +				"fl-type-prev %d\n", h->fl_start, h->fl_end,
> +				(int)h->fl_type, h->fl_flags, h->fl_rem_lease,
> +				h->fl_type_prev);
>  
>  		/*
>  		 * If it is a dummy-lock, we are done with this fd.
> @@ -1070,6 +1078,17 @@ static int restore_file_locks(struct ckpt_ctx *ctx, struct file *file, int fd)
>  		if (h->fl_flags & FL_POSIX)
>  			ret = restore_one_file_lock(ctx, file, fd, h); 
>  
> +		else if (h->fl_flags & FL_LEASE) {
> +			int type;
> +
> +			type = h->fl_type;
> +			if (h->fl_type & F_INPROGRESS)
> +				type = h->fl_type_prev;
> +			ret = do_setlease(fd, file, type, h->fl_rem_lease);

Do we need to sanitize the "type" argument ?
or the h->fl_rem_lease ?
or h->fl_type_prev ?  (which is taken blindly here)

> +			if (ret)
> +				ckpt_err(ctx, ret, "do_setlease(): %d\n", type);
> +		}
> +
>  		if (ret < 0)
>  			ckpt_err(ctx, ret, "%(T) fl_flags 0x%x\n", h->fl_flags);
>  	}
> diff --git a/fs/locks.c b/fs/locks.c
> index 4107295..7a80278 100644
> --- a/fs/locks.c
> +++ b/fs/locks.c
> @@ -184,6 +184,8 @@ void locks_init_lock(struct file_lock *fl)
>  	fl->fl_file = NULL;
>  	fl->fl_flags = 0;
>  	fl->fl_type = 0;
> +	fl->fl_type_prev = 0;
> +	fl->fl_break_time = 0UL;
>  	fl->fl_start = fl->fl_end = 0;
>  	fl->fl_ops = NULL;
>  	fl->fl_lmops = NULL;
> @@ -291,6 +293,13 @@ static int assign_type(struct file_lock *fl, int type)
>  	case F_WRLCK:
>  	case F_UNLCK:
>  		fl->fl_type = type;
> +		/*
> +		 * Clear fl_type_prev since we now have a new lease-type.
> +		 * That way, break_lease() will know to save the new lease-type
> +		 * in case of a checkpoint. (non-lease file-locks don't use
> +		 * ->fl_type_prev).  
> +		 */
> +		fl->fl_type_prev = 0;
>  		break;
>  	default:
>  		return -EINVAL;
> @@ -1211,6 +1220,16 @@ int __break_lease(struct inode *inode, unsigned int mode)
>  		goto out;
>  	}
>  
> +	/*
> +	 * TODO: Checkpoint/restart. Suppose lease_break_time was 45 seonds and
> +	 * 	 we were checkpointed when we had 35 seconds remaining in our
> +	 * 	 lease. When we are restarted, should we get only 35 seconds
> +	 * 	 of the lease and not the full lease_break_time ?
> +	 *
> +	 * 	 We checkpoint ->fl_break_time in the hope that we can use it
> +	 * 	 to calculate the remaining lease, but for now, give the
> +	 * 	 restarted process the full 'lease_break_time'.
> +	 */
>  	break_time = 0;
>  	if (lease_break_time > 0) {
>  		break_time = jiffies + lease_break_time * HZ;

Here, too, use ctx->ktime_begin

> @@ -1220,8 +1239,29 @@ int __break_lease(struct inode *inode, unsigned int mode)
>  
>  	for (fl = flock; fl && IS_LEASE(fl); fl = fl->fl_next) {
>  		if (fl->fl_type != future) {
> +			/*
> +			 * CHECK:
> +			 *
> +			 * If fl_type_prev is already set, we could be in a
> +			 * recursive checkpoint-restart i.e we were checkpointed
> +			 * once when our lease was being broken. We were then
> +			 * restarted from the checkpoint and checkpointed
> +			 * again before the restored lease expired. In this
> +			 * case, we want to restore the lease to the original
> +			 * type. So don't overwrite fl_type_prev if its already
> +			 * set.
> +			 */
> +			if (!fl->fl_type_prev)
> +				fl->fl_type_prev = fl->fl_type;
>  			fl->fl_type = future;
>  			fl->fl_break_time = break_time;
> +
> +			/*
> +			 * TODO: ->fl_break() sends the SIGIO to lease-holder.
> +			 *       If lease-holder was checkpointed/restarted and
> +			 *       this is a restarted lease, we should not
> +			 *       re-send the SIGIO ?
> +			 */
>  			/* lease must have lmops break callback */
>  			fl->fl_lmops->fl_break(fl);
>  		}
> diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
> index 4509016..ddad73f 100644
> --- a/include/linux/checkpoint_hdr.h
> +++ b/include/linux/checkpoint_hdr.h
> @@ -588,7 +588,9 @@ struct ckpt_hdr_file_lock {
>         __s64 fl_start;
>         __s64 fl_end;
>         __u8 fl_type;
> +       __u8 fl_type_prev;
>         __u8 fl_flags;
> +       unsigned long fl_rem_lease;

Mis-alignment :(

>  };
>  
>  struct ckpt_hdr_file_pipe {
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 700317a..54b2a7b 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -1066,6 +1066,7 @@ struct file_lock {
>  	fl_owner_t fl_owner;
>  	unsigned char fl_flags;
>  	unsigned char fl_type;
> +	unsigned char fl_type_prev;
>  	unsigned int fl_pid;
>  	struct pid *fl_nspid;
>  	wait_queue_head_t fl_wait;

Oren.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC][PATCH 3/3][cr][v2]: fileleases: C/R of an in-progress lease.
       [not found]   ` <1274836063-13271-4-git-send-email-sukadev-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
@ 2010-06-15  4:43     ` Oren Laadan
  2010-06-16 11:18     ` Jamie Lokier
  2010-06-16 14:52     ` Oren Laadan
  2 siblings, 0 replies; 39+ messages in thread
From: Oren Laadan @ 2010-06-15  4:43 UTC (permalink / raw)
  To: Sukadev Bhattiprolu
  Cc: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	serue-r/Jw6+rmf7HQT0dZR+AlfA, Containers, matthew-Ztpu424NOJ8



On 05/25/2010 09:07 PM, Sukadev Bhattiprolu wrote:
> If process P1 has a F_WRLCK lease on file F1 and process P2 opens the
> file, P2's open() blocks for lease_break_time (45 seconds) and P1 gets
> a SIGIO to cleanup it lease in preparation for P2's open.  If the two
> processes are checkpointed/restarted in this window, we should address
> following two issues:
> 
> 	- P1 should get a SIGIO only once for the lease (i.e if P1 got the
> 	  SIGIO before checkpoint, it should not get the SIGIO after restart).
> 
> 	- If R seconds remain in the lease, P2's open should be blocked for
> 	  at least the R seconds, so P1 has the time to clean up its lease.
> 	  The previous patch gives P1 the entire lease_break_time but that
> 	  can leave P2 stalled for 2*lease_break_time.
> 
> To address first, we add a field ->fl_break_notified to "remember" if we
> notified the lease-holder already. We save this field in the checkpoint
> image and when restarting, we notify the lease-holder only if this field
> is not set.
> 
> To address the second issue, we also checkpoint the ->fl_break_time for
> an in-progress lease. When restarting the process, we ensure that the
> lease-holder sleeps only for the remaining-lease rather than the entire
> lease.
> 
> These two fixes sound like an approximation (see comments in do_setlease()
> and __break_lease() below) and are also a bit kludgy (hence a separate patch
> for now).
> 
> Appreciate comments on how we can do this better. Specifically:
> 
> 	- do we even need to try and address the second issue above or
> 	  just let P1 have the entire lease_break_time again ?
> 
> 	- theoretically, the R seconds should start counting after *all*
> 	  processes in the application-process tree have been restarted,
> 	  since P1 waits inside the kernel for a portion of the remaining
> 	  lease - should we then add a delta to R ?

[...]

> @@ -1084,7 +1088,8 @@ static int restore_file_locks(struct ckpt_ctx *ctx, struct file *file, int fd)
>  			type = h->fl_type;
>  			if (h->fl_type & F_INPROGRESS)
>  				type = h->fl_type_prev;
> -			ret = do_setlease(fd, file, type, h->fl_rem_lease);
> +			ret = do_setlease(fd, file, type, h->fl_rem_lease,
> +					h->fl_break_notified);

Is h->fl_break_notified sanitized ?

Oren.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC][PATCH 3/3][cr][v2]: fileleases: C/R of an in-progress lease.
  2010-05-26  1:07 ` [RFC][PATCH 3/3][cr][v2]: fileleases: C/R of an in-progress lease Sukadev Bhattiprolu
@ 2010-06-15  4:43   ` Oren Laadan
  2010-06-16 11:18   ` Jamie Lokier
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 39+ messages in thread
From: Oren Laadan @ 2010-06-15  4:43 UTC (permalink / raw)
  To: Sukadev Bhattiprolu
  Cc: serue, Matt Helsley, matthew, linux-fsdevel, Containers



On 05/25/2010 09:07 PM, Sukadev Bhattiprolu wrote:
> If process P1 has a F_WRLCK lease on file F1 and process P2 opens the
> file, P2's open() blocks for lease_break_time (45 seconds) and P1 gets
> a SIGIO to cleanup it lease in preparation for P2's open.  If the two
> processes are checkpointed/restarted in this window, we should address
> following two issues:
> 
> 	- P1 should get a SIGIO only once for the lease (i.e if P1 got the
> 	  SIGIO before checkpoint, it should not get the SIGIO after restart).
> 
> 	- If R seconds remain in the lease, P2's open should be blocked for
> 	  at least the R seconds, so P1 has the time to clean up its lease.
> 	  The previous patch gives P1 the entire lease_break_time but that
> 	  can leave P2 stalled for 2*lease_break_time.
> 
> To address first, we add a field ->fl_break_notified to "remember" if we
> notified the lease-holder already. We save this field in the checkpoint
> image and when restarting, we notify the lease-holder only if this field
> is not set.
> 
> To address the second issue, we also checkpoint the ->fl_break_time for
> an in-progress lease. When restarting the process, we ensure that the
> lease-holder sleeps only for the remaining-lease rather than the entire
> lease.
> 
> These two fixes sound like an approximation (see comments in do_setlease()
> and __break_lease() below) and are also a bit kludgy (hence a separate patch
> for now).
> 
> Appreciate comments on how we can do this better. Specifically:
> 
> 	- do we even need to try and address the second issue above or
> 	  just let P1 have the entire lease_break_time again ?
> 
> 	- theoretically, the R seconds should start counting after *all*
> 	  processes in the application-process tree have been restarted,
> 	  since P1 waits inside the kernel for a portion of the remaining
> 	  lease - should we then add a delta to R ?

[...]

> @@ -1084,7 +1088,8 @@ static int restore_file_locks(struct ckpt_ctx *ctx, struct file *file, int fd)
>  			type = h->fl_type;
>  			if (h->fl_type & F_INPROGRESS)
>  				type = h->fl_type_prev;
> -			ret = do_setlease(fd, file, type, h->fl_rem_lease);
> +			ret = do_setlease(fd, file, type, h->fl_rem_lease,
> +					h->fl_break_notified);

Is h->fl_break_notified sanitized ?

Oren.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC][PATCH 3/3][cr][v2]: fileleases: C/R of an in-progress lease.
       [not found]   ` <1274836063-13271-4-git-send-email-sukadev-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
  2010-06-15  4:43     ` Oren Laadan
@ 2010-06-16 11:18     ` Jamie Lokier
  2010-06-16 14:52     ` Oren Laadan
  2 siblings, 0 replies; 39+ messages in thread
From: Jamie Lokier @ 2010-06-16 11:18 UTC (permalink / raw)
  To: Sukadev Bhattiprolu
  Cc: matthew-Ztpu424NOJ8, Containers,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	serue-r/Jw6+rmf7HQT0dZR+AlfA

Sukadev Bhattiprolu wrote:
> If process P1 has a F_WRLCK lease on file F1 and process P2 opens the
> file, P2's open() blocks for lease_break_time (45 seconds) and P1 gets
> a SIGIO to cleanup it lease in preparation for P2's open.  If the two
> processes are checkpointed/restarted in this window, we should address
> following two issues:
> 
> 	- P1 should get a SIGIO only once for the lease (i.e if P1 got the
> 	  SIGIO before checkpoint, it should not get the SIGIO after restart).
> 
> 	- If R seconds remain in the lease, P2's open should be blocked for
> 	  at least the R seconds, so P1 has the time to clean up its lease.
> 	  The previous patch gives P1 the entire lease_break_time but that
> 	  can leave P2 stalled for 2*lease_break_time.
> 
> To address first, we add a field ->fl_break_notified to "remember" if we
> notified the lease-holder already. We save this field in the checkpoint
> image and when restarting, we notify the lease-holder only if this field
> is not set.
> 
> To address the second issue, we also checkpoint the ->fl_break_time for
> an in-progress lease. When restarting the process, we ensure that the
> lease-holder sleeps only for the remaining-lease rather than the entire
> lease.
> 
> These two fixes sound like an approximation (see comments in do_setlease()
> and __break_lease() below) and are also a bit kludgy (hence a separate patch
> for now).
> 
> Appreciate comments on how we can do this better. Specifically:
> 
> 	- do we even need to try and address the second issue above or
> 	  just let P1 have the entire lease_break_time again ?
> 
> 	- theoretically, the R seconds should start counting after *all*
> 	  processes in the application-process tree have been restarted,
> 	  since P1 waits inside the kernel for a portion of the remaining
> 	  lease - should we then add a delta to R ?

For P1 running out of time would be an abnormal error condition which
it might not be prepared to handle gracefully, and it's best if c/r
doesn't add new ones of those, perhaps due to the process tree restart
taking up too much of the time, perhaps exarcerbated by the processes
themselves if they take a burst of CPU reacting to the restart.

Also, P1 userspace may use an algorithm which is dependent on
lease_break_time, so it can do "check for lease break events -> no
events", and subsequently "check the time" is sufficient to confirm
that a particular file has not been opened and its contents can be
assumed unchanged.  Checking the time is faster.

P2 is unlikely to care about the timeout, as long as it doesn't get
permanently stuck.

So I would let P1 have the lease_break_time again, and/or set the
times ticking after the whole process tree is restarted and make sure
to round up any errors in favour of P1.

-- Jamie

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC][PATCH 3/3][cr][v2]: fileleases: C/R of an in-progress lease.
  2010-05-26  1:07 ` [RFC][PATCH 3/3][cr][v2]: fileleases: C/R of an in-progress lease Sukadev Bhattiprolu
  2010-06-15  4:43   ` Oren Laadan
@ 2010-06-16 11:18   ` Jamie Lokier
  2010-06-16 15:08     ` Oren Laadan
                       ` (2 more replies)
       [not found]   ` <1274836063-13271-4-git-send-email-sukadev-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
  2010-06-16 14:52   ` Oren Laadan
  3 siblings, 3 replies; 39+ messages in thread
From: Jamie Lokier @ 2010-06-16 11:18 UTC (permalink / raw)
  To: Sukadev Bhattiprolu
  Cc: Oren Laadan, serue, Matt Helsley, matthew, linux-fsdevel, Containers

Sukadev Bhattiprolu wrote:
> If process P1 has a F_WRLCK lease on file F1 and process P2 opens the
> file, P2's open() blocks for lease_break_time (45 seconds) and P1 gets
> a SIGIO to cleanup it lease in preparation for P2's open.  If the two
> processes are checkpointed/restarted in this window, we should address
> following two issues:
> 
> 	- P1 should get a SIGIO only once for the lease (i.e if P1 got the
> 	  SIGIO before checkpoint, it should not get the SIGIO after restart).
> 
> 	- If R seconds remain in the lease, P2's open should be blocked for
> 	  at least the R seconds, so P1 has the time to clean up its lease.
> 	  The previous patch gives P1 the entire lease_break_time but that
> 	  can leave P2 stalled for 2*lease_break_time.
> 
> To address first, we add a field ->fl_break_notified to "remember" if we
> notified the lease-holder already. We save this field in the checkpoint
> image and when restarting, we notify the lease-holder only if this field
> is not set.
> 
> To address the second issue, we also checkpoint the ->fl_break_time for
> an in-progress lease. When restarting the process, we ensure that the
> lease-holder sleeps only for the remaining-lease rather than the entire
> lease.
> 
> These two fixes sound like an approximation (see comments in do_setlease()
> and __break_lease() below) and are also a bit kludgy (hence a separate patch
> for now).
> 
> Appreciate comments on how we can do this better. Specifically:
> 
> 	- do we even need to try and address the second issue above or
> 	  just let P1 have the entire lease_break_time again ?
> 
> 	- theoretically, the R seconds should start counting after *all*
> 	  processes in the application-process tree have been restarted,
> 	  since P1 waits inside the kernel for a portion of the remaining
> 	  lease - should we then add a delta to R ?

For P1 running out of time would be an abnormal error condition which
it might not be prepared to handle gracefully, and it's best if c/r
doesn't add new ones of those, perhaps due to the process tree restart
taking up too much of the time, perhaps exarcerbated by the processes
themselves if they take a burst of CPU reacting to the restart.

Also, P1 userspace may use an algorithm which is dependent on
lease_break_time, so it can do "check for lease break events -> no
events", and subsequently "check the time" is sufficient to confirm
that a particular file has not been opened and its contents can be
assumed unchanged.  Checking the time is faster.

P2 is unlikely to care about the timeout, as long as it doesn't get
permanently stuck.

So I would let P1 have the lease_break_time again, and/or set the
times ticking after the whole process tree is restarted and make sure
to round up any errors in favour of P1.

-- Jamie

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC][PATCH 3/3][cr][v2]: fileleases: C/R of an in-progress lease.
       [not found]   ` <1274836063-13271-4-git-send-email-sukadev-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
  2010-06-15  4:43     ` Oren Laadan
  2010-06-16 11:18     ` Jamie Lokier
@ 2010-06-16 14:52     ` Oren Laadan
  2 siblings, 0 replies; 39+ messages in thread
From: Oren Laadan @ 2010-06-16 14:52 UTC (permalink / raw)
  To: Sukadev Bhattiprolu
  Cc: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	serue-r/Jw6+rmf7HQT0dZR+AlfA, Containers, matthew-Ztpu424NOJ8



On 05/25/2010 09:07 PM, Sukadev Bhattiprolu wrote:
> If process P1 has a F_WRLCK lease on file F1 and process P2 opens the
> file, P2's open() blocks for lease_break_time (45 seconds) and P1 gets
> a SIGIO to cleanup it lease in preparation for P2's open.  If the two
> processes are checkpointed/restarted in this window, we should address
> following two issues:
> 
> 	- P1 should get a SIGIO only once for the lease (i.e if P1 got the
> 	  SIGIO before checkpoint, it should not get the SIGIO after restart).

The qualification "before" is vague in our case - a checkpoint is
potentially a length operation, so before *which part* of the
checkpoint you mean here ?

> 
> 	- If R seconds remain in the lease, P2's open should be blocked for
> 	  at least the R seconds, so P1 has the time to clean up its lease.
> 	  The previous patch gives P1 the entire lease_break_time but that
> 	  can leave P2 stalled for 2*lease_break_time.
> 
> To address first, we add a field ->fl_break_notified to "remember" if we
> notified the lease-holder already. We save this field in the checkpoint
> image and when restarting, we notify the lease-holder only if this field
> is not set.

I'm not sure I understand.

Signals are saved last, in particular they are saved after files, and
file leases. What happens if we at checkpoint, we look at a file lease -
we save the least_break_time, now we proceed with the checkpoint, now
the lease expires before we are done, so we get a signal, and finally
we save the signals. In this case, we get both an expiry time and the
signal recorded.

(Am I mis-reading the code ?)

It seems to me that we need to mark the file lease at checkpoint to
prevent the signal from being sent until _after_ the checkpoint ends
(as opposed to remembering that the signal was sent). And then at the
end of the checkpoint, iterate through the leases for each marked
lease - remove the mark and fire the signal.

Oren.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC][PATCH 3/3][cr][v2]: fileleases: C/R of an in-progress lease.
  2010-05-26  1:07 ` [RFC][PATCH 3/3][cr][v2]: fileleases: C/R of an in-progress lease Sukadev Bhattiprolu
                     ` (2 preceding siblings ...)
       [not found]   ` <1274836063-13271-4-git-send-email-sukadev-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
@ 2010-06-16 14:52   ` Oren Laadan
       [not found]     ` <4C18E534.80700-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
  2010-06-16 19:57     ` Sukadev Bhattiprolu
  3 siblings, 2 replies; 39+ messages in thread
From: Oren Laadan @ 2010-06-16 14:52 UTC (permalink / raw)
  To: Sukadev Bhattiprolu
  Cc: serue, Matt Helsley, matthew, linux-fsdevel, Containers



On 05/25/2010 09:07 PM, Sukadev Bhattiprolu wrote:
> If process P1 has a F_WRLCK lease on file F1 and process P2 opens the
> file, P2's open() blocks for lease_break_time (45 seconds) and P1 gets
> a SIGIO to cleanup it lease in preparation for P2's open.  If the two
> processes are checkpointed/restarted in this window, we should address
> following two issues:
> 
> 	- P1 should get a SIGIO only once for the lease (i.e if P1 got the
> 	  SIGIO before checkpoint, it should not get the SIGIO after restart).

The qualification "before" is vague in our case - a checkpoint is
potentially a length operation, so before *which part* of the
checkpoint you mean here ?

> 
> 	- If R seconds remain in the lease, P2's open should be blocked for
> 	  at least the R seconds, so P1 has the time to clean up its lease.
> 	  The previous patch gives P1 the entire lease_break_time but that
> 	  can leave P2 stalled for 2*lease_break_time.
> 
> To address first, we add a field ->fl_break_notified to "remember" if we
> notified the lease-holder already. We save this field in the checkpoint
> image and when restarting, we notify the lease-holder only if this field
> is not set.

I'm not sure I understand.

Signals are saved last, in particular they are saved after files, and
file leases. What happens if we at checkpoint, we look at a file lease -
we save the least_break_time, now we proceed with the checkpoint, now
the lease expires before we are done, so we get a signal, and finally
we save the signals. In this case, we get both an expiry time and the
signal recorded.

(Am I mis-reading the code ?)

It seems to me that we need to mark the file lease at checkpoint to
prevent the signal from being sent until _after_ the checkpoint ends
(as opposed to remembering that the signal was sent). And then at the
end of the checkpoint, iterate through the leases for each marked
lease - remove the mark and fire the signal.

Oren.


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC][PATCH 3/3][cr][v2]: fileleases: C/R of an in-progress lease.
       [not found]     ` <20100616111843.GA15054-yetKDKU6eevNLxjTenLetw@public.gmane.org>
@ 2010-06-16 15:08       ` Oren Laadan
  2010-06-16 17:46       ` Matt Helsley
  1 sibling, 0 replies; 39+ messages in thread
From: Oren Laadan @ 2010-06-16 15:08 UTC (permalink / raw)
  To: Jamie Lokier
  Cc: matthew-Ztpu424NOJ8, Containers,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	serue-r/Jw6+rmf7HQT0dZR+AlfA, Sukadev Bhattiprolu



On 06/16/2010 07:18 AM, Jamie Lokier wrote:
> Sukadev Bhattiprolu wrote:
>> If process P1 has a F_WRLCK lease on file F1 and process P2 opens the
>> file, P2's open() blocks for lease_break_time (45 seconds) and P1 gets
>> a SIGIO to cleanup it lease in preparation for P2's open.  If the two
>> processes are checkpointed/restarted in this window, we should address
>> following two issues:
>>
>> 	- P1 should get a SIGIO only once for the lease (i.e if P1 got the
>> 	  SIGIO before checkpoint, it should not get the SIGIO after restart).
>>
>> 	- If R seconds remain in the lease, P2's open should be blocked for
>> 	  at least the R seconds, so P1 has the time to clean up its lease.
>> 	  The previous patch gives P1 the entire lease_break_time but that
>> 	  can leave P2 stalled for 2*lease_break_time.
>>
>> To address first, we add a field ->fl_break_notified to "remember" if we
>> notified the lease-holder already. We save this field in the checkpoint
>> image and when restarting, we notify the lease-holder only if this field
>> is not set.
>>
>> To address the second issue, we also checkpoint the ->fl_break_time for
>> an in-progress lease. When restarting the process, we ensure that the
>> lease-holder sleeps only for the remaining-lease rather than the entire
>> lease.
>>
>> These two fixes sound like an approximation (see comments in do_setlease()
>> and __break_lease() below) and are also a bit kludgy (hence a separate patch
>> for now).
>>
>> Appreciate comments on how we can do this better. Specifically:
>>
>> 	- do we even need to try and address the second issue above or
>> 	  just let P1 have the entire lease_break_time again ?
>>
>> 	- theoretically, the R seconds should start counting after *all*
>> 	  processes in the application-process tree have been restarted,
>> 	  since P1 waits inside the kernel for a portion of the remaining
>> 	  lease - should we then add a delta to R ?
> 
> For P1 running out of time would be an abnormal error condition which
> it might not be prepared to handle gracefully, and it's best if c/r
> doesn't add new ones of those, perhaps due to the process tree restart
> taking up too much of the time, perhaps exarcerbated by the processes
> themselves if they take a burst of CPU reacting to the restart.
> 
> Also, P1 userspace may use an algorithm which is dependent on
> lease_break_time, so it can do "check for lease break events -> no
> events", and subsequently "check the time" is sufficient to confirm
> that a particular file has not been opened and its contents can be
> assumed unchanged.  Checking the time is faster.
> 
> P2 is unlikely to care about the timeout, as long as it doesn't get
> permanently stuck.
> 
> So I would let P1 have the lease_break_time again, and/or set the
> times ticking after the whole process tree is restarted and make sure
> to round up any errors in favour of P1.
> 
> -- Jamie
> 

For what it's worth, the restored break-time will be relative to after
all the processes are created (because that's where the kernel part of
the restart begins) - but yes, before they are fully restored.

So the problem that you describe exists, however I'm uncomfortable with
forcing this behavior in the kernel - because they may be other cases
in which userspace expect the lease to end within a certain time, and
such behavior may break it.

An alternative approach would be to modify the value of the lease break
time in _userspace_ - eg. by "massasging" the checkpoint image as we
feed it to the kenrel. This provides the flexibility to choose which
policy to use, by the application user/developer.

For example, we could have 'restart --lease-reset ...." parameter to
induce this behavior, or have it be default, and use 'restart --lease-keep'
to avoid it.

Oren.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC][PATCH 3/3][cr][v2]: fileleases: C/R of an in-progress lease.
  2010-06-16 11:18   ` Jamie Lokier
@ 2010-06-16 15:08     ` Oren Laadan
       [not found]     ` <20100616111843.GA15054-yetKDKU6eevNLxjTenLetw@public.gmane.org>
  2010-06-16 17:46     ` Matt Helsley
  2 siblings, 0 replies; 39+ messages in thread
From: Oren Laadan @ 2010-06-16 15:08 UTC (permalink / raw)
  To: Jamie Lokier
  Cc: Sukadev Bhattiprolu, serue, Matt Helsley, matthew, linux-fsdevel,
	Containers



On 06/16/2010 07:18 AM, Jamie Lokier wrote:
> Sukadev Bhattiprolu wrote:
>> If process P1 has a F_WRLCK lease on file F1 and process P2 opens the
>> file, P2's open() blocks for lease_break_time (45 seconds) and P1 gets
>> a SIGIO to cleanup it lease in preparation for P2's open.  If the two
>> processes are checkpointed/restarted in this window, we should address
>> following two issues:
>>
>> 	- P1 should get a SIGIO only once for the lease (i.e if P1 got the
>> 	  SIGIO before checkpoint, it should not get the SIGIO after restart).
>>
>> 	- If R seconds remain in the lease, P2's open should be blocked for
>> 	  at least the R seconds, so P1 has the time to clean up its lease.
>> 	  The previous patch gives P1 the entire lease_break_time but that
>> 	  can leave P2 stalled for 2*lease_break_time.
>>
>> To address first, we add a field ->fl_break_notified to "remember" if we
>> notified the lease-holder already. We save this field in the checkpoint
>> image and when restarting, we notify the lease-holder only if this field
>> is not set.
>>
>> To address the second issue, we also checkpoint the ->fl_break_time for
>> an in-progress lease. When restarting the process, we ensure that the
>> lease-holder sleeps only for the remaining-lease rather than the entire
>> lease.
>>
>> These two fixes sound like an approximation (see comments in do_setlease()
>> and __break_lease() below) and are also a bit kludgy (hence a separate patch
>> for now).
>>
>> Appreciate comments on how we can do this better. Specifically:
>>
>> 	- do we even need to try and address the second issue above or
>> 	  just let P1 have the entire lease_break_time again ?
>>
>> 	- theoretically, the R seconds should start counting after *all*
>> 	  processes in the application-process tree have been restarted,
>> 	  since P1 waits inside the kernel for a portion of the remaining
>> 	  lease - should we then add a delta to R ?
> 
> For P1 running out of time would be an abnormal error condition which
> it might not be prepared to handle gracefully, and it's best if c/r
> doesn't add new ones of those, perhaps due to the process tree restart
> taking up too much of the time, perhaps exarcerbated by the processes
> themselves if they take a burst of CPU reacting to the restart.
> 
> Also, P1 userspace may use an algorithm which is dependent on
> lease_break_time, so it can do "check for lease break events -> no
> events", and subsequently "check the time" is sufficient to confirm
> that a particular file has not been opened and its contents can be
> assumed unchanged.  Checking the time is faster.
> 
> P2 is unlikely to care about the timeout, as long as it doesn't get
> permanently stuck.
> 
> So I would let P1 have the lease_break_time again, and/or set the
> times ticking after the whole process tree is restarted and make sure
> to round up any errors in favour of P1.
> 
> -- Jamie
> 

For what it's worth, the restored break-time will be relative to after
all the processes are created (because that's where the kernel part of
the restart begins) - but yes, before they are fully restored.

So the problem that you describe exists, however I'm uncomfortable with
forcing this behavior in the kernel - because they may be other cases
in which userspace expect the lease to end within a certain time, and
such behavior may break it.

An alternative approach would be to modify the value of the lease break
time in _userspace_ - eg. by "massasging" the checkpoint image as we
feed it to the kenrel. This provides the flexibility to choose which
policy to use, by the application user/developer.

For example, we could have 'restart --lease-reset ...." parameter to
induce this behavior, or have it be default, and use 'restart --lease-keep'
to avoid it.

Oren.




^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC][PATCH 3/3][cr][v2]: fileleases: C/R of an in-progress lease.
       [not found]     ` <20100616111843.GA15054-yetKDKU6eevNLxjTenLetw@public.gmane.org>
  2010-06-16 15:08       ` Oren Laadan
@ 2010-06-16 17:46       ` Matt Helsley
  1 sibling, 0 replies; 39+ messages in thread
From: Matt Helsley @ 2010-06-16 17:46 UTC (permalink / raw)
  To: Jamie Lokier
  Cc: matthew-Ztpu424NOJ8, Containers,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	serue-r/Jw6+rmf7HQT0dZR+AlfA, Sukadev Bhattiprolu

On Wed, Jun 16, 2010 at 12:18:43PM +0100, Jamie Lokier wrote:
> Sukadev Bhattiprolu wrote:
> > If process P1 has a F_WRLCK lease on file F1 and process P2 opens the
> > file, P2's open() blocks for lease_break_time (45 seconds) and P1 gets
> > a SIGIO to cleanup it lease in preparation for P2's open.  If the two
> > processes are checkpointed/restarted in this window, we should address
> > following two issues:
> > 
> > 	- P1 should get a SIGIO only once for the lease (i.e if P1 got the
> > 	  SIGIO before checkpoint, it should not get the SIGIO after restart).
> > 
> > 	- If R seconds remain in the lease, P2's open should be blocked for
> > 	  at least the R seconds, so P1 has the time to clean up its lease.
> > 	  The previous patch gives P1 the entire lease_break_time but that
> > 	  can leave P2 stalled for 2*lease_break_time.
> > 
> > To address first, we add a field ->fl_break_notified to "remember" if we
> > notified the lease-holder already. We save this field in the checkpoint
> > image and when restarting, we notify the lease-holder only if this field
> > is not set.
> > 
> > To address the second issue, we also checkpoint the ->fl_break_time for
> > an in-progress lease. When restarting the process, we ensure that the
> > lease-holder sleeps only for the remaining-lease rather than the entire
> > lease.
> > 
> > These two fixes sound like an approximation (see comments in do_setlease()
> > and __break_lease() below) and are also a bit kludgy (hence a separate patch
> > for now).
> > 
> > Appreciate comments on how we can do this better. Specifically:
> > 
> > 	- do we even need to try and address the second issue above or
> > 	  just let P1 have the entire lease_break_time again ?
> > 
> > 	- theoretically, the R seconds should start counting after *all*
> > 	  processes in the application-process tree have been restarted,
> > 	  since P1 waits inside the kernel for a portion of the remaining
> > 	  lease - should we then add a delta to R ?
> 
> For P1 running out of time would be an abnormal error condition which
> it might not be prepared to handle gracefully, and it's best if c/r
> doesn't add new ones of those, perhaps due to the process tree restart
> taking up too much of the time, perhaps exarcerbated by the processes
> themselves if they take a burst of CPU reacting to the restart.

Seems to me like the scheduler could choose poorly under load and the 
task wouldn't get the time it needs to complete with the lease held. So 
isn't it, strictly-speaking, unsafe to assume that the task will get the 
CPU time needed before the lease expires?

Yes, I concede it's *unlikely* this will occur but my thought was tasks 
must still be prepared for that scenario. Otherwise weird bugs could 
appear even without c/r code in the kernel (much less "interrupting" the 
lease).

Perhaps this is one of those scenarios where sticking to the letter of
the "spec" rather than the intent is bad? (quotes since I have only read
the man page -- my brief searches didn't find any official-looking 
specs..)

> Also, P1 userspace may use an algorithm which is dependent on
> lease_break_time, so it can do "check for lease break events -> no
> events", and subsequently "check the time" is sufficient to confirm
> that a particular file has not been opened and its contents can be
> assumed unchanged.  Checking the time is faster.

Do you happen to know if a time namespace for clock_monotonic would fix 
that? Seems like it might since the "real time" clock can go backwards 
-- in other words it'd be unsafe to use for this purpose.

> P2 is unlikely to care about the timeout, as long as it doesn't get
> permanently stuck.
> 
> So I would let P1 have the lease_break_time again, and/or set the
> times ticking after the whole process tree is restarted and make sure
> to round up any errors in favour of P1.

Seems reasonable.

I suspect one reason Suka chose to do it this way is it reuses the lease 
code as close to the syscall layer as possible. That rather than messing 
with re-establishing timers and lease data structures directly. This 
should make the code more maintainable because folks don't need to worry 
much, if at all, about preserving restart of leases when modifying the 
file lease code. I think it's also less code.

Cheers,
	-Matt Helsley

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC][PATCH 3/3][cr][v2]: fileleases: C/R of an in-progress lease.
  2010-06-16 11:18   ` Jamie Lokier
  2010-06-16 15:08     ` Oren Laadan
       [not found]     ` <20100616111843.GA15054-yetKDKU6eevNLxjTenLetw@public.gmane.org>
@ 2010-06-16 17:46     ` Matt Helsley
  2 siblings, 0 replies; 39+ messages in thread
From: Matt Helsley @ 2010-06-16 17:46 UTC (permalink / raw)
  To: Jamie Lokier
  Cc: Sukadev Bhattiprolu, Oren Laadan, serue, Matt Helsley, matthew,
	linux-fsdevel, Containers

On Wed, Jun 16, 2010 at 12:18:43PM +0100, Jamie Lokier wrote:
> Sukadev Bhattiprolu wrote:
> > If process P1 has a F_WRLCK lease on file F1 and process P2 opens the
> > file, P2's open() blocks for lease_break_time (45 seconds) and P1 gets
> > a SIGIO to cleanup it lease in preparation for P2's open.  If the two
> > processes are checkpointed/restarted in this window, we should address
> > following two issues:
> > 
> > 	- P1 should get a SIGIO only once for the lease (i.e if P1 got the
> > 	  SIGIO before checkpoint, it should not get the SIGIO after restart).
> > 
> > 	- If R seconds remain in the lease, P2's open should be blocked for
> > 	  at least the R seconds, so P1 has the time to clean up its lease.
> > 	  The previous patch gives P1 the entire lease_break_time but that
> > 	  can leave P2 stalled for 2*lease_break_time.
> > 
> > To address first, we add a field ->fl_break_notified to "remember" if we
> > notified the lease-holder already. We save this field in the checkpoint
> > image and when restarting, we notify the lease-holder only if this field
> > is not set.
> > 
> > To address the second issue, we also checkpoint the ->fl_break_time for
> > an in-progress lease. When restarting the process, we ensure that the
> > lease-holder sleeps only for the remaining-lease rather than the entire
> > lease.
> > 
> > These two fixes sound like an approximation (see comments in do_setlease()
> > and __break_lease() below) and are also a bit kludgy (hence a separate patch
> > for now).
> > 
> > Appreciate comments on how we can do this better. Specifically:
> > 
> > 	- do we even need to try and address the second issue above or
> > 	  just let P1 have the entire lease_break_time again ?
> > 
> > 	- theoretically, the R seconds should start counting after *all*
> > 	  processes in the application-process tree have been restarted,
> > 	  since P1 waits inside the kernel for a portion of the remaining
> > 	  lease - should we then add a delta to R ?
> 
> For P1 running out of time would be an abnormal error condition which
> it might not be prepared to handle gracefully, and it's best if c/r
> doesn't add new ones of those, perhaps due to the process tree restart
> taking up too much of the time, perhaps exarcerbated by the processes
> themselves if they take a burst of CPU reacting to the restart.

Seems to me like the scheduler could choose poorly under load and the 
task wouldn't get the time it needs to complete with the lease held. So 
isn't it, strictly-speaking, unsafe to assume that the task will get the 
CPU time needed before the lease expires?

Yes, I concede it's *unlikely* this will occur but my thought was tasks 
must still be prepared for that scenario. Otherwise weird bugs could 
appear even without c/r code in the kernel (much less "interrupting" the 
lease).

Perhaps this is one of those scenarios where sticking to the letter of
the "spec" rather than the intent is bad? (quotes since I have only read
the man page -- my brief searches didn't find any official-looking 
specs..)

> Also, P1 userspace may use an algorithm which is dependent on
> lease_break_time, so it can do "check for lease break events -> no
> events", and subsequently "check the time" is sufficient to confirm
> that a particular file has not been opened and its contents can be
> assumed unchanged.  Checking the time is faster.

Do you happen to know if a time namespace for clock_monotonic would fix 
that? Seems like it might since the "real time" clock can go backwards 
-- in other words it'd be unsafe to use for this purpose.

> P2 is unlikely to care about the timeout, as long as it doesn't get
> permanently stuck.
> 
> So I would let P1 have the lease_break_time again, and/or set the
> times ticking after the whole process tree is restarted and make sure
> to round up any errors in favour of P1.

Seems reasonable.

I suspect one reason Suka chose to do it this way is it reuses the lease 
code as close to the syscall layer as possible. That rather than messing 
with re-establishing timers and lease data structures directly. This 
should make the code more maintainable because folks don't need to worry 
much, if at all, about preserving restart of leases when modifying the 
file lease code. I think it's also less code.

Cheers,
	-Matt Helsley

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC][PATCH 3/3][cr][v2]: fileleases: C/R of an in-progress lease.
       [not found]     ` <4C18E534.80700-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
@ 2010-06-16 19:57       ` Sukadev Bhattiprolu
  0 siblings, 0 replies; 39+ messages in thread
From: Sukadev Bhattiprolu @ 2010-06-16 19:57 UTC (permalink / raw)
  To: Oren Laadan
  Cc: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	serue-r/Jw6+rmf7HQT0dZR+AlfA, Containers, matthew-Ztpu424NOJ8

Oren Laadan [orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org] wrote:
| 
| 
| On 05/25/2010 09:07 PM, Sukadev Bhattiprolu wrote:
| > If process P1 has a F_WRLCK lease on file F1 and process P2 opens the
| > file, P2's open() blocks for lease_break_time (45 seconds) and P1 gets
| > a SIGIO to cleanup it lease in preparation for P2's open.  If the two
| > processes are checkpointed/restarted in this window, we should address
| > following two issues:
| > 
| > 	- P1 should get a SIGIO only once for the lease (i.e if P1 got the
| > 	  SIGIO before checkpoint, it should not get the SIGIO after restart).
| 
| The qualification "before" is vague in our case - a checkpoint is
| potentially a length operation, so before *which part* of the
| checkpoint you mean here ?

I should have been more specific. I meant - if P1 got the SIGIO before
freezing the container for checkpoint, it should not get the SIGIO after
restart/unfreeze. So checkpoint can take minutes, but if P2 is in the
same container and P2 is frozen too.

| > 
| > 	- If R seconds remain in the lease, P2's open should be blocked for
| > 	  at least the R seconds, so P1 has the time to clean up its lease.
| > 	  The previous patch gives P1 the entire lease_break_time but that
| > 	  can leave P2 stalled for 2*lease_break_time.
| > 
| > To address first, we add a field ->fl_break_notified to "remember" if we
| > notified the lease-holder already. We save this field in the checkpoint
| > image and when restarting, we notify the lease-holder only if this field
| > is not set.
| 
| I'm not sure I understand.
| 
| Signals are saved last, in particular they are saved after files, and
| file leases. What happens if we at checkpoint, we look at a file lease -
| we save the least_break_time, now we proceed with the checkpoint, now
| the lease expires before we are done, so we get a signal, and finally
| we save the signals. In this case, we get both an expiry time and the
| signal recorded.
| 
| (Am I mis-reading the code ?)

The signal is sent when P2 opens the file and lease-break is initiated.
No signal is sent when the lease actually expires.

So when P1 and P2 are *both frozen*, then only one of these are true right ?

	- P2 initiated the lease-break and sent the SIGIO or
	- the lease-break was not initiated at all

As youmentioned in your other email, I will look into the ctx->ktime_begin.
| 
| It seems to me that we need to mark the file lease at checkpoint to
| prevent the signal from being sent until _after_ the checkpoint ends
| (as opposed to remembering that the signal was sent). And then at the
| end of the checkpoint, iterate through the leases for each marked
| lease - remove the mark and fire the signal.
| 
| Oren.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC][PATCH 3/3][cr][v2]: fileleases: C/R of an in-progress lease.
  2010-06-16 14:52   ` Oren Laadan
       [not found]     ` <4C18E534.80700-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
@ 2010-06-16 19:57     ` Sukadev Bhattiprolu
  1 sibling, 0 replies; 39+ messages in thread
From: Sukadev Bhattiprolu @ 2010-06-16 19:57 UTC (permalink / raw)
  To: Oren Laadan; +Cc: serue, Matt Helsley, matthew, linux-fsdevel, Containers

Oren Laadan [orenl@cs.columbia.edu] wrote:
| 
| 
| On 05/25/2010 09:07 PM, Sukadev Bhattiprolu wrote:
| > If process P1 has a F_WRLCK lease on file F1 and process P2 opens the
| > file, P2's open() blocks for lease_break_time (45 seconds) and P1 gets
| > a SIGIO to cleanup it lease in preparation for P2's open.  If the two
| > processes are checkpointed/restarted in this window, we should address
| > following two issues:
| > 
| > 	- P1 should get a SIGIO only once for the lease (i.e if P1 got the
| > 	  SIGIO before checkpoint, it should not get the SIGIO after restart).
| 
| The qualification "before" is vague in our case - a checkpoint is
| potentially a length operation, so before *which part* of the
| checkpoint you mean here ?

I should have been more specific. I meant - if P1 got the SIGIO before
freezing the container for checkpoint, it should not get the SIGIO after
restart/unfreeze. So checkpoint can take minutes, but if P2 is in the
same container and P2 is frozen too.

| > 
| > 	- If R seconds remain in the lease, P2's open should be blocked for
| > 	  at least the R seconds, so P1 has the time to clean up its lease.
| > 	  The previous patch gives P1 the entire lease_break_time but that
| > 	  can leave P2 stalled for 2*lease_break_time.
| > 
| > To address first, we add a field ->fl_break_notified to "remember" if we
| > notified the lease-holder already. We save this field in the checkpoint
| > image and when restarting, we notify the lease-holder only if this field
| > is not set.
| 
| I'm not sure I understand.
| 
| Signals are saved last, in particular they are saved after files, and
| file leases. What happens if we at checkpoint, we look at a file lease -
| we save the least_break_time, now we proceed with the checkpoint, now
| the lease expires before we are done, so we get a signal, and finally
| we save the signals. In this case, we get both an expiry time and the
| signal recorded.
| 
| (Am I mis-reading the code ?)

The signal is sent when P2 opens the file and lease-break is initiated.
No signal is sent when the lease actually expires.

So when P1 and P2 are *both frozen*, then only one of these are true right ?

	- P2 initiated the lease-break and sent the SIGIO or
	- the lease-break was not initiated at all

As youmentioned in your other email, I will look into the ctx->ktime_begin.
| 
| It seems to me that we need to mark the file lease at checkpoint to
| prevent the signal from being sent until _after_ the checkpoint ends
| (as opposed to remembering that the signal was sent). And then at the
| end of the checkpoint, iterate through the leases for each marked
| lease - remove the mark and fire the signal.
| 
| Oren.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC][PATCH 2/3][cr][v2]: Checkpoint/restart file leases
       [not found]     ` <4C170430.2030708-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
@ 2010-07-30 19:16       ` Sukadev Bhattiprolu
  0 siblings, 0 replies; 39+ messages in thread
From: Sukadev Bhattiprolu @ 2010-07-30 19:16 UTC (permalink / raw)
  To: Oren Laadan
  Cc: matthew-Ztpu424NOJ8, johnstul-r/Jw6+rmf7HQT0dZR+AlfA, Containers,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	serue-r/Jw6+rmf7HQT0dZR+AlfA

Oren Laadan [orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org] wrote:
| 
| 
| >  		h->fl_type = lock->fl_type;
| > +		h->fl_type_prev = lock->fl_type_prev;
| >  		h->fl_flags = lock->fl_flags;
| > +		if (h->fl_type & F_INPROGRESS && 
| > +					(lock->fl_break_time > jiffies))
| > +			h->fl_rem_lease = (lock->fl_break_time - jiffies) / HZ;
| 
| Hmmm -- we have a ctx->ktime_begin marking the start of the checkpoint.
| It is used for relative-time calculations, for example, the expiry of
| restart-blocks and timeouts.  I suggest that we use it here too to be
| consistent.

Well, I started off using ktime_begin but discussed this with John Stultz
(CC'd here) who pointed out that mixing different domains of time is likely
to cause errors. ktime is an absolute time and fl_break_time uses jiffies -
a relative time. 

I think use of ktime_begin for restart_blocks is fine (since they use
ktime_t) but using ktime_t for file leases and converting between jiffies
and nanoseconds could be a problem, unless we convert fl_break_time to
seconds.

IOW, can we leave the above computation of ->fl_rem_lease for now ?

Sukadev

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC][PATCH 2/3][cr][v2]: Checkpoint/restart file leases
  2010-06-15  4:40   ` Oren Laadan
       [not found]     ` <4C170430.2030708-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
@ 2010-07-30 19:16     ` Sukadev Bhattiprolu
       [not found]       ` <20100730191607.GA16238-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
  2010-07-30 19:45       ` Oren Laadan
  1 sibling, 2 replies; 39+ messages in thread
From: Sukadev Bhattiprolu @ 2010-07-30 19:16 UTC (permalink / raw)
  To: Oren Laadan
  Cc: serue, Matt Helsley, matthew, linux-fsdevel, Containers, johnstul

Oren Laadan [orenl@cs.columbia.edu] wrote:
| 
| 
| >  		h->fl_type = lock->fl_type;
| > +		h->fl_type_prev = lock->fl_type_prev;
| >  		h->fl_flags = lock->fl_flags;
| > +		if (h->fl_type & F_INPROGRESS && 
| > +					(lock->fl_break_time > jiffies))
| > +			h->fl_rem_lease = (lock->fl_break_time - jiffies) / HZ;
| 
| Hmmm -- we have a ctx->ktime_begin marking the start of the checkpoint.
| It is used for relative-time calculations, for example, the expiry of
| restart-blocks and timeouts.  I suggest that we use it here too to be
| consistent.

Well, I started off using ktime_begin but discussed this with John Stultz
(CC'd here) who pointed out that mixing different domains of time is likely
to cause errors. ktime is an absolute time and fl_break_time uses jiffies -
a relative time. 

I think use of ktime_begin for restart_blocks is fine (since they use
ktime_t) but using ktime_t for file leases and converting between jiffies
and nanoseconds could be a problem, unless we convert fl_break_time to
seconds.

IOW, can we leave the above computation of ->fl_rem_lease for now ?

Sukadev

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC][PATCH 2/3][cr][v2]: Checkpoint/restart file leases
       [not found]       ` <20100730191607.GA16238-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
@ 2010-07-30 19:45         ` Oren Laadan
  0 siblings, 0 replies; 39+ messages in thread
From: Oren Laadan @ 2010-07-30 19:45 UTC (permalink / raw)
  To: Sukadev Bhattiprolu
  Cc: matthew-Ztpu424NOJ8, johnstul-r/Jw6+rmf7HQT0dZR+AlfA, Containers,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	serue-r/Jw6+rmf7HQT0dZR+AlfA



Sukadev Bhattiprolu wrote:
> Oren Laadan [orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org] wrote:
> | 
> | 
> | >  		h->fl_type = lock->fl_type;
> | > +		h->fl_type_prev = lock->fl_type_prev;
> | >  		h->fl_flags = lock->fl_flags;
> | > +		if (h->fl_type & F_INPROGRESS && 
> | > +					(lock->fl_break_time > jiffies))
> | > +			h->fl_rem_lease = (lock->fl_break_time - jiffies) / HZ;
> | 
> | Hmmm -- we have a ctx->ktime_begin marking the start of the checkpoint.
> | It is used for relative-time calculations, for example, the expiry of
> | restart-blocks and timeouts.  I suggest that we use it here too to be
> | consistent.
> 
> Well, I started off using ktime_begin but discussed this with John Stultz
> (CC'd here) who pointed out that mixing different domains of time is likely
> to cause errors. ktime is an absolute time and fl_break_time uses jiffies -
> a relative time. 
> 
> I think use of ktime_begin for restart_blocks is fine (since they use
> ktime_t) but using ktime_t for file leases and converting between jiffies
> and nanoseconds could be a problem, unless we convert fl_break_time to
> seconds.
> 
> IOW, can we leave the above computation of ->fl_rem_lease for now ?

The data on restart_blocks keep relative time - it's the the time
to expiry relative to ktime_begin (which is absolute).

ktime_begin merely gives a reference point in time against which
all other time-related values should be saved. The advantage is
that all time computation are relative to the same point in time
at checkpoint/restart - the time when the ktime_begin is set. It's
more consistent this way.

I don't see the problem with jiffies vs ktime - there are functions
to convert between different units (see jiffies.h). Even if you are
concerned about reducing the resolution because of a conversion -
well, it isn't realistic to expect nano-sec resolution after restart...

Oren.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC][PATCH 2/3][cr][v2]: Checkpoint/restart file leases
  2010-07-30 19:16     ` Sukadev Bhattiprolu
       [not found]       ` <20100730191607.GA16238-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
@ 2010-07-30 19:45       ` Oren Laadan
  2010-07-30 21:37         ` john stultz
       [not found]         ` <4C532BC9.6050109-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
  1 sibling, 2 replies; 39+ messages in thread
From: Oren Laadan @ 2010-07-30 19:45 UTC (permalink / raw)
  To: Sukadev Bhattiprolu
  Cc: serue, Matt Helsley, matthew, linux-fsdevel, Containers, johnstul



Sukadev Bhattiprolu wrote:
> Oren Laadan [orenl@cs.columbia.edu] wrote:
> | 
> | 
> | >  		h->fl_type = lock->fl_type;
> | > +		h->fl_type_prev = lock->fl_type_prev;
> | >  		h->fl_flags = lock->fl_flags;
> | > +		if (h->fl_type & F_INPROGRESS && 
> | > +					(lock->fl_break_time > jiffies))
> | > +			h->fl_rem_lease = (lock->fl_break_time - jiffies) / HZ;
> | 
> | Hmmm -- we have a ctx->ktime_begin marking the start of the checkpoint.
> | It is used for relative-time calculations, for example, the expiry of
> | restart-blocks and timeouts.  I suggest that we use it here too to be
> | consistent.
> 
> Well, I started off using ktime_begin but discussed this with John Stultz
> (CC'd here) who pointed out that mixing different domains of time is likely
> to cause errors. ktime is an absolute time and fl_break_time uses jiffies -
> a relative time. 
> 
> I think use of ktime_begin for restart_blocks is fine (since they use
> ktime_t) but using ktime_t for file leases and converting between jiffies
> and nanoseconds could be a problem, unless we convert fl_break_time to
> seconds.
> 
> IOW, can we leave the above computation of ->fl_rem_lease for now ?

The data on restart_blocks keep relative time - it's the the time
to expiry relative to ktime_begin (which is absolute).

ktime_begin merely gives a reference point in time against which
all other time-related values should be saved. The advantage is
that all time computation are relative to the same point in time
at checkpoint/restart - the time when the ktime_begin is set. It's
more consistent this way.

I don't see the problem with jiffies vs ktime - there are functions
to convert between different units (see jiffies.h). Even if you are
concerned about reducing the resolution because of a conversion -
well, it isn't realistic to expect nano-sec resolution after restart...

Oren.


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC][PATCH 2/3][cr][v2]: Checkpoint/restart file leases
       [not found]         ` <4C532BC9.6050109-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
@ 2010-07-30 21:37           ` john stultz
  0 siblings, 0 replies; 39+ messages in thread
From: john stultz @ 2010-07-30 21:37 UTC (permalink / raw)
  To: Oren Laadan
  Cc: matthew-Ztpu424NOJ8, Containers,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	serue-r/Jw6+rmf7HQT0dZR+AlfA, Sukadev Bhattiprolu

On Fri, 2010-07-30 at 15:45 -0400, Oren Laadan wrote:
> 
> Sukadev Bhattiprolu wrote:
> > Oren Laadan [orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org] wrote:
> > | 
> > | 
> > | >  		h->fl_type = lock->fl_type;
> > | > +		h->fl_type_prev = lock->fl_type_prev;
> > | >  		h->fl_flags = lock->fl_flags;
> > | > +		if (h->fl_type & F_INPROGRESS && 
> > | > +					(lock->fl_break_time > jiffies))
> > | > +			h->fl_rem_lease = (lock->fl_break_time - jiffies) / HZ;
> > | 
> > | Hmmm -- we have a ctx->ktime_begin marking the start of the checkpoint.
> > | It is used for relative-time calculations, for example, the expiry of
> > | restart-blocks and timeouts.  I suggest that we use it here too to be
> > | consistent.
> > 
> > Well, I started off using ktime_begin but discussed this with John Stultz
> > (CC'd here) who pointed out that mixing different domains of time is likely
> > to cause errors. ktime is an absolute time and fl_break_time uses jiffies -
> > a relative time. 
> > 
> > I think use of ktime_begin for restart_blocks is fine (since they use
> > ktime_t) but using ktime_t for file leases and converting between jiffies
> > and nanoseconds could be a problem, unless we convert fl_break_time to
> > seconds.
> > 
> > IOW, can we leave the above computation of ->fl_rem_lease for now ?
> 
> The data on restart_blocks keep relative time - it's the the time
> to expiry relative to ktime_begin (which is absolute).
> 
> ktime_begin merely gives a reference point in time against which
> all other time-related values should be saved. The advantage is
> that all time computation are relative to the same point in time
> at checkpoint/restart - the time when the ktime_begin is set. It's
> more consistent this way.

First, forgive me for not being very aware of the checkpoint/restart
code. 

So, ktime_begin is an absolute CLOCK_MONOTONIC time, relative to the
time the system booted (more or less). And it represents the checkpoint
time, correct?

Is there a similar checkpointed jiffies value?


> I don't see the problem with jiffies vs ktime - there are functions
> to convert between different units (see jiffies.h). Even if you are
> concerned about reducing the resolution because of a conversion -
> well, it isn't realistic to expect nano-sec resolution after restart...

You're right, there are functions to convert from relative jiffies to
relative nanoseconds, but there really aren't good functions to convert
from absolute jiffies to absolute CLOCK_MONOTONIC time.

One can make a rough approximation, but its not a very precise method
with a number of complications (ie: 32bit jiffies wraps every ~50 days,
the INITIAL_JIFFIES offset has to be remembered, and the larger issue of
the fact that CLOCK_MONOTONIC is NTP corrected, while jiffies may or may
not be). So I'd strongly advise against trying to directly convert abs
jiffies values to abs CLOCK_MONOTONIC time.

thanks
-john

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC][PATCH 2/3][cr][v2]: Checkpoint/restart file leases
  2010-07-30 19:45       ` Oren Laadan
@ 2010-07-30 21:37         ` john stultz
  2010-07-30 22:32           ` Oren Laadan
       [not found]           ` <1280525871.2451.23.camel-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>
       [not found]         ` <4C532BC9.6050109-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
  1 sibling, 2 replies; 39+ messages in thread
From: john stultz @ 2010-07-30 21:37 UTC (permalink / raw)
  To: Oren Laadan
  Cc: Sukadev Bhattiprolu, serue, Matt Helsley, matthew, linux-fsdevel,
	Containers

On Fri, 2010-07-30 at 15:45 -0400, Oren Laadan wrote:
> 
> Sukadev Bhattiprolu wrote:
> > Oren Laadan [orenl@cs.columbia.edu] wrote:
> > | 
> > | 
> > | >  		h->fl_type = lock->fl_type;
> > | > +		h->fl_type_prev = lock->fl_type_prev;
> > | >  		h->fl_flags = lock->fl_flags;
> > | > +		if (h->fl_type & F_INPROGRESS && 
> > | > +					(lock->fl_break_time > jiffies))
> > | > +			h->fl_rem_lease = (lock->fl_break_time - jiffies) / HZ;
> > | 
> > | Hmmm -- we have a ctx->ktime_begin marking the start of the checkpoint.
> > | It is used for relative-time calculations, for example, the expiry of
> > | restart-blocks and timeouts.  I suggest that we use it here too to be
> > | consistent.
> > 
> > Well, I started off using ktime_begin but discussed this with John Stultz
> > (CC'd here) who pointed out that mixing different domains of time is likely
> > to cause errors. ktime is an absolute time and fl_break_time uses jiffies -
> > a relative time. 
> > 
> > I think use of ktime_begin for restart_blocks is fine (since they use
> > ktime_t) but using ktime_t for file leases and converting between jiffies
> > and nanoseconds could be a problem, unless we convert fl_break_time to
> > seconds.
> > 
> > IOW, can we leave the above computation of ->fl_rem_lease for now ?
> 
> The data on restart_blocks keep relative time - it's the the time
> to expiry relative to ktime_begin (which is absolute).
> 
> ktime_begin merely gives a reference point in time against which
> all other time-related values should be saved. The advantage is
> that all time computation are relative to the same point in time
> at checkpoint/restart - the time when the ktime_begin is set. It's
> more consistent this way.

First, forgive me for not being very aware of the checkpoint/restart
code. 

So, ktime_begin is an absolute CLOCK_MONOTONIC time, relative to the
time the system booted (more or less). And it represents the checkpoint
time, correct?

Is there a similar checkpointed jiffies value?


> I don't see the problem with jiffies vs ktime - there are functions
> to convert between different units (see jiffies.h). Even if you are
> concerned about reducing the resolution because of a conversion -
> well, it isn't realistic to expect nano-sec resolution after restart...

You're right, there are functions to convert from relative jiffies to
relative nanoseconds, but there really aren't good functions to convert
from absolute jiffies to absolute CLOCK_MONOTONIC time.

One can make a rough approximation, but its not a very precise method
with a number of complications (ie: 32bit jiffies wraps every ~50 days,
the INITIAL_JIFFIES offset has to be remembered, and the larger issue of
the fact that CLOCK_MONOTONIC is NTP corrected, while jiffies may or may
not be). So I'd strongly advise against trying to directly convert abs
jiffies values to abs CLOCK_MONOTONIC time.

thanks
-john



^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC][PATCH 2/3][cr][v2]: Checkpoint/restart file leases
       [not found]           ` <1280525871.2451.23.camel-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>
@ 2010-07-30 22:32             ` Oren Laadan
  0 siblings, 0 replies; 39+ messages in thread
From: Oren Laadan @ 2010-07-30 22:32 UTC (permalink / raw)
  To: john stultz
  Cc: matthew-Ztpu424NOJ8, Containers,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Sukadev Bhattiprolu



john stultz wrote:
> On Fri, 2010-07-30 at 15:45 -0400, Oren Laadan wrote:
>> Sukadev Bhattiprolu wrote:
>>> Oren Laadan [orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org] wrote:
>>> | 
>>> | 
>>> | >  		h->fl_type = lock->fl_type;
>>> | > +		h->fl_type_prev = lock->fl_type_prev;
>>> | >  		h->fl_flags = lock->fl_flags;
>>> | > +		if (h->fl_type & F_INPROGRESS && 
>>> | > +					(lock->fl_break_time > jiffies))
>>> | > +			h->fl_rem_lease = (lock->fl_break_time - jiffies) / HZ;
>>> | 
>>> | Hmmm -- we have a ctx->ktime_begin marking the start of the checkpoint.
>>> | It is used for relative-time calculations, for example, the expiry of
>>> | restart-blocks and timeouts.  I suggest that we use it here too to be
>>> | consistent.
>>>
>>> Well, I started off using ktime_begin but discussed this with John Stultz
>>> (CC'd here) who pointed out that mixing different domains of time is likely
>>> to cause errors. ktime is an absolute time and fl_break_time uses jiffies -
>>> a relative time. 
>>>
>>> I think use of ktime_begin for restart_blocks is fine (since they use
>>> ktime_t) but using ktime_t for file leases and converting between jiffies
>>> and nanoseconds could be a problem, unless we convert fl_break_time to
>>> seconds.
>>>
>>> IOW, can we leave the above computation of ->fl_rem_lease for now ?
>> The data on restart_blocks keep relative time - it's the the time
>> to expiry relative to ktime_begin (which is absolute).
>>
>> ktime_begin merely gives a reference point in time against which
>> all other time-related values should be saved. The advantage is
>> that all time computation are relative to the same point in time
>> at checkpoint/restart - the time when the ktime_begin is set. It's
>> more consistent this way.
> 
> First, forgive me for not being very aware of the checkpoint/restart
> code. 

On the contrary, forgive me if I'm stating the obvious below ...

> 
> So, ktime_begin is an absolute CLOCK_MONOTONIC time, relative to the
> time the system booted (more or less). And it represents the checkpoint
> time, correct?

As a rule, all time measurements in the checkpoint image are
saved as relative values, using the start-of-checkpoint as the
reference point in time (*).

So at checkpoint, every absolute time value should be converted
to a value relative to the start-of-checkpoint. At restart, every
relative time value from the image is converted back (if needed)
to an absolute time value using a respective start-of-restart.

This makes sense for the most common case, where if a process
had 5 more seconds to sleep at the time of checkpoint, we would
like it to have 5 more seconds to sleep after it restarts.

(For the rare case where you care about the absolute timeout,
userspace can modify the checkpoint image on-the-fly during
restart to adjust the timeout value to be 0, or close enough
to zero that the timeout will happen as soon as the restart
completes).

That said, ktime_begin has two purposes:

1) It stores the start-of-{checkpoint,restart} during checkpoint
and restart respectively, and used for all conversions from
absolute to relative time values.  This is purely internal use
during the checkpoint/restart operations.
It make sense to use a single value, as opposed to compute the
deltas "on the go", because checkpoint/restart can be lengthy
operations (depending on the image size and IO speed). Doing
so against current time, which is a moving target, can yield
inconsistent results.

(*) Looking at the code, I see that the hrtimer expiry times
are actually calculated against current time - I'll fix that...

2) It provides information for the admin/user about what was
the absolute time when the checkpoint operation began (maybe
we should also end ktime_end...). Otherwise, we would have to
wrap each checkpoint image with additional meta-data from
userspace.
(In particular, having that absolute time in the checkpoint
image will allow a userspace filter to modify timeout values
as needed in the rare case mentioned above).

> Is there a similar checkpointed jiffies value?

No. but ...

>> I don't see the problem with jiffies vs ktime - there are functions
>> to convert between different units (see jiffies.h). Even if you are
>> concerned about reducing the resolution because of a conversion -
>> well, it isn't realistic to expect nano-sec resolution after restart...
> 
> You're right, there are functions to convert from relative jiffies to
> relative nanoseconds, but there really aren't good functions to convert
> from absolute jiffies to absolute CLOCK_MONOTONIC time.
> 
> One can make a rough approximation, but its not a very precise method
> with a number of complications (ie: 32bit jiffies wraps every ~50 days,
> the INITIAL_JIFFIES offset has to be remembered, and the larger issue of
> the fact that CLOCK_MONOTONIC is NTP corrected, while jiffies may or may
> not be). So I'd strongly advise against trying to directly convert abs
> jiffies values to abs CLOCK_MONOTONIC time.

... given this, to be able to compute the relative time where
jiffies are involved (e.g. the remaining lease time relative to
ktime_begin) we could also keep a ctx->jiffies_begin, calculate
the delta, and then convert to ktime_t delta.

Suka: if that works for you, can you please add this piece to
your patch ?

Thanks for the feedback !

Oren.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC][PATCH 2/3][cr][v2]: Checkpoint/restart file leases
  2010-07-30 21:37         ` john stultz
@ 2010-07-30 22:32           ` Oren Laadan
  2010-07-31  0:35             ` Sukadev Bhattiprolu
       [not found]             ` <4C5352EF.9080601-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
       [not found]           ` <1280525871.2451.23.camel-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>
  1 sibling, 2 replies; 39+ messages in thread
From: Oren Laadan @ 2010-07-30 22:32 UTC (permalink / raw)
  To: john stultz
  Cc: Sukadev Bhattiprolu, Serge E. Hallyn, Matt Helsley, matthew,
	linux-fsdevel, Containers



john stultz wrote:
> On Fri, 2010-07-30 at 15:45 -0400, Oren Laadan wrote:
>> Sukadev Bhattiprolu wrote:
>>> Oren Laadan [orenl@cs.columbia.edu] wrote:
>>> | 
>>> | 
>>> | >  		h->fl_type = lock->fl_type;
>>> | > +		h->fl_type_prev = lock->fl_type_prev;
>>> | >  		h->fl_flags = lock->fl_flags;
>>> | > +		if (h->fl_type & F_INPROGRESS && 
>>> | > +					(lock->fl_break_time > jiffies))
>>> | > +			h->fl_rem_lease = (lock->fl_break_time - jiffies) / HZ;
>>> | 
>>> | Hmmm -- we have a ctx->ktime_begin marking the start of the checkpoint.
>>> | It is used for relative-time calculations, for example, the expiry of
>>> | restart-blocks and timeouts.  I suggest that we use it here too to be
>>> | consistent.
>>>
>>> Well, I started off using ktime_begin but discussed this with John Stultz
>>> (CC'd here) who pointed out that mixing different domains of time is likely
>>> to cause errors. ktime is an absolute time and fl_break_time uses jiffies -
>>> a relative time. 
>>>
>>> I think use of ktime_begin for restart_blocks is fine (since they use
>>> ktime_t) but using ktime_t for file leases and converting between jiffies
>>> and nanoseconds could be a problem, unless we convert fl_break_time to
>>> seconds.
>>>
>>> IOW, can we leave the above computation of ->fl_rem_lease for now ?
>> The data on restart_blocks keep relative time - it's the the time
>> to expiry relative to ktime_begin (which is absolute).
>>
>> ktime_begin merely gives a reference point in time against which
>> all other time-related values should be saved. The advantage is
>> that all time computation are relative to the same point in time
>> at checkpoint/restart - the time when the ktime_begin is set. It's
>> more consistent this way.
> 
> First, forgive me for not being very aware of the checkpoint/restart
> code. 

On the contrary, forgive me if I'm stating the obvious below ...

> 
> So, ktime_begin is an absolute CLOCK_MONOTONIC time, relative to the
> time the system booted (more or less). And it represents the checkpoint
> time, correct?

As a rule, all time measurements in the checkpoint image are
saved as relative values, using the start-of-checkpoint as the
reference point in time (*).

So at checkpoint, every absolute time value should be converted
to a value relative to the start-of-checkpoint. At restart, every
relative time value from the image is converted back (if needed)
to an absolute time value using a respective start-of-restart.

This makes sense for the most common case, where if a process
had 5 more seconds to sleep at the time of checkpoint, we would
like it to have 5 more seconds to sleep after it restarts.

(For the rare case where you care about the absolute timeout,
userspace can modify the checkpoint image on-the-fly during
restart to adjust the timeout value to be 0, or close enough
to zero that the timeout will happen as soon as the restart
completes).

That said, ktime_begin has two purposes:

1) It stores the start-of-{checkpoint,restart} during checkpoint
and restart respectively, and used for all conversions from
absolute to relative time values.  This is purely internal use
during the checkpoint/restart operations.
It make sense to use a single value, as opposed to compute the
deltas "on the go", because checkpoint/restart can be lengthy
operations (depending on the image size and IO speed). Doing
so against current time, which is a moving target, can yield
inconsistent results.

(*) Looking at the code, I see that the hrtimer expiry times
are actually calculated against current time - I'll fix that...

2) It provides information for the admin/user about what was
the absolute time when the checkpoint operation began (maybe
we should also end ktime_end...). Otherwise, we would have to
wrap each checkpoint image with additional meta-data from
userspace.
(In particular, having that absolute time in the checkpoint
image will allow a userspace filter to modify timeout values
as needed in the rare case mentioned above).

> Is there a similar checkpointed jiffies value?

No. but ...

>> I don't see the problem with jiffies vs ktime - there are functions
>> to convert between different units (see jiffies.h). Even if you are
>> concerned about reducing the resolution because of a conversion -
>> well, it isn't realistic to expect nano-sec resolution after restart...
> 
> You're right, there are functions to convert from relative jiffies to
> relative nanoseconds, but there really aren't good functions to convert
> from absolute jiffies to absolute CLOCK_MONOTONIC time.
> 
> One can make a rough approximation, but its not a very precise method
> with a number of complications (ie: 32bit jiffies wraps every ~50 days,
> the INITIAL_JIFFIES offset has to be remembered, and the larger issue of
> the fact that CLOCK_MONOTONIC is NTP corrected, while jiffies may or may
> not be). So I'd strongly advise against trying to directly convert abs
> jiffies values to abs CLOCK_MONOTONIC time.

... given this, to be able to compute the relative time where
jiffies are involved (e.g. the remaining lease time relative to
ktime_begin) we could also keep a ctx->jiffies_begin, calculate
the delta, and then convert to ktime_t delta.

Suka: if that works for you, can you please add this piece to
your patch ?

Thanks for the feedback !

Oren.


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC][PATCH 2/3][cr][v2]: Checkpoint/restart file leases
       [not found]             ` <4C5352EF.9080601-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
@ 2010-07-31  0:35               ` Sukadev Bhattiprolu
  0 siblings, 0 replies; 39+ messages in thread
From: Sukadev Bhattiprolu @ 2010-07-31  0:35 UTC (permalink / raw)
  To: Oren Laadan
  Cc: matthew-Ztpu424NOJ8, john stultz, Containers,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA

Oren Laadan [orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org] wrote:
>
>
> john stultz wrote:
>> On Fri, 2010-07-30 at 15:45 -0400, Oren Laadan wrote:
>>> Sukadev Bhattiprolu wrote:
>>>> Oren Laadan [orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org] wrote:
>>>> | | | >  		h->fl_type = lock->fl_type;
>>>> | > +		h->fl_type_prev = lock->fl_type_prev;
>>>> | >  		h->fl_flags = lock->fl_flags;
>>>> | > +		if (h->fl_type & F_INPROGRESS && | > 
>>>> +					(lock->fl_break_time > jiffies))
>>>> | > +			h->fl_rem_lease = (lock->fl_break_time - jiffies) / HZ;
>>>> | | Hmmm -- we have a ctx->ktime_begin marking the start of the 
>>>> checkpoint.
>>>> | It is used for relative-time calculations, for example, the expiry of
>>>> | restart-blocks and timeouts.  I suggest that we use it here too to be
>>>> | consistent.
>>>>
>>>> Well, I started off using ktime_begin but discussed this with John Stultz
>>>> (CC'd here) who pointed out that mixing different domains of time is likely
>>>> to cause errors. ktime is an absolute time and fl_break_time uses jiffies -
>>>> a relative time. 
>>>>
>>>> I think use of ktime_begin for restart_blocks is fine (since they use
>>>> ktime_t) but using ktime_t for file leases and converting between jiffies
>>>> and nanoseconds could be a problem, unless we convert fl_break_time to
>>>> seconds.
>>>>
>>>> IOW, can we leave the above computation of ->fl_rem_lease for now ?
>>> The data on restart_blocks keep relative time - it's the the time
>>> to expiry relative to ktime_begin (which is absolute).
>>>
>>> ktime_begin merely gives a reference point in time against which
>>> all other time-related values should be saved. The advantage is
>>> that all time computation are relative to the same point in time
>>> at checkpoint/restart - the time when the ktime_begin is set. It's
>>> more consistent this way.
>>
>> First, forgive me for not being very aware of the checkpoint/restart
>> code. 
>
> On the contrary, forgive me if I'm stating the obvious below ...
>
>>
>> So, ktime_begin is an absolute CLOCK_MONOTONIC time, relative to the
>> time the system booted (more or less). And it represents the checkpoint
>> time, correct?
>
> As a rule, all time measurements in the checkpoint image are
> saved as relative values, using the start-of-checkpoint as the
> reference point in time (*).
>
> So at checkpoint, every absolute time value should be converted
> to a value relative to the start-of-checkpoint. At restart, every
> relative time value from the image is converted back (if needed)
> to an absolute time value using a respective start-of-restart.

One general observation, slightly off-topic. You mention 
"start-of-restart" here and ...
>
> This makes sense for the most common case, where if a process
> had 5 more seconds to sleep at the time of checkpoint, we would
> like it to have 5 more seconds to sleep after it restarts.

... "after it restarts" here. These two can be quite different if,
as you mention below, the C/R is a lengthy operation.

If application had 5 seconds remaining on the lease, and C/R takes
more than 5 seconds, the application would have spent the remaining
lease frozen.  Of course trying to compute the values relative to the
"end-of-restart" is theoretically impossible :-).

Maybe the user can adjust the lease-expiry in the checkpoint-image 
to allow for this as well.

>
> (For the rare case where you care about the absolute timeout,
> userspace can modify the checkpoint image on-the-fly during
> restart to adjust the timeout value to be 0, or close enough
> to zero that the timeout will happen as soon as the restart
> completes).
>
> That said, ktime_begin has two purposes:
>
> 1) It stores the start-of-{checkpoint,restart} during checkpoint
> and restart respectively, and used for all conversions from
> absolute to relative time values.  This is purely internal use
> during the checkpoint/restart operations.
> It make sense to use a single value, as opposed to compute the
> deltas "on the go", because checkpoint/restart can be lengthy
> operations (depending on the image size and IO speed). Doing
> so against current time, which is a moving target, can yield
> inconsistent results.
>
> (*) Looking at the code, I see that the hrtimer expiry times
> are actually calculated against current time - I'll fix that...
>
> 2) It provides information for the admin/user about what was
> the absolute time when the checkpoint operation began (maybe
> we should also end ktime_end...). Otherwise, we would have to
> wrap each checkpoint image with additional meta-data from
> userspace.
> (In particular, having that absolute time in the checkpoint
> image will allow a userspace filter to modify timeout values
> as needed in the rare case mentioned above).
>
>> Is there a similar checkpointed jiffies value?
>
> No. but ...
>
>>> I don't see the problem with jiffies vs ktime - there are functions
>>> to convert between different units (see jiffies.h). Even if you are
>>> concerned about reducing the resolution because of a conversion -
>>> well, it isn't realistic to expect nano-sec resolution after restart...
>>
>> You're right, there are functions to convert from relative jiffies to
>> relative nanoseconds, but there really aren't good functions to convert
>> from absolute jiffies to absolute CLOCK_MONOTONIC time.
>>
>> One can make a rough approximation, but its not a very precise method
>> with a number of complications (ie: 32bit jiffies wraps every ~50 days,
>> the INITIAL_JIFFIES offset has to be remembered, and the larger issue of
>> the fact that CLOCK_MONOTONIC is NTP corrected, while jiffies may or may
>> not be). So I'd strongly advise against trying to directly convert abs
>> jiffies values to abs CLOCK_MONOTONIC time.
>
> ... given this, to be able to compute the relative time where
> jiffies are involved (e.g. the remaining lease time relative to
> ktime_begin) we could also keep a ctx->jiffies_begin, calculate
> the delta, and then convert to ktime_t delta.
>
> Suka: if that works for you, can you please add this piece to
> your patch ?

Yes, sure I can do that.

Thanks John, Oren.

Sukadev

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC][PATCH 2/3][cr][v2]: Checkpoint/restart file leases
  2010-07-30 22:32           ` Oren Laadan
@ 2010-07-31  0:35             ` Sukadev Bhattiprolu
  2010-07-31  1:36               ` Matt Helsley
                                 ` (2 more replies)
       [not found]             ` <4C5352EF.9080601-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
  1 sibling, 3 replies; 39+ messages in thread
From: Sukadev Bhattiprolu @ 2010-07-31  0:35 UTC (permalink / raw)
  To: Oren Laadan
  Cc: john stultz, Serge E. Hallyn, Matt Helsley, matthew,
	linux-fsdevel, Containers

Oren Laadan [orenl@cs.columbia.edu] wrote:
>
>
> john stultz wrote:
>> On Fri, 2010-07-30 at 15:45 -0400, Oren Laadan wrote:
>>> Sukadev Bhattiprolu wrote:
>>>> Oren Laadan [orenl@cs.columbia.edu] wrote:
>>>> | | | >  		h->fl_type = lock->fl_type;
>>>> | > +		h->fl_type_prev = lock->fl_type_prev;
>>>> | >  		h->fl_flags = lock->fl_flags;
>>>> | > +		if (h->fl_type & F_INPROGRESS && | > 
>>>> +					(lock->fl_break_time > jiffies))
>>>> | > +			h->fl_rem_lease = (lock->fl_break_time - jiffies) / HZ;
>>>> | | Hmmm -- we have a ctx->ktime_begin marking the start of the 
>>>> checkpoint.
>>>> | It is used for relative-time calculations, for example, the expiry of
>>>> | restart-blocks and timeouts.  I suggest that we use it here too to be
>>>> | consistent.
>>>>
>>>> Well, I started off using ktime_begin but discussed this with John Stultz
>>>> (CC'd here) who pointed out that mixing different domains of time is likely
>>>> to cause errors. ktime is an absolute time and fl_break_time uses jiffies -
>>>> a relative time. 
>>>>
>>>> I think use of ktime_begin for restart_blocks is fine (since they use
>>>> ktime_t) but using ktime_t for file leases and converting between jiffies
>>>> and nanoseconds could be a problem, unless we convert fl_break_time to
>>>> seconds.
>>>>
>>>> IOW, can we leave the above computation of ->fl_rem_lease for now ?
>>> The data on restart_blocks keep relative time - it's the the time
>>> to expiry relative to ktime_begin (which is absolute).
>>>
>>> ktime_begin merely gives a reference point in time against which
>>> all other time-related values should be saved. The advantage is
>>> that all time computation are relative to the same point in time
>>> at checkpoint/restart - the time when the ktime_begin is set. It's
>>> more consistent this way.
>>
>> First, forgive me for not being very aware of the checkpoint/restart
>> code. 
>
> On the contrary, forgive me if I'm stating the obvious below ...
>
>>
>> So, ktime_begin is an absolute CLOCK_MONOTONIC time, relative to the
>> time the system booted (more or less). And it represents the checkpoint
>> time, correct?
>
> As a rule, all time measurements in the checkpoint image are
> saved as relative values, using the start-of-checkpoint as the
> reference point in time (*).
>
> So at checkpoint, every absolute time value should be converted
> to a value relative to the start-of-checkpoint. At restart, every
> relative time value from the image is converted back (if needed)
> to an absolute time value using a respective start-of-restart.

One general observation, slightly off-topic. You mention 
"start-of-restart" here and ...
>
> This makes sense for the most common case, where if a process
> had 5 more seconds to sleep at the time of checkpoint, we would
> like it to have 5 more seconds to sleep after it restarts.

... "after it restarts" here. These two can be quite different if,
as you mention below, the C/R is a lengthy operation.

If application had 5 seconds remaining on the lease, and C/R takes
more than 5 seconds, the application would have spent the remaining
lease frozen.  Of course trying to compute the values relative to the
"end-of-restart" is theoretically impossible :-).

Maybe the user can adjust the lease-expiry in the checkpoint-image 
to allow for this as well.

>
> (For the rare case where you care about the absolute timeout,
> userspace can modify the checkpoint image on-the-fly during
> restart to adjust the timeout value to be 0, or close enough
> to zero that the timeout will happen as soon as the restart
> completes).
>
> That said, ktime_begin has two purposes:
>
> 1) It stores the start-of-{checkpoint,restart} during checkpoint
> and restart respectively, and used for all conversions from
> absolute to relative time values.  This is purely internal use
> during the checkpoint/restart operations.
> It make sense to use a single value, as opposed to compute the
> deltas "on the go", because checkpoint/restart can be lengthy
> operations (depending on the image size and IO speed). Doing
> so against current time, which is a moving target, can yield
> inconsistent results.
>
> (*) Looking at the code, I see that the hrtimer expiry times
> are actually calculated against current time - I'll fix that...
>
> 2) It provides information for the admin/user about what was
> the absolute time when the checkpoint operation began (maybe
> we should also end ktime_end...). Otherwise, we would have to
> wrap each checkpoint image with additional meta-data from
> userspace.
> (In particular, having that absolute time in the checkpoint
> image will allow a userspace filter to modify timeout values
> as needed in the rare case mentioned above).
>
>> Is there a similar checkpointed jiffies value?
>
> No. but ...
>
>>> I don't see the problem with jiffies vs ktime - there are functions
>>> to convert between different units (see jiffies.h). Even if you are
>>> concerned about reducing the resolution because of a conversion -
>>> well, it isn't realistic to expect nano-sec resolution after restart...
>>
>> You're right, there are functions to convert from relative jiffies to
>> relative nanoseconds, but there really aren't good functions to convert
>> from absolute jiffies to absolute CLOCK_MONOTONIC time.
>>
>> One can make a rough approximation, but its not a very precise method
>> with a number of complications (ie: 32bit jiffies wraps every ~50 days,
>> the INITIAL_JIFFIES offset has to be remembered, and the larger issue of
>> the fact that CLOCK_MONOTONIC is NTP corrected, while jiffies may or may
>> not be). So I'd strongly advise against trying to directly convert abs
>> jiffies values to abs CLOCK_MONOTONIC time.
>
> ... given this, to be able to compute the relative time where
> jiffies are involved (e.g. the remaining lease time relative to
> ktime_begin) we could also keep a ctx->jiffies_begin, calculate
> the delta, and then convert to ktime_t delta.
>
> Suka: if that works for you, can you please add this piece to
> your patch ?

Yes, sure I can do that.

Thanks John, Oren.

Sukadev

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC][PATCH 2/3][cr][v2]: Checkpoint/restart file leases
       [not found]               ` <20100731003504.GA3544-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
@ 2010-07-31  1:36                 ` Matt Helsley
  2010-07-31  4:52                 ` Oren Laadan
  1 sibling, 0 replies; 39+ messages in thread
From: Matt Helsley @ 2010-07-31  1:36 UTC (permalink / raw)
  To: Sukadev Bhattiprolu
  Cc: matthew-Ztpu424NOJ8, Containers, john stultz,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA

On Fri, Jul 30, 2010 at 05:35:04PM -0700, Sukadev Bhattiprolu wrote:
> Oren Laadan [orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org] wrote:
> >
> >
> > john stultz wrote:
> >> On Fri, 2010-07-30 at 15:45 -0400, Oren Laadan wrote:
> >>> Sukadev Bhattiprolu wrote:
> >>>> Oren Laadan [orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org] wrote:
> >>>> | | | >  		h->fl_type = lock->fl_type;
> >>>> | > +		h->fl_type_prev = lock->fl_type_prev;
> >>>> | >  		h->fl_flags = lock->fl_flags;
> >>>> | > +		if (h->fl_type & F_INPROGRESS && | > 
> >>>> +					(lock->fl_break_time > jiffies))
> >>>> | > +			h->fl_rem_lease = (lock->fl_break_time - jiffies) / HZ;
> >>>> | | Hmmm -- we have a ctx->ktime_begin marking the start of the 
> >>>> checkpoint.
> >>>> | It is used for relative-time calculations, for example, the expiry of
> >>>> | restart-blocks and timeouts.  I suggest that we use it here too to be
> >>>> | consistent.
> >>>>
> >>>> Well, I started off using ktime_begin but discussed this with John Stultz
> >>>> (CC'd here) who pointed out that mixing different domains of time is likely
> >>>> to cause errors. ktime is an absolute time and fl_break_time uses jiffies -
> >>>> a relative time. 
> >>>>
> >>>> I think use of ktime_begin for restart_blocks is fine (since they use
> >>>> ktime_t) but using ktime_t for file leases and converting between jiffies
> >>>> and nanoseconds could be a problem, unless we convert fl_break_time to
> >>>> seconds.
> >>>>
> >>>> IOW, can we leave the above computation of ->fl_rem_lease for now ?
> >>> The data on restart_blocks keep relative time - it's the the time
> >>> to expiry relative to ktime_begin (which is absolute).
> >>>
> >>> ktime_begin merely gives a reference point in time against which
> >>> all other time-related values should be saved. The advantage is
> >>> that all time computation are relative to the same point in time
> >>> at checkpoint/restart - the time when the ktime_begin is set. It's
> >>> more consistent this way.
> >>
> >> First, forgive me for not being very aware of the checkpoint/restart
> >> code. 
> >
> > On the contrary, forgive me if I'm stating the obvious below ...
> >
> >>
> >> So, ktime_begin is an absolute CLOCK_MONOTONIC time, relative to the
> >> time the system booted (more or less). And it represents the checkpoint
> >> time, correct?
> >
> > As a rule, all time measurements in the checkpoint image are
> > saved as relative values, using the start-of-checkpoint as the
> > reference point in time (*).
> >
> > So at checkpoint, every absolute time value should be converted
> > to a value relative to the start-of-checkpoint. At restart, every
> > relative time value from the image is converted back (if needed)
> > to an absolute time value using a respective start-of-restart.
> 
> One general observation, slightly off-topic. You mention 
> "start-of-restart" here and ...
> >
> > This makes sense for the most common case, where if a process
> > had 5 more seconds to sleep at the time of checkpoint, we would
> > like it to have 5 more seconds to sleep after it restarts.
> 
> ... "after it restarts" here. These two can be quite different if,
> as you mention below, the C/R is a lengthy operation.

Worse! I think the big concern is not the duration of checkpoint but the
amount of time userspace expects to store a checkpoint before
restarting.

It's an arbitrary amount of time. A user could restart 50 days later.
Or 100. etc. Sure, it's _unlikely_, but we (checkpoint/restart implementers)
shouldn't count on seeing only "reasonable" times between completion of
checkpoint and initiation of restart. We need to be robust for all
the times we see.

I think that's why Oren made the choices he did. Making times relative
to ktime_begin certainly helps keep the time values semi-sane for those
crazy arbitrary cases in addition to being nice for the "reasonable"
cases.

Cheers,
	-Matt

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC][PATCH 2/3][cr][v2]: Checkpoint/restart file leases
  2010-07-31  0:35             ` Sukadev Bhattiprolu
@ 2010-07-31  1:36               ` Matt Helsley
       [not found]               ` <20100731003504.GA3544-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
  2010-07-31  4:52               ` Oren Laadan
  2 siblings, 0 replies; 39+ messages in thread
From: Matt Helsley @ 2010-07-31  1:36 UTC (permalink / raw)
  To: Sukadev Bhattiprolu
  Cc: Oren Laadan, john stultz, Serge E. Hallyn, Matt Helsley, matthew,
	linux-fsdevel, Containers

On Fri, Jul 30, 2010 at 05:35:04PM -0700, Sukadev Bhattiprolu wrote:
> Oren Laadan [orenl@cs.columbia.edu] wrote:
> >
> >
> > john stultz wrote:
> >> On Fri, 2010-07-30 at 15:45 -0400, Oren Laadan wrote:
> >>> Sukadev Bhattiprolu wrote:
> >>>> Oren Laadan [orenl@cs.columbia.edu] wrote:
> >>>> | | | >  		h->fl_type = lock->fl_type;
> >>>> | > +		h->fl_type_prev = lock->fl_type_prev;
> >>>> | >  		h->fl_flags = lock->fl_flags;
> >>>> | > +		if (h->fl_type & F_INPROGRESS && | > 
> >>>> +					(lock->fl_break_time > jiffies))
> >>>> | > +			h->fl_rem_lease = (lock->fl_break_time - jiffies) / HZ;
> >>>> | | Hmmm -- we have a ctx->ktime_begin marking the start of the 
> >>>> checkpoint.
> >>>> | It is used for relative-time calculations, for example, the expiry of
> >>>> | restart-blocks and timeouts.  I suggest that we use it here too to be
> >>>> | consistent.
> >>>>
> >>>> Well, I started off using ktime_begin but discussed this with John Stultz
> >>>> (CC'd here) who pointed out that mixing different domains of time is likely
> >>>> to cause errors. ktime is an absolute time and fl_break_time uses jiffies -
> >>>> a relative time. 
> >>>>
> >>>> I think use of ktime_begin for restart_blocks is fine (since they use
> >>>> ktime_t) but using ktime_t for file leases and converting between jiffies
> >>>> and nanoseconds could be a problem, unless we convert fl_break_time to
> >>>> seconds.
> >>>>
> >>>> IOW, can we leave the above computation of ->fl_rem_lease for now ?
> >>> The data on restart_blocks keep relative time - it's the the time
> >>> to expiry relative to ktime_begin (which is absolute).
> >>>
> >>> ktime_begin merely gives a reference point in time against which
> >>> all other time-related values should be saved. The advantage is
> >>> that all time computation are relative to the same point in time
> >>> at checkpoint/restart - the time when the ktime_begin is set. It's
> >>> more consistent this way.
> >>
> >> First, forgive me for not being very aware of the checkpoint/restart
> >> code. 
> >
> > On the contrary, forgive me if I'm stating the obvious below ...
> >
> >>
> >> So, ktime_begin is an absolute CLOCK_MONOTONIC time, relative to the
> >> time the system booted (more or less). And it represents the checkpoint
> >> time, correct?
> >
> > As a rule, all time measurements in the checkpoint image are
> > saved as relative values, using the start-of-checkpoint as the
> > reference point in time (*).
> >
> > So at checkpoint, every absolute time value should be converted
> > to a value relative to the start-of-checkpoint. At restart, every
> > relative time value from the image is converted back (if needed)
> > to an absolute time value using a respective start-of-restart.
> 
> One general observation, slightly off-topic. You mention 
> "start-of-restart" here and ...
> >
> > This makes sense for the most common case, where if a process
> > had 5 more seconds to sleep at the time of checkpoint, we would
> > like it to have 5 more seconds to sleep after it restarts.
> 
> ... "after it restarts" here. These two can be quite different if,
> as you mention below, the C/R is a lengthy operation.

Worse! I think the big concern is not the duration of checkpoint but the
amount of time userspace expects to store a checkpoint before
restarting.

It's an arbitrary amount of time. A user could restart 50 days later.
Or 100. etc. Sure, it's _unlikely_, but we (checkpoint/restart implementers)
shouldn't count on seeing only "reasonable" times between completion of
checkpoint and initiation of restart. We need to be robust for all
the times we see.

I think that's why Oren made the choices he did. Making times relative
to ktime_begin certainly helps keep the time values semi-sane for those
crazy arbitrary cases in addition to being nice for the "reasonable"
cases.

Cheers,
	-Matt

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC][PATCH 2/3][cr][v2]: Checkpoint/restart file leases
       [not found]               ` <20100731003504.GA3544-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
  2010-07-31  1:36                 ` Matt Helsley
@ 2010-07-31  4:52                 ` Oren Laadan
  1 sibling, 0 replies; 39+ messages in thread
From: Oren Laadan @ 2010-07-31  4:52 UTC (permalink / raw)
  To: Sukadev Bhattiprolu
  Cc: matthew-Ztpu424NOJ8, john stultz, Containers,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA



On 07/30/2010 08:35 PM, Sukadev Bhattiprolu wrote:
> Oren Laadan [orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org] wrote:
>>
>>
>> john stultz wrote:
>>> On Fri, 2010-07-30 at 15:45 -0400, Oren Laadan wrote:
>>>> Sukadev Bhattiprolu wrote:
>>>>> Oren Laadan [orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org] wrote:
>>>>> | | | >  		h->fl_type = lock->fl_type;
>>>>> | > +		h->fl_type_prev = lock->fl_type_prev;
>>>>> | >  		h->fl_flags = lock->fl_flags;
>>>>> | > +		if (h->fl_type & F_INPROGRESS && | > 
>>>>> +					(lock->fl_break_time > jiffies))
>>>>> | > +			h->fl_rem_lease = (lock->fl_break_time - jiffies) / HZ;
>>>>> | | Hmmm -- we have a ctx->ktime_begin marking the start of the 
>>>>> checkpoint.
>>>>> | It is used for relative-time calculations, for example, the expiry of
>>>>> | restart-blocks and timeouts.  I suggest that we use it here too to be
>>>>> | consistent.
>>>>>
>>>>> Well, I started off using ktime_begin but discussed this with John Stultz
>>>>> (CC'd here) who pointed out that mixing different domains of time is likely
>>>>> to cause errors. ktime is an absolute time and fl_break_time uses jiffies -
>>>>> a relative time. 
>>>>>
>>>>> I think use of ktime_begin for restart_blocks is fine (since they use
>>>>> ktime_t) but using ktime_t for file leases and converting between jiffies
>>>>> and nanoseconds could be a problem, unless we convert fl_break_time to
>>>>> seconds.
>>>>>
>>>>> IOW, can we leave the above computation of ->fl_rem_lease for now ?
>>>> The data on restart_blocks keep relative time - it's the the time
>>>> to expiry relative to ktime_begin (which is absolute).
>>>>
>>>> ktime_begin merely gives a reference point in time against which
>>>> all other time-related values should be saved. The advantage is
>>>> that all time computation are relative to the same point in time
>>>> at checkpoint/restart - the time when the ktime_begin is set. It's
>>>> more consistent this way.
>>>
>>> First, forgive me for not being very aware of the checkpoint/restart
>>> code. 
>>
>> On the contrary, forgive me if I'm stating the obvious below ...
>>
>>>
>>> So, ktime_begin is an absolute CLOCK_MONOTONIC time, relative to the
>>> time the system booted (more or less). And it represents the checkpoint
>>> time, correct?
>>
>> As a rule, all time measurements in the checkpoint image are
>> saved as relative values, using the start-of-checkpoint as the
>> reference point in time (*).
>>
>> So at checkpoint, every absolute time value should be converted
>> to a value relative to the start-of-checkpoint. At restart, every
>> relative time value from the image is converted back (if needed)
>> to an absolute time value using a respective start-of-restart.
> 
> One general observation, slightly off-topic. You mention 
> "start-of-restart" here and ...
>>
>> This makes sense for the most common case, where if a process
>> had 5 more seconds to sleep at the time of checkpoint, we would
>> like it to have 5 more seconds to sleep after it restarts.
> 
> ... "after it restarts" here. These two can be quite different if,
> as you mention below, the C/R is a lengthy operation.
> 
> If application had 5 seconds remaining on the lease, and C/R takes
> more than 5 seconds, the application would have spent the remaining
> lease frozen.  Of course trying to compute the values relative to the
> "end-of-restart" is theoretically impossible :-).
> 
> Maybe the user can adjust the lease-expiry in the checkpoint-image 
> to allow for this as well.

Heh ... you cannot pre-adjust it since you don't know how long
restart would take :)

It would certainly be better if we could restore all the expiry
times to be relative to the time when the restart _succeeds_ (and
just before all tasks resume execution), as opposed to when the
restart _begins_. Restart could take long time, and by the time it
concludes leases (and timers) could expire.

However, to implement such behavior, we'll need to keep a list of
expiry-values-to-adjust throughout the restart, then when restart
is about to complete, iterate through the list and adjust all of
them at once relative to the (near) end-time.

While this is possible, I don't think it is worth the additional
complexity in the code. Most restarts will not take that long.
Moreover, it can be viewed as a restart that succeeds quickly
but then the processes are frozen/stopped/descheduled for some
time so leases/timers expire.

The requirement that all expiries will be measured (on checkpoint
and restart) relative to the same point in time is very important
(while the above is nice-to-have).

Oren.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC][PATCH 2/3][cr][v2]: Checkpoint/restart file leases
  2010-07-31  0:35             ` Sukadev Bhattiprolu
  2010-07-31  1:36               ` Matt Helsley
       [not found]               ` <20100731003504.GA3544-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
@ 2010-07-31  4:52               ` Oren Laadan
  2 siblings, 0 replies; 39+ messages in thread
From: Oren Laadan @ 2010-07-31  4:52 UTC (permalink / raw)
  To: Sukadev Bhattiprolu
  Cc: john stultz, Serge E. Hallyn, Matt Helsley, matthew,
	linux-fsdevel, Containers



On 07/30/2010 08:35 PM, Sukadev Bhattiprolu wrote:
> Oren Laadan [orenl@cs.columbia.edu] wrote:
>>
>>
>> john stultz wrote:
>>> On Fri, 2010-07-30 at 15:45 -0400, Oren Laadan wrote:
>>>> Sukadev Bhattiprolu wrote:
>>>>> Oren Laadan [orenl@cs.columbia.edu] wrote:
>>>>> | | | >  		h->fl_type = lock->fl_type;
>>>>> | > +		h->fl_type_prev = lock->fl_type_prev;
>>>>> | >  		h->fl_flags = lock->fl_flags;
>>>>> | > +		if (h->fl_type & F_INPROGRESS && | > 
>>>>> +					(lock->fl_break_time > jiffies))
>>>>> | > +			h->fl_rem_lease = (lock->fl_break_time - jiffies) / HZ;
>>>>> | | Hmmm -- we have a ctx->ktime_begin marking the start of the 
>>>>> checkpoint.
>>>>> | It is used for relative-time calculations, for example, the expiry of
>>>>> | restart-blocks and timeouts.  I suggest that we use it here too to be
>>>>> | consistent.
>>>>>
>>>>> Well, I started off using ktime_begin but discussed this with John Stultz
>>>>> (CC'd here) who pointed out that mixing different domains of time is likely
>>>>> to cause errors. ktime is an absolute time and fl_break_time uses jiffies -
>>>>> a relative time. 
>>>>>
>>>>> I think use of ktime_begin for restart_blocks is fine (since they use
>>>>> ktime_t) but using ktime_t for file leases and converting between jiffies
>>>>> and nanoseconds could be a problem, unless we convert fl_break_time to
>>>>> seconds.
>>>>>
>>>>> IOW, can we leave the above computation of ->fl_rem_lease for now ?
>>>> The data on restart_blocks keep relative time - it's the the time
>>>> to expiry relative to ktime_begin (which is absolute).
>>>>
>>>> ktime_begin merely gives a reference point in time against which
>>>> all other time-related values should be saved. The advantage is
>>>> that all time computation are relative to the same point in time
>>>> at checkpoint/restart - the time when the ktime_begin is set. It's
>>>> more consistent this way.
>>>
>>> First, forgive me for not being very aware of the checkpoint/restart
>>> code. 
>>
>> On the contrary, forgive me if I'm stating the obvious below ...
>>
>>>
>>> So, ktime_begin is an absolute CLOCK_MONOTONIC time, relative to the
>>> time the system booted (more or less). And it represents the checkpoint
>>> time, correct?
>>
>> As a rule, all time measurements in the checkpoint image are
>> saved as relative values, using the start-of-checkpoint as the
>> reference point in time (*).
>>
>> So at checkpoint, every absolute time value should be converted
>> to a value relative to the start-of-checkpoint. At restart, every
>> relative time value from the image is converted back (if needed)
>> to an absolute time value using a respective start-of-restart.
> 
> One general observation, slightly off-topic. You mention 
> "start-of-restart" here and ...
>>
>> This makes sense for the most common case, where if a process
>> had 5 more seconds to sleep at the time of checkpoint, we would
>> like it to have 5 more seconds to sleep after it restarts.
> 
> ... "after it restarts" here. These two can be quite different if,
> as you mention below, the C/R is a lengthy operation.
> 
> If application had 5 seconds remaining on the lease, and C/R takes
> more than 5 seconds, the application would have spent the remaining
> lease frozen.  Of course trying to compute the values relative to the
> "end-of-restart" is theoretically impossible :-).
> 
> Maybe the user can adjust the lease-expiry in the checkpoint-image 
> to allow for this as well.

Heh ... you cannot pre-adjust it since you don't know how long
restart would take :)

It would certainly be better if we could restore all the expiry
times to be relative to the time when the restart _succeeds_ (and
just before all tasks resume execution), as opposed to when the
restart _begins_. Restart could take long time, and by the time it
concludes leases (and timers) could expire.

However, to implement such behavior, we'll need to keep a list of
expiry-values-to-adjust throughout the restart, then when restart
is about to complete, iterate through the list and adjust all of
them at once relative to the (near) end-time.

While this is possible, I don't think it is worth the additional
complexity in the code. Most restarts will not take that long.
Moreover, it can be viewed as a restart that succeeds quickly
but then the processes are frozen/stopped/descheduled for some
time so leases/timers expire.

The requirement that all expiries will be measured (on checkpoint
and restart) relative to the same point in time is very important
(while the above is nice-to-have).

Oren.

^ permalink raw reply	[flat|nested] 39+ messages in thread

end of thread, other threads:[~2010-07-31  4:53 UTC | newest]

Thread overview: 39+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-05-26  1:07 [RFC][PATCH 0/3][cr][v2]: Checkpoint/restart file leases Sukadev Bhattiprolu
2010-05-26  1:07 ` [RFC][PATCH 1/3][cr][v2]: Define do_setlease() Sukadev Bhattiprolu
2010-05-26 13:52   ` Serge E. Hallyn
2010-05-26 17:14     ` Sukadev Bhattiprolu
     [not found]     ` <20100526135256.GA25799-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2010-05-26 17:14       ` Sukadev Bhattiprolu
     [not found]   ` <1274836063-13271-2-git-send-email-sukadev-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
2010-05-26 13:52     ` Serge E. Hallyn
2010-05-26  1:07 ` [RFC][PATCH 2/3][cr][v2]: Checkpoint/restart file leases Sukadev Bhattiprolu
2010-06-15  4:40   ` Oren Laadan
     [not found]     ` <4C170430.2030708-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
2010-07-30 19:16       ` Sukadev Bhattiprolu
2010-07-30 19:16     ` Sukadev Bhattiprolu
     [not found]       ` <20100730191607.GA16238-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2010-07-30 19:45         ` Oren Laadan
2010-07-30 19:45       ` Oren Laadan
2010-07-30 21:37         ` john stultz
2010-07-30 22:32           ` Oren Laadan
2010-07-31  0:35             ` Sukadev Bhattiprolu
2010-07-31  1:36               ` Matt Helsley
     [not found]               ` <20100731003504.GA3544-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2010-07-31  1:36                 ` Matt Helsley
2010-07-31  4:52                 ` Oren Laadan
2010-07-31  4:52               ` Oren Laadan
     [not found]             ` <4C5352EF.9080601-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
2010-07-31  0:35               ` Sukadev Bhattiprolu
     [not found]           ` <1280525871.2451.23.camel-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>
2010-07-30 22:32             ` Oren Laadan
     [not found]         ` <4C532BC9.6050109-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
2010-07-30 21:37           ` john stultz
     [not found]   ` <1274836063-13271-3-git-send-email-sukadev-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
2010-06-15  4:40     ` Oren Laadan
2010-05-26  1:07 ` [RFC][PATCH 3/3][cr][v2]: fileleases: C/R of an in-progress lease Sukadev Bhattiprolu
2010-06-15  4:43   ` Oren Laadan
2010-06-16 11:18   ` Jamie Lokier
2010-06-16 15:08     ` Oren Laadan
     [not found]     ` <20100616111843.GA15054-yetKDKU6eevNLxjTenLetw@public.gmane.org>
2010-06-16 15:08       ` Oren Laadan
2010-06-16 17:46       ` Matt Helsley
2010-06-16 17:46     ` Matt Helsley
     [not found]   ` <1274836063-13271-4-git-send-email-sukadev-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
2010-06-15  4:43     ` Oren Laadan
2010-06-16 11:18     ` Jamie Lokier
2010-06-16 14:52     ` Oren Laadan
2010-06-16 14:52   ` Oren Laadan
     [not found]     ` <4C18E534.80700-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
2010-06-16 19:57       ` Sukadev Bhattiprolu
2010-06-16 19:57     ` Sukadev Bhattiprolu
     [not found] ` <1274836063-13271-1-git-send-email-sukadev-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
2010-05-26  1:07   ` [RFC][PATCH 1/3][cr][v2]: Define do_setlease() Sukadev Bhattiprolu
2010-05-26  1:07   ` [RFC][PATCH 2/3][cr][v2]: Checkpoint/restart file leases Sukadev Bhattiprolu
2010-05-26  1:07   ` [RFC][PATCH 3/3][cr][v2]: fileleases: C/R of an in-progress lease Sukadev Bhattiprolu

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.