All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH][RFC] freezer: Add CHECKPOINTING state to safeguard container checkpoint
@ 2009-04-28  2:17 Matt Helsley
       [not found] ` <20090505002620.2173735E178-g5auMkH+3blSq9BJjBFyUp/QNRX+jHPU@public.gmane.org>
  0 siblings, 1 reply; 6+ messages in thread
From: Matt Helsley @ 2009-04-28  2:17 UTC (permalink / raw)
  To: Oren Laadan
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Paul Menage

The CHECKPOINTING state prevents userspace from unfreezing tasks until
sys_checkpoint() is finished. When doing container checkpoint userspace
will do:

echo FROZEN > /cgroups/my_container/freezer.state
...
rc = sys_checkpoint( <pid of container root> );

To ensure a consistent checkpoint image userspace should not be allowed
to thaw the cgroup (echo THAWED > /cgroups/my_container/freezer.state)
during checkpoint.

"CHECKPOINTING" can only be set on a "FROZEN" cgroup using the checkpoint
system call. Once in the "CHECKPOINTING" state, the cgroup may not leave until
the checkpoint system call is finished and ready to return. Then the
freezer state returns to "FROZEN". Writing any new state to freezer.state while
checkpointing will return EBUSY. These semantics ensure that userspace cannot
unfreeze the cgroup midway through the checkpoint system call.

Signed-off-by: Matt Helsley <matthltc-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
Cc: Paul Menage <menage-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
Cc: Li Zefan <lizf-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
Cc: Cedric Le Goater <legoater-GANU6spQydw@public.gmane.org>

Notes:
        This is an untested RFC patch for container checkpoint. It's meant to
                work with Oren's patch series.
        As a side-effect this prevents the multiple tasks from entering the
                CHECKPOINTING state simultaneously. All but one will get -EBUSY.
---
 Documentation/cgroups/freezer-subsystem.txt |   10 ++++++
 checkpoint/checkpoint.c                     |    7 ++--
 include/linux/freezer.h                     |    8 +++++
 kernel/cgroup_freezer.c                     |   46 ++++++++++++++++++++++++++-
 4 files changed, 67 insertions(+), 4 deletions(-)

diff --git a/Documentation/cgroups/freezer-subsystem.txt b/Documentation/cgroups/freezer-subsystem.txt
index 41f37fe..92b68e6 100644
--- a/Documentation/cgroups/freezer-subsystem.txt
+++ b/Documentation/cgroups/freezer-subsystem.txt
@@ -100,3 +100,13 @@ things happens:
 		and returns EINVAL)
 	3) The tasks that blocked the cgroup from entering the "FROZEN"
 		state disappear from the cgroup's set of tasks.
+
+When the cgroup freezer is used to guard container checkpoint operations the
+freezer.state may be "CHECKPOINTING". "CHECKPOINTING" can only be set on a
+"FROZEN" cgroup using the checkpoint system call. Once in the "CHECKPOINTING"
+state, the cgroup may not leave until the checkpoint system call returns the
+freezer state to "FROZEN". Writing any new state to freezer.state while
+checkpointing will return EBUSY. These semantics ensure that userspace cannot
+unfreeze the cgroup midway through the checkpoint system call. Note that,
+unlike "FROZEN" and "FREEZING", there is no corresponding "CHECKPOINTED"
+state.
diff --git a/checkpoint/checkpoint.c b/checkpoint/checkpoint.c
index 7f5c18c..010743f 100644
--- a/checkpoint/checkpoint.c
+++ b/checkpoint/checkpoint.c
@@ -489,7 +489,7 @@ static int get_container(struct ckpt_ctx *ctx, pid_t pid)
 		return -EBUSY;
 	}
 
-	return 0;
+	return cgroup_freezer_begin_checkpoint(ctx->root_task);
 
  out:
 	if (task)
@@ -531,7 +531,7 @@ int do_checkpoint(struct ckpt_ctx *ctx, pid_t pid)
 
 	ret = ctx_checkpoint(ctx, pid);
 	if (ret < 0)
-		goto out;
+		return ret;
 	ret = build_tree(ctx);
 	if (ret < 0)
 		goto out;
@@ -560,6 +560,7 @@ int do_checkpoint(struct ckpt_ctx *ctx, pid_t pid)
 	/* on success, return (unique) checkpoint identifier */
 	ctx->crid = atomic_inc_return(&ctx_count);
 	ret = ctx->crid;
- out:
+out:
+	cgroup_freezer_end_checkpoint(ctx->root_task);
 	return ret;
 }
diff --git a/include/linux/freezer.h b/include/linux/freezer.h
index 5a361f8..f1d770b 100644
--- a/include/linux/freezer.h
+++ b/include/linux/freezer.h
@@ -65,8 +65,16 @@ extern void cancel_freezing(struct task_struct *p);
 
 #ifdef CONFIG_CGROUP_FREEZER
 extern int cgroup_frozen(struct task_struct *task);
+extern int cgroup_freezer_begin_checkpoint(struct task_struct *task);
+extern void cgroup_freezer_end_checkpoint(struct task_struct *task);
 #else /* !CONFIG_CGROUP_FREEZER */
 static inline int cgroup_frozen(struct task_struct *task) { return 0; }
+static inline int cgroup_freezer_begin_checkpoint(struct task_struct *task)
+{
+	return -ENOTSUP;
+}
+static inline void cgroup_freezer_end_checkpoint(struct task_struct *task)
+{}
 #endif /* !CONFIG_CGROUP_FREEZER */
 
 /*
diff --git a/kernel/cgroup_freezer.c b/kernel/cgroup_freezer.c
index fb249e2..0443226 100644
--- a/kernel/cgroup_freezer.c
+++ b/kernel/cgroup_freezer.c
@@ -25,6 +25,7 @@ enum freezer_state {
 	CGROUP_THAWED = 0,
 	CGROUP_FREEZING,
 	CGROUP_FROZEN,
+	CGROUP_CHECKPOINTING,
 };
 
 struct freezer {
@@ -61,6 +62,45 @@ int cgroup_frozen(struct task_struct *task)
 }
 
 /*
+ * cgroup freezer state changes made without the aid of the cgroup filesystem
+ * must go through this function to ensure proper locking is observed.
+ */
+static int freezer_checkpointing(struct task_struct *task,
+				 enum freezer_state next_state)
+{
+	struct freezer *freezer;
+	enum freezer_state state;
+
+	task_lock(task);
+	freezer = task_freezer(task);
+	spin_lock_irq(&freezer->lock);
+	state = freezer->state;
+	if ((state == CGROUP_FROZEN && next_state == CGROUP_CHECKPOINTING) ||
+	    (state == CGROUP_CHECKPOINTING && next_state == CGROUP_FROZEN))
+		freezer->state = next_state;
+	spin_unlock_irq(&freezer->lock);
+	task_unlock(task);
+
+	return state;
+}
+
+int cgroup_freezer_begin_checkpoint(struct task_struct *task)
+{
+	if (freezer_checkpointing(task, CGROUP_CHECKPOINTING) != CGROUP_FROZEN)
+		return -EBUSY;
+	return 0;
+}
+
+void cgroup_freezer_end_checkpoint(struct task_struct *task)
+{
+	/*
+	 * If we weren't in CHECKPOINTING state then userspace could have
+	 * unfrozen a task and given us an inconsistent checkpoint image
+	 */
+	WARN_ON(freezer_checkpointing(task, CGROUP_FROZEN) != CGROUP_CHECKPOINTING);
+}
+
+/*
  * cgroups_write_string() limits the size of freezer state strings to
  * CGROUP_LOCAL_BUFFER_SIZE
  */
@@ -68,6 +108,7 @@ static const char *freezer_state_strs[] = {
 	"THAWED",
 	"FREEZING",
 	"FROZEN",
+	"CHECKPOINTING",
 };
 
 /*
@@ -309,7 +350,10 @@ static int freezer_change_state(struct cgroup *cgroup,
 	freezer = cgroup_freezer(cgroup);
 
 	spin_lock_irq(&freezer->lock);
-
+	if (freezer->state == CGROUP_CHECKPOINTING) {
+		retval = -EBUSY;
+		goto out;
+	}
 	update_freezer_state(cgroup, freezer);
 	if (goal_state == freezer->state)
 		goto out;
-- 
1.5.6.3

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH][RFC] freezer: Add CHECKPOINTING state to safeguard container checkpoint
       [not found] ` <20090505002620.2173735E178-g5auMkH+3blSq9BJjBFyUp/QNRX+jHPU@public.gmane.org>
@ 2009-05-06 21:05   ` Oren Laadan
  2009-05-29 17:34   ` Oren Laadan
  2009-05-29 18:35   ` Oren Laadan
  2 siblings, 0 replies; 6+ messages in thread
From: Oren Laadan @ 2009-05-06 21:05 UTC (permalink / raw)
  To: Matt Helsley
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Paul Menage

Hi Matt,

Looks good. One small comment inline below.
Thanks,

Oren.


Matt Helsley wrote:
> The CHECKPOINTING state prevents userspace from unfreezing tasks until
> sys_checkpoint() is finished. When doing container checkpoint userspace
> will do:
> 
> echo FROZEN > /cgroups/my_container/freezer.state
> ...
> rc = sys_checkpoint( <pid of container root> );
> 
> To ensure a consistent checkpoint image userspace should not be allowed
> to thaw the cgroup (echo THAWED > /cgroups/my_container/freezer.state)
> during checkpoint.
> 
> "CHECKPOINTING" can only be set on a "FROZEN" cgroup using the checkpoint
> system call. Once in the "CHECKPOINTING" state, the cgroup may not leave until
> the checkpoint system call is finished and ready to return. Then the
> freezer state returns to "FROZEN". Writing any new state to freezer.state while
> checkpointing will return EBUSY. These semantics ensure that userspace cannot
> unfreeze the cgroup midway through the checkpoint system call.
> 
> Signed-off-by: Matt Helsley <matthltc-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
> Cc: Paul Menage <menage-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
> Cc: Li Zefan <lizf-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
> Cc: Cedric Le Goater <legoater-GANU6spQydw@public.gmane.org>
> 
> Notes:
>         This is an untested RFC patch for container checkpoint. It's meant to
>                 work with Oren's patch series.
>         As a side-effect this prevents the multiple tasks from entering the
>                 CHECKPOINTING state simultaneously. All but one will get -EBUSY.
> ---
>  Documentation/cgroups/freezer-subsystem.txt |   10 ++++++
>  checkpoint/checkpoint.c                     |    7 ++--
>  include/linux/freezer.h                     |    8 +++++
>  kernel/cgroup_freezer.c                     |   46 ++++++++++++++++++++++++++-
>  4 files changed, 67 insertions(+), 4 deletions(-)
> 
> diff --git a/Documentation/cgroups/freezer-subsystem.txt b/Documentation/cgroups/freezer-subsystem.txt
> index 41f37fe..92b68e6 100644
> --- a/Documentation/cgroups/freezer-subsystem.txt
> +++ b/Documentation/cgroups/freezer-subsystem.txt
> @@ -100,3 +100,13 @@ things happens:
>  		and returns EINVAL)
>  	3) The tasks that blocked the cgroup from entering the "FROZEN"
>  		state disappear from the cgroup's set of tasks.
> +
> +When the cgroup freezer is used to guard container checkpoint operations the
> +freezer.state may be "CHECKPOINTING". "CHECKPOINTING" can only be set on a
> +"FROZEN" cgroup using the checkpoint system call. Once in the "CHECKPOINTING"
> +state, the cgroup may not leave until the checkpoint system call returns the
> +freezer state to "FROZEN". Writing any new state to freezer.state while
> +checkpointing will return EBUSY. These semantics ensure that userspace cannot
> +unfreeze the cgroup midway through the checkpoint system call. Note that,
> +unlike "FROZEN" and "FREEZING", there is no corresponding "CHECKPOINTED"
> +state.
> diff --git a/checkpoint/checkpoint.c b/checkpoint/checkpoint.c
> index 7f5c18c..010743f 100644
> --- a/checkpoint/checkpoint.c
> +++ b/checkpoint/checkpoint.c
> @@ -489,7 +489,7 @@ static int get_container(struct ckpt_ctx *ctx, pid_t pid)
>  		return -EBUSY;
>  	}
>  
> -	return 0;
> +	return cgroup_freezer_begin_checkpoint(ctx->root_task);
>  
>   out:
>  	if (task)
> @@ -531,7 +531,7 @@ int do_checkpoint(struct ckpt_ctx *ctx, pid_t pid)
>  
>  	ret = ctx_checkpoint(ctx, pid);
>  	if (ret < 0)
> -		goto out;
> +		return ret;
>  	ret = build_tree(ctx);
>  	if (ret < 0)
>  		goto out;
> @@ -560,6 +560,7 @@ int do_checkpoint(struct ckpt_ctx *ctx, pid_t pid)
>  	/* on success, return (unique) checkpoint identifier */
>  	ctx->crid = atomic_inc_return(&ctx_count);
>  	ret = ctx->crid;
> - out:
> +out:
> +	cgroup_freezer_end_checkpoint(ctx->root_task);
>  	return ret;
>  }

Perhaps place the call to cgroup_freezer_begin_checkpoint() to
inside the do_checkpoint() function ?  It will be easier to
read and understand the code.

> diff --git a/include/linux/freezer.h b/include/linux/freezer.h
> index 5a361f8..f1d770b 100644
> --- a/include/linux/freezer.h
> +++ b/include/linux/freezer.h
> @@ -65,8 +65,16 @@ extern void cancel_freezing(struct task_struct *p);
>  
>  #ifdef CONFIG_CGROUP_FREEZER
>  extern int cgroup_frozen(struct task_struct *task);
> +extern int cgroup_freezer_begin_checkpoint(struct task_struct *task);
> +extern void cgroup_freezer_end_checkpoint(struct task_struct *task);
>  #else /* !CONFIG_CGROUP_FREEZER */
>  static inline int cgroup_frozen(struct task_struct *task) { return 0; }
> +static inline int cgroup_freezer_begin_checkpoint(struct task_struct *task)
> +{
> +	return -ENOTSUP;
> +}
> +static inline void cgroup_freezer_end_checkpoint(struct task_struct *task)
> +{}
>  #endif /* !CONFIG_CGROUP_FREEZER */
>  
>  /*
> diff --git a/kernel/cgroup_freezer.c b/kernel/cgroup_freezer.c
> index fb249e2..0443226 100644
> --- a/kernel/cgroup_freezer.c
> +++ b/kernel/cgroup_freezer.c
> @@ -25,6 +25,7 @@ enum freezer_state {
>  	CGROUP_THAWED = 0,
>  	CGROUP_FREEZING,
>  	CGROUP_FROZEN,
> +	CGROUP_CHECKPOINTING,
>  };
>  
>  struct freezer {
> @@ -61,6 +62,45 @@ int cgroup_frozen(struct task_struct *task)
>  }
>  
>  /*
> + * cgroup freezer state changes made without the aid of the cgroup filesystem
> + * must go through this function to ensure proper locking is observed.
> + */
> +static int freezer_checkpointing(struct task_struct *task,
> +				 enum freezer_state next_state)
> +{
> +	struct freezer *freezer;
> +	enum freezer_state state;
> +
> +	task_lock(task);
> +	freezer = task_freezer(task);
> +	spin_lock_irq(&freezer->lock);
> +	state = freezer->state;
> +	if ((state == CGROUP_FROZEN && next_state == CGROUP_CHECKPOINTING) ||
> +	    (state == CGROUP_CHECKPOINTING && next_state == CGROUP_FROZEN))
> +		freezer->state = next_state;
> +	spin_unlock_irq(&freezer->lock);
> +	task_unlock(task);
> +
> +	return state;
> +}
> +
> +int cgroup_freezer_begin_checkpoint(struct task_struct *task)
> +{
> +	if (freezer_checkpointing(task, CGROUP_CHECKPOINTING) != CGROUP_FROZEN)
> +		return -EBUSY;
> +	return 0;
> +}
> +
> +void cgroup_freezer_end_checkpoint(struct task_struct *task)
> +{
> +	/*
> +	 * If we weren't in CHECKPOINTING state then userspace could have
> +	 * unfrozen a task and given us an inconsistent checkpoint image
> +	 */
> +	WARN_ON(freezer_checkpointing(task, CGROUP_FROZEN) != CGROUP_CHECKPOINTING);
> +}
> +
> +/*
>   * cgroups_write_string() limits the size of freezer state strings to
>   * CGROUP_LOCAL_BUFFER_SIZE
>   */
> @@ -68,6 +108,7 @@ static const char *freezer_state_strs[] = {
>  	"THAWED",
>  	"FREEZING",
>  	"FROZEN",
> +	"CHECKPOINTING",
>  };
>  
>  /*
> @@ -309,7 +350,10 @@ static int freezer_change_state(struct cgroup *cgroup,
>  	freezer = cgroup_freezer(cgroup);
>  
>  	spin_lock_irq(&freezer->lock);
> -
> +	if (freezer->state == CGROUP_CHECKPOINTING) {
> +		retval = -EBUSY;
> +		goto out;
> +	}
>  	update_freezer_state(cgroup, freezer);
>  	if (goal_state == freezer->state)
>  		goto out;

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH][RFC] freezer: Add CHECKPOINTING state to safeguard container checkpoint
       [not found] ` <20090505002620.2173735E178-g5auMkH+3blSq9BJjBFyUp/QNRX+jHPU@public.gmane.org>
  2009-05-06 21:05   ` Oren Laadan
@ 2009-05-29 17:34   ` Oren Laadan
       [not found]     ` <4A201C91.8060706-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
  2009-05-29 18:35   ` Oren Laadan
  2 siblings, 1 reply; 6+ messages in thread
From: Oren Laadan @ 2009-05-29 17:34 UTC (permalink / raw)
  To: Cedric Le Goater
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Paul Menage

Hi,

While trying Matt's patch I hit a problem reported lockdep (report
further below).

There is a possible deadlock in the cgroup_freezer. The problem is
a locking order, and actually exists in the code already, and only
exposed by this patch.

From the lock-ordering comment in cgroup_freezer.c:

 * freezer_fork() (preserving fork() performance ...)
 * task->alloc_lock (to get task's cgroup)
 * freezer->lock
 *  sighand->siglock (if the cgroup is freezing)

...

 * freezer_write() (unfreeze):
 * cgroup_mutex
 *  freezer->lock
 *   read_lock css_set_lock (cgroup iterator start)
 *    task->alloc_lock (to prevent races with freeze_task())
 *     sighand->siglock

'task->alloc_lock' and 'freezer->lock' are taken in different order.

Oren.

-------------
kernel:
kernel: =======================================================
kernel: [ INFO: possible circular locking dependency detected ]
kernel: 2.6.30-rc7-orenl #366
kernel: -------------------------------------------------------
kernel: ckpt/2787 is trying to acquire lock:
kernel:  (&freezer->lock){......}, at: [<c0157b35>]
freezer_checkpointing+0x35/0x80
kernel:
kernel: but task is already holding lock:
kernel:  (&p->alloc_lock){+.+...}, at: [<c0157b21>]
freezer_checkpointing+0x21/0x80
kernel:
kernel: which lock already depends on the new lock.
kernel:
kernel:
kernel: the existing dependency chain (in reverse order) is:
kernel:
kernel: -> #2 (&p->alloc_lock){+.+...}:
kernel:        [<c0148f32>] validate_chain+0xa82/0xfc0
kernel:        [<c0149708>] __lock_acquire+0x298/0x9a0
kernel:        [<c0149e6e>] lock_acquire+0x5e/0x80
kernel:        [<c0336633>] _spin_lock+0x33/0x40
kernel:        [<c0155285>] cgroup_iter_start+0xa5/0xe0
kernel:        [<c015781a>] update_freezer_state+0x1a/0x70
kernel:        [<c01578e7>] freezer_write+0x77/0x160
kernel:        [<c0156576>] cgroup_file_write+0x156/0x210
kernel:        [<c0186c56>] vfs_write+0x96/0x130
kernel:        [<c01871bd>] sys_write+0x3d/0x70
kernel:        [<c0102c38>] sysenter_do_call+0x12/0x36
kernel:        [<ffffffff>] 0xffffffff
kernel:
kernel: -> #1 (css_set_lock){++++..}:
kernel:        [<c0148f32>] validate_chain+0xa82/0xfc0
kernel:        [<c0149708>] __lock_acquire+0x298/0x9a0
kernel:        [<c0149e6e>] lock_acquire+0x5e/0x80
kernel:        [<c03366c3>] _write_lock+0x33/0x40
kernel:        [<c015522b>] cgroup_iter_start+0x4b/0xe0
kernel:        [<c015781a>] update_freezer_state+0x1a/0x70
kernel:        [<c01578e7>] freezer_write+0x77/0x160
kernel:        [<c0156576>] cgroup_file_write+0x156/0x210
kernel:        [<c0186c56>] vfs_write+0x96/0x130
kernel:        [<c01871bd>] sys_write+0x3d/0x70
kernel:        [<c0102c38>] sysenter_do_call+0x12/0x36
kernel:        [<ffffffff>] 0xffffffff
kernel:
kernel: -> #0 (&freezer->lock){......}:
kernel:        [<c0148a21>] validate_chain+0x571/0xfc0
kernel:        [<c0149708>] __lock_acquire+0x298/0x9a0
kernel:        [<c0149e6e>] lock_acquire+0x5e/0x80
kernel:        [<c0336939>] _spin_lock_irq+0x39/0x50
kernel:        [<c0157b35>] freezer_checkpointing+0x35/0x80
kernel:        [<c0157bbd>] cgroup_freezer_begin_checkpoint+0xd/0x30
kernel:        [<c02185c6>] do_checkpoint+0xf6/0x6a0
kernel:        [<c02172a6>] sys_checkpoint+0x46/0x90
kernel:        [<c0102c38>] sysenter_do_call+0x12/0x36
kernel:        [<ffffffff>] 0xffffffff
kernel:

Matt Helsley wrote:
> The CHECKPOINTING state prevents userspace from unfreezing tasks until
> sys_checkpoint() is finished. When doing container checkpoint userspace
> will do:
> 
> echo FROZEN > /cgroups/my_container/freezer.state
> ...
> rc = sys_checkpoint( <pid of container root> );
> 
> To ensure a consistent checkpoint image userspace should not be allowed
> to thaw the cgroup (echo THAWED > /cgroups/my_container/freezer.state)
> during checkpoint.
> 

[...]

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH][RFC] freezer: Add CHECKPOINTING state to safeguard container checkpoint
       [not found] ` <20090505002620.2173735E178-g5auMkH+3blSq9BJjBFyUp/QNRX+jHPU@public.gmane.org>
  2009-05-06 21:05   ` Oren Laadan
  2009-05-29 17:34   ` Oren Laadan
@ 2009-05-29 18:35   ` Oren Laadan
       [not found]     ` <4A202AE9.1090801-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
  2 siblings, 1 reply; 6+ messages in thread
From: Oren Laadan @ 2009-05-29 18:35 UTC (permalink / raw)
  To: Matt Helsley, Cedric Le Goater
  Cc: Rafael J. Wysocki, Paul Menage,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Pavel Machek


Matt Helsley wrote:
> The CHECKPOINTING state prevents userspace from unfreezing tasks until
> sys_checkpoint() is finished. When doing container checkpoint userspace
> will do:

[...]

>  /*
> + * cgroup freezer state changes made without the aid of the cgroup filesystem
> + * must go through this function to ensure proper locking is observed.
> + */
> +static int freezer_checkpointing(struct task_struct *task,
> +				 enum freezer_state next_state)
> +{
> +	struct freezer *freezer;
> +	enum freezer_state state;
> +
> +	task_lock(task);
> +	freezer = task_freezer(task);
> +	spin_lock_irq(&freezer->lock);
> +	state = freezer->state;

After "echo FROZEN > /freezer/0/freezer_state", it is likely that
freezer->state remains CGROUP_FREEZING --

Then, looking at freezer_read():

...
  if (state == CGROUP_FREEZING) {
	/* We change from FREEZING to FROZEN lazily if the cgroup was
	 * only partially frozen when we exitted write. */
	update_freezer_state(cgroup, freezer);
	state = freezer->state;
  }
...

IOW, usually, only if someone reads the freezer_state file will
the freezer->status reliably reflect the state.

So, I searched for where else freezer->state is consulted, and found:

  int cgroup_frozen(struct task_struct *task)
  {
	struct freezer *freezer;
	enum freezer_state state;

	task_lock(task);
	freezer = task_freezer(task);
	state = freezer->state;
	task_unlock(task);

	return state == CGROUP_FROZEN;
  }

Which is called (only?) from kernel/power/process.c:thaw_tasks().
Here, too, unless some read from the freezer_state file, this
test will incorrectly fail.

Perhaps introduce a freezer_unlazy_state(freezer) helper, or a
a freezer_get_state(freezer) helper to use in those places ?

> +	if ((state == CGROUP_FROZEN && next_state == CGROUP_CHECKPOINTING) ||
> +	    (state == CGROUP_CHECKPOINTING && next_state == CGROUP_FROZEN))
> +		freezer->state = next_state;
> +	spin_unlock_irq(&freezer->lock);
> +	task_unlock(task);
> +
> +	return state;
> +}

[...]

Oren.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH][RFC] freezer: Add CHECKPOINTING state to safeguard container checkpoint
       [not found]     ` <4A201C91.8060706-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
@ 2009-05-29 22:25       ` Matt Helsley
  0 siblings, 0 replies; 6+ messages in thread
From: Matt Helsley @ 2009-05-29 22:25 UTC (permalink / raw)
  To: Oren Laadan
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Paul Menage

On Fri, May 29, 2009 at 01:34:09PM -0400, Oren Laadan wrote:
> Hi,
> 
> While trying Matt's patch I hit a problem reported lockdep (report
> further below).
> 
> There is a possible deadlock in the cgroup_freezer. The problem is
> a locking order, and actually exists in the code already, and only
> exposed by this patch.
> 
> From the lock-ordering comment in cgroup_freezer.c:
> 
>  * freezer_fork() (preserving fork() performance ...)
>  * task->alloc_lock (to get task's cgroup)
>  * freezer->lock
>  *  sighand->siglock (if the cgroup is freezing)
> 
> ...
> 
>  * freezer_write() (unfreeze):
>  * cgroup_mutex
>  *  freezer->lock
>  *   read_lock css_set_lock (cgroup iterator start)
>  *    task->alloc_lock (to prevent races with freeze_task())
>  *     sighand->siglock
> 
> 'task->alloc_lock' and 'freezer->lock' are taken in different order.

I suspect the new ordering is in the CHECKPOINTING patch and the comment
above is stale. I'll check to confirm it and send a patch to mainline if
the comment is stale.

> Oren.
> 
> -------------
> kernel:
> kernel: =======================================================
> kernel: [ INFO: possible circular locking dependency detected ]
> kernel: 2.6.30-rc7-orenl #366
> kernel: -------------------------------------------------------
> kernel: ckpt/2787 is trying to acquire lock:
> kernel:  (&freezer->lock){......}, at: [<c0157b35>]
> freezer_checkpointing+0x35/0x80
> kernel:
> kernel: but task is already holding lock:
> kernel:  (&p->alloc_lock){+.+...}, at: [<c0157b21>]
> freezer_checkpointing+0x21/0x80
> kernel:
> kernel: which lock already depends on the new lock.
> kernel:
> kernel:
> kernel: the existing dependency chain (in reverse order) is:
> kernel:
> kernel: -> #2 (&p->alloc_lock){+.+...}:
> kernel:        [<c0148f32>] validate_chain+0xa82/0xfc0
> kernel:        [<c0149708>] __lock_acquire+0x298/0x9a0
> kernel:        [<c0149e6e>] lock_acquire+0x5e/0x80
> kernel:        [<c0336633>] _spin_lock+0x33/0x40
> kernel:        [<c0155285>] cgroup_iter_start+0xa5/0xe0
> kernel:        [<c015781a>] update_freezer_state+0x1a/0x70
> kernel:        [<c01578e7>] freezer_write+0x77/0x160
> kernel:        [<c0156576>] cgroup_file_write+0x156/0x210
> kernel:        [<c0186c56>] vfs_write+0x96/0x130
> kernel:        [<c01871bd>] sys_write+0x3d/0x70
> kernel:        [<c0102c38>] sysenter_do_call+0x12/0x36
> kernel:        [<ffffffff>] 0xffffffff
> kernel:
> kernel: -> #1 (css_set_lock){++++..}:
> kernel:        [<c0148f32>] validate_chain+0xa82/0xfc0
> kernel:        [<c0149708>] __lock_acquire+0x298/0x9a0
> kernel:        [<c0149e6e>] lock_acquire+0x5e/0x80
> kernel:        [<c03366c3>] _write_lock+0x33/0x40
> kernel:        [<c015522b>] cgroup_iter_start+0x4b/0xe0
> kernel:        [<c015781a>] update_freezer_state+0x1a/0x70
> kernel:        [<c01578e7>] freezer_write+0x77/0x160
> kernel:        [<c0156576>] cgroup_file_write+0x156/0x210
> kernel:        [<c0186c56>] vfs_write+0x96/0x130
> kernel:        [<c01871bd>] sys_write+0x3d/0x70
> kernel:        [<c0102c38>] sysenter_do_call+0x12/0x36
> kernel:        [<ffffffff>] 0xffffffff
> kernel:
> kernel: -> #0 (&freezer->lock){......}:
> kernel:        [<c0148a21>] validate_chain+0x571/0xfc0
> kernel:        [<c0149708>] __lock_acquire+0x298/0x9a0
> kernel:        [<c0149e6e>] lock_acquire+0x5e/0x80
> kernel:        [<c0336939>] _spin_lock_irq+0x39/0x50
> kernel:        [<c0157b35>] freezer_checkpointing+0x35/0x80
> kernel:        [<c0157bbd>] cgroup_freezer_begin_checkpoint+0xd/0x30
> kernel:        [<c02185c6>] do_checkpoint+0xf6/0x6a0
> kernel:        [<c02172a6>] sys_checkpoint+0x46/0x90
> kernel:        [<c0102c38>] sysenter_do_call+0x12/0x36
> kernel:        [<ffffffff>] 0xffffffff
> kernel:
> 
> Matt Helsley wrote:
> > The CHECKPOINTING state prevents userspace from unfreezing tasks until
> > sys_checkpoint() is finished. When doing container checkpoint userspace
> > will do:
> > 
> > echo FROZEN > /cgroups/my_container/freezer.state
> > ...
> > rc = sys_checkpoint( <pid of container root> );
> > 
> > To ensure a consistent checkpoint image userspace should not be allowed
> > to thaw the cgroup (echo THAWED > /cgroups/my_container/freezer.state)
> > during checkpoint.
> > 
> 
> [...]
> 
> 

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH][RFC] freezer: Add CHECKPOINTING state to safeguard container checkpoint
       [not found]     ` <4A202AE9.1090801-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
@ 2009-05-30  3:04       ` Matt Helsley
  0 siblings, 0 replies; 6+ messages in thread
From: Matt Helsley @ 2009-05-30  3:04 UTC (permalink / raw)
  To: Oren Laadan
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Rafael J. Wysocki, Pavel Machek, Paul Menage

On Fri, May 29, 2009 at 02:35:21PM -0400, Oren Laadan wrote:
> 
> Matt Helsley wrote:
> > The CHECKPOINTING state prevents userspace from unfreezing tasks until
> > sys_checkpoint() is finished. When doing container checkpoint userspace
> > will do:
> 
> [...]
> 
> >  /*
> > + * cgroup freezer state changes made without the aid of the cgroup filesystem
> > + * must go through this function to ensure proper locking is observed.
> > + */
> > +static int freezer_checkpointing(struct task_struct *task,
> > +				 enum freezer_state next_state)
> > +{
> > +	struct freezer *freezer;
> > +	enum freezer_state state;
> > +
> > +	task_lock(task);
> > +	freezer = task_freezer(task);
> > +	spin_lock_irq(&freezer->lock);
> > +	state = freezer->state;
> 
> After "echo FROZEN > /freezer/0/freezer_state", it is likely that
> freezer->state remains CGROUP_FREEZING --
> 
> Then, looking at freezer_read():
> 
> ...
>   if (state == CGROUP_FREEZING) {
> 	/* We change from FREEZING to FROZEN lazily if the cgroup was
> 	 * only partially frozen when we exitted write. */
> 	update_freezer_state(cgroup, freezer);
> 	state = freezer->state;
>   }
> ...
> 
> IOW, usually, only if someone reads the freezer_state file will
> the freezer->status reliably reflect the state.
> 
> So, I searched for where else freezer->state is consulted, and found:
> 
>   int cgroup_frozen(struct task_struct *task)
>   {
> 	struct freezer *freezer;
> 	enum freezer_state state;
> 
> 	task_lock(task);
> 	freezer = task_freezer(task);
> 	state = freezer->state;
> 	task_unlock(task);
> 
> 	return state == CGROUP_FROZEN;
>   }
> 
> Which is called (only?) from kernel/power/process.c:thaw_tasks().

Currently that is the only place it's called, yes.

> Here, too, unless some read from the freezer_state file, this
> test will incorrectly fail.

True. For kernel/power/process.c:thaw_tasks() we're really trying to
prevent resume from thawing tasks that were frozen via the cgroup
freezer.

So really CGROUP_FREEZING should also be tested for in that code path.
However that doesn't match the name so a name change would be in order
too.


I audited the rest of the cgroup freezer code without the CHECKPOINTING
patch and the remaining tests for CGROUP_FROZEN look fine to me:

can_attach:
	This looks OK. If we're not in the frozen state then it's OK to
	add a new task to the cgroup. So the analogous scenario is we're
	in CGROUP_FREEZING but all of the tasks are frozen. Adding a
	new task -- frozen or otherwise -- would not pose a problem.

BUG_ON() in freezer_fork:
	The forking task isn't frozen and hence its cgroup is not
	can't be FROZEN. But there's a BUG_ON() here "just in case" -- 
	it should never happen. So at worst it won't find bugs here until
	userspace does a read() on freezer.state.

Lastly, this means we'd need a new function for other parts of the
kernel which definitely want to know if the cgroup is frozen. That
function would have to  do a lazy update if the state is
CGROUP_FREEZING.

I'll make a patch which changes cgroup_frozen() and send it off for
mainline.

Then I'll re-add cgroup_frozen() for CHECKPOINTING and ensure that it
works with the lazy update.

Cheers,
	-Matt

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2009-05-30  3:04 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-04-28  2:17 [PATCH][RFC] freezer: Add CHECKPOINTING state to safeguard container checkpoint Matt Helsley
     [not found] ` <20090505002620.2173735E178-g5auMkH+3blSq9BJjBFyUp/QNRX+jHPU@public.gmane.org>
2009-05-06 21:05   ` Oren Laadan
2009-05-29 17:34   ` Oren Laadan
     [not found]     ` <4A201C91.8060706-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
2009-05-29 22:25       ` Matt Helsley
2009-05-29 18:35   ` Oren Laadan
     [not found]     ` <4A202AE9.1090801-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
2009-05-30  3:04       ` Matt Helsley

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.