All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] kernel: move exit_task_work() past exit_notify()
@ 2013-04-12 19:27 Andrey Vagin
  2013-04-13 14:22 ` Oleg Nesterov
  0 siblings, 1 reply; 9+ messages in thread
From: Andrey Vagin @ 2013-04-12 19:27 UTC (permalink / raw)
  Cc: linux-kernel, Andrey Vagin, Andrew Morton, Eric W. Biederman,
	Al Viro, Oleg Nesterov, David Howells

exit_task_work() must be called after exit_notify, because
exit_task_namespaces() may release a file and fput() enqueues a work.

exit_notify
  exit_task_namespaces
    free_ipc_ns
      shm_destroy
        fput
          task_work_add

so if task works don't run after exit_notify(), a few files may leak.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: David Howells <dhowells@redhat.com>
Signed-off-by: Andrey Vagin <avagin@openvz.org>
---
 kernel/exit.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/exit.c b/kernel/exit.c
index 60bc027..1d1129b 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -795,7 +795,6 @@ void do_exit(long code)
 	exit_shm(tsk);
 	exit_files(tsk);
 	exit_fs(tsk);
-	exit_task_work(tsk);
 	check_stack_usage();
 	exit_thread();
 
@@ -822,6 +821,7 @@ void do_exit(long code)
 	ptrace_put_breakpoints(tsk);
 
 	exit_notify(tsk, group_dead);
+	exit_task_work(tsk);
 #ifdef CONFIG_NUMA
 	task_lock(tsk);
 	mpol_put(tsk->mempolicy);
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH] kernel: move exit_task_work() past exit_notify()
  2013-04-12 19:27 [PATCH] kernel: move exit_task_work() past exit_notify() Andrey Vagin
@ 2013-04-13 14:22 ` Oleg Nesterov
  2013-04-13 15:54   ` [PATCH 0/1] (Was: kernel: move exit_task_work() past exit_notify()) Oleg Nesterov
  0 siblings, 1 reply; 9+ messages in thread
From: Oleg Nesterov @ 2013-04-13 14:22 UTC (permalink / raw)
  To: Andrey Vagin
  Cc: linux-kernel, Andrew Morton, Eric W. Biederman, Al Viro, David Howells

On 04/12, Andrey Vagin wrote:
> exit_task_work() must be called after exit_notify, because
> exit_task_namespaces() may release a file and fput() enqueues a work.
>
> exit_notify
>   exit_task_namespaces
>     free_ipc_ns
>       shm_destroy
>         fput
>           task_work_add
>
> so if task works don't run after exit_notify(), a few files may leak.
>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: "Eric W. Biederman" <ebiederm@xmission.com>
> Cc: Al Viro <viro@zeniv.linux.org.uk>
> Cc: Oleg Nesterov <oleg@redhat.com>
> Cc: David Howells <dhowells@redhat.com>
> Signed-off-by: Andrey Vagin <avagin@openvz.org>
> ---
>  kernel/exit.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/kernel/exit.c b/kernel/exit.c
> index 60bc027..1d1129b 100644
> --- a/kernel/exit.c
> +++ b/kernel/exit.c
> @@ -795,7 +795,6 @@ void do_exit(long code)
>  	exit_shm(tsk);
>  	exit_files(tsk);
>  	exit_fs(tsk);
> -	exit_task_work(tsk);
>  	check_stack_usage();
>  	exit_thread();
>
> @@ -822,6 +821,7 @@ void do_exit(long code)
>  	ptrace_put_breakpoints(tsk);
>
>  	exit_notify(tsk, group_dead);
> +	exit_task_work(tsk);

I am not comfortable with this change...

The task is "really dead" after exit_notify(), even release_task(current)
can be called.

Let me think a bit... It seems that we have the alternative.

Oleg.


^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH 0/1] (Was: kernel: move exit_task_work() past exit_notify())
  2013-04-13 14:22 ` Oleg Nesterov
@ 2013-04-13 15:54   ` Oleg Nesterov
  2013-04-13 15:55     ` [PATCH 1/1] move exit_task_namespaces() outside of exit_notify() Oleg Nesterov
  0 siblings, 1 reply; 9+ messages in thread
From: Oleg Nesterov @ 2013-04-13 15:54 UTC (permalink / raw)
  To: Andrey Vagin, Eric W. Biederman
  Cc: linux-kernel, Andrew Morton, Al Viro, David Howells

On 04/13, Oleg Nesterov wrote:
>
> > --- a/kernel/exit.c
> > +++ b/kernel/exit.c
> > @@ -795,7 +795,6 @@ void do_exit(long code)
> >  	exit_shm(tsk);
> >  	exit_files(tsk);
> >  	exit_fs(tsk);
> > -	exit_task_work(tsk);
> >  	check_stack_usage();
> >  	exit_thread();
> >
> > @@ -822,6 +821,7 @@ void do_exit(long code)
> >  	ptrace_put_breakpoints(tsk);
> >
> >  	exit_notify(tsk, group_dead);
> > +	exit_task_work(tsk);
>
> I am not comfortable with this change...
>
> The task is "really dead" after exit_notify(), even release_task(current)
> can be called.
>
> Let me think a bit... It seems that we have the alternative.

Andrey, Eric, how about this patch?

COMPLETELY UNTESTED and I need to recheck, but perhaps you can review?

Oleg.


^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH 1/1] move exit_task_namespaces() outside of exit_notify()
  2013-04-13 15:54   ` [PATCH 0/1] (Was: kernel: move exit_task_work() past exit_notify()) Oleg Nesterov
@ 2013-04-13 15:55     ` Oleg Nesterov
  2013-04-14  1:52       ` Eric W. Biederman
                         ` (2 more replies)
  0 siblings, 3 replies; 9+ messages in thread
From: Oleg Nesterov @ 2013-04-13 15:55 UTC (permalink / raw)
  To: Andrey Vagin, Eric W. Biederman
  Cc: linux-kernel, Andrew Morton, Al Viro, David Howells

exit_notify() does exit_task_namespaces() after
forget_original_parent(). This was needed to ensure that ->nsproxy
can't be cleared prematurely, an exiting child we are going to
reparent can do do_notify_parent() and use the parent's (ours) pid_ns.

However, after 32084504 "pidns: use task_active_pid_ns in
do_notify_parent" ->nsproxy != NULL is no longer needed, we rely
on task_active_pid_ns().

Move exit_task_namespaces() from exit_notify() to do_exit(), after
exit_fs() and before exit_task_work().

This solves the problem reported by Andrey, free_ipc_ns()->shm_destroy()
does fput() which needs task_work_add(). And this allows us do simplify
exit_notify(), we can avoid unlock/lock(tasklist) and we can change
->exit_state instead of PF_EXITING in forget_original_parent().

Reported-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>

--- x/kernel/exit.c
+++ x/kernel/exit.c
@@ -649,7 +649,6 @@ static void exit_notify(struct task_stru
 	 *	jobs, send them a SIGHUP and then a SIGCONT.  (POSIX 3.2.2.2)
 	 */
 	forget_original_parent(tsk);
-	exit_task_namespaces(tsk);
 
 	write_lock_irq(&tasklist_lock);
 	if (group_dead)
@@ -795,6 +794,7 @@ void do_exit(long code)
 	exit_shm(tsk);
 	exit_files(tsk);
 	exit_fs(tsk);
+	exit_task_namespaces(tsk);
 	exit_task_work(tsk);
 	check_stack_usage();
 	exit_thread();


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 1/1] move exit_task_namespaces() outside of exit_notify()
  2013-04-13 15:55     ` [PATCH 1/1] move exit_task_namespaces() outside of exit_notify() Oleg Nesterov
@ 2013-04-14  1:52       ` Eric W. Biederman
  2013-04-15  9:57       ` Andrey Wagin
  2013-06-13  9:24       ` Andrew Vagin
  2 siblings, 0 replies; 9+ messages in thread
From: Eric W. Biederman @ 2013-04-14  1:52 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Andrey Vagin, linux-kernel, Andrew Morton, Al Viro, David Howells

Oleg Nesterov <oleg@redhat.com> writes:

> exit_notify() does exit_task_namespaces() after
> forget_original_parent(). This was needed to ensure that ->nsproxy
> can't be cleared prematurely, an exiting child we are going to
> reparent can do do_notify_parent() and use the parent's (ours) pid_ns.
>
> However, after 32084504 "pidns: use task_active_pid_ns in
> do_notify_parent" ->nsproxy != NULL is no longer needed, we rely
> on task_active_pid_ns().
>
> Move exit_task_namespaces() from exit_notify() to do_exit(), after
> exit_fs() and before exit_task_work().
>
> This solves the problem reported by Andrey, free_ipc_ns()->shm_destroy()
> does fput() which needs task_work_add(). And this allows us do simplify
> exit_notify(), we can avoid unlock/lock(tasklist) and we can change
> ->exit_state instead of PF_EXITING in forget_original_parent().

It feels like this ought to work, certainly the pid namespace should not
need this, and the pid namespace was the motivating case for most of the
movement.  However we haven't called exit_task_namespaces this early
since 2006.

Ugh. I goofed and used that field in scm.c. Sigh.  I will push a patch
to rename that field nsproxy->childrens_pid_ns so it is harder to
make the mistake I just made.

None of the uses of nsproxy->net_ns look like they will be used on the
exit path.

The /proc/<pid>/ns/{uts,ipc,net,mnt,pid} files are fine as nsproxy
itself is what becomes NULL and they test for that.  Well except the pid
file uses task_active_pid_ns.

nsproxy->ipc_ns is isolated to files under ipc so it is probably fine.

Likewise the nsproxy->uts_ns uses look like they will be fine.

Likewise the nsproxy->mnt_ns uses look like they will be fine.

So in a quick skim through the uses no problem cases stick out, nor
can I think of anything that would cause trouble.  This looks like a
good patch.

Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>

> Reported-by: Andrey Vagin <avagin@openvz.org>
> Signed-off-by: Oleg Nesterov <oleg@redhat.com>
>
> --- x/kernel/exit.c
> +++ x/kernel/exit.c
> @@ -649,7 +649,6 @@ static void exit_notify(struct task_stru
>  	 *	jobs, send them a SIGHUP and then a SIGCONT.  (POSIX 3.2.2.2)
>  	 */
>  	forget_original_parent(tsk);
> -	exit_task_namespaces(tsk);
>  
>  	write_lock_irq(&tasklist_lock);
>  	if (group_dead)
> @@ -795,6 +794,7 @@ void do_exit(long code)
>  	exit_shm(tsk);
>  	exit_files(tsk);
>  	exit_fs(tsk);
> +	exit_task_namespaces(tsk);
>  	exit_task_work(tsk);
>  	check_stack_usage();
>  	exit_thread();

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 1/1] move exit_task_namespaces() outside of exit_notify()
  2013-04-13 15:55     ` [PATCH 1/1] move exit_task_namespaces() outside of exit_notify() Oleg Nesterov
  2013-04-14  1:52       ` Eric W. Biederman
@ 2013-04-15  9:57       ` Andrey Wagin
  2013-04-15 15:33         ` Oleg Nesterov
  2013-06-13  9:24       ` Andrew Vagin
  2 siblings, 1 reply; 9+ messages in thread
From: Andrey Wagin @ 2013-04-15  9:57 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Eric W. Biederman, LKML, Andrew Morton, Al Viro, David Howells

2013/4/13 Oleg Nesterov <oleg@redhat.com>
>
> exit_notify() does exit_task_namespaces() after
> forget_original_parent(). This was needed to ensure that ->nsproxy
> can't be cleared prematurely, an exiting child we are going to
> reparent can do do_notify_parent() and use the parent's (ours) pid_ns.
>
> However, after 32084504 "pidns: use task_active_pid_ns in
> do_notify_parent" ->nsproxy != NULL is no longer needed, we rely
> on task_active_pid_ns().
>
> Move exit_task_namespaces() from exit_notify() to do_exit(), after
> exit_fs() and before exit_task_work().
>
> This solves the problem reported by Andrey, free_ipc_ns()->shm_destroy()
> does fput() which needs task_work_add(). And this allows us do simplify
> exit_notify(), we can avoid unlock/lock(tasklist) and we can change
> ->exit_state instead of PF_EXITING in forget_original_parent().
>

It looks good for me. I have tested it a bit and don't find any problem.
Oleg, thank you.

Acked-by: Andrew Vagin <avagin@gmail.com>

> Reported-by: Andrey Vagin <avagin@openvz.org>
> Signed-off-by: Oleg Nesterov <oleg@redhat.com>
>
> --- x/kernel/exit.c
> +++ x/kernel/exit.c
> @@ -649,7 +649,6 @@ static void exit_notify(struct task_stru
>          *      jobs, send them a SIGHUP and then a SIGCONT.  (POSIX 3.2.2.2)
>          */
>         forget_original_parent(tsk);
> -       exit_task_namespaces(tsk);
>
>         write_lock_irq(&tasklist_lock);
>         if (group_dead)
> @@ -795,6 +794,7 @@ void do_exit(long code)
>         exit_shm(tsk);
>         exit_files(tsk);
>         exit_fs(tsk);
> +       exit_task_namespaces(tsk);
>         exit_task_work(tsk);
>         check_stack_usage();
>         exit_thread();
>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 1/1] move exit_task_namespaces() outside of exit_notify()
  2013-04-15  9:57       ` Andrey Wagin
@ 2013-04-15 15:33         ` Oleg Nesterov
  0 siblings, 0 replies; 9+ messages in thread
From: Oleg Nesterov @ 2013-04-15 15:33 UTC (permalink / raw)
  To: Andrey Wagin, Al Viro
  Cc: Eric W. Biederman, LKML, Andrew Morton, David Howells

On 04/15, Andrey Wagin wrote:
>
> It looks good for me. I have tested it a bit and don't find any problem.
> Oleg, thank you.
>
> Acked-by: Andrew Vagin <avagin@gmail.com>

Thanks Andrey and Eric.

> > --- x/kernel/exit.c
> > +++ x/kernel/exit.c
> > @@ -649,7 +649,6 @@ static void exit_notify(struct task_stru
> >          *      jobs, send them a SIGHUP and then a SIGCONT.  (POSIX 3.2.2.2)
> >          */
> >         forget_original_parent(tsk);
> > -       exit_task_namespaces(tsk);
> >
> >         write_lock_irq(&tasklist_lock);
> >         if (group_dead)
> > @@ -795,6 +794,7 @@ void do_exit(long code)
> >         exit_shm(tsk);
> >         exit_files(tsk);
> >         exit_fs(tsk);
> > +       exit_task_namespaces(tsk);
> >         exit_task_work(tsk);

I do not see any problems with this patch too... but still I am worried.

Even if fput() can work correctly after exit_task_namespaces(), this limits
the usage of task_work_add(). Probably this is fine, but can't we at least
discuss another change?

We can change fput() so that it can always work, even after exit_task_work(),

	void fput(struct file *file)
	{
		if (atomic_long_dec_and_test(&file->f_count)) {
			struct task_struct *task = current;
			unsigned long flags;

			file_sb_list_del(file);
			if (likely(!in_interrupt() && !(task->flags & PF_KTHREAD))) {
				init_task_work(&file->f_u.fu_rcuhead, ____fput);
				if (!task_work_add(task, &file->f_u.fu_rcuhead, true))
					return;
			}

			spin_lock_irqsave(&delayed_fput_lock, flags);
			list_add(&file->f_u.fu_list, &delayed_fput_list);
			schedule_work(&delayed_fput_work);
			spin_unlock_irqrestore(&delayed_fput_lock, flags);
		}
	}

Al, what do you think?

Untested patch below.

Oleg.

--- x/fs/file_table.c
+++ x/fs/file_table.c
@@ -306,17 +306,19 @@ void fput(struct file *file)
 {
 	if (atomic_long_dec_and_test(&file->f_count)) {
 		struct task_struct *task = current;
+		unsigned long flags;
+
 		file_sb_list_del(file);
-		if (unlikely(in_interrupt() || task->flags & PF_KTHREAD)) {
-			unsigned long flags;
-			spin_lock_irqsave(&delayed_fput_lock, flags);
-			list_add(&file->f_u.fu_list, &delayed_fput_list);
-			schedule_work(&delayed_fput_work);
-			spin_unlock_irqrestore(&delayed_fput_lock, flags);
-			return;
+		if (likely(!in_interrupt() && !(task->flags & PF_KTHREAD))) {
+			init_task_work(&file->f_u.fu_rcuhead, ____fput);
+			if (!task_work_add(task, &file->f_u.fu_rcuhead, true))
+				return;
 		}
-		init_task_work(&file->f_u.fu_rcuhead, ____fput);
-		task_work_add(task, &file->f_u.fu_rcuhead, true);
+
+		spin_lock_irqsave(&delayed_fput_lock, flags);
+		list_add(&file->f_u.fu_list, &delayed_fput_list);
+		schedule_work(&delayed_fput_work);
+		spin_unlock_irqrestore(&delayed_fput_lock, flags);
 	}
 }
 


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 1/1] move exit_task_namespaces() outside of exit_notify()
  2013-04-13 15:55     ` [PATCH 1/1] move exit_task_namespaces() outside of exit_notify() Oleg Nesterov
  2013-04-14  1:52       ` Eric W. Biederman
  2013-04-15  9:57       ` Andrey Wagin
@ 2013-06-13  9:24       ` Andrew Vagin
  2013-06-14 13:19         ` Oleg Nesterov
  2 siblings, 1 reply; 9+ messages in thread
From: Andrew Vagin @ 2013-06-13  9:24 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Andrey Vagin, Eric W. Biederman, linux-kernel, Andrew Morton,
	Al Viro, David Howells

On Sat, Apr 13, 2013 at 05:55:21PM +0200, Oleg Nesterov wrote:
> exit_notify() does exit_task_namespaces() after
> forget_original_parent(). This was needed to ensure that ->nsproxy
> can't be cleared prematurely, an exiting child we are going to
> reparent can do do_notify_parent() and use the parent's (ours) pid_ns.
> 
> However, after 32084504 "pidns: use task_active_pid_ns in
> do_notify_parent" ->nsproxy != NULL is no longer needed, we rely
> on task_active_pid_ns().
> 
> Move exit_task_namespaces() from exit_notify() to do_exit(), after
> exit_fs() and before exit_task_work().
> 
> This solves the problem reported by Andrey, free_ipc_ns()->shm_destroy()
> does fput() which needs task_work_add(). And this allows us do simplify
> exit_notify(), we can avoid unlock/lock(tasklist) and we can change
> ->exit_state instead of PF_EXITING in forget_original_parent().
>

It looks good for me. kmemleak doesn't report any leaks. CRIU test
cases, which use namespaces, work without any errors.

Thanks.

Acked-by: Andrey Vagin <avagin@openvz.org>

> Reported-by: Andrey Vagin <avagin@openvz.org>
> Signed-off-by: Oleg Nesterov <oleg@redhat.com>
> 
> --- x/kernel/exit.c
> +++ x/kernel/exit.c
> @@ -649,7 +649,6 @@ static void exit_notify(struct task_stru
>  	 *	jobs, send them a SIGHUP and then a SIGCONT.  (POSIX 3.2.2.2)
>  	 */
>  	forget_original_parent(tsk);
> -	exit_task_namespaces(tsk);
>  
>  	write_lock_irq(&tasklist_lock);
>  	if (group_dead)
> @@ -795,6 +794,7 @@ void do_exit(long code)
>  	exit_shm(tsk);
>  	exit_files(tsk);
>  	exit_fs(tsk);
> +	exit_task_namespaces(tsk);
>  	exit_task_work(tsk);
>  	check_stack_usage();
>  	exit_thread();
> 

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 1/1] move exit_task_namespaces() outside of exit_notify()
  2013-06-13  9:24       ` Andrew Vagin
@ 2013-06-14 13:19         ` Oleg Nesterov
  0 siblings, 0 replies; 9+ messages in thread
From: Oleg Nesterov @ 2013-06-14 13:19 UTC (permalink / raw)
  To: Andrew Vagin
  Cc: Andrey Vagin, Eric W. Biederman, linux-kernel, Andrew Morton,
	Al Viro, David Howells

On 06/13, Andrew Vagin wrote:
>
> On Sat, Apr 13, 2013 at 05:55:21PM +0200, Oleg Nesterov wrote:
> > exit_notify() does exit_task_namespaces() after
> > forget_original_parent(). This was needed to ensure that ->nsproxy
> > can't be cleared prematurely, an exiting child we are going to
> > reparent can do do_notify_parent() and use the parent's (ours) pid_ns.
> >
> > However, after 32084504 "pidns: use task_active_pid_ns in
> > do_notify_parent" ->nsproxy != NULL is no longer needed, we rely
> > on task_active_pid_ns().
> >
> > Move exit_task_namespaces() from exit_notify() to do_exit(), after
> > exit_fs() and before exit_task_work().
> >
> > This solves the problem reported by Andrey, free_ipc_ns()->shm_destroy()
> > does fput() which needs task_work_add(). And this allows us do simplify
> > exit_notify(), we can avoid unlock/lock(tasklist) and we can change
> > ->exit_state instead of PF_EXITING in forget_original_parent().
> >
>
> It looks good for me. kmemleak doesn't report any leaks. CRIU test
> cases, which use namespaces, work without any errors.

OK, thanks.

I guess I need to resend this patch with your and Eric's acks.

But in fact I am going to send the 2nd change as well, it should
fix the same problem too and imho it makes sense anyway. And to
me the 2nd one looks like 3.10 material, it is much more simple
and straightforward.

Oleg.


^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2013-06-14 13:23 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-04-12 19:27 [PATCH] kernel: move exit_task_work() past exit_notify() Andrey Vagin
2013-04-13 14:22 ` Oleg Nesterov
2013-04-13 15:54   ` [PATCH 0/1] (Was: kernel: move exit_task_work() past exit_notify()) Oleg Nesterov
2013-04-13 15:55     ` [PATCH 1/1] move exit_task_namespaces() outside of exit_notify() Oleg Nesterov
2013-04-14  1:52       ` Eric W. Biederman
2013-04-15  9:57       ` Andrey Wagin
2013-04-15 15:33         ` Oleg Nesterov
2013-06-13  9:24       ` Andrew Vagin
2013-06-14 13:19         ` Oleg Nesterov

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.