[PATCH] vfs: introduce UMOUNT_WAIT which waits for umount completion

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH] vfs: introduce UMOUNT_WAIT which waits for umount completion
@ 2017-09-13 20:09 Jaegeuk Kim
  2017-09-13 23:04 ` Al Viro
  2017-09-20 17:38 ` [PATCH v2] " Jaegeuk Kim
  0 siblings, 2 replies; 30+ messages in thread
From: Jaegeuk Kim @ 2017-09-13 20:09 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, linux-f2fs-devel; +Cc: Jaegeuk Kim, Al Viro

This patch introduces UMOUNT_WAIT flag for umount(2) which let user wait for
its completion. This would fix a kernel panic caused by block device access by
filesystem, after device_shutdown during kernel_restart. This can happen due
to delayed umount -- reboot process already succeeded to unmount filesystem,
but its instance is sitll alive.

Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
---
 fs/namespace.c     | 12 +++++++++++-
 include/linux/fs.h |  1 +
 2 files changed, 12 insertions(+), 1 deletion(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index f8893dc6a989..b1ac89915b10 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -21,6 +21,7 @@
 #include <linux/fs_struct.h>	/* get_fs_root et.al. */
 #include <linux/fsnotify.h>	/* fsnotify_vfsmount_delete */
 #include <linux/uaccess.h>
+#include <linux/file.h>
 #include <linux/proc_ns.h>
 #include <linux/magic.h>
 #include <linux/bootmem.h>
@@ -1629,7 +1630,8 @@ SYSCALL_DEFINE2(umount, char __user *, name, int, flags)
 	int retval;
 	int lookup_flags = 0;
 
-	if (flags & ~(MNT_FORCE | MNT_DETACH | MNT_EXPIRE | UMOUNT_NOFOLLOW))
+	if (flags & ~(MNT_FORCE | MNT_DETACH | MNT_EXPIRE | UMOUNT_NOFOLLOW |
+			UMOUNT_WAIT))
 		return -EINVAL;
 
 	if (!may_mount())
@@ -1652,12 +1654,20 @@ SYSCALL_DEFINE2(umount, char __user *, name, int, flags)
 	retval = -EPERM;
 	if (flags & MNT_FORCE && !capable(CAP_SYS_ADMIN))
 		goto dput_and_out;
+	if (flags & UMOUNT_WAIT)
+		flush_delayed_fput();
 
 	retval = do_umount(mnt, flags);
 dput_and_out:
 	/* we mustn't call path_put() as that would clear mnt_expiry_mark */
 	dput(path.dentry);
 	mntput_no_expire(mnt);
+	if (!retval && (flags & UMOUNT_WAIT)) {
+		if (likely(!(current->flags & PF_KTHREAD)))
+			task_work_run();
+		else
+			flush_scheduled_work();
+	}
 out:
 	return retval;
 }
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 83341a6a553e..cb62af7a03e7 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1283,6 +1283,7 @@ struct mm_struct;
 #define MNT_DETACH	0x00000002	/* Just detach from the tree */
 #define MNT_EXPIRE	0x00000004	/* Mark for expiry */
 #define UMOUNT_NOFOLLOW	0x00000008	/* Don't follow symlink on umount */
+#define UMOUNT_WAIT	0x00000010	/* Wait to unmount completely */
 #define UMOUNT_UNUSED	0x80000000	/* Flag guaranteed to be unused */
 
 /* sb->s_iflags */
-- 
2.14.0.rc1.383.gd1ce394fe2-goog

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [PATCH] vfs: introduce UMOUNT_WAIT which waits for umount completion
  2017-09-13 20:09 [PATCH] vfs: introduce UMOUNT_WAIT which waits for umount completion Jaegeuk Kim
@ 2017-09-13 23:04 ` Al Viro
  2017-09-13 23:31   ` Jaegeuk Kim
  2017-09-20 17:38 ` [PATCH v2] " Jaegeuk Kim
  1 sibling, 1 reply; 30+ messages in thread
From: Al Viro @ 2017-09-13 23:04 UTC (permalink / raw)
  To: Jaegeuk Kim; +Cc: linux-kernel, linux-fsdevel, linux-f2fs-devel

On Wed, Sep 13, 2017 at 01:09:41PM -0700, Jaegeuk Kim wrote:
> +	if (!retval && (flags & UMOUNT_WAIT)) {
> +		if (likely(!(current->flags & PF_KTHREAD)))
> +			task_work_run();

This is complete crap.  The same damn thing will be done by
caller of sys_umount() pretty much immediately afterwards.
I'm not sure what it is that you are trying to paper over,
but this is just plain wrong.

What _is_ the semantics of UMOUNT_WAIT?  What does it guarantee,
and what would be supplying it to umount(2)?

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] vfs: introduce UMOUNT_WAIT which waits for umount completion
  2017-09-13 23:04 ` Al Viro
@ 2017-09-13 23:31   ` Jaegeuk Kim
  2017-09-13 23:44     ` Al Viro
  0 siblings, 1 reply; 30+ messages in thread
From: Jaegeuk Kim @ 2017-09-13 23:31 UTC (permalink / raw)
  To: Al Viro; +Cc: linux-kernel, linux-fsdevel, linux-f2fs-devel

Hi Al,

On 09/14, Al Viro wrote:
> On Wed, Sep 13, 2017 at 01:09:41PM -0700, Jaegeuk Kim wrote:
> > +	if (!retval && (flags & UMOUNT_WAIT)) {
> > +		if (likely(!(current->flags & PF_KTHREAD)))
> > +			task_work_run();
> 
> This is complete crap.  The same damn thing will be done by
> caller of sys_umount() pretty much immediately afterwards.
> I'm not sure what it is that you are trying to paper over,
> but this is just plain wrong.

Okay.

> What _is_ the semantics of UMOUNT_WAIT?  What does it guarantee,
> and what would be supplying it to umount(2)?

When android tries to reboot the system, it calls umount(2) without any flag.
Then, mntput_no_expire() will add delayed_mntput_work() which finally does
cleanup_mnt() later. In the mean time, android proceeded to shutdown all
the UFS devices, but filesystem would be still alive and tries to trigger
some I/Os. At this moment, I'd like to avoid EIO, since ext4 can issue kernel
panic because of error=panic.

So, what I'm trying to do is 1) adding a flag to wait for umount() completion,
2) issuing umount(2) with UMOUNT_WAIT in android. Then, it can guarantee there'd
be no I/Os after sucessful umount().

Could you please correct me?

Thanks,

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] vfs: introduce UMOUNT_WAIT which waits for umount completion
  2017-09-13 23:31   ` Jaegeuk Kim
@ 2017-09-13 23:44     ` Al Viro
  2017-09-14  1:10         ` Jaegeuk Kim
  0 siblings, 1 reply; 30+ messages in thread
From: Al Viro @ 2017-09-13 23:44 UTC (permalink / raw)
  To: Jaegeuk Kim; +Cc: linux-kernel, linux-fsdevel, linux-f2fs-devel

On Wed, Sep 13, 2017 at 04:31:16PM -0700, Jaegeuk Kim wrote:
> Hi Al,
> 
> On 09/14, Al Viro wrote:
> > On Wed, Sep 13, 2017 at 01:09:41PM -0700, Jaegeuk Kim wrote:
> > > +	if (!retval && (flags & UMOUNT_WAIT)) {
> > > +		if (likely(!(current->flags & PF_KTHREAD)))
> > > +			task_work_run();
> > 
> > This is complete crap.  The same damn thing will be done by
> > caller of sys_umount() pretty much immediately afterwards.
> > I'm not sure what it is that you are trying to paper over,
> > but this is just plain wrong.
> 
> Okay.
> 
> > What _is_ the semantics of UMOUNT_WAIT?  What does it guarantee,
> > and what would be supplying it to umount(2)?
> 
> When android tries to reboot the system, it calls umount(2) without any flag.
> Then, mntput_no_expire() will add delayed_mntput_work() which finally does
> cleanup_mnt() later. In the mean time, android proceeded to shutdown all
> the UFS devices.

Why has task_work_add() failed?  Or is that umount(2) issued by a kernel thread?

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] vfs: introduce UMOUNT_WAIT which waits for umount completion
  2017-09-13 23:44     ` Al Viro
@ 2017-09-14  1:10         ` Jaegeuk Kim
  0 siblings, 0 replies; 30+ messages in thread
From: Jaegeuk Kim @ 2017-09-14  1:10 UTC (permalink / raw)
  To: Al Viro; +Cc: linux-kernel, linux-fsdevel, linux-f2fs-devel

On 09/14, Al Viro wrote:
> On Wed, Sep 13, 2017 at 04:31:16PM -0700, Jaegeuk Kim wrote:
> > Hi Al,
> > 
> > On 09/14, Al Viro wrote:
> > > On Wed, Sep 13, 2017 at 01:09:41PM -0700, Jaegeuk Kim wrote:
> > > > +	if (!retval && (flags & UMOUNT_WAIT)) {
> > > > +		if (likely(!(current->flags & PF_KTHREAD)))
> > > > +			task_work_run();
> > > 
> > > This is complete crap.  The same damn thing will be done by
> > > caller of sys_umount() pretty much immediately afterwards.
> > > I'm not sure what it is that you are trying to paper over,
> > > but this is just plain wrong.
> > 
> > Okay.
> > 
> > > What _is_ the semantics of UMOUNT_WAIT?  What does it guarantee,
> > > and what would be supplying it to umount(2)?
> > 
> > When android tries to reboot the system, it calls umount(2) without any flag.
> > Then, mntput_no_expire() will add delayed_mntput_work() which finally does
> > cleanup_mnt() later. In the mean time, android proceeded to shutdown all
> > the UFS devices.
> 
> Why has task_work_add() failed?  Or is that umount(2) issued by a kernel thread?

Android triggers umount(2) by init process, which is definitely not a kernel
thread. But, we've seen some kernel panics which say umount(2) was succeeded,
but ext4 triggered a kernel panic due to EIO after then like below. I'm also
not sure task_work_run() would be also safe enoughly. May I ask where I can
find sys_umount() calls task_work_run()?

[254012.860565] c4  12426 [<ffffff909b6a7ebc>] panic+0x184/0x37c
[254012.860589] c4  12426 [<ffffff909b88e724>] __ext4_abort+0x198/0x19c
[254012.860606] c4  12426 [<ffffff909b897468>] ext4_put_super+0x80/0x2b4
[254012.860629] c4  12426 [<ffffff909b7e7274>] generic_shutdown_super+0x68/0xd0
[254012.860646] c4  12426 [<ffffff909b7e87fc>] kill_block_super+0x1c/0x5c
[254012.860663] c4  12426 [<ffffff909b7e70dc>] deactivate_locked_super+0x5c/0xc0
[254012.860679] c4  12426 [<ffffff909b7e71a8>] deactivate_super+0x68/0x74
[254012.860696] c4  12426 [<ffffff909b80a25c>] cleanup_mnt+0xb0/0x12c
[254012.860712] c4  12426 [<ffffff909b80a310>] delayed_mntput+0x38/0x4c
[254012.860737] c4  12426 [<ffffff909b6c6524>] process_one_work+0x1e0/0x490
[254012.860753] c4  12426 [<ffffff909b6c6094>] worker_thread+0x314/0x494
[254012.860771] c4  12426 [<ffffff909b6cb35c>] kthread+0xdc/0xec
[254012.860790] c4  12426 [<ffffff909b683860>] ret_from_fork+0x10/0x30

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] vfs: introduce UMOUNT_WAIT which waits for umount completion
@ 2017-09-14  1:10         ` Jaegeuk Kim
  0 siblings, 0 replies; 30+ messages in thread
From: Jaegeuk Kim @ 2017-09-14  1:10 UTC (permalink / raw)
  To: Al Viro; +Cc: linux-fsdevel, linux-kernel, linux-f2fs-devel

On 09/14, Al Viro wrote:
> On Wed, Sep 13, 2017 at 04:31:16PM -0700, Jaegeuk Kim wrote:
> > Hi Al,
> > 
> > On 09/14, Al Viro wrote:
> > > On Wed, Sep 13, 2017 at 01:09:41PM -0700, Jaegeuk Kim wrote:
> > > > +	if (!retval && (flags & UMOUNT_WAIT)) {
> > > > +		if (likely(!(current->flags & PF_KTHREAD)))
> > > > +			task_work_run();
> > > 
> > > This is complete crap.  The same damn thing will be done by
> > > caller of sys_umount() pretty much immediately afterwards.
> > > I'm not sure what it is that you are trying to paper over,
> > > but this is just plain wrong.
> > 
> > Okay.
> > 
> > > What _is_ the semantics of UMOUNT_WAIT?  What does it guarantee,
> > > and what would be supplying it to umount(2)?
> > 
> > When android tries to reboot the system, it calls umount(2) without any flag.
> > Then, mntput_no_expire() will add delayed_mntput_work() which finally does
> > cleanup_mnt() later. In the mean time, android proceeded to shutdown all
> > the UFS devices.
> 
> Why has task_work_add() failed?  Or is that umount(2) issued by a kernel thread?

Android triggers umount(2) by init process, which is definitely not a kernel
thread. But, we've seen some kernel panics which say umount(2) was succeeded,
but ext4 triggered a kernel panic due to EIO after then like below. I'm also
not sure task_work_run() would be also safe enoughly. May I ask where I can
find sys_umount() calls task_work_run()?

[254012.860565] c4  12426 [<ffffff909b6a7ebc>] panic+0x184/0x37c
[254012.860589] c4  12426 [<ffffff909b88e724>] __ext4_abort+0x198/0x19c
[254012.860606] c4  12426 [<ffffff909b897468>] ext4_put_super+0x80/0x2b4
[254012.860629] c4  12426 [<ffffff909b7e7274>] generic_shutdown_super+0x68/0xd0
[254012.860646] c4  12426 [<ffffff909b7e87fc>] kill_block_super+0x1c/0x5c
[254012.860663] c4  12426 [<ffffff909b7e70dc>] deactivate_locked_super+0x5c/0xc0
[254012.860679] c4  12426 [<ffffff909b7e71a8>] deactivate_super+0x68/0x74
[254012.860696] c4  12426 [<ffffff909b80a25c>] cleanup_mnt+0xb0/0x12c
[254012.860712] c4  12426 [<ffffff909b80a310>] delayed_mntput+0x38/0x4c
[254012.860737] c4  12426 [<ffffff909b6c6524>] process_one_work+0x1e0/0x490
[254012.860753] c4  12426 [<ffffff909b6c6094>] worker_thread+0x314/0x494
[254012.860771] c4  12426 [<ffffff909b6cb35c>] kthread+0xdc/0xec
[254012.860790] c4  12426 [<ffffff909b683860>] ret_from_fork+0x10/0x30

------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] vfs: introduce UMOUNT_WAIT which waits for umount completion
  2017-09-14  1:10         ` Jaegeuk Kim
  (?)
@ 2017-09-14  1:30         ` Al Viro
  2017-09-14 18:37           ` Al Viro
  -1 siblings, 1 reply; 30+ messages in thread
From: Al Viro @ 2017-09-14  1:30 UTC (permalink / raw)
  To: Jaegeuk Kim; +Cc: linux-kernel, linux-fsdevel, linux-f2fs-devel

On Wed, Sep 13, 2017 at 06:10:48PM -0700, Jaegeuk Kim wrote:

> Android triggers umount(2) by init process, which is definitely not a kernel
> thread. But, we've seen some kernel panics which say umount(2) was succeeded,
> but ext4 triggered a kernel panic due to EIO after then like below. I'm also
> not sure task_work_run() would be also safe enoughly. May I ask where I can
> find sys_umount() calls task_work_run()?

ret_{fast,slow}_syscall ->
	slow_work_pending ->
		do_work_pending() ->
			tracehook_notify_resume() ->
				task_work_run()

It's not sys_umount() (or any other sys_...()) - it's syscall dispatcher after
having called one of those and before returning to userland.  What is guaranteed
is that after successful task_work_add() the damn thing will be run in context
of originating process before it returns from syscall.  So any subsequent
syscalls from that process are guaranteed to happen after the work has run.
The same happens if the process exits rather than returns to userland (do_exit() ->
exit_task_work() -> task_work_run()), but for that you would need it to die in
umount(2) (e.g. get kill -9 delivered on the way out).

Please, check if you are seeing task_work_add() failure in there and if you do,
I would like to see a stack trace.  IOW, slap WARN_ON(1); right after
                        if (!task_work_add(task, &mnt->mnt_rcu, true))
                                return;
and see what (if anything) gets printed.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] vfs: introduce UMOUNT_WAIT which waits for umount completion
  2017-09-14  1:30         ` Al Viro
@ 2017-09-14 18:37           ` Al Viro
  2017-09-14 19:14             ` Jaegeuk Kim
  0 siblings, 1 reply; 30+ messages in thread
From: Al Viro @ 2017-09-14 18:37 UTC (permalink / raw)
  To: Jaegeuk Kim; +Cc: linux-kernel, linux-fsdevel, linux-f2fs-devel

On Thu, Sep 14, 2017 at 02:30:17AM +0100, Al Viro wrote:
> On Wed, Sep 13, 2017 at 06:10:48PM -0700, Jaegeuk Kim wrote:
> 
> > Android triggers umount(2) by init process, which is definitely not a kernel
> > thread. But, we've seen some kernel panics which say umount(2) was succeeded,
> > but ext4 triggered a kernel panic due to EIO after then like below. I'm also
> > not sure task_work_run() would be also safe enoughly. May I ask where I can
> > find sys_umount() calls task_work_run()?
> 
> ret_{fast,slow}_syscall ->
> 	slow_work_pending ->
> 		do_work_pending() ->
> 			tracehook_notify_resume() ->
> 				task_work_run()
> 
> It's not sys_umount() (or any other sys_...()) - it's syscall dispatcher after
> having called one of those and before returning to userland.  What is guaranteed
> is that after successful task_work_add() the damn thing will be run in context
> of originating process before it returns from syscall.  So any subsequent
> syscalls from that process are guaranteed to happen after the work has run.
> The same happens if the process exits rather than returns to userland (do_exit() ->
> exit_task_work() -> task_work_run()), but for that you would need it to die in
> umount(2) (e.g. get kill -9 delivered on the way out).
> 
> Please, check if you are seeing task_work_add() failure in there and if you do,
> I would like to see a stack trace.  IOW, slap WARN_ON(1); right after
>                         if (!task_work_add(task, &mnt->mnt_rcu, true))
>                                 return;
> and see what (if anything) gets printed.

AFAICS, for task_work_add() to fail here we need a final mntput() to be run
in context of a thread that already had exit_signals() run *and* subsequent
task_work_run() run to completion (with all pending callbacks executed, along
with all callbacks added by those, etc.)

For that to have happened during umount(2) we would've needed
	* killing signal delivered while going through the syscall
	* final mntput() to have been done *NOT* from sys_umount() (otherwise
the work would've been added before we got to exit_signals())
	* final mntput() to have been done *NOT* from any task_work callbacks
(otherwise it would've been added before we'd observed a combination of empty
list of pending work with PF_EXITING)

I really want to see the stack trace of that failing task_work_add(), if that's
what actually happens there.  What kind of a reproducer do you have for that?

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] vfs: introduce UMOUNT_WAIT which waits for umount completion
  2017-09-14 18:37           ` Al Viro
@ 2017-09-14 19:14             ` Jaegeuk Kim
  2017-09-15  0:19               ` Jaegeuk Kim
  0 siblings, 1 reply; 30+ messages in thread
From: Jaegeuk Kim @ 2017-09-14 19:14 UTC (permalink / raw)
  To: Al Viro; +Cc: linux-kernel, linux-fsdevel, linux-f2fs-devel

On 09/14, Al Viro wrote:
> On Thu, Sep 14, 2017 at 02:30:17AM +0100, Al Viro wrote:
> > On Wed, Sep 13, 2017 at 06:10:48PM -0700, Jaegeuk Kim wrote:
> > 
> > > Android triggers umount(2) by init process, which is definitely not a kernel
> > > thread. But, we've seen some kernel panics which say umount(2) was succeeded,
> > > but ext4 triggered a kernel panic due to EIO after then like below. I'm also
> > > not sure task_work_run() would be also safe enoughly. May I ask where I can
> > > find sys_umount() calls task_work_run()?
> > 
> > ret_{fast,slow}_syscall ->
> > 	slow_work_pending ->
> > 		do_work_pending() ->
> > 			tracehook_notify_resume() ->
> > 				task_work_run()
> > 
> > It's not sys_umount() (or any other sys_...()) - it's syscall dispatcher after
> > having called one of those and before returning to userland.  What is guaranteed
> > is that after successful task_work_add() the damn thing will be run in context
> > of originating process before it returns from syscall.  So any subsequent
> > syscalls from that process are guaranteed to happen after the work has run.
> > The same happens if the process exits rather than returns to userland (do_exit() ->
> > exit_task_work() -> task_work_run()), but for that you would need it to die in
> > umount(2) (e.g. get kill -9 delivered on the way out).
> > 
> > Please, check if you are seeing task_work_add() failure in there and if you do,
> > I would like to see a stack trace.  IOW, slap WARN_ON(1); right after
> >                         if (!task_work_add(task, &mnt->mnt_rcu, true))
> >                                 return;
> > and see what (if anything) gets printed.
> 
> AFAICS, for task_work_add() to fail here we need a final mntput() to be run
> in context of a thread that already had exit_signals() run *and* subsequent
> task_work_run() run to completion (with all pending callbacks executed, along
> with all callbacks added by those, etc.)
> 
> For that to have happened during umount(2) we would've needed
> 	* killing signal delivered while going through the syscall
> 	* final mntput() to have been done *NOT* from sys_umount() (otherwise
> the work would've been added before we got to exit_signals())
> 	* final mntput() to have been done *NOT* from any task_work callbacks
> (otherwise it would've been added before we'd observed a combination of empty
> list of pending work with PF_EXITING)
> 
> I really want to see the stack trace of that failing task_work_add(), if that's
> what actually happens there.  What kind of a reproducer do you have for that?

I've got this error from Android user, so there's no reproducer unfortunately.
So, I wrote a script capturing WARN_ON after reboot running at every minute, but
couldn't have got the error since yesterday so far.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] vfs: introduce UMOUNT_WAIT which waits for umount completion
  2017-09-14 19:14             ` Jaegeuk Kim
@ 2017-09-15  0:19               ` Jaegeuk Kim
  2017-09-15  2:06                   ` Al Viro
  0 siblings, 1 reply; 30+ messages in thread
From: Jaegeuk Kim @ 2017-09-15  0:19 UTC (permalink / raw)
  To: Al Viro; +Cc: linux-kernel, linux-fsdevel, linux-f2fs-devel

On 09/14, Jaegeuk Kim wrote:
> On 09/14, Al Viro wrote:
> > On Thu, Sep 14, 2017 at 02:30:17AM +0100, Al Viro wrote:
> > > On Wed, Sep 13, 2017 at 06:10:48PM -0700, Jaegeuk Kim wrote:
> > > 
> > > > Android triggers umount(2) by init process, which is definitely not a kernel
> > > > thread. But, we've seen some kernel panics which say umount(2) was succeeded,
> > > > but ext4 triggered a kernel panic due to EIO after then like below. I'm also
> > > > not sure task_work_run() would be also safe enoughly. May I ask where I can
> > > > find sys_umount() calls task_work_run()?
> > > 
> > > ret_{fast,slow}_syscall ->
> > > 	slow_work_pending ->
> > > 		do_work_pending() ->
> > > 			tracehook_notify_resume() ->
> > > 				task_work_run()
> > > 
> > > It's not sys_umount() (or any other sys_...()) - it's syscall dispatcher after
> > > having called one of those and before returning to userland.  What is guaranteed
> > > is that after successful task_work_add() the damn thing will be run in context
> > > of originating process before it returns from syscall.  So any subsequent
> > > syscalls from that process are guaranteed to happen after the work has run.
> > > The same happens if the process exits rather than returns to userland (do_exit() ->
> > > exit_task_work() -> task_work_run()), but for that you would need it to die in
> > > umount(2) (e.g. get kill -9 delivered on the way out).
> > > 
> > > Please, check if you are seeing task_work_add() failure in there and if you do,
> > > I would like to see a stack trace.  IOW, slap WARN_ON(1); right after
> > >                         if (!task_work_add(task, &mnt->mnt_rcu, true))
> > >                                 return;
> > > and see what (if anything) gets printed.
> > 
> > AFAICS, for task_work_add() to fail here we need a final mntput() to be run
> > in context of a thread that already had exit_signals() run *and* subsequent
> > task_work_run() run to completion (with all pending callbacks executed, along
> > with all callbacks added by those, etc.)
> > 
> > For that to have happened during umount(2) we would've needed
> > 	* killing signal delivered while going through the syscall
> > 	* final mntput() to have been done *NOT* from sys_umount() (otherwise
> > the work would've been added before we got to exit_signals())
> > 	* final mntput() to have been done *NOT* from any task_work callbacks
> > (otherwise it would've been added before we'd observed a combination of empty
> > list of pending work with PF_EXITING)
> > 
> > I really want to see the stack trace of that failing task_work_add(), if that's
> > what actually happens there.  What kind of a reproducer do you have for that?
> 
> I've got this error from Android user, so there's no reproducer unfortunately.
> So, I wrote a script capturing WARN_ON after reboot running at every minute, but
> couldn't have got the error since yesterday so far.

Instead, I put more traces in the reboot procedure, and got a clue to suspect
the below flow.

delayed_fput()                 init
                               - umount
 - mntput()
 - mntput_no_expire()            - mntput_no_expire()
                                 - mnt_add_count(-1);
                                 - mnt_get_count() return;
                                 - return 0;
 - mnt_add_count(-1);
 - delayed_mntput_work
                               - device_shutdown
 - ext4_put_super()
 - EIO

Does this make any sense?

Thanks,

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] vfs: introduce UMOUNT_WAIT which waits for umount completion
  2017-09-15  0:19               ` Jaegeuk Kim
@ 2017-09-15  2:06                   ` Al Viro
  0 siblings, 0 replies; 30+ messages in thread
From: Al Viro @ 2017-09-15  2:06 UTC (permalink / raw)
  To: Jaegeuk Kim; +Cc: linux-kernel, linux-fsdevel, linux-f2fs-devel

On Thu, Sep 14, 2017 at 05:19:39PM -0700, Jaegeuk Kim wrote:
> Instead, I put more traces in the reboot procedure, and got a clue to suspect
> the below flow.
> 
> delayed_fput()                 init
>                                - umount
>  - mntput()
>  - mntput_no_expire()            - mntput_no_expire()
>                                  - mnt_add_count(-1);
>                                  - mnt_get_count() return;
>                                  - return 0;
>  - mnt_add_count(-1);
>  - delayed_mntput_work
>                                - device_shutdown
>  - ext4_put_super()
>  - EIO
> 
> Does this make any sense?

Which filesystem it is?  With root I would've expected remount ro done
by sys_umount(); with anything else...  How has it managed to avoid
-EBUSY?  If it was umount -l (IOW, MNT_DETACH), I can see that happening,
but...  How would flushing prevent the scenario when the same opened
file had remained open until after the umount(2) return?

In other words, where has that fput() come from and how had it managed
to get past the umount(2)?

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] vfs: introduce UMOUNT_WAIT which waits for umount completion
@ 2017-09-15  2:06                   ` Al Viro
  0 siblings, 0 replies; 30+ messages in thread
From: Al Viro @ 2017-09-15  2:06 UTC (permalink / raw)
  To: Jaegeuk Kim; +Cc: linux-fsdevel, linux-kernel, linux-f2fs-devel

On Thu, Sep 14, 2017 at 05:19:39PM -0700, Jaegeuk Kim wrote:
> Instead, I put more traces in the reboot procedure, and got a clue to suspect
> the below flow.
> 
> delayed_fput()                 init
>                                - umount
>  - mntput()
>  - mntput_no_expire()            - mntput_no_expire()
>                                  - mnt_add_count(-1);
>                                  - mnt_get_count() return;
>                                  - return 0;
>  - mnt_add_count(-1);
>  - delayed_mntput_work
>                                - device_shutdown
>  - ext4_put_super()
>  - EIO
> 
> Does this make any sense?

Which filesystem it is?  With root I would've expected remount ro done
by sys_umount(); with anything else...  How has it managed to avoid
-EBUSY?  If it was umount -l (IOW, MNT_DETACH), I can see that happening,
but...  How would flushing prevent the scenario when the same opened
file had remained open until after the umount(2) return?

In other words, where has that fput() come from and how had it managed
to get past the umount(2)?

------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] vfs: introduce UMOUNT_WAIT which waits for umount completion
  2017-09-15  2:06                   ` Al Viro
  (?)
@ 2017-09-15  3:45                   ` Jaegeuk Kim
  2017-09-15  4:21                     ` Al Viro
  -1 siblings, 1 reply; 30+ messages in thread
From: Jaegeuk Kim @ 2017-09-15  3:45 UTC (permalink / raw)
  To: Al Viro; +Cc: linux-kernel, linux-fsdevel, linux-f2fs-devel

On 09/15, Al Viro wrote:
> On Thu, Sep 14, 2017 at 05:19:39PM -0700, Jaegeuk Kim wrote:
> > Instead, I put more traces in the reboot procedure, and got a clue to suspect
> > the below flow.
> > 
> > delayed_fput()                 init
> >                                - umount
> >  - mntput()
> >  - mntput_no_expire()            - mntput_no_expire()
> >                                  - mnt_add_count(-1);
> >                                  - mnt_get_count() return;
> >                                  - return 0;
> >  - mnt_add_count(-1);
> >  - delayed_mntput_work
> >                                - device_shutdown
> >  - ext4_put_super()
> >  - EIO
> > 
> > Does this make any sense?
> 
> Which filesystem it is?  With root I would've expected remount ro done
> by sys_umount(); with anything else...  How has it managed to avoid
> -EBUSY?  If it was umount -l (IOW, MNT_DETACH), I can see that happening,
> but...  How would flushing prevent the scenario when the same opened
> file had remained open until after the umount(2) return?

It's ext4, and we use umount(0) and retry it several times if -EBUSY happens.
But, I don't see -EBUSY error in the log.

> In other words, where has that fput() come from and how had it managed
> to get past the umount(2)?

Huge number of fput() were called by system drivers when init kills all the
processes before umount(2). So, most of fput() were added in delayed_fput_list.
Then, it seems there is a race between delayed_fput() and umount(). Anyway,
even after umount returns zero, it seems ext4's superblock is still alive
and waiting for delayed_fput() which will finally call put_super.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] vfs: introduce UMOUNT_WAIT which waits for umount completion
  2017-09-15  3:45                   ` Jaegeuk Kim
@ 2017-09-15  4:21                     ` Al Viro
  2017-09-15 18:44                       ` Jaegeuk Kim
  0 siblings, 1 reply; 30+ messages in thread
From: Al Viro @ 2017-09-15  4:21 UTC (permalink / raw)
  To: Jaegeuk Kim; +Cc: linux-kernel, linux-fsdevel, linux-f2fs-devel

On Thu, Sep 14, 2017 at 08:45:18PM -0700, Jaegeuk Kim wrote:

> > Which filesystem it is?  With root I would've expected remount ro done
> > by sys_umount(); with anything else...  How has it managed to avoid
> > -EBUSY?  If it was umount -l (IOW, MNT_DETACH), I can see that happening,
> > but...  How would flushing prevent the scenario when the same opened
> > file had remained open until after the umount(2) return?
> 
> It's ext4, and we use umount(0) and retry it several times if -EBUSY happens.

????

umount(0) will result in EFAULT; what are you talking about?

> But, I don't see -EBUSY error in the log.

Sorry, I'd been unclear - where is it mounted?  Is that the root filesystem?

> > In other words, where has that fput() come from and how had it managed
> > to get past the umount(2)?
> 
> Huge number of fput() were called by system drivers when init kills all the
> processes before umount(2). So, most of fput() were added in delayed_fput_list.

Umm...  What do you mean by system drivers?  If it was held by userland processes,
then we are back to the same question - why has task_work_add() failed in fput()?
If it had been kernel threads, which files had they been holding open?

> Then, it seems there is a race between delayed_fput() and umount(). Anyway,
> even after umount returns zero, it seems ext4's superblock is still alive
> and waiting for delayed_fput() which will finally call put_super.

That might be more than one mount of the same fs (in different namespaces, for
example) with umount taking out one of those, with the other having been
hit with umount -l before that, with some opened files being the only thing
that used to keep it alive.

I'd like to see /proc/1/mountinfo and fuser output, TBH...  I'm not familiar enough
with Android userland setup, so my apologies for dumb questions ;-/

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] vfs: introduce UMOUNT_WAIT which waits for umount completion
  2017-09-15  4:21                     ` Al Viro
@ 2017-09-15 18:44                       ` Jaegeuk Kim
  2017-09-15 22:12                           ` Theodore Ts'o
  0 siblings, 1 reply; 30+ messages in thread
From: Jaegeuk Kim @ 2017-09-15 18:44 UTC (permalink / raw)
  To: Al Viro; +Cc: linux-kernel, linux-fsdevel, linux-f2fs-devel

On 09/15, Al Viro wrote:
> On Thu, Sep 14, 2017 at 08:45:18PM -0700, Jaegeuk Kim wrote:
> 
> > > Which filesystem it is?  With root I would've expected remount ro done
> > > by sys_umount(); with anything else...  How has it managed to avoid
> > > -EBUSY?  If it was umount -l (IOW, MNT_DETACH), I can see that happening,
> > > but...  How would flushing prevent the scenario when the same opened
> > > file had remained open until after the umount(2) return?
> > 
> > It's ext4, and we use umount(0) and retry it several times if -EBUSY happens.
> 
> ????
> 
> umount(0) will result in EFAULT; what are you talking about?

Sorry, I meant: umount2("/data", 0);

> 
> > But, I don't see -EBUSY error in the log.
> 
> Sorry, I'd been unclear - where is it mounted?  Is that the root filesystem?

No, it's /userdata in Android.

> > > In other words, where has that fput() come from and how had it managed
> > > to get past the umount(2)?
> > 
> > Huge number of fput() were called by system drivers when init kills all the
> > processes before umount(2). So, most of fput() were added in delayed_fput_list.
> 
> Umm...  What do you mean by system drivers?  If it was held by userland processes,
> then we are back to the same question - why has task_work_add() failed in fput()?
> If it had been kernel threads, which files had they been holding open?

So, I digged it in more detail, and found, in drivers/android/binder.c [1],
- binder_ioctl()
 - create a kernel thread
 - zombie_cleanup_check()
  - binder_defer_work()
    - queue_work(..., &binder_deferred_work);

- binder_deferred_func()
 - binder_clear_zombies()
  - binder_proc_clear_zombies()
   - put_files_struct()
    - close_files()
     - filp_close()
      - fput()

It seems binder holds some proc files. If you think it's android-specific issue,
I may need to write a patch for android kernel instead. Let me know.

[1] https://android.googlesource.com/kernel/msm/+/android-8.0.0_r0.4/drivers/staging/android/binder.c

> 
> > Then, it seems there is a race between delayed_fput() and umount(). Anyway,
> > even after umount returns zero, it seems ext4's superblock is still alive
> > and waiting for delayed_fput() which will finally call put_super.
> 
> That might be more than one mount of the same fs (in different namespaces, for
> example) with umount taking out one of those, with the other having been
> hit with umount -l before that, with some opened files being the only thing
> that used to keep it alive.
> 
> I'd like to see /proc/1/mountinfo and fuser output, TBH...  I'm not familiar enough
> with Android userland setup, so my apologies for dumb questions ;-/

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] vfs: introduce UMOUNT_WAIT which waits for umount completion
  2017-09-15 18:44                       ` Jaegeuk Kim
@ 2017-09-15 22:12                           ` Theodore Ts'o
  0 siblings, 0 replies; 30+ messages in thread
From: Theodore Ts'o @ 2017-09-15 22:12 UTC (permalink / raw)
  To: Jaegeuk Kim; +Cc: Al Viro, linux-kernel, linux-fsdevel, linux-f2fs-devel

On Fri, Sep 15, 2017 at 11:44:33AM -0700, Jaegeuk Kim wrote:
> 
> So, I digged it in more detail, and found, in drivers/android/binder.c [1],
> - binder_ioctl()
>  - create a kernel thread
>  - zombie_cleanup_check()
>   - binder_defer_work()
>     - queue_work(..., &binder_deferred_work);
> 
> - binder_deferred_func()
>  - binder_clear_zombies()
>   - binder_proc_clear_zombies()
>    - put_files_struct()
>     - close_files()
>      - filp_close()
>       - fput()
> 
> It seems binder holds some proc files.

If binder was holding some files open, then umount should have failed
with EBUSY, no?

Does Android use mount namespaces at all?

					- Ted

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] vfs: introduce UMOUNT_WAIT which waits for umount completion
@ 2017-09-15 22:12                           ` Theodore Ts'o
  0 siblings, 0 replies; 30+ messages in thread
From: Theodore Ts'o @ 2017-09-15 22:12 UTC (permalink / raw)
  To: Jaegeuk Kim; +Cc: linux-fsdevel, linux-f2fs-devel, Al Viro, linux-kernel

On Fri, Sep 15, 2017 at 11:44:33AM -0700, Jaegeuk Kim wrote:
> 
> So, I digged it in more detail, and found, in drivers/android/binder.c [1],
> - binder_ioctl()
>  - create a kernel thread
>  - zombie_cleanup_check()
>   - binder_defer_work()
>     - queue_work(..., &binder_deferred_work);
> 
> - binder_deferred_func()
>  - binder_clear_zombies()
>   - binder_proc_clear_zombies()
>    - put_files_struct()
>     - close_files()
>      - filp_close()
>       - fput()
> 
> It seems binder holds some proc files.

If binder was holding some files open, then umount should have failed
with EBUSY, no?

Does Android use mount namespaces at all?

					- Ted

------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] vfs: introduce UMOUNT_WAIT which waits for umount completion
  2017-09-15 22:12                           ` Theodore Ts'o
  (?)
@ 2017-09-15 23:29                           ` Jaegeuk Kim
  2017-09-15 23:43                             ` Al Viro
  -1 siblings, 1 reply; 30+ messages in thread
From: Jaegeuk Kim @ 2017-09-15 23:29 UTC (permalink / raw)
  To: Theodore Ts'o, Al Viro, linux-kernel, linux-fsdevel,
	linux-f2fs-devel

On 09/15, Theodore Ts'o wrote:
> On Fri, Sep 15, 2017 at 11:44:33AM -0700, Jaegeuk Kim wrote:
> > 
> > So, I digged it in more detail, and found, in drivers/android/binder.c [1],
> > - binder_ioctl()
> >  - create a kernel thread
> >  - zombie_cleanup_check()
> >   - binder_defer_work()
> >     - queue_work(..., &binder_deferred_work);
> > 
> > - binder_deferred_func()
> >  - binder_clear_zombies()
> >   - binder_proc_clear_zombies()
> >    - put_files_struct()
> >     - close_files()
> >      - filp_close()
> >       - fput()
> > 
> > It seems binder holds some proc files.
> 
> If binder was holding some files open, then umount should have failed
> with EBUSY, no?

Based on what I've got some traces so far,

- binder_ioctl
 - create a kernel thread
 - zombie_cleanup_check
  - binder_defer_work
   - queue_work(..., &binder_deferred_work);

- binder_deferred_func
 - binder_clear_zombies
  - binder_proc_clear_zombies
   - put_files_struct
     - close_files
      - filp_close
       - fput

- delayed_fput
 ...
 - file_free
 - dput
                                                init
                                                - umount
 - mntput
  - mntput_no_expire
                                                - do_umount
						 - mnt_get_count() > 2
                                                - mntput_no_expire
                                                 - mnt_add_count(-1);
   - mnt_add_count(-1);
                                                 - mnt_get_count() return;
                                                 - return 0;
   - delayed_mntput_work
                                                - device_shutdown
    - ext4_put_super()
     - EIO, and panic if error=panic


The mntput() in delayed_fput() is the last function call. So before that moment,
sys_umount() may see mnt_get_count() as 2, so it avoids EBUSY condition. I'm not
sure why it check over 2 tho.

> 
> Does Android use mount namespaces at all?
> 
> 					- Ted

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] vfs: introduce UMOUNT_WAIT which waits for umount completion
  2017-09-15 23:29                           ` Jaegeuk Kim
@ 2017-09-15 23:43                             ` Al Viro
  2017-09-19 15:55                               ` Jaegeuk Kim
  0 siblings, 1 reply; 30+ messages in thread
From: Al Viro @ 2017-09-15 23:43 UTC (permalink / raw)
  To: Jaegeuk Kim
  Cc: Theodore Ts'o, linux-kernel, linux-fsdevel, linux-f2fs-devel

On Fri, Sep 15, 2017 at 04:29:11PM -0700, Jaegeuk Kim wrote:

> The mntput() in delayed_fput() is the last function call. So before that moment,
> sys_umount() may see mnt_get_count() as 2, so it avoids EBUSY condition. I'm not
> sure why it check over 2 tho.

Because it has just grabbed a reference itself, in addition to the one that keeps
the damn thing alive (due to being mounted).  So it bloody well should have
triggered -EBUSY, if they refer to the same vfsmount.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] vfs: introduce UMOUNT_WAIT which waits for umount completion
  2017-09-15 22:12                           ` Theodore Ts'o
@ 2017-09-16  7:11                             ` Amir Goldstein
  -1 siblings, 0 replies; 30+ messages in thread
From: Amir Goldstein @ 2017-09-16  7:11 UTC (permalink / raw)
  To: Theodore Ts'o, Jaegeuk Kim, Al Viro, linux-kernel,
	linux-fsdevel, linux-f2fs-devel

On Sat, Sep 16, 2017 at 1:12 AM, Theodore Ts'o <tytso@mit.edu> wrote:
> On Fri, Sep 15, 2017 at 11:44:33AM -0700, Jaegeuk Kim wrote:
>>
>> So, I digged it in more detail, and found, in drivers/android/binder.c [1],
>> - binder_ioctl()
>>  - create a kernel thread
>>  - zombie_cleanup_check()
>>   - binder_defer_work()
>>     - queue_work(..., &binder_deferred_work);
>>
>> - binder_deferred_func()
>>  - binder_clear_zombies()
>>   - binder_proc_clear_zombies()
>>    - put_files_struct()
>>     - close_files()
>>      - filp_close()
>>       - fput()
>>
>> It seems binder holds some proc files.
>
> If binder was holding some files open, then umount should have failed
> with EBUSY, no?
>
> Does Android use mount namespaces at all?
>

Extensively.
Every user (i.e. from multi user) has its own mount ns with private /data
Every app has its own mount ns, with /sdcard mounted to one of 3 FUSE
sdcard mounts depending of app storage permission (none, rdonly, rdwr).

Amir.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] vfs: introduce UMOUNT_WAIT which waits for umount completion
@ 2017-09-16  7:11                             ` Amir Goldstein
  0 siblings, 0 replies; 30+ messages in thread
From: Amir Goldstein @ 2017-09-16  7:11 UTC (permalink / raw)
  To: Theodore Ts'o, Jaegeuk Kim, Al Viro, linux-kernel,
	linux-fsdevel, linux-f2fs-devel

On Sat, Sep 16, 2017 at 1:12 AM, Theodore Ts'o <tytso@mit.edu> wrote:
> On Fri, Sep 15, 2017 at 11:44:33AM -0700, Jaegeuk Kim wrote:
>>
>> So, I digged it in more detail, and found, in drivers/android/binder.c [1],
>> - binder_ioctl()
>>  - create a kernel thread
>>  - zombie_cleanup_check()
>>   - binder_defer_work()
>>     - queue_work(..., &binder_deferred_work);
>>
>> - binder_deferred_func()
>>  - binder_clear_zombies()
>>   - binder_proc_clear_zombies()
>>    - put_files_struct()
>>     - close_files()
>>      - filp_close()
>>       - fput()
>>
>> It seems binder holds some proc files.
>
> If binder was holding some files open, then umount should have failed
> with EBUSY, no?
>
> Does Android use mount namespaces at all?
>

Extensively.
Every user (i.e. from multi user) has its own mount ns with private /data
Every app has its own mount ns, with /sdcard mounted to one of 3 FUSE
sdcard mounts depending of app storage permission (none, rdonly, rdwr).

Amir.

------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] vfs: introduce UMOUNT_WAIT which waits for umount completion
  2017-09-15 23:43                             ` Al Viro
@ 2017-09-19 15:55                               ` Jaegeuk Kim
  0 siblings, 0 replies; 30+ messages in thread
From: Jaegeuk Kim @ 2017-09-19 15:55 UTC (permalink / raw)
  To: Al Viro; +Cc: Theodore Ts'o, linux-kernel, linux-fsdevel, linux-f2fs-devel

On 09/16, Al Viro wrote:
> On Fri, Sep 15, 2017 at 04:29:11PM -0700, Jaegeuk Kim wrote:
> 
> > The mntput() in delayed_fput() is the last function call. So before that moment,
> > sys_umount() may see mnt_get_count() as 2, so it avoids EBUSY condition. I'm not
> > sure why it check over 2 tho.
> 
> Because it has just grabbed a reference itself, in addition to the one that keeps
> the damn thing alive (due to being mounted).  So it bloody well should have
> triggered -EBUSY, if they refer to the same vfsmount.

I've tracked another view in terms of mnt_get_count() and sb->s_active based on
namespaces, and could get the below scenario for instance.

Term: namespace(mnt_get_count())

1. create_new_namespaces() creates ns1 and ns2,

  /data(1)    ns1(1)    ns2(1)
    |          |          |
     ---------------------
               |
        sb->s_active = 3

2. after binder_proc_clear_zombies() for ns2 and ns1 triggers
  - delayed_fput()
    - delayed_mntput_work(ns2)

  /data(1)    ns1(1)
    |          |
     ----------
          |
    sb->s_active = 2

3. umount() for /data is successed.

  ns1(1)
    |
 sb->s_active = 1

4. device_shutdown() by init

5.  - delayed_mntput_work(ns1)
     - put_super(), since sb->s_active = 0
       - -EIO

Please let me know, if I'm missing something.

Thanks,

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2] vfs: introduce UMOUNT_WAIT which waits for umount completion
  2017-09-13 20:09 [PATCH] vfs: introduce UMOUNT_WAIT which waits for umount completion Jaegeuk Kim
  2017-09-13 23:04 ` Al Viro
@ 2017-09-20 17:38 ` Jaegeuk Kim
  2017-09-20 18:38   ` Al Viro
  2017-09-21 18:20   ` [PATCH v3] vfs: introduce UMOUNT_WAIT to wait for delayed_fput/mntput completion Jaegeuk Kim
  1 sibling, 2 replies; 30+ messages in thread
From: Jaegeuk Kim @ 2017-09-20 17:38 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, linux-f2fs-devel; +Cc: Al Viro

This patch introduces UMOUNT_WAIT flag for umount(2) which let user wait for
umount(2) to complete filesystem shutdown. This should fix a kernel panic
triggered when a living filesystem tries to access dead block device after
device_shutdown done by kernel_restart as below.

Term: namespace(mnt_get_count())

1. create_new_namespaces() creates ns1 and ns2,

  /data(1)    ns1(1)    ns2(1)
    |          |          |
     ---------------------
               |
        sb->s_active = 3

2. after binder_proc_clear_zombies() for ns2 and ns1 triggers
  - delayed_fput()
    - delayed_mntput_work(ns2)

  /data(1)    ns1(1)
    |          |
     ----------
          |
    sb->s_active = 2

3. umount() for /data is successed.

  ns1(1)
    |
 sb->s_active = 1

4. device_shutdown() by init

5.  - delayed_mntput_work(ns1)
     - put_super(), since sb->s_active = 0
       - -EIO

Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
---
 fs/namespace.c     | 12 +++++++++++-
 include/linux/fs.h |  1 +
 2 files changed, 12 insertions(+), 1 deletion(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index f8893dc6a989..f2c15c4f6e23 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -21,6 +21,7 @@
 #include <linux/fs_struct.h>	/* get_fs_root et.al. */
 #include <linux/fsnotify.h>	/* fsnotify_vfsmount_delete */
 #include <linux/uaccess.h>
+#include <linux/file.h>
 #include <linux/proc_ns.h>
 #include <linux/magic.h>
 #include <linux/bootmem.h>
@@ -1629,7 +1630,8 @@ SYSCALL_DEFINE2(umount, char __user *, name, int, flags)
 	int retval;
 	int lookup_flags = 0;
 
-	if (flags & ~(MNT_FORCE | MNT_DETACH | MNT_EXPIRE | UMOUNT_NOFOLLOW))
+	if (flags & ~(MNT_FORCE | MNT_DETACH | MNT_EXPIRE | UMOUNT_NOFOLLOW |
+			UMOUNT_WAIT))
 		return -EINVAL;
 
 	if (!may_mount())
@@ -1653,11 +1655,19 @@ SYSCALL_DEFINE2(umount, char __user *, name, int, flags)
 	if (flags & MNT_FORCE && !capable(CAP_SYS_ADMIN))
 		goto dput_and_out;
 
+	/* flush delayed_fput to put mnt_count */
+	if (flags & UMOUNT_WAIT)
+		flush_delayed_fput();
+
 	retval = do_umount(mnt, flags);
 dput_and_out:
 	/* we mustn't call path_put() as that would clear mnt_expiry_mark */
 	dput(path.dentry);
 	mntput_no_expire(mnt);
+
+	/* flush delayed_mntput_work to put sb->s_active */
+	if (!retval && (flags & UMOUNT_WAIT))
+		flush_scheduled_work();
 out:
 	return retval;
 }
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 6e1fd5d21248..69f0fd53c9c7 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1278,6 +1278,7 @@ struct mm_struct;
 #define MNT_DETACH	0x00000002	/* Just detach from the tree */
 #define MNT_EXPIRE	0x00000004	/* Mark for expiry */
 #define UMOUNT_NOFOLLOW	0x00000008	/* Don't follow symlink on umount */
+#define UMOUNT_WAIT	0x00000010	/* Wait to unmount completely */
 #define UMOUNT_UNUSED	0x80000000	/* Flag guaranteed to be unused */
 
 /* sb->s_iflags */
-- 
2.14.0.rc1.383.gd1ce394fe2-goog

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [PATCH v2] vfs: introduce UMOUNT_WAIT which waits for umount completion
  2017-09-20 17:38 ` [PATCH v2] " Jaegeuk Kim
@ 2017-09-20 18:38   ` Al Viro
  2017-09-21  0:34     ` Jaegeuk Kim
  2017-09-21 18:20   ` [PATCH v3] vfs: introduce UMOUNT_WAIT to wait for delayed_fput/mntput completion Jaegeuk Kim
  1 sibling, 1 reply; 30+ messages in thread
From: Al Viro @ 2017-09-20 18:38 UTC (permalink / raw)
  To: Jaegeuk Kim; +Cc: linux-kernel, linux-fsdevel, linux-f2fs-devel

On Wed, Sep 20, 2017 at 10:38:31AM -0700, Jaegeuk Kim wrote:
> This patch introduces UMOUNT_WAIT flag for umount(2) which let user wait for
> umount(2) to complete filesystem shutdown. This should fix a kernel panic
> triggered when a living filesystem tries to access dead block device after
> device_shutdown done by kernel_restart as below.

NAK.  This is just papering over the race you've got; it does not fix it.
You count upon the kernel threads in question having already gotten past
scheduling delayed fput, but what's there to guarantee that?  You are
essentially adding a "flush all pending fput that had already been
scheduled" syscall.  It
	a) doesn't belong in umount(2) and
	b) doesn't fix the race.
It might change the timing enough to have your specific reproducer survive,
but that kind of approach is simply wrong.

Incidentally, the name is a misnomer - it does *NOT* wait for completion of
fs shutdown.  Proof: have a filesystem mounted in two namespaces and issue
that thing in one of them.  Then observe how it's still alive, well and
accessible in another.

The only case that gets affected by it is when another mount is heading for
shutdown and is in a very specific part of that.  That is waited for.
If it's just before *OR* just past that stage, you are fucked.

And yes, "just past" is also affected.  Look:
CPU1: delayed_fput()
        struct llist_node *node = llist_del_all(&delayed_fput_list);
delayed_fput_list() is empty now
        llist_for_each_entry_safe(f, t, node, f_u.fu_llist)
                __fput(f);
CPU2: your umount UMOUNT_WAIT
	flush_delayed_fput()
		does nothing, the list is empty
	....
	flush_scheduled_work()
		waits for delayed_fput() to finish
CPU1:
	finish __fput()
	call mntput() from it
	schedule_delayed_work(&delayed_mntput_work, 1);
CPU2:
	OK, everything scheduled prior to call of flush_scheduled_work() is completed,
we are done.
	return from umount(2)
	(in bogus userland code) tell it to shut devices down
...
oops, that delayed_mntput_work we'd scheduled there got to run.  Too bad...

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2] vfs: introduce UMOUNT_WAIT which waits for umount completion
  2017-09-20 18:38   ` Al Viro
@ 2017-09-21  0:34     ` Jaegeuk Kim
  2017-09-21  2:42       ` Al Viro
  0 siblings, 1 reply; 30+ messages in thread
From: Jaegeuk Kim @ 2017-09-21  0:34 UTC (permalink / raw)
  To: Al Viro; +Cc: linux-kernel, linux-fsdevel, linux-f2fs-devel

On 09/20, Al Viro wrote:
> On Wed, Sep 20, 2017 at 10:38:31AM -0700, Jaegeuk Kim wrote:
> > This patch introduces UMOUNT_WAIT flag for umount(2) which let user wait for
> > umount(2) to complete filesystem shutdown. This should fix a kernel panic
> > triggered when a living filesystem tries to access dead block device after
> > device_shutdown done by kernel_restart as below.
> 
> NAK.  This is just papering over the race you've got; it does not fix it.
> You count upon the kernel threads in question having already gotten past
> scheduling delayed fput, but what's there to guarantee that?  You are
> essentially adding a "flush all pending fput that had already been
> scheduled" syscall.  It
> 	a) doesn't belong in umount(2) and
> 	b) doesn't fix the race.
>
> It might change the timing enough to have your specific reproducer survive,
> but that kind of approach is simply wrong.
> Incidentally, the name is a misnomer - it does *NOT* wait for completion of
> fs shutdown.  Proof: have a filesystem mounted in two namespaces and issue
> that thing in one of them.  Then observe how it's still alive, well and
> accessible in another.

Yes, I wrote the description incorrectly. Let me try describing UMOUNT_WAIT
which waits for any pending delayed works only like what you said. In normal
cases where other namespace is still active, this doesn't work at all.

> The only case that gets affected by it is when another mount is heading for
> shutdown and is in a very specific part of that.  That is waited for.
> If it's just before *OR* just past that stage, you are fucked.
> 
> And yes, "just past" is also affected.  Look:
> CPU1: delayed_fput()
>         struct llist_node *node = llist_del_all(&delayed_fput_list);
> delayed_fput_list() is empty now
>         llist_for_each_entry_safe(f, t, node, f_u.fu_llist)
>                 __fput(f);
> CPU2: your umount UMOUNT_WAIT
> 	flush_delayed_fput()
> 		does nothing, the list is empty

		how about waiting for workqueue completion here?

> 	....

	If all the __fput()s are not finished, do_umount() will return -EBUSY.

> 	flush_scheduled_work()
> 		waits for delayed_fput() to finish
> CPU1:
> 	finish __fput()
> 	call mntput() from it
> 	schedule_delayed_work(&delayed_mntput_work, 1);
> CPU2:
> 	OK, everything scheduled prior to call of flush_scheduled_work() is completed,
> we are done.
> 	return from umount(2)
> 	(in bogus userland code) tell it to shut devices down
> ...
> oops, that delayed_mntput_work we'd scheduled there got to run.  Too bad...

Is this doable?

---
 fs/file_table.c      |  6 ++++++
 fs/namespace.c       | 26 +++++++++++++++++++++++++-
 include/linux/file.h |  1 +
 include/linux/fs.h   |  1 +
 4 files changed, 33 insertions(+), 1 deletion(-)

diff --git a/fs/file_table.c b/fs/file_table.c
index 72e861a35a7f..35b32ffdb934 100644
--- a/fs/file_table.c
+++ b/fs/file_table.c
@@ -263,6 +263,12 @@ void flush_delayed_fput(void)
 
 static DECLARE_DELAYED_WORK(delayed_fput_work, delayed_fput);
 
+void flush_delayed_fput_wait(void)
+{
+	delayed_fput(NULL);
+	flush_delayed_work(&delayed_fput_work);
+}
+
 void fput(struct file *file)
 {
 	if (atomic_long_dec_and_test(&file->f_count)) {
diff --git a/fs/namespace.c b/fs/namespace.c
index f8893dc6a989..e2586a38c83c 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -21,6 +21,7 @@
 #include <linux/fs_struct.h>	/* get_fs_root et.al. */
 #include <linux/fsnotify.h>	/* fsnotify_vfsmount_delete */
 #include <linux/uaccess.h>
+#include <linux/file.h>
 #include <linux/proc_ns.h>
 #include <linux/magic.h>
 #include <linux/bootmem.h>
@@ -1133,6 +1134,12 @@ static void delayed_mntput(struct work_struct *unused)
 }
 static DECLARE_DELAYED_WORK(delayed_mntput_work, delayed_mntput);
 
+void flush_delayed_mntput_wait(void)
+{
+	delayed_mntput(NULL);
+	flush_delayed_work(&delayed_mntput_work);
+}
+
 static void mntput_no_expire(struct mount *mnt)
 {
 	rcu_read_lock();
@@ -1629,7 +1636,8 @@ SYSCALL_DEFINE2(umount, char __user *, name, int, flags)
 	int retval;
 	int lookup_flags = 0;
 
-	if (flags & ~(MNT_FORCE | MNT_DETACH | MNT_EXPIRE | UMOUNT_NOFOLLOW))
+	if (flags & ~(MNT_FORCE | MNT_DETACH | MNT_EXPIRE | UMOUNT_NOFOLLOW |
+			UMOUNT_WAIT))
 		return -EINVAL;
 
 	if (!may_mount())
@@ -1653,11 +1661,27 @@ SYSCALL_DEFINE2(umount, char __user *, name, int, flags)
 	if (flags & MNT_FORCE && !capable(CAP_SYS_ADMIN))
 		goto dput_and_out;
 
+	/* flush delayed_fput to put mnt_count */
+	if (flags & UMOUNT_WAIT)
+		flush_delayed_fput_wait();
+
 	retval = do_umount(mnt, flags);
 dput_and_out:
 	/* we mustn't call path_put() as that would clear mnt_expiry_mark */
 	dput(path.dentry);
 	mntput_no_expire(mnt);
+
+	if (!retval && (flags & UMOUNT_WAIT)) {
+		/*
+		 * If the last delayed_fput() is called during do_umount()
+		 * and makes mnt_count zero, we need to guarantee to register
+		 * delayed_mntput by waiting for delayed_fput work again.
+		 */
+		flush_delayed_fput_wait();
+
+		/* flush delayed_mntput_work to put sb->s_active */
+		flush_delayed_mntput_wait();
+	}
 out:
 	return retval;
 }
diff --git a/include/linux/file.h b/include/linux/file.h
index 61eb82cbafba..ffb4236cde39 100644
--- a/include/linux/file.h
+++ b/include/linux/file.h
@@ -84,6 +84,7 @@ extern void put_unused_fd(unsigned int fd);
 extern void fd_install(unsigned int fd, struct file *file);
 
 extern void flush_delayed_fput(void);
+extern void flush_delayed_fput_wait(void);
 extern void __fput_sync(struct file *);
 
 #endif /* __LINUX_FILE_H */
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 6e1fd5d21248..69f0fd53c9c7 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1278,6 +1278,7 @@ struct mm_struct;
 #define MNT_DETACH	0x00000002	/* Just detach from the tree */
 #define MNT_EXPIRE	0x00000004	/* Mark for expiry */
 #define UMOUNT_NOFOLLOW	0x00000008	/* Don't follow symlink on umount */
+#define UMOUNT_WAIT	0x00000010	/* Wait to unmount completely */
 #define UMOUNT_UNUSED	0x80000000	/* Flag guaranteed to be unused */
 
 /* sb->s_iflags */
-- 
2.14.0.rc1.383.gd1ce394fe2-goog

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [PATCH v2] vfs: introduce UMOUNT_WAIT which waits for umount completion
  2017-09-21  0:34     ` Jaegeuk Kim
@ 2017-09-21  2:42       ` Al Viro
  2017-09-21  5:02         ` Jaegeuk Kim
  0 siblings, 1 reply; 30+ messages in thread
From: Al Viro @ 2017-09-21  2:42 UTC (permalink / raw)
  To: Jaegeuk Kim; +Cc: linux-kernel, linux-fsdevel, linux-f2fs-devel

On Wed, Sep 20, 2017 at 05:34:09PM -0700, Jaegeuk Kim wrote:
> > 	flush_delayed_fput()
> > 		does nothing, the list is empty
> 
> 		how about waiting for workqueue completion here?
> 
> > 	....
> 
> 	If all the __fput()s are not finished, do_umount() will return -EBUSY.

Hell, no.  That's only when they are all on the same vfsmount.  And in that
case you don't need any waiting - if any of those mntput() is not past the
unlock_mount_hash() in mntput_no_expire(), you will get -EBUSY.  And if they
all are, the caller of umount(2) will end up dropping the last reference.  
In which case the shutdown will be scheduled via task_work_add() and processed
before umount(2) returns to userland.

The whole problem is that you have several vfsmounts over the same filesystem
(== same struct super_block), some of them held by kernel threads of yours.
umount(2) doesn't affect those and isn't affected by those.  What you do is,
AFAICS,
	ask the kernel threads to start shutting down
	umount()
	shut device down, hoping that all vfsmounts that used
to be held by those threads are gone by that point.

Your patch tries to stick "flush the pending work" in the umount().
With no warranty that it will catch that stuff in the stage where
flushing will affect anything.

> +void flush_delayed_fput_wait(void)
> +{
> +	delayed_fput(NULL);
> +	flush_delayed_work(&delayed_fput_work);
> +}

> +void flush_delayed_mntput_wait(void)
> +{
> +	delayed_mntput(NULL);
> +	flush_delayed_work(&delayed_mntput_work);
> +}

It's still a broken approach.  What I don't understand is why bother
with that sort of brittle logics in the first place.  Why not simply
open the damn thing with O_EXCL before proceeding to device shutdown?
And if you get "busy" from that, wait and retry...

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2] vfs: introduce UMOUNT_WAIT which waits for umount completion
  2017-09-21  2:42       ` Al Viro
@ 2017-09-21  5:02         ` Jaegeuk Kim
  2017-09-21 14:48           ` Theodore Ts'o
  0 siblings, 1 reply; 30+ messages in thread
From: Jaegeuk Kim @ 2017-09-21  5:02 UTC (permalink / raw)
  To: Al Viro; +Cc: linux-kernel, linux-fsdevel, linux-f2fs-devel

On 09/21, Al Viro wrote:
> On Wed, Sep 20, 2017 at 05:34:09PM -0700, Jaegeuk Kim wrote:
> > > 	flush_delayed_fput()
> > > 		does nothing, the list is empty
> > 
> > 		how about waiting for workqueue completion here?
> > 
> > > 	....
> > 
> > 	If all the __fput()s are not finished, do_umount() will return -EBUSY.
> 
> Hell, no.  That's only when they are all on the same vfsmount.  And in that
> case you don't need any waiting - if any of those mntput() is not past the
> unlock_mount_hash() in mntput_no_expire(), you will get -EBUSY.  And if they
> all are, the caller of umount(2) will end up dropping the last reference.  
> In which case the shutdown will be scheduled via task_work_add() and processed
> before umount(2) returns to userland.

Yes, what I'm trying to do with this flag would be waiting for releasing
mnt_count in the same vfsmount as well as sb->s_active across namespaces.
So, here at first, I wanted to wait for delayed_fput which grabs mnt_count
in the same vfsmount, so that do_umount() could be succeeded in time. If
this is the last remaining namespace, waiting for delayed_mntput enables
us to shut the filesystem down by task_work at the end of umount(2).

> The whole problem is that you have several vfsmounts over the same filesystem
> (== same struct super_block), some of them held by kernel threads of yours.
> umount(2) doesn't affect those and isn't affected by those.  What you do is,
> AFAICS,
> 	ask the kernel threads to start shutting down
> 	umount()
> 	shut device down, hoping that all vfsmounts that used
> to be held by those threads are gone by that point.

Yes, and actually, android retries umount(2) for several seconds, if it gets
failure. So, first I thought it'd be better to make umount() more deterministic.

> Your patch tries to stick "flush the pending work" in the umount().
> With no warranty that it will catch that stuff in the stage where
> flushing will affect anything.
> 
> > +void flush_delayed_fput_wait(void)
> > +{
> > +	delayed_fput(NULL);
> > +	flush_delayed_work(&delayed_fput_work);
> > +}
> 
> > +void flush_delayed_mntput_wait(void)
> > +{
> > +	delayed_mntput(NULL);
> > +	flush_delayed_work(&delayed_mntput_work);
> > +}
> 
> It's still a broken approach.  What I don't understand is why bother
> with that sort of brittle logics in the first place.  Why not simply
> open the damn thing with O_EXCL before proceeding to device shutdown?
> And if you get "busy" from that, wait and retry...

I'm not sure how many times we can retry and wait for this. IMHO, it'd be better
to use this together with the new flag, since this can detect unclosed namespace
given successful umount(2).

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2] vfs: introduce UMOUNT_WAIT which waits for umount completion
  2017-09-21  5:02         ` Jaegeuk Kim
@ 2017-09-21 14:48           ` Theodore Ts'o
  2017-09-21 17:16             ` Jaegeuk Kim
  0 siblings, 1 reply; 30+ messages in thread
From: Theodore Ts'o @ 2017-09-21 14:48 UTC (permalink / raw)
  To: Jaegeuk Kim; +Cc: Al Viro, linux-kernel, linux-fsdevel, linux-f2fs-devel

On Wed, Sep 20, 2017 at 10:02:37PM -0700, Jaegeuk Kim wrote:
> 
> Yes, and actually, android retries umount(2) for several seconds, if it gets
> failure. So, first I thought it'd be better to make umount() more deterministic.
> 
> I'm not sure how many times we can retry and wait for this. IMHO, it'd be better
> to use this together with the new flag, since this can detect unclosed namespace
> given successful umount(2).

Even if you could make umount wait for the binder deferred workqueue
to exit, if it takes time, it takes time.  And that's not the core
VFS's fault.  It's binder's.

So... why not fix this by adding a binder ioctl which forces its
workqueue to be flushed, and use that *before* you try calling mount?

If that binder ioctl takes minutes or hours to complete, then that's a
binder bug, and needs to be fixed in binder.

       	    	      	    	  - Ted

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2] vfs: introduce UMOUNT_WAIT which waits for umount completion
  2017-09-21 14:48           ` Theodore Ts'o
@ 2017-09-21 17:16             ` Jaegeuk Kim
  0 siblings, 0 replies; 30+ messages in thread
From: Jaegeuk Kim @ 2017-09-21 17:16 UTC (permalink / raw)
  To: Theodore Ts'o, Al Viro, linux-kernel, linux-fsdevel,
	linux-f2fs-devel

On 09/21, Theodore Ts'o wrote:
> On Wed, Sep 20, 2017 at 10:02:37PM -0700, Jaegeuk Kim wrote:
> > 
> > Yes, and actually, android retries umount(2) for several seconds, if it gets
> > failure. So, first I thought it'd be better to make umount() more deterministic.
> > 
> > I'm not sure how many times we can retry and wait for this. IMHO, it'd be better
> > to use this together with the new flag, since this can detect unclosed namespace
> > given successful umount(2).
> 
> Even if you could make umount wait for the binder deferred workqueue
> to exit, if it takes time, it takes time.  And that's not the core
> VFS's fault.  It's binder's.
> 
> So... why not fix this by adding a binder ioctl which forces its
> workqueue to be flushed, and use that *before* you try calling mount?

Yes, we can, but I'm not 100% sure this happens only by binder. This is caused
by delayed_fput/mntput called by any kernel thread which would be much generic.

> If that binder ioctl takes minutes or hours to complete, then that's a
> binder bug, and needs to be fixed in binder.
> 
>        	    	      	    	  - Ted

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v3] vfs: introduce UMOUNT_WAIT to wait for delayed_fput/mntput completion
  2017-09-20 17:38 ` [PATCH v2] " Jaegeuk Kim
  2017-09-20 18:38   ` Al Viro
@ 2017-09-21 18:20   ` Jaegeuk Kim
  1 sibling, 0 replies; 30+ messages in thread
From: Jaegeuk Kim @ 2017-09-21 18:20 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, linux-f2fs-devel; +Cc: Al Viro

This patch introduces UMOUNT_WAIT flag for umount(2) which let user wait
for the completion of delayed fput()/mntput() explicitly.

Espcially, it can be used for Android reboot procedure, and resolve the below
issue where a kernel panic happens when a living filesystem tries to access
dead block device after device_shutdown done by kernel_restart.

Term: namespace(mnt_get_count())

1. create_new_namespaces() creates ns1 and ns2,

  /data(1)    ns1(1)    ns2(1)
    |          |          |
     ---------------------
               |
        sb->s_active = 3

2. after binder_proc_clear_zombies() for ns2 and ns1 triggers
  - delayed_fput()
    - delayed_mntput_work(ns2)

  /data(1)    ns1(1)
    |          |
     ----------
          |
    sb->s_active = 2

3. umount() for /data is successed.

  ns1(1)
    |
 sb->s_active = 1

4. device_shutdown() by init

5.  - delayed_mntput_work(ns1)
     - put_super(), since sb->s_active = 0
       - -EIO

Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
---
 fs/file_table.c      |  6 ++++++
 fs/namespace.c       | 26 +++++++++++++++++++++++++-
 include/linux/file.h |  1 +
 include/linux/fs.h   |  1 +
 4 files changed, 33 insertions(+), 1 deletion(-)

diff --git a/fs/file_table.c b/fs/file_table.c
index 72e861a35a7f..35b32ffdb934 100644
--- a/fs/file_table.c
+++ b/fs/file_table.c
@@ -263,6 +263,12 @@ void flush_delayed_fput(void)
 
 static DECLARE_DELAYED_WORK(delayed_fput_work, delayed_fput);
 
+void flush_delayed_fput_wait(void)
+{
+	delayed_fput(NULL);
+	flush_delayed_work(&delayed_fput_work);
+}
+
 void fput(struct file *file)
 {
 	if (atomic_long_dec_and_test(&file->f_count)) {
diff --git a/fs/namespace.c b/fs/namespace.c
index f8893dc6a989..e2586a38c83c 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -21,6 +21,7 @@
 #include <linux/fs_struct.h>	/* get_fs_root et.al. */
 #include <linux/fsnotify.h>	/* fsnotify_vfsmount_delete */
 #include <linux/uaccess.h>
+#include <linux/file.h>
 #include <linux/proc_ns.h>
 #include <linux/magic.h>
 #include <linux/bootmem.h>
@@ -1133,6 +1134,12 @@ static void delayed_mntput(struct work_struct *unused)
 }
 static DECLARE_DELAYED_WORK(delayed_mntput_work, delayed_mntput);
 
+void flush_delayed_mntput_wait(void)
+{
+	delayed_mntput(NULL);
+	flush_delayed_work(&delayed_mntput_work);
+}
+
 static void mntput_no_expire(struct mount *mnt)
 {
 	rcu_read_lock();
@@ -1629,7 +1636,8 @@ SYSCALL_DEFINE2(umount, char __user *, name, int, flags)
 	int retval;
 	int lookup_flags = 0;
 
-	if (flags & ~(MNT_FORCE | MNT_DETACH | MNT_EXPIRE | UMOUNT_NOFOLLOW))
+	if (flags & ~(MNT_FORCE | MNT_DETACH | MNT_EXPIRE | UMOUNT_NOFOLLOW |
+			UMOUNT_WAIT))
 		return -EINVAL;
 
 	if (!may_mount())
@@ -1653,11 +1661,27 @@ SYSCALL_DEFINE2(umount, char __user *, name, int, flags)
 	if (flags & MNT_FORCE && !capable(CAP_SYS_ADMIN))
 		goto dput_and_out;
 
+	/* flush delayed_fput to put mnt_count */
+	if (flags & UMOUNT_WAIT)
+		flush_delayed_fput_wait();
+
 	retval = do_umount(mnt, flags);
 dput_and_out:
 	/* we mustn't call path_put() as that would clear mnt_expiry_mark */
 	dput(path.dentry);
 	mntput_no_expire(mnt);
+
+	if (!retval && (flags & UMOUNT_WAIT)) {
+		/*
+		 * If the last delayed_fput() is called during do_umount()
+		 * and makes mnt_count zero, we need to guarantee to register
+		 * delayed_mntput by waiting for delayed_fput work again.
+		 */
+		flush_delayed_fput_wait();
+
+		/* flush delayed_mntput_work to put sb->s_active */
+		flush_delayed_mntput_wait();
+	}
 out:
 	return retval;
 }
diff --git a/include/linux/file.h b/include/linux/file.h
index 61eb82cbafba..ffb4236cde39 100644
--- a/include/linux/file.h
+++ b/include/linux/file.h
@@ -84,6 +84,7 @@ extern void put_unused_fd(unsigned int fd);
 extern void fd_install(unsigned int fd, struct file *file);
 
 extern void flush_delayed_fput(void);
+extern void flush_delayed_fput_wait(void);
 extern void __fput_sync(struct file *);
 
 #endif /* __LINUX_FILE_H */
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 6e1fd5d21248..69f0fd53c9c7 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1278,6 +1278,7 @@ struct mm_struct;
 #define MNT_DETACH	0x00000002	/* Just detach from the tree */
 #define MNT_EXPIRE	0x00000004	/* Mark for expiry */
 #define UMOUNT_NOFOLLOW	0x00000008	/* Don't follow symlink on umount */
+#define UMOUNT_WAIT	0x00000010	/* Wait to unmount completely */
 #define UMOUNT_UNUSED	0x80000000	/* Flag guaranteed to be unused */
 
 /* sb->s_iflags */
-- 
2.14.0.rc1.383.gd1ce394fe2-goog

^ permalink raw reply related	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2017-09-21 18:20 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-09-13 20:09 [PATCH] vfs: introduce UMOUNT_WAIT which waits for umount completion Jaegeuk Kim
2017-09-13 23:04 ` Al Viro
2017-09-13 23:31   ` Jaegeuk Kim
2017-09-13 23:44     ` Al Viro
2017-09-14  1:10       ` Jaegeuk Kim
2017-09-14  1:10         ` Jaegeuk Kim
2017-09-14  1:30         ` Al Viro
2017-09-14 18:37           ` Al Viro
2017-09-14 19:14             ` Jaegeuk Kim
2017-09-15  0:19               ` Jaegeuk Kim
2017-09-15  2:06                 ` Al Viro
2017-09-15  2:06                   ` Al Viro
2017-09-15  3:45                   ` Jaegeuk Kim
2017-09-15  4:21                     ` Al Viro
2017-09-15 18:44                       ` Jaegeuk Kim
2017-09-15 22:12                         ` Theodore Ts'o
2017-09-15 22:12                           ` Theodore Ts'o
2017-09-15 23:29                           ` Jaegeuk Kim
2017-09-15 23:43                             ` Al Viro
2017-09-19 15:55                               ` Jaegeuk Kim
2017-09-16  7:11                           ` Amir Goldstein
2017-09-16  7:11                             ` Amir Goldstein
2017-09-20 17:38 ` [PATCH v2] " Jaegeuk Kim
2017-09-20 18:38   ` Al Viro
2017-09-21  0:34     ` Jaegeuk Kim
2017-09-21  2:42       ` Al Viro
2017-09-21  5:02         ` Jaegeuk Kim
2017-09-21 14:48           ` Theodore Ts'o
2017-09-21 17:16             ` Jaegeuk Kim
2017-09-21 18:20   ` [PATCH v3] vfs: introduce UMOUNT_WAIT to wait for delayed_fput/mntput completion Jaegeuk Kim

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.